Central Washington looks a whole lot more like Texas than I imagined
Athens Twilight Pro Criterium
I’ve accepted an offer to work as a software engineer for Amazon in Seattle starting at the end of April.
I’m excited at the prospect of creating experiences that delight on a stupidly large scale. I’m also looking forward to working with a group of people who are are frighteningly good at what they do.
I’m going to miss Atlanta, the people I’ve gotten to know, and experiences I’ve had after 11 years here. There is truly very little else in the world like it. It is responsible more than anyplace else for making me the person that I am, but the moment was right and new things are are sometimes the right things.
I’m working on implementing a a system for keeping track of frequently observed items in a large stream of items. Part of the algorithm I’m implementing depends on having many hashing algorithms available.
I ran across this paper which describes a solution to this exact problem for implementing the many hash functions needed for Bloom filters in a non-naïve fashion. Looks like it might be useful for my situation as well.
Graphs are good visualizations for understandable representations of certain sorts of data relationships. For example, the graph below represents all the co-appearances of characters in Victor Hugo’s Les Misérables.
Imagine trying to make sense of the tabular data that represents the hundreds of vertices and thousands of edges in the graph instead of the visualization. One (albeit obvious) inference we can draw from the visualization is that Valjean is a central character in the story, which would be considerably harder to infer from the tabular data without a good deal of work. Another great example of graph visualizations aiding exploratory data analysis is exploring enron, a project which attempted to marry algorithmic and visualization techniques to explore the Enron email corpus.
Historically, GraphViz is a tool I’ve used to great effect for visualizing graph data as it transforms text files with a syntax which is easy to programmatically generate and converts them into graphs in a number of print and image formats. It’s handy for doing things like quickly modeling things like the dependency graph of build steps in a complicated Ant buildfile or creating diagrams of class hierararchies in your documentation. For all of its virtues, GraphVis suffers from rudimentary styling abilities and the tool is built around the idea of doing batched processing of files, making interactive visualization and manipulation of a graph difficult.
Gephi is a tool which seems to be built to better address the idea of using graphs as visualization tools for interactive data exploration. It is a Java application with an OpenGL-accelerated graph canvas built atop the NetBeans platform. Data can either be input directly within the program, imported from text files in a number of formats, or taken in data from relational databases, graph databases such as Neo4J, or (through an extensible plugin interface) any number of other data stores. An interesting data import plugin I noticed allowed connecting to an IMAP server and downloading all of an account’s email in order to create the graph of the account’s email social network.
For the sake of demonstrating Gephi, I decided to try and build an undirected graph modeling the rooms in my apartment, their connectivity, and their square footage. Vertex size correlated with square footage of room and color of a vertex corresponded to the degree of the vertex.
Here’s how you can do the same:
- Download and install Gephi from http://gephi.org/
Launch Gephi. The initial interface looks something like this:
Users of Photoshop or any programming IDE will probably recognize the general shape of it. Dockable panels on the left and right with central workspace. One of the customer pull quotes on their site even refers to Gephi as being ‘like Photoshop for graphs’. Across the top are three tabs representing “modes” of the software: “Overview” for graph exploration, “Data Laboratory” for data input and manipulation, and “Preview” for proofing of graphs before they are output to PDF/SVG/PNG.
Create new project
Input data using Data Explorer Gephi has separate panes within the Data Explorer for inputting node and edge data. We will be putting node data in as id, label, and square footage. I’ve included my data as a Gist so you can follow along. Edge data will consist of source, target, type and id. For my purposes, labeling edges didn’t make sense, nor did making the type anything besides undirected as I have no trapdoors or firepoles in my house. The vertex data explorer looked like this:
Graph layout There are two different means of doing graph layout: automated or manual layout. To provide a rough spatial correspondence between my house and the graph, I opted for manual layout. The other option for doing layout is automated layout. There are a number of algorithms, but the ones which would potentially be most useful to us are the class of force-directed algorithms where edges are modeled as springs to provide reasonable separation of nodes while still creating a relatively compact graph. Even when doing manual layout, it can be useful to do force-directed layout first to get a rough layout and then use manual layout to tweak things to look as desired.
Graph Styling Styling the vertices and edges can help with making sense of the graph and can help encode more data into the visualization. For the purposes of my example graph, I decided to encode square footage into the size of the individual vertices, where larger nodes had nore square footage than smaller ones. The color of each node I mapped to the degree of the vertex.
Other sorts of useful stylings might be using coloring to correspond to categories, say if we wanted to classify different vertices in our house graph according to the sort of use that the room has, be it living space, storage space, private space, or public space. The sorts of stylings which would be useful depend of the nature of the data and the sorts of questions trying to be answered in exploratory data analysis.
Data Exploration Gephi has a wealth of statistical tools for understanding a data set. Examples include statistics on the denseness or sparseness of the graph, network diameter, connected components, or even things like PageRank.
Querying and filtering tools allow us to interactively explore the graph and, say, view only the storage areas in our house graph, or only those nodes with degree greater than 2.
Obviously this has been a relatively small example graph with which to test out Gephi, but the tool scales quite well for data sets where the size is just beyond the reach of manual creation and gives good tools for sense-making with large graphs.
There are several other features beyond the scope of this article which are worth exploring. Using Gephi to visualize graph data which includes time-series or parametric data or visualizing data which updates in real time are but two examples. There is information available on Gephi’s user website which delves deeper into how to get the most out of Gephi.
ATL -> SEA
Curling is fun.
With that in mind, I decided to write a custom UIGestureRecognizer which recognizes continuous press-drag gestures and release it under a BSD license for the learning benefit of others.
The included project has an interface which on detecting a press-drag gesture, drops a red dot on the site of the press and updates the radius of a blue circle to coincide with where your finger is dragged to.
It’s available here. Have at it and feel free to fork it and send a pull request if you improve on it.
In a bout of Doxie-inspired scanning and trashing of my closet-filling trove of old school papers, I came across a printout of some old Lisp code I had written trying to answer Project Euler problems. It’s neat looking back at your old code with fresh eyes, especially when it’s as early in your learning about computing as this code was. In keeping with the spirit of digitization that originally found it, I decided to transcribe it all into a gist on GitHub and verify that it all still worked.
This consisted of:
- launching emacs
- being welcomed by a spew of errors on loading my .emacs file
- declaring emacs bankruptcy and deleting the .emacs file I’ve been carrying with me all these years that I no longer care to understand what it does
- installing Steel Bank Common Lisp, an open source Lisp implementation with a high-performance compiler
- installing SLIME for interacting with Lisp from emacs
- wiring sbcl up to SLIME
Good News! Everything worked. Then it was time to take a stab at answering a new problem in Lisp.
Here’s the problem I picked:
The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17. Find the sum of all the primes below two million.
Easy enough to get a correct solution, if not a fast one. The assertion they make is that all problems should be answerable with one minute of computational work on a modestly capable computer. Mine took closer to 8. Here’s what my solution ended up being:
I think there’s a different approach I could take which would be perhaps faster, but that will have to wait for a bit as I’m limiting my time expenditure on ostensibly frivolous programming exercises. I did end up observing a few mildly interesting idiosyncrasies in getting reacquainted with Lisp that I thought were worth sharing.
Mildly Interesting Idiosyncrasies
Lisp (specifically SBCL) was one of the first programming environments I ever got comfortable with to a level of algorithmic implementation competency. Working with very large numbers with a tool that is innately built with arbitrary precision, arbitrary sized numbers in mind out of the box is kind of a big deal. The frustration or annoyance I’ve felt at every subsequent numerical overflow or precision bug I’ve encountered is firmly rooted in the niceties of this early experience. What’s (/ 5 2) ? Why, 5/2 of course! That said, if you don’t need this power, do the work to put the hinting into your code so that when compiled you end up with much faster computation. This is a place where a lot of speedup can be achieved doing numerical work.
I’ve also noticed that I was a huge fan of the loop macro looking back at old Lisp code. It’s certainly not functional, recursively-defined, quasi-classical, programming-language-survey-class-approved, but it is quite expressive when dealing with iterative constructs that I’ve seen come up in real programming that are awkward to express even functionally. They do however compile into who-the-hell-knows-what so things get interesting if your loop macro doesn’t behave as expected.
Performance profiling and enhancement is an interesting endeavor with SLIME and Lisp. One of the biggest performance bottlenecks I’ve noticed when transitioning a naive implementation of an algorithm from “This Code Is Correct” to “This Code Is Fast” is having to track down places where large numbers of conses are performed, where very large data structures are passed around by copying, and (this is a unique cultural artifact amongst a subset of PL survey class/Intro AI-educated people) where naive use of lists rather than more sophisticated Common Lisp types are used. The idea that the cons cell makes lists and trees natural data structures is a nice one, but the emphasis on that idea at least in my own computer science education seems to make me and many other Lisp novices not investigate what else the language offers, resulting in weirdly naïve but strictly correct implementations that I and others would never dream of writing in other languages.
In short, I’m glad that I learned to write Lisp and it’s been lovely retreading that path with older code. It’s provided yet another reminder that no matter how fast your code is, if your algorithm sucks, it’s going to take a long time. It’s given me a little perspective on how I’ve progressed in my reasoning about algorithmic implementation. It’s been a pleasant revisit of some of the things I used to cherish in my tools. But best of all, it’s reminded me that the tools I use will change how I think about solving problems and not to settle for shitty tools.
I went to a really interesting talk about IBM’s Watson by Bill Murdock, a member of the Algorithms team (and fellow Georgia Tech alum). The room was one of the largest lecture rooms the College of Computing has access to and the interest was well beyond capacity. Here are some notes I took of his presentation for those of you who were unable to attend. He gave a pretty good explanation of some of the more egregious snafus that happened during the course of the tournament, including the Final Jeopardy question where it seemed to indicate that it thought Toronto was a city in the US.
I think I was most surprised in the end by how much of Watson was based on statistical reasoning over unstructured text documents, given the problem domain naturally lends it self to symbolic reasoning and inference techniques.
I realized recently that I have a terrible habit of doing a lot of things in my work and personal life and not documenting them in any way, shape or form. This is an attempt at fixing that.