« back

About

One evening, my friend & coworker Michael Silberman emailed me a link to the Sunlight Foundation Mashup Contest. "We should do something for this!" he said, and I agreed. I checked out the just-launched opencongress.org and was impressed by the depth of the data available — I knew I wanted to do something cool with it. By the time I went to bed that night I had already started pulling data from the site.

But what was I going to do with it? The "Voting Trends Analysis" section of the site caught my eye — it seemed like there was room for expansion. There were a lot of interesting questions that an analysis of voting data could reveal. How loyal is each party, respectively? Who caucuses with whom? Is the divide across states more marked than the one across the rural/urban divide? Over the next six weeks I worked on the project in my spare time.

Getting the Data

Although there are a lot of great API options available for getting congressional data — many of them thanks to the Sunlight Foundation — I knew I'd be taking a trial & error approach to my analysis. I also knew that I wanted to process as much data as I could. Pulling the information off the net for every trial run would have been painful — getting a local copy of everything was definitely in order.

Perl-based screen scraping isn't glamorous, but it works. And thanks to OpenCongress's well-formed, Rails-powered XHTML, it wasn't too hard to extract a complete list of legislators. Next, I grabbed each one's voting available history, extracting information about each vote and how they voted. Along the way I also copied over the issues under which each bill had been categorized, to make slicing-and-dicing by legislative topic possible.

Analysis

How do you compare voting histories, anyway? Here I took a cue from what I remembered from the neural network classes I took in college. Each voting history can be thought of as a vector — a list of numerical values. In this case I assigned -1 to a "Nay" vote, 0 to an abstention, and 1 to an "Aye" vote. A vector can be represented as a point in space. A vector with two entries in it is called a two-dimensional vector, and can be plotted on a piece of paper: the first number defines the position along the x-axis in Cartesian space, and the second number defines the y-axis (you could switch them; it doesn't really matter). Similarly, a three-dimensional vector can be plotted in three-dimensional space. A four-dimensional vector — well, those get a little harder to draw.

But you can keep increasing the number of dimensions — the math doesn't mind. In the case of voting records, I simply took the set of overlapping votes between each pair of legislators and made the votes that each legislator cast on those bills an N-dimensional vector.

How do you tell how similar two vectors are? A good way is to determine the angle between them. If you draw from the origin — the point that all other points are positioned in relation to (e.g. the point at (0,0) in two-dimensional Cartesian space) — out to the position of the vector in N-dimensional space, you get a line segment. If you draw another line to another point, you can now find the angle between the two.

There's a shortcut to doing this with vectors. First, you normalize them so that each of these lines is of length 1 (while retaining its angle relative to the origin) — that makes sure that only the direction of the voting (so to speak) is considered, and not the magnitude, which is meaningless in this case. Then you take the dot product of the two vectors. This value is equal to the cosine of the angle between them. That's a useful metric for converting into percentages. With enough processing time I soon had similarity values for every possible pair of legislators in the legislative branch of the federal government.

I didn't just compare legislators to one another, though. I also created a modal voting vector for each party, which represented the predominant vote value for each party's members, for each vote — the voting record of the perfect Democrat and perfect Republican, in other words. By calculating a similarity value for each legislator against this hypothetical perfect partisan I could see how close each officeholder was to voting in lock-step with their party — and their political opponents'.

I also tried to run similarity comparisons across issues. I thought it would be interesting to see how closely some legislators voted by issue. Two legislators might vote together on the environment, but radically differently on tax cuts, for example. Unfortunately, the amount of voting history data currently available on OpenCongress wasn't sufficient to make these comparisons meaningful enough to be worth showing. But the infrastructure is built, and as the site's data accumulates perhaps it'll become possible to make these comparisons usefully.

The Visualization

The first thing I had to do was to check my work. Here's a similarity matrix that I generated after calculating all the values. It only shows congressmen and congresswomen — I figured the larger dataset would reveal more errors early on, so I started with the house before tackling the senate.

X Axis:
Y Axis:

If you mouse over the graph you should be able to see who each column and line represents. I haven't spent a huge amount of time on the DHTML, so I apologize if the Javascript is finicky (it seems to be in non-Firefox browsers). This was really just a way to check my work.

Legislators are sorted by party, then by state, then by district. If you look carefully you can see bands representing individual state delegations within parties. You can also see small blocks where delegations vote together across parties, such as Florida Republicans and Democrats. And if you switch to the black and white view ("toggle color"), you can see that the Republicans have a bit more party discipline than the Democrats do (in the black and white view, darker pixels mean more voting similarity).

An individual will obviously have the same voting record as him or her self, so these pixels are meaningless, and represented by the white diagonal stripe. Similarly, the upper-right half of the image is a mirror image of the lower-left — it's the difference between comparing Legislator A vs. Legislator B and Legislator B vs. Legislator A: the results are the same. It makes it a little easier to see patterns if you include the redundant data, though.

The above graph is interesting, but it's not very flashy. I wanted something with a little more pop. Fortunately, I've been experimenting with the Processing project over the last year. This seemed like a good way to create groupings of similar legislators that were easily understandable, and to provide an interface that would allow more exploration of similarity between voting habits.

When the applet loads, it starts by adding a highlighted legislator. By default, this is the majority leader in each chamber. It then looks for the strongest connection between the set of not-yet-added legislators and the set of already-added legislators. For the first iteration, this means the legislator who votes most similarly to the highlighted legislator. It then repeats the process until every legislator has been added. This means that groupings are generally weaker/less meaningful among the nodes that are added last. To get around this, you can regenerate the graph using any legislator as the starting point by double-clicking on their node.

Because of the nature of the physics simulation, I had to introduce a cap on the number of connections that any given legislator can have. Otherwise the whole thing blows apart and you're faced with a blank screen. But the cap is fairly high, and doesn't come into play for the majority of legislators.

There's other information to be had, too. If you click on a node you'll see that legislator's picture, name, state, district (if they're a representative) and the "Dem Score" and "GOP Score" — this is the similarity between them and the aforementioned perfect Democrat/perfect Republican. It's also reflected in the color of their node — nodes that are more purple represent legislators that are more moderate than their partisan peers. The color weighting is somewhat nonlinear to account for the human eye's inability to easily distinguish between similar shades of purple — moderate legislators will still be purple, but the purplishness of partisans is inhibited.

Finally, if you'd like to compare the voting histories of any two legislators directly, you can do so. Simply highlight the first one by left-clicking on them — their node will expand and their picture will appear. Then right-click on another legislator. Your browser will pop up a new window comparing the two on a vote-by-vote basis (be sure you have popup blocking disabled).

Technology

As I mentioned, I used Perl to scrape the data from OpenCongress.org and place it in a MySQL database. I realize that Perl isn't the cool kid anymore, but for regular-expression-heavy work, I still prefer it.

The data analysis code was also written in Perl and done from scratch, as all the existing matrix math libraries that I found seemed to be designed solely for three-dimensional vectors (this sort of math is used heavily in 3D graphics). My initial implementation was very slow, and took almost two days to calculate the similarity values for the House of Representatives. But after optimizing my algorithm and moving some of the processing load to MySQL, I was able to cut that time down to about an hour.

Processing is an incredibly fun visualization environment — it's easy to learn & use, but offers all of the power of Java (if you need it). The visualization depends heavily on the Traer Physics library from Princeton. I based my code off of the source code to a Swiss artist named Sala's very cool HTML DOM visualizer applet; it was an invaluable resource for learning Traer and, as is probably obvious, provided much of the inspiration for the visualization.

The legislator-vs-legislator web app portion is written in PHP and frankly isn't very interesting.

Thanks

First and foremost, thanks to EchoDitto for encouraging and humoring me. A less cool company wouldn't have.

I also am hugely in the debt of Sala, and can't thank him enough for open-sourcing his code. You can check out his other notable project here, if you haven't already.

Finally, thanks to the Sunlight Foundation for sponsoring this contest and making all of this great data available.

Contact

If you've got any questions, I'd be happy to answer them. You can email me at tom (at) echoditto (dot) com.

Source

I haven't bothered to clean it up at all. But if you're interested in it (including the 36M of scraped and generated data and the 11M of congressional photos), you can download it here. This archive contains defunct, duplicate, and generally horrific scripts. But if you're curious about how I did anything, digging through this and then emailing me is probably your best bet.