Sweep Banner
MIDS Capstone Project Fall 2017


Problem Background

Today's political climate is the most polarized it has ever been with Congress members voting with their party 95% of the time, up 12 percentage points in the past 17 years. This leads to a situation where the fate of import legislation typically lies in just a handful of representatives.

There is significant work being done to track legislators, voting and elections, however voters must sift through catalogs of disparate data and draw their own insights.


Our goal is to apply a machine learning approach to identify potential swing voters so that engaged citizens have a productive channel for political activism between elections.


  • Combine data sources into cohesive data store
  • Expose data to voters through interactive visualization
  • Identify fence-sitters:   candidates who may vote against their party majority

Data Sources

We collected data from publicly available data sources such as Propublica, OpenSecrets, Votesmart, and FEC election results for the following areas:


  • Bills
  • Roll Call Votes
  • Legislators


  • PAC Contributions
  • Special Interest Group Ratings
  • District Economics


Anomaly Detection

We framed the machine learning problem in terms of predicting legislators who vote against their party rather than predicting votes outright overall, as this approach has not received extensive research in the literature. By treating it as an anomaly detection problem, we hope to have better predictive power than raw voting results.

Ensemble Models

Similarly, although there was significant research done on this topic, research was often limited to one particular feature set, i.e. voting history, PAC funding, etc., and we believe that combining models trained on each of these individual features into an ensemble model will provide greater predictive power and generalization than any one model on its own. An overview of the model approach can be seen to the right.

Model Results

Overall Performance

The overall model is able to faithfully identify congress members who vote against their own party. Setting a probability threshold at 80% identifies the vast majority (90%) of this population. Unfortunately, due to the frequency at which members vote with their party, such a threshold results in an unacceptable number of false positives. Despite its impressive performance over baseline, additional work is needed to maximize the predictions of the ensemble model.

Relative Performance

All of the featureset models perform better than their baseline. Member ideology, which is typically the most common approach used to explain voting patterns, performed the best. We obtained a small, yet reliable gain in performance by combining the results of these models with an ensemble approach.

More Information

Last updated:

December 12, 2017