MIDS Capstone Project Summer 2016

True Business Data


Small/medium businesses are the heartbeat of the US economy, yet there’s a severe lack of open, reliable data on this key part of the U.S. economy hindering further data science. Even simple questions like how many business are actively are debated. With better information policy makers, business owners, researchers and data scientists could unlock better understanding and immense value for the American economy.


Using big data processing and machine learning techniques we’ve created a tool that enables rich, recent business data to be extracted by Zip Code from the web. The raw source of data is the Common Crawl—an open source snapshot of the public web (50Tb) updated ~monthly.  We approached this data problem using a combination of MapReduce frameworks on Softlayer, manual data classification and supervised machine learning.


We focused this project on building out the methodology and proving the thesis that a novel open source data-set could be created from the common crawl. We feel that we have proven this decisively, while also acknowledging there are yet major improvements to the groundwork we have laid. This work will prove valuable in a number of ways: both to further spur other data science projects built on top of this novel source of U.S. businesses data, and further as a blueprint for how to use the web (and common crawl) as the starting point for open source data creation.

Project resources:


"Data is the new oil"

Creating open source data from the web.

Last updated:

March 30, 2017