This project uses various Big Data techniques (Spark, Dask, Elasticsearch,Spark Streaming,logstash,Hadoop) to analyze characteristics of the 2016 Presidential candidates using batch and realtime data processing scenarios. We chose four candidates for our analysis: Donald Trump, Hillary Clinton, Ted Cruz and Bernie Sanders.
The entire Reddit corpus from October 2007 through August 2015 was used to evaluate various characteristics of the candidates including post volume over time, the most popular keywords, parts of speech that describe the candidates and sentiment. The intent was to evaluate the rise or decline of these candidates and to tease out insights. We also chose to process Twitter data
to evaluate sentiment of Donald Trump in particular given the rise in interest as a candidate and competing views.