MIDS Capstone Project Spring 2017

Make News Credible Again

Team members

Problem Statement:

We aim to build a model that is capable of discerning whether an article is credible or not based on features derived solely from its text (i.e. word choice, writing style, title, etc.).

Background:

The widespread propagation of false information online (“fake news”) is not a recent phenomenon but its perceived impact in the 2016 U.S. presidential election has thrust the issue into the spotlight. In this project, we explore a number of machine learning-based approaches for solving the problem. Our first step was to identify the various forms of “fake news”.

Four Common Forms of “Fake News”:

Clickbait — Shocking headlines meant to generate clicks to increase ad revenue. Oftentimes these stories are highly exaggerated or totally false.
Propaganda — Intentionally misleading or deceptive articles meant to promote the author’s agenda. Oftentimes the rhetoric is hateful and incendiary.
Commentary/Opinion — Biased reactions to current events. These articles oftentimes tell the reader how to perceive recent events.
Humor/Satire — Articles written for entertainment. These stories are not meant to be taken seriously.

In this project, we focused on developing a classifier that was able to detect clickbait articles and propaganda articles.

Data:

To acquire a sufficiently large labeled corpus of articles to train on, we scraped the websites of both credible and non-credible sources listed in the OpenSources (http://www.opensources.co/) database for new articles daily. Articles were given the same label as their source.

Approach:

Scrape source websites for new article context (text and title) daily and store on cloud server.
Preprocess articles for content-based classification using various widely used techniques in NLP.
Train different machine learning models to classify the news articles
Create a web application (using Falsk API) to serve as the front-end for our classifier that returns a classification, a confidence metric and few important features in the model.
A more detailed description of our approach can be found here

( www.classify.news )*

*Please note that during user testing, certain computer models and/or web browsers had difficulty loading the banner video. This video demonstrates how the page should be loading: https://streamable.com/xn5zl . If you face issues, please try another laptop or web browser (preferably with ad-blockers temporarily disabled)

Course

Data Science 210. Capstone , Spring 2017

Class Project Gallery

More Information

Web Application

Google Cloud Deployment Code

Web Application Code

Approach

Visualizations

Model Classification Results

Model Accuracy

Last updated: April 27, 2017