Student Project

Looking for a Needle in a Haystack: Predicting Wikipedia Edits

Team members

Ugur Yildirim

Nathaniel Weinman

This project aims at predicting new edits on Wikipedia, a widely-used public encyclopedia. If it was possible to predict which articles were likely to be edited soon, Wikipedia could notify readers that there may soon be new information. Separately, well-predicted near-future edits could be a relevant feature to other models — those predicting, for example, vandalism or movie box office success — allowing them to identify trends slightly sooner.

Based on size, view, and edit features, our model was able to do a reasonably good job predicting edit likelihood on a down-sampled, balanced dataset. We found that a Gradient Boosting model was the most effective compared to six other classifiers that we trained. By using about 50 features drawn from both the main and talk namespaces, and based on the text's size, view count, edit count, and minor edit count, it was able to predict the probability of an article being edited with roughly 76% accuracy on a sampled dataset where half of the articles were edited. However, more work must be done for it to be reliable "in the wild," where, every day, less than 0.1% of the namespace articles get an edit.

Course

Info 251. Applied Machine Learning , Fall 2017

Class Project Gallery

More Information

The project's GitHub repo

info-251-writeup.pdf

™ Wikimedia Foundation, Inc.

Last updated: December 12, 2017