MIDS Capstone Project Spring 2021

Privacy Xplorer

Privacy Xplorer enables regulators to examine how companies have updated their privacy policies in response to GDPR by leveraging state of the art NLP and a corpus of nearly 1 million privacy policies and over 100K companies.

The General Data Protection Regulation (GDPR) is a major overhaul of privacy regulation introduced in 2016 and enforced in 2018 in the EU. GDPR aims to give individuals greater transparency and control over the collection and use of their personal data. Specifically, GDPR requires that companies:

  • Provide users with the ability to access, edit and request deletion of their personal data

  • Provide users information on the collection and option to disable cookies

Privacy policies are internal statements that govern an organization's practices for handling personal data. Over the last twenty years, as more privacy regulations have been introduced, privacy policies have almost doubled in length and increased in complexity.

Privacy Xplorer enables regulators to analyze the impact of GDPR, specifically on whether companies allow users the rights to edit, view and delete their personal data, and whether companies specify how to disable cookies within their privacy policies. We provide dashboards that allow regulators to view a single company overtime, view regional differences, compare between industries and compare adoption rates between more popular and less popular sites.

Privacy Xplorer is powered by a two-stage LegalBERT model that outperforms the baseline by as much as 30%. Our models are trained on the Online Privacy Policies (OPP-115) corpus, a collection of 115 website privacy policies labeled by law students, and the training data is augmented using semi-supervised labeling. We then apply our models to the Princeton-Leuven Longitudinal Corpus, an un-labeled dataset of 910,000 privacy policy snapshots from 130,000 websites, spanning over two decades. We built dashboards and visualizations to help regulators draw insights from this rich dataset.

More Information

Last updated:

April 14, 2021