MIDS Capstone Project Spring 2023

SimpLAWfy

Team members

Problem & Motivation

Every day, consumers mindlessly accept terms of services and privacy policies for the most popular websites and mobile applications. These documents range from 2,000 to ~8,000 words and continue to grow longer as companies become more complex in their product and service offerings and ever-changing legal regulations and requirements. As a result, most of us decide to skip to the bottom and hit “I accept” instead of even attempting to interpret the legalese stuffed into these bloated documents.

SimpLAWfy is a tool that aims to remedy the challenges of reading and comprehending the terms of service and privacy policies by using an NLP summarization model to provide a simplified explanation of the terms that can be understood in a fraction of the time. As an additional feature, our NLP model is equipped to answer a set of common questions that users care most about regarding privacy policies.

Data Source & Data Science Approach

Our team developed an NLP summarization and question answering model utilizing summaries generated manually by our team and by OpenAI's ChatGPT. We leveraged the bart-large-cnn-samsum model for summarization and the bart-lfqa model for question answering.

Mission

Our mission is to prevent blind acceptance to signing away user's data and digital rights.

Key Learnings & Impact

A key theme throughout our project has been maintaining the balance between summarization and understandability. We wanted to ensure that our summaries were concise when compared to their original documents while also providing a summary that was easy to understand and more accessible. Throughout this process, we leveraged our legal experts to help us wade through the opaque language and understand risks associated with blind acceptance and scale and scope how user data is being collected and used. US data privacy laws and regulations are rapidly changing and can vary based on where the user is located or where the company operates and as a result it can be difficult for the common consumer to stay up to date.

Evaluation

We evaluated the simpLAWfy model based on several factors including length, difficulty and performance which had to be considered for our target users to comprehend what they are agreeing to and provide a useful representation of the full text. We used length to see how many words and sentences and our summaries had in comparison to the original document. To evaluate difficulty, we leveraged Flesch Reading Ease scores. Finally, we leveraged ROUGE-1 metrics to evaluate our model's performance. We prioritized ROUGE-1 recall and flesch reading score over precision as we felt it was important to retain a high level of the information and to have a summary that was easy to understand.

Acknowledgments

We would like to acknowledge our capstone advisors, Fred Nugen and Joyce Shen, for their guidance and support during the development of our product. We would also like to thank our legal experts, Nina Chang and Alex Lemberg, for their expertise in helping us understand terms of services and privacy policies.

Course

Data Science 210. Capstone , Spring 2023

Class Project Gallery

More Information

Product Website

Video

Last updated: April 22, 2023