MIDS Capstone Project Spring 2024

Juniper: Privacy Interface for Large Language Model Interaction

Team members

Summary

Our project, JUNIPER is a proxy interface that enables individuals and organizations to unlock the vast potential of Large Language Models (LLMs) by upholding user or organization privacy. It is applied to medical diagnosis for initial proof of concept, where large quantities of private data are present. Individuals and companies should no longer fear sensitive data leaks and utilize the power of open-source LLMs.

Mission Statement

At our core, we are committed to enabling individuals and organizations unlock the vast potential of Large Language Models by steadfastly upholding the paramount importance of privacy.

Background

Privacy breaches and data exposure are significant concerns with the use of LLMs. Users might unintentionally share sensitive information in their LLM prompts, which can be accessed by LLM providers and potentially utilized elsewhere. Organizations are subject to strict regulations such as GDPR, which dictate the handling of personal data. Furthermore, traditional data anonymization methods, while intended to protect privacy, can sometimes compromise the effectiveness of downstream tasks.

MVP

Our MVP focuses on three core objectives essential for the effectiveness and user-friendliness of our system. Firstly, robust measures are implemented to redact or replace private data in prompts before they reach an open-source model like OpenAI, ensuring privacy and compliance with regulations. Secondly, the integrity of the diagnostic process is maintained to ensure consistency between original and treated prompts, bolstering confidence in system accuracy. Lastly, user autonomy is prioritized by allowing intervention and modification of treated prompts, empowering users and enhancing trust. By addressing these aspects, our MVP aims to deliver a comprehensive and user-centric solution for medical diagnosis while upholding privacy and accuracy standards.

Data Sources

We used multiple datasets to address this complex privacy preservation compute problem for LLMs.

Symptom_to_diagnosis - https://huggingface.co/datasets/gretelai/symptom_to_diagnosis
Names by gender - https://archive.ics.uci.edu/dataset/591/gender+by+name
Race and ethnicity data for first, middle, and surnames - https://www.nature.com/articles/s41597-023-02202-2
Data for: Demographic aspects of first names - https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DV…
Disease Symptoms and Patient Profile Dataset - https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient…

Course

Data Science 210. Capstone , Spring 2024

Class Project Gallery

More Information

Website Page

Product

Github Repo

Motivation

Microsoft Presidio architecture

RAG & two tower architecture

Video

Last updated: April 17, 2024