5th Year MIDS Capstone Project 2023


Problem and Motivation

Congress has an insider trading problem. In 2022, the New York Times reported that "stock trades reported by nearly a fifth of congress show possible conflicts." Congress’ insider trading problem is not just an optics problem. Since 2012, over 25,000 trades have been made by congressional members with a total volume of around half a billion dollars. That’s a lot of trades, and a lot of money, to have no ethical guard rails for. 

Some progress has been made to hold congress accountable. Notably, in 2012, the STOCK Act was passed requiring congress to publicly disclose personal transactions. The public disclosures required by the STOCK act represent productive progress towards holding congress accountable for their personal finances. In 2020, the Department of Justice used disclosures from the STOCK act to unveil an insider trading scandal at the start of the COVID-19 pandemic. The DOJ probe found that at least 6 senators had used early information about the COVID-19 pandemic to invest in biotech companies that ended up seeing substantial gains over 2020.

So far, it has mostly been investigative journalism that holds congressional members accountable for their trades. However, for journalists to analyze potential instances of insider trading, they must painstakingly gather a corpus of congressional activity. Such work often occurs when a journalist following a particular congressional member observes a suspicious transaction. Still, investigative journalism as a mechanism for accountability does not scale well when we think of the thousands of transactions made per year. Without extensive investigation and analysis, it is impossible to tell if individual disclosures show possible conflicts of interest.

Intervention and Impact

This is where our team intervenes. We seek to provide investigative journalists, regulatory bodies, and watch dogs with an easy-to-use dashboard that describes a congressional members’ activity around a particular industry they are trading in. This includes contextualizing transactions with a congressional members’ committee assignments, attended hearings, and sponsored legislation. All of these points might give them insider knowledge on the industry they are trading in. With such information easily accessible, investigative experts can more easily analyze transactions that might be subject to criminal or ethical charges. 

Our product is a web dashboard that uses AI on top of a first-of-its-kind congressional activity golden dataset to contextualize stock transactions with relevant information. Our novel dataset gathers information on committee assignments, sponsored bills, attended hearings, sponsored travel, and press releases from various news and governmental sources. We then use a BERT-based retriever to rank doccuments according to their relevance to a queried transaction disclosure. Our final realtional database schema consists of 8 tables spanning 300,000 documents relevant to 286 members of congress.

Data Source and Data Science Approach 

Our retriever works by first enhancing transaction metadata with a verbose description using GPT-4. In the ingestion step, we embed our database on congressional activity using a BERT transformer and place the embeddings in an S3 store. We can then compute semantic similarity with cosine similarity between the embeddings for the enhanced description of the transaction metadata and the congressional activity database embeddings. This similarity score is used to rank the documents in the congressional activity database and returned documents are ordered accordingly. 


We assess our model through the real-life use case of the COVID insider-trading scandal. Of the congressional activity data keyed by a particular transaction, we expect our model to return activity that would enable investigative journalists to write the articles that broke the story on insider trading. We analyze the rhetoric of articles written by The New York Times and The Associated Press to derive a set of 13 distinct pieces of information used.

12 out of the 13 distinct pieces of information used by the New York Times and Associated Press are in our database. Of the evidence existing in our database, our model retrieves all 12 pieces of information in the top-10 most relevant documents for their respective categories. We further examine all 58 transactions listed in the DOJ probe on congressional trading during COVID, and along with a SF Chronicle investigative journalist, manually identify 117 pieces of information in our database that are potentially relevant as context to the transactions. 106, or 91% of those points are retrieved in the top-10 most relevant document for their respective categories. The high performance of our model shows that our product can be used by investigators as an analytical sieve to identify potentially suspect transactions. 

Key Learnings

The primary challenge we faced in this project was culminating data from various sources. This necessitated the creation of a centralized linked database. The data engineering output of this project is first-of-its-kind and represents a significant value add for any future data project in the space of congressional analysis. With a robust set of documents, the next challenge involved retrieving the documents with the highest probability of indicating a conflict of interest. However, we wanted the information displayed to be as complete as possible. To solve this problem, we ranked the documents according to their semantic similarity to the transaction metadata. As such, our product is more of an optimized search tool than a chat-my-docs type tool. We also ran into the issue of the computational demands of computing embeddings and similarity scores. To address this, we batch processed database documents and vectorized similarity computation. This solution enabled us to develop a low-latency web page, optimized for user experience. Another important challenge we faced was the loss of one of our teammates about half way into the project due to unforeseen personal circumstances. This required us to rapidly re-scale the project and use an agile work framework where each phase deliverable produced deployable value. 

Next Steps

With the goal of creating a tool for investigative bodies to easily analyze potential instances of congressional insider trading, a chatbot focusing on summarization is the intuitive next step for our product. In fact, our model is well adapted as a retriever in a RAG-type architecture. We can envision an interface where users can ask questions directly about a transaction, and the chatbot would summarize the appropriate context for sources of market-signal indicating congressional knowledge. We are optimistic that novel data solutions can be implemented in this space to further move the needle on congressional ethics.

Our existing product can be easily expanded to include a wide variety of data solutions, including live data endpoints and chatbot interfaces. Beyond additional data solutions, our product is well-suited to ingest data on non-congressional transaction disclosures. In fact, since the S-E-C requires disclosure of stock transactions made by various corporate executives, our solution can be leveraged to promote accountability in the corporate space. Our team hopes to engage with legal experts in the future to tailor indications of conflict around actionable legal frameworks. Congress’ insider trading problem isn’t going away anytime soon. Let’s hold congress accountable and restore the American public’s faith in government. 



PoliWatch Presentation (MIDS FY2023 Capstone)

PoliWatch Presentation (MIDS FY2023 Capstone)

If you require video captions for accessibility and this video does not have captions, click here to request video captioning.

Last updated:

March 21, 2024