MIDS Capstone Project Summer 2021

TRESPI: TRansformers for Expansion of SParse Indexes

Team members

Joanna Wang

Corporations, government agencies, educational institutions, and other large organizations often need to retrieve specific information from large document datasets. A current, active area of research involves incorporation of recent advances in deep neural networks for natural language processing (NLP) into information retrieval systems. Deep neural networks are generally slower than traditional, probabilistic retrieval algorithms. Consequently, modern systems for information retrieval often use a two-stage process for retrieving a document based on a user query.

First Stage Retrieval: A probabilistic algorithm quickly retrieves a set of candidate documents using an inverted index.
Reranking: An advanced (but slower) algorithm based on a deep neural network compares the query to the candidate documents and sorts the documents by relevance to the user's query.

Our research focuses on application of deep learning to first stage retrieval and was inspired by two recent information retrieval models:

Context-aware Hierarchical Document Term weighting (HDCT) framework by Zhuyun Dai and Jamie Callan at Carnegie Mellon University.
DocT5query by Rodrigo Nogueira and Jimmy Lin at the University of Waterloo.

For our research project, we combined indexes generated by the HDCT and DocT5query models. We named our system for combining indexes TRESPI, or TRansformers for Expansion of SParse Indexes. The goal was to build an inverted index for ad-hoc document search that incorporates a term's context into it's term weight, but that also addresses vocabulary mismatch by including terms that are relevant to a document's topic, but do not explicitly appear in the document. Vocabulary mismatch occurs when a query term is relevant to a document's topic, but does not appear in the document.

We expected that an index consisting of term weights generated from both DocT5query and the original text would have the best performance. Surprisingly, our best-performing index only contained term weights generated by passing DocT5query terms through HDCT, omitting HDCT term weights generated from the original document text. TRESPI's performance was comparable to the performance of HDCT and DocT5query.

Course

Data Science 210. Capstone , Summer 2021

Class Project Gallery

More Information

Project Website

Github Repository

<cite>Banner Image is from xkcd comic # 1256</cite>

Last updated: August 5, 2021