MIDS Capstone Project Fall 2023

Clean Data Is All You Need

Team members

We have created a powerful software pipeline designed to efficiently process unstructured PDF documents. This pipeline produces an output package that includes the document's textual content, figures, and tables as images, along with a JSON file containing a structured map of the document. Our innovative approach combines cutting-edge visual transformer technology, fine-tuned specifically for optimal performance in this task.

Our solution boasts impressive speed while maintaining high accuracy levels. It is user-friendly, making implementation a breeze, and it offers flexibility and modularity, allowing for easy expansion and updates to incorporate the latest technological advancements.

Course

Data Science 210. Capstone , Fall 2023

Class Project Gallery

More Information

Website

Github Repository

Clean_data_is_all_you_need

Process PDFs of scientific papers into structured data, Fast, with accurate results, easy to implement, and expandable and modular.

Last updated: December 18, 2023