MIDS Capstone Project Spring 2023

Financial Report Summarization - SEC10-K

Problem & Motivation: Financial professionals must analyze vast amounts of financial literature to make informed decisions. To streamline this process, we aim to create a comprehensive financial reporting solution using NLP algorithms to generate concise summaries focused on decision-making information.

Data Source & Data Science Approach: We collected a dataset of 191 10-K reports from the SEC for fiscal year 2021. We focused on Item 7 (MD&A) and used subject matter experts to generate gold-standard summaries for 48 reports. We applied the BART model architecture and fine-tuned it with a Keywords_attention layer to improve performance.

Evaluation: We used machine learning evaluation metrics (ROUGE scores and BERT score) and human evaluation to assess the quality of generated summaries. Our fine-tuned BART model outperformed the GPT-3.5 Turbo model and showed significant improvement over the baseline model.

Key Learnings & Impact: Proper data preprocessing and fine-tuning with domain-specific data significantly improved the performance of our model. However, there's still room for improvement to reach human performance levels. Future work includes training with more labels, expanding the scope beyond Item 7, incorporating table data, and deploying the model on the cloud for real-time outputs.

Acknowledgements: We are grateful for the valuable input provided by our subject matter experts, who helped us evaluate the summaries and identify areas for improvement. Their expertise played a crucial role in refining our model and ensuring its effectiveness in summarizing financial information for decision-makers and analysts.

Last updated:

April 20, 2023