MIDS Capstone Project Summer 2024

Very Intelligent Portfolio

Problem & Motivation

The U.S. boasts the world's largest capital markets, yet many employees face a unique challenge: concentrated risk from stock compensation. This creates a dual threat - potential loss in net worth if the company's stock declines and risk to future compensation tied to company performance. Compounding this issue, vesting periods often prevent immediate diversification, leaving employees vulnerable.

To address this, we've developed VIP, an open-source solution that uses machine learning to optimize portfolio weights and recommend hedging strategies. VIP democratizes advanced investment techniques, empowering both employees with vested stock and retail investors to better manage their financial futures.

Data Source & Data Science Approach

Our data was accessed from Wharton Research Data Services (WRDS), as provided by The Wharton School, University of Pennsylvania. 

Our primary dataset is the Center For Research in Security Prices, LLC (CRSP) Daily Stock Dataset. This dataset contains the daily US market data for all active and inactive securities with primary listings on the NYSE, NYSE American, NASDAQ, NYSE Arca and Cboe BZX exchanges. Key variables include the security identification information, price, return, shares outstanding, volume, and other security level meta information. We sourced the data for all years beginning January 1st, 2018 and ending December 31st, 2023. 

Our secondary dataset is the Compustat Daily Updates - Fundamentals Quarterly Dataset provided by S&P Global Market Intelligence Capital IQ. This dataset contains the quarterly financial fundamentals for publicly traded companies in the U.S. Key variables include the security identification information, total assets, total liabilities, retained earnings, sales, cost of goods sold, expenses, net income, common and preferred shares outstanding, and other security level meta information. We sourced the data as pre-merged with CRSP for all years beginning January 1st, 2018 and ending December 31st, 2023. 

We started with the raw CRSP dataset which contains the daily stock market information for more than 7,000 stocks from 2018 to 2023. We performed a series of data cleaning by removing or imputing invalid values and keeping only stocks in S&P 500, and performed a series of feature engineering to create features, this dataset is used for our prod model shown in the Demo. We also cleaned the Compustat dataset by filling the missing data from the relevant SEC filings, and then joined the CRSP dataset with the Compustat dataset on security & date. This joined dataset is used for one of our dev models in development.

At the core of our approach is a deep learning model inspired by recent successes in applying attention mechanisms to time series data in Natural Language Processing (NLP). This model generates numerical embedding representations of stocks based on their historical price movements and relationships. 

Building upon these embeddings, we implemented an optimization algorithm to determine optimal asset allocation weights. The primary goal of this algorithm is to minimize portfolio variance and downside risk, leveraging the rich information captured in the stock embeddings. Through this innovative combination of deep learning and portfolio optimization techniques, we aim to provide a powerful tool for retail investors to make more informed and effective investment decisions.

Evaluation

We experimented with various models, starting with the static stock embeddings using a self-attention mechanism, inspired by word2vec, serving as our machine learning baseline. We then explored various approaches with the Transformer model, including using each stock's own history for prediction. Ultimately, the most accurate and consistent results came from a Transformer model that uses date as the sequence and incorporates all stocks’ historical returns as features. This model has been productionalized.

A key aspect of our evaluation involved comparing static and dynamic (context-aware) embeddings. Static embeddings assign a fixed vector to each stock, regardless of time or market conditions. In contrast, our dynamic embeddings are context-aware, allowing the representation of a stock to change based on current market conditions and recent performance. This approach captures the evolving nature of stock relationships and market dynamics, potentially providing more accurate and timely representations for portfolio optimization.

We built our model using the PyTorch ML package and trained our model in AWS SageMaker with a NVIDIA A10G virtual machine with 8 GPUs. The model was trained for 50 epochs with early stopping and completed training in ~4 hours. We utilized mean absolute error (MAE) as our loss, RAdam as our optimizer, and with a learning rate of 0.00005. We evaluated the results of our model with the MAE loss and we monitored the learning curve as the model trained. We observed that the model continues to improve after 50 epochs, showing promising learning progression. We also visualized the predictions against the true values to visually inspect the quality of the predictions. When visualizing the true versus predicted values, we noted the model was able to make reasonably accurate predictions. Particularly impressive, the model was able to capture & make predictions on the market downturn during COVID-19.

We produced a cosine similarity matrix using the learned embeddings. This matrix reveals the relationships between stocks based on their learned representations. We noted that none of the cosine similarities are below 0, which is in-line with our expectation, where companies in the market generally move in a similar direction with each other. We also visualized the cosine similarities on low dimensional space by utilizing dimensionality reduction technique, noting that similar companies are clustered near each other, which is consistent with our expectation that similar companies should move in similar directions in the market.

This cosine similarity matrix replaces the traditional correlation matrix used in portfolio optimization to generate portfolio recommendations by balancing the Maximum Sharpe ratio (which compares the return of an investment with its risk) and the minimum variance.

Key Learnings & Impact

To deliver the best product, we also developed two more advanced models by incorporating enriched features like market return context and S&P Compustat financial fundamentals, which show promising performance on smaller datasets. We will work to bring these models to production next.

Through this project, we gained valuable insights into several key areas of financial technology and machine learning applications. We explored the effectiveness of deep learning embeddings in capturing complex relationships between stocks, discovering their potential to represent intricate market dynamics in a more nuanced way than traditional methods. Our work also sheds light on the applicability of NLP-inspired techniques to financial time series data, demonstrating how approaches successful in one domain can be adapted to solve challenges in another. Additionally, we encountered and navigated the unique challenges and opportunities in developing AI-driven tools specifically tailored for retail investors, providing us with a deeper understanding of this user group's needs and constraints.

The potential impact of our project extends beyond these technical learnings. By developing this tool, we aim to democratize access to sophisticated portfolio optimization techniques, traditionally available only to institutional investors or high-net-worth individuals. This democratization has the potential to empower retail investors, enabling them to make more informed decisions and potentially achieve better investment outcomes. In the broader societal context, our project could contribute to reducing the wealth gap by improving financial outcomes for a wider range of investors. By providing advanced tools to those who might not otherwise have access, we hope to level the playing field in the realm of personal finance and investment.

Acknowledgements

We would like to thank UC Berkeley for providing access to the Wharton Research Data Services, which has been instrumental in our data collection efforts. We also acknowledge the open-source community for the various libraries and tools that have facilitated our development process. Special thanks to our advisors, Korin Reid and Joyce Shen, and the MIDS community for their valuable feedback and support throughout this project.

References

  • Larry Cao. "AI PIONEERS IN INVESTMENT MANAGEMENT"
  • Li, Haifeng, and Mo Hai. "Deep Reinforcement Learning Model for Stock Portfolio Management."
  • Sokolov, Alik, et al. "Neural Embeddings of Financial Time-Series Data."
  • Sutiene, Kristina, et al. "Enhancing portfolio management using artificial intelligence."
  • Vaswani, A., et al. "Attention Is All You Need."
  • Wang, et al. “Stock2Vec: A Hybrid Deep Learning Framework for Stock Market Prediction with Representation Learning and Temporal Convolutional Network”
  • Wang, Zhicheng, et al. "Deeptrader: A deep reinforcement learning approach."
  • Yang, Shantian. "Deep reinforcement learning for portfolio management."
  • Zhang, Zihao, et al. "Deep learning for portfolio optimization."
  • Zhang, Weiwei, and Chao Zhou. "Deep learning algorithm to solve portfolio management."
Last updated: August 8, 2024