MIDS Capstone Project Summer 2023

Movie Magic

Problem & Opportunity

Producing movies is expensive. Movie budgets since 1973 have had a nominal mean of $27M and a median of $15M1. Additionally, the process of identifying successful movies is quite manual with industry experts saying it can take up to 90 minutes to review just one script thoroughly without taking any notes for comparability. The fact that over a third of movies don’t break even further complicates the problem. Despite these challenges, the total addressable market is massive with US box office numbers reaching $7.4B in 20222.

Traditionally, most modeling approaches predicting movie success focus on features available at later stages in the movie development process such as budget, cast and crew, and runtime. However, by these points in production, there’s already been a major investment into the project that can’t be recouped.

In order to avoid such large sunk costs, we focus on the development phase and use features related to plot, genre, and writers. This not only differentiates us from existing competitors and research but also empowers producers to make smarter financial decisions about which projects to pursue.

Our value proposition is clear: we enable moviemakers to be smart about their time and money. By creating a quantitative view of success, our customers will be better positioned to make greenlighting decisions, find hidden gems, and work more efficiently.

Data Source

We built Movie Magic using data from the following sources:

  • Web scraping Wikipedia to collect American films released since 1973
    • Iterating through the “List of American films of <year>” pages to get release year, film names, and hypertext references (e.g., List of American films of 2023)
    • Visiting individual pages using the wikipedia Python library and getting information on plot, writers, adaptation / based on, actors, budget, and box office
  • Requests to the OMDb API for information such as genre(s) and specific release date in addition to backup information on plot and writers

After building out our initial dataset, we spent a significant amount of time on the following preprocessing tasks to ensure the highest data quality and reduce the risk of feature leakage.

  • Parsed people fields such as writers and actors to ensure consistent formatting
  • Parsed financial fields such as budget and box office to ensure consistent data scaling and handle any instances where a range of values was returned
  • Selected the plot field to use for modeling based on text length and descriptive quality
  • Searched for actor names in the plot field and removed any instances to reduce the risk of feature leakage given that cast is not available during the movie development stage
  • Removed any duplicates that found their way into the dataset during the scraping process (e.g., some lists of American films have an additional section noting the top 10 highest-grossing films that year)

Machine Learning Pipeline

Our Movie Magic solution encompasses advanced natural language processing and dynamic time series based feature engineering and leverages the gradient boosted tree algorithm to predict movie financial success. Seventy features were engineered to extract maximal information, and a total of 28 models were deployed and connected to fortify the prediction capabilities.

Feature Engineering

Our final modeling features fell into three categories:

  • Movie-Related Features: Represent what we know about the movie in the development phase, such as genre and the track record of the writers’ previous works.
  • Natural Language Processing Features: Extract language features such as sentiment and emotion and generate three themes to encapsulate the nuances in the plot.
  • Time Series Based Features: Predict the popularity of specific genres, topics, sentiments, and emotions, enabling the model to learn market trends in the movie industry.

Experiments and Data Splits

We had three machine learning modeling experiments. The first experiment was aligned with the benchmark published research paper that we aimed to beat, training on data from 1973 through 2009 and testing on 2010-20193. For the second experiment, we extended the training period to 2016 to include more recent data based on feedback from our domain experts. Lastly, to test the sensitivity of our models to more recent data, we trained on only the last 20 years of films and excluded films prior to 1993. 

Model Design 

Our machine learning pipeline’s initial steps involve constructing movie-related features and extracting natural language processing features. These features are subsequently used to perform popularity forecasting using time series techniques. The final classification model was selected after training on these features across the three experiments. We leverage the gradient boosted tree algorithm because it outperformed alternative machine learning models in our experiments, owing to its sequential training process that adeptly rectifies errors from preceding trees. This enables the algorithm to effectively capture complex patterns in data.

We utilized L1 regularization, tree-based feature importance, and SHAP values to determine the most relevant features. Overall, the results were relatively consistent, highlighting important features such as plot embedding, genre popularity, topic clusters, and sentiment shift. We dropped irrelevant features and gained comprehensive insights into feature importance.


We used the F-1 score as our performance metric for model selection because it equally balances the risks of missing a hit and investing in a flop—a balance of priorities offered by our domain experts. We evaluated our final model on the held-out test data for experiments 1 and 2 and achieved an F-1 score of 0.85 for both, surpassing the benchmark set by the research paper by a notable 13%. By comparing training, validation, and test results, we did not identify any overfitting issues.

Key Learning & Impact

Our project has overcome significant challenges to ultimately create an MVP that delivers value to our users, helping to quantify the decision-making process that is traditionally done on “gut feel.”

We have compiled a proprietary dataset by scraping and transforming data from over 13,000 web pages that cover half a century of movies. Facing the challenge of few inherent features in our dataset, we used innovative approaches such as cutting-edge NLP techniques to extract additional information and increase the feature count to 70. Each feature was carefully curated and handled to avoid data leakage given our time-dependent data. In all, over 28 models were employed in our pipeline to learn and predict movie success. 

We found that our time series models had noticeable performance improvements when more recent and historical data was included in training, indicating that the nature of the data includes both short-term trends and long-term repeating patterns. We also observed that the NLP models were sensitive to the words used and writing style of the input text, perhaps indicating that the training texts had a specific style associated with successful movies.

Inspecting our model’s important features and evaluating correct and incorrect predictions for fairness yielded interesting examples for discussion. Within the scope of the films in our dataset, we learned that the semantic and syntactic relationships of the plot text, the number of sentiment changes in plots, the timing of movie releases, whether a film is an adaptation, and the performance of writers’ past films are important features to predicting if a movie is worth investment.

Our project roadmap is full of exciting new features and value to be created–identifying plots with “actor bait” roles that are more likely to attract star talent, indicating special production elements that increase budgets such as international shooting, and expanding the dataset to reduce historical biases. We also plan on building a critical success prediction tool that learns from a diverse range of critical opinions. As movie enthusiasts, we are fascinated by film trends and believe that there are many opportunities to improve and disrupt this industry - watch out for Movie Magic coming soon to a theatre near you.


We'd like to thank our instructors, Joyce Shen and Cornelia Ilin, for their guidance throughout this project. Additionally, we’d like to thank the following for specific subject matter expertise: Mark Butler for NLP guidance; Quincie Li and Mel Luis for industry advice and feedback; Robert Wang for AWS overviews; Ethan Fast, Lance Thibodeau, and Ben Schreiber for front-end troubleshooting; and Mike Sarchet for VC style feedback on our project presentation.


  1. Determined via web scraping Wikipedia pages for 50 years of American films
  2. Box Office Mojo - Domestic Yearly Box Office
  3.  Exploiting time series based story plot popularity from movie success prediction (2023)
Last updated: August 10, 2023