MIDS Capstone Project Summer 2023

Predicting Turnover Through Machine Learning

Problem & Motivation

The goal of this project was to create an easy-to-use web application for HR professionals to upload basic employee history reports and generate a dashboard that displays historical and predictive turnover insights. Turnover is an extremely expensive problem for companies, and analysis of the topic often falls to HR teams which may not have dedicated data analysis or data science support to pursue predictive analytics. Our project gives them a plug-and-play solution, where all they need to provide is a basic employee history report that a non-technical HR processional can generate from any standard HR information system.

For more context on the problem of turnover for companies: Employees represent a company's most significant costs, and turnover can be a highly expensive challenge, from the administrative costs of offboarding employees and recruiting for their replacements, to lost productivity from vacant jobs. According to Gallup in 2019, U.S. businesses lost $1 trillion to voluntary turnover. By understanding the factors that contribute to turnover and being able to forecast it accurately, organizations can proactively implement strategies to reduce turnover and retain valuable talent. However, most HR departments are not equipped to conduct a thorough analysis of critical metrics such as employee turnover. While HR teams often have access to raw data or basic reporting of historical trends, they often lack the resources or expertise to do predictive analysis. Our project is to build a tool that takes a structured employee dataset from the user and generates analysis of both historical trends as well as forecasts of future periods for key metrics such as turnover and employee replacement costs. This tool will be useful for HR professionals who have some access to basic reporting on their employees but lack dedicated support from data analysts or data scientists.

Data Source & Data Science Approach

Our project was developed using a training dataset from Steven Shoemaker, a professional and thought leader in the corporate People Analytics space. The format of the dataset mirrors a basic report of all active and terminated employees - i.e. everyone who has ever been employed - at a fictional company, with associated information such as their hire date, termination date (if applicable), department, job level, and salary.

This project is a proof of concept for how future users could submit their own employee history reports and generate a dashboard of historical and predictive turnover insights. As of the time of project delivery, the models used in our application are tuned to the structure of our training data. Were the team to develop this product beyond the final delivery date, we could make adjustments to parts of our processing and modeling pipelines, such as automating outcome variable definition and model hyperparameter tuning to improve generalizability.

Data submitted to the application is processed for two machine learning models - a turnover prediction model that makes individual level turnover predictions, and a time series model that forecasts monthly turnover rates for the entire company. The output of these models can be seen in the dashboard in separate modules. The output of the turnover prediction models that is relevant to our users is displayed in a table that shows employees who have not yet left the company along with their individual turnover predictions. The monthly turnover forecasting results can be seen in the time series chart labeled “Turnover Forecasting.”

For the turnover prediction model, we define the outcome variable as what is known in the industry as “regrettable turnover” - an instance of an employee leaving the company under circumstances the company did not want or plan, such as accepting a job at another company. From a business perspective, not all turnover is undesirable, such as in the case of low performers exiting - and so the value our application provides lies in predicting turnover that the company would like to prevent. We treat the problem as a binary classification task, where a 1 indicates a regrettable turnover and a 0 represents all other cases. This outcome variable is defined through analysis of termination reason fields in the submitted data, and the model incorporates a number of features that are either native to the training data (age, tenure, department) or engineered in our processing scripts (salary percentile within an employee’s department/location/job level cohort). Our model produces a prediction score for the likelihood of regrettable turnover for an individual employee, which we display in a summary table in the dashboard.

For the time series model, we used Prophet to forecast monthly turnover rates at the company level. To achieve this, we first had to convert the data’s original basic employee history format to a dataframe of monthly turnover rates. The dataset spans eight years of the fictional company’s history, and contained considerable noise in the first few years of data when the company size was smaller - and so only data from the latest six years was used to train the forecasting model. The model is designed to generate predicted turnover rates for the months remaining in the calendar year as of the time of data submission. Since our training data was collected in February of 2022, this means our proof of concept displays predictions for the months February through December 2022.

Additionally, we include statistical analysis of relevant metrics for our target audience, such as a feature importance ranking table and chi-square test table, each with information about statistical significance. The feature importance table displays the magnitude of the coefficient for each feature as well as the p-value (i.e. statistical significance) for that feature, to highlight to users which features were most impactful in the turnover prediction model. The chi-square test results allow the user to explore turnover metrics for different categorical variables and view whether differences between sub-groups were significant.


Our key performance metric for logistic regression is the F5 score. In the business context of turnover, false negatives are significantly more costly than false positives. In other words, the cost to the business of falsely predicting that an employee would stay when they instead submit a surprise resignation is significantly higher than a case in which they falsely predict the employee will leave, but they end up staying. In the former case, the company is affected by all of the costs associated with regrettable turnover - from the administrative costs of advertising and recruiting for a replacement to the productivity costs of a team having to temporarily operate understaffed. In the latter case, it is conceivable the company wastes some investment in a retention program aimed at an employee that was never a retention risk, however that investment is likely to be a fraction of the cost of an instance of regrettable turnover. In light of this, we optimized our model evaluation accordingly by weighting Recall much more heavily than Accuracy.

Key Learnings & Impact

We developed and evaluated five models to predict turnover: logistic regression, random forest, gradient boosted trees, linear support vector machines, and non-linear support vector machines. These models were selected due to their efficiency with binary classification tasks. We considered the logistic regression model as our baseline model and the other techniques as more sophisticated and experimental attempts to improve on that baseline.

We observed a significant imbalance of our target variable in our sample dataset, with only 16% of the employees having left the company. We experimented with upsampling techniques to improve model performance, and ultimately decided on the SMOTE + ENN technique, which upsamples from the minority class then removes noisy samples.

These models were trained on a training set of the data set, and then evaluated against a held-out section of the data set, with the F5 score being the primary metric of success. In this current version of the application, the logistic regression model performed the best with the highest F5 score, so it was chosen to be the turnover prediction model used within the application.

We should note that while the logistic regression model appeared to have the highest test scores, the differences between model scores seemed small, so we conducted additional tests of statistical significance to determine whether these differences were meaningful. Ultimately, we determined the differences were not significant–however this did not change our selection of logistic regression as our preferred model, as it retained value over other models due to its simplicity and explainability.

We learned how to integrate statistical analysis into our machine learning model. 5x2 cross validation t-testing was performed to determine the statistical significance of our model performance metrics. Chi-Squared testing was used to determine the statistical significance of our feature importances.

Web design, web development, and server hosting were completely foreign to us when first starting the project, but we learned enough about the tools and platforms related to host the application. From Flask to Socket IO, Bootstrap UI to Plotly, our team learned a significant amount regarding the full stack of a web application.


We would like to thank our Capstone instructors, Joyce Shen and Cornelia Ilin, for their excellent instruction and feedback throughout the semester.

We would also like to thank Dylan Gandossy, a Vice President of Human Resources at Workday, who consulted with us on the topic of HR analytics applications and provided invaluable feedback on how to keep our project focused on actionable insights for our target users.

We would also like to thank Martin Lim, a fellow MIDS student who provided a valuable ethical and privacy assessment of our application, including a number of suggestions that we implemented.

Last updated:

August 2, 2023