MIDS Capstone Project Fall 2023

Handwriting for Hope

Team members:

Problem & Motivation

Alzheimer’s disease, the leading cause of dementia, presents a significant and growing challenge. With cases expected to triple by 2050, resulting in costs potentially exceeding $1.1 trillion, the urgency for early detection is critical (Alzheimer's Facts and Figures). Early diagnosis is often made difficult by the subtlety of initial symptoms, such as memory loss and confusion, which can be easily mistaken for normal aging. Identifying Alzheimer's at an early stage, before extensive neuronal damage occurs, is crucial for improving the quality of life of patients and reducing healthcare costs. Timely diagnosis allows for lifestyle and dietary modifications that can slow down cognitive decline, along with access to treatments for managing symptoms (NIH Lifestyle Changes to Reduce Alzheimer's). Among all Americans alive today, if those who will get Alzheimer's disease were diagnosed when they had mild cognitive impairment, before dementia, it would collectively save approximately $7 trillion* in health and long-term care costs (Alzheimer's Facts and Figures).

Our project is driven by our desire to make early detection of Alzheimer’s possible in a cost-efficient, easily accessible, and scalable manner. Central to our approach is training a machine learning model to identify early signs of Alzheimer’s through handwriting analysis. We selected handwriting analysis as our focal point because emerging research has highlighted its potential as a supportive tool in Alzheimer’s diagnosis(Handwriting Analysis to Support Alzheimer’s Disease Diagnosis). To make this technology widely accessible, we have encapsulated our model within an easy-to-use web application. This application serves as a potential tool for both medical professionals and patients, facilitating early detection and intervention. Additionally, it acts as a dynamic database, collecting valuable data that will aid in ongoing research and the continual improvement of our machine learning model.

Data Source & Data Science Approach

The dataset that we used is called DARWIN (DARWIN). DARWIN is a specialized dataset designed for Alzheimer's disease research, particularly for developing machine learning models for early detection through handwriting analysis. It consists of handwriting data from 174 participants, including 89 Alzheimer’s patients and 85 healthy controls. Each data point within this dataset is characterized by 450 distinct features. This composition results in a dataset that, while relatively small in size, is complex and high-dimensional.

We explored a variety of machine learning models, including Logistic Regression, Neural Networks, Random Forest, and XG Boost, with our primary objective being the accurate binary classification of participants as high or low risk for Alzheimer’s. To tackle the specific challenges posed by our dataset, namely its small size and high dimensionality, we implemented a range of techniques. We utilized synthetic data augmentation to effectively enlarge the dataset and mitigate the issues arising from its limited size. Additionally, we experimented with Principal Component Analysis (PCA) and feature selection methods to manage high dimensionality. After extensive experimentation and analysis, we identified our optimal model: a Neural Network model trained on our synthetically augmented dataset.

Evaluation

In evaluating our models, we used a combination of F1 score, precision, recall, and accuracy. The F1 score was given priority as our primary metric due to its ability to balance precision and recall effectively. This balance is crucial in the context of diagnostic support tools in healthcare, where the ramifications of both false positives and false negatives can be profound. Our Final model (Neural Network model trained on our synthetically augmented dataset) achieved an F1 score of 97%.

Key Learnings & Impact

Synthetic Data Augmentation: The effectiveness of synthetic data augmentation in situations with limited data is a significant insight. This technique notably improved our model's performance and can be applied in similar research contexts.
Handwriting Analysis for Alzheimer's Detection: Our exploration aligns with growing research that indicates handwriting analysis can be a valuable tool in detecting early onset Alzheimer's Disease (AD) before the emergence of more typical symptoms. However, we recognize the need for a more unified approach in this research area. The variability in methods and the lack of a standardized data collection protocol, as highlighted by the proposal from N. D. Cilia et al.(Handwriting Analysis to Support Alzheimer’s Disease Diagnosis), represent significant challenges. Bridging this gap and achieving consensus on standardized protocols is crucial for scaling this method and enhancing its reliability across different research and clinical settings.
Potential for Clinical Application: Despite being a proof of concept, our tool has garnered interest from clinicians for trial in clinical settings. This indicates promising potential for future diagnostic applications. Overcoming challenges such as privacy and security could make this a viable, scalable alternative to more expensive, formal diagnostic tests like CT scans or MRIs.

Acknowlegments

We extend our gratitude to our Capstone professors, Cornelia Ilin and Zona Kostic, for their invaluable support and guidance throughout the duration of our project.

Our thanks also go to Dr. Francesco Fontanella, a key researcher in the compilation of the DARWIN Dataset, for his correspondences.

Additionally, we are thankful to Dr. Barry Gordon for his interest in our application.