Total Recall: An NLP Approach To The Early Detection Of Alzheimer's Disease
MIDS Capstone Project Fall 2017

Total Recall

A Natural Language Processing Approach To Predicting Alzheimer's Disease

Problem Background

Alzheimer's disease is a chronic neurodegenerative disease that usually begins in people over 65 years of age with the early symptom of short-term memory loss. It is currently ranked as the sixth leading cause of death in the United States and one in 10 people age 65 and older has Alzheimer’s.

There is no clear understanding on the cause of the disease and no effective treatments can stop or reverse its progression. Therefore an early detection can be crucial.

Prior research suggests that language degradation in Alzheimer’s patients can occur many years and even decades before primary symptoms.


By using NLP/ML techniques and Analyzing transcribed speeches of a person over a period of time, we aim to identify linguistic trends that presage Alzheimer's disease.


The data-science powered solution can be automated, inexpensive with relatively high accuracy. Such “Language markers” can serve as a non-clinical indication of approaching Alzheimer’s disease.

Data Sources

Control Groups

  • US Presidents’ news conferences
  • Hansard: British House of Parliament Debates

Test Group

  • U.S Congressional Records from the last 10 years


Linguistic Features Computation

Based on a number of academic papers and research efforts, Total Recall computed more than 300 linguistic features that reflect spoken language capabilities. Examples are Vocabulary Richness, Repetition, Use of the Passive Pronoun

Difference-In-Difference (DID) Analysis

The use of the statistical DID method to assess the effect linguistic features on AD compares the average change over time for the Test group with that of the Control group.

A linear regression model was computed for each linguistic features (Y) in the dataset. The model s formulated as:

Y = β0 + β1 × A + β2 × T + β3 × A × T

where T is the age of the person, and A is an indicator variable for AD. We computed these models over a group of 12 politicians, 5 of whom were diagnosed with dementia and 7 of whom are healthy.

Feature Selection From the DID Model

Total Recall identified numerically nagative β3 coefficients, indicating a relative decline for the Test group compared to the Control group.

Given the small sample size, we computed a Permutation test on each of the models to identify statistically significant changes in the Test group vs the Control group.

We computed cross-feature correlations to identify related groups.

We cross-referenced these features with the literature to identify a representative set of features that serve as early indicators of AD.

Last updated:

December 13, 2017