Mar 21, 2018

Big Data Meets Literary Analysis: Digital Humanities Research at the I School

Machine learning and big data don’t intuitively go hand-in-hand with studies of literary fiction; however, new research from School of Information Professor David Bamman, using a machine learning algorithm and natural language processing, revealed surprising trends related to gender in novels of the 20th century.  

Professor Bamman recently published “The Transformation of Gender in English-Language Fiction” in the Journal of Cultural Analytics with Professor Ted Underwood and Sabrina Lee from the University of Illinois. Following publication, their work has received considerable media attention including articles in Smithsonian Magazine, The Economist, and The Guardian. All these publications explore Professor David Bamman’s work using algorithms to identify various issues with gender in literature.

“Part of my motivation for working on projects like these is to show how NLP can be useful for problems outside out of the commercial applications we all know about (like Siri or Google Translate). Text is a form of data, and using NLP to reason about its structure has the potential to tell us something new and interesting about the world.”
–Professor David Bamman

Professor Bamman’s work reflects the interdisciplinary approach of much of the research at the I School, and adds to the ever-growing field of digital humanities. Professor Bamman points to the varying applications of machine learning and natural language processing (NLP): “Part of my motivation for working on projects like these is to show how NLP can be useful for problems outside out of the commercial applications we all know about (like Siri or Google Translate).  Text is a form of data, and using NLP to reason about its structure has the potential to tell us something new and interesting about the world.”

Smithsonian Magazine examines the analysis that female character representation decreased as the proportion of female to male authors fell. Machine learning methods in the study were able to identify trends that individual people would not have been able to simply because of the sheer amount of data.

The Economist further explores the research findings by looking at the algorithms that Bamman and Underwood used. They identify the different bases for the algorithms used, such as the association of certain words with specific genders. The author also makes reference to Professor Bamman’s previous research from 2013 in which he “was able to identify character stereotypes from 42,000 Wikipedia film summaries.” They conclude that although artificial intelligence has yet to write literature the way people do, it can be used as a tool to understand more technical aspects in the humanities, as shown by Professor Bamman’s research.

The article in The Guardian focuses on the Professors’ conclusion that fewer female authors means less female representation in novels because male writers are resistant to writing female characters into their novels. The author notes that the data indicates “the blurring of the boundaries of gender as fiction moves into the 20th century.” Professor Bamman was surprised by these findings as well, telling us “I think what's so fascinating about these findings is how stark they are. I've done work on gender bias in Wikipedia and Twitter and expected to find some disparity in characterization here, but the difference in how men and women as authors allocate attention to men and women as characters is really just staggering.”

their_first_quarrel_gibson.jpg
David Bamman
David Bamman
graphs showing representation of women in fiction
Source: The Economist

Last updated:

March 22, 2018