May 28, 2010

Geoff Nunberg Discusses Google Books and Automated Textual Analysis

From The Chronicle of Higher Education

The Humanities Go Google

By Marc Parry

Matthew L. Jockers may be the first English professor to assign 1,200 novels in one class.

Lucky for the students, they don't have to read them.

As grunts in Stanford University's new Literature Lab, these students investigate the evolution of literary style by teaming up like biologists and using computer programs to "read" an entire library.

It's a controversial vision for changing a field still steeped in individual readers' careful analyses of texts. And it could become a more common way of doing business in the humanities as millions of books are made machine-readable through new tools like Google's digital library. History, literature, language studies: For any discipline where research focuses on books, some experts say, academe is at a computational crossroads....

But here's the rub. Google Books, as others point out, wasn't really built for research. It was built to create more content to sell ads against. And it was built thinking that people would read one book at a time.

That means Google Books didn't come with the interfaces scholars need for vast data manipulation. And it isn't marked with rigorous metadata, a term for information about each book, like author, date, and genre.

Back in August 2009, Geoffrey Nunberg, a linguist who teaches at the University of California at Berkeley's School of Information, wrote an article for The Chronicle that declared Google's metadata a "train wreck." The tags remain a "mess" today, he says. When scholars start trying large-scale projects on Google Books, he predicts, they'll have to engage in lots of hand-checking and hand-correction of the results, "because you can't trust these things."

Classification is particularly awful, he adds. A book's type—fiction, reference, etc.—is key information for a scholar like Mr. Jockers, who can't track changes in fiction if he doesn't know which books are novels. "The average book before 1970 at Google Books is misclassified," Mr. Nunberg says....

Read more...

 

Last updated:

October 4, 2016