Dec 16, 2010

Geoff Nunberg Provides Context to the Google Books Research Corpus

From Chronicle of Higher Education

Counting on Google Books

By Geoffrey Nunberg

Humanities scholars may someday count as a watershed the paper that appeared on Wednesday in Science, titled "Quantitative Analysis of Culture Using Millions of Digitized Books." But they'll have certain things to get past before they can appreciate that.

The paper describes some examples of quantitative analysis performed on what is by far the largest corpus ever assembled for humanities and social-science research. Culled from Google Books, it contains more than five million books published between 1800 and 2000—at a rough estimate, 4 percent of all books ever published—of which two-thirds are in English and the others distributed among Chinese, French, German, Hebrew, Russian, and Spanish. The English corpus alone contains some 360 billion words, a size that permits analyses on a scale that aren't possible with collections like the Corpus of Historical American English, at Brigham Young University, which tops out at a mere 410 million words....

It's unlikely that "the whole field" of literary studies—or any other field—will take up these methods, though the data will probably figure in the literature the way observations about origins and etymology do now. But I think Trumpener is quite right to predict that second-rate scholars will use the Google Books corpus to churn out gigabytes of uninformative graphs and insignificant conclusions. But it isn't as if those scholars would be doing more valuable work if they were approaching literature from some other point of view.

This should reassure humanists about the immutably nonscientific status of their fields. Theories of what makes science science come and go, but one constant is that it proceeds by the aggregation of increments great and small, so that even the dullards have something to contribute. As William Whewell, who coined the word "scientist," put it, "Nothing which was done was useless or unessential." Humanists produce reams of work that is precisely that: useless because it's merely adequate. And the humanities resist the standardizations of method that make possible the structured collaborations of science, with the inevitable loss of individual voice. Whatever precedents yesterday's article in Science may establish for the humanities, the 12-author paper won't be one of them.

Geoffrey Nunberg, a linguist, is an adjunct full professor in the School of Information at the University of California at Berkeley.

Read more...

Last updated:

October 4, 2016