Revealing a web of knowledge: mining quotations and ideas from a very large digital library of books

Tuesday, November 17, 2009

12:30 pm - 2:00 pm

Bill Schilit, Google Research, Book Search

Scanning books, magazines, and newspapers is widespread because people believe a great deal of the world's information still resides off-line. In general after works are scanned they are OCR'ed, indexed for search and processed to add links. In this talk I will describe a new approach to automatically add links by mining repeated passages. This technique connects elements that are semantically rich, so strong relations are made. Moreover, link targets point within rather than to the entire work, facilitating navigation. Our system has been run on a digital library of many millions of books (Google Book Search), has been used by thousands of people, and has generated the world's largest collection of quotations. I will also present a follow-on project based on the theory that authors copy passages from book to book because these quotations capture an idea particularly well: Jefferson on liberty; Stanton on women's rights; and Gibson on cyberpunk. These projects suggest that mining quotations for links and ideas is an important mechanism for understanding the knowledge contained in books.

Bill Schilit is part of the Google Research team and an adopted member of the Book Search group. Before joining Google, Bill was co-director of the Intel Research lab in Seattle, managed digital library and mobile computing research at Fuji-Xerox (FXPAL), worked on distributed computing at AT&T's Bell Labs, and was part of the team that developed Ubiquitous Computing at PARC. He is a Fellow of the IEEE, Associate Editor-in-Chief of Computer Magazine and a past member of the Board of Governors of the IEEE Computer Society. Bill received a Ph.D. from Columbia University in 1995.

Last updated: March 26, 2015

Revealing a web of knowledge: mining quotations and ideas from a very large digital library of books

Getting to South Hall