Speakers and Topics (Preliminary)

 
Joe Hellerstein: Federated Facts and Figures
It has become apparent that the "World Wide Web" comprises a small fraction of the data available on the Internet. The metaphor of a web was motivated by network hypertext, but the volume of hypertext on the Internet is dwarfed by the amount of data made available in networked databases provided by directory services, information portals, government agencies, scientists and a host of other providers. This data is sometimes called the "deep web" or the "hidden web", but in fact it is not very web-like: it has no static inbound hyperlinks, so it is not accessed by the webcrawlers of search engine. Moreover, a large fraction of it is made up of structured "facts and figures", not full-text prose. For both of these reasons, this data is largely untapped as a resource for any use other than point lookups. Recent studies estimate the size of this data as being 7.5 petabytes -- 400 to 550 times larger than the hypertext indexed by search engines. This enormous information resource is clearly underutilized by today's web technologies. The Telegraph project at UC Berkeley is exploring the mechanisms for -- and consequences of -- aggressively leveraging this resource.

Telegraph is an adaptive dataflow system, which can be used to run live queries over unpredictable deep web sources. We are also extending Telegraph with the capability to ``trawl'' large amounts of data from the FFF, by running recursive queries over multiple data sources. In this talk, I will overview the technology behind Telegraph, and demonstrate the power of queries and trawls over the deep web. As part of the talk, I will demonstrate the Telegraph application we developed for the 2000 presidential election, which joins and breaks down facts and figures about presidential campaign donations, donor data, demographics, and other deep web data of interest.

In addition, I will briefly discuss issues of privacy, economics and social policy that arose during the election, and discuss some of the future work on these issues that is beginning with colleagues at Berkeley. See http://fff.cs.berkeley.edu for some demos.

Marti Hearst: Empirical Foundations for Designing Usable Web Sites
There is currently much debate about what constitutes good web site design. Many detailed usability guidelines have been developed for both general user interfaces and for web page design. However, designers have historically experienced difficulties following design guidelines. Furthermore, there is no general agreement about which web design guidelines are correct. A recent survey of 21 web guidelines found little consistency among them. We suspect this might result from the fact that there is a lack of empirical validation for such guidelines.

Surprisingly, until now, no studies have derived web design guidelines directly from web sites that have been assessed by human judges. We report the results of empirical analyses of the page-level elements on a large collection of expert-reviewed web sites. These metrics concern page composition (e.g., word count, link count, graphic count), page formatting (e.g., emphasized text, text positioning, and text clusters), and overall page characteristics (e.g., page size and download speed). If we constrain predictions to be among pages within categories such as education, community, living, and finance, we can predict with 80 percent accuracy on average if a web site will receive high ratings or not.

These results provide an empirical foundation for web site design guidelines and also suggest which metrics can be most important for evaluation via user studies.

Robert Wilensky Robust Hyperlinks and Robust Locations

URLs can be made robust so that if a web page moves to another location anywhere on the web, you can find it. Even if that page has been edited. (If the page has been deleted and no mirrors are available, you'll have to try something else, obviously.) Today's address-based URLs are augmented with a five or so word content-based lexical signature to make a Robust Hyperlink. When the URL's address-based portion breaks, the signature is fed into any web search engine to find the new site of the page. Using our free, Open Source software (including source code), you can rewrite your web pages and bookmarks files to make them robust, automatically.

Richard Tong: Detecting and Tracking Opinions in On-Line Discussions
On-line public discussions (e.g., chat rooms, Usenet newsgroups) are a potentially rich source of information about emerging patterns and trends. These sources are very "noisy" though and present us with a number of interesting technical challenges as we try to extract the useful "signal."

The work we report in this talk focuses on techniques for detecting and tracking opinions in such on-line forums. A key objective for us is to develop a robust, scalable approach to modeling opinions about various subjects of interest, such as world events, stock market performance, the latest movies, and new technologies.

We will present some results from our experiments with a family of models for detecting and tracking movie "buzz" on Usenet. These models were implemented using available commercial technology and our results show that interesting patterns of discussion do emerge as movies move from initial announcements, through pre-release marketing, to the opening weekend, and then on to general release.

 

 

Madeline Schnapp and Tim Allwine: Mining of Book Data from Amazon.com
O'Reilly Research, a division of O'Reilly & Associates has developed a suite of tools that access data gathered by mining the Amazon.com web site. These tools include a real time graphing tool that will graph rank data as a function of time for all of O'Reilly books plus a discreet number of competitor's books. Additional tools include a web interface to a specialized real time search tool that searches the Amazon web site for books that match a query string and return a file sorted by rank. Most ecently, we have developed a tool that now gathers information on over 20,000 computer titles. With this information, we are able to understand the who, what, when and how of other publishers in the computer book publishing space with a degree of granularity unavailable up to this point.

In our presentation, you will present what types of tools we use to gather the data and why we selected these tools. We will also present what type of data is returned and how it is organized, and what we are doing with this data. Finally we will discuss the advantages and disadvantages of this type of technique and our experience with both Amazon and other web mining data collecting activities we have been involved in both short term and long term.

 

 
Rashmi Sinha Comparing Human Recommenders to Online Systems
The design of better online Recommender Systems (RS) requires a thorough analysis of the social filtering process that RS are trying to replace. In a series of studies, we directly compared book and movie recommendations made by the user's friends to those made by online RS. We tested three book RS (Amazon's Recommendation Wizard, RatingZone's QuickPicks, and Sleeper), and three Movie RS (Amazon's Recommendation Wizard, Reel.com's QuickPicks, and MovieCritic). Our results indicated that the human recommenders (the user's friends) consistently provided better recommendations than RS. However, users did find that items recommended by online RS were often "new" and "unexpected", while the items recommended by friends mostly served as reminders of items users had already planned to pursue. Usability evaluation of the RS showed that users did not mind providing more input to the system in order to get better recommendations. Also users trusted a system more if it recommended items that they had previously liked.
 
 
Warren Sack: Conversation Map
Conversation Map is a system that combines and extends a set of discourse analysis techniques from sociolinguistics, computational corpus-based linguistics, and sociology. Given the archive of a very large-scale conversation (e.g., an archive of Usenet newsgroup messages), the Conversation Map system analyzes the content and the relationships between several thousand messages and then uses the results of the analysis to create a graphical interface. With the graphical interface, one can see the social and semantic relationships that have emerged over the course of a VLSC. The Conversation Map system computes and then graphs out who is "talking" to whom and what they are "talking" about. Also, it identifies themes of conversation and the central terms and possible metaphors or definitions that have been produced by a VLSC. Demonstrations of the system can be found online here: http://www.sims.berkeley.edu/~sack/CM. I am engaged in a participatory design process in conjunction with collaborators in social science, computational linguistics and interface design in order to extend and improve the current system. Moreover, the system is being tested and critiqued by newsgroup participants. The text analysis portions of the system are implemented in 8000 lines of Perl and the graphical interface is implemented mostly in 4000 lines of Java with the addition of several CGI scripts written in Perl.

Mike Gebbie: Competitors+
Competitors+ creates an at-a-glance list of a company's competitors. This is a boon to researchers, stock market junkies, managers and others who need to quickly understand who's who in industries that are new to them or rapidly changing. In our web-prototype over 130,000 news articles spanning 2 months were used to create lists that were 87 percent accurate. Unlike services like Hoover's Online that rely on "industry experts" to create competitor lists, Competitors+ is completely automated, and therefore, much cheaper and easier to maintain.