Telegraph is an adaptive dataflow system, which can be used to run live queries over unpredictable deep web sources. We are also extending Telegraph with the capability to ``trawl'' large amounts of data from the FFF, by running recursive queries over multiple data sources. In this talk, I will overview the technology behind Telegraph, and demonstrate the power of queries and trawls over the deep web. As part of the talk, I will demonstrate the Telegraph application we developed for the 2000 presidential election, which joins and breaks down facts and figures about presidential campaign donations, donor data, demographics, and other deep web data of interest.
In addition, I will briefly discuss issues of privacy, economics and social policy that arose during the election, and discuss some of the future work on these issues that is beginning with colleagues at Berkeley. See http://fff.cs.berkeley.edu for some demos.
Surprisingly, until now, no studies have derived web design guidelines directly from web sites that have been assessed by human judges. We report the results of empirical analyses of the page-level elements on a large collection of expert-reviewed web sites. These metrics concern page composition (e.g., word count, link count, graphic count), page formatting (e.g., emphasized text, text positioning, and text clusters), and overall page characteristics (e.g., page size and download speed). If we constrain predictions to be among pages within categories such as education, community, living, and finance, we can predict with 80 percent accuracy on average if a web site will receive high ratings or not.
These results provide an empirical foundation for web site design guidelines and also suggest which metrics can be most important for evaluation via user studies.
Robert Wilensky Robust Hyperlinks and Robust Locations
The work we report in this talk focuses on techniques for detecting and tracking opinions in such on-line forums. A key objective for us is to develop a robust, scalable approach to modeling opinions about various subjects of interest, such as world events, stock market performance, the latest movies, and new technologies.
We will present some results from our experiments with a family of models for detecting and tracking movie "buzz" on Usenet. These models were implemented using available commercial technology and our results show that interesting patterns of discussion do emerge as movies move from initial announcements, through pre-release marketing, to the opening weekend, and then on to general release.
In our presentation, you will present what types of tools we use to gather the data and why we selected these tools. We will also present what type of data is returned and how it is organized, and what we are doing with this data. Finally we will discuss the advantages and disadvantages of this type of technique and our experience with both Amazon and other web mining data collecting activities we have been involved in both short term and long term.