Web Mining

at

UC Berkeley

June 20, 2001

 

The "surface Web" of static HTML pages is only a small fraction of content accessible via the WWW; much more data lies in databases in the "deep Web" which are accessible via HTML forms. Other huge amounts of content are available on email lists, Usenet news sites, and in chat rooms. Recently a number of researchers in different fields have independently developed ways to mine the trove of textual and numeric data available in both the "surface" and the "deep" Web.

Examples:

  1. A database researcher at Berkeley has been able to cross tabulate public information from the Federal Election Commission with other online databases to offer revealing information about donor characteristics.
  2. A major publisher has used online book transactions to determine patterns of sales by topics, thereby improving their set of offerings and inventory management.
  3. Another researcher has used national language processing techniques to extract opinions from online discussion groups about which movies are generating "buzz".
  4. Usability researchers at UC Berkeley have found that they can accurately predict human ratings of web sites using computer-collected measures of page composition, page formatting. These results can be used as preliminary indicators of effective Web site design.

There are many other examples of using the Web to extract competitive intelligence, real-time social statistics, and other sorts of data that can be useful for business decisions. The one-day workshop on "Web Mining" at Berkeley will bring together a number of people working in this area to describe their work. The intention is to provide a forum to exchange ideas, techniques, and applications in Web mining.

The Web mining workshop is by invitation only, but is open to researchers in the area, SIMS Affiliates, and other interested parties.