MIMS Final Project 2009

    The currently established formats for how a website can publish metadata about a site’s pages, the robots.txt file and sitemaps, focus on how to provide information to crawlers about where to not go and where to go on a site. This is sufficient as input for crawlers, but does not allow websites to publish richer metadata about their site’s structure, such as the navigational structure. This project studies the availability of website metadata on today’s Web in terms of available information resources and quantitative aspects of their contents. Such an analysis of the available Web site metadata not only makes it easier to understand what data is available today, but also serves as the foundation for investigating what kind of information retrieval processes could be driven by that data. Using data gathered in this study, we designed and prototyped a system for generating most useful pages (called ulinks) from a site. Our system is similar to, albeit much simpler than, sitelinks shown by leading search engines. Our analysis of the limitations of our ulink generation system shows that if websites had richer data formats to publish metadata, then ulink generation can be much improved.


October 7, 2016