A domain-independent architecture for efficient information retrieval on the 
World Wide Web 

by Marc Langheinrich 


Advisors 

Professor Ipke Wachsmuth University of Bielefeld, Faculty of Technology 

Professor Oren Etzioni University of Washington, Computer Science & 
Engineering 


Abstract 

The World Wide Web's rapid growth in recent years has provided a wealth of 
online information. Including already more than hundred million documents, 
finding a particular page has become a daunting task of battling the Web's 
"information overload". 

Most popular methods of finding information on the Web are known for being 
either notoriously imprecise or often incomplete: simple searches can easily 
return hundreds or thousands of irrelevant pages, while others might fail to 
include even a single relevant one. 

This thesis presents a novel architecture called "Dynamic Reference 
Sifting", which attempts to combine the comprehensiveness of Web indices, 
such as AltaVista or Hotbot, with the accuracy of Web directories, such as 
Yahoo. Dynamic Reference Sifting uses the output of general purpose search 
services, combined with additional, orthogonal information sources; domain 
specific heuristics; and a flexible categorization scheme to filter out all 
but the single correct page. 

Our experiments show that for certain types of pages, this approach can 
provide nearly twice the accuracy and at least the same coverage as any 
existing service. We have implemented a prototype called "Ahoy! The Homepage 
Finder", which demonstrates the feasibility of our approach. Ahoy! is 
publicly accessible on the Web, and has served more than 500,000 queries since 
it was fielded in May 1996. 

In order to demonstrate the domain independence and generality of our 
architecture, we will also present two simple prototypes using Dynamic 
Reference Sifting in the domains of academic papers and jokes. Both systems 
were developed and implemented in less than ten days, but prove highly 
successful in our initial experiments.