Using the Web Infrastructure for Just-In-Time Recovery of Missing Web Pages [Extended Abstract] Martin Klein Department of Computer Science Old Dominion University Norfolk, VA 23529 +1-757-683-6001 mklein@cs.odu.edu ABSTRACT The Internet provides access to a great number of web sites, but the structure of the web is constantly changing. Missing web pages remain a pervasive problem that users experience every day. This dissertation is about creating a method to overcome this problem by automatically mapping between Uniform Resource Identifiers (URIs) and textual content of web pages using lexical signatures (LSs) and tags. We intro- duce a “just-in-time” approach to support the preservation of web content relying on the “living” web. We propose a method to harness the collective behavior of the Web Infras- tructure and investigate the suitability of lexical signatures and tags to give a “good enough” description of the “about- ness” of missing pages. Utilizing Internet search engines by querying these LSs will return the replacement page or a very similar page which can be provided to the user. We investigate the evolution of lexical signatures over time and propose a framework to aid in the creation of LSs. Analyz- ing snapshots of the web from recent years will enable us to investigate the decay of such lightweight descriptions and also the characteristics of missing pages (http error code 404). We propose to evaluate and measure the quality of the framework with information retrieval methods such as precision and recall. 1. INTRODUCTION Research on digital libraries (DLs) usually includes in- formation retrieval and digital preservation aspects along with diverse models for creating and adding technical, de- scriptive and administrative metadata to the digital objects. Digital preservation projects typically involve controlled en- vironments and collections. They tend to focus on actively providing in-depth preservation services such as refreshing, migration and emulation within these environments. Re- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. JCDL ’07 Vancouver, Canada Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. freshing is the copying of bits to different systems and mi- gration is the transferring of data to newer system environ- ments [27]. Emulation is replicating the functionality of an obsolete system [23]. We define the Web Infrastructure (WI) as the collection of commercial web search engines (Google, Yahoo!, MSN, etc.), web archives operated by non-profit in- stitutions (e.g., the Internet Archive’s “Wayback Machine”) and research projects (e.g., CiteSeer and NSDL). The WI provides what can be called in vivo preservation: preser- vation that occurs naturally in the “living web”. It is not guaranteed by an in-depth institutional commitment to a particular archive, but achieved by the often involuntary, low-fidelity, distributed efforts of millions of individual users, web administrators and commercial services and can conse- quently be considered as an passive approach to preserva- tion. Although the WI does not yet offer emulation, it does offer refreshing and migration, albeit with somewhat uneven results. Figure 1 shows the WI refreshing and migrating web documents. The original document was published as a NASA technical memorandum in 1993 as a compressed PostScript (.ps.Z) file on a “nasa.gov” machine. There are 13 versions indexed by Google Scholar, 3 versions in CiteSeer and 3 cached versions in the Internet Archive (IA). Although NASA eventually migrated the report to PDF, CiteSeer per- formed that migration independently, as well as a migration to PNG. Yahoo! and Google provide dynamic conversion to HTML. Even if the report were to be removed from nasa.gov web servers, the report has been refreshed and migrated so deeply into the WI that it would be difficult to completely eradicate. In the same manner, the various copies are also not easy to locate – there could be many more copies in the WI than is shown in Figure 1. 1.1 Background “We can’t save everything” is often heard (cf. [17]) when discussing digital preservation, specifically the preservation of web pages. While this is certainly true, it does not provide guidance as to what should be saved. Intuitively one would probably say, save all of the important materials such as the Bible or the United States Declaration of Independence. But the truth is, that due to their historical, religious and philosophical value, these texts have been migrated from physical to digital format, and these digital documents have been refreshed to many locations and migrated to many dif- ferent formats. So it is safe to say, we are in little danger of