Web Crawler Middleware for Search Engine Digital Libraries: A Case Study for CiteSeerX Jian Wu † , Pradeep Teregowda ‡ , Madian Khabsa ‡ , Stephen Carman † , Douglas Jordan ‡ ,J J ose San Pedro Wandelmer † , Xin Lu † , Prasenjit Mitra † and C. Lee Giles †‡ † Information Sciences and Technology ‡ Department of Computer Science and Engineering University Park, PA, 16802, USA jxw394@ist.psu.edu ABSTRACT Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports doc- uments and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal inter- face to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import ﬁles downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded ﬁles. When importing documents, users can specify document mime types and ob- tain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research pa- pers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though de- signed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous ; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures Keywords search engine, information retrieval, web crawling, ingestion, middleware 1. INTRODUCTION Crawling is a prerequisite and an essential process for op- erating a search engine. A focused crawler should be able Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. WIDM’12, November 2, 2012, Maui, Hawaii, USA. Copyright 2012 ACM 978-1-4503-1720-7/12/11 ...$15.00. to eﬃciently harvest designated documents from the inter- net. The CiteSeerX [3] digital library and search engine is designed to provide open access to academic documents in PDF and postscript formats. While a well designed vertical crawler can eﬃciently select documents based on their con- tent types, it is also desirable to crawl all potentially useful ﬁles ﬁrst and then selectively import documents of certain formats to the search engine repository. Most available open source crawlers at present are de- signed for general purposes and are not customized for a particular search engine. Some web crawlers, such as Her- itrix [2], have been well maintained and widely used by dig- ital libraries, archives, and companies 1 . To take advantage of these crawlers to serve for a digital library which mainly indexes academic documents, it is necessary to deﬁne a clear interface to integrate these crawlers to the ingestion system of the search engine. Besides, this interface should also be able to import documents which are directly downloaded by FTP service. Here, we develop a middleware, named Crawl Document Importer (CDI), to import documents of selected formats from ﬁles harvested by an open source crawler or from FTP downloads to the crawl database and repository before in- gesting them into the CiteSeerX search engine. Heritrix is one of the highly rated and widely used open source crawlers, so we take it as an example application. However, the mid- dleware is designed to be extensible, i.e., for another web crawler, the user only need to write a log parser/extractor which returns the standard metadata tuple and tells the middleware how to access the downloaded ﬁles. The Python crawler written by Shuyi Zheng (hereafter the SYZ crawler) has been the dedicated harvester for the CiteSeerX project since 2009. This crawler is able to crawl about 5000–10000 seed URLs daily with a depth of two us- ing a breadth-ﬁrst policy. As a focused crawler, a number of ﬁlter rules are applied to selectively download free access online documents in PDF and postscript formats. As the CiteSeerX is expanding its service to other types of doc- uments, switching to other more reliable and well main- tained crawlers can be more eﬃcient and desirable. The SYZ crawler is not able to import documents directly down- loaded from FTP service. In addition, it is a necessity to deﬁne an interface to the crawler system in order to com- bine the crawler to the CiteSeerX code to make it more integrated. These considerations drive us to design a mid- dleware which can work with multiple open source crawlers. 1 https://webarchive.jira.com/wiki/display/ Heritrix/Users+of+Heritrix 57