Template extraction based on menu information Julian Alarte David Insa Josep Silva Salvador Tamarit Universitat Polit` ecnica de Val` encia Departamento de Sistemas Inform´ aticos y Computaci ´ on Camino de Vera s/n, E-46022, Valencia, Spain. jalarte@dsic.upv.es dinsa@dsic.upv.es jsilva@dsic.upv.es stamarit@dsic.upv.es Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique. 1 Introduction A web template is a prepared HTML page where formatting is already implemented and visual compo- nents are ready to insert content. Web templates are used as a basis for composing new webpages that share a common look and feel. This is good for web development because many tasks can be automated thanks to the reuse of components. In fact, many websites are maintained automatically by code gen- erators, which generate webpages using templates. Web templates are also good for users, which can benefit from intuitive and uniform designs with a common vocabulary of colored and formatted visual elements. Contrarily, web templates are an important problem for crawlers and indexers, because they judge the relevance of a webpage according to the frequency and distribution of terms and hyperlinks. Since templates contain a considerable number of common terms and hyperlinks that are replicated in a large number of webpages, relevance may turn out to be inaccurate, leading to incorrect results [1, 15, 17]. Moreover, in general, templates do not contain relevant content, they usually contain one or more pagelets [5, 1] (i.e., self-contained logical regions with a well defined topic or functionality) where the main con- tent must be inserted. The main content of a webpage is often complementary to its template. Therefore, detecting templates can allow indexers to identify the main content that is usually inside a specific pagelet of the template. Modern crawlers and indexers do not treat all terms in a webpage in the same way. Webpages are preprocessed to identify the template because template extraction allows them to identify those pagelets that only contain noisy information such as advertisements and banners. This content should not be indexed. Indexing the non-content part of templates not only affects accuracy, it also affects performance and is, in general, a waste of storage space, bandwidth and time. Template extraction helps indexers to isolate the main content. This allows us to enhance indexers by assigning higher weights to the really relevant terms. Once templates have been extracted, they should