Generating and Retrieving Text Segments for Focused Access to Scientific Documents Caterina Caracciolo Maarten de Rijke ISLA, University of Amsterdam, Kruislaan 403, 1098 SJ Amsterdam, The Netherlands {caterina, mdr}@science.uva.nl Abstract. When presented with a retrieved document, users of a search engine are usually left with the task of pinning down the relevant in- formation inside the document. Often this is done by a time-consuming combination of skimming, scrolling and Ctrl+F. In the setting of a digital library for scientific literature the issue is especially urgent when dealing with reference works, such as surveys and handbooks, as these typically contain long documents. Our aim is to develop methods for providing a “go-read-here” type of retrieval functionality, which points the user to a segment where she can best start reading to find out about her topic of interest. We examine multiple query-independent ways of segmenting texts into coherent chunks that can be returned in response to a query. Most (experienced) authors use paragraph breaks to indicate topic shifts, thus providing us with one way of segmenting documents. We compare this structural method with semantic text segmentation methods, both with respect to topical focus and relevancy. Our experimental evidence is based on manually segmented scientific documents and a set of queries against this corpus. Structural segmentation based on contiguous blocks of relevant paragraphs is shown to be a viable solution for our intended application of providing “go-read-here” functionality. 1 Introduction The growing number of scientific publications available in electronic format has changed the way people relate to documents. Working within the scientific do- main, Tenopir and King [32] observe that researchers now tend to read more articles than before, but that, on average, the time dedicated to each article has shrunk and readers very rarely read an entire article—instead, they browse and skim the document, possibly doing attentive reading of only some parts of it. Increasingly, people use a “locate-and-read” strategy instead of the more traditional “read-and-locate” typical of a paper environment. Currently, there are several examples where a kind of “go-read-here” func- tionality is available or being explored. For example, some general web search en- gines help users in their search “within” retrieved documents by providing links labeled “HTML version” (for non-HTML documents) and “In cache” (which takes the user to a cached version of the document where query words are high- lighted). In the setting of document-centric XML retrieval, the search engine