K-Graph: Selecting Top-k Data Sources for XML Keyword Queries Khanh Nguyen and Jinli Cao Department of Computer Science and Computer Engineering La Trobe University, Melbourne, Australia {tuan.nguyen, j.cao}@latrobe.edu.au Abstract. Existing approaches on XML keyword search mostly focus on querying over single data source. However, searching over hundreds or even thousands of (distributed) data sources by sequentially querying every single data source is extremely high cost, thus it can be impractical. In this paper, we propose an approach for selecting top-k data sources to a given query in order to avoid high cost of search in numerous, po- tentially irrelevant data sources. The proposed approach can efficiently select top-k mostly relevant data sources without querying over the data sources. We propose a ranking function for measuring the strength of correlation between keywords in a data source and summarize the data sources as keywords correlation graphs (K-Graphs). The top-k relevant data sources will be selected by estimating the relevance of correspond- ing K-Graphs to the query. Experimental results show that the approach achieves good performance with a variety of experimental parameters. 1 Introduction Extensible Markup Language (XML) has become a de facto standard for repre- senting and exchanging data, resulting in the proliferation of XML documents distributed over the internet. Traditionally, XML data are retrieved using struc- tured query languages such as XPath and XQuery, in which users have to learn both data schema and query languages in order to effectively issue queries. Since the data schema and the query languages may be complex, retrieving XML data using XPath/XQuery languages is usually limited to advanced users. In that context, keyword-based search over XML data has been proposed as a mean to liberate users from the learning curve of the structured query languages, thus attracted significant attention of researchers from both fields of information re- trieval and databases. Querying XML data using keyword-based search has been widely studied in literature [1–7], however most of existing approaches focus on query processing over single data source. Searching through hundred or even thousands of data sources by sequentially querying each data source is extremely expensive cost and may not be practical, while efficient query processing even in single data source is a challenging problem [8–12]. Efficient query processing over a system which integrates numerous data sources is definitely much more challenging.