Mining “Hidden Phrase” Definitions from the Web Hung. V. Nguyen, P. Velamuru, D. Kolippakkam, H. Davulcu, H. Liu Department of Computer Science and Engineering Arizona State University, T empe, AZ, 85287, USA {hung, prasanna.velamuru, n.kolippakkam, hdavulcu, hliu}@asu.edu M. Ates Cash-Us.com 21 Helen Way, Berkeley Heights, NJ, 07922, USA mates@cash-us.com Abstract. Keyword searching is the most common form of document search on the Web. Many Web publishers manually annotate the META tags and titles of their pages with frequently queried phrases in order to improve their placement and ranking. A “hidden phrase” is defined as a phrase that occurs in the META tag of a Web page but not in its body. In this paper we present an algorithm that mines the definitions of hidden phrases from the Web documents. Phrase definitions allow (i) publishers to find relevant phrases with high query frequency, and, (ii) search engines to test if the content of the body of a document matches the phrases. We use co- occurrence clustering and association rule mining algorithms to learn phrase definitions from high-dimensional data sets. We also provide experimental results. 1. Introduction Keyword searching is the most common form of document search on the Web. Most search engines do their text query and retrieval using keywords. Search engines pull out and index words and phrases that are deemed significant. Phrases and words that are mentioned in the URL, TITLE or META tags of the document as well as those that are repeated many times in the body are more likely to be deemed important. The average keyword query length is under three words (2.2 words [3], 2.8 words [5]). Nowadays, many Web publishers frequently use phrase frequency databases, like Overture [12] and Word Tracker [13], to identify phrases that are queried with high frequency and attach them to their document titles or META tags in order to improve their placement and ranking. If a phrase occurs in the META tag of a Web page but not in its body then we call it a “hidden phrase”. Mining the definitions of “hidden phrases” or phrases in general, would allow (i) publishers to easily find relevant phrases with high query frequency, and, (ii) search engines to test if the content of the body of a document matches the phrases in its TITLE and META tags. As an example, if a catalog publisher knows that a “leather jacket” is a “motorcycle jacket” then, the publisher can use the second phrase as a