Discovering Domain Specific Concepts within User-Generated Taxonomies Jonathan Klinginsmith , Malika Mahoui , Yuqing Wu and Josette Jones School of Informatics and Computing Indiana University, Bloomington, IN, USA Email: jklingin, yuqwu @indiana.edu School of Informatics IUPUI, Indianapolis, IN, USA Email: mmahoui, jofjones @iupui.edu Abstract—Collaborative tagging of resources on the Web has become a commonplace occurrence. Web sites allowing resources to be tagged provide a tremendous amount of user-generated taxonomic information. However, information seekers are hindered by the lack of organization within these tags as well as the multitude of domains encompassed within these sites. To address these issues, we propose a multi-step approach for creating domain specific concept hierarchies from collaborative tags. Each concept hierarchy is based on domain specific subject matters, which may span more than one tag, as opposed to related work which are only concerned with the relationships between single tags. Keywords-folksonomy; sequence mining; suffix tree; sub- sumption; concept hierarchy I. I NTRODUCTION Collaborative tagging of resources on the Web has become a commonplace occurrence. As part of the second generation of the Web, users within these sites are allowed to provide their own annotations (tags) to classify digital resources. The resulting collection of user-generated tags is called a folksonomy. The word folksonomy is the combination of the words folk and taxonomy, emphasizing the fact that the taxonomic information generated within these Web sites is done by common users (folk). The main advantage of folksonomies, compared to formal taxonomies, is the low cost in building and assigning resources, thus allowing communities of users to contribute to the classification process. Because of this collaborative annotation effort, there is a great deal of user-generated taxonomic information to discover within folksonomies. However, the resources classified within these sites span a multitude of domains. As a researcher interested in a particular domain, whether for analyzing domain specific trends or using the data for marketing or advertising purposes, one would need to discover the taxonomic information for the particular domain of interest. As as result, a multi-step framework is needed to first extract the subset of appropriate domain specific tags and then to discover the concepts and relationships among the information extracted. Several approaches have been proposed for organizing user-generated tags [1]–[5]. The structures generated in these approaches vary from a forest of trees such as in [2], [5] to a directed acyclic graph as in [1], [4] to clusters of directed graphs as in [3]. The current approaches, although promising, do not provide a methodology for discovering domain specific concepts within a collaborative tagging Web site. Additionally, the ability to organize topics into a conceptual hierarchy is absent from the current approaches, where a concept may span more than one sequence of tags. In [6], the subsumption hierarchy calculation was in- troduced as a means to derive conceptual topic/sub-topic relationships. Building off of this work, [4] utilized the co- occurrence of single tags within Flickr 1 to create a concept hierarchy. This work provided promising results; however, it did not take into account multiple tags in a sequence. Folksonomies represent an organization of topics shared by a large number of users. Topic identification can be mapped to the problem of discovering frequent sequences of tags. [7] proposed an algorithm for finding frequent maximal text sequences. This algorithm is a refinement of the algorithm for discovering sequential patterns introduced in [8]. The problem of sequential pattern mining and incre- mental sequence mining has been discussed in a collection of studies [9]–[15]. Expanding on this research, [16] introduced the concept of a decreasing support constraint on length- increasing sequences. Towards discovering domain specific concepts within user-generated tags we propose the following: An approach for discovering domain specific tags from within a folksonomy. The proposed approach leverages curated domain sources such as reliable Web sites to seed the set of domain specific topics and terms for querying collaborative tagging sites. The use of a tag sequence suffix tree for discovering important sequences of user annotated tags. The generation of a concept hierarchy from the impor- tant tag sequences discovered. An implementation of our overall approach on the domain of information specific to patient and customer health. 1 http://www.flickr.com