KeaKAT – An Online Automatic Keyphrase Assignment Tool Rabia Irfan, Sharifullah Khan, Irfan Ali Khan, Muhammad Asif Ali School of Electrical Engineering & Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan {09msitrirfan, sharifullah.khan, 08bitikhan ,08bitasifa}@seecs.edu.pk Abstract—Kea++ is a well known tool for assigning keyphrases to documents. But Kea++ contains noise and irrelevant terms in the keyphrase result set. The extended refinement methodology was developed to fine tune the results of Kea++ for multiple domains. However using Kea++ and its refinement as a system for assigning keyphrases to documents is not simple for users of a domain other than computing. The system needs to be installed and configured. It does not have any GUI to facilitate users in assigning keyphrases. The objective of the KeaKAT is to develop a web-based keyphrase assignment tool to facilitate users in assigning relevant keyphrases to their documents online. KeaKAT saves users not only from installing and configuring the system, but also improves the usability of the system. Index Terms—Keyphrase assignment, information extraction, usability, classification system I. I NTRODUCTION Keyphrases are the words that present the concise summary of a document or text [1], [5], [11]. They can be used in variety of applications that involve organization and man- agement of the huge amount of information. They can be helpful in browsing document collections [8], can be used as metadata [16], can be used to index document collections [16], [7] and can assist in classification and clustering of document collections [10]. Because of their usage in different applications, many tools were develop that can be helpful in automatically generating keyphrases. Two main approaches were used; one is extraction of keyphrase from a document text known as keyphrase extraction and other is alignment of document with a classification system/taxonomy known as keyphrase assignment. Kea++ [14] is a well known tool that can perform both keyphrase assignment and extraction based on a given input. However the output produced by Kea++ contains noise and irrelevant terms. The work done by [4] proposed refinement methodology that takes Kea++ assigned keyphrases as input and generates refined keyphrases. The methodology exploits the hierarchical structure of a taxonomy [15], [6] as well as common heuristics to fine tune the result of Kea++. The refinement methodology was extended in the work [9]. The extended refinement methodology aimed to improve and generalize the refinement methodology for multiple domains. Both Kea++ and the extended refinement methodology work together to produce better results for keyphrase assignment to documents. Kea++ and its refinements as a system need installation and configuration for assigning keyphrases to documents accurately. There is no graphical user interface (GUI) that has been provided to facilitate users in using the system. Automatic keyphrase assignment techniques are not only helpful for computing experts, but can be equally impor- tant and applied in other disciplines of academia, particularly in library sciences. The users belonging to fields other than computing are not computer experts. Most of the time they are not comfortable in using techniques that involve too much of the computer understanding, neither they bother themselves to understand the technical details of such systems. Therefore the available automatic tools are not widely used by academia. The objective of the KeaKAT is to improve the usability of the automatic keyphrase tools. KeaKAT is a web-based automatic keyphrase assignment tool. The tool uses Kea++ and the extended refinement methodology in background for keyphrase assignment. KeaKAT facilitates users to train and test Kea++ and also applies refinement in the background for assigning relevant keyphrases to their documents online. It saves users from installing and configuring the system and facilitates users through GUI to get their job easily done. It improves the usability of the system. The rest of the paper is organized as follows: Section 2 discusses the related work and existing systems. Section 3 explains the architecture, working of the proposed tool. Com- parative analysis of the existing systems with the proposed system is described in Section 3. Section 4 concludes the paper and discusses the future work. II. RELATED WORK This section discusses existing systems for automatic keyphrase assignment. Kea [17] and its later version Kea++ [14] are famous tools developed at the University of Waikato for performing the task of keyphrase generation automatically. Kea is a machine learning based tool and it works in two phases; training phase and extraction phase. Initially Kea was used to extract keyphrase from documents later on it was extended to perform keyphrase assignment and known as Kea++. Kea++ also works in two phases like Kea i.e. training and extraction. During each phase it works in two sub steps: candidate identification and filtering. During candidate iden- tification step language dependent techniques such as: input cleaning, stemming etc are applied to form pseudo phrases. During the filtering step, those keyphrases are identified which are the most suitable candidates based on four features: Term 2012 10th International Conference on Frontiers of Information Technology 978-0-7695-4927-9/12 $26.00 © 2012 IEEE DOI 10.1109/FIT.2012.14 30