International Journal of Digital Content Technology and its Applications Volume 4, Number 3, June 2010 A Visual Mining System for Theme Development Evolution Analysis of Scientific Literature Jinlong Wang, Can Wen, Shunyao Wu School of Computer Engineering Qingdao Technological University, Qingdao, 266033, China {wangjinlong,wencan314,shunyaowu} @gmail.com Huy Quan Vu School of Information Technology Deakin University Victoria 3125, Australia hqv@deakin.edu.au doi: 10.4156/jdcta.vol4.issue3.21 Abstract Theme development evolution analysis of literature is a significant tool to help the scientific scholars find and study the frontier problems more efficiently. This paper designs and develops a visual mining system for theme development evolution analysis to deal with the large number of literature information. The analysis of related themes based on sub-themes, together with the dynamic threshold strategy are adopted for improving the accuracy of system. Experiments results prove that correlations of themes obtained from the system are accurate and achieve better practical effect in comparison with that of our early work. Keywords: clustering, text mining, literature, dynamic 1. Introduction Along with the popularization of Internet information, various knowledge and information are emerging and updating. As a carrier of knowledge, the quantity of scientific literatures presents an explosive growth in recent years. How to sort and summarize the vast literature resources rapidly and efficiently is very important to researchers. Especially, through the theme evolution analysis of scientific literature, researchers can obtain higher-level semantics, such as how the content changes dynamically over time. This potential information is very helpful for researchers to find research hot spots quickly, grasp the developing trend in research fields timely and investigate more efficiently. Currently, many domestic and foreign researchers focus on the mining and analysis of literature data [1-10]. However, the evolution analysis of large-sale dynamic text stream according to their themes has not been implemented satisfactorily because of great challenges in terms of efficiency and accuracy. At present, the analysis methods mainly include statistical theme model and text clustering. Statistical theme models [11] construct generative model from large number of text stream data; the efficiency is reduced when obtaining themes in different time slices. Text clustering as an effective method for text analysis, correlation analysis between themes is one key problem for improving clustering accuracy. Many advanced clustering approaches are proposed to solve actual application problems [12-14]. The paper [15] proposed a method for scientific literature themes evolution analysis based on clustering. But the related themes will become less with the elapse of time for dynamic text stream from semantics perspective, which results in the problem of error accumulation and influences clustering accuracy. Topic detection and tracking (TDT) originated from early event detection and tracking (EDT) by identifying new events and tracking subsequent news stories that discuss the event of interest [16-19]. It is similar with the unsupervised clustering research. Dynamic threshold model [20] and division comparison of subtopics [21] are just used effectively in TDT fields which can also be applied in themes clustering to improve its accuracy. Meanwhile, the technology of information visualization [22-23] plays an important role in the analysis of content development and evolution. With these two kinds of technology, this paper designs and develops a visual mining system for theme development evolution analysis of scientific literature. Experiments prove that more reliable theme evolutionary 215