International Journal of Computer Engineering and Applications, Volume XI, Special Issue, May 17, www.ijcea.com ISSN 2321-3469 Aashka Sahni and Prof. Sushila Palwe 1 Topic Modeling on Online News and News Clustering Aashka Sahni 1 and Prof. Sushila Palwe 2 Department of Computer Engineering, MIT College of Engineering, Pune ABSTRACT: News media includes print media, broadcast news and internet. Print media contains newspapers, news magazines, broadcast news contains radio and television, while internet contains online newspapers, news blogs, etc. The online news has been the prevalent form of information on the internet. Often, the occurrence of the same event or happening is depicted differently in different news websites or sources due to the varied perceptions of the same circumstance. Proposed system intends to collect news data from such diverse sources, capture the varied perceptions, summarize and present them at one place. Another goal of the proposed system includes detecting topics accurately in case of short news data. Previous approaches like LDA and its variants are able to identify topics efficiently for long texts (news), however, fail to do so in the case of short texts (news) due to data sparsity problem. Since sophisticated signals are delivered by the short news, it is an importnat resource for topic modeling, however, the issues of acute sparsity and irregularity are prevalent. These pose new difficulties to existing topic models, like LDA and its variations. In this paper, a lucid but generic explanation for topic modeling in online news has been provided. System presents a word co-occurrence network based model named WNTM, which works for both long as well as short news articles by managing the sparsity and imbalance issues simultaneously. WNTM is modeled by assigning and reassigning (according to probability calculation) a topic to every word in the document rather than modeling topics for every document. It effectively improves the density of information space without wasting much time and space complexity. Along these lines, the rich context saved in the word-word space likewise ensures to detect new and uncommon topics with convincing quality. The system extracts real time online news data and uses this data for system implementation. Firstly, topic modeling algorithm is applied on this online news data to identify the key topic of the incoming news and also to identify the most trending topic. Once we identify the topic of news, the system uses k-means document clustering algorithm to cluster all latest news associated to a particular topic together. Likewise, classify the news on the basis of topic. After clustering, generation of the summary is done from the output and we intend to present the summarized news along with the topic to the user. Keywords: Data mining, Topic modeling, document clustering, online news [1] INTRODUCTION Recently, a generative probabilistic model of textual corpora has been considered, to segregate representations of the news(information). It decreases depiction length and discloses inter- and intra-document factual structure. Such models ordinarily will be