The Risk Level Estimation Based on Deep Learning Method for Tianya Forum Jindong Chen Xijin Tang Institute of Systems Science, Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing, 100190 P.R. China j.chen@amss.ac.cn, xjtang@iss.ac.cn Abstract Using the societal risk indicators from socio psychology, a deep learning method is applied to estimate the risk level of Tianya Forum. Due to the effectiveness in semantic and word order information extraction for documents, a deep learning method Post Vector is used to generate the distributed representations of BBS posts. Through the experimental comparison on societal risk classification of BBS posts, the performance of kNN based on Post Vector is superior to kNN based on Bag-of-Words, edit distance or Lu- cene-based search method. Therefore, with kNN based on Post Vector method and the annotated data of Tianya Zatan broad, the risk level of Baixing Shengyin broad in different months is estimated, and the reasonability of the estimated results is analyzed. Keywords: Tianya Forum, Societal Risk Classi- fication, Deep Learning, kNN, Post Vector 1 General Instructions Up to date, more and more Chinese people treat social media (such as blog, micro-blog, BBS, etc.) as one way to express their opinions toward the daily phenomena and social events, so it is a better way to monitor the relative societal risk level based on these online data [1]. Through measurement of the topics and their frequency expressed online, the current relative societal risk level can be estimated. “Tianya Forum is a fa- mous Internet forum in China, and provides BBS, blogs, micro-blogs and photo album services etc.” 1 . Tianya Forum includes multiple boards; the posts on Tianya Zatan board and Baixin Shengyin board etc. mainly cover the hot and 1 http://en.wikipedia.org/wiki/Tianya_Club sensitive topics of current society [2]. Therefore, the boards of Tianya Forum are selected as the data sources to explore effective strategies for online societal risk monitoring. According to comprehensive analysis and com- parison [1], the framework of societal risk indi- cators including 7 categories and 30 sub catego- ries based on word association tests which is constructed by Zheng et al. [3] is chosen as risk categories. To evaluate the current risk level, the main challenge is to classify each post into one of multiple societal risk categories (7 main catego- ries and 1 risk free category). However, the massive amount and negative effects of the posts lead to the impracticability of the classification of posts by human. Since the effectiveness of ma- chine learning method in text classification, the machine learning method is a better approach for the classification task [4]. The basic principle of text classification is uti- lizing machine learning strategies to assign pre- defined labels to new documents based on the model learned from a trained set of labels and documents [5]. Generally, two main procedures affect the accuracy of text classification: docu- ment representation and classifier construction. The traditional document representation method is Bag-of-Words. For Bag-of-Words representa- tion, the vector size equals to the vocabulary size, the vector elements at the indexes of the words occurred in the document are “word frequency” while the other elements are “0”s [6]. Bag-of-Words representation is mainly through extraction and selection of feature word to im- prove the quality of document vector [7]. There are many research works have proven the effec- tiveness of Bag-of-Words representation in text classification field, such as news classification [8] and personality classification [9]. The repre- sentative machine learning methods for text classification are K-Nearest Neighbor (KNN) [10], naïve Bayes [11] and support vector ma-