International Journal of Computer Applications (0975 8887) Volume 67No.6, April 2013 21 What the Masses Want: A Case Study in Knowledge Discovery from Politically Oriented Data Samhaa R. El-Beltagy, Moustafa Ghanem, Heba Ezzat, Sourya Ezzat, Mohmmed Aboelhouda, Ahmed Gamal, Mohamed Elkalioby, and Shady Alaa Issa Nile University, Smart Village - B2 - Km 28, Cairo-Alex Desert Rd Cairo 12677, Egypt ABSTRACT This paper describes an approach taken to analyze and categorize a sizable dataset of politically oriented posts that were submitted to a popular idea bank, Egypt 2.0, created following the Egyptian revolution. The aim of the analysis was to organize and present the data in a simple way that allows the voice of the people to be heard by decision makers and activists in a critical 6 week period in February and March 2011. The constraints faced when developing the approach included the absence of a classification scheme, the unavailability of training data, the need to assign more than one category, or label, to individual posts and the need to complete the task in a short period of time. The goal of this paper is twofold. Firstly, to present and evaluate the rapid development framework and algorithms used to organize the data. Secondly, to document the challenges encountered when both developing the system itself and analyzing the data, and to present our experience to the research community with the aim of identifying potentially new interesting research topics. General Terms Knowledge discovery, Taxonomy building, Classification Keywords Text Mining; Text Analysis; Topic Categorization; Multi- Labeling. 1. INTRODUCTION The wide use of social media, online forums, and other online means of communication, is encouraging people to become more vocal with their political ideas, points of view, and aspirations. Any government that is serious about hearing the voices of its people must be attuned to their ideas and suggestions and must take these into consideration in its decision making and policy setting process. The work detailed in this paper describes an approach that was taken to do precisely that after the Egyptian revolution. The approach was developed to address a real-life problem that emerged immediately following the revolution when a prominent political activist created a Google moderator series (Egypt 2.0) to allow Egyptians to express their dreams and ideas for what is to become the new Egypt (Google Moderator is a tool that allows distributed communities to submit ideas, events, presentations, etc, as well as vote on them). In just one week, Egyptians, at that point eager to participate in shaping their country, submitted over fifty thousand postings with more than one million votes. A single person would typically enter a single posting expressing all of his/her ideas in fields that s/he deemed important. So a single posting could contain ideas covering topics ranging from political reform, restructuring of the police force, improving education, to making the streets cleaner. While valuable information could be extracted from this resource such as areas that people feel need the most improvement, concrete ideas for making changes in a given area, etc, the size of this submitted data became a primary obstacle in its immediate utilization and made it necessary to carry out further analysis and processing in order to make it useful for decision makers, researchers, or an average user interested in learning more about what others have suggested. The goal of the presented work is twofold. Firstly, to present the framework developed for the rapid development of a system capable of organizing this data by categorizing it and presenting it in a way that would allow the voice of the people to be heard even in the absence of a classification scheme. Secondly, to document the challenges encountered when developing the system and present them to the community with the aim of identifying new research topics. The presented system has been deployed and one of its outputs was a report highlighting the main demands/ideas of the people using a category tree that was derived from the data itself. The report was included as part of the “national dialogue” congress initiative organized by the interim Egyptian government in late March 2011. The aim of the congress was to bring together activists and politicians to discuss the future of Egypt. The rest of this paper is organized as follows: section 2, presents a brief problem description, section 3 provides an overview of the proposed system, section 4 presents the results of evaluating the proposed system, section 5 describes an enhancement that was brought about as a result of analyzing evaluation outcomes, section 6, overviews related work, section 7, briefly describes some of the insights gained by classifying user posting, section 8, summarizes the lessons learned and identifies future research challenges, and finally section 9, concludes this paper. 2. PROBLEM DESCRIPTION The input to the proposed system is a large set of postings, written in both Arabic and English, collected from the idea bank - each of which can span multiple topics. The majority of the postings were relatively short with people trying to express their opinions in as little words as possible. The main problem that this work had to address is that there was a need to classify or annotate these posts in the absence of a classification scheme. The other problem is that classification had to take place without any training data. The first problem requires an understanding of the existing data, and the creation of a classification taxonomy to capture it. Manual inspection of the data and the consequent manual development of a classification taxonomy is not only a time consuming activity, but also an error prone one as many categories or subcategories may be missed in the presence of a large amount of data. To address both problems the following steps were taken: