Text Mining and Knowledge Discovery from Big Data: Challenges and Promise Amal Mahmoud Yehia 1 Lamiaa Fattouh Ibrahim 1,2 Maysoon Fouad Abulkhair 2 1 Department of Computer Science and Information Institute of Statistical Studies and Research Cairo University, Cairo, Egypt 2 Department of Information Technology Faculty of Computing and Information Technology King Abdulaziz University B.P. 42808, 21551- Girls Section, Jeddah, Saudi Arabia Abstract With the fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents text mining and the ways used to categorize document structure techniques in big data. This subject poses a big challenge when it comes to guaranteeing the quality of extracted features in text documents to describe user interests or preferences due to large amounts of noise. This subject has many models and algorithms but still needs more to achieve best results for users, making this an open issue that needs more research. Keywords: Text mining - Big Data - Knowledge Discovery. 1. Introduction We live in a flood of data that is too big, too fast, or too hard for existing tools to process. “Too big” means that organizations increasingly must deal with petabyte-scale collections of data that come from click streams, transaction histories, sensors, and elsewhere. “Too fast” means that is not only the volume of data is big, but it must be processed quickly. “Too hard” is a catchall for data that does not fit neatly into existing processing tools or that needs some kind of analysis that existing tools cannot readily provide. A Big Data problem has three distinct characteristics: the data volume is huge, the data-producing velocity is very high, and the data type is diverse (a mixture of structured data, semi-structured data, and unstructured data). These characteristics pose great challenges to traditional data processing systems since these systems either cannot scale to the huge data volume in a cost-effective way or they fail to handle data with a variety of types [1]. The unprecedented data volumes require an effective data analysis and prediction platform to achieve fast response and real-time classification for such Big Data. Exploring large volumes of data to extract information or knowledge for future action is a principal task for Big Data application [2]. The term data mining has been stretched beyond its limits to apply to any form of data analysis. One of the numerous definitions of Data Mining or Knowledge Discovery in Databases is: Extraction of interesting information or patterns from data in large databases is known as data mining. According to Cheng, H et.al. “Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [3]. This encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies. According to Prabhu, “Data mining is the search for relationships and global patterns that exist in large databases but are ‘hidden’ among the vast amount of data, such as a relationship between patient data and their medical diagnosis” [4]. Text mining is a technique which extracts information from unstructured data and finds patterns. It is also known as knowledge discovery from the text (KDT), it deals with the machine supported analysis of text [5]. Text documents are in semi-structured or unstructured format datasets such as emails, full-text documents, HTML files etc. The problem IJCSI International Journal of Computer Science Issues, Volume 13, Issue 3, May 2016 ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784 www.IJCSI.org http://dx.doi.org/10.20943/01201603.5461 54 doi:10.20943/01201603.5461 2016 International Journal of Computer Science Issues