International Journal of Computer Applications (0975 – 8887) Volume 136 – No.3, February 2016 27 Challenges for Information Retrieval in Big data: Product Review Context Sanjib Kumar Sahu Dept of Computer Science, Utkal University, Odisha, India D. P. Mahapatra Dept of Computer Science, NIT, Rourkela, Odisha, R. C. Balabantaray Dept of Computer Science, IIIT Bhubaneswar, Odisha, ABSTRACT The ever increasing scale of e-commerce has today presented a big range of choice for the customer. Customer uses online product reviews as a primary criterion to make a decision for his purchase. These product reviews are scattered all around the internet, and this data has a great potential value. However, it is also unstructured and written in a natural language, which poses great problems for data mining and data analytics. The scale, non-uniformity and complexity of product reviews make them classic big data elements. This paper discusses the big data challenges and opportunities involved in mining and analytics of product review data. It formally studies the problem under a big data framework and formulates a plan for the extraction, mining and analysis. This paper also reviews some of the mining approaches for product reviews and implemented feature/attributes based method for finding the review of products. Keywords Big Data; Information Retrieval;Data Mining;Product Reviews; Text Mining; Sentiment Classification; e- commerce; 1. INTRODUCTION E-commerce is a 21 st century innovation which has changed the 150,000 years old practice of trading the physical goods [1]. It has brought a whole marketplace to our phone and laptop screens, with a huge variety of capital goods available to purchase at any time. This is becoming a major driver of economic growth in populous and developing economies like India and China, where booming businesses are bringing investments, creating jobs and most importantly empowering the customer [2]. There has never been another time in history where a customer had so wide range of options. There are so many merchants selling hundreds of brands and thousands of products which can be shipped to every corner of the world, that customers are spoilt for choice today. For example, a simple search for a mobile phone in the price range of Rs. 10,001-18000 on a popular online store Flipkart brings more than 300 product results. The dilemma of a customer today is to make a decision which gives him the best value for his money. Product reviews have become the primary resource for a customer to get all the information about any item and mostly a wise customer decision is a result of reading tons of reviews on the internet. These reviews come from many places. There are specific websites and blogs where professionals use a product and post their impressions, there are many online forums where users of a product can submit their comments, and merchants selling these products also allow customers to rate and review them on the web. A huge amount of this review data is being generated everyday, which is mostly free to access and has a great value to the customer. Yet, procedural analytics for this data have been highly neglected, majorly due to its associated complexity. Such reviews are highly unorganized as they are written in a natural language. They are ever growing and they originate from multiple heterogeneous sources. Moreover, the user perception of a product also changes over time due to the availability of better alternatives in the marketplace. The characteristics of review data fit perfectly to the classic big data definition. This data is voluminous and heterogeneous, it originates from autonomous sources with distributed and decentralized control, and it has complex and evolving nature. Therefore, the techniques and practices traditionally employed for big data must be adopted for analysis of such product reviews. There are tremendous advantages and opportunities of undertaking such data analysis [3]. It creates a value out of the unstructured data floating around the web. The analysis report has great significance for a customer, who can now focus on the key positive and negative sentiments of a product without wasting a lot of his time searching on the internet, compare the product to other similar products in the market and shortlist his feature preferences to make a quicker and much more effective decision. Such reports are also more beneficial than limited surveys, and manufacturers can use them to understand the market demand better to design improved and more attractive products. The product reviews serve as a valuable feedback which improves the profitability of all the stake-holders involved in an online retail business like the manufacturers, the retailers and the customers. The roadmap to the remaining part of the paper is as follows. Section-II discusses the big data challenges and solutions for the product review analysis. It also defines and classifies the complexity level of this data. Section-III proposes a general process for the analysis of such data. Section-IV discusses the existing methodologies of product review. Section-V discusses about sample collection, result analysis and procedure to a methodology for product review. Finally, the conclusion and future development for investigation is presented in Section-VI. 2. DATA CHARACTERISTICS In a recent and widely cited paper [4], Wu et.al have proposed HACE theorem to model big data characteristics. According to HACE theorem, Big Data starts with large-volume, heterogeneous, autonomous sources with distributed and decentralized control, and seeks to explore complex and evolving relationships among data. This paper experiments above mentioned model for understanding the characteristics of product review data. Later, this paper will also classify this data using the popular matrices to understand the challenges involved in the analysis. Below, first explain the basic concepts of HACE theorem. 2.1 Huge Heterogeneous Data with Diverse Dimensionality Product reviews are unstructured and complex. Generally there is no common standard template, and rather these are