A Study of Similarity Functions Used in Textual Information Retrieval in Wide Area Networks Jaswinder Singh #1 , Parvinder Singh *2 , Yogesh Chaba #3 # Department of Computer Science & Engineering, Guru Jambheshwar University of Science & Technology Hisar, Haryana, India * Department of Computer Science & Engineering, Deenbandhu Chhotu Ram University of Science & Technology Murthal, Sonepat, Haryana, India Abstract— World Wide Web is a rich source of information. It continues to expand in size and complexity with the increasing use of the internet and social media but how to retrieve relevant documents on the Web is becoming a challenge. In this paper there is discussion about the goals, challenges and importance of similarity functions in information retrieval in wide area networks. This paper discusses the different similarity functions that are used by various authors as information retrieval techniques to measure the similarity of document with the query in the field of information retrieval in wide area networks. Keywords— Similarity Function, Textual Information Retrieval, Wide Area Networks I. INTRODUCTION The continuous growth of web and the expectation of user on search engine to anticipate his or her needs have led to the development of the field of information retrieval in wide area networks. The tool used to extract relevant information from web world is called search engine. A survey claim that 85% of internet users use search engines or some kind of search tool to find specific information of interest [1]. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. Search engines answer tens of millions of queries every day [2]. The objective of search engine is to provide high quality results to the user that is relevant to the user query. Search engines use automated software programs known as spiders to survey the web and build their database. Web documents are retrieved by these programs are analysed. Data collected from each web page are then added to the search engine index. When the user enters the query at the search engine site, then the user input is checked against the search index of all the pages it has analysed, the best URLs are then returned to the user as hits, ranked in order with the best results at the top. The aim of this paper is to study the goals, challenges and importance of similarity functions in the field of information retrieval in the wide area networks, particularly the similarity functions. The remainder of paper is organized as follows. The first section of paper describes the brief working of search engine and second section describes the information retrieval process in wide area networks. This section describes the goals, challenges of information retrieval and the problems that are faced by information retrieval system in wide area networks that is whether because of the nature of web or because of the activity of user or the searching process. This section also describes the information retrieval system and the classical models of information retrieval in wide area networks. Third section describes the various similarity functions which are the functions that are used to find out the textual similarity between the user query and documents. The related work on the similarity functions is reviewed and concludes that with the proper combination of the similarity functions the search results can be further improved. II. INFORMATION RETRIEVAL IN WIDE AREA NETWORKS Information retrieval has become an important subject of much research in recent years, because the amount of information available in digital formats has grown exponentially and the need for retrieving relevant information has assumed a crucial importance. The most common text retrieval task is to retrieve the documents in response to the user query. “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” [3]. Information Retrieval deals with representation, storage, organization of, and access to information items such as documents, web pages, etc [4]. Information Retrieval system is different from the DBMS in the sense that the retrieval is probabilistic where as the retrieval is deterministic in DBMS [5]. Modern information retrieval can be accessed through the services of different search engines e.g. Google, Bing and AltaVista etc. A. Goals of Information Retrieval The main goal of an Information Retrieval in wide area networks is to search for the documents that are relevant to the user’s query. Keyword search is the simplest form of the most popular query method for the search engine in information systems. Searched results of inputted keyword in some cases might not display the required documents. This can be the result of lacking of search method or knowledge of how to use the specific keyword. Fig.1 explains the information retrieval process which is mostly followed by the user during searching the information in wide area networks. User formulates a query about the information need and then the user chooses the search tool or search system and sends to the information retrieval system. Information retrieval system searches for the matches in the document database and retrieves the results. The user evaluates the results based on the relevance [4]. Relevance is subjective in nature as it depends on the Jaswinder Singh et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (6) , 2014, 7880-7884 www.ijcsit.com 7880