International Conference on Computational Science and Technology – 2014 (ICCST’14) A Comparison of Distance Methods Effectiveness in Retrieving Relevant Articles in Agricultural Domain Kim Soon Gan Faculty of Computer and Informatics University Malaysia Sabah Kota Kinabalu, Malaysia e-mail: g_k_s967@yahoo.com Rayner Alfred Faculty of Computer and Informatics University Malaysia Sabah Kota Kinabalu, Malaysia e-mail: ralfred@ums.edu.my Kim On Chin Faculty of Computer and Informatics University Malaysia Sabah Kota Kinabalu, Malaysia e-mail: kimonchin@ums.edu.my Patricia Anthony Faculty of Environment, Society and Design Lincoln University Christchurch, New Zealand e-mail: patricia.anthony@lincoln.a c.nz Abstract—The large volume of online and offline information that is available today has overwhelmed users’ efficiency and effectiveness in processing this information in order to extract relevant information. The exponential growth of the volume of internet information complicates the process of accessing and retrieving relevant information. Thus, it is a very time consuming and complex task for user in accessing relevant information. Information retrieval (IR) is a branch of artificial intelligence that tackles the problem of accessing and retrieving relevant information. The aim of IR is to enable the available data source to be queried for relevant information efficiently and effectively. However, in retrieving relevant information, several methods have been proposed to measure the similarity between the posted query and the articles retrieved. Different distance methods will rank these articles differently. This paper studies and compares the effectiveness of using different distance methods in retrieving relevant documents based on 17 specific queries in the agricultural domain. The obtained results of the experiment are empirically evaluated. Information Retrieval, Similarities Measurment, Euclidean Distance, Jaccard Extend Coefficient I. INTRODUCTION Information retrieval (IR) task is one of the Artificial Intelligence streams that, deals with the task of retrieving relevant information to a particular user or based on a specific context from a repository of data [1]. These data may be in the form of images, texts, audios, videos and etc. The study of information retrieval includes data representation, storage, organization and the process of retrieving this information. The advancement of the Internet has brought a new breakthrough for spreading and publishing information through the web. Thus, Internet is now the biggest repository of information resources. However, the exponential growth of information has both advantages and disadvantages. With this large volume of information, information dissemination and sharing become easier due to the fact that tones of information are available for user, and others. However, there are also some disadvantages that include overloaded information, unprocessed information, and the difficulties of retrieving relevant information based on a particular context posted by users. Besides that, these unstructured data posted on the web also increases the difficulty of retrieving information since they are not modelled properly. Since the information retrieval is dealing with human natural language, thus, the semantic ambiguous of the posted text also needs to be taken into consideration. Most information retrieval researchers now are diverting their focus more to the web information retrieval in order to tackle the problems mentioned previously. The earliest approach for tackling these problems is through search engine. Search engine is one of most prominent applications used for retrieving information from the web that this is used by most of the Internet user. Several most well-known search engines nowadays include Google, Yahoo and Bing. However, these search engines mostly deal with a generic search task. It performs well in retrieving generic query but for a specific query in a specific domain, e.g., agricultural domain, retrieving task requires more specific domain knowledge. Besides that, most of the search engines are still based on keywords search which cannot be used to deal with semantic ambiguous of human natural language. Thus, many of the researchers both in academics and industry have built their own IR systems. These IR systems can be used to tackle a more specific retrieving process and some of them are used for dealing with specific retrieving problems such as scalability, semantic ambiguous, cross-language, security, privacy, and others. In previous study, a robust IR system has been developed to tackle both specific and generic retrieval tasks [2]. The main concern of the retrieval task is the issue of how efficient does the IR system in retrieving relevant documents. Relevancy has always been the main concern of retrieving task. Relevancy is related to the similarity and dissimilarity measures which are common used in data mining. Similarity is a numerical measurement that measures the degree of how two or more objects are alike. Dissimilarity is the opposite of similarity, where the