A Survey on Search Results Annotation Rosamma K S , Jiby J Puthiyidam Dept of Computer Science College of Engineering Poonjar, Cochin University of Science and Technology Kottayam, Kerala, India Abstract-The use of web search engines are very frequent and common worldwide over the internet by end users for different purposes. A web search engine takes the query request from the end user and executes that query on relational database used to store the information on behalf of that web search engine. Based on input queries the dynamic response is generated by search engine, in the form of HTML based pages. These pages are supported with the web databases. Every web page generated contains many results to display for particular query, called as Search Result Records (SRRs). These SRRs may contain data units that are relevant to one common semantic. These SRRs are further required to be assigned with proper labels. The manual methods for record extraction and labeling have a worse scalability. Thus automatic annotation based method is needed to improve the accuracy as well as scalability of web search engines. This paper takes the review of such systems. Key words-Web Database, Annotation, Data alignment I. INTRODUCTION Web information extraction and annotation are two important research areas in recent years. Large portion of the deep web is database based. The data encoded in the returned result pages come from the underlying structured databases, called Web databases (WDB)[12,13]. A typical search result page may contain multiple search result records (SRRs). Each SRR contains multiple data units each of which describes one aspect of a real-world entity. It corresponds to the value of a record under an attribute. There is a high demand for collecting data of interest from multiple WDBs. Unfortunately, the data units in the SRRs are often not provided with a semantic label. Having semantic labels for data units is important for the record linkage task and for storing collected SRRs into a database table for later analysis. Since the data units in the SRRs are initially unstructured and unlabeled, an automatic labeling method is required. Thus many alignment and annotation methods were introduced. These methods improve the efficiency of searching and updating of data. Manual methods of labeling had less scalability. So many automatic annotation methods were introduced. II. ANNOTATION METHODS Web information extraction and annotation area received a fast growth in the recent years. Many of the systems developed are based on, marking the specified areas in the sample page and then use a set of rules to extract the information. There are many systems that have higher extraction accuracy through the use of supervised learning techniques. A. Vision Based Approaches for Web Data Extraction These methods [1,2,3] use visual features of deep web page for extracting the web data. There exists many systems that use this concept. ViDE is based on some common visual features of the deep web. ViDE first builds a visual block tree using the VIPS algorithm. Using the Visual Block tree, data record extraction and data item extraction are carried out based on the proposed visual features. Then a visual wrapper generated is to improve the efficiency of both data record extraction and data item extraction. There is an another system that uses enhanced co-citation algorithm[3] .Unlike the other systems that develops a new set of APIs for the extraction of visual information, this algorithm retrieve the visual information of the deep web pages directly from the web database. The framework is processed under three different phases. The first phase involves extraction of web pages using enhanced co- citation algorithm. The algorithm follows two strategies to extract the visual information of web pages from web database, content based and link based. The former extracts the textual content of the users’ query links and its siblings, and the later utilizes only the link construction among the web pages collected for the enhanced co-citation algorithm. The second phase is the data record extraction. The objective of this phase is to determine the border line of data records and remove them from deep Web pages. It seems that the following assumptions are satisfied: i. All data records present in multi data region are extracted ii. For every extracted data record, no data item should be neglected and no erroneous data item be incorporated Rosamma K S et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (6) , 2014, 7715-7719 www.ijcsit.com 7715