EXTRACT AND ANALYSIS OF SEMI STRUCTURED DATA FROM WEBSITES AND DOCUMENTS Prashant M Ahire 1 , Anil P Gagare 2 , Yogesh B pawar 3 , Savan S Vidhate 4 1,2,3,4 B.E Computer ,Department of Computer Engineering, GES’s R H Sapat College of Engineering, Nashik 422005, Maharashtra, India 1 ahireprashant77@gmail.com, 2 anilgagare840@gmail.com, 3 yogesh.pawar95@gmail.com, 4 sa1.vidhate@gmail.com Abstract: Discovering into the W3 consortium and portable documents for the purpose of fetching useful information is a hectic task under the limitations of current available browsers. While a huge amount of work is being carried out to improve the efficiency. The huge amount of information on web and portable digital document is stored in backend databases which are not indexed by traditional search engines such databases are known as Semi structured Databases and extraction and analysis of web content and documents is a time consuming and complex task. Hence, there has been increased interest in retrieval and integration of semi structured web data and digital document data with a view to improve quality information to the users who wish to analyze the data. This paper states an approach that identifies web page templates and the tag structures of a portable document in order to sort semi structured data from web pages and documents and analyze the fetching data as per user requirement using various SQL queries. Keyword: Web page extraction, Analysis, Web Page Service, portable documents I. Introduction Data mining is nothing but the process of extracting useful information from collected databases. Extraction of the information from the big databases is called the “Knowledge Discovery”. It is an analytical tool for analyzing data. It allows user to analyze data from many different aspects or angles, categorize it, and conclude the relationships identified. Technically, it is the process of finding correlations or patterns among loads of field in large relational databases [7]. Web mining is similar to the data mining where the data is extracted from the web pages. It is one of the applications of data mining techniques to discover patterns from the web. Web mining can be divided into three different types which are Web Usage Mining, Web Content Mining and Web Structure Mining. Web Usage Mining is the process of extracting useful information from server logs. E.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Web Structure Mining is the process of using graph theory to analyze the node and connection structure of a web site. Web Content Mining is the mining, extraction and integration of output oriented data, information and knowledge from web page content [3]. Extraction and analysis of the web pages and portable document is a heated research area in the field of data mining and web mining. Extraction is nothing but the fetching relevant information from the web page and portable documents. Extraction is crucial step for analyzing the data. Different web sites contain information on varied topics in various formats. Large amount of effort are often required for a user to manually locate and extract data of interest from the web pages and portable documents. Just consider about results of MSBTE University, the results are stored in HTML page format. If an analysis is to be made then gathering information, converting it into excel sheet to use various query to process the data resulting into subject wise result, toppers of each subject, overall topper etc. In the same way user will do analysis of Portable Document Format for that great efforts are needed [5].