International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056 Volume: 04 Issue: 05 | May -2017 www.irjet.net p-ISSN: 2395-0072 An Automatic Extraction of Educational Digital Objects and Metadata from institutional Websites Kajal K. Nandeshwar 1 , Praful B. Sambhare 2 1 M.E. IInd year, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India 2 Assistant Professor, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract: Among others things, as an educational information source, internet is used. In this project, a tool is provided which can detect all educational digital objects in any format that are already published on institutional websites and can be uploaded to a repository. This recopilation is a tedious task and is usually performed manually. In this project, the proposed system architecture is for automating this task of collecting the documents within any educational web domain and detects the documents that are loaded into a repository. In addition, its metadata such as abstract, author name is automatically extracted if available. The aim of the proposed system is automatically extracts the EDOs that are uploaded on the institutional websites and stored into the repository. Keywords: EDOs, automatic, website, Repository, links, extraction, information gathering 1. INTRODUCTION Internet is a powerful source of information which is used as an educational information source. Educational resource also called as digital object, learning object, learning resources, digital resources, digital content, reusable learning object, educational content in the field of technology enhanced learning (McGreal, 2004). Nowadays one of the most important sources of educational material is web where students and teachers have a large amount of information at their disposal [4]. An Educational Digital Object (EDO) is any material in the digital format that can be used as educational resource. For example, a scientific publication, an educational material that is used in a class is an educational resource [1]. Manual data extraction process is time consuming and error prone. Web pages come in the different formats including text, HTML pages, PDF documents, and other proprietary formats. Web pages may give the same or analogous information utilizing entirely diverse formats or linguistic uses, which makes addition of the information a fascinating task [14]. The web links provide a source of valuable information. In this project, system architecture is used for collecting the documents to assist the manager of institutional repositories in the recopilation task of EDOs within a website. Thus, plausible documents to be uploaded to a repository can be detected. Also, its metadata such as abstract, author name, affiliation if available are automatically extracted. A problem that can be found in this extraction of EDOs is that many times, the required data are not in the document. These data can be in the different pages of the same website. The proposed system architecture takes advantage of this feature to improving the automation of information extraction. Therefore, in this system some data extracted are searched in the document and also searched in another page of the same sites. The proposed system gathers the EDOs and metadata which is in the form of list of links that are uploaded on the institutional website or any website and stored into the repository. The system receives as input URL of website or a text where the search is performed. The output of the system shows the retrieved documents together with the extracted information in a database. 2. EXISTING WORK Automatic extraction plays an important role in processing results from search engines [7]. Regarding to the automatic gathering information systems, various proposals have been developed. DeLa (Data Extraction and Label Assignment for Web Databases): DeLa describe by J. Wang and F.H. Lochovsky [9] which automatically extracts data from website and assigns the meaningful labels to the data. This technique concentrates on the pages that querying back end database using the complex search forms other than using keywords. ViPER (Visual perception based Extraction of Records): It is described by K. Simon and G. Lausen [13] which is a totally automated information extraction tool. This technique is based on the assumption that the web page contains at least two consecutive data records which exhibits some kind of the structural and visible similarity. ViPER is able to extract the relevant data with respect to user’s visual perception of the webpage. It only extracts the contiguous page in a website and it fails to perform nested structure effectively. It performs the good data extraction but implementation is not available [12]. © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 775