International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395 -0056
Volume: 04 Issue: 05 | May -2017 www.irjet.net p-ISSN: 2395-0072
An Automatic Extraction of Educational Digital Objects and Metadata
from institutional Websites
Kajal K. Nandeshwar
1
, Praful B. Sambhare
2
1
M.E. IInd year, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India
2
Assistant Professor, Dept. of Computer Science, P. R. Pote College of Engg, Amravati, Maharashtra, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract: Among others things, as an educational
information source, internet is used. In this project, a tool
is provided which can detect all educational digital objects
in any format that are already published on institutional
websites and can be uploaded to a repository. This
recopilation is a tedious task and is usually performed
manually. In this project, the proposed system architecture
is for automating this task of collecting the documents
within any educational web domain and detects the
documents that are loaded into a repository. In addition,
its metadata such as abstract, author name is
automatically extracted if available. The aim of the
proposed system is automatically extracts the EDOs that
are uploaded on the institutional websites and stored into
the repository.
Keywords: EDOs, automatic, website, Repository,
links, extraction, information gathering
1. INTRODUCTION
Internet is a powerful source of information which is
used as an educational information source. Educational
resource also called as digital object, learning object,
learning resources, digital resources, digital content,
reusable learning object, educational content in the field
of technology enhanced learning (McGreal, 2004).
Nowadays one of the most important sources of
educational material is web where students and teachers
have a large amount of information at their disposal [4].
An Educational Digital Object (EDO) is any material in
the digital format that can be used as educational
resource. For example, a scientific publication, an
educational material that is used in a class is an
educational resource [1]. Manual data extraction process
is time consuming and error prone. Web pages come in
the different formats including text, HTML pages, PDF
documents, and other proprietary formats. Web pages
may give the same or analogous information utilizing
entirely diverse formats or linguistic uses, which makes
addition of the information a fascinating task [14]. The
web links provide a source of valuable information. In
this project, system architecture is used for collecting the
documents to assist the manager of institutional
repositories in the recopilation task of EDOs within a
website. Thus, plausible documents to be uploaded to a
repository can be detected. Also, its metadata such as
abstract, author name, affiliation if available are
automatically extracted.
A problem that can be found in this extraction of EDOs is
that many times, the required data are not in the
document. These data can be in the different pages of the
same website. The proposed system architecture takes
advantage of this feature to improving the automation of
information extraction. Therefore, in this system some
data extracted are searched in the document and also
searched in another page of the same sites. The
proposed system gathers the EDOs and metadata which
is in the form of list of links that are uploaded on the
institutional website or any website and stored into the
repository. The system receives as input URL of website
or a text where the search is performed. The output of
the system shows the retrieved documents together with
the extracted information in a database.
2. EXISTING WORK
Automatic extraction plays an important role in
processing results from search engines [7]. Regarding to
the automatic gathering information systems, various
proposals have been developed.
DeLa (Data Extraction and Label Assignment for
Web Databases): DeLa describe by J. Wang and F.H.
Lochovsky [9] which automatically extracts data from
website and assigns the meaningful labels to the data.
This technique concentrates on the pages that querying
back end database using the complex search forms other
than using keywords.
ViPER (Visual perception based Extraction of
Records): It is described by K. Simon and G. Lausen [13]
which is a totally automated information extraction tool.
This technique is based on the assumption that the web
page contains at least two consecutive data records
which exhibits some kind of the structural and visible
similarity. ViPER is able to extract the relevant data with
respect to user’s visual perception of the webpage. It
only extracts the contiguous page in a website and it fails
to perform nested structure effectively. It performs the
good data extraction but implementation is not available
[12].
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 775