RESEARCH ARTICLE
177 1936-6612/2011/4/400/008 doi:10.1166/asl.2011.1261
Copyright © 2011 American Scientific Publishers Advanced Science Letters
All rights reserved Vol. 4, 400–407, 2014
Printed in the United States of America
HTML Format Tables Extraction with
Differentiating Cell Content as Property Name
*Detty Purnamasari
1
, Lintang Yuniar Banowosari
2
, I Wayan Simri Wicaksana
3
, Suryadi Harmanto
4
1
Information System, Gunadarma University, Jl. Margonda Raya No. 100 Pondok Cina Depok, Indonesia
2
Information Management, Gunadarma University, Jl. Margonda Raya No. 100 Pondok Cina Depok, Indonesia
3,4
Information Technology, Gunadarma University, Jl. Margonda Raya No. 100 Pondok Cina Depok, Indonesia
Website presents data in various forms and formats, one of them in the form of a table. Tables on the Internet can be taken
such way by copy and paste, but this way is not easy if done on many tables then from extracted result they have been
merged with the other tables. This article discussed the research on extraction of HTML tables which stored into a database
form. The approach used was algorithm to perform the search process the number of rows and number of columns from the
table, and algorithms to perform matching the contents of the table cell extraction results with a Property Name database, so
it is unknown whether the extracted table has property in the row/column/table without property. Table and Property Name
database displays the data in the Indonesian Language. At pre processing stage Property Name database which is also
prepared the techniques to enrich the instance of the Property Name database. The tables in the extract is a table HTML
format with a simple table where the form is not found of any merger of the rows and columns in the row position merge
1/column 1. This research provides techniques to enrich the instance of a database, and with the use of illustrations, and then
an approach to do the extraction of tabular HTML format can be done in a semi-automatic. In addition to that property in the
table which is extracted can be distinguished from the contents of the cell which is a data table.
Keywords: HTML, Property Name of Table, Table Extraction, Website
1. INTRODUCTION
The data sources available on the Internet in various
forms and formats, one of which is in the form of a table.
The table consists of a cell, where each cell can contain a
label/name attributes of the cell and cell
data/content/attribute value
10
. For example that it find on
the web site of a travel agent, it is showing data on the
sales of airline tickets in the form of a table.
Data retrieval in the form of tables on the Internet
performed to process data to the further process or does
the merging of data extraction results with existing data.
*
Email Address: detty@staff.gunadarma.ac.id /
dettydepe@yahoo.com
Actually data retrieval on the Internet can be done by
means of copy and paste, but this way is not easy if
conducted at many tables. Table extraction approach is
useful if you want to take a few tables from various
sources on the Internet. The Illustrations can be seen in
Figure 1, where the results of the data retrieval process of
merging can be performed for further interoperability
process.
Figure 1 shows two forms of tables about the ticket
pricing information with the property name which is
different but they have the same meaning. The table
extraction will be useful to combine contents of both
tables into one table only.