TELKOMNIKA Telecommunication, Computing, Electronics and Control Vol. 19, No. 1, February 2021, pp. 317~326 ISSN: 1693-6930, accredited First Grade by Kemenristekdikti, Decree No: 21/E/KPT/2018 DOI: 10.12928/TELKOMNIKA.v19i1.16205 317 Journal homepage: http://journal.uad.ac.id/index.php/TELKOMNIKA WEIDJ: Development of a new algorithm for semi-structured web data extraction Ily Amalina Ahmad Sabri, Mustafa Man Faculty of Ocean Engineering Technology and Informatics, Universiti Malaysia Terengganu, Terengganu, Malaysia Article Info ABSTRACT Article history: Received Mar 29, 2020 Revised Aug 9, 2020 Accepted Aug 29, 2020 In the era of industrial digitalization, people are increasingly investing in solutions that allow their process for data collection, data analysis and performance improvement. In this paper, advancing web scale knowledge extraction and alignment by integrating few sources by exploring different methods of aggregation and attention is considered in order focusing on image information. The main aim of data extraction with regards to semi- structured data is to retrieve beneficial information from the web. The data from web also known as deep web is retrievable but it requires request through form submission because it cannot be performed by any search engines. As the HTML documents start to grow larger, it has been found that the process of data extraction has been plagued with lengthy processing time. In this research work, we propose an improved model namely wrapper extraction of image using document object model (DOM) and JavaScript object notation data (JSON) (WEIDJ) in response to the promising results of mining in a higher volume of image from a various type of format. To observe the efficiency of WEIDJ, we compare the performance of data extraction by different level of page extraction with VIBS, MDR, DEPTA and VIDE. It has yielded the best results in Precision with 100, Recall with 97.93103 and F-measure with 98.9547. Keywords: Document object model JavaScript object notation Web data extraction Wrapper extraction of image This is an open access article under the CC BY-SA license. Corresponding Author: Ily Amalina Ahmad Sabri Faculty of Ocean Engineering Technology and Informatics Universiti Malaysia Terengganu Kuala Nerus, Terengganu, Malaysia Email: ilylina@umt.edu.my 1. INTRODUCTION The numbers of devices and gadgets connection to the Internet is on the rise. This increase in internet’s connection makes the web as the largest source of information worldwide. With the large amount of data residing in the web, and complemented by advanced technologies in database processing, it is therefore a seamless effort to gather, collect and process the data. As the consequence of the exponential data growth, it is most important for users to adopt advanced data analytics technologies for an efficient storage, retrieval and analysis of the data. The main aim is to usefully utilize this data, to learn about patterns and trends that can be used to make a positive impact on our lifestyle. However, the data itself doesn’t produce these objectives, but rather it’s solutions that arise from analyzing it and finding the answers we need. This accumulation of data in terms of volume, technology and techniques are often being discussed in relation to mine data from world wide web. Figure 1 shows the number of scholarly works over time by their publication type such as book, dissertation, journal article, report, conference proceeding and so forth via lens.org. From this graph, it can be easily seen the trend in this research field.