Bulletin of Electrical Engineering and Informatics Vol. 13, No. 3, June 2024, pp. 2036~2047 ISSN: 2302-9285, DOI: 10.11591/eei.v13i3.7665  2036 Journal homepage: http://beei.org Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database M. Miftakul Amin 1,2 , Deris Stiawan 3 , Ermatita 3 , Rahmat Budiarto 4 1 Department of Computer Engineering, Politeknik Negeri Sriwijaya, Palembang, Indonesia 2 Faculty of Engineering, Universitas Sriwijaya, Palembang, Indonesia 3 Department of Computer Engineering, Faculty of Computer Science, Universitas Sriwijaya, Palembang, Indonesia 4 Department of Computer Science, College of Computing and Information, Al-Baha University, Alaqiq, Saudi Arabia Article Info ABSTRACT Article history: Received Oct 4, 2023 Revised Feb 22, 2024 Accepted Feb 28, 2024 Bibliographic databases are used to measure the performance of researchers, universities and research institutions. Thus, high data quality is required and data duplication is avoided. One of the weaknesses of the threshold-based approach in duplication detection is the low accuracy level. Therefore, another approach is required to improve duplication detection. This study proposes a method that combines threshold-based and rule-based approaches to perform duplication detection. These two approaches are implemented in the comparison stage. The cosine similarity function is used to create weight vectors from the features. Then, the comparison operator is used to determine whether the pair of records are grouped as duplication or not. Three research databases: Web of Science (WoS), Scopus, and Google Scholar (GS) on the Science and Technology Index (SINTA) database are investigated. Rule 4 and Rule 5 provide the best performance. For WoS dataset, the accuracy, precision, recall, and F1-measure values were 100.00%. For Scopus dataset, the accuracy and precision values were 100.00%, recall: 98.00%, and the F1-measure value is 98.00%. For GS dataset, the accuracy value was 100.00%, precision: 99.00%, recall: 97.00%, and the F1-measure value is 98.00%. The proposed method is potential tool for accurate detection on duplication records in publication databases. Keywords: Duplicate detection Research database Rule-based Similarity function Threshold-based This is an open access article under the CC BY-SA license. Corresponding Author: Deris Stiawan Department of Computer Engineering, Faculty of Computer Science, Universitas Sriwijaya Palembang, Indonesia Email: deris@unsri.ac.id 1. INTRODUCTION Information is increasingly being stored electronically, it may be conveniently accessed and exchanged as both interaction and internet usage grow. Users have access to digital information sources at any time and from any location, and they can search for information collections based on their needs. These easily accessible electronic data collections can be used to disseminate knowledge in the field of research and education. Public understanding and scientific literacy can both rise with open and extensive access to scientific knowledge. Amorim et al. [1] found that research on data governance is emerging as a major concern for researchers; therefore, universities and research institutes need to procure a number of tools that facilitate the management of scientific publication data. In the aspect of governance, meanwhile Heidorn [2] highlights that scientific asset management requires that information collections contain valid information. In the context of Indonesia, the Garba Rujukan Digital (GARUDA) database is one of several databases that play a