(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 12, 2022 802 | Page www.ijacsa.thesai.org Data Clutter Reduction in Sampling Technique Nur Nina Manarina Jamalludin 1 , Zainura Idrus 2 , Zanariah Idrus 3 , Ahmad Afif Ahmarofi 4 , Jahaya Abdul Hamid 5 , Nurul Husna Mahadzir 6 College of Computing, Informatics and Media, Universiti Teknologi Mara (UiTM), Selangor, Malaysia 1, 2 Faculty of Computer & Mathematical Sciences, Universiti Teknologi MARA Kedah, Kedah, Malaysia. 3, 4, .6 Kolej Matrikulasi Kedah,Kementerian Pendidikan Malaysia, Changlun, Kedah. 5 Abstract—Visualization is a process of converting data into its visual form as such data patterns can be extracted from the data. Data patterns are knowledge hidden behind the data. However, when data is big, it tends to overlap and clutter on visualization which distorts the data patterns. Data is overly crowded on visualization thus, it has become a challenge to extract knowledge patterns. Besides, big data is costly to visualize because it requires expensive hardware facilities due to its size. Moreover, it is timely to plot the data since it takes time for data to render on visualizations. Due to those reasons, there is a need to reduce the size of big datasets and at the same time maintain the data patterns. There are many methods of data reduction, which are preprocessing operations, dimension reduction, compression, network theory, redundancy elimination, data mining, machine learning, data filtering and sampling techniques. However, the commonly used data reduction technique is sampling technique that derives samples from data populations. Thus, sampling technique is chosen as a study for data reduction in this paper. However, the studies are scattered and are not discussed in a single paper. Consequently, the objective of this paper is to collect them in a single paper for further analysis in order to understand them in great detail. To achieve the objective, three interdisciplinary databases which are ACM Digital Library, IEEE Explore and Science Direct have been selected. From the database, a total of 48 studies have been extracted and they are from the years 2017 to 2021. Other than sampling techniques, this paper also seeks information on big data, data visualization, data clutter, and data reduction. Keywords—Sampling technique; probability sampling; non- probability sampling; data clutter; big data; data visualization; data reduction I. INTRODUCTION Data visualization is a technique to convert data to a visual form to extract knowledge hidden behind the data through data patterns. According to [1], data visualization involves a combination of people with distinct visualization-related skills. Data visualization is also a technology to explore data interactively. Through data exploration, various data patterns can be revealed. Big data plays a bigger role in our latest technologies today. Communities are particularly depending on the data to gain more information for decision making [2]. The advantage of data visualization is that it supports analysis, identifies issues and tackles problems faster through data patterns [3]. Big data technology is designed to process an enormous dataset for process optimization and decision making [4, 5]. However, an enormous dataset, both structured and unstructured are complex as they deal with an extensive amount of data. Thus, consistently, they are inadequate to operate with conventional processing techniques and algorithms [2, 6]. This is true when data is from various sources with various forms and format and yet they need to be integrated prior to processing. Other challenges when dealing with big data are the effectiveness and efficiency in understanding, storing, managing, and developing data visualization [7]. Plotting these big data to form visualizations, require high end and expensive hardware and software facilities. Nevertheless, when data have been successfully plotted and converted to visualization forms, it is common for the data to overlap on top of each other which could lead to data clutter issues. Data clutter can be defined as data that overlapping on top of each other which can lead to a massive number of false detections over the search space that relies on pixel patterns [8]. Fig. 1 below show the example of data clutter. Fig. 1. Example of data clutter (Source: [51]) Another issue of concern regarding data clutter is plotting efficiency. It takes a longer time to plot data into its visual and is defined as computational overhead which is costly [9]. Data clutter also leads to unrecognized data patterns when in fact extracting the patterns is the main objective of data visualization [10] for strategic planning and decision making. In other words, the whole point of data visualization is to extract data patterns in order to uncover the gems of knowledge hidden behind the data. Thus, there is a need to overcome data clutter and one of the techniques is through data reduction. Data reduction is one of the methods to shrunk computational overhead [11]. Although the dataset is reduced, the original information in the dataset should be preserved without scarifying any data patterns. However, data reduction could somehow remove some information from an original dataset that can lead to an unknown output of the dataset [12]. Nevertheless, data reduction can solve the difficulties that both data and visualization scientists suffer [13].