A Smart Data Pre-Processing Approach by Using ML Algorithms on IoT Edges: A Case Study Şükrü Mustafa Kaya Computer Engineering Department Istanbul Aydin University Istanbul, Turkey smustafakaya@stu.aydin.edu.tr Ali Güneş Computer Engineering Department Istanbul Aydin University Istanbul, Turkey aligunes@aydin.edu.tr Atakan Erdem Department of Biological Sciences University of Calgary Calgary,Canada atakan.erdem1@ucalgary.ca Abstract—The internet of things (IoT) is a technology that allows many objects used in daily life to produce a variety of data and transfer those data to other objects or systems. The application domain of this system is increasing day by day, and the technologies used for its infrastructure are also varied. However, to process the huge amount of sensor data effectively, smart and fast filtering solutions are required. As a data pre- processing task, smart data filtering improves not only the data processing speed but also the quality of data as well. In other words, big data management is facilitated by getting more effective results with little noise and meaningful data. In this study, we examined big IoT data stored on IoT edges to detect anomalies in temperature, age, gender, weight, height, and time data. In this context, the Logistic Regression algorithm was applied at both sensing and network layers for anomaly detection purposes. Furthermore, the performance of the classification algorithm in terms of speed and accuracy was reported as the output of the study. Keywords-component; internet of things; big data management; big data analytics; data filtering I. INTRODUCTION As a result of digitalization gaining momentum in the world, the generation, collection, analysis, and storage of data that will facilitate our daily lives and the establishment of decision-making mechanisms based on meaningful data have gained importance. Parallel to the proceedings, IoT technology including cloud computing and database systems, which can detect the sensing networks, devices, or people that can observe the physical world, produce and process data, and perform decision-making processes, has emerged. The devices that make up this technology can communicate with each other over the internet and share information. As a result of this feature, the IoT technology is being used effectively in smart agriculture, smart homes, smart industry smart cities, and smart energy systems. However, it is impossible for IoT devices to filter data while producing data [1, 2]. IoT edges are the first place where data can be pre- processed before the generated data go to the cloud. It is important to filter data before they go to the cloud because if filtering is not done, the success of cloud services in terms of speed and accuracy decreases [3, 4]. Therefore, speed and accuracy are two important criteria to consider. Since there are no similar studies prioritizing the speed and accuracy criteria within this scope, it is thought that our study and the obtained experimental results will have important contributions to the studies in this field. Studies in different IoT areas can be mentioned as examples to show the importance of the problems we focus on. Eugene S. et al. [5] examine the benefits of a wide range of efficient, successful, and innovative applications and services for the IoT and big data analysis. The study aims to examine data analysis applications in different IoT areas, to provide a classification of analytical approaches, and to put forward a layered taxonomy from internet of things data to analytics. The taxonomy supply insight into the appropriateness of analytical techniques; and with the obtained information, a meaningful result is obtained that provides the technology and infrastructure for IoT analytics. As a result, developments that will shape future research on the IoT are being investigated. In their article, Gunasekaran M. et al. suggest a new architecture for the application of the internet of things to storage and process scalable big sensor data for healthcare implementations. The suggested architecture consists of two key sub architectures: The meta fog routing (MF-R) and grouping and selection (GC) architectures. The MF-R architecture uses big data technologies such as apache pig and apache Hbase to collect and store the big sensor data produced from distinct sensor devices. The suggested GC architecture is used to enable the integration of fog computing with cloud computing. In addition, a MapReduce based on a prediction model is used to presage heart diseases using the architecture [5, 6]. Yasmin F. et al. propose an adaptive method to reduce data. The proposed method is an estimation-based data reduction utilizing LMS adaptive filters. Specifically, the recommended method for both the source and base station nodes is based on a convex integration of two LMS window filters separated using different sizes to predict the next measured values since the sensor nodes must immediately transmit the detected values only when there is a significant deviation from the predicted values [7]. This article proposes a new model for the effective management of big data generated by different sources, such as sensor data that do not require human intervention, by optimizing virtual machine selection. The planned model aims to optimize the store of patients’ data to provide a real time data recall mechanism and thus to improve the performance of health systems [8]. In another study, studies on the internet of 36 2021 International Conference on Artificial Intelligence of Things (ICAIoT) 978-1-6654-0176-0/21/$31.00 ©2021 IEEE DOI 10.1109/ICAIoT53762.2021.00014 2021 International Conference on Artificial Intelligence of Things (ICAIoT) | 978-1-6654-0176-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICAIoT53762.2021.00014 Authorized licensed use limited to: ULAKBIM UASL - Istanbul Aydin Universitesi. Downloaded on June 03,2022 at 07:32:18 UTC from IEEE Xplore. Restrictions apply.