Improving Clustering Results through Active Learning Marjan Qazvini ∗ Independent Researcher, Iran Abstract Data labelling is a task that arises in various ﬁelds, including image processing, voice recognition, and text classiﬁcation. Active Learning (AL) is a method that can be used to simplify this task. This study focuses on tabular data and the classiﬁcation of disabilities. We use the English Longitudinal Study of Ageing (ELSA) and diﬀerent socio-demographic, disease, and disability factors to group participants into various disability levels. Since the ground truth is unknown, we employ diﬀerent clustering methods. The results show that by combining AL strategies, even with small amounts of data, we can achieve accuracy comparable to that of the entire dataset. keywords — [JEL]C1, C8 keywords — Active Learning, Coclus, K-modes 1 Introduction One area of Machine Learning (ML) focuses on classifying data based on known features. When the ground truths are known, the problem is classiﬁed as supervised. When the ground truths are unknown or diﬃcult to obtain, the problem falls under unsupervised learning. In such cases, we must employ diﬀerent clustering methods. The common issue with homo- geneous data—such as images, voices, and texts—is the shortage of labels. However, this problem has not been fully explored in the context of inhomogeneous data, such as tab- ular data. Active Learning (AL) encompasses a set of strategies designed to address this label shortage by selecting the most informative data. The idea behind AL is that a small amount of informative data can be suﬃcient for learning a pattern. AL is frequently used in image classiﬁcation, and therefore, most AL strategies have been developed in this area by combining diﬀerent neural networks and AL methods. For example, the cost-eﬀective AL (CEAL) method combines Convolutional Neural Networks (CNNs) with AL [26]. This approach selects the least conﬁdent data points for manual labelling, while the most conﬁ- dent ones are automatically pseudo-labelled. Challenges arise when trying to combine deep learning models with AL. Deep learning typically requires large datasets, and the process of labelling just one sample in each iteration can be time-consuming. Furthermore, deep models tend to achieve high training accuracy, which reduces uncertainty. To introduce ∗ Corresponding author: marjan.qazvini@gmail.com 1