Challenges in Predicting Community Periodontal Index from Hospital Dental Care Records Daniel Vieira X-akseli Oy Espoo, Finland daniel.vieira@x-akseli.fi Jaakko Hollm´ en Aalto University School of Science Department of Information and Computer Science Espoo, Finland Jaakko.Hollmen@aalto.fi Jari Linden Chief Dentist, Health Care Center Lohja, Finland Jari.Linden@lohja.fi Jorma Suni Dental Director, CEO, Oral Health Corporation Vantaa, Finland Jorma.Suni@vantaa.fi Abstract Many studies have been performed in predicting periodon- tal diseases based on genetic information, dental images or patients habits but few have yet used dental visits records. This paper proposes a methodology based on Random For- est to classify the periodontal disease condition of patients and a way to assess the most important features that lead to a successful classification. We investigate three problem- atic issues found in dental care records: noise, class imbal- ance and concept drift and propose solutions to overcome them by respectively detecting and removing noise, under- sampling and only considering recent data. Experiments performed on records from Finnish public hospitals of two cities had good classification results and feature importance was able to detect dentists with poor performance with re- spect to diagnosis and treatment application. 1 Introduction In recent years many studies have emerged related to pe- riodontal disease prediction using machine learning meth- ods. These studies have shown that periodontal disease can be successfully predicted using decision trees with ge- netic information [1] or using Support Vector Machines on saliva samples [2]. Dental care image classification was also proven to be effective by applying stepwise linear discrimi- nant analysis on features extracted from the images [3] [4]. Few studies[5] [6] have, however, been carried out using only patient visit records. Dental care visit records were used in a study [5] from 2010 for periodontal risk assessment using Artificial Neural Net- works. This study was performed on a small set of 230 patients which included comprehensive information about each subject, such as pan chewing habits, family history of periodontitis and patient questionnaires regarding their food habits. In 2002 a statistical analysis of periodontal dis- ease prediction was carried out on a group of 523 subjects which enrolled in the Veterans Affairs Dental Longitudinal Study [6]. The data included radiography and patient ques- tionnaires about past dental surgeries. Both studies classi- fied the subjects in five groups from low to high risk. These have demonstrated how periodontal disease can be success- fully predicted using visit records if necessary information about the patient is available. Even though the number of dental care visit records is quite large these are difficult to use in machine learning since the data is usually humanly generated, i.e. subjective and prone to mistakes and data properties change over time as result of different factors such as changes in treatments, new doctors or evolution in patient health habits. A way to avoid this problem is to carefully select the subjects and observations as to avoid imbalance between classes and to either choose a small subset of observation or to choose features whose changes over the years don’t deteriorate the predictor’s ac- curacy. This paper seeks to build a periodontal disease predictor us- ing sole dental care visit records. It differs from previous studies in the sense that questionnaires, family history of each patient and radiography are not used. All patients that had more than one examination were considered and the problems of class imbalance, noise and concept drift were addressed. Features not related to periodontal disease were taken into account to study the effects of other factors on the matter, such as the doctor that treated the patient. The Community Periodontal Index (CPI) is the most widely 978-1-4799-1053-3/13/$31.00 c 2013 IEEE CBMS 2013 107