A System to Improve Data Quality in healthcare using Naïve Bayes Classifier Shwetkranti N. Taware PG Student, Computer Department D.Y. Patil College of Engg. Pune, India Vaishali Kolhe Assistant Professor, Computer Department D.Y. Patil College of Engg. Pune, India AbstractData accuracy is very important in modern database. Inaccurate data results in inaccurate decision. Data errors in some domains, such as medicine, banking may have particularly severe consequences. Data errors will arise at a range of points within the life cycle of information, from data entry, through storage, integration and cleaning. Every step presents a chance to deal with data accuracy. During data entry time, errors can catch and correct more fastly. The information community has paid comparatively very little attention to data quality at assortment time. Bayesian Network Model is an end-to-end system for form design, entry and data quality assurance. Using previous form submissions, system learns a probabilistic model over the fields of the form. Bayesian Network Model has been applied at every step of the data entry process to improve data quality. Before entry, it finds an ordering of form fields that promotes rapid information capture, driven by the greedy information gain and can statically reformulate form fields to give more accurate responses. During entry, it dynamically adapts the form based on entered values, facilitating re- asking, reformulation and real-time interface feedback in the spirit of providing appropriate entry friction. After entry, it automatically identifies possibly-erroneous inputs, guided by contextualized error likelihoods and re-asks those form fields, possibly reformulated, to verify their correctness. Bayesian Network Model is vulnerable to drawbacks which will overcome by proposed work. System has potential to improve data quality significantly at a reduced cost when compared to current approach. Keywords— Data accuracy; data entry; form Design; dataset; bayesian network; dynamic form; learning. I. INTRODUCTION Data quality is defined as in terms of accuracy, completeness, consistency, uniqueness and usability and so on. Numbers of Data quality improvement techniques were proposed by different authors. Important decisions are made by organizations and individuals based on inaccurate data stored in supposedly authoritative databases. Accurate Data is especially important for critical applications like healthcare, banking where even a single error can have drastic consequences. Improving data quality during data entry is best opportunities to assure data quality. Data entry is everywhere organizations all over the world rely on clerks to transcribe information from paper forms into supposedly authoritative databases. Maintaining high quality during data entry is challenge in many smaller institutes those operating in the developing world. Data entry operator fails to specify field constraints correctly and other validation logic due to lack of expert in form designing They also lack the resources needed for performing double entry. For low-resource organizations, data entry is the first and best opportunity to address data accuracy. In banking system, data entry operator may not know the form field’s constraint. So that he will fill form number of times. As one record is stored in database two times it will create many drawbacks 1. Wastage of memory. 2. Cost to get account details is increased. Data entry operator kept the diary for patient data. That means data in paper form. Data Entry operator may not expert in data entry. In survey methodology, soft constraints are given like as age =60, pregnant field still can take yes value. System gives only warning that means in accurate data is stored in database. There are many approaches presented by various researchers to provide the accuracy while data entry operations. Recently in [1], author presented the new approach called Bayesian Network Model. The foundation of this approach is a probabilistic representation of data collection forms induced from existing form values and their relationships between form fields. The Bayesian Network Model system trains and applies these probabilistic models in order to automatically derive process improvements in question flow, question wording, and re-confirmation. Since form layout and question selection is often adhoc, Bayesian Network Model optimizes form field ordering according to a probabilistic objective function that aims to maximize the information content of form answers as early as possible. Applied before entry, the model generates an entropy- optimal ordering, which focuses on important form field first. Applying its probabilistic model (Bayesian Network) during data entry, Bayesian Network Model can evaluate the conditional distribution of answers to a form field. In current form filling, system has form fields with many extraneous choices; Bayesian Network Model can opportunistically reformulate them to be easier and more appropriate with the available information. The model is consulted to predict which responses may be erroneous, so as to re-ask those form fields in order to verify their correctness. Intelligent re-asking approximates the benefits of double entry at a fraction of the cost. The proposed system reduces the amount of time of a data entry operator to fill out a form. System requires diabetes dataset to check quality of Bayesian and naïve byes algorithm. Also system uses naïve byes classifier for getting predicted values. System compares the accuracy (in terms of prediction) of Bayesian and naïve byes algorithm. Shwetkranti N. Taware et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5206-5209 www.ijcsit.com 5206