An automatic electronic nursing records analysis system based on the text classification and machine learning Zhang Wei, Zheng Xian Ju, Xie Chun Dept. of Computer Engineering Chengdu Technological University Chengdu, China zhangwei0317@gmail.com Jiang Hua, Peng Jin Metabolomics and Multidisciplinary Laboratory Institute for Emergency and Disaster Medicine Sichuan Provincial People’s Hospital Sichuan Academy of Medical Sciences Chengdu, China cdjianghua@gmail.com Abstract—Enormous amount of unstructured electronic health record are invaluable for the medical research in finding the relationship between the patient's disease and the final diagnosis. How to use computer automatically dig up these information has long been a hot spot. To get the relationship between the clinical outcomes and free text writing by nurse, we developed an automatic categorization system process natural language nursing record based on vector space model. 210 cases of electronic nursing records, which were diagnosed as pancreatitis, were induced in this study. We filtered the restricted corpus for acute pancreatitis classification by information gain (information gain. IG), and construct a text classification system based on Partial least squares discrimination algorithm (PLS-DA) and vector support machine (VSM). PLS loading value analysis found that there are 20 terms can be used to classify medical record text. Our innovative machine-learning algorithm effectively classified free texts of nurse care records associated with normal and acute pancreatitis diagnoses, after training on pre-classified test sets by PLS. This automatic identification technology focus in large-scale medical document may provide important clues to study the acute pancreatitis and other important common disease. Keywords-Text classification; Partial Least Squares; Vector support machine; Information gain; Pancreatitis I. INTRODUCTION 1 With the rapid development and popularity of the electronic medical record (EMR) technology, a sharp increase of medical record text messages was stored in a readable form by computer. How to automatic classification, organization and management the voluminous literature, information and data (most of it is the text) has become an important topic for the research in text mining and machine learning. Nursing record include a lot of analysis and description of original information from patients. Combined with physical examination and various clinical laboratory test results, the medical record of the original information can often reflect the patient's condition changes in overall situation and display a high correlation with the final clinical diagnosis. Until now, no one has been able to develop Supported by Research Foundation of Chengdu Technological University No. KY1211009B and Sichuan Provincial Education Board No.13ZA0047 software that can automatically identify and understand nursing medical record text. However the development of the software and algorithm for a particular disease process plays an important role in the cognition to the disease [1]. How to classification the severe acute pancreatitis from the normal acute pancreatitis in clinical diagnosis has been the important challenge to a clinician. For clinicians who faced with a patient with abdominal pain, vomiting, fever, hematuresis and amylase rise, which mean he/she could be diagnosed with acute pancreatitis, but a series of imaging examination including ultrasound is key to get the final diagnosis. There has long been a lack of a quantitative research in mining the relationship between the final diagnosis and the early clinical diagnosis. Electronic Nursing records summary is clinical summary recorded by nurse after the first time of nursing ward round, which containing the unbiased digital text of summarized clinical observation. Information contain in this sort of record is likely to be closely related to the final outcome and prognosis of patients. There are generous of nursing record written in Chinese characters, to construct the proper classifier, we need to transfer the large amount of Chinese writing unstructured nursing document into structured documents through auto segmentation by computer. At present, there are two ways transfer the unstructured documents into structured documents, one is Knowledge Engineering (KE) and another is Machine learning (ML) for extraction the key definition[2]. On the extraction in relatively smaller amount of data, the KE technology s extraction effect is better. For unknown large number of the records, the usage of machine learning has more advantages. On one hand, nursing medical record system is composed by a large number of professional terms to describe the short text; on the other hand, there is no document show that it can describe the patient s condition with specific disease accurately through a small number of professional terms. In order to gain the most important professional term from a large number of descriptive sentences, the machine learning is very important method to extract medical record information. Based on HowNet knowledge base to conversion and structuring the nursing records, We performed the pattern recognition in using the partial least squares and support vector machine (SVM) to find out a way to get specific 2013 Fifth International Conference on Intelligent Human-Machine Systems and Cybernetics 978-0-7695-5011-4/13 $26.00 © 2013 IEEE DOI 10.1109/IHMSC.2013.265 492 2013 Fifth International Conference on Intelligent Human-Machine Systems and Cybernetics 978-0-7695-5011-4/13 $26.00 © 2013 IEEE DOI 10.1109/IHMSC.2013.265 494