Mining Medical Records for Computer Aided Diagnosis R. Bharat Rao, Romer Rosales, Stefan Niculescu, Sriram Krishnan Luca Bogoni, Xiang S. Zhou, Balaji Krishnapuram Siemens Medical Solutions, 51 Valley Stream Parkway, Malvern, PA, USA 1 Introduction Over the last five years, a new generation of medical data min- ing tools have dramatically impacted the health care industry by improving the diagnosis of medical diseases and by reduc- ing the time pressure on physicians and nurses. Our demon- stration highlights three products for the the health care indus- try, showcasing the potential of novel data mining technolo- gies to save lives on a large scale. During the demonstration, our products will use real-life (de-identified) patient data, in an effort to convey the practical and theoretical challenges unique to data from the medical domain. 2 Early-stage diagnosis of colorectal cancer 2.1 Background Colo-rectal cancer (CRC) affected 147,000 patients in the US in 2004, and of them 57,000 died. Unlike many other form of cancers, CRC is removable if it is found at an early stage. In its early stage it manifests itself as colonic polyp. The recommendation is that each individual over age 50 undergo optical colonoscopy so that any polyp may be removed and to repeat the procedure after 10 years if negative and a more frequent review if any polyps are found. The prevalence in the general population is roughly 5% to 8%, with only 10% of these showing any signs of cancer (adenomatous polyps). Virtual colonography (VC), also known as CT Colonogra- phy (CTC), was introduced as a means to address problem- atic cases which could not be accurately diagnosed by earlier methods like optical colonoscopy (OC). It was predicted that CTC could be used as a screening tool so that only patients with positive finding from CTC would be sent to OC. 2.2 Results of clinical studies However, radiologists need substantial training to perform CTC, and it is a long procedure. As a result, in clinical practice, non experts readers often show a substantially lower sensitivity with CTC (75% or less on medium to large sized polyps) [1]. However, in large clinical trials on 145 individ- uals, our computer aided detection (CAD) system can accu- rately diagnose patients based on CTC images, with a sensi- tivity of around 90% for medium and large sized polyps [2]. Further, when inexperienced radiologists are assisted by our CAD product in clinical studies, they decrease their false pos- itive rate by 66%, yet improve their sensitivity on medium Figure 1: Polyp detection in virtual colonoscopy and large sized polyps to 97% [1]. Thus CAD assisted in- experienced radiologists were clinically shown to diagnose polyps as accurately as experienced radiologists. Figure 1 shows screen shots from the software that will be demon- strated. 2.3 Novel technical contributions Although the product development is still ongoing, many new theoretical and technical contributions have already been pro- posed. Our demonstration will highlight the following novel contributions from the fields of machine learning, data min- ing and computer vision. Batch-wise classification of non-iid data: Unlike most al- gorithms that assume the data to be drawn iid, we exploit inter-sample correlations to improve the accuracy while clas- sifying a set of samples simultaneously [3]. Multiple instance learning: A novel, convex-hull based al- gorithm finds optimal classifiers when the disease status of regions of images is imputed (guessed statistically) based on proximity to radiologist marks in training images [4]. Curvatures pattern feature descriptions [5]: Computes the principal curvatures on the surface and characterizes patterns of curvature, with the intuition that polyps are ellipsoidal