The Smart Data Extractor, a Clinician Friendly Solution to Accelerate and Improve the Data Collection During Clinical Trials Sophie QUENNELLE a,b,c,1 , Maxime DOUILLET d , Lisa FRIEDLANDER b,c , Olivia BOYER b,c , Antoine NEURAZ a,b,c , Anita BURGUN a,b,c and Nicolas GARCELON a,b,d a HeKA Team, Inria Inserm UMR_S1138, PariSantéCampus, Paris, France b Université de Paris Cité, Paris, France c Hôpital Universitaire Necker-Enfants malades, APHP, Paris, France d Data Science Platform, Imagine Institute, Paris, France Abstract. In medical research, the traditional way to collect data, i.e. browsing patient files, has been proven to induce bias, errors, human labor and costs. We propose a semi-automated system able to extract every type of data, including notes. The Smart Data Extractor pre-populates clinic research forms by following rules. We performed a cross-testing experiment to compare semi-automated to manual data collection. 20 target items had to be collected for 79 patients. The average time to complete one form was 6’81’’ for manual data collection and 3’22’’ with the Smart Data Extractor. There were also more mistakes during manual data collection (163 for the whole cohort) than with the Smart Data Extractor (46 for the whole cohort). We present an easy to use, understandable and agile solution to fill out clinical research forms. It reduces human effort and provides higher quality data, avoiding data re-entry and fatigue induced errors. Keywords. Electronic Health Records, Clinical Research Forms, Clinical Data Reuse, Observational Study 1. Introduction Most of the patients’ information required for clinical trials and registries are available in patients’ electronic health records (EHRs).[1] The most common manner to fill Case Report Forms (CRF) is still to browse patients' documents searching for the information required by the study protocol. This process induces delays, human efforts, costs, and risks of transcription errors. Recent efforts have been dedicated to reuse EHR data to identify patients eligible for trials to optimize clinical trial protocols and to transcribe the variables of interest from EHRs to CRFs automatically.[2, 3] However, several pitfalls remain since EHR data are heterogeneous, completeness of structured data elements is low and most of the clinical information is locked into medical notes and needs to be transformed in a structured format before secondary use.[4] Our objective was to develop 1 Corresponding Author : Sophie Quennelle, E-mail: sophie.quennelle@protonmail.com Caring is Sharing – Exploiting the Value in Data for Health and Innovation M. Hägglund et al. (Eds.) © 2023 European Federation for Medical Informatics (EFMI) and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/SHTI230112 247