The Smart Data Extractor, a Clinician
Friendly Solution to Accelerate and
Improve the Data Collection During
Clinical Trials
Sophie QUENNELLE
a,b,c,1
, Maxime DOUILLET
d
, Lisa FRIEDLANDER
b,c
,
Olivia BOYER
b,c
, Antoine NEURAZ
a,b,c
, Anita BURGUN
a,b,c
and
Nicolas GARCELON
a,b,d
a
HeKA Team, Inria Inserm UMR_S1138, PariSantéCampus, Paris, France
b
Université de Paris Cité, Paris, France
c
Hôpital Universitaire Necker-Enfants malades, APHP, Paris, France
d
Data Science Platform, Imagine Institute, Paris, France
Abstract. In medical research, the traditional way to collect data, i.e. browsing
patient files, has been proven to induce bias, errors, human labor and costs. We
propose a semi-automated system able to extract every type of data, including notes.
The Smart Data Extractor pre-populates clinic research forms by following rules.
We performed a cross-testing experiment to compare semi-automated to manual
data collection. 20 target items had to be collected for 79 patients. The average time
to complete one form was 6’81’’ for manual data collection and 3’22’’ with the
Smart Data Extractor. There were also more mistakes during manual data collection
(163 for the whole cohort) than with the Smart Data Extractor (46 for the whole
cohort). We present an easy to use, understandable and agile solution to fill out
clinical research forms. It reduces human effort and provides higher quality data,
avoiding data re-entry and fatigue induced errors.
Keywords. Electronic Health Records, Clinical Research Forms, Clinical Data
Reuse, Observational Study
1. Introduction
Most of the patients’ information required for clinical trials and registries are available
in patients’ electronic health records (EHRs).[1] The most common manner to fill Case
Report Forms (CRF) is still to browse patients' documents searching for the information
required by the study protocol. This process induces delays, human efforts, costs, and
risks of transcription errors. Recent efforts have been dedicated to reuse EHR data to
identify patients eligible for trials to optimize clinical trial protocols and to transcribe the
variables of interest from EHRs to CRFs automatically.[2, 3] However, several pitfalls
remain since EHR data are heterogeneous, completeness of structured data elements is
low and most of the clinical information is locked into medical notes and needs to be
transformed in a structured format before secondary use.[4] Our objective was to develop
1
Corresponding Author : Sophie Quennelle, E-mail: sophie.quennelle@protonmail.com
Caring is Sharing – Exploiting the Value in Data for Health and Innovation
M. Hägglund et al. (Eds.)
© 2023 European Federation for Medical Informatics (EFMI) and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
doi:10.3233/SHTI230112
247