EDITED BY
Uwe Aickelin,
The University of Melbourne, Australia
REVIEWED BY
Terri Elizabeth Workman,
George Washington University, United States
Paul M. Heider,
Medical University of South Carolina,
United States
*CORRESPONDENCE
Arlene Casey
arlene.casey@ed.ac.uk
RECEIVED 12 March 2023
ACCEPTED 06 September 2023
PUBLISHED 28 September 2023
CITATION
Casey A, Davidson E, Grover C, Tobin R,
Grivas A, Zhang H, Schrempf P, O’Neil AQ,
Lee L, Walsh M, Pellie F, Ferguson K, Cvoro V,
Wu H, Whalley H, Mair G, Whiteley W and Alex B
(2023) Understanding the performance and
reliability of NLP tools: a comparison of four
NLP tools predicting stroke phenotypes in
radiology reports.
Front. Digit. Health 5:1184919.
doi: 10.3389/fdgth.2023.1184919
COPYRIGHT
© 2023 Casey, Davidson, Grover, Tobin, Grivas,
Zhang, Schrempf, O’Neil, Lee, Walsh, Pellie,
Ferguson, Cvero, Wu, Whalley, Mair, Whiteley
and Alex. This is an open-access article
distributed under the terms of the Creative
Commons Attribution License (CC BY). The use,
distribution or reproduction in other forums is
permitted, provided the original author(s) and
the copyright owner(s) are credited and that the
original publication in this journal is cited, in
accordance with accepted academic practice.
No use, distribution or reproduction is
permitted which does not comply with these
terms.
Understanding the performance
and reliability of NLP tools: a
comparison of four NLP tools
predicting stroke phenotypes in
radiology reports
Arlene Casey
1
*
, Emma Davidson
2
, Claire Grover
3
, Richard Tobin
3
,
Andreas Grivas
3
, Huayu Zhang
1
, Patrick Schrempf
4,5
,
Alison Q. O’Neil
4,6
, Liam Lee
7
, Michael Walsh
8
, Freya Pellie
9,10
,
Karen Ferguson
2
, Vera Cvoro
2,11
, Honghan Wu
12,13
,
Heather Whalley
2,14
, Grant Mair
2,15
, William Whiteley
2,15
and Beatrice Alex
16,17
1
Advanced Care Research Centre, Usher Institute, University of Edinburgh, Edinburgh, United Kingdom,
2
Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom,
3
School of
Informatics, University of Edinburgh, Edinburgh, United Kingdom,
4
Canon Medical Research Europe Ltd.,
AI Research, Edinburgh, United Kingdom,
5
School of Computer Science, University of St Andrews, St
Andrews, United Kingdom,
6
School of Engineering, University of Edinburgh, Edinburgh, United Kingdom,
7
Medical School, University of Edinburgh, Edinburgh, United Kingdom,
8
Intensive Care Department,
University Hospitals Bristol and Weston, Bristol, United Kingdom,
9
National Horizons Centre, Teesside
University, Darlington, United Kingdom,
10
School of Health and Life Sciences, Teesside University,
Middlesbrough, United Kingdom,
11
Department of Geriatric Medicine, NHS Fife, Fife, United Kingdom,
12
Institute of Health Informatics, University College London, London, United Kingdom,
13
Alan Turing
Institute, London, United Kingdom,
14
Generation Scotland, Institute of Genetics and Cancer, University of
Edinburgh, Edinburgh, United Kingdom,
15
Neuroradiology, Department of Clinical Neurosciences, NHS
Lothian, Edinburgh, United Kingdom,
16
Edinburgh Futures Institute, University of Edinburgh, Edinburgh,
United Kingdom,
17
School of Literatures, Languages and Cultures, University of Edinburgh, Edinburgh,
United Kingdom
Background: Natural language processing (NLP) has the potential to automate the
reading of radiology reports, but there is a need to demonstrate that NLP methods
are adaptable and reliable for use in real-world clinical applications.
Methods: We tested the F1 score, precision, and recall to compare NLP tools on a
cohort from a study on delirium using images and radiology reports from NHS Fife
and a population-based cohort (Generation Scotland) that spans multiple National
Health Service health boards. We compared four off-the-shelf rule-based and
neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and
reported on their performance for three cerebrovascular phenotypes, namely,
ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from
the EdIE-R team defined phenotypes using labelling techniques developed in
the development of EdIE-R, in conjunction with an expert researcher who read
underlying images.
Results: EdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke,
≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst
that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both
cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+
≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1
scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation
Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest
for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When
comparing NLP tool output with brain image reads using F1 scores, ALARM+
TYPE Original Research
PUBLISHED 28 September 2023
| DOI 10.3389/fdgth.2023.1184919
Frontiers in Digital Health 01 frontiersin.org