Data quality in European primary care research databases. Report of a workshop held in London September 2013 A.Rosemary Tate 1 and Dipak Kalra 2 and Rachael Boggon 3 and Natalia Beloff 1 and Shivani Puri 3 and Tim Williams 3 Abstract— Primary care research databases provide a signif- icant resource for health services and epidemiological research. However since data are recorded primarily for clinical care their suitability for research may vary widely according to the research application or recording practices of individual general practitioners. A methodological approach for characterising data quality is required. We describe a one-day workshop entitled “Towards a common protocol for measuring and monitoring data quality in European primary care research databases”. Researchers, database experts and clinicians were invited to give their perspectives on data quality and to exchange ideas on what data quality metrics should be made available to researchers. We report the main outcomes of this workshop, including a summary of the presentations and discussions and suggested way forward. I. I NTRODUCTION The potential for using routinely collected patient records for research purposes has been steadily increasing with the recent advances and diminishing technical barriers in data storage and information processing. Primary care records are created on, or close to, the date that an event occurs and record all interactions with the general practitioner (GP) including tests, prescriptions and referrals to secondary care. However, records are variable in quality and may be missing or incompletely recorded. Since the validity of results relies on the quality of the data, it is important to have processes in place for assessing this variability and ensuring that data is of high quality with respect to their intended use. Although there is a vast literature on data quality in general, and many different frameworks have been proposed, there is still a need to categorise different dimensions of quality and to standardise the benchmarks for each dimension. Data quality is a multidimensional concept which depends on the use that is being made of the data, i.e. “fitness for use” [1]. Different dimensions will be more important for some groups of user than others. This workshop brought together clinicians, users of the data and database experts to discuss what data quality means to them and to develop a common approach for measuring data quality in primary care European databases. The specific aims were to: *This work was supported by the Medicines and Healthcare Products Regulatory Agency 1 A.R.Tate and N. Beloff are with School of Informatics and Engineering, University of Sussex, Falmer BN1 9QJ, UK. rosemary@sussex.ac.uk 2 Dipak Kalra is with Centre for Health Informatics and Multiprofessional Education, University College London, UK. d.kalra@ucl.ac.uk 2 Shivani Puri, Rachael Boggon and Tim Williams are with the Medicines and Healthcare Products Regulatory Agency, Buckingham Palace Road, London, UK. shivani.padmanabhan@mhra.gsi.gov.uk 1) Share experiences of assessing data quality in electronic health records (EHRs). 2) Discuss the issues and challenges involved with mea- suring data quality in EHRs for epidemiological and clinical research. 3) Work towards development of an approach to ensure compatibility of data quality measures for different European primary and secondary care databases. 4) Discuss how to help data contributors improve data quality (for both clinical care and research) at source. The workshop was held at the Clinical Practice Research Datalink in London and was organised by the authors who chaired and facilitated the four sessions. These were arranged as two sets of short 10-minute presentations: A. Data quality in European research databases and B. Data quality from the users point of view and two discussion sessions: C. The clinical perspective (panel session) and D. Break-out discussions. The 42 invited attendees included statisticians, epidemiologists, general practitioners, clinician researchers, IT professionals and representatives from the Primary Care Information Services (PRIMIS). In advance of the workshop, all invitees were asked to provide answers to a questionnaire aimed at understanding what drives interest in data quality and how it is approached. We summarise the presentations, discussion and questionnaire answers and provide sugges- tions for a proposed way forward. II. SUMMARY OF PRESENTATIONS A. Data quality in European research databases 1) Data quality in the Clinical Practice Research Datalink (CPRD): Rosemary Tate described an investigation of data quality in the CPRD Gold database [2]. Percentages of data elements relating to different dimensions of data quality were extracted for all 538 practices contributing to the database be- tween 2000-2011 and investigated using summary statistics, graphs and correlation analysis. Recording of most elements improved over time. There were large inter-practice vari- ations, and most percentages had left-skewed distributions with several outliers. Most percentages were only weakly inter-correlated, except those related to specific conditions (e.g. tests and measures for diabetes). GP practices who were weak at recording one aspect were generally fine at recording all others. She concluded that practice-based DQ scores should be tailored to the intended use of the data. 2) Data quality in a primary care Catalan Database: Leonardo M´ endez (SIDIAP) described the Registry Quality