GENERAL PAPER Proficiency testing: binary data analysis Emil Bashkansky 1 • Vladimir Turetsky 1 Received: 23 December 2015 / Accepted: 5 April 2016 Ó Springer-Verlag Berlin Heidelberg 2016 Abstract A method for evaluating qualitative proficiency testing (PT) of laboratories conducting binary tests is proposed. The method is based on the scale-invariant item response model proposed by the authors in earlier publi- cations. We consider the case where the laboratories under the PT conduct test consisting of a set of test items/species presenting different, but unknown beforehand levels of difficulty when trying to detect a particular property of theirs, and we need to evaluate/compare both the intrinsic abilities of the participating laboratories and the level of difficulty of the test items. We assume that the responses to different test items do not affect one another and discuss how to get and interpret the most likely estimates/scores. The method is illustrated by the example presented in a recent publication by our colleagues from QuoData GmbH and can be considered as an alternative to that proposed in their publication method of scoring. Keywords Proficiency test (PT) Binary data Ability and difficulty Item response model Maximal likelihood Introduction Qualitative testing resulting in a binary output of type ‘‘good/bad,’’ ‘‘pass/fail,’’ ‘‘true/false,’’ ‘‘yes/no,’’ ‘‘positive/ negative,’’ ‘‘detected/non-detected,’’ ‘‘1/0,’’ etc. is often used in various fields of science, educational measurement, quality control, industry and healthcare, including analyt- ical chemistry and microbiology [1–9]. The performance reliability of such binary measurement systems (BMSs) is usually assessed by false-positive and false-negative rates [2], probability of detection (POD) [3–5], or sensitivity and specificity [6], which, in turn, may significantly depend on the extent to which the tested property is presented in the tested object [7]. This is the main reason why for a full assessment of a BMS’ ability (hereafter noted by a), the test usually comprises a series of test items presenting various levels of difficulty (hereafter noted by d). Unfor- tunately, the use of non-established terminology sometimes complicates the understanding of what is essentially the same thing. The difficulty in detecting a test item’s prop- erty may be variously called ‘‘level of task challenge’’ [8], ‘‘level of difficultyof the task’’ [9], or ‘‘degree of difficulty of the challenge’’ [10]. Ability may be termed ‘‘level of competence’’ [9] or ‘‘detection capability’’ [10], when applied to proficiency testing. Further, some ambiguity exists in the distinction between the test subject and object. If a laboratory is testing microbiological specimens to determine the presence or absence of some property, this laboratory is the subject of testing and the specimens are the objects under test. If, however, the results of these tests are used to evaluate the competence of the laboratory, which proficiency testing (PT) aims to do, then the eval- uator providing the specimens for the laboratory becomes the subject and the laboratory the object of the testing. The difficulty that different species pose to the tested laboratories can be known or unknown beforehand. The present article relates to the latter situation (difficulties are unknown before testing), as it is very common; however, it is also a much more complicated case to analyze. Recently, the authors proposed a new approach to evaluation of binary test results when checking the one-dimensional & Emil Bashkansky ebashkan@braude.ac.il 1 Ort Braude College of Engineering, P.O. Box 78, 2161002 Karmiel, Israel 123 Accred Qual Assur DOI 10.1007/s00769-016-1208-x