Centroids: Gold standards with distributional variations Ian Lewin * ,S ¸ enay Kafkas , Dietrich Rebholz-Schuhmann Linguamatics * St. John’s Innovation Centre Cowley Road Cambridge UK European BioInformatics Institute Wellcome Trust Genome Campus Hinxton Cambridge UK ian.lewin@linguamatics.com, {kafkas,rebholz}@ebi.ac.uk Abstract Motivation: Gold Standards for named entities are, ironically, not standard themselves. Some specify the “one perfect annotation”. Others specify “perfectly good alternatives”. The concept of Silver standard is relatively new. The objective is consensus rather than perfection. How should the two concepts be best represented and related? Approach: We examine several Biomedical Gold Standards and motivate a new representational format, centroids, which simply and effectively represents name distributions. We define an algorithm for finding centroids, given a set of alternative input annotations and we test the outputs quantitatively and qualitatively. We also define a metric of relatively acceptability on top of the centroid standard. Results: Precision, recall and F-scores of over 0.99 are achieved for the simple sanity check of giving the algorithm Gold Standard inputs. Qualitative analysis of the differences very often reveals errors and incompleteness in the original Gold Standard. Given automatically generated annotations, the centroids effectively represent the range of those contributions and the quality of the centroid annotations is highly competitive with the best of the contributors. Conclusion: Centroids cleanly represent alternative name variations for Silver and Gold Standards. A centroid Silver Standard is derived just like a Gold Standard, only from imperfect inputs. Keywords: Centroid, Gold Standard, Silver Standard, Evaluation 1. Introduction We examine several Gold Standard assessment datasets available in biomedicine and motivate a new representation for Gold Standard markup: centroids. Centroids provide a simple and effective representation for name distributions and a more fine-grained method for measuring how good a user annotation is. In this way, centroids represent an ex- tension of classical gold standard markup. In addition, we define an algorithm for finding centroids, given a set of alternatively annotated inputs, and test it quantitatively and qualitatively against both Gold Standard inputs and automatically annotated inputs. Given a set of alternative inputs, each of which is Gold Standard, we verify that the algorithmically discovered cen- troids are also overwhelmingly gold standard, as tradition- ally conceived. Even when (infrequently) they are not, they very often represent errors in the original gold standard. We apply the algorithm also to sets of alternative automatic annotations as submitted to the CALBC challenge competi- tion. We thereby derive a Silver Standard, a representation of a consensus driven standard. We show that silver stan- dard centroids are very highly competitive with the best of the contributing annotations. Further experiments also show that Silver Standard scores correlate to Gold Stan- dard, suggesting that silver might indeed stand proxy for gold as well as representing a consensus annotation (with distributions). We conclude that it is indeed highly desirable to represent distributions directly within the Gold Standards and not just implicitly, for example through fuzzy or partial matching schemes. 2. Background The Gold Standard data-sets available for biomedical named entity recognition are only few in number. Yet, they differ in more than just subject matter. Here we show how they vary in the representation and suggested evaluation of name variations. The SCAI corpus (Kolarik et al., 2008) for chemical enti- ties, for instance, assigns gold standard I-O-B labels to pre- defined tokens within sentences. Good scores require re- producing the same tokenization as well as the one correct label for each token. Some variation is possible through the use of special class labels. For example, the token compounds within uridine compounds is assigned the la- bel B-Modifier. This perhaps indicates that compounds is not here functionning as an essential part of the name of the chemical. Other labels, such as B-Trivial and B-Family are similarly suggestive. It is therefore at least possible for an evaluation procedure to be sensitive to these 3894