How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese Jerid Francom, Amy LaCross, Adam Ussishkin Wake Forest University, University of Arizona, University of Arizona Winston-Salem, NC, USA, Tucson, AZ, USA, Tucson, AZ, USA francojc@wfu.edu, lacross@email.arizona.edu, ussishki@email.arizona.edu Abstract In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Speciﬁcally, we compare two robust variables identiﬁed in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data. 1. Developing corpus resources for low-density languages The beneﬁts of developing language corpora that provide readily access to quantiﬁable language use in naturalis- tic settings have been widely embraced by many schol- ars in a diverse set of language disciplines. In addition to the strong tradition of corpus analysis and development in applied research (dictionary (Sinclair, 1987), language teaching (Biber and Conrad, 2001), translation (McEnery and Xiao, 2007) and machine-learning applications), cor- pus data are increasingly employed in typical theoretical investigation, in particular psycholinguistic studies on lan- guage processing (Gilquin and Gries, 2009, for a survey). However, the great majority of the estimated 5-7,000 lan- guages of the world are ‘low-density’, i.e. for which ro- bust language resources are limited, or non-existent (Borin, 2009). This fact highlights an obvious lack of empirical coverage of range of possible linguistic diversity – an ob- stacle for theoretical and applied applications for particu- lar languages and theoretical investigations more generally. To address this gap, many researchers have focused their efforts on developing resources for low-density languages (LDL) (McEnery et al., 2006; Scannell, 2007, inter alia). Despite best efforts on the part of language researchers, there are unique challenges related to the quality and quan- tity of available data that researchers must face when de- veloping corpora for LDLs which ultimately may call into question the general applicability of the ﬁnal prod- uct. Whereas access to primary data may be limited both in print and electronic form, creating sometimes insurmount- able problems 1 , language data that is available is often re- We gratefully recognize our colleagues Dr. Albert Gatt (U of Malta) and Jeff Berry (U of Arizona) for their invaluable assis- tance and acknowledge funding from the United States National Science Foundation (BCS-0715500) to Adam Ussishkin. 1 Difﬁculties in attaining data do not always stem from the number of speakers of language, but may in fact reﬂect the in- teraction of various extra-linguistic factors (cultural, economic, stricted also in terms of its overall representativeness of the target language (i.e. genres/registers, modalities, etc.) (Biber, 1993). Compared to languages such as English language where resources are literally samples of the lan- guage, techniques for attaining representativeness for other language corpora are as straight forward given the resources that are available represent almost complete coverage of all language data in existence (Scannell, 2007) – in effect, a representativeness bottleneck. Under standard evaluation practices many existing projects are considered ‘specialized’, that is less-than-representative language samples. Accordingly, without some assurance of corpus validity, credible results from low-density language research is limited. However, these smaller and less-diverse language samples do not necessarily misrepresent distribu- tional properties for those linguistic units that have been collected. It is logically possible that some, or all, of the linguistic units contained in the corpus are indeed repre- sentative of the larger language body from which it was sampled. Yet the question is, how you know (i.e. determine representativeness)? In what follows we describe a novel approach to evaluate corpus representativeness that exploits the relationship be- tween corpus linguistics and psycholinguistics. 2. Behavioral data as external validation for corpus resources Corpus-based evidence is inherently limited in gauging the relative representativeness of a corpus in an absolute sense. A general characteristic of corpus design and evaluation, this limitation is typically addressed by collecting large amounts of data rigorously sampled from a wide variety of sources. For LDL resources, which often lack accessible resources, this is a pressing issue. In this case an exter- nal, non-corpus based metric is needed. We propose that evidence from psycholinguistic investigation based on data political, etc.). 421