A Progress Report from the Linguistic Data Consortium: recent activities in resource creation and distribution and the development of tools and standards Christopher Cieri and Mark Liberman University of Pennsylvania, Linguistic Data Consortium 3600 Market Street, Suite 810, Philadelphia, PA. 19104-2653, USA {ccieri|myl}@ldc.upenn.edu Abstract This paper described recent activities of the Linguistic Data Consortium in the collection, annotation and distribution of language data the developments of tools and standards for using that data, the creation of metadata to facilitate the search for linguistic resources. Introduction Rapid changes in the landscape of linguistic research and technology development require continuous adaptation from international data centers who would serve the research communities involved. The only constant among these communities is the need for greater volumes of high quality data and tools with which to process the data. Variable are the human languages, data types, annotations, standards and formats needed. This presentation reports on the progress the Linguistic Data Consortium (LDC) has made in distributing existing resources, in collecting and annotating new resources and in developing and sharing standards and tools to address the needs of multiple research communities who are joined by their use of digital linguistic resources. Resource Distribution LDC’s original, and still primary mission, is to support education, research and technology development by serving as a central distribution point and repository of language resources. LDC’s operational model is strongly tied to the notion of a consortium in which members who believe in the work of the organization provide yearly support and receive benefits well in excess of what their membership fees would acquire on the open market. LDC members receive ongoing rights to each database released in the years in which they support the consortium. LDC released 24 data sets in 2002 and 27 in 2003. Membership agreements, differing by organization type, govern the use of LDC data. On rare occasions, corpus specific agreements supercede the membership agreement and further constrain the use of a corpus. Most LDC corpora are also available for licensing to non-members. LDC membership fees have not changed since the Consortium was founded. The annual fee is significantly less than the cost to produce just one corpus. There is clear evidence that this model provides extraordinary support to research organizations worldwide. Since its founding, just over 12 years ago, LDC has distributed more than 21,200 copies of 288 different corpora to more than 1720 organizations in 89 countries excluding the data sets available for free download from the web pages. Membership and licensing fees completely support this distribution activity. The percentage of LDC members in each of the commercial, government and non-profits sectors has remained stable since we last reported it at LREC 2002. Figure 1 shows that about three-fourths of LDC members are in the non-profit sectors. Commercial organizations comprise nearly an additional one-fifth of LDC members while government organizations including the research branches governments around the world account for the remainder. Figure 1: LDC Membership by Organization Type LDC data users are by no means limited to, or even concentrated in, the United States. Figure 2 shows the geographical distribution of organizations that use LDC data. This map is not limited to Consortium members or even organizations that license data for a fee but also includes those who have requested corpora distributed without fee under the NSF funded Talkbank program and those who have registered to download free data from LDC’s web page. Figure 2: Geographical distribution of LDC Data Users Commercial 19% Government 5% Non-Profit 76% 929