Design Principles and Data Collection for CELEN: A Corpus of Learner Spanish in Japan Pilar Valverde Kansai Gaidai University Hirakata, Japan pilar-vi@kansaigaidai.ac.jp Abstract This paper describes the first steps in the creation of a new resource for Spanish, CELEN, a written corpus of Spanish as a Foreign Language in Japan. First, we introduce the situation of Spanish in higher education in Japan and the design principles of the corpus. Second, we describe the workflow, consisting of collection of background and permit forms, collection of texts, transcription, and task metadata registration. Third, we present some results about the data collected at Kansai Gaidai University during the first semester of the current academic year (from April to August 2018): the resulting corpus contains 963 texts from 449 learners totalling 74,631 words, which represent three of the six CEFR levels: A1, A2 and B1. The work on CELEN is still ongoing with texts of the second semester as well as texts from other institutions waiting to be added to the corpus. In the future, we plan to annotate the corpus with morpho-syntactic information and make it accessible to the research community under a CC-BY-NC license, so that one can not only see the data but also manipulate it and further annotate it. Keywords: learner corpora, learner Spanish, Spanish as a Foreign Language 1. Background: Spanish in Japan Spanish is the second most spoken language worldwide by number of native speakers—around 477 million—after Chinese (Cervantes Institute, 2015) and is also one of the most commonly learned foreign languages in many countries. Overall, it is estimated that approximately 21 million people are studying Spanish as a foreign language worldwide. The countries with the highest absolute number of learners are the United States, Brazil, and France. However, unlike other European languages, Spanish is a less studied foreign language in most parts of Asia. In this paper, we present the design principles and data collection process of a learner corpus of Spanish in Japan (CELEN). 1 Specifically, we focus on Spanish in higher education, since the teaching of this language in primary and secondary schools in this country is almost non- existent, as is the case with other languages except English. In Japan, it is estimated that roughly 60,000 undergraduates study Spanish (Cervantes Institute, 2015). The great majority study it as a second foreign language, as it is required by many universities to study another language apart from English for one year. In addition, about fifteen universities offer four-year programs in Hispanic Studies, with around 1,000 new students each year altogether. Therefore, one of the challenges in gathering data for the corpus is the scarcity of students, especially those with an intermediate or advanced level of 1 Corpus del Español como Lengua Extranjera en Japón. the language. 2 Students of other majors enrolled in a Spanish as a second foreign language class meet twice a week for one or two years and usually achieve a very basic knowledge of the language, below A1. 3 For our corpus, we focus mainly on the students majoring in Spanish, so as to gather data from various proficiency levels. Students mastering in Spanish are usually expected to reach an A1 level during their first year of studies, A2 during the second year, and B1 in the third year. Their level during the third and fourth year can vary between A2 to B2 depending on the program offered by each university, the background and academic level of the students, the opportunities for doing a stay in a Spanish- speaking country during their studies, etc. B2 level will be reached only by part of the students who have studied abroad for at least one semester, while C1 and C2 levels are practically exclusive of a few graduate students or professionals who have studied the language extensively. On the other hand, we target quite a homogeneous population with regard to their proficiency level and knowledge of other languages. In Japan, the vast majority of students have not studied any foreign language apart from English before entering the university, so when they start to study a second foreign language at the university level, they do it from the beginning. In addition, for the 2 For the sake of comparison, the United States, the country with the most learners of Spanish, has around 800,000 students enrolled in Spanish language courses at the university level, and around 17% in advanced courses (Lacorte and Suárez-García, 2016). 3 We use the classification of levels proposed by the Common European Framework of Reference for Languages (Council of Europe 2001) and adopted by the Cervantes Institute (A1, A2, B1, B2, C1 and C2). 485