Towards Construction of an Error-Corrected Corpus of Indonesian Second Language Learners Budi Irmawati 1 , Mamoru Komachi 2 , and Yuji Matsumoto 1 1 {budi-i,matsu}@is.naist.jp, 2 komachi@tmu.ac.jp 1 Nara Institute of Science and Technology 2 Tokyo Metropolitan University CLIC 2014 May 22 th -24 th Indonesian is spoken by 23 million native speakers and 240 million second language speakers a It is a morphological rich language included both derivative and inflection as well as clitics Available language resources for Indonesian are very limited No automatic grammar error identification for Indonesian Motivations Creating learner’s annotated corpus and make it accessible online Developing system for error identification for second language learners of Indonesian Goals Data: Lang-8 b crawled 2011 c Data # Learners 107 L1 15 Journals 783 Sentences 6,559 Tokens 77,201 Word types 8,673 Data pre-processing: Filtering: clean html tag and dis- charge sentences that are not in In- donesian Pairing learner’s sentences and native correction sentences Error distribution OOV 16.71% Noun 15.58% Verb 14.74% Adp 11.24% . 10.64% Adj 7.43% Pron 7.13% Adv 4.93% X 4.4% Conj 2.32% Det 2.73% Other 2.14% Data Two-step alignment using Rule-based and combination of Confusion Matrix and Rule- based (Hybrid) Automatic one-word-to-one-word alignment for replacement, affixation, word omission errors, and unnecessary words based on the POS. It used steam and complete word comparation. align (R L 0 ,R C 0 ) ⇐⇒ align (R L-1 ,R C -1 ) and align (R L+1 ,R C +1 ) Native speaker manually post-editing (spelling, punctuation, and capitalization) of 658 correction sentences exhaustively as well as multi-word expression writing rules [1,2] automatically aligns sentence pairs and assigns error tags based on candidate set ex- tracted from Rule-based results. Data Processing: raw data Filtering data Rule-based Alignment Evaluation Re-corrected correction (660 sentences) Candidate set extraction Candidate set list random selection (200 sentences) Hybrid Evaluation Learner’s corpus POS tagger[3] Automatic dependency annotation automatic gold Human judgment evaluation of error types Two native speakers evaluated error type of 100 sentences as well as extraction of partial sentence as sequence of 5 and 7 words in three tasks: Task_A, whether a partial sentence has an error, Task_B, which word is incorrect, Task_C, what is their suggested correction. Experiments Each sentence has ID, journal ID, learner’s ID, learner’s sentence, and native correction Each token has POS tag, token index, alignment index, error type, token’s correction Each sentence also has dependency relation annotation Annotation Schema I get comment that I have not read Saya memdapat koment yang saya belum membaca Saya mendapatkan komentar yang belum saya baca NSD VSA NSD S– G– PS1 VSA S S O A nsubj dobj rcmod ref agent neg Example Precision: P = # alignment given by system alignment given by system Inter-annotator agreement[4]: Kappa (κ)= P a - P e 1 - P e Evaluation Automatic one-word-to-one-word alignment for replacement, affixation, word omission errors, and unnecessary words based on the POS Cap Affix Repl Spell Unnecessary Missing 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision Error Types Alignment Rule based Hybrid Correct Avg 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 # Correct Sent & Precision Table 1: Confusion matrix of error-corrected sentences Machine Annotation Corrected Annotation C A R S U M Capitalization C 26 0 0 0 0 0 Affixation A 0 6 0 0 1 0 Replacement R 0 1 88 1 [14] 4 Spelling S 0 17 3 80 0 0 Unnecessary words U 0 1 [16] 3 114 0 Missing words M 0 0 [24] 0 0 126 Human judgement Table 2: Inter-annotator agreement of partial sentence error identification Agreement 5-words 7-words Task_A (κ) 0.7067 0.8652 Task_B (κ) 0.5333 0.8333 Table 3: Error classification based on Human Judgment Kinds of error 5-words 7-words (1) Unidentified 2 3 (2) Error detection mismatch 15 4 (3) Error correction mismatch 17 7 Results Identify learner’s error using syntactic information Develop interactive error identification system for the second language learners Future Works [a] Wikipedia [b] http://lang-8.com [c] http://cl.naist.jp/nldata/lang-8 This study is supported in part by the Directorate General of Higher Education, Republic of Indonesia under BPPLN Scholarship Batch 7 fiscal year 2012-2015 Acknowledgment [1]H. Alwi (2000). Tata Bahasa Baku Bahasa Indonesia. Balai Pustaka, Indonesia, third edition [2] J. N. Sneddon et.al. Adelaar, A., Djenar, D. N., and Ewing, M. C. (2010). Indonesian: A Comprehensive Brammar. Roultedge, Australia, second edition [3]S. D. Larasati, V. Kubonß, and D. Zeman. 2011. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. In SFCM, volume 100 of CCIS, p.119-129, Zurich, Switzerland. [4]M. Chodorow, M. Dickinson, R. Israel, and J. R. Tetreault (2012). Problems in evaluating grammatical error detection systems. In COLING, p.611-628. Indian Institute of Technology Bombay. Reference