Towards Construction of an Error-Corrected Corpus of Indonesian Second Language Learners Budi Irmawati 1 , Mamoru Komachi 2 , and Yuji Matsumoto 1 1 {budi-i,matsu}@is.naist.jp, 2 komachi@tmu.ac.jp 1 Nara Institute of Science and Technology 2 Tokyo Metropolitan University CLIC 2014 May 22 th -24 th • Indonesian is spoken by 23 million native speakers and 240 million second language speakers a • It is a morphological rich language included both derivative and inﬂection as well as clitics • Available language resources for Indonesian are very limited • No automatic grammar error identiﬁcation for Indonesian Motivations • Creating learner’s annotated corpus and make it accessible online • Developing system for error identiﬁcation for second language learners of Indonesian Goals • Data: Lang-8 b crawled 2011 c Data # Learners 107 L1 15 Journals 783 Sentences 6,559 Tokens 77,201 Word types 8,673 • Data pre-processing: – Filtering: clean html tag and dis- charge sentences that are not in In- donesian – Pairing learner’s sentences and native correction sentences • Error distribution OOV 16.71% Noun 15.58% Verb 14.74% Adp 11.24% . 10.64% Adj 7.43% Pron 7.13% Adv 4.93% X 4.4% Conj 2.32% Det 2.73% Other 2.14% Data • Two-step alignment using Rule-based and combination of Confusion Matrix and Rule- based (Hybrid) – Automatic one-word-to-one-word alignment for replacement, aﬃxation, word omission errors, and unnecessary words based on the POS. It used steam and complete word comparation. align (R L 0 ,R C 0 ) ⇐⇒  align (R L-1 ,R C -1 ) and align (R L+1 ,R C +1 ) – Native speaker manually post-editing (spelling, punctuation, and capitalization) of 658 correction sentences exhaustively as well as multi-word expression writing rules [1,2] – automatically aligns sentence pairs and assigns error tags based on candidate set ex- tracted from Rule-based results. Data Processing: raw data Filtering data Rule-based Alignment Evaluation Re-corrected correction (660 sentences) Candidate set extraction Candidate set list random selection (200 sentences) Hybrid Evaluation Learner’s corpus POS tagger[3] Automatic dependency annotation automatic gold • Human judgment evaluation of error types Two native speakers evaluated error type of 100 sentences as well as extraction of partial sentence as sequence of 5 and 7 words in three tasks: – Task_A, whether a partial sentence has an error, – Task_B, which word is incorrect, – Task_C, what is their suggested correction. Experiments • Each sentence has ID, journal ID, learner’s ID, learner’s sentence, and native correction • Each token has POS tag, token index, alignment index, error type, token’s correction • Each sentence also has dependency relation annotation Annotation Schema I get comment that I have not read Saya memdapat koment yang saya belum membaca Saya mendapatkan komentar yang belum saya baca NSD VSA NSD S– G– PS1 VSA S S O A nsubj dobj rcmod ref agent neg Example • Precision: P = # alignment given by system ∑ alignment given by system • Inter-annotator agreement[4]: Kappa (κ)= P a - P e 1 - P e Evaluation • Automatic one-word-to-one-word alignment for replacement, aﬃxation, word omission errors, and unnecessary words based on the POS Cap Aﬃx Repl Spell Unnecessary Missing 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision Error Types Alignment Rule based Hybrid Correct Avg 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 # Correct Sent & Precision Table 1: Confusion matrix of error-corrected sentences Machine Annotation Corrected Annotation C A R S U M Capitalization C 26 0 0 0 0 0 Aﬃxation A 0 6 0 0 1 0 Replacement R 0 1 88 1 [14] 4 Spelling S 0 17 3 80 0 0 Unnecessary words U 0 1 [16] 3 114 0 Missing words M 0 0 [24] 0 0 126 • Human judgement Table 2: Inter-annotator agreement of partial sentence error identiﬁcation Agreement 5-words 7-words Task_A (κ) 0.7067 0.8652 Task_B (κ) 0.5333 0.8333 Table 3: Error classiﬁcation based on Human Judgment Kinds of error 5-words 7-words (1) Unidentiﬁed 2 3 (2) Error detection mismatch 15 4 (3) Error correction mismatch 17 7 Results • Identify learner’s error using syntactic information • Develop interactive error identiﬁcation system for the second language learners Future Works [a] Wikipedia [b] http://lang-8.com [c] http://cl.naist.jp/nldata/lang-8 This study is supported in part by the Directorate General of Higher Education, Republic of Indonesia under BPPLN Scholarship Batch 7 ﬁscal year 2012-2015 Acknowledgment [1]H. Alwi (2000). Tata Bahasa Baku Bahasa Indonesia. Balai Pustaka, Indonesia, third edition [2] J. N. Sneddon et.al. Adelaar, A., Djenar, D. N., and Ewing, M. C. (2010). Indonesian: A Comprehensive Brammar. Roultedge, Australia, second edition [3]S. D. Larasati, V. Kubonß, and D. Zeman. 2011. Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus. In SFCM, volume 100 of CCIS, p.119-129, Zurich, Switzerland. [4]M. Chodorow, M. Dickinson, R. Israel, and J. R. Tetreault (2012). Problems in evaluating grammatical error detection systems. In COLING, p.611-628. Indian Institute of Technology Bombay. Reference