Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 1353–1361 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC 1353 Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers N´ uria Gala 1 , Ana¨ ıs Tack 2,3,4 , Ludivine Javourey-Drevet 5,6 , Thomas Franc ¸ois 2 , Johannes C. Ziegler 5 1 Aix Marseille Univ, Laboratoire Parole et Langage, LPL CNRS (UMR 7309), France 2 CENTAL, UCLouvain 3 ITEC, imec research group at KU Leuven 4 F.R.S.-FNRS Research Fellow, Belgium 5 Aix Marseille Univ, Laboratoire de Psychologie Cognitive, LPC CNRS (UMR 7290), France 6 Aix Marseille Univ, Apprentissage, Didactique, ´ Evaluation, Formation (EA 4671), France {nuria.gala,ludivine.javourey, johannes.ziegler}@univ-amu.fr {anais.tack,thomas.francois}@uclouvain.be Abstract In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes. Keywords: Parallel corpora, text simplification, readability, linguistic complexity, misreading, poor-readers, dyslexia 1. Introduction Reading is a complex cognitive task. Since reading com- prehension is necessary for all school learning activities, poor reading and comprehension skills compromise chil- dren’s academic and professional success. Typical readers also tends to progress quickly in reading because, as the process becomes more and more automatized, they enter a virtuous circle in which good reading comprehension skills boosts word identification and vice-versa (Stanovich et al., 1986; Stanovich, 2009). On the contrary, a child facing difficulties will tend to read less and therefore will not en- ter this virtuous circle. His/her reading difficulties will in- crease as the grade level becomes more demanding in terms of reading speed and comprehension (Tunmer and Hoover, 2019). Given that reading comprehension skills of French- speaking students have decreased over recent years (Mullis et al., 2017), we have decided to address this issue in the framework on the Alector project 1 . Our aim was to de- velop and to test resources that make it possible to propose simplified texts to children facing problems in reading. For these children, text simplification might be a powerful and possibly the only way to leverage document accessibility. The idea is not to impoverish written language, but to pro- pose simplified versions of a given text that convey the ex- act same meaning. The main assumption is that the simpli- fication of a text will allow children with reading difficulties to eventually get through a text and thus discover the plea- sure of reading through understanding what they actually read. This will allow them to enter the above mentioned 1 https://alectorsite.wordpress.com/ virtuous circle, whereby word recognition and decoding skills are trained through reading more. The promise of this enterprise is that training children on simpler texts will lower their give-up threshold and improve their de- coding, word recognition and comprehension skills, which ultimately would allow them to move on to more complex texts. In order to test our hypothesis on text simplification and readability, we compiled a corpus of 183 texts (including 79 authentic texts), which was tested in schools during a three- year study. In this paper, we describe the corpus, its possi- bilities of use, and its availability. The resource is mainly addressed to a community of professionals interested in hel- ping French-speaking learners who struggle with learning to read. It could also be of interest for research, i.e. for developing and training automatic text simplification sys- tems. The paper is organized as follows. In Section 2., we give an overview of related work (currently available simplified corpora and annotated corpora with errors). In Section 3., we specify how the corpus was created and provide quanti- tative details about it. Section 4. describes how a sub-part of the corpus was annotated with reading errors from poor and dyslexic readers. 2. Related work The use of corpora is essential in many domains for diffe- rent purposes. For reading, there are a number of standar- dized reading tests such as the International Reading Tests (IReST) (Vital-Durand, 2011) which exists in a variety of languages. However, standardized or specifically annotated corpora (i.e., with errors) are very costly to build and not al-