Analysis of a Brazilian Indigenous corpus using machine learning methods Tiago Barbosa de Lima 1 0000-0002-0707-522X , Andr´ e C. A. Nascimento 1 , Pericles Miranda 1 , Rafael Ferreira Mello 1 1 Universidade Federal Rural de Pernambuco, Rua Dom Manuel de Medeiros, Recife, Pernambuco, 52171-900, Brazil {tiago.blima,andre.camara,rafael.mello}@ufrpe.br,periclesmiranda@gmail.com Abstract. In Brazil, several minority languages suffer a serious risk of extinc- tion. The appropriate documentation of such languages is a fundamental step to avoid that. However, for some of those languages, only a small amount of text corpora is digitally accessible. Meanwhile there are many issues related to the identification of indigenous languages, which may help to identify key similarities among them, as well as to connect related languages and dialects. Therefore, this paper proposes to study and automatically classify 26 neglected Brazilian native languages, considering a small amount of training data, under a supervised and unsupervised setting. Our findings indicate that the use of ma- chine learning models to the analysis of Brazilian Indigenous corpora is very promising, and we hope this work encourage more research on this topic in the next years. 1. Introduction The language documentation process of Brazilian Indigenous languages is re- cent [Moore and Galucio 2016]. Depending on the classification criteria adopted, there are between 160 to 180 Brazilian Indigenous languages in Brazil [Drude et al. 2007, Moore and Galucio 2016]. Most of them suffer the risk of extinction by the end of this century [Drude et al. 2007], making it relevant to encourage more research on the doc- umentation of such languages. Despite the fact that Language Classification (LC) is well established for languages spoken by a larger part of the population, indigenous lan- guages have not received much attention, with very few studies focused in such idiomatic groups [Drude et al. 2007, Moore et al. 2008, Jauhiainen et al. 2019b]. Therefore, auto- matic language identification might improve the process of categorising those languages since more data could be collected in a short amount of time. Artificial Intelligence (AI) algorithms have already been proven to be able to ac- curately categorise a diverse number of languages [Jauhiainen et al. 2019b]. Nonethe- less, it is not always easy to train new AI models for every language that is cur- rently known [Jauhiainen et al. 2019b]. Further, some documents are written in more than one language which makes difficult to separate them [Jauhiainen et al. 2019b]. It is necessary to develop methods capable of considering as many languages as pos- sible regardless of the amount of data available. Moreover, it is important to eval- uate how well the classifiers perform in terms of the number of languages they are supposed to classify [Jauhiainen et al. 2019b]. The study of a language under a com- putational perspective usually starts with the application of machine learning mod- els [Linares and Oncevay-Marcos 2017]. Therefore, the investigation and application of