ITALIAN-LEGAL-BERT: A Pre-trained Transformer Language Model for Italian Law Daniele Licari 1, , Giovanni Comandè 1 1 EMbeDS, Sant’Anna School of Advanced Studies, Pisa, 56127, Italy. Abstract The state of the art in natural language processing is based on transformer models that are pre-trained on general knowledge and enable efcient transfer learning in a wide variety of downstream tasks even with limited data sets. However, these models signifcantly decrease performance when operating in specifc and sectoral domains. This is problematic in the Italian legal context, as there are many discrepancies between the language found in generic open source corpora (e.g., Wikipedia and news articles) and legal language, which can be cryptic, Latin-based, and domain idiolectal formulas. In this paper, we introduce the ITALIAN-LEGAL-BERT model with additional pre-training of the Italian BERT model on Italian civil law corpora. It achieves better results than the ‘general-purpose’ Italian BERT in diferent domain-specifc tasks. Keywords Legal artifcial intelligence, Pre-trained language model, Italian Legal BERT 1. Introduction In many domains, specialized models performed better than pre-trained models on general domains[1, 2, 3, 4, 5]. In general, the more semantically distant a domain-specifc language is from the common language than the greater the advantages of using specialized models, especially in complex tasks. In the Italian legal context, the discrepancy between specifc language and general language is even more pronounced. The Italian legal language has its unavoidable complexity, like all technical languages, but it is made even more obscure by useless stylistic expedients that ofen forcibly show a continuity with the languages of the past (Latin or old Italian). The full understanding of judicial texts is the exclusive prerogative of domain experts. It contains technicalities with specifc and unambiguous meanings (“contumacia”, “anticresi”, “anatocismo”, “sinallagma”). It also makes extensive use of terms in general use but ofen employed with their own and specifc meanings, if not entirely diferent from those in common use. For example, “nullità”, “annullabilità”, “inefcacia”, “inutilizzabilità”, which outside of legal language are synonyms of annulment, denote entirely distinct and diferent concepts and situations. Such locutions as “buon padre di famiglia” (good family man) and “possessore di buona fede” (possessor of good faith) indicate diferent concepts from the language of common use [6]. The Knowledge Management for Law Workshop (KM4LAW), September 26, 2022, Bozen-Bolzano, Italy Corresponding author. E d.licari@santannapisa.it (D. Licari); g.comande@santannapisa.it (G. Comandè) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)