ITALIAN-LEGAL-BERT: A Pre-trained Transformer
Language Model for Italian Law
Daniele Licari
1,∗
, Giovanni Comandè
1
1
EMbeDS, Sant’Anna School of Advanced Studies, Pisa, 56127, Italy.
Abstract
The state of the art in natural language processing is based on transformer models that are pre-trained on
general knowledge and enable efcient transfer learning in a wide variety of downstream tasks even with
limited data sets. However, these models signifcantly decrease performance when operating in specifc
and sectoral domains. This is problematic in the Italian legal context, as there are many discrepancies
between the language found in generic open source corpora (e.g., Wikipedia and news articles) and legal
language, which can be cryptic, Latin-based, and domain idiolectal formulas.
In this paper, we introduce the ITALIAN-LEGAL-BERT model with additional pre-training of the
Italian BERT model on Italian civil law corpora. It achieves better results than the ‘general-purpose’
Italian BERT in diferent domain-specifc tasks.
Keywords
Legal artifcial intelligence, Pre-trained language model, Italian Legal BERT
1. Introduction
In many domains, specialized models performed better than pre-trained models on general
domains[1, 2, 3, 4, 5]. In general, the more semantically distant a domain-specifc language
is from the common language than the greater the advantages of using specialized models,
especially in complex tasks.
In the Italian legal context, the discrepancy between specifc language and general language
is even more pronounced. The Italian legal language has its unavoidable complexity, like
all technical languages, but it is made even more obscure by useless stylistic expedients that
ofen forcibly show a continuity with the languages of the past (Latin or old Italian). The
full understanding of judicial texts is the exclusive prerogative of domain experts. It contains
technicalities with specifc and unambiguous meanings (“contumacia”, “anticresi”, “anatocismo”,
“sinallagma”). It also makes extensive use of terms in general use but ofen employed with
their own and specifc meanings, if not entirely diferent from those in common use. For
example, “nullità”, “annullabilità”, “inefcacia”, “inutilizzabilità”, which outside of legal language
are synonyms of annulment, denote entirely distinct and diferent concepts and situations.
Such locutions as “buon padre di famiglia” (good family man) and “possessore di buona fede”
(possessor of good faith) indicate diferent concepts from the language of common use [6].
The Knowledge Management for Law Workshop (KM4LAW), September 26, 2022, Bozen-Bolzano, Italy
∗
Corresponding author.
E d.licari@santannapisa.it (D. Licari); g.comande@santannapisa.it (G. Comandè)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)