On Link Validity and entity resolution Research report RR-11010 Léa Guizol, Madalina Croitoru, Michel Leclère LIRMM (University of Montpellier II & CNRS), INRIA Sophia-Antipolis, France Abstract. The Entity Resolution problem has been widely addressed in the liter- ature. In its simplest version, the problem takes as input a knowledge base com- posed of records describing real world entities and outputs the sets of records judged to correspond to the same real world entity. More elaborated versions take into account links amongst records representing relationships between the enti- ties which represent. However, none of the approaches in the literature question the validity of certain links between records. In this paper we highlight this new aspect of “link validity” in knowledge bases and show how Entity Resolution approaches should take this aspect into consideration. 1 Introduction Knowledge base systems (KBs) allow to store and query an abstract model of the real world using a representation and reasoning language based on formal logic. One of the main problem when managing such a system is to ensure that the users of the sys- tem share the same “representation/interpretation” relationship between the conceptual primitives of the language and their corresponding notions in the real world. The devel- opment of domain ontologies which fix the vocabulary for classes and properties and specify, by axioms (some specific formulas), their semantics establishes a first solution to this problem. For individual entities, this solution is not applicable. Indeed, we have to continually reference new individuals, and the number of individual references to manage can reach several thousand (or million) individuals. To tackle this problem, a record is associated with each individual reference that specifies the characteristics of the referred individual entity. At least, this record contains, generally a name attribute which indicates the names which are used in the real world to designate the correspond- ing entity and a type attribute which indicates its class in addition to the reference which identifies the record. For instance, a record corresponding to a literary text contains the “work” class as type and a title as name. Often, users of the knowledge base own very little information about an individual entity and this information is rather contextual. For instance, when a user inserts a new book in a bibliographic base, often the only information (s)he has about author, the author’s name on the cover. Unfortunately names don’t identify a real world entity, neither its corresponding record. This is due to abusive use of abbreviations, variants, homonyms, etc. As a matter of consequence, many records (and thus references) in the knowledge base represent the same individual entity (real world). lirmm-00647284, version 1 - 3 Jan 2012