Leveraging Probabilistic Existential Rules for Adversarial Deduplication Jose N. Paredes, Maria Vanina Martinez, Gerardo I. Simari, and Marcelo A. Falappa {jose.paredes,mvm,gis,mfalappa}@cs.uns.edu.ar Dept. of Computer Science and Engineering, Universidad Nacional del Sur (UNS) Institute for Computer Science and Engineering (UNS–CONICET) San Andres 800, (8000) Bahia Blanca, Argentina Abstract. The entity resolution problem in traditional databases, also known as deduplication, seeks to map multiple virtual objects to its cor- responding set of real-world entities. Though the problem is challenging, it can be tackled in a variety of ways by means of leveraging several simplifying assumptions, such as the fact that the multiple virtual ob- jects appear as the result of name or attribute ambiguity, clerical errors in data entry or formatting, missing or changing values, or abbrevia- tions. However, in cyber security domains the entity resolution problem takes on a whole different form, since malicious actors that operate in certain environments like hacker forums and markets are highly moti- vated to remain semi-anonymous—this is because, though they wish to keep their true identities secret from law enforcement, they also have a reputation with their customers. The above simplifying assumptions cannot be made in this setting, and we therefore coin the term “adver- sarial deduplication”. In this paper, we propose the use of probabilistic existential rules (also known as Datalog+/–) to model knowledge engi- neering solutions to this problem; we show that tuple-generating depen- dencies can be used to generate probabilistic deduplication hypotheses, and equality-generating dependencies can later be applied to leverage existing data towards grounding such hypotheses. The main advantage with respect to existing deduplication tools is that our model operates under the open-world assumption, and thus is capable of modeling hy- potheses over unknown objects, which can later become known if new data becomes available. 1 Introduction Deduplication of objects in databases, also known as entity resolution, is a classi- cal problem in data cleaning [18]; the basic idea is that databases contain objects that are potential duplicates—seemingly different records that correspond to the same entity or object in the real world—that need to be identified and merged [8, 14]. There has been a lot of interest on this topic in the last 20 years, yielding a large body of work that lies mostly in the databases literature. Traditionally, enti- ties are resolved using pairwise similarity over the attributes of reference [18]. The