Local Constraints in Semistructured Data Schemas Andrea Cal`ı, Diego Calvanese, Maurizio Lenzerini Dipartimento di Informatica e Sistemistica Universit`a di Roma “La Sapienza” Via Salaria 113, 00198 Roma, Italy {cali,calvanese,lenzerini}@dis.uniroma1.it Abstract. Recently, there have been several proposals of formalisms for modeling semistruc- tured data, which is data that is neither raw, nor strictly typed as in conventional database systems. Semistructured data models are graph-based models, where graphs are used to rep- resent both databases and schemas. We study the basic problem of schema subsumption, which amounts to check whether all databases conforming to a schema also conform to an- other schema, in the presence of constraints, which are used to enforce additional conditions on databases. In particular, we study the relationship between various constraint languages and the basic property of locality, which allows one to check subsumption between schemas in polynomial time in the number of nodes of the schemas. We show that locality holds when both numeric constraints and disjunction are added to a simple constraint language. On the other hand, locality is lost when we consider constraints both on outgoing and incoming edges of databases. 1 Introduction The ability to represent data whose structure is less rigid and strict than in conventional databases is considered a crucial aspect in modern approaches to data modeling, and is important in many application areas, such as web information systems, biological databases, digital libraries, and data integration [21, 1, 5, 20, 16, 17]. Semistructured data is data that is neither raw, nor strictly typed as in conventional database systems [1]. Recently, several formalisms for modeling semistructured data have been proposed, such as OEM (Object Exchange Model) [2], and bdfs (Basic Data model For Semistructured data) [5]. In such formalisms, data is represented as graphs with labeled edges, where information on both the values and the schema of data are kept. In particular, bdfs is an elegant graph-based data model, where graphs are used to represent both databases and schemas, the former with edges labeled by data, and the latter with edges labeled by formulae of a suitable logical theory. The notion of a database g conforming to a schema S is given in terms of a special relation, called simulation, between the two graphs. Roughly speaking, a simulation is a correspondence between the edges of g and those of S such that, whenever there is an edge labeled a in g, there is a corresponding edge in S labeled with a formula satisﬁed by a. The notion of simulation is less rigid than the usual notion of satisfaction, and suitably reﬂects the need of dealing with less strict structures of data. For several tasks related to data management, it is important to be able to check subsumption between two schemas, i.e., to check whether every database conforming to one schema always