Generalized Inter-Cloud Structured Data Sharing Malek Athamnah and Krishna Kant, Temple University mathamna@temple.edu, kkant@temple.edu Abstract—In this paper, we discuss the issue of collaborative data sharing among a number of parties to provide rich online services to their clients. We assume that each party hosts its data in a private cloud infrastructure but they collectively agree to certain well-defined set of accesses and access restrictions to one another’s databases. We show how these can be used to derive access rules that become the basis of access control and query planning. In this paper we focus entirely on relational database sharing, and present efficient and near optimal heuristic algorithms for access rule derivation and query planning in spite of the NP-hardness and high complexity of the underlying problems. 1 I. I NTRODUCTION The rapid deployment of cyber and cyberphysical systems has resulted in increasingly rich data repositories that are used by product and service providers to support their op- erations and to provide rich online services to clients based on mash-ups and analytics over data available from other collaborating parties. As a concrete example, consider the health-care ecosystem involving hospitals, insurance compa- nies, diagnostic labs, drug companies, nursing homes, etc. In this environment, answering a query regarding payments by patients requires matching patient id from the hospital database and corresponding customer id from insurance database – effectively a “join” operation if the data is stored in the relational form. Similarly, in order to smoothly distribute products to retailers, it is necessary to have coordination and restricted data sharing across trucking companies, 3PL (3rd party logistics) operators, cold-chain suppliers, distribution center operators, etc. Numerous other examples abound, one of which, namely e-commerce, will be covered in detail in the paper. The increasing penetration of automation and cloud computing means that the collaboration happens across private or semiprivate clouds owned by various parties, each hosting its own data. We expect the richness of services to rise rapidly in the near future, driven by the push of big data analytics across willing parties. Although the current data sharing practices across private parties depend on undisclosed 1-on-1 agreements between them, we have argued in our prior research that a more direct multi-party model offers many advantages, including less information leakage [1]. In our multi-party model, each party is explicitly provided a set of mutually agreed access- rules for accessing data either directly from another party, or one that represents a composition over data from two or more parties. For example, with relational databases, a party may be allowed access to R ./S[A, B], meaning, attributes A and B over the relational join of relations R and S that belong 1 This research was supported by NSF grant CNS-1527346 to two different parties. The explicit nature of such accesses leads to better control over sharable information and its more efficient access, as discussed in our earlier work [1]. The main advancement of this paper over the prior work is two fold: (a) to present an user-friendly way of describing the accesses, and from there to automatically derive the “access rules”, and (b) to support conditional access to data in specification, access rule derivation, and enforcement of the rules. The rest of the paper is organized as follows. In section II we describe our multiparty model and review some key issues essential for understanding the contributions of this paper. Section III then discusses the extended collaboration model studied in this paper. Section IV discusses details of rule derivation and Section V concerns the rule enforcement under conditional accesses. Section VI presents a comprehensive evaluation of the proposed algorithms. Finally, Section VII concludes the discussion. II. MULTIPARTY MODEL AND RELATED WORK We consider a group of collaborating parties, each hosting their relational databases in their private clouds, but willing to work together on a mutual data access plan. Accessing data across parties can be technically challenging due to varying format and semantics, but these are not the focus of this work. Thus we proceed with the simple assumption that each party creates a “stub” for its data for uniform access by others. For example, in order to do a “join” of records from hospital and insurance company, we need stubs to ensure that patient ids and insured id’s translate to the same ID for the same person. The full assumptions are discussed in our earlier work [2], that defines the notion of meaningful joins across parties, similar in spirit to the traditional notion of lossless joins within a single party. We also assume that the parties are not malicious and will correctly provide the agreed upon data; it is surely desirable to have some trust-but-verify mechanisms in place, but that too is beyond the scope of this paper. As discussed above, it is often necessary to provide access to a party to data that is composed from that belonging to two or more parties (or private clouds). Here “composition” could refer to meaningful operations over relations from multiple parties, such as join, union, intersection, difference, etc. Of these, join is invariably the most important and challenging to handle operation. Other operations across parties often are not useful, and in any case, they are easy to handle and thus not discussed further here. A. Fundamental Issues in Multiparty Models Providing access to the results of arbitrary compositions over relations poses the following crucial problem, that we