Eﬃcient Pattern Mining of Uncertain Data with Sampling Toon Calders 1 , Calin Garboni 2 , and Bart Goethals 2 1 TU Eindhoven, The Netherlands 2 University of Antwerp, Belgium Abstract. Mining frequent itemsets from transactional datasets is a well known problem with good algorithmic solutions. In the case of un- certain data, however, several new techniques have been proposed. Un- fortunately, these proposals often suﬀer when a lot of items occur with many diﬀerent probabilities. Here we propose an approach based on sam- pling by instantiating “possible worlds” of the uncertain data, on which we subsequently run optimized frequent itemset mining algorithms. As such we gain eﬃciency at a surprisingly low loss in accuracy. These is conﬁrmed by a statistical and an empirical evaluation on real and syn- thetic data. 1 Introduction In frequent itemset mining, the transaction dataset is typically represented as a binary matrix where each line represents a transaction and every column cor- responds to an item. An element M ij represents the presence or the absence of the item j in transaction i by the value 1 or 0 respectively. For this the basic traditional model, where an item is either present or absent in a transaction many algorithms have been proposed for mining frequent itemsets; i.e., sets of columns of M that have all ones in at least a given number of transactions (see e.g. [Goe05] for an overview on frequent itemset mining). In many applications, however, an item is not present or absent in a trans- action, but rather an existence probability of being in the transaction is given. This is the case, for example, for data collected from experimental measurements or from noisy sensors. Mining frequent patterns from this kind of data is more diﬃcult than mining from traditional transaction datasets. After all, computing the support of an itemset now has to rely on the existence probabilities of the items, which leads to an expected support as introduced by Chui et al. [CKH07]. If the binary matrix is transformed into a probabilistic matrix, where each element takes values in the interval [0, 1], we have the so called uncertain data model. Under the assumption of statistical independence of the items in all transactions in the dataset, the support of an itemset in this model, as deﬁned by Chui et al. [CKH07], is based on the possible world interpretation of uncertain data. Basically, for every item x and every transaction t there exist two sets of possible worlds, one with the worlds in which x is present in t and one with