Approximation of Frequentness Probability of Itemsets in Uncertain Data Toon Calders Eindhoven University of Technology The Netherlands Calin Garboni University of Antwerp Belgium Bart Goethals University of Antwerp Belgium Abstract—Mining frequent itemsets from transactional datasets is a well known problem with good algorithmic so- lutions. Most of these algorithms assume that the input data is free from errors. Real data, however, is often affected by noise. Such noise can be represented by uncertain datasets in which each item has an existence probability. Recently, Bernecker et al. (2009) proposed the frequentness probability; i.e., the probability that a given itemset is frequent, to select itemsets in an uncertain database. A dynamic programming approach to evaluate this measure was given as well. We argue, however, that for the setting of Bernecker et al. (2009), that assumes independence between the items, already well-known statistical tools exist. We show how the frequentness probability can be approximated extremely accurately using a form of the central limit theorem. We experimentally evaluated our approximation and compared it to the dynamic programming approach. The evaluation shows that our approximation method is extremely accurate even for very small databases while at the same time it has much lower memory overhead and computation time. I. I NTRODUCTION In frequent itemset mining, the considered transaction dataset is typically represented as a binary matrix M where each line represents a transaction and every column corre- sponds to an item. An element M ij represents the presence or the absence of the item j in transaction i by the value 1 or 0 respectively as in Table I (left). This is the basic traditional model, where we are certain that an item is present or absent in a transaction. For this type of data many algorithms have been proposed for mining frequent itemsets; i.e., sets of columns of M that have all ones in at least a given number of transactions (see e.g. [6] for an overview on frequent itemset mining). In several applications, however, an item is not present or absent in a transaction, but rather the probability of it being in the transaction is given. This is the case for data collected from experimental measurements susceptible to noise. For example, in satellite picture data the presence of an object or feature can be expressed more faithfully by a probability score when it is obtained by subjective human interpretation or an image segmentation tool. Such data is called uncertain data and Table I (right) presents a popular type of uncertain database. This example dataset consists of 4 transactions and 3 items. For every transaction, a score between 0 and 1 is given to reflect the probability that the item is present in the transaction. TID a b c t 1 1 1 0 t 2 0 0 1 t 3 1 1 0 t 4 1 1 0 TID a b c t 1 0.9 0.8 0.2 t 2 0.1 0.1 0.9 t 3 0.6 0.8 0.3 t 4 0.9 0.9 0.2 Table I CERTAIN DATASET ( LEFT) AND UNCERTAIN DATASET ( RIGHT) For example, the existence probability of 0.9 associated to item a in the first transaction represents that there is a 90% probability that a is present in transaction t 1 and 10% probability that it is absent. Table I (left) actually represents an instantiation of the uncertain dataset depicted in Table I (right). Such instantiations are called possible worlds. There are 2 |T |∗|I | possible worlds, where |T | is the total number of transactions and |I | the total number of items in the dataset. Under the assumption that presence and absence of the different items is statistical independent, the probability of a possible world is obtained by simply multiplying the individual item probabilities. We will call this model the independent uncertain database model. For example, the probability of the world in Table I (left) is 0.9 × 0.8 × 0.8 × 0.9 × ... × 0.8 = 0.0914. A far less probable world is obtained if we take the complement of Table I (left); i.e., we switch the ones to zeroes and the zeroes to ones. The probability of this world is 1.92 ∗ 10 -10 . The probabilities of all possible worlds sum up to 1. Mining frequent patterns from this kind of datasets is more difficult than mining from traditional transaction datasets. After all, computing the support of an itemset now has to take the existence probabilities of the items into consideration. To provide information about the frequency of an itemset, two approaches exist. One is based on the expected support as introduced by Chui et al. [5]. For every itemset, its expected support is computed and those for which it exceeds a minimum threshold are reported as frequent. The second one is called frequentness probability and it was introduced in [2]. For a particular itemset, it takes into consideration the probability distribution of the support and it gives its probability of being frequent at a given minimum support threshold. The existing methods have the drawback of being computationally costly and being exposed to rounding errors when dealing with low probability values.