SMC 2009 Mining High Average-Utility Itemsets Tzung-Pei Hong Dept. of Computer Science and Information Engineering National University of Kaohsiung Kaohsiung, Taiwan tphong@nuk.edu.tw Cho-Han Lee Institute of Electrical Engineering National University of Kaohsiung Kaohsiung, Taiwan prescott2005@hotmail.com Shyue-Liang Wang Dept. of Information Management National University of Kaohsiung Kaohsiung, Taiwan slwang@nuk.edu.tw Abstract—The average utility measure is adopted in this paper to reveal a better utility effect of combining several items than the original utility measure. A mining algorithm is then proposed to efficiently find the high average-utility itemsets. It uses the summation of the maximal utility among the items in each transaction including the target itemset as the upper bounds to overestimate the actual average utilities of the itemset and processes it in two phases. As expected, the mined high average- utility itemsets in the proposed way will be fewer than the high utility itemset under the same threshold. Experiments results also show the performance of the proposed algorithm. Keywords—utility mining, average utility, two-phase mining, downward closure I. INTRODUCTION In the past, Liu et al. then presented a two-phase algorithm for fast discovering all high utility itemsets [2, 3]. In this paper, we proposed a new idea to evaluate the utilities of itemsets. Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. Thus, using the same minimum threshold to judge itemsets with different lengths is not fair. In order to alleviate the effect of the length of itemsets and identify really good utility itemsets, the average utility measure is adopted in this paper to reveal a better utility effect of combining several items than the original utility measure. It is defined as the total utility of an itemset divided by its number of items within it. The average utility of an itemset is then compared with a threshold to decide whether it is a high average-utility itemset. An algorithm is also proposed to find all the high average-utility itemsets. Like two-phase mining for high utility itemsets, the proposed mining algorithm for high average-utility itemsets uses average-utility upper bounds to overestimate the actual average utilities of itemsets for satisfying the downward closure property. The average-utility upper bound of an itemset is designed here as the summation of the maximal utility among the items in each transaction including the itemset. Only the combinations of the itemsets which have their average- utility upper bounds beyond the user-defined threshold are added into the candidate set in a level-wise way. The downward closure property can thus be maintained in this way. Finally, the performance of the proposed mining algorithm is verified by real-world market data. II. REVIEW OF RELATED MINING ALGORITHMS Agrawal and Srikant proposed the Apriori algorithm [1] to mine association rules from a set of transactions. In each pass, Apriori employs the downward-closure (anti-monotone) property to prune impossible candidates, thus improving the efficiency of identifying frequent itemsets. Many other algorithms based on the property have then been proposed to discover frequent itemsets rapidly [4-7]. Traditional association-rule mining does not, however, consider the quantities sold in transactions and the profit of each item sold, which are important to some applications as well. Yao et al. thus proposed the utility model to measure how “useful” an itemset is by considering both the quantities and the profits of items [8]. In utility mining, the downward-closure property no long exists since the utility of an itemset will grow monotonically and the frequency of an itemset will reduce monotonically along with the number of items in an itemset. The two different monotonic properties make the downward- closure property invalid in utility mining. Thus, Barber and Hamilton proposed the approaches of Zero pruning (ZP) and Zero subset pruning (ZSP) to exhaustively search for all high utility itemsets in the database [9]. Li et al. then proposed the FSM, the ShFSM and the DCG methods [10, 11] to discover all high utility itemsets by taking advantage of the level-closure property. Besides, Yao proposed a framework for mining high utility itemsets based on mathematical properties of utility constraints [12]. Liu et al. then presented a two-phase algorithm for fast discovering all high utility itemsets [2, 3]. The proposed approach is based on the two-phased approach. III. MINING HIGH AVERAGE-UTILITY ITEMSETS Traditionally, the utility of an itemset is the summation of the utilities of the itemset in all the transactions regardless of its length. Thus, the utility of an itemset in a transaction will increase along with the increase of its length. That is, longer itemsets in a transaction result in higher utility values. For example, assume a transaction is given as shown in Table 1. There are five items in the transaction, respectively denoted A to E. The value attached to each item is the quantity sold in the transaction. TABLE 1. A TRANSACTION AS THE EXAMPLE. TID A B C D E tx 1 1 4 1 0 Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics San Antonio, TX, USA - October 2009 978-1-4244-2794-9/09/$25.00 ©2009 IEEE 2600