Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases Vincent S. Tseng, Bai-En Shie, Cheng-Wei Wu, and Philip S. Yu, Fellow, IEEE Abstract—Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant algorithms have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement. The situation may become worse when the database contains lots of long transactions or long high utility itemsets. In this paper, we propose two algorithms, namely utility pattern growth (UP-Growth) and UP-Growth + , for mining high utility itemsets with a set of effective strategies for pruning candidate itemsets. The information of high utility itemsets is maintained in a tree-based data structure named utility pattern tree (UP-Tree) such that candidate itemsets can be generated efficiently with only two scans of database. The performance of UP-Growth and UP-Growth + is compared with the state-of-the-art algorithms on many types of both real and synthetic data sets. Experimental results show that the proposed algorithms, especially UP- Growth + , not only reduce the number of candidates effectively but also outperform other algorithms substantially in terms of runtime, especially when databases contain lots of long transactions. Index Terms—Candidate pruning, frequent itemset, high utility itemset, utility mining, data mining Ç 1 INTRODUCTION D ATA mining is the process of revealing nontrivial, previously unknown and potentially useful informa- tion from large databases. Discovering useful patterns hidden in a database plays an essential role in several data mining tasks, such as frequent pattern mining, weighted frequent pattern mining, and high utility pattern mining. Among them, frequent pattern mining is a fundamental research topic that has been applied to different kinds of databases, such as transactional databases [1], [14], [21], streaming databases [18], [27], and time series databases [9], [12], and various application domains, such as bioinfor- matics [8], [11], [20], Web click-stream analysis [7], [35], and mobile environments [15], [36]. Nevertheless, relative importance of each item is not considered in frequent pattern mining. To address this problem, weighted association rule mining was proposed [4], [26], [28], [31], [37], [38], [39]. In this framework, weights of items, such as unit profits of items in transaction databases, are considered. With this concept, even if some items appear infrequently, they might still be found if they have high weights. However, in this framework, the quantities of items are not considered yet. Therefore, it cannot satisfy the requirements of users who are interested in discovering the itemsets with high sales profits, since the profits are composed of unit profits, i.e., weights, and purchased quantities. In view of this, utility mining emerges as an important topic in data mining field. Mining high utility itemsets from databases refers to finding the itemsets with high profits. Here, the meaning of itemset utility is interestingness, importance, or profitability of an item to users. Utility of items in a transaction database consists of two aspects: 1) the importance of distinct items, which is called external utility, and 2) the importance of items in transactions, which is called internal utility. Utility of an itemset is defined as the product of its external utility and its internal utility. An itemset is called a high utility itemset if its utility is no less than a user-specified minimum utility threshold; otherwise, it is called a low-utility itemset. Mining high utility itemsets from databases is an important task has a wide range of applications such as website click stream analysis [16], [25], [29], business promotion in chain hypermarkets, cross- marketing in retail stores [3], [10], [19], [30], [32], [33], online e-commerce management, mobile commerce environment planning [24], and even finding important patterns in biomedical applications [5]. However, mining high utility itemsets from databases is not an easy task since downward closure property [1] in frequent itemset mining does not hold. In other words, pruning search space for high utility itemset mining is difficult because a superset of a low-utility itemset may be a high utility itemset. A naı ¨ve method to address this problem is to enumerate all itemsets from databases by the principle of exhaustion. Obviously, this method suffers from the problems of a large search space, especially when databases contain lots of long transactions or a low minimum utility threshold is set. Hence, how to effectively prune the search 1772 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 . V.S. Tseng, B.-E. Shie, and C.-W. Wu are with the Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City, Taiwan 70101, R.O.C. E-mail: tsengsm@mail.ncku.edu.tw, brianshie@gmail.com, silvemoonfox@idb.csie.ncku.edu.tw. . P.S. Yu is with the Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607, and the Computer Science Department, King Abdulaziz University, Jeddah, Saudi Arabia. E-mail: psyu@cs.uic.edu. Manuscript received 31 Jan. 2011; revised 2 Nov. 2011; accepted 28 Feb. 2012; published online 9 Mar. 2012. Recommended for acceptance by J. Freire. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2011-01-0045. Digital Object Identifier no. 10.1109/TKDE.2012.59.