An effective tree structure for mining high utility itemsets Chun-Wei Lin a , Tzung-Pei Hong b,c, , Wen-Hsiang Lu a a Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan, ROC b Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC c Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan, ROC article info Keywords: Utility mining High utility pattern HUP-tree HUP-growth Two-phase mining Downward closure abstract In the past, many algorithms were proposed to mine association rules, most of which were based on item frequency values. Considering a customer may buy many copies of an item and each item may have dif- ferent profits, mining frequent patterns from a traditional database is not suitable for some real-world applications. Utility mining was thus proposed to consider costs, profits and other measures according to user preference. In this paper, the high utility pattern tree (HUP tree) is designed and the HUP-growth mining algorithm is proposed to derive high utility patterns effectively and efficiently. The proposed approach integrates the previous two-phase procedure for utility mining and the FP-tree concept to uti- lize the downward-closure property and generate a compressed tree structure. Experimental results also show that the proposed approach has a better performance than Liu et al.’s two-phase algorithm in exe- cution time. At last, the numbers of tree nodes generated from three different item ordering methods are also compared, with results showing that the frequency ordering produces less tree nodes than the other two. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction Mining frequent itemsets from a transaction database is a funda- mental task for knowledge discovery. Its goal is to identify the item- sets with their appearing frequencies above a certain threshold. It usually serves as a basic procedure in finding association rules (Agrawal, Imielinksi, & Swami, 1993a; Agrawal, Imielinksi, & Swami, 1993b; Agrawal & Srikant, 1994; Chen, Han, & Yu, 1996; Cheung, Lee, & Kao, 1997) and sequential patterns (Agrawal & Srikant, 1995). In the past, numerous methods were proposed to discover frequent itemsets. The approaches could be divided into two categories: level-wise approaches and pattern-growth approaches. The Apriori algorithm (Agrawal et al., 1993a) was first proposed to mine association rules based on a level-wise processing way. The FP-growth algorithm was then proposed to construct a compressed tree structure and to mine rules based on it (Han, Pei, & Yin, 2000). Both the Apriori and the FP-growth approaches treat all the items in a database as binary variables. That is, they only consider whether an item is bought in a transaction or not. In this case, frequent itemsets just reveal the occurrence importance of the itemsets in the transactions, but do not reflect any other implicit factors, such as prices or profits. For example, a sale of diamonds may occur less frequently than that of clothing in a department store, but the former gives a much higher profit per unit sold than the latter. Only frequency is thus not sufficient to identify highly profitable items. Utility mining (Yao & Hamilton, 2006; Yao, Hamilton, & Butz, 2004) was thus proposed to partially solve the above problem. It may be thought of as an extension of frequent-itemset mining with sold quantities and item profits considered. The utility means how ‘‘useful’’ an itemset is. Utility mining would usually like to find high utility itemsets, which mean their utility values are larger than or equal to a threshold defined by users. In practice, the utility value of an itemset can be measured in terms of costs, profits or other measures from user preference. For example, someone may be interested in finding the itemsets with good profits and another may focus on the itemsets with low pollution while manufacturing. Liu et al. then presented the two-phase algorithm for fast dis- covering all high utility itemsets (Liu, Liao, & Choudhary, 2005) based on the downward-closure property. The property indicates that any superset of a non-frequent itemset is also non-frequent. It is thus called the anti-monotone property as well. The property is used to reduce the search space by pruning non-frequent itemsets early. The two-phase algorithm generates candidate high utility itemsets in a level-wise way. The database-scanning time is, however, a bottleneck of the approach. In this paper, a new utility-mining approach with the aid of a tree structure is proposed. A new tree structure called the high 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.12.082 Corresponding author at: Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC. E-mail addresses: p7895122@mail.ncku.edu.tw (C.-W. Lin), tphong@nuk.edu.tw (T.-P. Hong), whlu@mail.ncku.edu.tw (W.-H. Lu). Expert Systems with Applications 38 (2011) 7419–7424 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa