A Statistical Machine Learning Approach for Ticket Mining in IT Service Delivery Ea-Ee Jan, Jian Ni, Niyu Ge, Naga Ayachitula, Xiaolan Zhang IBM T.J. Watson Research Center, 19 Skyline Dr, Hawthorne, NY {ejan|nij|niyuge|nagaaka|cxzhang}@us.ibm.com Abstract—Ticketing is a fundamental management process of IT service delivery. Customers typically express their requests in the form of tickets related to problems or configuration changes of existing systems. Tickets contain a wealth of information which, when connected with other sources of information such as asset and configuration information, monitoring information, can yield new insights that would otherwise be impossible to gain from one isolated source. Linking these various sources of information requires a common key shared by these data sources. The key is the server names. Unfortunately, due to historical as well as practical reasons, the server names are not always present in the tickets as a standalone field. Rather, they are embedded in unstructured text fields such as abstract and descriptions. Thus, automatically identifying server names in tickets is a crucial step in linking various information sources. In this paper, we present a statistical machine learning method called Conditional Random Field (CRF) that can automatically identify server names in tickets with high accuracy and robustness. We then illustrate how such linkages can be leveraged to create new business insights. Keywords: Service management, Data model, Analytics, Statistical Models, Machine Learning I. INTRODUCTION The success of IT service delivery heavily depends on making strategic decisions based on insights from multiple business components. These decisions require a comprehensive examination of several IT service management processes, such as incident management, configuration management, and service request management [7,9,10,11]. Forming such an encompassing view is a complex process. First, it requires a unification of various tools and data models that are produced by different business units independently. Second, it needs to take into account of the human factors which may result in missing or inconsistent content. Third, as the service delivery business evolves, new problems arise that may demand previously unexplored concepts or features to be derived from the operations content. Thus, existing data models cannot be used and the new knowledge must be inferred from available content. Recent research has focused on building semi-automated knowledge discovery tools that are capable of locating information across a variety of business units and data models [9,12]. For example, data center operational issues, such as outages and IT system incidents, are recorded and tracked as tickets in IPC (Incident, Problem, Change) management systems. In another department, the data center asset management, a different set of system management tools, keeps track of the available servers in the data center and the hardware and software installed on each server. By linking the tickets and the server configuration details, one can learn about the top server features that lead to the creation of the tickets, and can drive targeted actions for productivity improvement. Also, one can learn how groups of servers with different hardware and software compare with respect to their ticket handling labor costs and equipment costs. In some IT environments, legacy IPC and asset management systems are not integrated to provide a structured ticket field that captures an accurate reference to related servers. In other instances, although dedicated fields are available, the information might be missing or inaccurate because system administrators may have forgotten to capture it appropriately as they rush to close the ticket and meet productivity targets. Therefore, in a business-knowledge discovery framework that aspires to gain cross-business unit insights, text-mining on the ticket descriptions is essential in linking ticketing records and server configuration records. The tickets and the server configuration details are linked by the common server names. Therefore, finding server names in the tickets is an essential first step toward bridging the gap. Although ideally the ticketing application should provide users with a structured interface with a required ‘server name’ field, there are also many applications that allow users to enter free text. In this paper we focus on such free-text tickets. We present a statistical machine learning method that can identify server names in unstructured tickets text data accurately and robustly. II. PREVIOUS WORK Tickets comprise extensive details about the service requests and problem symptoms in unstructured fields such as the abstract, description, and resolution. Previous work [9,12] on finding server names in tickets has been largely heuristic- based. Before processing the tickets, a dictionary of server references is first extracted from the server configuration database. Because the tickets consist mostly of free texts, well- formatted texts cannot be assumed. In fact, quote, parenthesis, and other special characters are frequently used around server information. Space delimiters are sometimes missing; prefixes and suffixes are added due to various system operations. In order to build heuristic rules, a set of tickets is set aside for rules training and creation as followed. Each ticket is tokenized, and the tokens are matched against the dictionary using fuzzy matching algorithms. The prefix and suffix matches between each token and dictionary entries are collected. Histograms are created for each prefix and suffix patterns. The patterns with highest frequency counts are used as 541 978-3-901882-50-0 c 2013 IFIP