IJIRST –International Journal for Innovative Research in Science & Technology| Volume 2 | Issue 1 | June 2015 ISSN (online): 2349-6010 All rights reserved by www.ijirst.org 197 Online SVM based Optimized Bug Report Triaging using Feature Extraction Ms. Neetika Sharma Dr. Vijay Kumar M. Tech. Scholar Professor and Head Department of Computer Science and Engineering Department of Computer Science and Engineering Kautilya Institute of Technology and Engineering, Jaipur (Rajasthan) Kautilya Institute of Technology and Engineering, Jaipur (Rajasthan) Abstract Triage is medical term referring to the process of prioritizing patients based on the severity of their condition so as to maximize benefit (help as many as possible) when resources are limited. Bug Report triaging is a process where tracker issues are screened and prioritized. Triage should help ensure that all reported issues are properly managed - bugs as well as improvements and feature requests. The large number of new bug reports received in bug repositories of software systems makes their management a challenging task. Handling these reports manually is time consuming, and often results in delaying the resolution of important bugs. The most critical issue related with bug reports is that their number is vast and most of these are duplicates of some previously sent bug report. The solution to this problem requires that bug reports are to be categorized in groups where each group consist of all the bug reports which belongs to the same bug, and the number of groups is equal to the number of unique bugs addressed so far. Bug report corresponding to some new bug is to be placed in a separate group followed by its duplicates, if any. Classifying weather a bug report that arrived through a user, written in a natural language, is a duplicate or unique report is a time consuming task, especially when the number of bug reports that are received is large. Thus, this process needs to be automated. Bug reports have textual, contextual and categorical features and these features needs to be extracted for checking of duplicates and non duplicates. Moreover, in the group of reports, a particular report can be specified as master and all the reports that corresponds to the same bug are to be linked to it. Thus, duplicates need not be discarded so as to provide later, a complete description of the bug. In this paper, a much more extended set of textual features is considered for bug report duplicacy checking. Support Vector Machine classifier is used for classification of the incoming bug report as duplicate of non-duplicates. The simulation of the prescribed model is done using R Statistical Package. A sample of bug reports from Mozilla repository is considered, Results of the simulation model establishes the fact that Proposed classifier has higher efficiency as compared to existing technique BM25F which employs 25 feature sets. Keywords: Bug Report Triaging, Feature Extraction, Machine Learning Algorithms, Bayesian Classifier _______________________________________________________________________________________________________ I. INTRODUCTION An open source project typically maintains an open bug repository so that bug reports from all over the world can be gathered. When a new bug report is submitted to the repository, a person, called a triager, examines whether it is a duplicate of an existing bug report. If it is, the triager marks it as duplicate and the bug report is removed from consideration for further work. In the literature, there are approaches exploiting only natural language information to detect duplicate bug reports. These software repositories provide abundance of valuable information about open source projects [1]. With the increase in the size of the data maintained by the repositories, automated extraction of such data from individual repositories, as well as of linked information across repositories, has become a necessity. In this paper, a framework is described that uses web scraping to automatically mine repositories and link information across repositories. Mining software repositories is an important activity when analyzing large scale projects. Mining information across multiple data sources is one of the challenges [2]. It is observed that relevant information from one repository can complement the mining activity on another repository. For example, the Bugzilla bug tracker for Mozilla and Red Hat projects contain custom “keywords”, textual tags, that help identify specific categories of b ugs in the database. The keyword security relates to a security bug. This could have been used to identify the security bugs in projects like Fedora, Firefox etc. But the usage of the keyword was not consistent across bug reports. The Common Vulnerability Exposure (CVE) [3] site maintains information about publicly known vulnerabilities. The vulnerabilities tagged by CVE are contained in the National Vulnerability Database (NVD), a U.S. government repository devised to manage vulnerability data. The NVD database lists vulnerabilities specific to different types of products including Fedora (Red Hat), Firefox (Mozilla) etc. The external resource section of each vulnerability listed in the NVD has a mapping or link to the bugs in their respective bug- tracking system. This implies that information from the NVD can be utilised in mining security bugs in projects like Firefox, Fedora etc. For projects, like Ubuntu, that deploy the Launchpad bug tracker, the search engine allows to search for bugs with CVE tags [4]. These CVE tags in turn can be used to collect linked vulnerability characteristics in terms of the nature of exploits,