Circle: Design and Implementation of a Classifier Based on Circuit Minimization Mehmet Dalkilic * Arijit Sengupta Center for Genomics & Bioinformatics School of Informatics Indiana University Bloomington, IN 47405 dalkilic@indiana.edu Information Systems Kelley School of Business Indiana University Bloomington, IN 47405 asengupt@indiana.edu Abstract We present Circle, a classification algorithm based on the priciples of boolean function min- immization. This classification process uses a recursive method to generate a set of impli- cants (or rules). The novelty of this algorithm is in the fact that the rules generated contain information about not only presence of fea- tures, but also their absence in determining class values. Although function minimization is inherently exponential on the number of attributes, we introduce several optimization techniques to reduce the complexity to the ex- tent that we are able to scale the algorithm to 1000 columns, the limit in most commer- cial database systems. Circle is levelwise al- gorithm that iteratively produces implicants. For portions of the training set that are mis- classified, Circle recurs producing additional implicants that will be logically conjoined to the previous implicants. Several optimization techniques were applied to the base Circle al- gorithm, as well as numerous ways it can be configured for specifc data mining tasks. Cir- cle is completely implemented in Java with a JDBC-compliant database backend. One of the primary applications of Circle is mining bioinformatics data, and particularly genomic data. We present results of our experiments running circle on several well-known data sets for machine learning, as well as special large genomic data sets. 1 Introduction Both inexpensive storage and the ability to gener- ate and collect information has outpaced any reason- able expectation to interpret this information with- ∗ Partially supported by NSF-IIS-0082401 out automatic or at least semi-automatic techniques. In the area of bioinformatics, e.g. genomics, the ac- cumulation and kinds of information discovered from high-throughput sequencing of proteins 1 far outpaces even Moore’s law. There are two kinds of signficant challenges facing bioinformatics: operations and com- putation. Operations defined loosely is the design and implementation of information systems that al- low general search; provide grid services via the web for deployment of software, data sets; provide web por- tals for scientists focused on various aspects of bioin- formatics to submit and post new findings to pub- lic repositories of bioinformatic information that are shared throughout bioinformatics communities; pro- vide a suitable structured environment to do in sil- ico science–computation. See [14] for an excellent overview. The other challenge is computation, includ- ing the development of models, algorithms, and data mining 2 . One of the primary tasks of bioinformati- cians is to make sense of the sequenced proteins, i.e. function. A protein’s function, as has become to be widely believed, is determined by its three-dimensional structure that is, in turn, determined directly by the linear sequence of amino acids making up the protein. Crystallization is currently the only means of directly determining structure, but is labor intenstive and very difficult. High-throughput techniques have matured to the point that it is far easier to sequence many thou- sands of proteins rather than crystallizing a few. In treating proteins as collections of strings, bioinformati- cians realized that by “aligning” strings, similar pro- teins would likely share similar 3D structure, and ther- fore, function. As an example we present an alignment of about the last 30 residues of human alpha globulin 1 A protein for purposes of this paper is a string over an al- phabet |Σ| = 20, where the symbols denote amino acids, organic molecules that, when connected, form a chain that is biologically active, e.g., as an enzyme. 2 Data mining and general search from operations are signifi- cantly different. In general search, scientists might be interested in seeking authors, citations, protein sequences. 1