Dimension Reduction Techniques for Training Polynomial Networks William M. Campbell P27439@EMAIL. MOT. COM Kari Torkkola A540 AA@EMAIL. MOT. COM Motorola Human Interface Lab, 2100 East Elliot Road, M/D EL508, Tempe, AZ 85284 Sreeram V. Balakrishnan FSB027@EMAIL. MOT. COM Motorola Human Interface Lab, 3145 Porter Drive, Palo Alto, CA 94304 Abstract We propose two novel methods for reducing di- mension in training polynomial networks. We consider the class of polynomial networks whose output is the weighted sum of a basis of mono- mials. Our first method for dimension reduction eliminates redundancy in the training process. Using an implicit matrix structure, we derive it- erative methods that converge quickly. A sec- ond method for dimension reduction involves a novel application of random dimension reduction to “feature space.” The combination of these al- gorithms produces a method for training polyno- mial networks on large data sets with decreased computation over traditional methods and model complexity reduction and control. 1. Introduction We consider polynomial networks of the following type. The inputs, , ..., , to the network are combined with multipliers to form a vector of basis functions ; for example, for two inputs and and a second degree network, we obtain (1) A second layer linearly combines all these inputs to pro- duce scores . We call the classification (or ver- ification) model. In general, the polynomial basis terms of the form are used where is less than or equal to the polynomial degree, . For each input vector, , and each class, , a score is produced by the inner prod- uct, . If a sequence of input vectors is introduced to the classifier, the total score is the average score over all inputs, . The total score is used for classification or verification. Note that we do not use a sigmoid on the output as is common in higher-order neural networks (Giles & Maxwell, 1987). Training techniques for polynomial networks fall into sev- eral categories. The first category of techniques estimates the parameters for the polynomial expansion based on in- class data (Fukunaga, 1990; Specht, 1967). These methods approximate the class specific probabilities (Schurmann, 1996). Since out-of-class data is not used for training a specific model, accuracy is limited. A second category of methods involves discriminative training (Schurmann, 1996) with a mean-squared error criterion. The goal of these methods is to approximate the a posteriori distribu- tion for each class. This method traditionally involves de- composition of large matrices, so that it is intractable for large training sets in terms of both computation and stor- age. A more recent method involves the use of support vec- tor machines. This method uses the technique of structural risk minimization. We use an alternate training technique based upon the method in Campbell and Assaleh (1999) which approximates a posteriori probabilities. The use of polynomial networks and our train- ing/classification method is motivated from an application perspective. First, the discriminative training method in Campbell and Assaleh (1999) can be applied to very large data sets efficiently. For the application we consider in speech processing, this property is critical since we want to be able to train systems in a reasonable amount of time without custom hardware. Second, discriminative training of polynomial networks produces state-of-the-art perfor- mance in terms of number of parameters needed, accuracy, and computation effort for several applications including speaker and isolated word recognition. Polynomial net- works outperform many other common techniques used for these applications because they are discriminatively trained and approximate a posteriori probabilities. In contrast, techniques such as Gaussian mixture models and Hidden-Markov models use maximum likelihood training and approximate in-class probabilities. For open set problems, this leads to difficulties since we do not model the out-of-class set well (although partial solutions such as “cohort normalization” have been proposed (Campbell, Jr.,