Automatic Data Selection for MLP-based Feature Extraction for ASR Carmen Pel´ aez-Moreno 1 , Qifeng Zhu 2 , Barry Chen 2 , Nelson Morgan 2 1 Department of Signal Theory and Communications, University Carlos III, Madrid, Spain carmen@tsc.uc3m.es 2 International Computer Science Institute (ICSI), Berkeley, California, USA qifeng, byc, morgan@icsi.berkeley.edu Abstract The use of huge databases in ASR has become an important source of ASR system improvements in the last years. How- ever, their use demands an increase of the computational re- sources necessary to train the recognizers. Several techniques have been proposed in the literature with the purpose of making a better use of these enormous databases by selecting the most ‘informative‘ portions and thus reducing the computational bur- den. In this paper, we present a technique to select samples from a database that allows us to obtain similar results in MLP-based feature extraction stages by using around 60% of the data. 1. Introduction In the last years, databases for speech recognition are becoming bigger and bigger because the usage of these huge databases has become an important source of accuracy improvement. How- ever the inclusion of these enormous resources into the recog- nition engines does not come without disadvantages: the com- putational demands have increased considerably. The use of the outputs of MLP neural networks as features that allow us to incorporate long term information into the fea- ture vectors have proven to successfully increase the recogni- tion performance [1, 2, 3]. Nevertheless, again this inclusion demands significantly more computational effort. Not only do these demands pose a problem to the final recognition systems, but also take an important role in the research stages that has motivated the use of intermediate tasks and several strategies to ease and make the process more efficient [4, 5]. In this context, we propose to look into the data provided in those databases and study its ’usefulness’, i.e., to look for and eliminate both the redundancies and the potentially harm- ful data such as outliers or mislabelled data. Besides we put forward that there are some types of data which are more easily learned than others, and therefore the amount of data pertaining those easy groups could be safely removed from the training without paying any price in the accuracy of the recognition. Here we report some experiments that have been made in order to show the validity of the previous assertions. 2. Data Selection Several data selection techniques have been proposed in the lit- erature and variations of them can be found under names such as novelty detection, selective sampling or active learning. How- ever the goals of those techniques can be different [6]: • Generative methods aim at selecting the best samples from unlabeled data to maximize data labeling invest- ments. • Selective methods try to select an adequate subset from labeled data to maximize performance or reduce the computational effort while maintaining a similar perfor- mance. Here we can further distinguish between wrap- per and filter approaches. The former employs a statisti- cal re-sampling technique (such as cross validation) and uses the actual target learning algorithm to estimate the accuracy of the subsets. Its disadvantage is its high cost because the learning algorithm has to be called repeat- edly. The filter techniques are based on selectors that operate independently of the learning algorithm, i.e., un- desirable samples are filtered out of the data before the induction commences [7]. Though generative methods have also important applications in speech recognition, here we are primarily concerned with the selection of already labeled data. As we will further explain, the technique we are proposing is based on the filter approach but does not employ the actual labels making it suitable for its use as a generative method. However, if used in the later fash- ion a labeling stage performed after the selection must be un- dertaken. Besides, it shares the inspiration from the wrapper approach of using the target learning algorithm to perform the selection. This makes the selection and learning criteria match, using, however, for the selection, a reduced version of that al- gorithm to avoid the high computational cost. To gain a better understanding of selective methods it is worth mentioning that these methods obtain their benefits from two facts: • Reducing the redundancy existing in the database can help to reduce the costs of learning achieving the same performance with less effort. Redundancy, however, should not be measured in terms of the number of exam- ples present for each class due to two reasons: in the first place, not all the classes are equally separable making certain classes easier to learn and therefore its samples more redundant for the learning machine and second, the most common samples in the training set are usually the most common samples in the test set as well making it wise to model some classes better than others. • However, over-represented examples in the database can harm the generalization capabilities of a given learn- ing machine biasing its modeling toward those classes. This can be negative if the distribution of testing sam- ples among classes is not the same as seen in training. 2.1. Evaluation methods and sampling criteria For the selection of data based on the filter approach we need an evaluation method that allows us to sort the data according