Evolution of Speech Recognizer Agents by Artificial Life Ramin Halavati, Saeed Bagheri Shouraki, Saman Harati Zadeh, and Caro Lucas Abstract—Artificial Life can be used as an agent training approach in large state spaces. This paper presents an artificial life method to increase the training speed of some speech recognizer agents which where previously trained by genetic algorithms. Using this approach, vertical training (genetic mutations and selection) is combined with horizontal training (individual learning through reinforcement learning) and results in a much faster evolution than simple genetic algorithm. The approach is tested and a comparison with GA cases on a standard speech data base is presented. Keywords—Artificial Life, Speech Recognition, Fuzzy Modeling. I. INTRODUCTION RTIFICIAL Life is a field of artificial intelligence which deals with modeling real life to know more about it. The first A-Life model was introduced in 1966 by Von-Neumann [8] as the idea of creating cyber-organism, capable of showing live-creatures' phenomena such as reproduction, self-healing, growth, etc. Von Neumann's first A-Life environment was a simple cellular automaton which evolved patterns that could influence on their surrounding cells and change them so that they would resemble the same pattern. A-Life continued in several different forms and with several different targets in the next years such as studying the evolution of complex behaviors [5], [4] and [1], emergence of social behaviors [6] simulation of natural creatures [7], emergence of computer programs [9], etc. In this paper, an A-Life based training approach is used to combine the advantages of genetic algorithm based training methods and reinforcement learning approaches. To do so, the subject of training will be an agent that receives its original configuration from its parent(s) as its genome but also can learn and optimize itself during his life time. We have previously used this method for training of animal like agents in Zamin artificial life model [2] and [11] and it has resulted in evolution of simple but sufficient living behaviors for artificial creatures. In this paper, this combined training approach is tested over a speech recognition algorithm that uses fuzzy representations for recognition of human speech phoneme [4]. The major problem of the specified speech recognition contribution was the very low training speed of genetic algorithms that Ramin Halavati, Saeed Bagheri and Saman Harati Zadeh Shouraki are with the Artificial Creatures Lab, Computer Engineering Department,Sharif University of Technology, Tehran, Iran (e-mail: {halavati, sbagheri,harati}@ce.sharif.edu). Caro Lucas is with the Control Lab, Electrical Engineering Department, Tehran University, Tehran, Iran (e-mail: lucas@ipm.ir). optimized phoneme definitions when there was a big training set and it is shown that the presented approach overcomes that weakness. The paper is organized as follows: In the next part, the speech recognizers' structure and methods are presented. Section 3 presents the training approach and section 4 presents the experimental results and the comparisons of training results with genetic algorithm results. And at last comes the conclusions and future works. II. SPEECH RECOGNIZER AGENTS The first step of both recognition and training processes is conversion of the spectrogram of speech signal 1 into a fuzzy description. The fuzzification approach is based on four major ideas: First, a human recognizer does not read the spectrogram with full precision and pays attention only at local features. Second, a human does not decide based on precise speech amplitudes and only a rough measure is sufficient. Third, we do not count speech frames and use relative lengths such as long or short. Four, we are more sensitive to lower frequencies than higher ones. Based on these ideas, the frequency axis is separated into 25 ranges according to MEL filter banks. The MEL filter bank frequencies are selected so that the ranges are narrower in lower frequencies and wider in higher ones (Stevens and Volksman, 1940). Figure 1 shows the spectrogram separated with horizontal lines based on MEL bands and vertical lines based on phoneme positions. To make the local data reduction, each sample column in each MEL band is considered as one block and represented with one value which is the average of 10% of highest amplitude points in that block. Figure 2 shows the result of this stage. Note that new image has only 25 values in each vertical line (frequency axis) but the time axis is not altered. 1 Spectrogram is a 2-D Image from a speech signal in which, the vertical axis represents the present frequencies and the horizontal axis represents time. The brightness of each point shows the amplitude of a certain frequency at a certain time. A World Academy of Science, Engineering and Technology 11 2005 77