Perceptual Experiment on Number Production for Speaker Identification 1 ) Byunggon Yang Dongeui University ABSTRACT Acoustic parameters of the nine Korean numbers were analyzed by Praat, a speech analysis software, and synthesized by SenSynPPC, a Klatt formant synthesizer. The overall intensity, pitch and formant values of the numbers were modified dynamically by a step of 1 dB, 1 Hz and 2.5% respectively. The study explored the sensitivity of listeners to changes in the three acoustic parameters. Twelve male and female subjects listened to 390 pairs of synthesized numbers and judged whether the given pair sounded the same or different. Results showed that subjects perceived the same sound quality within the range of 6.6 dB of intensity variation, 10.5 Hz of pitch variation and 5.9% of the first three formant variation. The male and female groups showed almost the same perceptual ranges. Also, an asymmetrical structure of high and low boundary was observed. The ranges may be applicable to the development of a speaker identification system while the method of synthesis modification may apply to its evaluation data. Key words : perception, synthesis, Korean numbers, speaker identification 1. Introduction People produce numbers frequently in everyday lives. Each person has a series of unique numbers for identification. Therefore, it is natural and easy for an individual to produce numbers to identify himself or herself in a web-based business transaction. Because most personal computers are equipped with sound-input and -output systems, it will be desirable to make most of the system to record each individual's voice and analyze it to match future voice inputs for speaker identification. According to Hirahar and Kato(1992), the absolute formant frequencies will provide cues to speaker identification whereas the relative differences among the formants can be employed to identify vowels. However, since the formant values partially represent the speech output in the source-filter model (Fant, 1960), the amplitude and pitch information should be included to correctly identify the speaker among tens of thousands of the customers registered. Especially, the number production by the same speaker will not always be the same acoustically so that it may be difficult to find what identifies the individual among several possible candidates. Even though human perception in different people may not be more accurate than the machine comparison, it may be meaningful to investigate how human beings process the sound difference at the first stage. In other words, the machine comparison of acoustic parameters can be almost indefinite in their combination if we are using acoustical data with all the digits below zero. One way to solve the problem may be by collecting a large speech database and extracting some unique statistical patterns of individual differences. Another way may be pursued by a perceptual experiment to find a certain range of the same sound quality. The author believes that a successful machine identification will be just a little more sensitive within the range of human discrimination. The perceptual results may provide some insight into where we should focus during the identification procedures by computers. Besides, the synthesis method may be applicable for training the machine and evaluating whether it will correctly identify the synthesized pair with a gradual increment or decrement of the parameters. Sometimes we cannot obtain enough data to train computers for all the possible sets of human speech. Previous studies on vowel perception (Yang, 1995:142, Table 5) revealed that there were certain formant ranges of the same vowel quality. The first formant varied for almost 200 Hz unnoticed by the listener. Also, the F2 range came out around 400 Hz and that of F3 around 800 Hz. Basically, the listener showed the wider range of the same sound quality for the higher formant values. The range became wider with diphthongs (Yang, 1996). The uniformly modified diphthongs led to comparable perceptual ranges with those of the monophthongal study. The result reflected the psychoacoustical characteristics of critical bands (Zwicker, 1962). In those experiments, * This work was supported by Korea Research Foundation Grant (KRF-99-041-A00010).