A Statistical Approach for Similarity Measurement Between Sentences for EBMT Niladri Chatterjee Department of Mathematics Indian Institute of Technology Hauz Khas,New Delhi 110016 Email: niladri@maths.iitd.ernet.in Abstract Success of Example-Based Machine Translation depends heavily on how efficient the retrieval scheme is. The more similar is the retrieved sentence to the input one, the easier will be the adaptation of the retrieved translation to the current requirement. However, there is no suitable scheme for measuring similarity between sentences. This paper reports preliminary results of a similarity measurement scheme that is based on a linear model , whose coefficients are determined by multiple regression technique. The data for the analysis has been collected from a survey of a number of respondents. Three major aspects of similarity, namely pragmatic, syntactic and semantic have been considered. Each respondent has been asked to evaluate the similarity between different pairs of sentences that are carefully designed to reflect one of the above types of similarity. A statistical analysis of these evaluations reveals general human perception about sentential similarity, which will help in designing a suitable retrieval scheme. 1. Introduction. Example-Based Machine Translation (EBMT) [Nagao, 1984][Brown, 1996] has of late become popular to facilitate automatic and/or semi-automatic Machine Translation. EBMT is based on the idea performing translation by imitating past translation examples.. In this type of translation system, a large amount of translation examples between two languages (L1 and L2, say, respectively the source and the target language) are stored in a textual database. These examples are subsequently used as guidance for future translation tasks. In EBMT one does not go through the rigour of syntax and semantics of the source and target languages. Rather, in order to translate a sentence, given in L1 to L2, the scheme first retrieves some similar sentence(s) in L1 from its knowledge base. The translation of the retrieved sentence(s) are then modified (or adapted) suitably to derive a translation of the given input sentence. Evidently, the scheme depends upon how good and effective the retrieval scheme is. The closer will be the retrieved sentence to the input one, the easier will be its adaptation to the present translation requirement, and consequently, the overall translation quality will improve. However, no scheme has so far been developed to quantify the similarity between two sentences in an objective way. The primary cause may be attributed to the general variations in human expressions – which is manifested in different ways of producing sentences that essentially convey the same meaning. Consider, for example, the following sentences: She is good looking, She is good to look at She looks good Not only are these sentences made of the same key words, they convey the same meaning too. On the other hand, the following sentences This horse is running good This horse is good to run on It was a good running by this horse have completely different senses to convey, although the key words are same again.