C L BRINDHA DEVI AND P NAVANEETHAN: NETWORK PERFORMANCE FOR MULTI LINGUAL DATA TRANSMISSION DOI: 10.21917/ijct.2012.0068 488 NETWORK PERFORMANCE FOR MULTI LINGUAL DATA TRANSMISSION C.L. Brindha Devi 1 and P. Navaneethan 2 1 Department of Computer Science, Arignar Anna Government Arts College for Women, India E-mail: clbrindhadevi@gmail.com 2 Department of Electrical and Electronics Engineering, PSG College of Technology, India E-mail: pnn@eee.psgtech.ac.in Abstract This paper compares different character encoding schemes used to encode the characters in different languages. A new character encoding protocol called PANDITHAM has been developed to encode the characters in different languages. The languages English and Tamil are taken for a case study and its performance under networking environment is compared with regard to PANDITHAM, Unicode and UTF-8 encodings. This study has proved that PANDITHAM is optimal for all languages as it reduces the network congestion. Keywords: Multilingual, PANDITHAM, Unicode, UTF-8 1. INTRODUCTION Internet traffic is the flow of data around the Internet. It includes web traffic, which is the amount of data that is related to the World Wide Web, along with the traffic from other major uses of the Internet, such as electronic mail and peer-to-peer networks. Some companies offer advertising schemes that in return contribute to increase in web traffic. The World Wide Web has become a major channel for information service. There are web pages in almost every popular language including various European, Asian, and Middle East languages [2]. While approximately 70% of web content is in English, the number of native English speakers constitutes only 36.5% of the world’s online population [5]. The rapidly accelerating trend of globalization of businesses and the success of e-Governance solutions require data to be stored and manipulated in many different natural languages. 2. CHARACTER ENCODING In computers and in data transmission between them, data is internally presented as octets. Octets are called as bytes. A character is thought of as the smallest component of written language that has a semantic value. The set of all the characters in a language is called a Repertoire [3]. Each character in the repertoire is assigned a unique numerical code called Code Position. A character encoding defines how sequences of numeric codes are presented as sequences of octets. For many years, Americans have transmitted data using the ASCII character set. But ASCII is inadequate in handling the characters of all other languages. Different countries have adopted different techniques for exchanging text in different languages, making it difficult to exchange data in an interconnected world. There are many character encoding systems like ASCII, Unicode, EBCDIC, ISO-8859 [4] etc. This paper compares the encoding of Multilingual characters using Unicode, UTF-8 and PANDITHAM (A Protocol for ApplicatioNs Development In THAmizh and Multilingual Computing) and as a case study the languages Tamil and English are considered. 2.1 UNICODE CHARACTER ENCODING Unicode [8] is a universal font encoding scheme, designed to cover all world languages. It is a 16-bit scheme with over 65500 slots to assign to various languages. Each language (except few like Chinese) is given a 128-slot block. All Indic languages are allocated 128-slots each. Assignment of characters to specific slots within this block is based on ISCII (Indian Script Code for Information Interchange) [9] scheme, that uses Devanagari as the basic reference language. Refer to Table.1 for Tamil characters in Unicode. Thus the vowels, consonants and their modifiers of each Indic language appear at the same slot. "Ka" of Tamil and Telugu are separated by the same 128 slots, greatly facilitating programming. The character set in Tamil language shall be categorized into frequently used Tamil characters and infrequently used Tamil characters. The language contains a total of 313 (247 + 66) characters. The frequently used set of Tamil characters is divided into consonants, vowels and combined characters. Tamil language has 12 vowels (Uyir Eluthukkal) and 18 consonants (Mei Eluthukkal). The vowels come after consonants and combine with consonants to form the composite consonants. This way, the combination of 12 vowels and 18 consonants form 216 composite consonants. The coding for Tamil is not as per the Tamil alphabetical (Akkara Varisai) order. Since ISCII is the base for Unicode, it needs 2 bytes for encoding Uyir Eluthukkal [ ] and Akkara Mei Eluthukkal [ ... ], and 4 bytes to encode Mei Eluthukkal [ ... ] and Uyir Mei Eluthukkal [ ]. The English language has 26 characters. All the characters can be stored in the given slot. So Unicode uses 2 bytes to encode these English characters. Consider for example, the word “ R.” in Tamil. When this word is encoded using Unicode it needs 26 bytes. Name: R . 4 4 4 2 2 4 2 2 2 No of Bytes needed: 26 bytes The corresponding Unicode sequence would be, 0BAE, 0BBE : 0BA4, 0BC7 : 0BB8, 0BCD : 0BB5 : 0BB0 : 0BA9, 0BCD : 0020 : 0052 : 002E. 2.2 UTF- 8 CHARACTER ENCODING UTF stands for Unicode Transformation Format. The '8' means that it uses a series of 8-bits to represent a character. The number of bytes needed to represent a character varies from 1 to 6. Most software is not designed to handle 16-bit or 32-bit