Georgian Electronic Scientific Journal: Computer Science and Telecommunications 2010|No.1(24) 43 A Hybrid Approach to Identify Sentence Boundaries Vivek Dubey Shri Shankaracharya College of Engineering and Technology, Bhilai, Chhattishgarh, India e-mail: vivekdubey22@ gmail.com Abstract Now a day, every one are using internet and enriching their knowledge. Simultaneously, accessing internet has increased Natural language Applications. Words and Sentences are important tools for Natural Language Processing. This paper explores the problem of identifying sentence boundaries and model developed for this task using hybrid approach. Keywords: Natural Language Processing, Word, Sentence. 1 Introduction THESE days Electronics text is readily available via World Wide Web (WWW) in form as Microsoft Word, Adobe PDF, Past Script, HTML, SGML, XML, etc and the preformatter are available in market to convert such documents into a plain text, however such text are noisy [1], [3], [4], [7], [10] and have to be preprocessed for Natural Language Application. The plain text is a sequence of characters [2], [6] such as capital alphabet (A-Z), small alphabet (a-z), numeric letter (0-9) and special symbol (*, +, ?, etc) in form of heading, words, number and sentences. Normally, Sentence is considered as an important building block in the majority of Natural Language Processing (NLP) tasks likes Parts of Speech (POS) tagging, Parsing, Document Processing, Machine Translation (MT), Text Alignment, Information Extraction (IE), Text Summarization, Smart Editor, Text to Speech conversion, etc. Before any syntactic analysis of the original plain text, two transformations usually take place: isolated words, called as tokenization and isolated sentence, called as sentence splitter. There are two types of tokens formed by character structure like punctuation, number, dates, etc and other type like went, they, school, etc. The second type will undergo a morphological analysis. Sentence is a sequence of words, number and end of sentence – period, exclamation mark and a question mark. [8], [9], [11] proposed techniques to identify sentence boundary using maximum entropy, IBM word alignment model 1. In the paper, Hybrid approach, both rule based and example based, is used for identifying abbreviated token. The rest of the paper is organized as follows. Section 2 discusses cases related sentence boundary. Section 3 introduces an overview of developed system. Section 4 presents system algorithm. Section 5 illustrates the experiment and results. Section 6 gives conclusion and further work. 2 Cases Related to Sentence Boundary A. if there are new-lines or blank-spaces before sentences like below.   I am a boy. You are a girl.