107
Niladri Sekhar Dash
Research Cell : An International Journal of Engineering Sciences, Issue December 2016
ISSN: 2229-6913 (Print), ISSN: 2320-0332 (Online) -, Web Presence: http://www.ijoes.vidyapublications.com
© 2016 Vidya Publications. Authors are responsible for any plagiarism issues.
Culling Scientific and Technical Terms from Text Corpora for
Compiling a TermBank in Bangla
Niladri Sekhar Dash
Linguistic Research Unit Indian Statistical Institute Kolkata, India
Email: ns_dash@yahoo.com
ABSTRACT
In this paper I describe a few steps that we adopt to develop a digital TermBank after culling
the Scientific and Technical Terms (STTs) from a text corpus of Bangla. Following the stages
and methods of processing and analysis of corpus we are successful to develop a TermBank
which now contains nearly 10,000 terms to be used in various works of linguistics and
language technology. The strategy we use can be effectively applied on corpora of other
Indian languages for same purposes. This confirms its utility and relevance in NLP works for
Indian languages.
Keywords: Scientific and technical terms, corpus, POS tagging, collocation, lemmatization,
TreeBank, terminology, frequency
1. Introduction
The development of a comprehensive digital database of scientific and technical terms (STTs)
in a language is important in works of linguistics and language technology, such as, termbank
compilation, linguistic resource generation, machine translation, machine learning,
information retrieval, knowledge representation, text classification, language planning, online
language education, dictionary compilation, text composition, and mass literacy (Sager
1994). Keeping such activities in mind, we have developed, as a project of our NLP activities,
a comprehensive database of nearly 10,000 STTs extracted from a Bangla corpus of scientific
texts compiled with data collected from the TDIL corpus developed for the language.
To be precise in presentation, we first define the concept of scientific term (Section 2) and
technical term (Section 3) to draw a line of distinction between the two. Next, we describe
methods we use to process the corpus (Section 4), and the architecture we use for TermBank
compilation (Section 5). In conclusion (Section 6), we identify people who use this
TermBank to address various needs of linguistics and language technology (Wright and
Budin, 1997, pp. 370).
2. Scientific Term
The expression scientific term refers to single and multiword units that are used in different
scientific texts in specialized senses. Although the literal meaning of the expression refers to
specialized terms used in scientific texts, it is not confined to the fields of science only.
Rather it encompasses all the specialized terms used in any discipline of human knowledge.