Automatically Identifying Citations in Hebrew-Aramaic Documents YAAKOV HACOHEN-KERNER 1 , NADAV SCHWEITZER 2 , and DROR MUGHAZ 2 1 Department of Computer Science, Jerusalem College of Technology, Jerusalem, Israel 2 Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English for various reasons. The aim of this novel research is to auto- matically identify undated citations included a unique data set: rab- binic documents written in Hebrew-Aramaic. We formulate four feature sets: orthographic, quantitative, stopword-based, and n-gram- based. Different experiments on all combinations of these feature sets using six common machine learning methods and Infogain have been performed. A combination of all four feature sets using logistic regression achieves an accuracy of 91.98%, which is an improvement of 16.53% compared to a baseline result. KEYWORDS citation identification, Hebrew-Aramaic docu- ments, knowledge discovery, machine learning methods, undated documents INTRODUCTION A citation is a mention of a work in the body of a text, and a reference provides full bibliographic information about a cited work and appears in a list of works at the end of a document. Citations are a defining feature of many kinds of documents; for example, academic, legal, and religious. Address correspondence to Yaakov HaCohen-Kerner, Department of Computer Science, Jerusalem, Israel. E-mail: kerner@jct.ac.il Cybernetics and Systems: An International Journal, 42:180–197 Copyright # 2011 Taylor & Francis Group, LLC ISSN: 0196-9722 print=1087-6553 online DOI: 10.1080/01969722.2011.567893 180 Jerusalem College of Technology—Machon Lev, 21 Havaad Haleumi St., POB 16031, 91160