Digitalizing Scheme of Handwritten Hanja Historical Documents Min-Soo Kim, Man-Dae Jang, Hyun-Il Choi, Taik-Heon Rhee and Jin-Hyung Kim Korea Advanced Institute of Science and Technology Artificial Intelligence and Pattern Recognition Laboratory 373-1, Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea {mskim, mdjang, hichoi, three, jkim}@ai.kaist.ac.kr Hee-Kue Kwag AI R&D center, Dongbang SnC Co.,Ltd. 10th Floor, BaekSang Bldg, Gwanhun-dong, Jongno-gu, Seoul, Korea hkkwag@dbsnc.co.kr Abstract In Korea, the historical archives of Lee dynasty have been digitized for years. Although the metallic pressing technology had been used in Korea as early as 13th century, most of these documents are written by the King’s chroni- clers and secretaries. In addition, ancient characters which are not used in contemporary texts take considerable pro- portion. As a consequence, it is extremely difficult to uti- lize conventional OCR systems, and most of the process has been performed manually. However, this manual processing has been unsatisfactory in terms of costs and efficiency. As an alternative, we built a system that is composed of a dedi- cated handwritten Hanja recognizer and an easy-to-use ver- ifier. Preliminary experiments show that the proposed sys- tem can help enhancing the overall efficiency of the process and reducing the costs. 1. Introduction The historical documents are obviously invaluable. From various kinds of documents such as record of historical events, description of ancient culture, a king’s dairy and classical literature, we could learn a lot of things about the past. As time goes on, we are supposed to set much value on the historical documents. Korean national agency also has been preserving histor- ical documents from the past. However, just keeping the documents physically has limitation in maintenance and ac- cessibility. For the maintenance, since the documents were written on papers, the government spends large amount of budget to keep the documents safely. The maintenance causes limited access to those documents, i.e., accessibility. Although many peoples want to read valuable documents, it is restricted to those who are authorized to do so for secu- rity. The construction of digital library for historical doc- uments may reduce maintenance cost, expand accessibil- ity. It is easy to perform data management and informa- tion retrieval efficiently using digital archives if the docu- ments were archived digitally. There are no more limita- tions in maintenance and accessibility. As the digitalization is expected to enlarge utilization historical documents effi- ciently, the Korean governments have been digitalizing his- torical documents from several years ago. The digitalization of the documents related to science technology, education, culture and history is active in Ko- rea. For instance, National Computerization Agency started digitalization of some historical documents from 2000. The project will be continued to 2005. However, the project cov- ers only 5% of historical documents, i.e., 95% of documents are still remaining to be digitalized (Table 1). The reason why it takes too much time for digitalization of historical documents is explained in the following section. Kind of Literatures Total sum (unit:volume) Documents 2,365,561 Books 2,034,871 Others 5,868,689 Table 1. Statistics for the number of classical literatures From 108 B.C., The Korean began using Chinese sym- bols and characters in writing as well as learning about Chi- Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04) 0-7695-2088-X/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Korea Advanced Institute of Science and Technology. Downloaded on July 13, 2009 at 21:40 from IEEE Xplore. Restrictions apply.