Very long-term digital preservation and archival strategies for Tamil documents Mani M. Manivannan Senior Director of Engineering Symantec Corporation Chennai, India mmanivannan@gmail.com ABSTRACT The decipherment of Indus script remains controversial. For several years, the inscriptions in Tamil Brahmi script weren't recognized as Tamil. The inscriptions in Vaeuttu, Pallava grantha, and evolving Tamil Brahmi scripts required careful analysis to decipher. Probable errors in inscriptions and reading of the inscriptions have led to multiple interpretations. As writing instruments and technology changed, Tamils have lost valuable literature and public documents. In the short duration that digital Tamil has existed, we are already finding it difficult to retrieve early Tamil writings on the internet. As we embark on the e- Governance and digital archival of precious documents, we have to remember that unless we are careful, we may find it difficult to preserve these documents for the long duration of centuries. In this paper, we will explore some of the strategies to minimize the loss of valuable documents. 1. Background and Overview Tamils have been creating digital documents in Tamil since the mid 1980s[1] using MS-DOS based editors. Since the early 1990s Tamils have been privately exchanging e-mails in Tamil and with the advent of tamil.net mailing list in 1996, Tamils worldwide started to communicate with each other in Tamil script. Since then there have been multiple Tamil mailing lists, web pages, webzines, blogs that added Tamil digital content. Project Madurai, a community project has been digitizing classical Tamil literature and Tamil books in the public domain and creating a major Tamil corpus with its digital documents since January 1998. Though the Tamils of the diaspora were instrumental in creating a lot of the early Tamil digital content on the web, the mainland Tamil Nadu media have taken to the web and nearly all the Tamil news media have Tamil content on the web. Tamil Nadu government departments have been creating Tamil digital documents in various encodings including TAB, TAM and Baamini. With a nationwide push for e-Governance in India, the Tamil Nadu government is about to embark on one of the largest Tamil digitization efforts. Before the advent of the standardization efforts to unify the multiple encodings after the TamilNet’97 conference in Singapore, there were multiple fonts each with its proprietary encoding. These discussions led to the TSCII encoding, the first open, non-proprietary Tamil encoding specification in 1998. At the TamilNet’99 conference in Chennai, the Government of Tamil Nadu formally declared two encoding standards TAB and TAM. During this period, neither ISCII, the Indian national standard and Unicode the global standard based on ISCII, had much support among the Tamil developers. However the standards TSCII, TAB and TAM did not stop others from creating fonts based on proprietary encodings. Baamini has been popular among the Eezham Tamil diaspora before TSCII and it is still used by some in the Government of Tamil Nadu. Other encodings that were used to create Tamil digital documents include Vanavil, Indoweb, Murasoli, Webulagam, Thinathanthi, Dinamani, Thinaboomi, Murasu Anjal, Mylai, Thatstamil, ShreeLipi, Amudham (Dinakaran), Vikatan, Anu Graphics (Pallavar), and Senthamizh (Nakkeeran). These encodings are popular enough that some converters recognize all of these. With strong support for the Unicode encoding in Microsoft Windows and Apple platforms as well as Google’s search and e- mail applications, use of Unicode has started to spread. Tamil Wikipedia is a popular site that uses Unicode as does the Tamil Lexicon at Chicago University. With the National e-Governance plan of Government of India notifying Unicode as the standard for e-Governance applications, creation of Tamil digital documents in Unicode encoding is expected to increase substantially. INFITT has recognized Unicode as the main standard for Tamil computing. INFITT has also acknowledged that there are commercial applications that don’t yet support Unicode, such as those used in publishing and other industries. For these applications that are not Unicode-ready, INFITT has recognized Tamil All Character Encoding TACE-16 as the only Alternate Standard for Tamil Computing. Since high-end publications are expected to create PDF documents such as Government notifications, text books, etc., along with books published by vendors in Tamil using TACE-16 encoding, this will form part of the Tamil digital document collection. In this paper, we will review the impact of technological obsolescence on the Tamil digital documents created in the past 25 years and compare it to the impact on English language digital documents. We will also study the preservation and archival strategies that are evolving in the rest of the world to address the threats to the stability of digital documents. Drawing inspiration from the seminal paper by Paul Convey [3], we hope that this paper will stimulate in-depth discussions among the technical specialists and the broader audiences interested in preserving not only the historical and cultural heritage of Tamils but also the more mundane public records that the government and the citizens depend on across generations. 2. What is the problem? Digital Dilemma! All current digital storage media are ephemeral. As Conway’s graph demonstrates the millennia-old Indus valley signs and the ancient Tamil inscriptions, while fragile, are still legible [2]. Even the palm manuscripts have managed to survive long enough to transfer information to the future generations. Books printed