Normalizing Digital News-Stories for Preservation Muzammil Khan * , Arif Ur Rahman , M. Daud Awan * and Syed Mehtab Alam * Department of Computer Science, Preston University, Islamabad, Pakistan Email: muzammilkhan86@gmail.com, drdaudawan@preston.edu.pk Department of Computer Science, Bahria University, Islamabad, Pakistan badwanpk@bui.edu.pk, mehtabalamshah@gmail.com Abstract—Preserving news stories may be important because of various reasons like they provide detailed information about events and they may be used for research purposes in the long term. However, the news stories published online are in danger because of reasons like constant change in the technologies used to publish information and the formats for publication. Certain institutions or individuals may be interested in pre- serving news stories related to a particular event or topic. The stories should be collected from various online newspapers and preserved for the long term. The major issue in the preservation process is that newspapers use different formats for online publication of the stories. The paper presents a tool which is developed to addresses the issue. The tool facilitates users in the extraction of news stories from various online newspapers and migration to a normalized format. Keywords: News archiving, News preservation I. I NTRODUCTION Newspapers cover stories about various types of events like acts of parliaments, events of political importance for countries, proceedings of courts related to important cases, births, deaths and sports. In the past few years newspapers have changed the way news are published from printed ver- sions to digital versions. The news generation in the digital environment is no longer periodic or linear process with fixed single output like the printed newspaper. The news are instantly generated and updated in continuous fashion. However, because of various reasons like the short lifespan of digital information and speed of generation of information, it has become vital to preserve digital news for the long term. Digital preservation includes various actions to ensure that digital information remains accessible and usable as long as they are considered important [1]. Many approaches have been developed in the past to preserve digital information like the the model migration approach for database preservation and preservation of research data [2], [3]. The lifespan of news stories published online vary from one newspaper to another i.e. from one day to a month. Though a newspaper may be backed up and archived by the news publisher or national archives, in the future it will be difficult to access particular information published in various news- papers about the same story. The issues become even more complicated if a story is to be tracked through an archive of many newspapers which require different access technologies. In addition to this, it may be difficult to extract explicitly available metadata i.e. author name, date of publication, and even more difficult to extract metadata which is not explicitly available i.e. the list of named entities in an article, with a news story [4]. The focus in this paper is on the extraction and normalization of news stories from various newspapers published online. Any Web resources can be captured using three different techniques i.e. by browser, by crawler and by developing a customized tool [5]. The technique may also depend on the type of resources to be captured and the extraction frequency. News stories related to the same topic may be extracted from various online news publishing websites using specially developed software systems. However, some issues including difference in format, metadata and technologies used to de- velop websites make the identification of similar stories and their preservation a challenging task. In this paper a tool i.e. Digital News Story Extractor (DNSE), is proposed for digital news story preservation(DNSP). The Digital News Story Extractor (DNSE) is a tool de- veloped to facilitate the extraction of news stories from the online newspapers and migration to a normalized format. The normalized format also includes a step to add metadata which is stored for future use in the the Digital News Sto- ries Archive (DNSA). The DNSA contains three directories, namely Header, DNSArchive and URLsArchive, con- taining normalized news stories and related files (if any). The tool is developed in Java using JSOUP, URL and other related Java packages. The preservation format is developed using XML. II. RELATED WORK Libraries and archives preserve newspapers by carefully digitizing collections as newspapers are a good source of knowing history. The National Endowment for Humanities (NEH) funded the United States Newspaper Program (USNP) until 2011. The Library of Congress also provided technical support to the USNP. The goal of the initiative was to create an archive and make it publicly available. The initiative focused on the preservation of historical newspapers published in the United States. The preservation strategy consists of several steps including scanning a newspaper in TIFF or JPEG2000 and recording metadata in accordance with Dublin Core and METS-ALTO [4], [6]. The British library and Findmypast 1 maintain the British Newspaper Archive which contains more than 40 million scanned historical newspapers. The archive is 1 http://www.findmypast.com/ (Accessed on August 30, 2016) 978-1-5090-2641-8/16/$31.00 ©2016 IEEE 91