Normalizing Digital News-Stories for Preservation
Muzammil Khan
*
, Arif Ur Rahman
†
, M. Daud Awan
*
and Syed Mehtab Alam
†
*
Department of Computer Science, Preston University, Islamabad, Pakistan
Email: muzammilkhan86@gmail.com, drdaudawan@preston.edu.pk
†
Department of Computer Science, Bahria University, Islamabad, Pakistan
badwanpk@bui.edu.pk, mehtabalamshah@gmail.com
Abstract—Preserving news stories may be important because
of various reasons like they provide detailed information about
events and they may be used for research purposes in the long
term. However, the news stories published online are in danger
because of reasons like constant change in the technologies used
to publish information and the formats for publication.
Certain institutions or individuals may be interested in pre-
serving news stories related to a particular event or topic. The
stories should be collected from various online newspapers and
preserved for the long term. The major issue in the preservation
process is that newspapers use different formats for online
publication of the stories. The paper presents a tool which is
developed to addresses the issue. The tool facilitates users in the
extraction of news stories from various online newspapers and
migration to a normalized format.
Keywords: News archiving, News preservation
I. I NTRODUCTION
Newspapers cover stories about various types of events
like acts of parliaments, events of political importance for
countries, proceedings of courts related to important cases,
births, deaths and sports. In the past few years newspapers
have changed the way news are published from printed ver-
sions to digital versions. The news generation in the digital
environment is no longer periodic or linear process with
fixed single output like the printed newspaper. The news
are instantly generated and updated in continuous fashion.
However, because of various reasons like the short lifespan
of digital information and speed of generation of information,
it has become vital to preserve digital news for the long term.
Digital preservation includes various actions to ensure that
digital information remains accessible and usable as long as
they are considered important [1]. Many approaches have been
developed in the past to preserve digital information like the
the model migration approach for database preservation and
preservation of research data [2], [3].
The lifespan of news stories published online vary from one
newspaper to another i.e. from one day to a month. Though
a newspaper may be backed up and archived by the news
publisher or national archives, in the future it will be difficult
to access particular information published in various news-
papers about the same story. The issues become even more
complicated if a story is to be tracked through an archive of
many newspapers which require different access technologies.
In addition to this, it may be difficult to extract explicitly
available metadata i.e. author name, date of publication, and
even more difficult to extract metadata which is not explicitly
available i.e. the list of named entities in an article, with a
news story [4]. The focus in this paper is on the extraction
and normalization of news stories from various newspapers
published online.
Any Web resources can be captured using three different
techniques i.e. by browser, by crawler and by developing a
customized tool [5]. The technique may also depend on the
type of resources to be captured and the extraction frequency.
News stories related to the same topic may be extracted
from various online news publishing websites using specially
developed software systems. However, some issues including
difference in format, metadata and technologies used to de-
velop websites make the identification of similar stories and
their preservation a challenging task. In this paper a tool i.e.
Digital News Story Extractor (DNSE), is proposed for digital
news story preservation(DNSP).
The Digital News Story Extractor (DNSE) is a tool de-
veloped to facilitate the extraction of news stories from the
online newspapers and migration to a normalized format.
The normalized format also includes a step to add metadata
which is stored for future use in the the Digital News Sto-
ries Archive (DNSA). The DNSA contains three directories,
namely Header, DNSArchive and URLsArchive, con-
taining normalized news stories and related files (if any). The
tool is developed in Java using JSOUP, URL and other related
Java packages. The preservation format is developed using
XML.
II. RELATED WORK
Libraries and archives preserve newspapers by carefully
digitizing collections as newspapers are a good source of
knowing history. The National Endowment for Humanities
(NEH) funded the United States Newspaper Program (USNP)
until 2011. The Library of Congress also provided technical
support to the USNP. The goal of the initiative was to create an
archive and make it publicly available. The initiative focused
on the preservation of historical newspapers published in the
United States. The preservation strategy consists of several
steps including scanning a newspaper in TIFF or JPEG2000
and recording metadata in accordance with Dublin Core and
METS-ALTO [4], [6]. The British library and Findmypast
1
maintain the British Newspaper Archive which contains more
than 40 million scanned historical newspapers. The archive is
1
http://www.findmypast.com/ (Accessed on August 30, 2016)
978-1-5090-2641-8/16/$31.00 ©2016 IEEE
91