Partition Aware Duplicate Records Detection
(PADRD) Methodology in Big Data - Decision
Support Systems
Anusuya Kirubakaran
1(
✉
)
and Aramudhan Murugaiyan
2
1
Mother Teresa Women’s University, Kodaikanal, India
Anusuya.kirubakaran@outlook.com
2
PKIET, Karaikal, Puducherry, India
Aranagai@yahoo.co.in
Abstract. As on today, the big data analytics and business intelligence (BI)
decision support system (DSS) are the vital pillar of the leadership ability by
translating raw data toward intelligence to make ‘right decision on right time’ and
to share ‘right decision to right people’. Often DSS challenged to process the
massive volume of data (Terabyte, petabyte, Exabyte, Zettabyte etc.) and to over‐
come the issues like data quality, scalability, storage and query performance. The
failure in DSS was one of the reasons highlighted clearly by United State Senate
report regarding the 2008 American economy collapse. To keep these issues in
mind, this work explores a preventive methodology for “Data Quality - Dupli‐
cates” dimension with optimized query performance in big data era. In detail, BI
team extracts and loads the historical operational structured data (Data Feed) to
its repository from multiple sources periodically such as daily, weekly, monthly,
quarterly, half yearly for analytics and reporting. During this load unpremeditated
duplicate data feed insertion occurs due to lack of expertise, lack of history,
missing integrity constraints which impact the intelligence reporting error ratio
& the leader ship ability. So the necessity of unintentional data quality issue
injection prevention arises. Over all, this paper proposes a methodology to
“Improve the Data Accuracy” through detection of duplicate records between big
data repository vs data feed before the data load with “Optimized Query Perform‐
ance” through partition aware search query generation and “Faster Data Block
Address Search” through braided b
+
tree indexing.
Keywords: Duplicate record detection · Braided tree indexing
Decision support system
1 Introduction
Today, most of the firms, governments & regulators across geographies understood and
erudite from 2008 financial crisis that the necessity of quality of data to be in place for
effective and wealthy macro-economic behaviors. Hence regulators are insisting the
financial sectors to manage the quality of data which are useful for identification of risks
across portfolios, proactive measure for business stabilization and challenges,
© Springer Nature Singapore Pte Ltd. 2018
Shriram R and M. Sharma (Eds.): DaSAA 2017, CCIS 804, pp. 86–98, 2018.
https://doi.org/10.1007/978-981-10-8603-8_8