Decaying Telco Big Data with Data Postdiction Constantinos Costa * , Andreas Charalampous * , Andreas Konstantinidis *‡ , Demetrios Zeinalipour-Yazti * and Mohamed F. Mokbel § * Department of Computer Science, University of Cyprus, 1678 Nicosia, Cyprus ‡ Department of Computer Science & Engineering, Frederick University, 1036 Nicosia, Cyprus § Qatar Computing Research Institute, HBKU, Qatar and University of Minnesota, Minneapolis, MN 55455, USA {costa.c, achara28, akonstan, dzeina}@cs.ucy.ac.cy; mmokbel@hbku.edu.qa Abstract—In this paper, we present a novel decaying operator for Telco Big Data (TBD), coined TBD-DP (Data Postdiction). Unlike data prediction, which aims to make a statement about the future value of some tuple, our formulated data postdiction term, aims to make a statement about the past value of some tuple, which does not exist anymore as it had to be deleted to free up disk space. TBD-DP relies on existing Machine Learning (ML) algorithms to abstract TBD into compact models that can be stored and queried when necessary. Our proposed TBD-DP operator has the following two conceptual phases: (i) in an ofﬂine phase, it utilizes a LSTM-based hierarchical ML algorithm to learn a tree of models (coined TBD-DP tree) over time and space; (ii) in an online phase, it uses the TBD-DP tree to recover data within a certain accuracy. In our experimental setup, we measure the efﬁciency of the proposed operator using a ∼10GB anonymized real telco network trace and our experimental results in Tensorﬂow over HDFS are extremely encouraging as they show that TBD-DP saves an order of magnitude storage space while maintaining a high accuracy on the recovered data. Index Terms—telco, big data, spatio-temporal analytics, data decaying, data reduction, machine learning. I. I NTRODUCTION In recent years there has been considerable interest from telecommunication companies (telcos) to extract concealed value from their network data. Consider for example a telco in the city of Shenzhen, China, which serves 10 million users. Such a telco is shown to produce 5TB per day [1] (i.e., thousands to millions of records every second). Huang et al. [2] break their 2.26TB per day Telco Big Data (TBD) down as follows: (i) Business Supporting Systems (BSS) data, which is generated by the internal work-ﬂows of a telco (e.g., billing, support), accounting to a moderate of 24GB per day and; (ii) Operation Supporting Systems (OSS) data, which is generated by the Radio and Core equipment of a telco, accounting to 2.2TB per day and occupying over 97% of the total volume. Effectively storing and processing TBD workﬂows can unlock a wide spectrum of challenges, ranging from churn prediction of subscribers [2], city localization [3], 5G network optimization / user-experience assessment [4]–[6] and road trafﬁc mapping [7]. Even though the acquisition of TBD is instrumental in the success of the above scenarios, Telcos are reaching a point where they are collecting more data than they could possibly exploit. This has the following two implications: (i) it introduces a signiﬁcant ﬁnancial burden on the operator to store the collected data locally. Notice that the Fig. 1. Data Prediction (top): aims to ﬁnd the future value of some tuple. Data Postdiction (bottom): aims to recover the past value of some tuple, which has been deleted to reduce the storage requirements, using a ML model. deep storage of data in public clouds, where economies-of- scale are available (e.g., AWS Glacier), is not an option due to privacy reasons; and (ii) it imposes a high computational cost for accessing and processing the collected data. For example, a petabyte Hadoop cluster, using between 125 and 250 nodes, costs ∼1M USD [8] and a linear scan of 1PB would require almost 15 hours. Additionally, in [9] it is shown that the amount of storage doubles every year and storage media costs decline only at a rate of less than 1/5 per year. Finally, high- availability storage mandates low-level data replication (e.g., in HDFS the default data replication is 3). Consequently, we claim that the vision of inﬁnitely storing all IoT-generated velocity data on fast high-availability or even deep storage will gradually become too costly and impractical for many analytic-oriented processing scenarios. To this end, data decaying [10], [11] (or data rotting) has recently been suggested as a powerful concept to complement traditional data reduction techniques [12], [13], e.g., sampling, aggregation (OLAP), dimensionality reduction (SVD, DFT), synopsis (sketches) and compression. Data decaying refers to “the progressive loss of detail in information as data ages with time”. In data decaying recent data retains complete resolution, which is practical for operational scenarios that can continue to operate at full data resolution, while older data is either compacted or discarded [5], [10], [11]. Additionally, the decaying cost can be amortized over time, matching current trends in micro-batching (e.g., Apache Spark). Unfortuna- tely, data decaying currently relies on rather straightforward methodologies, such as rotational decaying (i.e., FIFO) [10], or decaying based on speciﬁc queries [5] rather than the complete dataset itself. Our aim in this work is to expand upon these developments to provide more intelligent and generalized decaying operators.