ItCompress: An Iterative Semantic Compression Algorithm H. V. Jagadish 1 Raymond T. Ng 2 Beng Chin Ooi 3 Anthony K. H. Tung 3+ 1 University of Michigan. email: jag@eecs.umich.edu 1301 Beal Ave, Ann Arbor, MI, 48109-2122 2 University of British Columbia. email: rng@cs.ubc.ca 2366 Main Mall, Vancouver, B.C., V6T 1Z4 3 National University of Singapore, email:{ooibc, atung}@comp.nus.edu.sg 3 Science Dr 2, Singapore 117543 4 +Contact Author Abstract Real datasets are often large enough to necessitate data compression. Traditional ‘syntactic’ data compression methods treat the table as a large byte string and operate at the byte level. The tradeoff in such cases is usually between the ease of retrieval (the ease with which one can retrieve a single tuple or attribute value without decompressing a much larger unit) and the effectiveness of the compression. In this regard, the use of semantic compression has gen- erated considerable interest and motivated certain recent works. In this paper, we propose a semantic compression al- gorithm called ItCompress ITerative Compression, which achieves good compression while permitting access even at attribute level without requiring the decompression of a larger unit. ItCompress iteratively improves the com- pression ratio of the compressed output during each scan of the table. The amount of compression can be tuned based on the number of iterations. Moreover, the initial iterations provide significant compression, thereby making it a cost-effective compression technique. Extensive ex- periments were conducted and the results indicate the su- periority of ItCompress with respect to previously known tehniques, such as ‘SPARTAN’ and ‘fascicles’. 1 Introduction Advances in information technology have necessitated the creation of massive high-dimensional tables required for new applications such as corporate data warehouses, network-traffic monitoring and bio-informatics. The sizes of such tables are often in the range of terabytes, thereby making it a challenge to store them efficiently. In order to reduce the respective sizes of such tables, an obvious so- lution is the use of traditional data compression methods which are statistical or dictionary-based (e.g., Lempel-Ziv [17]). Such methods are ‘syntactic’ in nature since they view the table as a large byte string and operate at the byte level. More recently, compression techniques, which take se- mantics of the table into consideration during compression [9, 1], have received considerable attention. In general, these algorithms first try to derive a descriptive model, M , of the database by taking into account the semantics of the attributes and then separate them into the following three groups with respect to M : 1. Data values that can be derived from M . 2. Data values essential for deriving the data values in (1) using M . 3. Data values that do not fit M i.e., outliers. By storing only the model M together with the second and third groups of data values, compression is achieved since M typically takes up substantially less storage space than the original database. Such semantic compression gen- erally has the following advantages over syntactic compres- sion: • More Complex Analysis Since the semantics of the data are taken into considera- tion, complex correlation and data dependency between the data attributes can be exploited in case of semantic compression of data. Note that this is not supported in case of syntactic compression methods since the database is viewed as a large byte string in such methods. Fur- ther, the exploratory nature of many data analysis appli- cations implies that exact answers are usually not needed and analysts may prefer a fast approximate answer with an upper bound on the error of approximation. By taking into consideration the error tolerance that is acceptable in each attribute, semantic compression can be used to perform lossy compression to enhance the compression ratio. (The benefits can be substantial even when the level of error tolerance is low). • Fast Retrieval Given a massive table, only certain rows of the table are typically accessed to answer database queries. As such, it is desirable to be able to decompress only certain tu- ples in the database, while allowing the other tuples to re- main uncompressed. Since syntactic compression meth- ods, such as gzip, are unaware of the record boundary, it is usually not possible to do so without uncompressing the whole database. In fact, it has even been suggested [12, 13] that tables are better compressed column-wise. Separate compression of individual tuples, and even in- dividual attributes, is possible. However, syntactic com- pression is usually not effective on very small strings. As such, this sort of fine granularity compression is not used frequently. Semantic compression permits local reconstruction of se- lected tuples and even attributes without having to recon- struct the entire table. In fact, it is even possible to store the compressed data in a relational database, thereby making the query optimization and indexing techniques of relational databases available also for compressed data.