IJIRST –International Journal for Innovative Research in Science & Technology| Volume 2 | Issue 11 | April 2016 ISSN (online): 2349-6010 All rights reserved by www.ijirst.org 321 Optimization of XML Compression Lalita T.Dhekwar Jagdish Pimple Department of Computer Science and Engineering Department of Computer Science and Engineering Nagpur Institute of Technology Nagpur Institute of Technology Abstract XML is a standard for exchanging and presenting information on the Web because XML makes data flexible in representation and easily portable as well. However, XML data is also recognized as verbose since it heavily increases the size of the data due to the repeated tags and structures. The data verbosity problem gives rise to many challenges of conventional query processing and data exchange. The XML increase the overhead of bandwidth-and memory-limited devices. XML compression and optimization are one of the solutions of the verbosity problems of XML. Although many effective XML compressors, such as XMill, have been proposed to solve the data size problem but it does not address the problem of running queries on compressed XML data. Other compressors have been proposed to query compressed XML data. However, the compression ratio of these compressors is usually worse than that of XMill and that of the generic compressor gzip, while their query performance and the expressive power of the query language they support are inadequate. The main objective of this work is in two folds; first design and development of XML compression method and second optimization of existing methods of XML compression. In addition, the increased size affects both query processing and data exchange. XML files require a lot more storage space and network bandwidth. Keywords: XML, data compression, query processing, Web applications _______________________________________________________________________________________________________ I. INTRODUCTION Extensible Markup Language (XML) is becoming the standard format for electronic data storage and exchange. With the increasing popularity of XML data format, it can also widely use in the business applications as a database or storing the metadata information as well. XML defines a set of rules for converting the simple data into the standard format. The main characteristics of XML which makes XML more popular in data communication are Self-Descriptors: It is possible to understand an XML document without external context. Evolvable and Extensible: XML formats can be designed to evolve gracefully overtime, whereas retaining backwards and forwards compatibility. Isomorphic: XML formats are constant in the all the environment so in the hybrid environment the XML format is used for data communication. Human-Readable: XML, as a text format, is easy for people to read and understand with common, non-proprietary tools (i.e., a text editor), and is simple to modify with those same tools. This encourages the adoption of XML-based formats and aids greatly in debugging and ad hoc uses of them. Simplicity: XML format is very simple to use with few encodings. It supports the Unicode format for supporting the world language characters. XML is often referred to as self-describing data because it is designed in a way that the schema is repeated for each record in the document. On one hand, this self-describing feature provides XML with immense flexibility but on the other hand, it also introduces the main problem of verbosity of XML documents which results in huge document sizes. This huge size lead to the fact that the amount of information has to be transmitted, processed, stored, and queried is often larger than that of other data formats. This can be a serious problem in many occasions, since data has to be transmitted quickly and stored compactly. Large XML documents not only consume transmission time, but also consume large amounts of storage space. The problem can be addressed if XML compression techniques are used to reduce the space requirements of XML. There are two types of compressors on the basis of XML awareness: General text based compressor and XML concise compressor. The general text based compressors are XML-Blind, treats XML documents as usual plain text documents and applies the traditional text compression techniques. gZip, WinRAR, 7Zip are belongs to this group. 7Zip compressor by default uses the 7Z format and that format by default uses the LZMA method to compress the data. LZMA is an improved version of LZ77 algorithm which improves the data compression ratio. XML concise compressors are aware about the XML structure so that they can take an advantage of XML structure to compress the data to increase the compression ratio. The XMill is one of the types of XML compressor which eliminates the redundant data by identifying the similarities between the semantically related data. The XMill also uses the gZip library to compress the XML string type of data. To improve the compression ratio of XMill compressor this work added 7Zip library in addition with gZip to compress the XML string data.