Reducing the Storage Burden via Data Deduplication David Geer E nterprise data volumes are exploding as organizations collect and store increasing amounts of information for their own use and to comply with government regula- tions. According to the Enterprise Strategy Group, an industry analy- sis firm, 13 percent of midsized com- panies surveyed in 2004 used more than 10 terabytes of data storage, but by this year, the number had increased to 42 percent. This requires companies to buy more storage, consume more pro- cessing power and energy in han- dling and managing the information, utilize more network resources in transmitting the material, and spend more time on related functions such as data backup and replication. However, much of the information in storage is duplicate data. Different sources within an organization often create identical files or duplicate existing files so that they can work with them independently. To cope with this, organiza- tions are increasingly using data- deduplication technology, also known as intelligent compres- sion, single-instance storage, data reduction, and capacity-optimized storage. As Figure 1 shows, deduplication identifies redundant data, elimi- nates all but one copy, and creates logical pointers to the information set so that users can access the mate- rial as needed. Deduplication lets an organiza- tion keep 20 times more data in a given amount of storage, said Jerome Wendt, president of DCIG Inc., an industry analysis firm. The technology thus reduces stor- age needs and costs, as well as the amount of energy and processing power used to run storage systems. This has given the approach momen- tum in the marketplace. However, data deduplication faces challenges. For example, it’s not a standardized technology, and many vendors offer widely varying tech- niques, thereby confusing potential customers. DIVING INTO DEDUPLICATION The concept of data deduplication has been around since the 1960s, according to Juan Orlandini, staff engineer with Datalink, an infor- mation-storage vendor and ser- vice provider. However, he noted, early systems weren’t as technically sophisticated as today’s offerings. Vendor Data Domain released the first modern deduplication product, the DD200, in 2004, he said. Today, major vendors include EMC Corp. with its Avamar and Disk Library offerings, ExaGrid Systems with its Disk-Based Backup prod- uct, FalconStor Software with the FalconStor Single Instance Reposi- tory, NEC with its DataRedux tech- nology and HYDRAstor product, Quantum with its DXi Series offer- ings, Sepaton with DeltaStor, and Symantec with its Enterprise Vault and NetBackup PureDisk 6.5. Deduplication works with most types of files, including e-mail attachments, but not with images, noted Enterprise Strategy Group analyst Lauren Whitehouse. How it works Deduplication is usually imple- mented as part of a storage or backup system, according to DCIG’s Wendt. Deduplication products integrate with multiple brands and types of storage systems via gateways and Fibre Channel connections, said Datalink’s Orlandini. Vendors sell products either as freestanding appliances, as is the case with Quantum, or as software run by a backup or other storage- related server, as is the case with Asigra. Hashing. Most deduplication systems use hashing to identify and compare data chunks to determine if they are redundant, said FalconStor vice president of technology John Lallier. Because the hashes, also called fin- gerprints, are small files, the process is much quicker and more efficient than comparing entire data sets. Deduplication systems use algo- rithms to produce the hash value that represents a data set. The Secure Hashing Algorithm-1 (SHA-1) returns a 20-byte hash value, while the older Message Digest-5 returns a 16-byte value, according to Rory Bolt, chief technology officer of EMC’s Avamar product family. Published by the IEEE Computer Society December 2008 15 INDUSTRY TRENDS