IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.12, December 2009 101 Manuscript received December 5, 2009 Manuscript revised December 20, 2009 A Highly Scalable and Efficient Distributed File Storage System Fawad Riasat Raja Dr. Adeel Akram †† Faculty of Telecommunication & Information Engineering University of Engineering & Technology Taxila, Pakistan ABSTRACT The need and use of large scale distributed storage has rapidly increased in last few years. Organizations need Terabytes of storage for their operational data and backups. Large storage systems are the ultimate solution, but they are very expensive and require higher degree of skills for their operation and maintenance e.g. Storage Area Network (SAN). We propose “A Highly Scalable and Efficient Distributed File Storage System” that is reliable, inexpensive and easy to maintain. Our system is based on peer-to-peer network architecture. To ensure the reliability of the system we use a technique of erasure codes known as Luby Transform (LT). The System is designed for deployment in Local Area Networks (LAN) but with minimal changes it can be extended for Wide Area Networks (WAN) and Internet. Key words: Distributed Storage Systems, Peer-to-Peer Networks, Consistent Hashing, Data Blocks. 1. INTRODUCTION With the passage of time, storage space requirements of small businesses to large enterprises increased by many folds for archival of their operational data and backups. Information in the form of e-mails, documents, presentations, databases, images and multimedia contents etc., require Terabytes of storage space. Storing information and managing its storage in a limited budget is a critical issue for small businesses as well as for large enterprises. Vendors come up with different solutions day by day but these solutions are very expensive and hard to maintain. Some organizations uses file servers to overcome their storage requirements and when the need of storage grows, they add more hard disks or tape drives in their storage servers' farm to increase their storage capacity. For reliability, replication is used between the dedicated servers while their disk drives are organized in the form of RAID arrays e.g. RAID 1+0 or RAID 5. These types of storage solutions are not scalable and their management is another important issue [1]. Some of the storage systems use clustering technology [2] [3]. . In Cluster technology, many computers or storage nodes are connected together using a SAN. But storage nodes connected in a cluster can share same account information with each other that may results in obvious security issues. Another problem with this solution is its cost and management. We come up with a solution that addresses the above mentioned problems. Nowadays, a standard desktop PC has enormous computing and storage capacity. Usually a standard PC contains more than 100 GB Hard Disk Drive (HDD), 1 GB RAM and 2GHz or higher processor. A typical installation of an operating system and other required application software do not consume more than 20 to 30 GB of HDD storage. This leaves on the average about 70% of the storage space to be unused, especially in case of computers used in Laboratories and office environment. A small organization has more than 20 PCs. A University LAB for example, may contain on average around 30 PCs with above mentioned specifications. If the available storage capacity of these PCs is combined together, then a single LAB can provide 30 x 70 = 2100 GB of storage capacity. This surplus multi-Terabytes storage capacity remains unused in most of these LABs and can be utilized if combined to form a Large Virtual Storage Space to store huge amount of data. This motivation guided us in developing a Large Distributed File Storage System based on available storage capacities of existing PCs. Our proposed system utilizes unused storage capacity of desktop machines (PCs) operating in small businesses, large enterprises or universities. Our design is based on completely decentralized (peer-to-peer) architecture. Main reasons behind using the peer-to peer architecture instead of client server architecture are: Resilience to failure Load Balancing Higher availability of resources