Since I was into de-duplication as a part of my undergraduate thesis , I found the articles from NetApp and EMC very interesting.

De-duplication is the process of removing replicas of a file, the file may be of any type (example .jpg,.txt,.doc etc). There are two major approaches to data de-duplication, one is file level and the other is database level.

Few companies employ file level de-duplication while a majority of them employ block level de-duplication.

Block Level De-duplication :

De-duplication at the data block level compares blocks of data  with other subsequent blocks. Block level deduplication allows you to de-duplicate data within a given object. If an object (file, database, etc.) contains blocks of data that are identical to each other, then block level deduplication eliminates storing the redundant data and reduces the size of the object in storage.

In De-duplication a single copy of the file is maintained and other copies of the file are made into references to that particular file, which drastically reduces the file size in case the redundancies for that particular file is very high. References may be similar to soft link approach used in Linux.

For example if a image file that is 3MB , is stored in 5 different locations the total size occupied by that file is 15MB. In case of data de-duplication a single copy of the file is maintained and the rest of the copies are made as references to the original file location (or) rather single file location. So the result after de-duplication comes way lesser than the former may be just slightly higher than 3MB.

Files may be named differently, this poses a great challenge hence the md5/SHA-1  of the file is calculated and checked for duplicates. Links are established between similar files. For my project I use the Amazon S3 for storing data on the cloud .I found it to be an easy and efficient way of storing and accessing my data. Amazon AWS provides support for various languages like C#, Java and PHP etc. The howto’s are provided under the Developer section of the Amazon AWS website.

The links given below provide some useful resources regarding de-duplication.

http://www.informationweek.com/blog/229205878

http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/134-inline-or-post-process.html/

http://www.evaluatorgroup.com/document/data-de-duplication-%E2%80%93why-when-where-and-how-infostor-article-by-russ-fellows/

And of course Wiki

http://en.wikipedia.org/wiki/Data_deduplication

Since a lot of research is being carried out on how to decrease the storage costs : de-duplication proves to be an effective tool in this regard.