
The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication.

Backing up or making duplicate copies of virtual environments is similarly improved.Ĭlassification Post-process versus in-line deduplication ĭeduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines-something that alternatives like hard links or shared disks do not offer. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. See WAN optimization for more information. In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required.

Hard-linking does not help with large files that have only changed in small ways, such as an email database differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents). Neither approach captures all redundancies, however. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. Storage-based data deduplication reduces the amount of storage needed for a given set of files. Examples are CSS classes and named references in MediaWiki. In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. With data deduplication, only one instance of the attachment is actually stored the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space.

These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data.

The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. It can also be applied to network data transfers to reduce the number of bytes that must be sent. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Data processing technique to eliminate duplicate copies of repeating data
