Data deduplication is a technique for compressing data where duplicate data is deleted, maintaining one copy of each unit of information on a system rather than allowing multiples to thrive. The copies retained have references allowing the system to retrieve them. This technique reduces the need for storage space and can keep systems running faster in addition to limiting expenses associated with data storage. It can work in a number of ways and is used on many types of computer systems.
In file-level data deduplication, the system looks for any duplicated files and deletes the extras. Block-level deduplication looks at blocks of data within files to identify extraneous data. People can end up with doubled data for a wide variety of reasons, and using data deduplication can streamline a system, making it easier to use. The system can periodically pore through the data to check for duplicates, eliminate extras, and generate references for the files left behind.
Such systems are sometimes referred to as intelligent compression systems, or single-instance storage systems. Both terms reference the idea that the system works intelligently to store and file data in order to reduce the load on the system. Data deduplication can be especially valuable with large systems where data from a number of sources is stored and storage costs are constantly on the rise, as the system needs to be expanded over time.
These systems are designed to be part of a larger system for compressing and managing data. Data deduplication cannot protect systems from viruses and faults, and it is important to use adequate antivirus protection to keep a system safe and limit viral contamination of files while also backing up at a separate location to address concerns about data loss due to outages, damage to equipment, and so forth. Having the data compressed before backing up will save time and money.
Systems utilizing data deduplication in their storage can run more quickly and efficiently. They will still require periodic expansion to accommodate new data and to address concerns about security, but they should be less prone to filling up quickly with duplicated data. This is an especially common concern on email servers, where the server may store large amounts of data for users and significant chunks of it could consist of duplicates like the same attachments repeated over and over; for example, many people emailing from work have attached footers with email disclaimers and company logos, and these can eat up server space quickly.