There are two very different types of de-duplication: source & target, and they work completely differently.
(This blog entry is in a series. The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores.)
Source and target de-duplication are very different, both in how they work and how you use them. Let's take a look at them.
Target de-duplication is what's found in Intelligent Disk Targets (IDTs), most of which are virtual tape libraries (VTLs). You continue using whatever backup software floats your boat (as long as the VTL supports it), and send your backups to your de-dupe IDT/VTL and it will de-dupe them for you. This reduces the amount of disk needed to store your data, but it does not change the amount of bandwidth needed to get the backups to the backup server. The de-dupe can reduce bandwidth usage if the de-dupe IDT/VTL can then replicate the de-duped data to another IDT/VTL in another location. Now you have an on-site and an off-site copy without making an actual tape. (If you want to make an actual tape, you can make it from the onsite or offsite IDT/VTL.) Source de-duplication requires you to use different backup software on the client(s) where you want to use it. They may be in your data center and they may be in a remote datacenter, or they can even be a laptop. This client software talks to the backup server (that is also running the de-dupe backup software) and says "hey, I've got this piece of data here with this hash. Have you seen that hash before?" (This piece of data is a piece of a file, not the whole file.) If the server has seen that piece of data before, it doesn't send the data again; it just notes that there's another copy of that block of data at that client. That way, if a file has already been backed up by the backup server before (such as the same file being stored by multiple people), then it won't transfer that file across the LAN/WAN. In addition, if a previous version of a file has been backed up before, de-dupe will notice the parts of the file it has seen (and not back them up again) and the parts of the file it hasn't seen (and back them up). This reduces both the amount of disk required to store your data AND the amount of bandwidth necessary to send the data. Source De-duplication Advantages Reduces bandwidth usage all over Can protect a remote office without any hardware installed there (up to a certain amount of data) Designed to use disk Design incorporates automated onsite & offsite (and even really offsite) copies Disadvantages Requires change of backup software Typically slower than target de-dupe on large volumes of data (Many TB) Target De-duplication Advantages Some implementations very fast (100s of MB/s to 1000s of MB/s) Does not require change in backup software Disadvantages Considered a "band-aid" by some to help backup software that was designed to use disk Requires hardware at each remote site to be protected via de-dupe Onsite & offsite copies may be outside of knowledge of backup software This blog entry is in a series. The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores
|