


Written by W. Curtis Preston
Monday, 30 July 2007 14:56
There are two very different types of de-duplication: source & target, and they work completely differently.
(This blog entry is in a series. The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores.)
Source and target de-duplication are very different, both in how they work and how you use them. Let's take a look at them.
Target de-duplication is what's found in Intelligent Disk Targets (IDTs), most of which are virtual tape libraries (VTLs). You continue using whatever backup software floats your boat (as long as the VTL supports it), and send your backups to your de-dupe IDT/VTL and it will de-dupe them for you. This reduces the amount of disk needed to store your data, but it does not change the amount of bandwidth needed to get the backups to the backup server. The de-dupe can reduce bandwidth usage if the de-dupe IDT/VTL can then replicate the de-duped data to another IDT/VTL in another location. Now you have an on-site and an off-site copy without making an actual tape. (If you want to make an actual tape, you can make it from the onsite or offsite IDT/VTL.)
Source de-duplication requires you to use different backup software on the client(s) where you want to use it. They may be in your data center and they may be in a remote datacenter, or they can even be a laptop. This client software talks to the backup server (that is also running the de-dupe backup software) and says "hey, I've got this piece of data here with this hash. Have you seen that hash before?" (This piece of data is a piece of a file, not the whole file.) If the server has seen that piece of data before, it doesn't send the data again; it just notes that there's another copy of that block of data at that client. That way, if a file has already been backed up by the backup server before (such as the same file being stored by multiple people), then it won't transfer that file across the LAN/WAN. In addition, if a previous version of a file has been backed up before, de-dupe will notice the parts of the file it has seen (and not back them up again) and the parts of the file it hasn't seen (and back them up). This reduces both the amount of disk required to store your data AND the amount of bandwidth necessary to send the data.
Source De-duplication
Advantages
Reduces bandwidth usage all over
Can protect a remote office without any hardware installed there (up to a certain amount of data)
Designed to use disk
Design incorporates automated onsite & offsite (and even really offsite) copies
Disadvantages
Requires change of backup software
Typically slower than target de-dupe on large volumes of data (Many TB)
Target De-duplication
Advantages
Some implementations very fast (100s of MB/s to 1000s of MB/s)
Does not require change in backup software
Disadvantages
Considered a "band-aid" by some to help backup software that was designed to use disk
Requires hardware at each remote site to be protected via de-dupe
Onsite & offsite copies may be outside of knowledge of backup software
This blog entry is in a series. The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores
Add comment
Comments
'Considered a "band-aid" by some to help backup software that was designed to use disk'
do you mean "...designed to use tape"? Ie, legacy apps such as TPM? Or have I misunderstood something?
Also, you cannot completely dismiss Symantec like that. I realize you're a competitor and you need to position against them, but they are absolutely NOT "worthless." I know of several very large installations that are very happy. Also, how you slam a very successful product with a beta product is beyond me.
I fixed the link to your blog.
you missed a point.
Target/inline de-duplication doesn't need custom/modified a backup agent i.e the backup agent is sending the incremental data without bothering about the duplicates. This saves storage but consumes but not bandwidth.
Source de-duplication has a modified agent and sends hashes before sending the data. This saves time, bandwidth and storage.
and BTW, the simple checksum matching technique used in puredisk is of no good use. A simple byte insertion can shift all the blocks.
An old post from my blog - blog.druvaa.com/2008/06/15/data-de-duplication/
1. How big is it? (i.e. how much disk do you give me and what de-dupe ratio do I get with my data?)
2. How fast is it? (i.e. how fast are backups, restores, and the overall de-dupe process?)
3. How much does it cost?
All of your arguments against post-process above are aiming at #2. My opinion is neither in-line or post-process de-dupe system can claim any kind of victory. Both have advantages and disadvantages that have to be tested out with your data and your servers. Then, when all that testing is done, you get to compare how big, fast, and expensive the systems are. THAT's all that matters.
Having said that, I'd like to comment on some of your statements, as I think they represent common misunderstandings about the process. Instead of doing it here, I'm going to do it in another blog post.
First, some or all delayed de-dupe systems seem to require scheduling of a de-dupe batch process where the data or data location may not have full functionality during the de-dupe process.
Second, the de-dupe process is extremely resource (especially CPU) hungry. With the process running, the appliance may not perform reasonably with other functions during the process. of course this will change for the better over time. Inline de-dupe appliances are built to withstand high-resource consumption during backup ... there really are no significant back-end processes.
Third, if the deduped data is sent "off site", such as with a proprietary system sending the de-dupe information to a similar box, ... the movement off site is necessarily delayed until de-dupe can be accomplished. No such delay is required if inline de-dup.
Fourth, if de-dupe is done "inline", the process of getting data from its source to off site is simpler. Simple is good.
For these reasons, I see the vendors with "inline" de-dupe as having a significant advantage ... one that shouldn't be waived off as unimportant.
Of course, it's possible that I didn't realize my hearing aide batteries needed changing at the time I listened to Curtis' podcast. :wink: ... and my generalizations may well be worse than I think Curtis' was. cheers, wayne
Nice try, though. :wink:
I have been looking at other options than NBU for de-dupe and there are some good ones and bad one in comparison to Pure Disk.
BTW, There is file-level de-dupe and sub-file-level de-dupe. File-level de-dupe is also called CAS, or content-addressable storage, and yes -- they do the hash at the file level.
But what we're talking about here is sub-file-level de-dupe. This catches not only duplicated files, but also duplicated pieces of files (i.e. blocks) that have already been seen. So when you back up a spreadsheet every day because it gets updated every day, each day you should back up only the new blocks in that spreadsheet.
Was that just for ease of understanding for the reader?
_I_ talk about remote restores! Check out my next blog entry: "De-duplication & remote restores."
RSS feed for comments to this post