Two different types of de-duplication

There are two very different types of de-duplication: source & target, and they work completely differently.

(This blog entry is in a series.  The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores.)

Source and target de-duplication are very different, both in how they work and how you use them. Let's take a look at them.

Target de-duplication is what's found in Intelligent Disk Targets (IDTs), most of which are virtual tape libraries (VTLs).  You continue using whatever backup software floats your boat (as long as the VTL supports it), and send your backups to your de-dupe IDT/VTL and it will de-dupe them for you.  This reduces the amount of disk needed to store your data, but it does not change the amount of bandwidth needed to get the backups to the backup server.  The de-dupe can reduce bandwidth usage if the de-dupe IDT/VTL can then replicate the de-duped data to another IDT/VTL in another location.  Now you have an on-site and an off-site copy without making an actual tape.  (If you want to make an actual tape, you can make it from the onsite or offsite IDT/VTL.)

 

Source de-duplication requires you to use different backup software on the client(s) where you want to use it.  They may be in your data center and they may be in a remote datacenter, or they can even be a laptop.  This client software talks to the backup server (that is also running the de-dupe backup software) and says "hey, I've got this piece of data here with this hash.  Have you seen that hash before?" (This piece of data is a piece of a file, not the whole file.)  If the server has seen that piece of data before, it doesn't send the data again; it just notes that there's another copy of that block of data at that client.  That way, if a file has already been backed up by the backup server before (such as the same file being stored by multiple people), then it won't transfer that file across the LAN/WAN.  In addition, if a previous version of a file has been backed up before, de-dupe will notice the parts of the file it has seen (and not back them up again) and the parts of the file it hasn't seen (and back them up).  This reduces both the amount of disk required to store your data AND the amount of bandwidth necessary to send the data. 

 

Source De-duplication

Advantages

Reduces bandwidth usage all over

Can protect a remote office without any hardware installed there (up to a certain amount of data)

Designed to use disk

Design incorporates automated onsite & offsite (and even really offsite) copies

Disadvantages

Requires change of backup software

Typically slower than target de-dupe on large volumes of data (Many TB)

 

Target De-duplication 

Advantages

Some implementations very fast (100s of MB/s to 1000s of MB/s)

Does not require change in backup software 

Disadvantages 

Considered a "band-aid" by some to help backup software that was designed to use disk

Requires hardware at each remote site to be protected via de-dupe

Onsite & offsite copies may be outside of knowledge of backup software 

This blog entry is in a series.  The previous entry is What is De-duplication , and the next entry is De-duplication & remote restores 

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

11 comments
  • I think NBU6.5 will use source de-duplication, as it will have puredisk technology built in. I wonder how much more resources this will require from master and/or media servers.

    it is also the only way to do efficient remote backups, but nobody ever talks about (remote) restores. prepping a server at the main site and then shipping it somewhere at the other side of the globe is not always a good/prefered way to do it.

    for me the restore part is still one piece of the puzzle that needs to be solved.

  • You can already buy the Puredisk product now, and yes, it is source de-dupe. NBU 6.5 will offer greater integration between the two products.

    _I_ talk about remote restores! Check out my next blog entry: “De-duplication & remote restores.”

  • You seem to suggest, rather explicitly, that dedupes are functioning on the file level. Not only am I pretty sure that’s not true (including not of Puredisk, at least, that’s not what they said at the EDPF in Minnesota in 2005), but I think it’d be pretty boneheaded to hash files rather than blocks.

    Was that just for ease of understanding for the reader?

  • I just re-read the article, and I could see where you think I’m saying that de-dupe is file-level, but that’s not what I’m saying. I just gave a duplicated file as an example. I’ll re-edit the blog entry and give another example.

    BTW, There is file-level de-dupe and sub-file-level de-dupe. File-level de-dupe is also called CAS, or content-addressable storage, and yes — they do the hash at the file level.

    But what we’re talking about here is sub-file-level de-dupe. This catches not only duplicated files, but also duplicated pieces of files (i.e. blocks) that have already been seen. So when you back up a spreadsheet every day because it gets updated every day, each day you should back up only the new blocks in that spreadsheet.

  • :-?: So, which way will be the most cost effective route to take for data de-dupe? By using NBU 6.0 & Purchasing the Pure Disk Option OR by upgrading to NBU 6.5 with the Pure Disk already built in?

    I have been looking at other options than NBU for de-dupe and there are some good ones and bad one in comparison to Pure Disk.

  • As far as I can tell, Puredisk will not be built into the base product. It’ll be an option just like it is now. It’s just that it will be more integrated with the base product.

    Nice try, though. ๐Ÿ˜‰

  • In a recent podcast (sorry, I forget which one), Curtis said “inline or delayed de-dupe doesn’t matter when selecting a backup appliance” (my words and estimation of Curtis’ intent). Although he went on at length to discuss some effects, I thought he left out a few important considerations.

    First, some or all delayed de-dupe systems seem to require scheduling of a de-dupe batch process where the data or data location may not have full functionality during the de-dupe process.

    Second, the de-dupe process is extremely resource (especially CPU) hungry. With the process running, the appliance may not perform reasonably with other functions during the process. of course this will change for the better over time. Inline de-dupe appliances are built to withstand high-resource consumption during backup … there really are no significant back-end processes.

    Third, if the deduped data is sent “off site”, such as with a proprietary system sending the de-dupe information to a similar box, … the movement off site is necessarily delayed until de-dupe can be accomplished. No such delay is required if inline de-dup.

    Fourth, if de-dupe is done “inline”, the process of getting data from its source to off site is simpler. Simple is good.

    For these reasons, I see the vendors with “inline” de-dupe as having a significant advantage … one that shouldn’t be waived off as unimportant.

    Of course, it’s possible that I didn’t realize my hearing aide batteries needed changing at the time I listened to Curtis’ podcast. ๐Ÿ˜‰ … and my generalizations may well be worse than I think Curtis’ was. cheers, wayne

  • I did say that none of the features that people typically debate matter (inline vs post process, MD5 vs SHA-1 vs custom, reverse vs forward referencing, etc). What matters is:

    1. How big is it? (i.e. how much disk do you give me and what de-dupe ratio do I get with my data?)

    2. How fast is it? (i.e. how fast are backups, restores, and the overall de-dupe process?)

    3. How much does it cost?

    All of your arguments against post-process above are aiming at #2. My opinion is neither in-line or post-process de-dupe system can claim any kind of victory. Both have advantages and disadvantages that have to be tested out with your data and your servers. Then, when all that testing is done, you get to compare how big, fast, and expensive the systems are. THAT’s all that matters.

    Having said that, I’d like to comment on some of your statements, as I think they represent common misunderstandings about the process. Instead of doing it here, I’m going to do it in another blog post.

  • Custis

    you missed a point.

    Target/inline de-duplication doesn’t need custom/modified a backup agent i.e the backup agent is sending the incremental data without bothering about the duplicates. This saves storage but consumes but not bandwidth.

    Source de-duplication has a modified agent and sends hashes before sending the data. This saves time, bandwidth and storage.

    and BTW, the simple checksum matching technique used in puredisk is of no good use. A simple byte insertion can shift all the blocks.

    An old post from my blog – http://blog.druvaa.com/2008/06/15/data-de-duplication/

  • I said "Source de-duplication requires you to use different backup software on the client(s) where you want to use it."

    Also, you cannot completely dismiss Symantec like that. I realize you’re a competitor and you need to position against them, but they are absolutely NOT "worthless." I know of several very large installations that are very happy. Also, how you slam a very successful product with a beta product is beyond me.

    I fixed the link to your blog.

  • When you say about the disadvantages of target de-duplication:

    ‘Considered a “band-aid” by some to help backup software that was designed to use disk’

    do you mean “…designed to use tape”? Ie, legacy apps such as TPM? Or have I misunderstood something? ๐Ÿ™‚