Login Form






Lost Password?
No account yet? Register

Search Backup Central

Curtis

Disclaimer

The opinions contained within this website, it's blog(s), forums, and Wikis, are those of the original poster and do not represent the position of my (or any other) employer. This blog is not owned by my employer nor does it officially represent any company.
In-line or post-process de-duplication? PDF Print E-mail
Written by W. Curtis Preston   
Friday, 24 August 2007

In-line and post-process de-duplication are features -- not benefits.  And I think that arguing about which one is better is like trying to argue which is better: synchronous or asynchronous replication.  The both have benefits and drawbacks.  What matters is whether or not the one you buy meets your requirements, right?  I'll try my best to present both sides of the argument, and dispel a lot of what I believe to be misconceptions about this issue.

This blog is in a series.  The previous entry is  De-duplication & remote restores and the next entry is  De-dupe targets in TSM environments?

 

First, let me describe both both processes.

In-line (or as I like to call it, synchronous) de-duplication

A block of data comes into the appliance (as part of a larger stream of data from a backup).  While it's in RAM, the appliance does its magic to figure out whether it's seen that block before.  If it has seen it before, it writes a pointer somewhere saying that it's seen it again.  If it hasn't seen the block of data before, it writes the new block.  Job complete.

This method allows most of the hard work to be done in RAM, which minimizes I/O overhead.  The only disk operation that's done 100% of the time is the hash lookup (except for one in-line vendor that can keep their hash in RAM). (Some products must also do a read of the block the new block apparently matches to in order to do a bit-level verification that the two blocks are the same before discarding the new block.)  95% or the time (when the block matches a block already seen  -- assuming a 10:1 de-dupe ratio), it requires one write to the disk to update the hash table, and that's it.  It just throws away the redundant block and never writes it to disk.  5% of the time (when it's a new, unique block), it requires one write to disk and one write to the hash table.

Post-process (or as I like to call it, asynchronous) de-duplication

A block of data comes into the appliance (as part of a larger stream of data from a backup), and it is written to disk in its entirety.  Then a separate process (running asynchronously and possibly from another appliance that's accessing the same disk) reads the block of data and does it's magic to see if it's ever seen that block of data before.  If it has, it deletes the block of data and replaces it with a pointer.  If it hasn't seen it, it leaves it alone.

This method requires a lot more I/O than the in-line method.  It writes 100% of all new blocks to disk.  It then reads 100% of all new blocks from disk so that it can check it for commonality, which requires another disk read to check the hash table.  (Some vendors require an additional read here for bit-level verification of commonality.)  If it matches (95% of the time), it requires another write to delete the duplicate copy and another write to update the hash table.  If it doesn't match (5% of the time), it requires a write to the hash table.

The following is a table summarizing the operations of both methods.

 

 

In-line

Post-process

 

Redundant block

New block

Redundant block

New block

Disk write

N/A

N/A

Write block

Write block

Disk read

N/A

N/A

Read block

Read block

CPU

Redundancy check

Redundancy check

Redundancy check

Redundancy check

Disk read*

Hash lookup

Hash lookup

Hash lookup

Hash lookup

Disk read (either bit comparison or second hash calc & lookup)

Possibly perform second redundancy check

Possibly perform second redundancy check

Possibly perform second redundancy check

Possibly perform second redundancy check

Disk write

N/A

N/A

Delete block

N/A

Disk write

N/A

Write block

N/A

N/A

Disk write*

Hash table update

Hash table update

Hash table update

Hash table update

Total disk ops

2-3

3-4

6

5

 

*One vendor keeps their hash table in RAM, thus not requiring a disk read to do a hash lookup 

Pros and Cons

Everyone agrees that the job isn't done until the de-dupe's done. 

The advantage of an in-line appliance is that it has a lot less I/O work to perform. The table above illustrates this.  In-line vendors believe that post-process vendors will run out of hours in the day to de-dupe.  The disadvantage of the in-line method is that it must be all done within a single host, and the work must be performed very quickly in order to "get out of the way" of the incoming data stream, so as not to slow it down.

The advantage of the post-process method is that it can subdivide the data to be de-duped, and delegate it to as many de-duping processors/hosts as it needs in order to get the job done. The disadvantage is that it does have a lot more I/O work to do, as can be seen in the above table.

The proof is in the pudding

If your backups aren't slowed down, and you don't run out of hours in the day, does it matter which method you chose?  I don't think so.  Which is why I think this argument is kind of pointless.  What matters is whether or not it works for you. Test anything with your biggest workload and see how it does.

If the device you buy meets your requirements, who cares what's under the hood?  I don't think you should have to think about in-line vs post-process.  I think you should care about how big it is, how fast it is, and how much it costs.   (Remember that cost comes from many hours well beyond acquisition/depreciation cost.  You need to factor in how easy the product is to install and configure with your backup software, as well as how easy it is to manage things like lost drives, management growth, etc.)

A device won't necessarily be faster or slower because it's in-line or post-process.  It will be faster or slower based on how the vendor implemented it.  I've seen slow and fast devices on both sides.  So quit listening to vendor FUD and go try these things out already, then buy which one is right for you.   (And, yes, I think is one area where third-party help from GlassHouse can help.  We're the one company that doesn't want to sell you any of this stuff, as we don't sell hardware or software.  We just want you to get equipment that works for you.)

Misconceptions/FUD

Vendors are fun. the way they through FUD around about each other.  Some of them don't care if it's true; they only care if it gets them the sale.  (Some vendors are better/worse about this than others, of course.)  Here are the misconceptions/FUD that I've seen floating around about this issue.
 
An in-line de-dupe process will definitely slow down your incoming data stream.
I've seen many tests where backups sent to an in-line device were faster than sending those backups to tape, and comparable with the speed of going to non-de-duped disk.  I have seen some tests on some in-line appliances where the speed of an individual backup would be capped at 40-50 MB/s, but I've rarely seen any backup systems that can generate a 40-50 MB/s stream of data with a single job (no multiplexing).
Slowing down the incoming stream of data is bad.
Is it so bad if you slow the backup down a little bit?  Is it still faster than what you used to do (tape)?  Is it still fast enough to meet your backup window and RTO requirements?  The latter is the only thing that matters.  In addition, the in-line vendors want to make sure you remember that the backup's not really done until the de-dupe process is complete.  And if you have a poorly architected post-process machine, you may run out of hours in the day and not get fully de-duped.  That would, of course, significantly reduce you de-dupe ratio if that was happening on a regular basis.
Post-process de-dupe process happens after all backups have been completed.
This is one of the most common misconceptions about post-process, and it comes from the name -- which is why I prefer the term asynchronous over post-process.  Post-process systems typically wait until a given virtual tape is not being used before de-duping it. (Some wait until it's not written to; others wait until it's not being used at all.)  This typically happens when the tape is full or the backup is complete. Depending on how you setup your system, this may cause an initial delay of a few minutes to a few hours.  This delay is from the start time of the first set of backups in a given night. Once backups get going, and the first tape is full/done, there should be no delay after that.  By the time it de-dupes the first tape, the next tape will be ready for de-duping.
A post-process de-dupe process running at the same time as your backup will slow down your backup.
This should not be an issue for a properly sized system.  If it becomes a problem for any given environment, the de-dupe process can be offloaded to one or more completely separate servers/heads.  Although they will be accessing the same disk arrays as the backups, they will not be accessing the same areas of disk, as they'll be reading from backups that have already been written.  Incoming backups will go to their own disks, so there's no disk contention issues either.
Post-process de-dupe process must be done before you copy backups to another device.
Like a lot of FUD, this one is an exaggeration of the truth.  One post-process VTL does stop de-duping a given tape if it is put into a drive (for read or write), whereas another one stops de-duping a given tape only if it is being written to.  The former believes that they should not "compete" with the backup by reading from the same disks that the restore/copy could be reading from.  The latter believes that their testing has shown that this is not a problem. 
 
The reason why neither is really an issue is that you typically write to more tapes in a night than you have drives.  As long as that's the case, there will always be tapes to de-dupe while others are being copied from, and the post-process de-dupe process is smart enough to de-dupe the tapes that aren't being used while it's waiting for the tapes that are being used.  (Again, this is only a potential concern with a post-process vendor that halts de-dupe if the tape is being read from, and not all of them do that. And, at least one vendor will even be removing the concurrent write issue soon.  They'll de-dupe the part of the tape that's already written while the backup process is writing to the other part of the tape.)

As you can see, this is not as easy as "one is good, the other is bad."  Again, my summary statement is that in-line and post-process are features/inner works -- not benefits.  I think you should make your purchase decisions based on how fast, big, and expensive the product is -- not what it's de-dupe engine does underneath.

This blog is in a series.  The previous entry is  De-duplication & remote restores and the next entry is  De-dupe targets in TSM environments?

Comments
Add NewSearch
ddierickx - and the other way around   | Registered | 2007-08-29 05:38:32
i guess there are systems that do the same thing for restore or not? (i don't have any expirience with de-duping yet).

i mean, don't restore times suffer when the software tries to assemble all the parts from all over the place to build a 'true' image of the system?

on traditional backup, you restore from your image and everything is in that one image, all bits and bytes neatly after eachother and the restore goes fa-a-a-ast.

restoring one file in a de-dup setup will probably be fast enough not to notice, i but can't imagine it being just as fast when restoring a whole filesystem for example.

ack, maybe i shouldn't talk about things i have never used before
cpreston - Actually, you may be surprised   | Super Administrator | 2007-08-29 09:22:16
Because it's disk, and reading from more disks is faster than reading from fewer disks, AND because they know they can't help the backup and hurt the restore.

This should be a question for whatever vendor(s) you're considering, but most of the vendors have an answer that says restore is either the same or faster than it was before de-dupe.
Only registered users can write comments!

Copyright (C) 2007 Alain Georgette / Copyright (C) 2006 Frantisek Hliva. All rights reserved.

 
< Prev   Next >

Sponsored Links