In-line or post-process de-duplication? (updated 6-08)

In-line and post-process de-duplication are features — not benefits. And I think that arguing about which one is better is like trying to argue which is better: synchronous or asynchronous replication. The both have benefits and drawbacks. What matters is whether or not the one you buy meets your requirements, right? I’ll try my best to present both sides of the argument, and dispel a lot of what I believe to be misconceptions about this issue.

This blog is in a series. The previous entry is De-duplication & remote restores and the next entry is De-dupe targets in TSM environments?

 

First, let me describe both both processes.

In-line (or as I like to call it, synchronous) de-duplication

A block of data comes into the appliance (as part of a larger stream of data from a backup). While it’s in RAM, the appliance does its magic to figure out whether it’s seen that block before. If it has seen it before, it writes a pointer somewhere saying that it’s seen it again. If it hasn’t seen the block of data before, it writes the new block. Job complete.

This method allows most of the hard work to be done in RAM, which minimizes I/O overhead. The only disk operation that’s done 100% of the time is the hash lookup (except for an in-line vendor that can keep their hash in RAM). (Some products must also do a read of the block the new block apparently matches to in order to do a bit-level verification that the two blocks are the same before discarding the new block.) 90% or the time (when the block matches a block already seen — assuming a 10:1 de-dupe ratio), it requires one write to the disk to update the hash table, and that’s it. It just throws away the redundant block and never writes it to disk. 10% of the time (when it’s a new, unique block), it requires one write to disk and one write to the hash table.

Post-process (or as I like to call it, asynchronous) de-duplication

A block of data comes into the appliance (as part of a larger stream of data from a backup), and it is written to disk in its entirety. Then a separate process (running asynchronously and possibly from another appliance that’s accessing the same disk) reads the block of data and does its magic to see if it’s ever seen that block of data before. If it has, it deletes one of the redundant blocks of data and replaces it with a pointer. If it hasn’t seen it, it makes no changes.

This method requires a lot more I/O than the in-line method. It writes 100% of all new blocks to disk. It then reads 100% of all new blocks from disk so that it can check it for commonality, which requires another disk read to check the hash table (or whatever database is tracking previously-seen data). (Some vendors require an additional read here for bit-level verification of commonality.) If it matches (95% of the time), it requires another write to delete the duplicate copy and another write to update the hash table. If it doesn’t match (5% of the time), it requires a write to the hash table (or whatever database is tracking previously-seen data).

The following is a table summarizing the operations of both methods.

 

 

In-line

Post-process

 

Redundant block

New block

Redundant block

New block

Disk write

N/A

N/A

Write block

Write block

Disk read

N/A

N/A

Read block

Read block

CPU

Redundancy check

Redundancy check

Redundancy check

Redundancy check

Disk read*

Hash lookup

Hash lookup

Hash lookup

Hash lookup

Disk read (either bit comparison or second hash calc & lookup)

Possibly perform second redundancy check

Possibly perform second redundancy check

Possibly perform second redundancy check

Possibly perform second redundancy check

Disk write

N/A

N/A

Delete block

N/A

Disk write

N/A

Write block

N/A

N/A

Disk write*

Hash table update

Hash table update

Hash table update

Hash table update

Total disk ops

2-3

3-4

6

5

 

*Some vendors keeps their hash table/database in RAM, thus not requiring a disk read to do a hash lookup

Pros and Cons

Everyone agrees that the job isn’t done until the de-dupe’s done.

The advantage to the in-line appliance is that it has a lot less I/O work to perform. The table above illustrates this. In-line vendors believe that post-process vendors will run out of hours in the day to de-dupe due to this difference in work load.

Since in-line appliances dedupe the data the second it arrives at the appliance, they can also replicate it as soon as it arrives. Post process vendors must wait until their element of work is done. For example, some start deduping when a virtual tape is full, or when each backup job completes. (Note, that does not mean waiting until all jobs are complete.) They must start deduping that tape or job before they can replicate it.  Some can start replicating it as soon as they start deduping it, and others have to wait until that element of work is done before they can replicate it.  This will add a delay of minutes to hours to the replication process for a post-process device. Whether or not a delay of minutes or hours is important is a decision left to the reader.

In-line devices also do not need a landing zone for the original data. Post process devices typically have a landing zone big enough to hold one night’s backups. You can have a smaller landing zone if you have enough processing power to keep up with the backups, but most people will opt for a landing zone big enough for one night’s backups. Depending on your environment, this may be a significant number.

The big concern about inline devices is that because they are in the data path, they can slow down the incoming backup if they are unable to keep up with its speed. Some inline dedupe devices absolutely slow down some jobs — particularly large, single stream backups. Only testing will prove whether or not this is a concern for your environments.

The advantage of the post-process method is that it should NEVER get in the way of any incoming backup speed. (As of this writing, the fastest published injest speeds are provided by post-process systems, so there must be something to that claim.) It can also subdivide the data to be de-duped, and delegate it to as many de-duping processors/hosts as it needs in order to get the job done. This can be a real advantage on large, very fast, single-stream backups.

Another advantage is that it allows for a staggered implementation of dedupe. If you purchase an inline device, dedupe is always on. With a post-process device, you could have it off in production and on in the lab until you really feel comfortable with it. Then when you turn it on in production, you can dedupe all the data you’ve already written. That’s not possible with an inline device.

If you have any concerns about the integrity of your dedupe system, it should be completely removed if you could have a copy on tape that’s never been touched by dedupe. If you had a post process system, you could copy last night’s backup to tape before it’s deduped. This isn’t possible with an inline system.

If the dedupe vendor uses forward referencing (only possible with post process), it also leaves last night’s backups in their entirety, even after dedupe. It also leaves more recent data more contiguous than older data. This can have a big advantage during large restores of recent data. Consider also the previous paragraph about integrity concerns.  Since forward-referencing vendors leave last night’s backup in its original, un-deduped state even after it has been deduped, this would also allow you to copy last night’s un-deduped backup to tape before, during, or after it has been deduped.

Another advantage of the inline approach is simplicity.  With inline, you get what you get.  You get a certain amount of disk and throughput, and that’s it.  With post process vendors, you have to make some decisions that you don’t have to make with inline vendors.  How much staging area do you want?  How fast do you want the dedupe process to go?  Do you want to do the dedupe in the same head as the ingest?  These are all questions that must be answered when configuring a post process dedupe system.

Finally, the post process approach allows for some “funny numbers” on marketing sheets.  Some of the vendors advertise ingest rates signficantly faster than they could handle with the dedupe engine.  One vendor, for example, advertises a system can ingest data at 4 TB/hr, when the dedupe engine can only run at 500 MB/s.  That means you can only use the system for three hours a day and still keep up with the incoming data — at the ingest rate they’re advertising.  But someone is likely to compare the 4 TB/hr number with an inline vendor’s 1 TB/hr number, and think that the other vendor is faster when they’re really not.

My summary of the advantages and disadvantages of both approaches is as follows:

The advantages of inline are:

  • Less I/O work to perform
  • When you’re done, you’re done
  • Data can be replicated the second it shows up
  • Simpler configuration
  • No landing zone required

The disadvantages of inline are:

  • Possibly slow down the incoming backup speed
  • Does not allow for foward-referencing approach

The advantages of postprocess are:

  • No concerns about slowing down incoming backup speed
  • Allows for staggered implementation of dedupe
  • Allows you to copy last night’s backups in its original format
  • Allows forward-referencing approach (if desired)

The disadvantages of post process are as follows:

  • It does have a lot more I/O work to do, as can be seen in the above table
  • It requires the landing zone disk
  • It requires more configuration than an inline approach
  • It allows the vendor to advertise numbers that aren’t quite real
  • It will delay replication to a remote site by minutes or hours, depending on which product we’re talking about.

The proof is in the pudding

If your backups aren’t slowed down, and you don’t run out of hours in the day, does it matter which method you chose? I don’t think so. Which is why I think this argument is kind of pointless. What matters is whether or not it works for you. Test anything with your biggest workload and see how it does.

If the device you buy meets your requirements, who cares what’s under the hood? I don’t think you should have to think about in-line vs post-process. I think you should care about how big it is, how fast it is, and how much it costs. (Remember that cost comes from many hours well beyond acquisition/depreciation cost. You need to factor in how easy the product is to install and configure with your backup software, as well as how easy it is to manage things like lost drives, management growth, etc.)

A device won’t necessarily be faster or slower because it’s in-line or post-process. It will be faster or slower based on how the vendor implemented it. I’ve seen slow and fast devices on both sides. So quit listening to vendor FUD and go try these things out already, then buy which one is right for you. (And, yes, I think is one area where third-party help from GlassHouse can help. We’re the one company that doesn’t want to sell you any of this stuff, as we don’t sell hardware or software. We just want you to get equipment that works for you.)

Misconceptions/FUD

Vendors are fun. the way they through FUD around about each other. Some of them don’t care if it’s true; they only care if it gets them the sale. (Some vendors are better/worse about this than others, of course.) Here are the misconceptions/FUD that I’ve seen floating around about this issue.

An in-line de-dupe process will definitely slow down your incoming data stream.
I’ve seen many tests where backups sent to an in-line device were faster than sending those backups to tape, and comparable with the speed of going to non-de-duped disk. I have seen some tests on some in-line appliances where the speed of an individual backup would be capped at 40-50 MB/s, but I’ve rarely seen any backup systems that can generate a 40-50 MB/s stream of data with a single job (no multiplexing).
Slowing down the incoming stream of data is bad.
Is it so bad if you slow the backup down a little bit? Is it still faster than what you used to do (tape)? Is it still fast enough to meet your backup window and RTO requirements? The latter is the only thing that matters. In addition, the in-line vendors want to make sure you remember that the backup’s not really done until the de-dupe process is complete. And if you have a poorly architected post-process machine, you may run out of hours in the day and not get fully de-duped. That would, of course, significantly reduce you de-dupe ratio if that was happening on a regular basis.
Post-process de-dupe process happens after all backups have been completed.
This is one of the most common misconceptions about post-process, and it comes from the name — which is why I prefer the term asynchronous over post-process. Post-process systems typically wait until a given virtual tape is not being used before de-duping it. (Some wait until it’s not written to; others wait until it’s not being used at all.) This typically happens when the tape is full or the backup is complete. Depending on how you setup your system, this may cause an initial delay of a few minutes to a few hours. This delay is from the start time of the first set of backups in a given night. Once backups get going, and the first tape is full/done, there should be no delay after that. By the time it de-dupes the first tape, the next tape will be ready for de-duping.
A post-process de-dupe process running at the same time as your backup will slow down your backup.
This should not be an issue for a properly sized system. If it becomes a problem for any given environment, the de-dupe process can be offloaded to one or more completely separate servers/heads. Although they will be accessing the same disk arrays as the backups, they will not be accessing the same areas of disk, as they’ll be reading from backups that have already been written. Incoming backups will go to their own disks, so there’s no disk contention issues either.
Post-process de-dupe process must be done before you copy backups to another device.
Like a lot of FUD, this one is an exaggeration of the truth. One post-process VTL does stop de-duping a given tape if it is put into a drive (for read or write), whereas another one stops de-duping a given tape only if it is being written to. The former believes that they should not “compete” with the backup by reading from the same disks that the restore/copy could be reading from. The latter believes that their testing has shown that this is not a problem.

The reason why neither is really an issue is that you typically write to more tapes in a night than you have drives. As long as that’s the case, there will always be tapes to de-dupe while others are being copied from, and the post-process de-dupe process is smart enough to de-dupe the tapes that aren’t being used while it’s waiting for the tapes that are being used. (Again, this is only a potential concern with a post-process vendor that halts de-dupe if the tape is being read from, and not all of them do that. And, at least one vendor will even be removing the concurrent write issue soon. They’ll de-dupe the part of the tape that’s already written while the backup process is writing to the other part of the tape.)

As you can see, this is not as easy as “one is good, the other is bad.” Again, my summary statement is that in-line and post-process are features/inner works — not benefits. I think you should make your purchase decisions based on how fast, big, and expensive the product is — not what it’s de-dupe engine does underneath.

This blog is in a series. The previous entry is De-duplication & remote restores and the next entry is De-dupe targets in TSM environments?

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

7 comments
  • i guess there are systems that do the same thing for restore or not? (i don’t have any expirience with de-duping yet).

    i mean, don’t restore times suffer when the software tries to assemble all the parts from all over the place to build a ‘true’ image of the system?

    on traditional backup, you restore from your image and everything is in that one image, all bits and bytes neatly after eachother and the restore goes fa-a-a-ast.

    restoring one file in a de-dup setup will probably be fast enough not to notice, i but can’t imagine it being just as fast when restoring a whole filesystem for example.

    ack, maybe i shouldn’t talk about things i have never used before ๐Ÿ˜†

  • Because it’s disk, and reading from more disks is faster than reading from fewer disks, AND because they know they can’t help the backup and hurt the restore.

    This should be a question for whatever vendor(s) you’re considering, but most of the vendors have an answer that says restore is either the same or faster than it was before de-dupe.

  • I’m not clear on one of your points here. If you agree that most companies that buy VTL’s to speed up their backup and recovery operations (and don’t want to get off of their tape backup software) then the distinction between inline vs. post process is a critical issue.

    If the goal is faster backup, inline deduplication (which adds a process into the backup path) typically will not be faster than writing to a VTL, right?

    I agree with you that inline dedupe is faster than writing to tape, but I don’t believe it is faster than writing to a VTL.

  • Inline and VTL are not mutually exclusive. As I said in my comments to that OTHER article, there are inline VTLs (diligent & data domain) and there are post process filesystem-based devices (Quantum when run at advertised speeds).

    I’ve tested both types of systems, but I have not done any large side-by-side comparison of inline vs post-process ingest speeds. Maybe someone else has and can comment.

    BTW, which comment are you talking about that you are unclear on? You’re unclear as to which one you’re unclear on. ๐Ÿ˜‰

  • Sorry… should have quoted… I was talking about “An in-line de-dupe process will definitely slow down your incoming data stream.”

    BTW… I completely agree that “Inline and VTL are not mutually exclusive”.

  • The quote you’re quoting is under the commonly held misconceptions section. That means that the statement is NOT true. I then go to explain WHY it’s not true.

  • Great article thanks for the information. Does data deduplication affect this?