I remember when I first started talking to Quantum about dedupe and they were trying to call their “immediate” deduplication “inline” because it’s happening at the same time as backup. They eventually stopped referring to it as inline, as it does not meet the definition of inline dedupe that was around long before they came out with their product. Unfortunately, now that EMC is now selling their Quantum-based product, they’re apparently trying to do the same thing – or at least one of their bloggers is. As usual, I’m drawing a very thick line between inline and post-process. Click Read More to see why.
The definition of inline deduplication (which was decided at least five years before EMC entered the target dedupe market by OEMing Quantum) is dedupe that is done in such a way that the native, non-deduped data is never written to disk – ever. If your product ever writes the data in its native format, then it’s doing post-process. The EMC product always writes data in its native format (according to the blog post referenced above), so it’s a post-process product. (I have been told that the Quantum 7500 can do true inline dedupe up to about 150 MB/s, but I haven’t verified that. See my other blog post about that. But it appears that the EMC products based on Quantum are not configured to work that way.)
Since I have been quoted many times as saying that I don’t care whether you use post-process or inline, why do I care if EMC calls what they do inline? The first reason I care is that it is confusing when a term has been used one way for many years and a newcomer comes along and starts using it to mean something else. I’m trying to help people understand the market and when someone does that (particularly a large vendor), it gets very confusing for people.
The second reason that the differentiation between inline and post-process is important is that post-processing systems can get “behind” and inline systems cannot. It’s analogous to asynchronous replication. Since asynchronous replication acknowledges the write to the application as soon as it’s done (whether it’s been replicated or not), the replicated copy can get “behind” the primary copy – from seconds to hours. In fact, asynchronously replicated systems sometimes get so out of synch that they cannot catch up.
Just like asynchronous replication, a post-processing dedupe system allows the native data to continue writing faster than it can process it, causing a backlog of dedupe work to happen at the end of the backup window that doesn’t exist in an inline system. It is possible (depending on the environment and the dedupe system) that the system could get so far behind that it would never be able to catch up. This is especially true of systems where the ingest rate is significantly faster than their dedupe rate. If a system can ingest data at 4 TB/hr, but can only process it at 1 TB/hr, you could get it impossibly behind by backing up 4 TB/hr for 10 hours. It would have 40 TB to dedupe in 24 hours – and it can only dedupe 24 TB in 24 hours. This is not to say that this is BAD – it just means it’s something you have to plan for in a post-processing system that you don’t have to plan for in an inline system.
The final reason that it is important to differentiate between these two types of dedupe is the difference in the amount of I/O each system must perform. Post-processing systems actually have to perform about 300% more I/O than inline systems. (I blogged about this here.) This extra I/O has an infrastructure cost that will be reflected in the price of the system.
This is not to say that I prefer inline systems, or that I think post-processing systems are poorly designed. I actually really like some of the post-processing systems. There are advantages to both methods, and a casual reader reading that a system is inline might assume that this system has the advantages of an inline system. If it’s truly an inline system, then it will. If it’s actually a post-processing system, it won’t.
EMC can call what they do “immediate” or “concurrent” (the way SEPATON does), but they can’t call what they do “inline.” It’s misleading. “That’s all I’ve got to say about that…”
----- Signature and Disclaimer -----
Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technologist at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.