What is deduplication and how does it work? (Backup to Basics series)

Start listening

In our latest episode of the Backup to Basics series, we talk about what I think is the most important invention in my career: deduplication. Without dedupe, much of what we do in backup and recovery, and disaster recovery, would simply not be possible. Without dedupe there really is no disk backup market; there is no cloud backup market. I’d be out of a job! What is dedupe, anyway, and how does it work? What are the different kinds of dedupe and does that matter? You should learn a lot about this important topic.



On this episode of restore it all we talk about what I think is the biggest advancement in backup and recovery technology during my career. And that’s deduplication. I hope you enjoy the episode. Hi, and welcome to Backup Central’s Restored all podcast. I’m your host, w Curtis Preston, aka Mr. Backup. And a half with me, my network, rearchitect Rearchitect, engineer.

[00:00:45] Prasanna Malaiyandi: Hey, Curtis, whatever I could do to keep you safe, you know?

[00:00:49] W. Curtis Preston: You know what’s really funny is like I, I consider myself a pretty tech savvy guy, and when we were talking today, About what I’m, you know how I’ve, I’ve replaced a bunch of gear and I’m swapping out some stuff and moving some cables around, and then you were like, you were yelling at me. You were like, you can’t do that.

You can’t put the switch on the thing. And I was like, yeah, I can, like, what are you talking about? And it, and it took me like a couple of seconds and I was like, oh, wait. You’re right. I can’t, that’s not, I can’t do that. I can’t put. The switch. I can’t put the router. That’s gonna be my firewall on the same switch

As my home LAN

[00:01:35] Prasanna Malaiyandi: Yeah,

[00:01:37] W. Curtis Preston: I dunno what I was thinking. Yeah.

[00:01:42] Prasanna Malaiyandi: Just another topic that I know just a little bit about.

[00:01:48] W. Curtis Preston: I’m a little, I feel a little ashamed that that was. But I’m glad I talked to you about my, you know, as, as is the case with many subjects. I’m glad I talked to you about, you know, what I’m up to. Um,

[00:02:03] Prasanna Malaiyandi: Glad I could help.

[00:02:05] W. Curtis Preston: I have successfully purchased and configured for the video for the video watchers. Let’s see if it makes it into the camera before the cable runs.

There it is, the ASUS AX6600, which is a mesh router. And I gotta say it’s much more better than what I had before, and it’s able, I’ve got two. It’s supposed to provide 5,500 square feet, but of course that’s, that doesn’t include drywall and two by fours, right?

[00:02:37] Prasanna Malaiyandi: it’s crazy how much signal degrades going through drywall. And the other thing people don’t realize is five gigahertz, like degrades like no tomorrow

[00:02:48] W. Curtis Preston: Right. Remind me, remind me why five gertz is better again.

[00:02:54] Prasanna Malaiyandi: It’s faster because it can handle more bandwidth, and also the channel is wider, so you can have more things talking at the same time. It’s just as your frequency goes up, the distance goes down for the same power levels,

[00:03:08] W. Curtis Preston: So is this like DC versus ac?

[00:03:11] Prasanna Malaiyandi: not quite DC versus ac. It’s more about. You need to pump as many things as possible into, because high frequency, right, it’s more per cycle, right, than 2.4, which is less airtime, if you will.

[00:03:25] W. Curtis Preston: Right.

[00:03:26] Prasanna Malaiyandi: And so every sort of peak, you can send more out with the five gigahertz because you’re doing it more often.

[00:03:33] W. Curtis Preston: right.

[00:03:35] Prasanna Malaiyandi: And so it works a lot better. It’s just the distance isn’t as great. Now, I will tell people, so this is one of my, I’m gonna get up on my soapbox now, right? One of my rare soapbox events and tell people, a lot of times people think they need more wifi access points in their house to get coverage.

And to those people, I will say, plan out your network carefully. Put your devices where they matter. And also don’t put too many devices and don’t crank up the power all the way to high, because I know Curtis, you and I were talking about this when you’re looking at mesh, and it was like, imagine that your router can overpower your phone, your laptop, your iPad, so it’s screaming at the top of its lungs and your phone can barely even scream back at it.

And so that’s actually worse for your network and for airtime than actually sort of balancing out power.

[00:04:26] W. Curtis Preston: I just don’t know if, like, the stuff you’re talking about, like is. is that even, is that configuration option even on consumer class routers?

[00:04:36] Prasanna Malaiyandi: you’ll have sort of the low, medium, high power levels, uh, but it takes time to fine tune and tweet these, right? You have to walk around with a wifi analyzer on your phone, right? So Apple with their, uh, iPhones, right? They ship, what is it? Airport utility, which has a wifi scan. Option, which will show you all the wifi networks and sort of the signal strength, and you basically have to walk around your house with that and be like, okay, where is it strong?

Where is it weak?

Right, to figure out the placement. That’s the ideal way, because what you want is you want coverage in the right places, because what you see is in a lot of high density housing areas, or even homes next to each other is most people end up with crummy wifi because their power is turned up so high, it bleeds into everyone else’s area such that everyone has a crappy time.

because then you get interference and then everyone sort of slows down and then it

[00:05:26] W. Curtis Preston: Right. Yeah. I got a lot of wifi. I got a lot of networks. Um, you know, um, yeah,

[00:05:35] Prasanna Malaiyandi: And for, and for the last bit, last bit of my soapbox is please, please, please do not use 40 megahertz channel widths on your 2.4 gigahertz channels. You do not need to use 40 megahertz and ruin everyone else’s connectivity. Please only use 20 megahertz bands for 2.4 gigahertz.

[00:05:55] W. Curtis Preston: Uh, I’ll see what I can do but I, but I have this new, you know, and again, I am not a wireless, I feel like a wireless nbe, but I have this new fancy right where it automatically selects the right. Um, that’s

pretty cool.

[00:06:12] Prasanna Malaiyandi: Point to go. Yeah.

[00:06:13] W. Curtis Preston: Yeah. Well, not just that, but also

2.4 versus five. Yeah.

[00:06:17] Prasanna Malaiyandi: So actually all of this is part of the wifi standard, so the figuring out which access point, that’s part of the 8 0 2 11 R standard. And I think that the band steering is also part of the standard as well.

[00:06:29] W. Curtis Preston: Yeah.

[00:06:29] Prasanna Malaiyandi: a lot of folks are implementing now.

Some devices don’t do well with band steering. It basically looks at sort of the difference between the five gigahertz and the 2.4 gigahertz and says, okay, which one should I pick? And most devices, if it’s seven decibels difference or more, then it’ll pick, uh, the higher the faster speed. And so that’s kind of how it tricks your devices into picking the right band.

[00:06:54] W. Curtis Preston: Interesting. Yeah, it’s kind of cool. Um, all I know is that I finally have a mesh that covers the two. Cuz my problem is that I have things in the garage, things embedded inside walls in the garage that need wifi, not just, not just inside walls. , I have a device that’s inside a wall, inside an electrical cabinet, inside a wall.

Right? I have a sense, uh, app or a bi, a device, and that’s deep inside my electrical, my circuit breaker box. Um, and this reached to it. No problem. It didn’t, it it had, it had like two bars. Right. So clearly, and, and the thing is, it’s only, it’s like 20 feet from.

[00:07:42] Prasanna Malaiyandi: yep.

[00:07:43] W. Curtis Preston: Right. But it’s, you know, a couple of drywall walls and some two by fours and some metal.

Uh, but it worked. That’s the important part is that it worked. Um, yeah, so I th I think I might be in, I think I might be in wifi heaven for a while. Um, and you too can be there for the low, low price of $350 That’s a two, that’s a two node system. Um, And it’s supposed like, yeah, but I’m pretty happy. But, uh, that’s not what we’re talking about today.

[00:08:18] Prasanna Malaiyandi: really. We can talk about wifi all day if you want.

[00:08:21] W. Curtis Preston: yeah. Well, you could talk about wifi all day. I feel really stupid when you’re talking about wifi, because I’m like, this is not my Bailey Wick. That’s a cool word, by the way, Bailey Wick. So I thought we’d talk about backups instead because that’s, that’s my world. And I feel comfortable knowing them.

Most people don’t know crap about this space, uh, because they, they, you know, they get the job as a junior person and then next thing you know, they become a, a real sys admin or a network admin or a, you know, or a security admin or a dba.

[00:08:57] Prasanna Malaiyandi: Yeah, well, except our listeners who are all awesome and probably experts in the backup field and know all about this.

[00:09:02] W. Curtis Preston: Well, certainly Daniel.

[00:09:05] Prasanna Malaiyandi: Hi Daniel.

[00:09:06] W. Curtis Preston: Hi Daniel. The backup anorak. Um, I wonder, you know, he’s never, he’s never, he better still be listening to the show since we call out to him every once in a while. Him and Stuart, although Stuart’s retired. I don’t think Stuart’s listening to our show. I only tell ’em when we talk about ’em. But, um, so we’re continuing in our backup to basic series. It’s been a couple of weeks, uh, as the kids say it’s been a minute, uh, since such a, I remember the first time I heard that thing, I was like, what are you talking a minute? Anyway, . But yeah, it’s been a minute since we’ve done an episode of our Backup to Basic series, but I am looking down at the book and of course, uh, for those of you that don’t know, basically we’re doing a podcast version of my book, modern Data Protection.

Make sure it gets in camera here from O’Reilly. Uh, you can purchase the, uh, the, the print version from, uh, your favorite book seller. Um, , perhaps it’s one based in the Amazon, perhaps not, uh, Um, and, uh, but if you would like an ebook version of it, you can get your own by going to druva.com/ebook. That’s d r uva.com/ebook.

They will, of course, ask for your contact information and then email the crap out of you until you tell ’em to stop. But, That is, that is the price that you pay. Um, let’s talk

about, oh yeah. And while we’re at it, uh, I’ll throw out the disclaimer, uh, that this is an independent podcast and, um, uh, I work for Druva, Prasanna works for Zoom and, um,

The, um, but the opinions that you hear are ours. Um, and. Et cetera. Please rate us, uh, by going to your, you know, most of you’re on iTunes. Just scroll down to the bottom there, give us five or six stars and a comment. We love comments. And, uh, if you’d like to join the conversation, just contact me, w Curtis Preston gmail or WC Preston on Twitter.

[00:11:17] Prasanna Malaiyandi: What about LinkedIn?

[00:11:19] W. Curtis Preston: But n oh yeah, LinkedIn. Uh, it’s linkedin.com/what is it? Slash in slash Mr. Beck. Um, and by the way, my Twitter account already has multifactor authentication, configured not using sms, which as should you, especially now that they’re disabling, that so weird the way they did that. What’s funny is I support the desysion.

That’s just the way


[00:11:44] Prasanna Malaiyandi: way it came out. Yeah.

[00:11:47] W. Curtis Preston: Oh, Elon. Okay. So in our backup to basic series, we’re continuing on, and today we are talking about using disk and deduplication. You know, I, I, um, couple weeks ago, I hit 30 years in the backup industry, and I got interviewed by Chris Mellor

[00:12:09] Prasanna Malaiyandi: the register and blocks and files.

[00:12:12] W. Curtis Preston: Yeah.

It’s in his, for his block and file. Um, um, blog and one of the questions was what I thought was the most, um, important development in the backup industry since I joined. And to me, hands down, it’s not even, it’s not, there’s not even a close second, and that is the invention of deduplication

[00:12:41] Prasanna Malaiyandi: Yep.

[00:12:42] W. Curtis Preston: and because. I, I can’t think of another technology in the backup space that has changed backup architecture more than deduplication, and I can think of many other things that we do that are only possible because deduplication is underneath them,

[00:13:09] Prasanna Malaiyandi: Oh yeah, definitely. Yeah. I don’t think we would be able to get, especially with the data growth and the size of these applications.

[00:13:19] W. Curtis Preston: Is data growing? Is

[00:13:20] Prasanna Malaiyandi: No, not at all. Right. I don’t think it would be possible to do, like I know Curtis, you’ve talked about previous, like in your early days, right, about trying to do a backup. I being like, oh my God, how am I gonna do this full backup in a weekend?

[00:13:33] W. Curtis Preston: Yeah.

[00:13:34] Prasanna Malaiyandi: And just with the fact, and I know we’ll go and talk about more about deduplication, but yeah, just being able to now do that in a cost effective way, using new ways of actually doing the backups as well, which is enabled with deduplication.

[00:13:48] W. Curtis Preston: Yeah. So it, it’s, it’s like disk. You could argue that disk using disk and backups is the bigger, uh, advancement. But first off, not really an advancement. It’s just instead of tape, we’re gonna use disc,


[00:14:03] Prasanna Malaiyandi: was there to start with anyway. It was just sort of, the cost was so high, and especially given the type of workload you see with deduplication where, or with backups where you’re doing periodic fulls or other things like that, and keeping them for long periods of time. Are you going to spend what, 40 x or 30 x on storage for your backup system versus your production?


That’s a hard sell.

[00:14:27] W. Curtis Preston: just, yeah, cuz that’s a problem. So one of the, one of the things, uh, that I remember from back in the day, like I, I don’t remember really thinking about this lately, but back in the day, I would say that for every gigabyte of primary storage, you had 20 gigabytes of backup storage. And so if you’re gonna do that with disk, even, you know, even once, many years ago.

Wow. At this point, it’s like 20 years ago, . But, but even once they came out with this idea of, Uh, SATA disk instead


[00:15:03] Prasanna Malaiyandi: nearline


[00:15:05] W. Curtis Preston: Right. Um, that, that helped bring the cost down significantly. But, But,

not, But not as much as deduplication.

[00:15:13] Prasanna Malaiyandi: Yeah. Because even with those price differences, right? Maybe it was half the price or a third of the price, but once you add in that 20 x that you talked about, right, Curtis, then that adds up. And it’s not only just the storage cost, it’s also you have to account for the power, the cooling, the floor space, right?

All the things that go into that system.

[00:15:33] W. Curtis Preston: Yeah. Yeah. Um, it’s funny, um, just sort of, just sort of a, an afterthought that, that. Post that, um, that Chris Mellor did about the 30 years. The one group that jumped on the article and just started retweeting all kinds of parts of, of, or pieces of the article was the tape group , because I said, I said really good things about tape.

And, and the thing is that, um, you know, I, I, you know, I, I believe in all of those things, but. You know, all of the advancements that I’ve seen in backup in the last 20 plus years has been disk and deduplication. Right. Um, so let’s talk about, so what, so not everybody really understands what deduplication is.

Some people used to describe it like, well, it’s like compression, uh, the way I remember it’s like macro compression. Um, it’s like compression over time.

do you think of that?

[00:16:44] Prasanna Malaiyandi: uh, I don’t quite like that, so, so, right.

[00:16:49] W. Curtis Preston: may be some old blog posts that I might have said that phrase, but go ahead.

[00:16:53] Prasanna Malaiyandi: so in my mind, right, deduplication is. Finding two identical segments and tossing one away, keeping only one copy, but still keeping a reference to that so you can, so you still know you have two virtual copies, but one physical copy,

[00:17:10] W. Curtis Preston: Mm-hmm.

[00:17:11] Prasanna Malaiyandi: right? At a high level, that’s what I, and now

[00:17:13] W. Curtis Preston: you?

[00:17:14] Prasanna Malaiyandi: what is compression is taking an object, a singular object, and squeezing it into a smaller space.

[00:17:24] W. Curtis Preston: Right. But how do you understand how compression works? Cuz I Sure as hell don’t

[00:17:28] Prasanna Malaiyandi: yeah, so typically like you would run it through different types of algorithms like LZ compression and all the rest in order to look for patterns and throw away bits and compress it down. Now, the difference I would say between duping compression because they do sound the same,

[00:17:43] W. Curtis Preston: Yeah.

[00:17:44] Prasanna Malaiyandi: right? I would say one of the differences is with deduplication.

It’s more like a file system level compression, if you want to think of it that way, because it’s not just I’m taking this block. Yeah. It’s not just I’m taking this and I’m squeezing it down such that it could be, I just need to look at this and figure it out. Right. It’s a lot more complex than that.

[00:18:07] W. Curtis Preston: It is definitely a lot more complex than compression. Right. Um, I, I, I’ve just, I’ve, I’ve just honestly never really dug into the code of how traditional compression works. So the idea is that I’m looking for duplicate segments of data across many places, both from different sources as well as different time periods, right? I’m, I’m comparing the, this chunk of data that’s coming in right now and, and tonight’s backup.

I’m comparing it literally with every chunk of data that I’ve ever received

from anywhere else. .

[00:18:42] Prasanna Malaiyandi: I would say that’s an ideal system, but not everyone builds their deduplication that way. ,

[00:18:47] W. Curtis Preston: So,

[00:18:48] Prasanna Malaiyandi: where


[00:18:49] W. Curtis Preston: there are, yeah, go ahead.

[00:18:52] Prasanna Malaiyandi: Yeah. So it all goes down to sort of what is your deduplication domain is another term that some people talk about, right? Which is, is it limited to a system? Is it limited to a cluster which might be formed to multiple systems, or is it limited to sort of a single backup stream coming in?



[00:19:10] W. Curtis Preston: that the question is what is your data domain? Uh,

[00:19:13] Prasanna Malaiyandi: Yeah. D Domain.

[00:19:16] W. Curtis Preston: So let’s back up. So a, as I understand it, right, so basically we’re taking the data that’s, that’s coming in or that’s going to come in, we’re slicing it up into, I like the term chunk. , right? We run those chunks through a cryptographic hashing algorithm.

SH one, Shaw 2 56, whatever it, whatever you’re using. On the other side of that, we get a alpha numeric value, in the case of SH one, it’s 160 bit alpha alphanumeric value. so basically you, you, depending on the algorithm you use, you get a, um, you get an alpha numeric value at the end, and the size of that val, of that value is going to be based on which algorithm you use.

In the case of SHA-1, it’s 160 bits, right? And. You can then take the 160 bits. You can’t reverse engineer it. You can’t take the 160 bits and turn it into the chunk, but you can use that, that value to uniquely identify that chunk. And so if you have another chunk of data, regardless of where it came from, If it’s 160 bit value, again, that’s SHA-1 and other values are different.

If it’s fingerprint is the same, you can say that this chunk is identical to that other chunk that had the same fingerprint, and you can then discard the other chunk, right? the,


[00:20:44] Prasanna Malaiyandi: Yeah, you can, you can discard the actual data, but you should still keep track of it somewhere in a file system, just because you need, still need

[00:20:52] W. Curtis Preston: Yeah. You’re gonna keep track. Oh, we found another one of these,


[00:20:57] Prasanna Malaiyandi: And so usually that lookup is in a deduplication index is what they called them. Usually a dedupe index, which keeps a list of, Hey, here are all the fingerprints that I have.

[00:21:06] W. Curtis Preston: Right. As we, we were alluding to before, one of the things that determines sort of your effectiveness of, of dedupe is the dedupe domain, right? So I’ve seen it file system level, meaning it only looks for duplicate data within each volume. I’ve seen it host level, I’ve seen it backup level, meaning literally backup configuration wise.

right? So if I, if I have a Windows server and I’m backing up the host and I’m backing up SQL Server, I only look for duplicates within SQL Server backups right against each other. Uh, then we have, um, if we’re backing up several systems to a box, right? Maybe that the dedupe domain is only within that box.

It’s only looking for. Duplicates between all of that. And then there’s what I would call truly global dedupe, which is , we’re looking for duplicates from everything that’s coming in, uh, from multiple sources. Right?

[00:22:09] Prasanna Malaiyandi: Mm-hmm.

[00:22:10] W. Curtis Preston: there is a. Point of decreasing marginal returns, right? You can argue, and certainly if you’re a company that only does d dedupe within, like earlier I was, we only looked for dupes within SQL server backups.

You could make an argument that, well, there’s not a lot of duplicate data between SQL Server and Windows, right? so even though we’re not comparing the two, there’s not, there’s not gonna be a lot of duplicate data there, and there’s not gonna be a lot of duplicate. between the SQL Server database on this host and the SQL Server database on that host.

So that’s another argument that some

[00:22:47] Prasanna Malaiyandi: but, but I think a lot of that was because of architectural limitations of the products themselves rather than, that is really what you wanted to do. Right? Because

that’s more of a management issue.

[00:22:58] W. Curtis Preston: they didn’t, It was like, it was like, well, if we’re gonna do it, if we’re gonna do it that way, it’s gonna be much harder. to, to, to design a product to do it that way. And we don’t think, we don’t think that there’s going to be that much more benefit, um,

[00:23:17] Prasanna Malaiyandi: But on the other hand, if you look at things like VMware, right? If I have a bunch of VMs, right, there’s a good cha, and they all came from a single golden image, right? There’s a good chance that as you’re backing it up, 80, 90% of that stuff is all gonna be deduplicated, right?

[00:23:32] W. Curtis Preston: Absolutely. Yeah. There’s also a lot of duplicate data even within like a large filer, right? There’s gonna be lots of duplicate data there, right? So if you’re only doing it volume to volume or backup configuration to backup configuration, you, there’s a lot of duplicate data that I think you would, you would miss.


[00:23:52] Prasanna Malaiyandi: I know you talked about the domains, but I think another thing to also mention is, Some products do different types of chunking, if you will. Some do it at the file level, others do it at sort of a smaller level, right? And some do sort of fixed segment where each one is sort of a fixed length.

Others do sort of variable segments where they try to figure out what is optimal, because depending on how you’re doing your fingerprinting, right, you want to find the most number of matches, right? So you can save on storage.

[00:24:22] W. Curtis Preston: right. I,

[00:24:23] Prasanna Malaiyandi: another thing that also comes up.

[00:24:25] W. Curtis Preston: I would argue that file level dedupe isn’t really dedupe, it’s more a single instance. Right. Um, that’s like single instance storage of a file, you

know? Okay. It, it’s, yeah. But so I, I’m always thinking subfile, uh, when I think about what I think of actual dedupe . There is a much, like a very big, uh, other way that we divide up the dedupe industry, and that is source versus target.

[00:24:56] Prasanna Malaiyandi: Yep.

[00:24:58] W. Curtis Preston: Um, the, um, the first dedupe product I ever saw,

which was, uh, no, was not, that was not the first, no, the first one I saw the product at the time was called Undo.

Have we talked

about this?

[00:25:18] Prasanna Malaiyandi: Mm.

[00:25:19] W. Curtis Preston: Undo with two Os. It was really funny that the name of a dedupe vendor. Had duplicate data in their company name. It was undoo with two os. You know this product, you just don’t know that that’s what it used to be called.

[00:25:35] Prasanna Malaiyandi: What is it?


[00:25:38] W. Curtis Preston: give you a, I’ll give, I’ll give you a hint. It. The name comes from the fact that it would be a C of availability. I’m gonna, I’m gonna put the, the Jeopardy theme in here.

[00:25:57] Prasanna Malaiyandi: What would it see of availability?

[00:26:01] W. Curtis Preston: That’s what the name, that’s where the name for the company comes from, or if I want to put it in the right order, an availability c.

[00:26:11] Prasanna Malaiyandi: I don’t know what this is.

[00:26:14] W. Curtis Preston: Avamar

[00:26:15] Prasanna Malaiyandi: Oh, oh, that makes sense.

[00:26:18] W. Curtis Preston: Yeah. So that’s, that’s where the name Avamar came from. So the, the first

[00:26:23] Prasanna Malaiyandi: I should know that

[00:26:25] W. Curtis Preston: you shouldn’t know

[00:26:25] Prasanna Malaiyandi: I having being, uh, part of my former employer. Yes.

[00:26:30] W. Curtis Preston: Yeah. Well, I mean, you know, I, I have a bit of an inside track because that they’re, They were right up the road from me, right? They were up there. They were up in Irvine. Um, and that was, uh, the first dedupe product. They were a source dedupe . So what’s the difference between source dedupe and target dedupe Prasanna?

[00:26:50] Prasanna Malaiyandi: So the biggest one is, so let’s first talk about target tup, right? So Target Tup is data comes into the system and then a deduplication algorithm runs tosses away data. It can support any type of client as long as it supports whatever the protocol it has. So it’s NFS or smb, right?

Whatever can write to it, the data gets deduped.

[00:27:12] W. Curtis Preston: Hang on, hang on. Before you go on to that. I don’t disagree with what you said. I just, I think there could be a little bit more clarification. It’s a box that I send whatever I want to.

[00:27:23] Prasanna Malaiyandi: Yep.

[00:27:24] W. Curtis Preston: Typically it, the thing about Target Dedup was that, um, that it was, you didn’t have to do a lot of re-engineering of

your backup system.

[00:27:32] Prasanna Malaiyandi: it’s like a VTL system, right? That came.

[00:27:34] W. Curtis Preston: plug in a box. Yeah. And you would send you, and basically you stopped using tape and you sent your backups to this box. Maybe the box might even be pretending to be a tape library, the virtual tape library. Right. Um, and then it did all the dedupe magic over there. Um,

[00:27:51] Prasanna Malaiyandi: Which was great because you can just plug in your box and go. Now the other side is called source side dedupe, instead of sending all the data and tossing it away, why don’t we do something smart and actually figure out the duplicates on the client itself, on the source right, dedupe on the source, and only send the unique data.

And this has the advantage. Actually not sending the data over the wire, which is actually a huge benefit that people don’t understand always, right? Is not sending the data can actually make it a lot faster, even though you think, oh, I’m now putting additional load on my server itself. But it ends up being better than trying to send all the data and just tossing it away like target-side dedupe does.

[00:28:35] W. Curtis Preston: I would say it theoretically should be better

right? Because you, I’m just saying I’ve seen some crappy source dedupe systems, right?

[00:28:43] Prasanna Malaiyandi: Okay. Sorry. I’ve seen some, I’ve seen some good ones, or the ones that I’ve interacted with have been good. And so I’ve seen the performance numbers around

[00:28:52] W. Curtis Preston: Yeah. I, I do think it, it makes more sense to me. It always made more sense to me. The only reason why we had Target dedupe was because to do source dedupe , you have to redesign the backup product. , right? It took a long time to get, to get, uh, basically you have to stop using net backup networker or tsm, whatever it was back in the day, and you had to replace it.

Like in this case with Avamar, Avamar was a source do-do product. You had to do what we call a four clipped upgrade. You had to throw out the baby with the bathwater, whatever phrase, whatever. You know,

uh, analogy you want to use there. That was the main problem as I saw it with source dedup. Right. Is that, is that you, you had to change your backup product to get it,

[00:29:38] Prasanna Malaiyandi: and that was in the beginning, right? At the very early

[00:29:41] W. Curtis Preston: Well, well, You. You, well, yeah. Now you just had to, had to upgrade your backup product, right? Because many of modern backup technologies now support source dedupe , although even some newer backup technologies don’t, I don’t know if, I dunno if that came out in English, so some I, there was some double negatives in there. Some newer, very new backup technologies. Don’t do source dedupe .

[00:30:14] Prasanna Malaiyandi: which seems bunkers.

[00:30:15] W. Curtis Preston: which does seem bonkers. Um, I, you know, and, um, I’m talking about the likes of Rubric and Cohesity, right? These are new, these are, you know, next gen backup products that were designed in the last, less than the last 10 years.

Right. And it’s based on an appliance model. and they do all the dedupe inside that box, is my understanding, right?

[00:30:43] Prasanna Malaiyandi: And I just wanna challenge that, Curtis, because I thought in some cases, They do do source side deduplication, but I think because they’ve tried to be open and act as a target device, in those cases, you can’t, like, you don’t really have another option.

[00:31:00] W. Curtis Preston: Yeah, I, I don’t, well, again, I’m not,

[00:31:03] Prasanna Malaiyandi: I, but I don’t know

[00:31:04] W. Curtis Preston: work at, I work at Druva, not at Rubrik, uh, or, or Cohesity. But it is my understanding that they do target side dedup, which is, and, and one of the challenges of target side dedup is you need an appliance. at each location. Now I know that they can do virtual appliances, right?

So they have a, they have a VM level appliance. Uh, but you need a box or something pretending to be a box at each location, because if you’re not eliminating the duplicates before you send it to the box, um, then you need, you need something that’s on-prem, right?

[00:31:40] Prasanna Malaiyandi: Because you definitely don’t wanna send that all over the Wan

[00:31:43] W. Curtis Preston: No, no, that’s the, to me, that’s the biggest advantage of a source dedupe system is that it’s ultimately scalable, right?

That you, that assuming, assuming it doesn’t slow things down, assuming, assuming all these things, assuming that the product actually works, um, that you, um, you could back up a laptop. , right? You can back up a mobile phone and the, the duplicate data will be eliminated before it’s sent over the wan, which is what you need to do if you’re backing up something over the internet.

[00:32:15] Prasanna Malaiyandi: Mm-hmm.

[00:32:15] W. Curtis Preston: Right. Um, and, um, so the, the downside that some, you know, again, you, you, you talked about it already, is that it does put additional compute requirement on the client. The argument is that it’s offset by the,

um, the savings of the network bandwidth. Right. Um,

[00:32:42] Prasanna Malaiyandi: There is also one more downside,

[00:32:44] W. Curtis Preston: okay.

[00:32:45] Prasanna Malaiyandi: which is that. Not all applications can do source side deduplication. So if you do have an application which only supports writing to like an NFS Mount point or an SMB Mount point, or something that doesn’t allow the integration of these source side deduplication duplication logic, then you are going to need to be able to support target side dedupe.


[00:33:09] W. Curtis Preston: Yep. Uh, agreed. Um, and an example of that would be like, um, uh, Oracle, right?

[00:33:17] Prasanna Malaiyandi: Yep. Incremental merge.

[00:33:21] W. Curtis Preston: yeah. Um, although I would think that you should be able, I don’t know, we could, we

[00:33:27] Prasanna Malaiyandi: No, you can’t. You can’t. You can’t.

[00:33:29] W. Curtis Preston: You can’t take the Oracle stream and slice it and dice it. I don’t know.

[00:33:37] Prasanna Malaiyandi: Did you what? Sorry? You could, um, there are companies out there which give, which provide a virtual file system interface

that lives

[00:33:46] W. Curtis Preston: So you you fake it. You fake it out. Yeah.

Okay. All right. And then I’ve got something called hybrid dedupe and this, this was invented by your former employer.

[00:33:58] Prasanna Malaiyandi: I don’t even know what a hybrid dedupe is.

[00:34:01] W. Curtis Preston: it’s, it’s, it’s Target Dedoo pretending to be Source cdu.


[00:34:08] Prasanna Malaiyandi: D. Oh, see, here’s my, okay, so here’s my problem is I think Boost

[00:34:18] W. Curtis Preston: Uhhuh.

[00:34:19] Prasanna Malaiyandi: is source. I deduplication, I don’t know if I would call it hybrid, because it is very similar to what Avamar DI did. , right? It’s moving the deduplication logic to the client

such that you could do all of the computation. The same thing that we have talked about with source I deduplication,

[00:34:41] W. Curtis Preston: I, I’ll tell you why I put it in a different category. To me, hybrid dedupe is redoing the backup software. I’m sorry, source dedupe, true source iDation. It’s done at the backup software level,

[00:34:55] Prasanna Malaiyandi: Okay, then. I

[00:34:56] W. Curtis Preston: with, with with hybrid dedupe . I’m still dumb sending everything to this source dedupe thing that’s gonna redo it, right?

Um, it doesn’t matter in the end, you get, you get roughly the same benefits, right? Um, that’s what, uh,

[00:35:13] Prasanna Malaiyandi: Okay. So with hybrid, yeah. You get the benefits of source without having to upgrade and, or sorry, throw away your backup software.

[00:35:21] W. Curtis Preston: Right, right, right. Um, so I, I, um, we spent most of this time talking about dedupe . Um, there are a bunch of different ways to use disk in your backup system. Some of which don’t really require dedup, right? We used to do what we call disk cashing, where you just had enough disk for last night’s backup. You would back up to disk and then you would copy that to tape, and then you would hand that to a man in a van.

Uh, then we got a bunch of different things. I got D to D to T D to D to D, D to D, D to C, and D to D to to C. Did I do all that? So dis to dis to tape disc, to disc to disk, direct cloud and dis to disc to cloud, right? So these are all ways that people use disk in current backup systems. Um, to me, d D to C or disto disc to cloud is really dis to disc.

To disc is just the cloud is or the

disc Is being run by the cloud, right? And I will say that dedupe , by the way, I will say that without d. The whole thing of using the cloud, the way we use the cloud just wouldn’t work. I mean, you can’t send full backups to the cloud. I mean, you could, with unlimited bandwidth.

[00:36:34] Prasanna Malaiyandi: well, and yeah, with unlimited bandwidth it would just be expensive. Right. Just going back to the conversation we had earlier about the wan, right? You don’t wanna send full copies out to over the wan.

[00:36:44] W. Curtis Preston: right.

[00:36:45] Prasanna Malaiyandi: Um, because that gets expensive and very slow. Um, the other one I was going to comment on was, uh, oh, I know we’ve been talking about disk, but I think it’s also important to acknowledge that now it’s no longer spinning disk.

It could also be flash. Right. We’ve seen

[00:37:06] W. Curtis Preston: yeah,

but that’s a whole other thing

[00:37:08] Prasanna Malaiyandi: I I, I, know, but I’m just saying that when it comes to deduplication and backup ST or protection storage, right? This, it could be flash, it could be disk, it could be object storage, right? So I think it’s important to differentiate that, like what we’re talking about with deduplication, when we mentioned disk, right?

The media layer itself. Yeah, the media layer. Yes. The media layer is not tape.

[00:37:33] W. Curtis Preston: Right, right. Hang on one second. Um, I need to, didn’t realize I had a, I had a, um, Meeting

[00:37:48] Prasanna Malaiyandi: Meaning a.

[00:37:50] W. Curtis Preston: Yeah. Four. Well, four 15, which is an odd, um, all right. It’s a, it’s a pre-meeting with a podcast thing. It’s, um, anyway, um, so, uh, yeah, so, okay, you know, I hate the idea of flash

[00:38:24] Prasanna Malaiyandi: know, I know, I know. I’m, I, I’m just saying that people will bring it up. So I just wanna clarify that when we talk about disc, we’re just talking about not tape.

[00:38:34] W. Curtis Preston: The only place. Yeah. Correct. The only place where I think maybe Flash has a place in the backup system is, and you know, you know, the folks over at Pierre and Neil, they’re all mad at me now. Right. But, uh, the only place that I, where I think Flash has a place in the backup system is with like live recovery. If you’re gonna do, if you’re gonna do instant recovery and you’re actually gonna run VMs off of your backups, that better be some really nice performing disk. But the thing is, it doesn’t need to be your whole system. It just needs to be like the most

[00:39:15] Prasanna Malaiyandi: A part, part of, and it needs to, you don’t need your entire system to be flash,

[00:39:20] W. Curtis Preston: Yeah.

[00:39:20] Prasanna Malaiyandi: You just

need enough to be able to support that use case.

[00:39:24] W. Curtis Preston: I, I just think that where Flash does really, really, Is in random access, right? Backup isn’t a random access application. Backup is a streaming application. Even if what we’re talking is large dedupe chunks. I don’t

know. I, I,

[00:39:41] Prasanna Malaiyandi: I,


[00:39:42] W. Curtis Preston: say, let’s just say the jury is out for me. I, I am in Missouri.

Missouri. Is that, is that the show me state? That’s the show me state. Right?

[00:39:50] Prasanna Malaiyandi: yeah.

[00:39:51] W. Curtis Preston: So I’ll tell you what, I’ll tell you what. If there’s anybody that’s listening to this that just got pissed off,

[00:39:59] Prasanna Malaiyandi: what’s his

name? I’ll come back

[00:40:00] W. Curtis Preston: to, I welcome you to, come on and tell me why I’m wrong. I, I just,

[00:40:04] Prasanna Malaiyandi: I, I, I, I know who will come back on, you

[00:40:07] W. Curtis Preston: who, who,

will come back on,

[00:40:08] Prasanna Malaiyandi: what’s his name? Bass Data guy.

[00:40:11] W. Curtis Preston: uh oh. Oh, are they flash

[00:40:15] Prasanna Malaiyandi: Yeah,

[00:40:16] W. Curtis Preston: mark? Um, No, sorry, Howard. Uh, Howard. Yeah.

[00:40:22] Prasanna Malaiyandi: Fastest. Pure flash. Yeah.

[00:40:25] W. Curtis Preston: Yeah. Um, all right. All right. Well, yeah, Howard, uh, you wanna tell, you wanna tell me why I’m wrong? Um, I’m more than happy to have you back. We can duke it out. We can duke it. wouldn’t be the first time. Howard and I have, have disagreed on something.

I don’t know. It’s just, it’s just there are so many area, there are so many other places where I would wanna spend money in the backup system.

[00:40:48] Prasanna Malaiyandi: Yep.

[00:40:48] W. Curtis Preston: Um, and, um,

[00:40:52] Prasanna Malaiyandi: comes down to what the cost is. Right. If you could get flash down to a low enough point,

[00:40:57] W. Curtis Preston: which is the point of vast data, right? Their architecture allows using flash in a, um, you know, a significant way,

[00:41:06] Prasanna Malaiyandi: That’s, that’s why I brought

[00:41:07] W. Curtis Preston: uh, close to cost. Okay. All right.

Okay. All right. All right. All right. Um, and then I got this whole other thing. I’m not gonna go into that other thing. Um, but yeah, so d d makes disk and, and cloud-based products, both physiologically feasible as well as economically feasible. Right. Um,

[00:41:35] Prasanna Malaiyandi: is.

[00:41:37] W. Curtis Preston: hmm.

[00:41:37] Prasanna Malaiyandi: Is there something that a person shopping for a dedupe system should be asking?

Like what are the important things that they should be asking in order to determine

[00:41:50] W. Curtis Preston: yeah, that’s a, that’s a great question. I think the, the question would be things about what’s the restored performance? Because in the end, that’s the only thing that matters. I remember. A product. Now, this product is still on the market, but I believe, I believe they have addressed this, this issue. I remember a dedupe product.

It was a Target dedupe product that had, uh, I remember that had 400 megabytes a second throughput in to an appliance.

[00:42:25] Prasanna Malaiyandi: And like 10 megabits out

[00:42:27] W. Curtis Preston: It was 40, it was 40, it was 40, uh, megabytes out. It had a 90%, what we call dedupe tax. Right. That the, because the problem with dedupe, depending on how you store it, is that you’ve got everything you need all over the

[00:42:42] Prasanna Malaiyandi: All over the place.

[00:42:43] W. Curtis Preston: Yeah. And so this was just a really, really, really bad design. And um, uh, I believe that they addressed it and, um, because that product is still on the market today. But that version, one of that product was ble. Um, so yeah, it’s about restored performance, right? So one thing, oh, I’m. Uh, dedupe ratio is crap.

Don’t look at dedupe ratio. dedupe ratio is a made up number. Um, I will, um, I’ll, I’ll go back to, I’ll pick on Avamar. Avamar. Back in the day, they used to say they had a 400 to one DEDUP ratio. Do you remember this? Because

they basically considered every backup as a full backup. They’re like, the way we store backups, which is the same way Druva stores backups, the way we store backups.

It’s like, even though they’re incremental, it’s like they’re a full. , right? Because they behave like a full during a restore. And so they considered every backup a full. And so they said, well then therefore the dedup ratio is 400 to one. Well, that was always complete nonsense. Um, the other would be, I remember, uh, again, I’m gonna pick on people equally.

I remember sales reps of a certain large target. D company that where you might’ve worked, where they would tell customers to go and do full backups more frequently because it made their dedup ratio better. , which is just, again, nonsense. What matters, in my opinion, what matters is how big is a full backup versus how big are all the backups, right?

So if I have. If I, let me, let me explain what I’m saying. If I have a hundred terabytes, if, if one full backup of my environment is a hundred terabytes and then after three months how big is, or whatever number you want. Uh, but it’s just three months seems like a, a nice, long, um, what do you call it? Uh, POC thing,

right? Um, after a hundred, after, you know, three months, how. How much stuff is stored over there? That’s what I’m saying. Don’t dedupe ratios is nonsense that that didn’t come out in English. dedupe ratios are nonsense, but if I can fit a hundred terabytes right, if I have a hundred terabyte environment and then a series of incremental backups, and then over there, my question is how big is. How much data did I write to disk? And let’s say it’s, it’s, it’s 200 terabytes after 90 days. And then compare that with another product who writes a hundred terabytes? You backed up the same data, but you used half as much storage on the back end. . That’s what I’m saying. The the, the problem is, and the, the other reason, and again, I’m a little extra sensitive to this cuz I work for Druva.

People ask us what’s our, what’s our dedupe ratio? We’re like, well the thing is we’re like the opposite of Avamar. Well we’re actually similar to Avamar in that we’re source I dedupe, but we don’t use that funny math. So we could say 401, but that’s nonsense. So you know, we say, well, we. Because, because we also do incremental forever backups.

That’s, that’s the problem. Right. So, um, but I know that on average, if we have a hundred terabyte customer, we store, you know, roughly a year’s worth of backups in less than a hundred terabytes of disk.

[00:46:16] Prasanna Malaiyandi: Yeah. And I think it’s important there to also account for that increment, like how I look at these like numbers. I totally get what you said, Curtis, like you should just do an apples apples. But if you don’t have that ability, you should also look to say, okay, I have a hundred terabyte full. And then say, my daily change rate is 2%.

right? So if I do 2% for a month, right? That’s, what is that two 60? 60 more terabytes, right? So it should be 160 terabytes worth of data that I sent over, right? For 160 terabytes worth of data, how much should I actually store?

Right? Which will give you similar things to what you’re saying, right? But Bec, because what you’re saying is if you had the two products, then you could do a direct comparison.

But I’m saying if you don’t have the two products, then here’s another way you could

[00:47:04] W. Curtis Preston: Well, I, well, I would argue that there’s no way to compare them if you don’t have two pro, if you, if you’re not, if you’re not doing a true comparison. Right.

[00:47:14] Prasanna Malaiyandi: A

[00:47:14] W. Curtis Preston: it’s just, it’s just that d math is funny, right? So different products charge differently, right? You look at, um, like when you look at Metallic, which competes with Druva, they have a frontend price and we have a backend price.

They have, they actually have the front end price, and then you also need to pay for the backend storage. Right? So you’re paying, so how do you, how do you compare that? Um, it’s, it’s just, it’s difficult

[00:47:41] Prasanna Malaiyandi: hard. Yeah.

[00:47:42] W. Curtis Preston: it’s hard. Uh, but all I’m saying is dedup ratio is crap and doesn’t mean anything. Um, but what does matter is how much data are you storing on that backend because you will be paying for that one way or the other.

All right. I don’t know if we made this, if we, if this is clear as mud or what, but, uh, I hope that was helpful and, uh, maybe we, maybe we ticked off Howard and Howard’s gonna come on next week’s episode. . I dunno.

[00:48:15] Prasanna Malaiyandi: Come join us,


[00:48:16] W. Curtis Preston: Thanks for, thanks for, uh, thanks for helping me with my network as well, so,

[00:48:21] Prasanna Malaiyandi: anytime, Curtis. Just remember I am not tech support.

[00:48:26] W. Curtis Preston: Yeah. Yeah. All right. Well, uh, and thanks to the listeners and remember to subscribe so that you can restore it all.

Join the discussion

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: