<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Why global dedupe matters</title>
		<description>Discuss Why global dedupe matters</description>
		<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html</link>
		<lastBuildDate>Fri, 10 Feb 2012 11:00:27 +0000</lastBuildDate>
		<generator>JComments</generator>
		<atom:link href="http://www.backupcentral.com/component/jcomments/feed/com_content/231/10.html" rel="self" type="application/rss+xml" />
		<item>
			<title>Brian Doyle says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-570</link>
			<description><![CDATA[As I do agree with you that global dedup is important, its importance really depends on your priorities. I know for us replication and the expense of a larger wan pipe where huge factors in the decision. At the time several the companies which globally deduped had no good way to replicate that data offsite. So I don't care how good my dedup is I need to get that data offsite, which is where other vendors excel. Today I am very happy in our decision into Data Domain but you are right we are a shop that moves 7-10 TB a night. Great article just want to get my 2 cents in.]]></description>
			<dc:creator>Brian Doyle</dc:creator>
			<pubDate>Sun, 05 Apr 2009 21:12:12 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-570</guid>
		</item>
		<item>
			<title>Aaron E. Kristoff says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-567</link>
			<description><![CDATA[The DBA's pessimism my at times be warranted but it should not be driven by paranoia and too many times that is exactly the case. That is why .bak's are sitting all over file systems. Strings of incrementals and dedupe conceptually do not give you a complete snapshot. That is why dedupe is only a part of an overall data protection strategy. I believe that dedupe increases the value of tape. In this market, many vendors will push that dedupe is a tape replacement. I stand on the opposite side of that. After all, I feel much better about deduping and storing my backup data on disk knowing that it is all reduped and stored as normal on tape.]]></description>
			<dc:creator>Aaron E. Kristoff</dc:creator>
			<pubDate>Wed, 01 Apr 2009 11:23:32 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-567</guid>
		</item>
		<item>
			<title>Jeremy says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-566</link>
			<description><![CDATA[Aaron, And now with DeDupe added, the pesismism from these folks has multiplied. Now, not only does everything have to be a full backup to be able to successfully recover from but they do not want their backup data being sent to this black magic voodoo disk device In some ways, that pessimism is probably waranteed. Conceptually, incrementals and dedupe don't give you a complete snapshot of the database at any given time. Transactional consistency can be very important for some applications, and having a month of incrementals is probably enough to make a lot of folks squemish. Its not like there arn't stories of recovery operations failing, leaving the DBA manually editing rows to reconstruct a database. Having a history of completly standalone full backup scattered all over the planet provides a safer feeling than having a month old full backup in the vault and the rest of the data sitting on the "black magic voodoo disk device". The other side of the coin is, once your database gets to 100T, hopefully the DBA has a better plan than 100's of full backups scattered everywhere. At that point they probably should be testing recovery scenerio's too. Its not like a configuration error in the backup software has ever caused a full backup to be unrecoverable.]]></description>
			<dc:creator>Jeremy</dc:creator>
			<pubDate>Wed, 01 Apr 2009 10:54:55 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-566</guid>
		</item>
		<item>
			<title>Jeremy says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-565</link>
			<description><![CDATA[I never said they didn't do incrementals. But when it's time for a full, you need the kind of throughput I talked about there. None of the local dedupe systems listed in the table in the performance post can handle 1100 MB/s for 12 hours and still dedupe it in 24 hours. I guess I wasn't clear, when I said: "If that full backup is only run once a month, then the machine has a significantly larger window to perform the dedupe operation in an off-line manner. " I tried to point out that it may not be necessary to do the dedupe operation in 24 hours. The full backups could be staggered to give the machine sufficient time to finish the dedupe operation. That 200T might be a once a week or once a month operation. Or in the case of multiple 200T backups they could be scheduled a day or two apart on a larger schedule to give the dedupe operation sufficient time to complete.]]></description>
			<dc:creator>Jeremy</dc:creator>
			<pubDate>Wed, 01 Apr 2009 10:33:54 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-565</guid>
		</item>
		<item>
			<title>LMAO....this really hit home.</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-564</link>
			<description><![CDATA[I am cracking up over the comment about the DBAs "don't want no stinking incrementals." Have you been speaking to my DBAs? Or my Unix admins? Or my Exchange admins? Or my SysDev people? It has been such a struggle to change the mindset that incremental backups are not evil especially now that all the data is on disk (VTL) and can be recovered at a drop of a hat. No more waiting for offsite tapes. No more mount/seek/unmount times. The data is on disk. The disk does not care if the backup is a single full or the last full and 6 incrementals that it is restoring. And now with DeDupe added, the pesismism from these folks has multiplied. Now, not only does everything have to be a full backup to be able to successfully recover from but they do not want their backup data being sent to this black magic voodoo disk device that warps the data (dedupes it) into some x-files like hybrid creature that must be contained underground. I know the DBAs still think like this. That is why I have a trillion .bak files in my SQL file system backups even though we are using an SQL agent in our backup software. I could filter them out but I can just see it now when they ask for some .bak's to be restored. I pick my battles with em.]]></description>
			<dc:creator>Aaron E. Kristoff</dc:creator>
			<pubDate>Wed, 01 Apr 2009 09:21:16 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-564</guid>
		</item>
		<item>
			<title>Never said they didn\'t do incrementals.</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-562</link>
			<description><![CDATA[I never said they didn't do incrementals. But when it's time for a full, you need the kind of throughput I talked about there. I specifically tell users NOT to do things on purpose (like more fulls) just to get better dedupe ratios. It wastes time and resources to gain nothing. But... You'd be surprised at how many DBAs "don't want no stinking incrementals." They're still carrying resentment from being burnt on them from 10-15 years ago or something.]]></description>
			<dc:creator>W. Curtis Preston</dc:creator>
			<pubDate>Wed, 01 Apr 2009 02:31:27 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-562</guid>
		</item>
		<item>
			<title>Deduping large backups.</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-559</link>
			<description><![CDATA[I know of several companies where they have databases that are 40, 50, 100, even 200 TB. If you want to do a full backup of a 40 TB database in 12 hours, you need over 1100 MB/s! Why aren't those customers configuring their DBMS/backup plug-in to do incremental backups? How much of that 40T really changed in 24 hours? 1GB, 5GB, 10GB? With physical tape, 6 weeks of daily incrementals might be impossible to restore in a small window due to mount/dismount/seek times. With a VTL, the incremental restores should be quite fast. If that full backup is only run once a month, then the machine has a significantly larger window to perform the dedupe operation in an off-line manner. It seems to me, that many of the dedupe vendors recommend full backups simply to make their numbers look better. If you backup ten 1T incrementals and they get a 2:1 compression ratio, you have stored 5T. If you backup ten 10T fulls and they get 20:1 you have still stored 5T of data. I see some vendors requiring certain backup strategies to assure large dedupe ratios, and it screams of snake oil. Are those full backups being done to gain extra redundancy? Do full backups to a dedupe system, really give the same level of security that full backups to multiple physical tapes provide? The dedupe system removes all that explicit redundancy down to a single point of failure. If the disk array containing a piece of data common to half the backups fail, then you have lost half of your backups (ignoring heroics).]]></description>
			<dc:creator>Jeremy</dc:creator>
			<pubDate>Tue, 31 Mar 2009 19:09:43 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-559</guid>
		</item>
		<item>
			<title>Thought exercise?</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-539</link>
			<description><![CDATA[You call this a &#34;thought exercise,&#34; as if it's not based in reality. It's completely the opposite. I'm trying to use numbers and logic to help people understand what I see every day in the field; customers who back up more than 10-20 TB a night want and NEED global dedupe. Not having it costs them money in disk; it costs them money in opex; and it just makes things harder. You can ignore that FACT if you'd like, but it IS a fact. I see it EVERY day. (Just today, I talked to a customer that needs 4500 MB/s to back up ONE SERVER in a twelve-hour period. Yes, that's an EXTREME case, but I'm just making the point that this all comes from real people and real datacenters, not &#34;thought exercises.&#34; Second, this is about throughput, NOT capacity, because it's throughput that causes dedupe customers to buy more boxes -- not capacity. In the real world, very few local-dedupe dedupe boxes (regardless of vendor) ever scale to the advertised capacities before their customers say they're out of throughput. If customers buy the max config out of the box, they actually end up wasting a lot of disk as a result. (Hear that customers? Don't max out your box with storage before you max it out with throughput, OK?) As to who this post is aimed at, I said it right in the post, &#34;if you're backing up less than 10-20 TB per day (including weekends) from your entire data center, global dedupe is not a problem you need to worry about.&#34; If a customer has more than that, though, the lack of global dedupe can cost them real money. So since this has to do with throughput and not capacity, using slots and tapes to make your argument is a complete non sequitur. So let me make your argument with tape drives, then show you why it STILL doesn't apply. If I took your argument and used tape drives instead of tapes, I'd say &#34;OK, so lets follow this logic through just a little bit. A DL3000 from EMC has 400 MB/s of throughput. That's roughly 4-5 LTO-4 tape drives (no compression) or 2-3 at 2:1 compression. So, more accurately, not having global dedup is like having a tape library that doesn't scale past 2-5 tape drives, which is pretty much NOBODY.&#34; So the first problem with that argument is that all tape library manufacturers make tape libraries with more than 2-5 tape drives. In fact, that's where most tape libraries START. The largest tape libraries in the world hold HUNDREDS of tape drives. (The STK SL8500 holds 448; the IBM TS3500 holds 192; the QTM Scalar 10k holds 324.) So it's OBVIOUS (to me, at least) that this kind of throughput is needed by SOME companies. The second problem with that argument is that having 100s of tape drives is FINE; but having several dedupe ISLANDS is NOT. Buying another tape drive for additional throughput costs you nothing but that tape drive. Buying an additional local-dedupe system for scaling purposes means buying an addition head AND enough disk for another full copy of your data. (Because you WILL store two full copies of your backups if you have two local-dedupe systems and load-balance across them.) And don't tell me this extra disk only costs an extra 10%. That's the &#34;there's only a 5% difference between 10:1 dedupe and 20:1 dedupe&#34; math again; you know I think that's bogus. Let me explain. Suppose a customer needs 2500 MB/s of throughput, and as a result buys five 500 MB/s local-dedupe boxes, and then follows your logic (of &#34;it's only 10%) and load balances across all five of them. (I realize this is a big stinking customer, but we're only big stinking customers need global dedupe; I've already said that.) At 2500 MB/s, this customer is backing up a little over 100 TB a night (2500 * 3600 * 12 = 108). My experience shows me that in this configuration (distributed weekly full backup with daily incremental), about half each night's backup comes from fulls, and half comes from incrementals. If that's the case, then this fictitious datacenter is a 350 TB datacenter (100 TB * 50% * 7 days = 350 TB). While a datacenter this size was unheard of 10 years ago, it is way too commonplace today. Let's back up this 350 TB environment with a 90 day retention, shall we? 90 days at 100 TB a night means 9000 TB, or 9 PB of backups. (That fits right into what I've seen, that for every 1 TB of live disk, I generally see at least 20 TB on tape.) At 20:1 dedupe, that means I need 450 TB of raw disk to hold 90 days of backups for my 350 TB environment. (Hopefully that's 350 TB of affordable SATA disk, which should be a lot cheaper than my 350 TB of DMX or USP disk.) But wait. In order to meet my throughput requirements, I needed to buy five heads and I load balanced across them (because you told me that it would only make a 10% difference). That means that I need to add four additional full copies in there. (Because the first time I back up my 350 TB datacenter to a new member of this local-dedupe array, I'm going to write it in full.) That's 1400 TB of extra stuff I have to store because I don't have global dedupe. You're saying that the difference between 9000 TB and 10400 TB is about 14%. While that is correct, it doesn't speak to the issue at hand. The reality is that I will need to add 350 TB of ADDITIONAL RAW DISK to EACH of the other four nodes in my &#34;array&#34; because I have to start each node with a base copy. I therefore started out needed 450 TB of disk (9000 TB/20); now I need 1850 TB of disk. THAT'S the real difference, and it's a 400% difference, not 10% or 14%! Saying that global dedupe doesn't make a significant difference to those that need it is just silly.]]></description>
			<dc:creator>W. Curtis Preston</dc:creator>
			<pubDate>Fri, 20 Mar 2009 14:32:57 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-539</guid>
		</item>
		<item>
			<title>Thanks, Aaron</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-538</link>
			<description><![CDATA[@Aaron That means a lot coming from you (a real actual customer). Real comments from real customers always have more weight in my courtroom.]]></description>
			<dc:creator>W. Curtis Preston</dc:creator>
			<pubDate>Fri, 20 Mar 2009 11:55:28 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-538</guid>
		</item>
		<item>
			<title>Aaron E. Kristoff says:</title>
			<link>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-537</link>
			<description><![CDATA[Mr. Waterhouse, the whole point of global dedupe and why it is important has nothing to do with 20:1 ratios, PB's of retained data, LTO3's, or tape libraries. The heart of the matter is how much data are you backing up tonight and how much will you backup tomorrow night but also, and more importantly, how much easier does it make my life. As a storage admin, I do not want the added administrative burden of deciding which node/head/appliance my exchange data will go to tongiht and then having to assure it gets to that same node/head/appliance the next night fot the sake of dedupe just because the 2 nodes/heads/appliances can't sit down for coffee and reconcile their differences. I would much rather be able to send my exchange data to the backup target and know that dedupe will handle all of that data equally regardless of ingest point. It also makes load balancing easier. Not having to calculate how much data is going through one head and then trying to balance that number against how much is going through the other saves me time and headaches. So does it matter? Maybe not to you right now but certainly matters to me.]]></description>
			<dc:creator>Aaron E. Kristoff</dc:creator>
			<pubDate>Fri, 20 Mar 2009 09:50:28 +0000</pubDate>
			<guid>http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/231-global-dedupe.html#comment-537</guid>
		</item>
	</channel>
</rss>

