SearchFAQMemberlist Log in
Reply to topic Page 4 of 6
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Tapeless backup environments?
Author Message
Post Tapeless backup environments? 
All of those Paul said and Data Domain too. They have both a NAS and a virtual tape interface. And yes, all of these do de-dupe.

I keep a directory of de-dupe vendors at Backup Central Wiki:
http://www.backupcentral.com/components/com_mambowiki/index.php/Disk_Targets%2C_currently_shipping

Here's a tinyurl version in case that one get's truncated:
http://tinyurl.com/2dtvh2

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu [mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Paul Keating
Sent: Monday, September 24, 2007 12:46 PM
To: Jim Horalek; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

There are several.
FalconStor, Diligent, Quantum and Sepaton I believe will all present a
"tape" to an NDMP device, and provide de-dupe on the backend.

Paul

--


-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf
Of Jim Horalek
Sent: September 24, 2007 12:43 PM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?


On a similar note how does NDMP play with Disk de-dup? All of
the de-dups
I've seem are NAS devices. NDMP only talks to tape or VTL.
Are there VTL's
with De-dup that would solve the NDMP problem?

Jim
====================================================================================

La version française suit le texte anglais.

------------------------------------------------------------------------------------

This email may contain privileged and/or confidential information, and the Bank of
Canada does not waive any related rights. Any distribution, use, or copying of this
email or the information it contains by other than the intended recipient is
unauthorized. If you received this email in error please delete it immediately from
your system and notify the sender promptly by email that you have done so.

------------------------------------------------------------------------------------

Le présent courriel peut contenir de l'information privilégiée ou confidentielle.
La Banque du Canada ne renonce pas aux droits qui s'y rapportent. Toute diffusion,
utilisation ou copie de ce courriel ou des renseignements qu'il contient par une
personne autre que le ou les destinataires désignés est interdite. Si vous recevez
ce courriel par erreur, veuillez le supprimer immédiatement et envoyer sans délai à
l'expéditeur un message électronique pour l'aviser que vous avez éliminé de votre
ordinateur toute copie du courriel reçu.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
Dave,

Dude, you've got to get our more. Wink I'd recommend continually perusing
some of these sites to stay current on what's going on in the industry.
De-dupe is kind of the most-mentioned topic in the storage industry
since I don't know what.

http://www.searchstorage.com
http://www.byteandswitch.com
http://www.infostoremag.com
http://www.isit.com/IndexSTO.cfm
http://www.backupcentral.com (My blog)

On my blog I've got a series of entries that talks about De-duplication,
starting with this one, "What is De-duplication?" I tried to link all
the de-dupe entries together, so that each entry has a forwarding link
to the next blog entry in the series:
http://www.backupcentral.com/content/view/58/47/

Your question about where de-dupe resides is answered in this entry "Two
different types of de-dupe:"

http://www.backupcentral.com/content/view/129/47/

We've got directories of both types:
Hardware/Target: http://tinyurl.com/384528
Software/Source: http://tinyurl.com/2dtvh2

(I use TinyUrl.com because the URLs are very long and tend to get
truncated in email. BTW, tinyurl uses de-duplication-like techniques,
as they run an algorithm against the string to give you a smaller
string. Then when you click on that string, they "restore" the original
URL to your browser. Kind of cool.)

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Dave
Markham
Sent: Monday, September 24, 2007 11:35 AM
To: Jeff Lightner
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

Guys i've just read this thread and can say im very interested in it.
The first thing is i learned a new term called deduplication which i
didn't know existed.

Question : I gather Deduplication is using other software. DataDomain i
think i saw mentioned. Where does this fit in with Netbackup and does
the software reside on every client or just a server somewhere?

Ok, so im trying to kit refresh a backup environment for a customer
which has 2 sites. Production and DR about 200 miles apart. There is a
link between the sites but the customer will probably frown on increased
bandwidth charges to transfer backup data across for DisasterRecovery
purposes.

Data is probably only 1 TB for the site with perhaps 70% being required
to be transfered daily to offsite media.

Currently i use tape and i was just speccing a new tape system as i
thought by using disk based backups, and retentions of weekly/monthly
backups lasting say 6 weeks, im going to need a LOT of disk, plus the
bandwidth transfer costs to DR site

LTO3 tapes are storing 200gb a tape which is pretty good compared to
disk i thought.

I guess in my set up its a trade off between :-

Initial cost of disk array vs initial cost of tape library, drives and
media

Time take to backup ( network will be bottle neck here. Still on 100Meg
lan with just 2 DB servers using GigaBit lan to backup server.

Offsite transfer of tapes daily to offsite location vs Cost of increased
bandwith between sites to transfer backup data.


Im now confused what to propose Smile



_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
Question : I gather Deduplication is using other software.
DataDomain i
think i saw mentioned. Where does this fit in with Netbackup and does
the software reside on every client or just a server somewhere?

In the technologies I'm familiar with--one of them is old, another new,
it's conceptually simple. "The system," whether that's a standalone
system or a box of disk with some smarts or an agent on the backup
client, receives data and examines it in blocks of some size (AFAIK,
always way larger than a 512-byte disk block). Simplistically, it
checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there. If so, the
data can be tossed away and the checksum kept. The "file" as stored as
a collection of these checksums (imprecise term, but works for the
example) or a list of pointers to the single instance (hence the SIS
term can be overloaded here) of the data represented by that checksum.
A simplistic example would be storing a TB of zeros. Deduplicating
devices would store the first "block" of zeros, then find that all the
rest of them were the same checksum, same data and just store one more
pointer. That 1TB file becomes, say, one real instance of 512KB of
zeros (if that is the "block" size) plus the space for a few million
pointers to the same 512KB of data. Obviously, even this could be
compressed but that's another story.

Backing up the same system with few changes would be a very small full
backup. Backing up many instances of, say, the C drive of w2k3 systems
will deduplicate like crazy. Backing up a million different JPEGs
wouldn't save any appreciable space, but backing them up twice, or
multiple instances of the same JPEG, would.

LTO3 tapes are storing 200gb a tape which is pretty good compared to
disk i thought.

But that's a horrible number for LTO3. Either your tapes aren't full or
something is broken. Look at the available_media report to get a good
idea of the range of data stored on your FULL tapes.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Simplistically, it checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

To what hole do you refer? I see one in your simplistic example, but
not in what actually happens (which require a much longer technical
explanation).

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
On Mon, Sep 24, 2007 at 05:08:31PM -0400, bob944 wrote:
In the technologies I'm familiar with--one of them is old, another new,
it's conceptually simple. "The system," whether that's a standalone
system or a box of disk with some smarts or an agent on the backup
client, receives data and examines it in blocks of some size (AFAIK,
always way larger than a 512-byte disk block). Simplistically, it
checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

Yes, there's a hole there if that's all you're relying on. Not all of
them do that.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
There are no products in the market that rely solely on a checksum to
identify redundant data. There are a few that rely solely on a 160-bit
hash, which is significantly larger than a checksum (typically 12-16
bits). There are some who are concerned about hash collisions in this
scenario. I am not one of those people. Here is a quote from an
article I wrote. The entire article is available here:

http://tinyurl.com/2j7r52

<quote>
Hash collisions occur when two different chunks produce the same hash.
It's widely acknowledged in cryptographic circles that a determined
hacker could create two blocks of data that would have the same MD5
hash. If a hacker could do that, they might be able to create a fake
cryptographic signature. That's why many security experts are turning to
SHA-1. Its bigger key space makes it much more difficult for a hacker to
crack. However, at least one group has already been credited with
creating a hash collision with SHA-1.

The ability to forcibly create a hash collision means absolutely nothing
in the context of deduplication. What matters is the chance that two
random chunks would have a hash collision. With a 128-bit and 160-bit
key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
2160 with SHA-1. That's 1038 and 1048, respectively. If you assume that
there's less than a yottabyte (1 billion petabytes) of data on the
planet Earth, then the odds of a hash collision with two random chunks
are roughly 1,461,501,637,330,900,000,000,000,000 times greater than the
number of bytes in the known computing universe.

Let's compare those odds with the odds of an unrecoverable read error on
a typical disk--approximately 1 in 100 trillion or 1014. Even worse odds
are data miscorrection, where error-correcting codes step in and believe
they have corrected an error, but miscorrect it instead. Those odds are
approximately 1 in 1021. So you have a 1 in 1021 chance of writing data
to disk, having the data written incorrectly and not even knowing it.
Everybody's OK with these numbers, so there's little reason to worry
about the 1 in 1048 chance of a SHA-1 hash collision.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for any
length of time has experienced a tape drive writing something that it
then couldn't read. Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore. Hash
collisions are a nonissue.
</quote>

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of A Darren
Dunham
Sent: Monday, September 24, 2007 5:59 PM
To: Veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

On Mon, Sep 24, 2007 at 05:08:31PM -0400, bob944 wrote:
In the technologies I'm familiar with--one of them is old, another
new,
it's conceptually simple. "The system," whether that's a standalone
system or a box of disk with some smarts or an agent on the backup
client, receives data and examines it in blocks of some size (AFAIK,
always way larger than a 512-byte disk block). Simplistically, it
checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

Yes, there's a hole there if that's all you're relying on. Not all of
them do that.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
I'm not convinced that writing to a DataDomain is going to be faster than
writing to multiple LTO-3 drives over a SAN. The DD is limited to about
90MB/sec which is on par with 1-2 LTO-3 drives and not much more than that.
Unless, of course, you consider adding extra DD units for every 2 LTO-3
drives you currently have and that's going to bump your costs up even higher
(which might be offset by the requirement for a Decru FC520 encrypting
appliance for every 2-3 LTO-3 drives today).

I don't think that NetBackup 6.5 includes de-duplication. It's provided by
PureDisk which is a separately licensed product. With 6.5.1, you'll be able
to use PureDisk as a storage unit, something that's not there yet today.

.../Ed

--
Ed Wilts, RHCE, BCFP, BCSD
Mounds View, MN, USA
mailto:ewilts < at > ewilts.org


-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu [mailto:veritas-bu-
bounces < at > mailman.eng.auburn.edu] On Behalf Of Clem Kruger
Sent: Monday, September 24, 2007 11:32 AM
To: dave.markham < at > fjserv.net; Jeff Lightner
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

Hi Dave,

Yes it is a difficult decision I have looked at DataDomain with
NetBackup. I have found that the backups are faster and there is a vast
amount of disk being saved.

NetBackup 6.5 includes de-duplication and I have become a great friend
of it. To use the words of a supplier, "Saving me Time, Saving me Space
and Saving me Money" Smile


Kind Regards,
Clem Kruger

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Dave
Markham
Sent: 24 September 2007 17:35 PM
To: Jeff Lightner
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

Guys i've just read this thread and can say im very interested in it.
The first thing is i learned a new term called deduplication which i
didn't know existed.

Question : I gather Deduplication is using other software. DataDomain i
think i saw mentioned. Where does this fit in with Netbackup and does
the software reside on every client or just a server somewhere?

Ok, so im trying to kit refresh a backup environment for a customer
which has 2 sites. Production and DR about 200 miles apart. There is a
link between the sites but the customer will probably frown on
increased
bandwidth charges to transfer backup data across for DisasterRecovery
purposes.

Data is probably only 1 TB for the site with perhaps 70% being required
to be transfered daily to offsite media.

Currently i use tape and i was just speccing a new tape system as i
thought by using disk based backups, and retentions of weekly/monthly
backups lasting say 6 weeks, im going to need a LOT of disk, plus the
bandwidth transfer costs to DR site

LTO3 tapes are storing 200gb a tape which is pretty good compared to
disk i thought.

I guess in my set up its a trade off between :-

Initial cost of disk array vs initial cost of tape library, drives and
media

Time take to backup ( network will be bottle neck here. Still on 100Meg
lan with just 2 DB servers using GigaBit lan to backup server.

Offsite transfer of tapes daily to offsite location vs Cost of
increased
bandwith between sites to transfer backup data.


Im now confused what to propose Smile

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
I'm not convinced either. Although our numbers are a little different,
you and I end up roughly at the same place. There are a number of
vendors whose de-dupe targets top out at about 200-300 MB/s, which is
roughly the speed of 2-3 LTO-3 drives, depending on how well you use
them. If you need more than that, you need to buy another box. (BTW,
Data Domain's numbers have increased to about 200 MB/s.)

These numbers work just fine when we're talking backups via the LAN to
LAN-based backup servers. You're going to need at least two, possibly
three network-based backup servers to generate 200 MB/s. Assuming 70
MB/s or so per master/media server, you buy one de-dupe unit per three
master/media servers or so. You can scale pretty far that way. You
will need to make sure that backup A is always sent to de-dupe unit A,
and backup B is always sent to de-dupe unit B, and so on. (If you send
backup B to de-dupe unit A after initially sending it to de-dupe unit A,
its first backup will not get de-duped against anything, resulting in a
significant decrease in overall de-duplication ratio.) While you won't
get as big of a de-dupe ratio as you would if you could have a single
device that could do 1000s of MB/s, there is an argument to be made that
you won't get much de-dupe when de-duping the backups of server A
against those of server B -- unless they have similar data. So a very
large setup like this will require a bit of planning, but I think the
benefits outweigh the extra planning required.

Now, if you happen to have a SINGLE SAN media server that needs MORE
than 200 MB/s, then you're going to want a device that can handle that
level of throughput. This is going to be a pretty big server, BTW, as a
200 MB/s device can back up about 6 TB in 8 hours. And notice I said
SAN media server, not a regular media server, as a regular media server
isn't going to be able to generate more than 200 MB/s, as it's getting
its backups via IP. But a SAN media server is backing up its own data
locally, so it can go much faster. This also means you're really
looking at a SAN/block device, which means you're really looking at a
VTL. (Yes, I'm aware of the Puredisk storage unit around the corner. I
think you'll find it's not going after this part of the market.)

If you need this kind of throughput, there are a few products that are
advertising several hundred or thousands of MB/s within a single de-dupe
setup. These are the newer kids on the de-dupe block, of course, so
they're not going to have as many customer references as the vendors
that have been selling de-dupe as long. But from what I've seen,
they're worth a look.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Ed Wilts
Sent: Monday, September 24, 2007 9:44 PM
To: 'Clem Kruger'; dave.markham < at > fjserv.net; 'Jeff Lightner'
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

I'm not convinced that writing to a DataDomain is going to be faster
than
writing to multiple LTO-3 drives over a SAN. The DD is limited to about
90MB/sec which is on par with 1-2 LTO-3 drives and not much more than
that.
Unless, of course, you consider adding extra DD units for every 2 LTO-3
drives you currently have and that's going to bump your costs up even
higher
(which might be offset by the requirement for a Decru FC520 encrypting
appliance for every 2-3 LTO-3 drives today).

I don't think that NetBackup 6.5 includes de-duplication. It's provided
by
PureDisk which is a separately licensed product. With 6.5.1, you'll be
able
to use PureDisk as a storage unit, something that's not there yet today.

.../Ed

--
Ed Wilts, RHCE, BCFP, BCSD
Mounds View, MN, USA
mailto:ewilts < at > ewilts.org


-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu [mailto:veritas-bu-
bounces < at > mailman.eng.auburn.edu] On Behalf Of Clem Kruger
Sent: Monday, September 24, 2007 11:32 AM
To: dave.markham < at > fjserv.net; Jeff Lightner
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

Hi Dave,

Yes it is a difficult decision I have looked at DataDomain with
NetBackup. I have found that the backups are faster and there is a
vast
amount of disk being saved.

NetBackup 6.5 includes de-duplication and I have become a great friend
of it. To use the words of a supplier, "Saving me Time, Saving me
Space
and Saving me Money" Smile


Kind Regards,
Clem Kruger

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Dave
Markham
Sent: 24 September 2007 17:35 PM
To: Jeff Lightner
Cc: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

Guys i've just read this thread and can say im very interested in it.
The first thing is i learned a new term called deduplication which i
didn't know existed.

Question : I gather Deduplication is using other software. DataDomain
i
think i saw mentioned. Where does this fit in with Netbackup and does
the software reside on every client or just a server somewhere?

Ok, so im trying to kit refresh a backup environment for a customer
which has 2 sites. Production and DR about 200 miles apart. There is a
link between the sites but the customer will probably frown on
increased
bandwidth charges to transfer backup data across for DisasterRecovery
purposes.

Data is probably only 1 TB for the site with perhaps 70% being
required
to be transfered daily to offsite media.

Currently i use tape and i was just speccing a new tape system as i
thought by using disk based backups, and retentions of weekly/monthly
backups lasting say 6 weeks, im going to need a LOT of disk, plus the
bandwidth transfer costs to DR site

LTO3 tapes are storing 200gb a tape which is pretty good compared to
disk i thought.

I guess in my set up its a trade off between :-

Initial cost of disk array vs initial cost of tape library, drives and
media

Time take to backup ( network will be bottle neck here. Still on
100Meg
lan with just 2 DB servers using GigaBit lan to backup server.

Offsite transfer of tapes daily to offsite location vs Cost of
increased
bandwith between sites to transfer backup data.


Im now confused what to propose Smile

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
cpreston:
Simplistically, it checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

To what hole do you refer?

The idea that N bits of data can unambiguously be represented by fewer
than N bits. Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

I see one in your simplistic example, but
not in what actually happens (which require a much longer technical
explanation).

Hence my introduction that began with "[s]implistically." But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
There are no products in the market that rely solely on a checksum to
identify redundant data. There are a few that rely solely on
a 160-bit
hash, which is significantly larger than a checksum (typically 12-16

No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

[...] The ability to forcibly create a hash collision means
absolutely nothing in the context of deduplication.

Of course it does. Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars"). It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

What matters is the chance that two
random chunks would have a hash collision. With a 128-bit and 160-bit
key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
2160 with SHA-1. That's 1038 and 1048, respectively. If you

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well. But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless. We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party." Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article: the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not. What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint. I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.

Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for any
length of time has experienced a tape drive writing something that it
then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Just a teensy point - LTO3 tapes should store 400Gb natively. They're
marketed as having a capacity up to 800Gb, but that's with 2:1
compression. We normally get about 550GB for MRI data.

LTO4 are available with 800Gb native capacity. The drives can also
encrypt data.

Dave Markham wrote:
Guys i've just read this thread and can say im very interested in it.
The first thing is i learned a new term called deduplication which i
didn't know existed.

Question : I gather Deduplication is using other software. DataDomain i
think i saw mentioned. Where does this fit in with Netbackup and does
the software reside on every client or just a server somewhere?

Ok, so im trying to kit refresh a backup environment for a customer
which has 2 sites. Production and DR about 200 miles apart. There is a
link between the sites but the customer will probably frown on increased
bandwidth charges to transfer backup data across for DisasterRecovery
purposes.

Data is probably only 1 TB for the site with perhaps 70% being required
to be transfered daily to offsite media.

Currently i use tape and i was just speccing a new tape system as i
thought by using disk based backups, and retentions of weekly/monthly
backups lasting say 6 weeks, im going to need a LOT of disk, plus the
bandwidth transfer costs to DR site

LTO3 tapes are storing 200gb a tape which is pretty good compared to
disk i thought.

I guess in my set up its a trade off between :-

Initial cost of disk array vs initial cost of tape library, drives and media

Time take to backup ( network will be bottle neck here. Still on 100Meg
lan with just 2 DB servers using GigaBit lan to backup server.

Offsite transfer of tapes daily to offsite location vs Cost of increased
bandwith between sites to transfer backup data.


Im now confused what to propose Smile





--
Do you want a picture of your brain - volunteer for a brain scan!
http://www.fil.ion.ucl.ac.uk/Volunteers/

Computer systems go wrong - even backup systems
Be paranoid!

Chris Freemantle, Data Manager
Wellcome Trust Centre for Neuroimaging
+44 (0)207 833 7496
www.fil.ion.ucl.ac.uk
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Most of this while well documented seems to boil down to the same
alarmist notion that had people trying to ban cell phones in gas
stations. The possibility that something untoward COULD happen does NOT
mean it WILL happen. To date I don't know of a single gas pump
explosion or car fire that was traced to cell phone usage at the pump.
Oddly enough though no one monitors gas pumps to be sure users aren't
re-entering their vehicles and fires HAVE been traced to static
electricity caused by that.

If odds are so important it seems it would be important to worry about
the odds that your data center, your offsite storage location and your
Disaster Recovery site will all be taken out at the same time.

I also suggest the argument is flawed because it seems to imply that
only the cksum is stored and no actual the data - it is original
compressed data AND the cksum that result in the restore - not the cksum
alone.

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of bob944
Sent: Wednesday, September 26, 2007 4:03 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

cpreston:
Simplistically, it checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

To what hole do you refer?

The idea that N bits of data can unambiguously be represented by fewer
than N bits. Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

I see one in your simplistic example, but
not in what actually happens (which require a much longer technical
explanation).

Hence my introduction that began with "[s]implistically." But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
There are no products in the market that rely solely on a checksum to
identify redundant data. There are a few that rely solely on
a 160-bit
hash, which is significantly larger than a checksum (typically 12-16

No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

[...] The ability to forcibly create a hash collision means
absolutely nothing in the context of deduplication.

Of course it does. Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars"). It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

What matters is the chance that two
random chunks would have a hash collision. With a 128-bit and 160-bit
key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
2160 with SHA-1. That's 1038 and 1048, respectively. If you

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well. But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless. We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party." Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article: the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not. What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint. I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.

Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for any
length of time has experienced a tape drive writing something that it
then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------

CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.

----------------------------------



_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Pls read my other post about the odds of this happening. With a decent
key space, the odds of a hash collision with a 160=bit key space are so
small that any statistician would call them zero. 1 in 2^160. Do you
know how big that number is? It's a whole lot bigger than it looks.
And those odds are significantly better than the odds that you would
write a bad block of data to a regular disk drive and never know it.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: bob944 [mailto:bob944 < at > attglobal.net]
Sent: Wednesday, September 26, 2007 4:03 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Cc: Curtis Preston
Subject: RE: [Veritas-bu] Tapeless backup environments?

cpreston:
Simplistically, it checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.

To what hole do you refer?

The idea that N bits of data can unambiguously be represented by fewer
than N bits. Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

I see one in your simplistic example, but
not in what actually happens (which require a much longer technical
explanation).

Hence my introduction that began with "[s]implistically." But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
There are no products in the market that rely solely on a checksum to
identify redundant data. There are a few that rely solely on
a 160-bit
hash, which is significantly larger than a checksum (typically 12-16

No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

[...] The ability to forcibly create a hash collision means
absolutely nothing in the context of deduplication.

Of course it does. Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars"). It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

What matters is the chance that two
random chunks would have a hash collision. With a 128-bit and 160-bit
key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
2160 with SHA-1. That's 1038 and 1048, respectively. If you

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well. But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless. We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party." Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article: the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not. What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint. I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.

Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for any
length of time has experienced a tape drive writing something that it
then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
It's interesting that the probability of any 2 randomly selected hashs
being the same is quoted, rather than the probability that at least 2
out of a whole group are the same. That's probably because the minutely
small chance becomes rather bigger when you consider many hashs. This
will still be small, but I suspect not as reassuringly small.

To illustrate this consider the 'birthday paradox'. How many people do
you need in a room to have at least a 50% chance that 2 of them have the
same birthday? The chance of any 2 randomly chosen people sharing the
same birthday is 1/365 (neglecting leap years). Thats quite small, so we
need a lot of people to get a 50% chance, right? Wrong. You need 23
people. Google for 'birthday paradox' for the simple maths.

For our data I would certainly not use de-duping, even if it did work
well on image data.


bob944 wrote:
cpreston:
Simplistically, it checksums the "block" and looks in a table of
checksums-of-"blocks"-that-it-already-stores to see if the identical
<ahem, anyone see a hole here?> data already lives there.
To what hole do you refer?

The idea that N bits of data can unambiguously be represented by fewer
than N bits. Anyone who claims to the contrary might as well knock out
perpetual motion, antigravity and faster-than-light travel while they're
on a roll.

I see one in your simplistic example, but
not in what actually happens (which require a much longer technical
explanation).

Hence my introduction that began with "[s]implistically." But throw in
all the "much longer technical explanation" you like, any process which
compares a reduction-of-data to another reduction-of-data will sooner or
later return "foo" when what was originally stored was "bar."


cpreston:
There are no products in the market that rely solely on a checksum to
identify redundant data. There are a few that rely solely on
a 160-bit
hash, which is significantly larger than a checksum (typically 12-16

No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

[...] The ability to forcibly create a hash collision means
absolutely nothing in the context of deduplication.

Of course it does. Most examples in the literature concern storing
crafted-data-pattern-A ("pay me one dollar") in order for the data to be
read later as something different ("pay me one million dollars"). It
can't have escaped your attention that every day, some yahoo crafts
another buffer-or-stack overflow exploit; some of them are brilliant.
The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

What matters is the chance that two
random chunks would have a hash collision. With a 128-bit and 160-bit
key space, the odds of that happening are 1 in 2128 with MD5, and 1 in
2160 with SHA-1. That's 1038 and 1048, respectively. If you

Grasshopper, the wisdom is not in the numbers, it is in remembering that
HTML will not paste into ASCII well. But I suspect you mean "one in
2^128" or similar.

Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless. We do not care about the odds that a particular
block "the quick brown fox jumps over the lazy dog"
checksums/hashes/fingerprints to the same value as another particular
block "now is the time for all good men to come to the aid of their
party." Of _course_ that will be astronomically unlikely, and with
sufficient hand-waving (to quote your article: the odds of a hash
collision with two random chunks are roughly
1,461,501,637,330,900,000,000,000,000 times greater than the number of
bytes in the known computing universe") these totally meaningless
numbers can seem important.

They're not. What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint. I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.

Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for any
length of time has experienced a tape drive writing something that it
then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.




--
Do you want a picture of your brain - volunteer for a brain scan!
http://www.fil.ion.ucl.ac.uk/Volunteers/

Computer systems go wrong - even backup systems
Be paranoid!

Chris Freemantle, Data Manager
Wellcome Trust Centre for Neuroimaging
+44 (0)207 833 7496
www.fil.ion.ucl.ac.uk
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
On Wed, Sep 26, 2007 at 04:02:49AM -0400, bob944 wrote:
Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data from
foo to bar on every read with no indication that it isn't the data that
was written.

While I find the "compare only based on hash" a bit annoying for other
reasons, the argument above doesn't convince me.

Disks, controllers, and yes RAID arrays can fail silently in all sorts
of ways by either acknowledging a write that is not done, writing to the
wrong location, reading from the wrong location, or reading blocks where
only some of the data came from the correct location. Most RAID systems
do not verify data on read to protect against silent data errors on the
storage, only against obvious failures.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
On Wed, Sep 26, 2007 at 09:58:12AM -0400, Jeff Lightner wrote:
I also suggest the argument is flawed because it seems to imply that
only the cksum is stored and no actual the data - it is original
compressed data AND the cksum that result in the restore - not the cksum
alone.

It's not that the actual data isn't stored, it's whether or not the
actual data is checked. Some algorithms search through the hash space,
and if a hit comes up, they assume that the previously stored data is a
match without a comparison.

The original data must always be stored. Even if it were possible to
run a hash algorithm in reverse quickly, there would be no way to
determine which of various possible input strings was the original.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Display posts from previous:
Reply to topic Page 4 of 6
Goto page Previous  1, 2, 3, 4, 5, 6  Next
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB