SearchFAQMemberlist Log in
Reply to topic Page 5 of 6
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Tapeless backup environments?
Author Message
Post Tapeless backup environments? 
On Wed, Sep 26, 2007 at 04:22:01PM +0100, Chris Freemantle wrote:
For our data I would certainly not use de-duping, even if it did work
well on image data.

There are different ways of doing deduplication. Not all of them rely
on hash signature matching to find redundant data. You should talk with
a particular vendor and see how they accomplish it.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Most of this while well documented seems to boil down to the same
alarmist notion that had people trying to ban cell phones in gas
stations. The possibility that something untoward COULD
happen does NOT
mean it WILL happen. To date I don't know of a single gas pump

I can't speak for car fires, but I can speak for
checksums/hashes/fingerprints mapping to more than one set of data.
It's been demonstrated. It happens. It _has_ to happen. It's the way
these data reductions work, and the reason why it's more convenient to
refer to small hashes of data rather than the full data for many
uses--this has been a programming commonplace since the '50s. But
programmers know it's not a two-way street: a set of data generates
only one checksum/hash/fingerprint, but one checksum/hash/fingerprint
maps to more than one set of data. And that's fine, for a program that
takes this into account (either because it doesn't matter to the
program's logic or a secondary step checks the data). As a trivial
example, reducing three-bit data to a two-bit checksum means that trying
to go backwards will retrieve the wrong three-bit data 50% of the time.
Bigger hashes and more sophisticated algorithms reduce the number of
times you get the wrong data; they don't eliminate it.

If odds are so important it seems it would be important to worry about
the odds that your data center, your offsite storage location and your
Disaster Recovery site will all be taken out at the same time.

And if it's not important that the data you read may not be what was
written, don't let me stop you. _The odds are_ that it'll be okay.

I also suggest the argument is flawed because it seems to imply that
only the cksum is stored and no actual the data - it is original
compressed data AND the cksum that result in the restore -
not the cksum alone.

If I get your meaning, you have an incorrect understanding of the
argument--nobody is talking about generating the original data from a
checksum. As I said in what you quoted (trimmed here), every unique (as
determined by the implementation) "block" of data gets stored, once. A
data stream is stored as a list of pointers or
checksums/hashes/fingerprints which refer to those common-storage
"blocks". Any number of data streams will point to the same "block"
when they have it in common, and as many times as that "block" occurs in
their data stream. To read the data stream later, the list of pointers
tells the implementation what "blocks" to retrieve and send back to the
file reader. Now, if "foo" and "bar" both reduced to the same
checksum/hash/fingerprint when stored, somebody is going to receive the
wrong data when the stream(s) that had those data are read. So sorry
about that corrupted payroll master file...


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Pls read my other post about the odds of this happening.
With a decent
key space, the odds of a hash collision with a 160=bit key
space are so
small that any statistician would call them zero. 1 in 2^160. Do you
know how big that number is? It's a whole lot bigger than it looks.
And those odds are significantly better than the odds that you would
write a bad block of data to a regular disk drive and never know it.

I did read your other post, and addressed your numbers. C Freemantle
makes the same point I do, perhaps more clearly, in his "birthday
paradox" posting.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
On Wed, Sep 26, 2007 at 04:02:49AM -0400, bob944 wrote:
Bogus comparison. In this straw man, that
1/100,000,000,000,000 read error a) probably doesn't
affect anything because of the higher-level RAID array
it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop,
fail, get-it-from-another-source error--NOT a silent
changing of the data from foo to bar on every read
with no indication that it isn't the data that
was written.

While I find the "compare only based on hash" a bit annoying
for other reasons, the argument above doesn't convince me.

Disks, controllers, and yes RAID arrays can fail silently in
all sorts of ways by either acknowledging a write that is not
done, writing to the wrong location, reading from the wrong
location, or reading blocks where only some of the data came
from the correct location. Most RAID systems do not verify
data on read to protect against silent data errors on the
storage, only against obvious failures.

Perhaps anything can have a failure mode where it doesn't alert--but in
a previous lifetime in hardware and some design, I saw only one
undetected data transformation that did not crash or in some way cause
obvious problems (intermittent gate in a mainframe adder that didn't
affect any instructions used by the OS).

I don't remember a disk that didn't maintain, compare and _use for error
detection_, the cylinder, head and sector numbers in the format.

The write frailties mentioned, if they occur, will fail on read. And
the read frailties mentioned will generally (homage paid to the
mainframe example I cited as the _only_ one I ever saw that didn't)
cause enough mayhem that apps or data or systems go belly-up in a big
way, fast.

These events, like double-bit parity errors or EDAC failures, involve
1. that something breaks in the first place
2. that it not be reported
3. that the effects are so subtle that they are unnoticed (the app or
system doesn't crash, the data aren't wildly corrupted, ...)

The problem with checksumming/hashing/fingerprinting is that the
methodology has unavoidable errors designed in, and an implementation
with no add-on logic to prevent or detect them will silently corrupt
data. That's totally different, IMO.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
On Wed, Sep 26, 2007 at 05:15:08PM -0400, bob944 wrote:
Perhaps anything can have a failure mode where it doesn't alert--but in
a previous lifetime in hardware and some design, I saw only one
undetected data transformation that did not crash or in some way cause
obvious problems (intermittent gate in a mainframe adder that didn't
affect any instructions used by the OS).

There's a lot more data out there now (more chances for problems).
Disk firmware has become much more complex.

I don't remember a disk that didn't maintain, compare and _use for error
detection_, the cylinder, head and sector numbers in the format.

Disks may (usually) do that, but they don't report it back to you so you
can verify, and they're not perfect.

One of the ZFS developers wrote about a disk firmware bug they
uncovered. Every once in a while the disk would return the data not
from the requested block but from a block with some odd calculated
offset from that one. Unless the array/controller/system is checking
the data, you'll never know until it hits something critical.

Netapp also talks about the stuff they had to add because of silently
dropped writes and corrupted reads.

Everything has an error rate.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Chris Freemantle said:
It's interesting that the probability of any 2 randomly selected hashs
being the same is quoted, rather than the probability that at least 2
out of a whole group are the same. That's probably because the minutely

small chance becomes rather bigger when you consider many hashs. This
will still be small, but I suspect not as reassuringly small.
To illustrate this consider the 'birthday paradox'.

I'm really glad you point this out. The way I interpret this is that
the odds of their being a hash collision in your environment increase
with every new block of data you submit to the de-duplication system.
I've talked to somebody who has researched this mathematically, and he
says he's going to share with me his calculations. I'll share them
if/when he shares them with me. As a proponent of these systems, I
certainly don't want to misrepresent the odds they represent.

For our data I would certainly not use de-duping, even if it did work
well on image data.

I think you're under the misconception that all de-dupe systems use ONLY
hashes to identify redundant data. While there are products that do
this (and I still trust them more than you do), there are also products
that do a full block comparison of the supposedly matching blocks before
throwing one of them away.

In addition, there are ways to completely remove the risk you're worried
about. If you backup to a de-dupe backup system, regardless of its
design, and then use your backup software to copy from it to tape (or
anything), you verify the de-duped data, as any good backup software
will check all data it copies against its own stored checksums.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
Bob,

I'll try to respond as best as I can.

No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.

You and I don't disagree on this. The only thing we differ with is the
odds of the event. I think the odds are small enough to not be
concerned with, and you think they're larger than that.

(I also think it's important to state what I stated in my other reply.
Most de-dupe systems do not rely only on hashes. So if you can't get
past this whole hashing thing, there's no reason to reject de-dupe
altogether. Just make sure your vendor uses an alternate method.

The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.

So someone is going to exploit the hash collision possibilities in my
backup system to do what, exactly? As much as I've spoken and written
about storage security, I can't for the life of me figure out what
someone would hope to gain or how they would gain it this way.

Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless.

These are odds based on the size of the key space. If you have 2^160
odds, you have a 1:2^160 chance of a collision.

What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint.

This is referring to the birthday paradox. As I stated in another post,
I haven't thought about this before, and am looking into what the real
odds are. I'm trying to translate it into actual numbers.

I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.

Couldn't agree more.

Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion

Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything

I thought probably wasn't acceptable? I'm sorry, that was just too
close to your previous use of "probably" in a very different context.

probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data
from
foo to bar on every read with no indication that it isn't the data that
was written.

I think Darren's other posts about this point are sufficient. It
happens. It happens all the time, and is well documented. And yet the
industry's ok with this. On the other hand, the odds of what we're
talking about are significantly smaller and people are freaking out.

If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for
any
length of time has experienced a tape drive writing something that it
then couldn't read.

That's not news, and why we've been making copies of data for, oh, 50
years or so.

I'm just saying that a hash collision, however possible, would basically
translate into a failed backup that looks good. Do you have any idea
how many failed backups that look good happen every single day with
tape? And, as long as you bring up making copies, making copies of your
de-duped data removes any concerns, as it verifies the original.

Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.

Now _there's_ an unbiased source.

Touche'. Anyone who has actually experienced a hash collision in their
de-duplication backup system please stand up. Given the hype that
de-dupe has made, don't you think that anyone who had experienced such a
thing would have reported it and such a report would have been given big
press? I sure do. And yet there has been nothing.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf
Of Curtis Preston
Sent: 01 October 2007 06:35
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?
...

These are odds based on the size of the key space. If you have 2^160
odds, you have a 1:2^160 chance of a collision.

by saying that, the implication is that the keyspace is uniform. It's
not. The probablity of a hash collision is a function of the uniformity
of the keyspace as well as the number of items you've hashed and the
size of the key. There's lots of research in the crypto field that's
relevant to de-dupe.

You also should consider the characteristics of the de-dupe software
when it encounters a hash collision. Backups are the last line of
defence for many, when all else (personal copies, replication, snapshots
etc.) has failed. The 'acceptable risk' of a hash collision is of
little comfort when you've got one. Does it fail silently, throw it's
hands in the air and core dump, or handle the situation gracefully and
carry on without missing a beat. Ask them what they do. As Curtis
mentioned, not all de-dupe s/ware relies purely on hashes.

Balance this with the /fact/ that there's already a chance of undetected
corruption in the components you buy today, which is why most
technologies that survive impose their own data validation checks
instead of relying purely on the underlying technology in the stack to
have checked it for them. The multi-layered checks that go on improve
your overall confidence.

At least one design in the SiS field also accepts that hashing
algorithms will improve over time and they've had the foresight to be
able to drop in new hashing schemes in future.

When picking de-dupe software you should also care about Intellectual
Property. Who's got what isn't necessarily clear in this space, and the
patent lawyers won't be far away. Picking the big boys help here, but
also look at people with a mature view to the marketplace (eg. some
companies are prepared to talk about licensing deals rather than court
cases when they encounter infringement)

There's lots of other things to consider in picking an algorithm,
including how well it handles patterns that don't fall naturaly on block
boundaries (think of the challenges involved in de-duping 'the quick
brown fox' and 'the quicker brown fox') that will affect de-dupe ratios,
and how that affects performance. And the solution's not just about the
algorithm.

De-dupe is a great advance, and a disruptive technology not just for
backup but also for primary storage. Look forward to it, but go in with
your eyes open.
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post  
As promised, I looked into applying the Birthday Paradox logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of data, I think you'll be OK.

View user's profile Send private message
Post Tapeless backup environments? 
cpreston <netbackup-forum < at > backupcentral.com>:
As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.

One of us still doesn't understand this. Smile

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox. The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity. An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure out
if people in the room have the same birthday.

What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more than
you can represent three bits of data with two.

Hashing is a technique for saving time in certain circumstances. It is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious comparison
to undetected write errors will not change that. What _would_ be a
useful exercise for the reader is to discover how many unique values of
8KB are, on average, represented by a given 160-bit
checksum/hash/fingerprint.


_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
On Tue, Oct 16, 2007 at 12:09:30AM -0400, bob944 wrote:
One of us still doesn't understand this. Smile

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox. The number of possible values in
BP is 366; there is no data reduction in it, no key values.

The 366 isn't the data space, it's the keyspace. When we look at a
person's birthday, we're hashing them into that space. The "paradox"
then is how many people can we hash before the chance of a "collision"
is significant.

Obviously if 400 people are in a room, the number of values exceeds the
keyspace and the probability of a collision is greater than 1.

An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.

I'm afraid I don't understand what you mean with that sentence.

An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure out
if people in the room have the same birthday.

What is stopping at 8 bits? Hash collisions can always occur. The
question is what is the probability.

What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more than
you can represent three bits of data with two.

I think everyone aknowledges that as a fact.

Hashing is a technique for saving time in certain circumstances. It is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values.

The argument is that a process does not have to be infallible to be
valuable, much like the electrical and mechanical processes we currently
use. That if the chance of failure in the algorithm is much less then
the chance of other parts of the system introducing silent data
corruption, then the overall amount of data loss is not significantly
changed.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.

I said:
As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.

Bob944 said:
One of us still doesn't understand this. Smile

Got that right. Smile

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.

I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.

* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.

The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.

Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.

An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure
out
if people in the room have the same birthday.

Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.

I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

Hashing is a technique for saving time in certain circumstances. It
is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
comparison
to undetected write errors will not change that.

This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Post Tapeless backup environments? 
What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.

that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. Smile

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
Hardware compression on your tape drives buys more than saved tapes - it
buys reduced backup times. I found that out way back when on DDS
tapes. We do compression on our stuff (and I have at many jobs) and
have yet to see a restore fail that wasn't due to an issue traced to the
original backup job that wasn't noticed at the time rather than some
mystical bit change that occurred during the restore.

While it is theoretically possible you'll get killed during the next
Leonid meteor shower I doubt you're reinforcing your roof with steel to
insure it doesn't happen.

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Iverson,
Jerald
Sent: Thursday, October 18, 2007 11:52 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?


What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.

that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. Smile

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

Post Tapeless backup environments? 
So you're OK with hash-based de-dupe, which everyone acknowledges has a
chance (although quite small) that you could have a hash-collision and
potentially corrupt a block of data somewhere, sometime, when you least
expect it...

But you're NOT ok with the long-running industry standard of loss-less
compression algorithms? (All compression algorithms for tape are
loss-less algorithms.) Lossy algorithms are only used in things like
video compression, where it's ok to lose blocks along the way as long as
the human eye can't detect them, or as long as you can fit it on
youtube.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Iverson,
Jerald
Sent: Thursday, October 18, 2007 8:52 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?


What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.

that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. Smile

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

View user's profile Send private message
Display posts from previous:
Reply to topic Page 5 of 6
Goto page Previous  1, 2, 3, 4, 5, 6  Next
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB