Welcome! » Log In » Create A New Profile

Tapeless backup environments?

Posted by Anonymous 
Tapeless backup environments?
September 26, 2007 09:16AM
On Wed, Sep 26, 2007 at 04:22:01PM +0100, Chris Freemantle wrote:
[quote]For our data I would certainly not use de-duping, even if it did work
well on image data.
[/quote]
There are different ways of doing deduplication. Not all of them rely
on hash signature matching to find redundant data. You should talk with
a particular vendor and see how they accomplish it.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 26, 2007 12:55PM
[quote]Most of this while well documented seems to boil down to the same
alarmist notion that had people trying to ban cell phones in gas
stations. The possibility that something untoward COULD
happen does NOT
mean it WILL happen. To date I don't know of a single gas pump
[/quote]
I can't speak for car fires, but I can speak for
checksums/hashes/fingerprints mapping to more than one set of data.
It's been demonstrated. It happens. It _has_ to happen. It's the way
these data reductions work, and the reason why it's more convenient to
refer to small hashes of data rather than the full data for many
uses--this has been a programming commonplace since the '50s. But
programmers know it's not a two-way street: a set of data generates
only one checksum/hash/fingerprint, but one checksum/hash/fingerprint
maps to more than one set of data. And that's fine, for a program that
takes this into account (either because it doesn't matter to the
program's logic or a secondary step checks the data). As a trivial
example, reducing three-bit data to a two-bit checksum means that trying
to go backwards will retrieve the wrong three-bit data 50% of the time.
Bigger hashes and more sophisticated algorithms reduce the number of
times you get the wrong data; they don't eliminate it.

[quote]If odds are so important it seems it would be important to worry about
the odds that your data center, your offsite storage location and your
Disaster Recovery site will all be taken out at the same time.
[/quote]
And if it's not important that the data you read may not be what was
written, don't let me stop you. _The odds are_ that it'll be okay.

[quote]I also suggest the argument is flawed because it seems to imply that
only the cksum is stored and no actual the data - it is original
compressed data AND the cksum that result in the restore -
not the cksum alone.
[/quote]
If I get your meaning, you have an incorrect understanding of the
argument--nobody is talking about generating the original data from a
checksum. As I said in what you quoted (trimmed here), every unique (as
determined by the implementation) "block" of data gets stored, once. A
data stream is stored as a list of pointers or
checksums/hashes/fingerprints which refer to those common-storage
"blocks". Any number of data streams will point to the same "block"
when they have it in common, and as many times as that "block" occurs in
their data stream. To read the data stream later, the list of pointers
tells the implementation what "blocks" to retrieve and send back to the
file reader. Now, if "foo" and "bar" both reduced to the same
checksum/hash/fingerprint when stored, somebody is going to receive the
wrong data when the stream(s) that had those data are read. So sorry
about that corrupted payroll master file...

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 26, 2007 12:57PM
[quote]Pls read my other post about the odds of this happening.
With a decent
key space, the odds of a hash collision with a 160=bit key
space are so
small that any statistician would call them zero. 1 in 2^160. Do you
know how big that number is? It's a whole lot bigger than it looks.
And those odds are significantly better than the odds that you would
write a bad block of data to a regular disk drive and never know it.
[/quote]
I did read your other post, and addressed your numbers. C Freemantle
makes the same point I do, perhaps more clearly, in his "birthday
paradox" posting.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 26, 2007 02:20PM
[quote]On Wed, Sep 26, 2007 at 04:02:49AM -0400, bob944 wrote:
[quote]Bogus comparison. In this straw man, that
1/100,000,000,000,000 read error a) probably doesn't
affect anything because of the higher-level RAID array
it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop,
fail, get-it-from-another-source error--NOT a silent
changing of the data from foo to bar on every read
with no indication that it isn't the data that
was written.
[/quote]
While I find the "compare only based on hash" a bit annoying
for other reasons, the argument above doesn't convince me.

Disks, controllers, and yes RAID arrays can fail silently in
all sorts of ways by either acknowledging a write that is not
done, writing to the wrong location, reading from the wrong
location, or reading blocks where only some of the data came
from the correct location. Most RAID systems do not verify
data on read to protect against silent data errors on the
storage, only against obvious failures.
[/quote]
Perhaps anything can have a failure mode where it doesn't alert--but in
a previous lifetime in hardware and some design, I saw only one
undetected data transformation that did not crash or in some way cause
obvious problems (intermittent gate in a mainframe adder that didn't
affect any instructions used by the OS).

I don't remember a disk that didn't maintain, compare and _use for error
detection_, the cylinder, head and sector numbers in the format.

The write frailties mentioned, if they occur, will fail on read. And
the read frailties mentioned will generally (homage paid to the
mainframe example I cited as the _only_ one I ever saw that didn't)
cause enough mayhem that apps or data or systems go belly-up in a big
way, fast.

These events, like double-bit parity errors or EDAC failures, involve
1. that something breaks in the first place
2. that it not be reported
3. that the effects are so subtle that they are unnoticed (the app or
system doesn't crash, the data aren't wildly corrupted, ...)

The problem with checksumming/hashing/fingerprinting is that the
methodology has unavoidable errors designed in, and an implementation
with no add-on logic to prevent or detect them will silently corrupt
data. That's totally different, IMO.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 26, 2007 10:46PM
On Wed, Sep 26, 2007 at 05:15:08PM -0400, bob944 wrote:
[quote]Perhaps anything can have a failure mode where it doesn't alert--but in
a previous lifetime in hardware and some design, I saw only one
undetected data transformation that did not crash or in some way cause
obvious problems (intermittent gate in a mainframe adder that didn't
affect any instructions used by the OS).
[/quote]
There's a lot more data out there now (more chances for problems).
Disk firmware has become much more complex.

[quote]I don't remember a disk that didn't maintain, compare and _use for error
detection_, the cylinder, head and sector numbers in the format.
[/quote]
Disks may (usually) do that, but they don't report it back to you so you
can verify, and they're not perfect.

One of the ZFS developers wrote about a disk firmware bug they
uncovered. Every once in a while the disk would return the data not
from the requested block but from a block with some odd calculated
offset from that one. Unless the array/controller/system is checking
the data, you'll never know until it hits something critical.

Netapp also talks about the stuff they had to add because of silently
dropped writes and corrupted reads.

Everything has an error rate.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 30, 2007 09:56PM
Chris Freemantle said:
[quote]It's interesting that the probability of any 2 randomly selected hashs
being the same is quoted, rather than the probability that at least 2
out of a whole group are the same. That's probably because the minutely
[/quote]
[quote]small chance becomes rather bigger when you consider many hashs. This
will still be small, but I suspect not as reassuringly small.
To illustrate this consider the 'birthday paradox'.
[/quote]
I'm really glad you point this out. The way I interpret this is that
the odds of their being a hash collision in your environment increase
with every new block of data you submit to the de-duplication system.
I've talked to somebody who has researched this mathematically, and he
says he's going to share with me his calculations. I'll share them
if/when he shares them with me. As a proponent of these systems, I
certainly don't want to misrepresent the odds they represent.

[quote]For our data I would certainly not use de-duping, even if it did work
well on image data.
[/quote]
I think you're under the misconception that all de-dupe systems use ONLY
hashes to identify redundant data. While there are products that do
this (and I still trust them more than you do), there are also products
that do a full block comparison of the supposedly matching blocks before
throwing one of them away.

In addition, there are ways to completely remove the risk you're worried
about. If you backup to a de-dupe backup system, regardless of its
design, and then use your backup software to copy from it to tape (or
anything), you verify the de-duped data, as any good backup software
will check all data it copies against its own stored checksums.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
September 30, 2007 10:37PM
Bob,

I'll try to respond as best as I can.

[quote]No importa. The length of the checksum/hash/fingerprint and the
sophistication of its algorithm only affect how frequently--not
whether--the incorrect answer is generated.
[/quote]
You and I don't disagree on this. The only thing we differ with is the
odds of the event. I think the odds are small enough to not be
concerned with, and you think they're larger than that.

(I also think it's important to state what I stated in my other reply.
Most de-dupe systems do not rely only on hashes. So if you can't get
past this whole hashing thing, there's no reason to reject de-dupe
altogether. Just make sure your vendor uses an alternate method.

[quote]The notion that the bad guys will never figure out a way to plant a
silent data-change based on checksum/hash/fingerprint collisions is,
IMO, naive.
[/quote]
So someone is going to exploit the hash collision possibilities in my
backup system to do what, exactly? As much as I've spoken and written
about storage security, I can't for the life of me figure out what
someone would hope to gain or how they would gain it this way.

[quote]Those are impressive, and dare I guess, vendor-supplied, numbers. And
they're meaningless.
[/quote]
These are odds based on the size of the key space. If you have 2^160
odds, you have a 1:2^160 chance of a collision.

[quote]What _is_ important? To me, it's important that if I read
back any of the N terrabytes of data I might store this week, I get the
same data that was written, not a silently changed version because the
checksum/hash/fingerprint of one block that I wrote collides with
another cheksum/hash/fingerprint.
[/quote]
This is referring to the birthday paradox. As I stated in another post,
I haven't thought about this before, and am looking into what the real
odds are. I'm trying to translate it into actual numbers.

[quote]I can NOT have that happen to any
block--in a file clerk's .pst, a directory inode or the finance
database. "Probably, it won't happen" is not acceptable.
[/quote]
Couldn't agree more.

[quote][quote]Let's compare those odds with the odds of an unrecoverable
read error on a typical disk--approximately 1 in 100 trillion
[/quote][/quote]
[quote]Bogus comparison. In this straw man, that 1/100,000,000,000,000 read
error a) probably doesn't affect anything
[/quote]
I thought probably wasn't acceptable? I'm sorry, that was just too
close to your previous use of "probably" in a very different context.

[quote]probably doesn't affect anything because of the higher-level
RAID array it's in and b) if it does, there's an error, a
we-could-not-read-this-data, you-can't-proceed, stop, fail,
get-it-from-another-source error--NOT a silent changing of the data
[/quote]from
[quote]foo to bar on every read with no indication that it isn't the data that
was written.
[/quote]
I think Darren's other posts about this point are sufficient. It
happens. It happens all the time, and is well documented. And yet the
industry's ok with this. On the other hand, the odds of what we're
talking about are significantly smaller and people are freaking out.

[quote][quote]If you want to talk about the odds of something bad happening and not
knowing it, keep using tape. Everyone who has worked with tape for
[/quote][/quote]any
[quote][quote]length of time has experienced a tape drive writing something that it
then couldn't read.
[/quote]
That's not news, and why we've been making copies of data for, oh, 50
years or so.
[/quote]
I'm just saying that a hash collision, however possible, would basically
translate into a failed backup that looks good. Do you have any idea
how many failed backups that look good happen every single day with
tape? And, as long as you bring up making copies, making copies of your
de-duped data removes any concerns, as it verifies the original.

[quote][quote]Compare that to successful deduplication disk
restores. According to Avamar Technologies Inc. (recently acquired by
EMC Corp.), none of its customers has ever had a failed restore.
[/quote]
Now _there's_ an unbiased source.
[/quote]
Touche'. Anyone who has actually experienced a hash collision in their
de-duplication backup system please stand up. Given the hype that
de-dupe has made, don't you think that anyone who had experienced such a
thing would have reported it and such a report would have been given big
press? I sure do. And yet there has been nothing.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 01, 2007 02:30AM
[quote]-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf
Of Curtis Preston
Sent: 01 October 2007 06:35
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?
[/quote]...
[quote]
These are odds based on the size of the key space. If you have 2^160
odds, you have a 1:2^160 chance of a collision.
[/quote]
by saying that, the implication is that the keyspace is uniform. It's
not. The probablity of a hash collision is a function of the uniformity
of the keyspace as well as the number of items you've hashed and the
size of the key. There's lots of research in the crypto field that's
relevant to de-dupe.

You also should consider the characteristics of the de-dupe software
when it encounters a hash collision. Backups are the last line of
defence for many, when all else (personal copies, replication, snapshots
etc.) has failed. The 'acceptable risk' of a hash collision is of
little comfort when you've got one. Does it fail silently, throw it's
hands in the air and core dump, or handle the situation gracefully and
carry on without missing a beat. Ask them what they do. As Curtis
mentioned, not all de-dupe s/ware relies purely on hashes.

Balance this with the /fact/ that there's already a chance of undetected
corruption in the components you buy today, which is why most
technologies that survive impose their own data validation checks
instead of relying purely on the underlying technology in the stack to
have checked it for them. The multi-layered checks that go on improve
your overall confidence.

At least one design in the SiS field also accepts that hashing
algorithms will improve over time and they've had the foresight to be
able to drop in new hashing schemes in future.

When picking de-dupe software you should also care about Intellectual
Property. Who's got what isn't necessarily clear in this space, and the
patent lawyers won't be far away. Picking the big boys help here, but
also look at people with a mature view to the marketplace (eg. some
companies are prepared to talk about licensing deals rather than court
cases when they encounter infringement)

There's lots of other things to consider in picking an algorithm,
including how well it handles patterns that don't fall naturaly on block
boundaries (think of the challenges involved in de-duping 'the quick
brown fox' and 'the quicker brown fox') that will affect de-dupe ratios,
and how that affects performance. And the solution's not just about the
algorithm.

De-dupe is a great advance, and a disruptive technology not just for
backup but also for primary storage. Look forward to it, but go in with
your eyes open.
--------------------------------------------------------

NOTICE: If received in error, please destroy and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 14, 2007 10:34PM
As promised, I looked into applying the Birthday Paradox logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of data, I think you'll be OK.
Tapeless backup environments?
October 15, 2007 09:15PM
cpreston <netbackup-forum < at > backupcentral.com>:
[quote]As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
[/quote]
One of us still doesn't understand this. :-)

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox. The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity. An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure out
if people in the room have the same birthday.

What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more than
you can represent three bits of data with two.

Hashing is a technique for saving time in certain circumstances. It is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious comparison
to undetected write errors will not change that. What _would_ be a
useful exercise for the reader is to discover how many unique values of
8KB are, on average, represented by a given 160-bit
checksum/hash/fingerprint.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 16, 2007 03:42PM
On Tue, Oct 16, 2007 at 12:09:30AM -0400, bob944 wrote:
[quote]One of us still doesn't understand this. :-)

Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox. The number of possible values in
BP is 366; there is no data reduction in it, no key values.
[/quote]
The 366 isn't the data space, it's the keyspace. When we look at a
person's birthday, we're hashing them into that space. The "paradox"
then is how many people can we hash before the chance of a "collision"
is significant.

Obviously if 400 people are in a room, the number of values exceeds the
keyspace and the probability of a collision is greater than 1.

[quote]An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
[/quote]
I'm afraid I don't understand what you mean with that sentence.

[quote]An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure out
if people in the room have the same birthday.
[/quote]
What is stopping at 8 bits? Hash collisions can always occur. The
question is what is the probability.

[quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more than
you can represent three bits of data with two.
[/quote]
I think everyone aknowledges that as a fact.

[quote]Hashing is a technique for saving time in certain circumstances. It is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values.
[/quote]
The argument is that a process does not have to be infallible to be
valuable, much like the electrical and mechanical processes we currently
use. That if the chance of failure in the algorithm is much less then
the chance of other parts of the system introducing silent data
corruption, then the overall amount of data loss is not significantly
changed.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 02:28AM
At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.

I said:
[quote]As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
[/quote]
Bob944 said:
[quote][quote]One of us still doesn't understand this. :-)
[/quote][/quote]
Got that right. :-)

[quote][quote]Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.
[/quote][/quote]
I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.

* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.

[quote]The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
[/quote]
Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.

[quote][quote]An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure
[/quote][/quote]out
[quote][quote]if people in the room have the same birthday.
[/quote][/quote]
Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

[quote][quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote][/quote]than
[quote][quote]you can represent three bits of data with two.
[/quote][/quote]
I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

[quote][quote]Hashing is a technique for saving time in certain circumstances. It
[/quote][/quote]is
[quote][quote]valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
[/quote][/quote]comparison
[quote][quote]to undetected write errors will not change that.
[/quote][/quote]
This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 08:55AM
[quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote]than
[quote]you can represent three bits of data with two.
[/quote]
that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. :-)

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 10:16AM
Hardware compression on your tape drives buys more than saved tapes - it
buys reduced backup times. I found that out way back when on DDS
tapes. We do compression on our stuff (and I have at many jobs) and
have yet to see a restore fail that wasn't due to an issue traced to the
original backup job that wasn't noticed at the time rather than some
mystical bit change that occurred during the restore.

While it is theoretically possible you'll get killed during the next
Leonid meteor shower I doubt you're reinforcing your roof with steel to
insure it doesn't happen.

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Iverson,
Jerald
Sent: Thursday, October 18, 2007 11:52 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

[quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote]than
[quote]you can represent three bits of data with two.
[/quote]
that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. :-)

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 10:46AM
So you're OK with hash-based de-dupe, which everyone acknowledges has a
chance (although quite small) that you could have a hash-collision and
potentially corrupt a block of data somewhere, sometime, when you least
expect it...

But you're NOT ok with the long-running industry standard of loss-less
compression algorithms? (All compression algorithms for tape are
loss-less algorithms.) Lossy algorithms are only used in things like
video compression, where it's ok to lose blocks along the way as long as
the human eye can't detect them, or as long as you can fit it on
youtube.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Iverson,
Jerald
Sent: Thursday, October 18, 2007 8:52 AM
To: veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

[quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote]than
[quote]you can represent three bits of data with two.
[/quote]
that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it, and see whether or not it is
"data-loss-less", then i may turn compression back on. until then,
tapes are cheap and i'll buy 2.5 times as many as i need. :-)

thanks,
jerald

p.s.
our de-dupe vtl does the hash and then a bit by bit comparison of the
data block to ensure the data really is the same in order to eliminate
the duplicate block. i think some of the confusion may be in not
understanding how the de-dupe process works. once you create a hash for
a block of data, you are storing the hash AND the block of data. you
are never having to re-create a big block a data from a smaller hash.
the backup stream of data gets re-written from a "string" of 8k blocks,
into a "string" of 160-bit pointers which point to the unique 8k blocks
of data via the hash table. or something like that...
****************************************************************
Confidentiality Note: The information contained in this
message, and any attachments, may contain confidential
and/or privileged material. It is intended solely for the
person(s) or entity to which it is addressed. Any review,
retransmission, dissemination, or taking of any action in
reliance upon this information by persons or entities other
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the
material from any computer.
****************************************************************

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 12:01PM
Sorry, but I just can't keep from jumping in at this point.
Not taking either side, but...

Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research? I could place a posting on there that
either concurs with, or totally rejects the position of that posting;
and someone else would come along and claim it as gospel.

I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!

Saying
" This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead."
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that. Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no matter
what the numbers are, you're not going to accept..."

If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were. I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.

I would thing that almost everyone on this forum does some kind of pilot
before rolling something out into production.

I hope I'm wrong. I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...

BTW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)

Take care,

Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

------------------------------------------------------------------------
---
Message: 1
Date: Thu, 18 Oct 2007 04:06:52 -0400
From: "Curtis Preston" <cpreston < at > glasshouse.com>
Subject: Re: [Veritas-bu] Tapeless backup environments?
To: <bob944 < at > attglobal.net>, <veritas-bu < at > mailman.eng.auburn.edu>
Message-ID:

<4FBA0941CF3D9347889AA5FF23A809BEF3C673 < at > ghmail02.glasshousetech.com>
Content-Type: text/plain; charset="US-ASCII"

At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.

I said:
[quote]As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
[/quote]
Bob944 said:
[quote][quote]One of us still doesn't understand this. :-)
[/quote][/quote]
Got that right. :-)

[quote][quote]Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.
[/quote][/quote]
I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.

* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.

[quote]The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
[/quote]
[quote]8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
[/quote]
Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.

[quote][quote]An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
[/quote][/quote]
[quote][quote]30 all represented by the same code, in which case you can't figure
[/quote][/quote]out
[quote][quote]if people in the room have the same birthday.
[/quote][/quote]
Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

[quote][quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote][/quote]than
[quote][quote]you can represent three bits of data with two.
[/quote][/quote]
I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

[quote][quote]Hashing is a technique for saving time in certain circumstances. It
[/quote][/quote]is
[quote][quote]valueless in re-creating (and a lookup is a re-creation) original data
[/quote][/quote]
[quote][quote]when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
[/quote][/quote]comparison
[quote][quote]to undetected write errors will not change that.
[/quote][/quote]
This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 12:53PM
I would say no as Wikipedia is like an encyclopedia and is a good spot
to start but it isn't peer reviewed published articles so in research it
would not be considered a valid source.

Dustin D'Amour
Wireless Switching
Plateau Wireless

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Eagle,
Kent
Sent: Thursday, October 18, 2007 12:59 PM
To: cpreston < at > glasshouse.com; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments

Sorry, but I just can't keep from jumping in at this point.
Not taking either side, but...

Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research? I could place a posting on there that
either concurs with, or totally rejects the position of that posting;
and someone else would come along and claim it as gospel.

I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!

Saying
" This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead."
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that. Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no matter
what the numbers are, you're not going to accept..."

If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were. I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.

I would thing that almost everyone on this forum does some kind of pilot
before rolling something out into production.

I hope I'm wrong. I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...

BTW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)

Take care,

Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

------------------------------------------------------------------------
---
Message: 1
Date: Thu, 18 Oct 2007 04:06:52 -0400
From: "Curtis Preston" <cpreston < at > glasshouse.com>
Subject: Re: [Veritas-bu] Tapeless backup environments?
To: <bob944 < at > attglobal.net>, <veritas-bu < at > mailman.eng.auburn.edu>
Message-ID:

<4FBA0941CF3D9347889AA5FF23A809BEF3C673 < at > ghmail02.glasshousetech.com>
Content-Type: text/plain; charset="US-ASCII"

At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.

I said:
[quote]As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:

http://www.backupcentral.com/content/view/145/47/

Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
[/quote]
Bob944 said:
[quote][quote]One of us still doesn't understand this. :-)
[/quote][/quote]
Got that right. :-)

[quote][quote]Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.
[/quote][/quote]
I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:

* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.

* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.

[quote]The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
[/quote]
[quote]8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
[/quote]
Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.

And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.

[quote][quote]An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
[/quote][/quote]
[quote][quote]30 all represented by the same code, in which case you can't figure
[/quote][/quote]out
[quote][quote]if people in the room have the same birthday.
[/quote][/quote]
Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)

[quote][quote]What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
[/quote][/quote]than
[quote][quote]you can represent three bits of data with two.
[/quote][/quote]
I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.

[quote][quote]Hashing is a technique for saving time in certain circumstances. It
[/quote][/quote]is
[quote][quote]valueless in re-creating (and a lookup is a re-creation) original data
[/quote][/quote]
[quote][quote]when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
[/quote][/quote]comparison
[quote][quote]to undetected write errors will not change that.
[/quote][/quote]
This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other
governmental agency, are not deposits of or other obligations of or
guaranteed by Wilmington Trust or any other bank or entity, and are
subject to risks, including a possible loss of the principal amount
invested. This e-mail and any files transmitted with it may contain
confidential and/or proprietary information. It is intended solely for
the use of the individual or entity who is the intended recipient.
Unauthorized use of this information is prohibited. If you have
received this in error, please contact the sender by replying to this
message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 01:43PM
Glad to have another person in the party. What's your birthday? ;)

[quote]Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
[/quote]
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.

I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..

As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.

[quote]I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
[/quote]
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it? ;)

[quote]Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
[/quote]
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.

[quote]Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.

[quote]If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
[/quote]I
[quote]either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
[/quote]
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.

[quote]I would thing that almost everyone on this forum does some kind of
[/quote]pilot
[quote]before rolling something out into production.
[/quote]
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.

What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.

[quote]I hope I'm wrong.
[/quote]
About what? That I'm an idiot? ;) I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.

[quote]I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
[/quote]
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.

[quote]TW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)
[/quote]
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote

(Sorry, just couldn't resist.) ;)

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 01:53PM
On 10/18/07, Iverson, Jerald <Jerald.Iverson < at > aiminvestments.com> wrote:
[quote]...
that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it,
[/quote]<snip>

Hi jerald,

Data compression exploits the non-randomness of "normal" data.
Compression algorithms have variable compression rates because their
performance is dependent on the data being compressed. Truly random
data does NOT compress at all. "Typical" data is not truly random.
Once data has been compressed, it is close to random, so compression
can not be applied again. Many encryption algorithms also result in
near-random data that does not compress.

A formal definition of a data set's "randomness" is it's Kolmogorov
complexity. http://en.wikipedia.org/wiki/Kolmogorov_complexity

Compression is just an alternate means of data representation.
Several others are at work on your LTO tapes too!
http://en.wikipedia.org/wiki/Forward_error_correction
http://en.wikipedia.org/wiki/Run_Length_Limited
http://en.wikipedia.org/wiki/PRML

Don't get too paranoid...these are good things.

Austin
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 02:10PM
On Thu, Oct 18, 2007 at 01:44:03PM -0400, Curtis Preston wrote:
[quote]So you're OK with hash-based de-dupe, which everyone acknowledges has a
chance (although quite small) that you could have a hash-collision and
potentially corrupt a block of data somewhere, sometime, when you least
expect it...

But you're NOT ok with the long-running industry standard of loss-less
compression algorithms? [...]
[/quote]
I think the smiley on the end indicated that it was a humorous comment.
At least that's how I took it.

--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 18, 2007 10:19PM
[quote]discussion, although it appears to me that you're already made up your
mind.
[/quote]
I'd prefer to say I have little interest in a technology which, by
design, will retrieve a completely different chunk of data than what was
written, with no notice whatsoever. BTW, before you bring out tape
errors again, I posted long ago why this argument was not comparable.

No point in beating the poor Birthday Paradox to death; you've
completely missed the point there. It doesn't matter that the same
values come up more often than our intuition suggests--which is the
_only_ lesson of BP--what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.

Here's all a thinking person responsible for data needs to consider:

An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.

If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat. The
exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.

Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 12:40AM
I wish we had a white board and could sit in front of each other to
finish the discussion, but it's obvious that it's not going to be
resolved here.

You believe I'm missing your point, and I believe you're missing my
point.

[quote]what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
[/quote]
Not sure how many times I (or others) have to keep saying, the dates are
not the data that are being deduped. The dates are the hashes. The
data is the person.

[quote]An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
[/quote]
This, again, only makes sense if you are using the hash to
store/reconstruct the data, not to ID the data. The fingerprint (like a
real fingerprint) is not used to reconstruct a block, it's only used to
give it a unique ID that distinguishes it from other blocks. You still
have to store the block with the key. And with 2^160 different
fingerprints, that means we can calculate unique fingerprints for 2^160
blocks. That means we can calculate a unique fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.

[quote]If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
[/quote]
I really don't think you need to make it personal, and suggest that I
don't know what I'm doing simply because we have been unable to
successfully communicate to each other in this medium. This medium can
be a very difficult one to communicate such a difficult subject in. I
think things would be very different in person with a whiteboard.

[quote]should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
[/quote]
They worked really hard to figure out how to take one block that
calculates to a particular hash and create another block that calculates
to the same hash. It's used to fake a signature. I get it. I just
don't see how or why somebody would use this to do I don't know what
with my backups. And if we were having this discussion over a few
drinks we could try to come up with some ideas. Right now, I'm as tired
as you are of this discussion.

[quote]Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
[/quote]
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.

For anyone who's still reading, I just want to say this:

I was only trying to bring some sanity to what I felt was an undue
amount of FUD against the hash-only products. I'm not necessarily trying
to talk anyone into them. I just want you to understand what I THINK
the real odds are. If after understanding how it works and what the
odds are, you're still uncomfortable, don't dismiss dedupe. Just
consider a non-hash-based de-dupe product.

Curtis out.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 01:03AM
How about setting up a white board / aka NetMeeting !

I think this thread has gone on for some time now, and yet there still
appears to be 2 different opinions.

Not going to please everyone.....! :-) personally, I would not be worried
about it and will just step out of the debate and move on.

Right or wrong, I really don't care that much :-)

But anyhow, something like DIGG Whiteboard might help - think its still free
if those wishing to continue the debate want to continue offline :-)

Bye !

Regards

Simon Weaver
3rd Line Technical Support
Windows Domain Administrator

EADS Astrium Limited, B23AA IM (DCS)
Anchorage Road, Portsmouth, PO3 5PU

Email: Simon.Weaver < at > Astrium.eads.net

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Curtis
Preston
Sent: Friday, October 19, 2007 8:38 AM
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

I wish we had a white board and could sit in front of each other to finish
the discussion, but it's obvious that it's not going to be resolved here.

You believe I'm missing your point, and I believe you're missing my point.

[quote]what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
[/quote]
Not sure how many times I (or others) have to keep saying, the dates are not
the data that are being deduped. The dates are the hashes. The data is the
person.

[quote]An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
[/quote]
This, again, only makes sense if you are using the hash to store/reconstruct
the data, not to ID the data. The fingerprint (like a real fingerprint) is
not used to reconstruct a block, it's only used to give it a unique ID that
distinguishes it from other blocks. You still have to store the block with
the key. And with 2^160 different fingerprints, that means we can calculate
unique fingerprints for 2^160 blocks. That means we can calculate a unique
fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.

[quote]If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
[/quote]
I really don't think you need to make it personal, and suggest that I don't
know what I'm doing simply because we have been unable to successfully
communicate to each other in this medium. This medium can be a very
difficult one to communicate such a difficult subject in. I think things
would be very different in person with a whiteboard.

[quote]should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
[/quote]
They worked really hard to figure out how to take one block that calculates
to a particular hash and create another block that calculates to the same
hash. It's used to fake a signature. I get it. I just don't see how or
why somebody would use this to do I don't know what with my backups. And if
we were having this discussion over a few drinks we could try to come up
with some ideas. Right now, I'm as tired as you are of this discussion.

[quote]Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
[/quote]
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.

For anyone who's still reading, I just want to say this:

I was only trying to bring some sanity to what I felt was an undue amount of
FUD against the hash-only products. I'm not necessarily trying to talk
anyone into them. I just want you to understand what I THINK the real odds
are. If after understanding how it works and what the odds are, you're
still uncomfortable, don't dismiss dedupe. Just consider a non-hash-based
de-dupe product.

Curtis out.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 07:54AM
"since you should know from earlier postings that spoofing your favorite
160-bit hashing algorithm with reasonable-looking fake data is now old
hat. The exploit itself should concern us"

This I don't get. In addition to lamenting about possibility over
probability is bob944 now suggesting that dedupe vendors are evil
hackers that INTEND to destroy our data?

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of WEAVER,
Simon (external)
Sent: Friday, October 19, 2007 4:01 AM
To: 'Curtis Preston'; bob944 < at > attglobal.net;
veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

How about setting up a white board / aka NetMeeting !

I think this thread has gone on for some time now, and yet there still
appears to be 2 different opinions.

Not going to please everyone.....! :-) personally, I would not be
worried
about it and will just step out of the debate and move on.

Right or wrong, I really don't care that much :-)

But anyhow, something like DIGG Whiteboard might help - think its still
free
if those wishing to continue the debate want to continue offline :-)

Bye !

Regards

Simon Weaver
3rd Line Technical Support
Windows Domain Administrator

EADS Astrium Limited, B23AA IM (DCS)
Anchorage Road, Portsmouth, PO3 5PU

Email: Simon.Weaver < at > Astrium.eads.net

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Curtis
Preston
Sent: Friday, October 19, 2007 8:38 AM
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?

I wish we had a white board and could sit in front of each other to
finish
the discussion, but it's obvious that it's not going to be resolved
here.

You believe I'm missing your point, and I believe you're missing my
point.

[quote]what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
[/quote]
Not sure how many times I (or others) have to keep saying, the dates are
not
the data that are being deduped. The dates are the hashes. The data is
the
person.

[quote]An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
[/quote]
This, again, only makes sense if you are using the hash to
store/reconstruct
the data, not to ID the data. The fingerprint (like a real fingerprint)
is
not used to reconstruct a block, it's only used to give it a unique ID
that
distinguishes it from other blocks. You still have to store the block
with
the key. And with 2^160 different fingerprints, that means we can
calculate
unique fingerprints for 2^160 blocks. That means we can calculate a
unique
fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.

[quote]If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
[/quote]
I really don't think you need to make it personal, and suggest that I
don't
know what I'm doing simply because we have been unable to successfully
communicate to each other in this medium. This medium can be a very
difficult one to communicate such a difficult subject in. I think
things
would be very different in person with a whiteboard.

[quote]should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
[/quote]
They worked really hard to figure out how to take one block that
calculates
to a particular hash and create another block that calculates to the
same
hash. It's used to fake a signature. I get it. I just don't see how
or
why somebody would use this to do I don't know what with my backups.
And if
we were having this discussion over a few drinks we could try to come up
with some ideas. Right now, I'm as tired as you are of this discussion.

[quote]Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
[/quote]
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.

For anyone who's still reading, I just want to say this:

I was only trying to bring some sanity to what I felt was an undue
amount of
FUD against the hash-only products. I'm not necessarily trying to talk
anyone into them. I just want you to understand what I THINK the real
odds
are. If after understanding how it works and what the odds are, you're
still uncomfortable, don't dismiss dedupe. Just consider a
non-hash-based
de-dupe product.

Curtis out.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu

This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from
disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use
it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and
all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 09:10AM
O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...

2nd. (and last) post on this -

My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".

As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.

I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.

The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
[quote]WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]This was your text, no?

Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).

You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
;-)

You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"

I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...

Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.

End transmission.

Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

-----Original Message-----
From: Curtis Preston [mailto]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments

Glad to have another person in the party. What's your birthday? ;)

[quote]Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
[/quote]
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.

I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..

As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.

[quote]I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
[/quote]
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it? ;)

[quote]Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
[/quote]
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.

[quote]Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.

[quote]If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
[/quote]I
[quote]either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
[/quote]
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.

[quote]I would thing that almost everyone on this forum does some kind of
[/quote]pilot
[quote]before rolling something out into production.
[/quote]
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.

What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.

[quote]I hope I'm wrong.
[/quote]
About what? That I'm an idiot? ;) I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.

[quote]I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
[/quote]
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.

[quote]TW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)
[/quote]
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote

(Sorry, just couldn't resist.) ;)

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 09:39AM
Not an attack - just a question: Did someone in this thread say they
HAD experienced data loss due to deduplication? If so I missed it.

You mixed comments about another thread in here and I *think* you're
saying something about someone's experience with 10GigE rather than
deduplication. Your post could be misread to say someone had in fact
had such a data loss and posted it here.

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Eagle,
Kent
Sent: Friday, October 19, 2007 12:08 PM
To: Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments

O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...

2nd. (and last) post on this -

My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".

As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.

I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.

The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
[quote]WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]This was your text, no?

Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).

You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
;-)

You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"

I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...

Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.

End transmission.

Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

-----Original Message-----
From: Curtis Preston [mailto]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments

Glad to have another person in the party. What's your birthday? ;)

[quote]Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
[/quote]
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.

I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..

As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.

[quote]I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
[/quote]
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it? ;)

[quote]Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
[/quote]
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.

[quote]Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.

[quote]If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
[/quote]I
[quote]either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
[/quote]
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.

[quote]I would thing that almost everyone on this forum does some kind of
[/quote]pilot
[quote]before rolling something out into production.
[/quote]
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.

What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.

[quote]I hope I'm wrong.
[/quote]
About what? That I'm an idiot? ;) I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.

[quote]I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
[/quote]
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.

[quote]TW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)
[/quote]
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote

(Sorry, just couldn't resist.) ;)

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other
governmental agency, are not deposits of or other obligations of or
guaranteed by Wilmington Trust or any other bank or entity, and are
subject to risks, including a possible loss of the principal amount
invested. This e-mail and any files transmitted with it may contain
confidential and/or proprietary information. It is intended solely for
the use of the individual or entity who is the intended recipient.
Unauthorized use of this information is prohibited. If you have
received this in error, please contact the sender by replying to this
message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Tapeless backup environments?
October 19, 2007 10:10AM
Jeff,

The mix was deliberate. Please re-read my post & it should become
evident as to why. There was no implication that someone stated they had
experienced data loss.

In fact, nothing in my post is really speaking to dedupe or data loss.
It's about the posts themselves...

- Kent

-----Original Message-----
From: Jeff Lightner [mailto]
Sent: Friday, October 19, 2007 12:36 PM
To: Eagle, Kent; Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: [Veritas-bu] Tapeless backup environments

Not an attack - just a question: Did someone in this thread say they
HAD experienced data loss due to deduplication? If so I missed it.

You mixed comments about another thread in here and I *think* you're
saying something about someone's experience with 10GigE rather than
deduplication. Your post could be misread to say someone had in fact
had such a data loss and posted it here.

-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto] On Behalf Of Eagle,
Kent
Sent: Friday, October 19, 2007 12:08 PM
To: Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments

O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...

2nd. (and last) post on this -

My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".

As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.

I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.

The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
[quote]WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]This was your text, no?

Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).

You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
;-)

You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"

I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...

Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.

End transmission.

Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS

-----Original Message-----
From: Curtis Preston [mailto]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments

Glad to have another person in the party. What's your birthday? ;)

[quote]Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
[/quote]
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.

I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..

As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.

[quote]I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
[/quote]
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it? ;)

[quote]Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
[/quote]
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.

[quote]Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
[/quote]matter
[quote]what the numbers are, you're not going to accept..."
[/quote]
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.

[quote]If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
[/quote]I
[quote]either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
[/quote]
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.

[quote]I would thing that almost everyone on this forum does some kind of
[/quote]pilot
[quote]before rolling something out into production.
[/quote]
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.

What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.

[quote]I hope I'm wrong.
[/quote]
About what? That I'm an idiot? ;) I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.

[quote]I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
[/quote]
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.

[quote]TW - You "Tilt at Windmills" (Don Quixote), you don't chase them. ;-)
[/quote]
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote

(Sorry, just couldn't resist.) ;)

Visit our website at www.wilmingtontrust.com

Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.

_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
Sorry, only registered users may post in this forum.

Click here to login