 |
Page 6 of 6
|
| Author |
Message |
Eagle, Kent
Guest
|
 Tapeless backup environments
Sorry, but I just can't keep from jumping in at this point.
Not taking either side, but...
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research? I could place a posting on there that
either concurs with, or totally rejects the position of that posting;
and someone else would come along and claim it as gospel.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
Saying
" This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead."
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that. Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no matter
what the numbers are, you're not going to accept..."
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were. I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
I would thing that almost everyone on this forum does some kind of pilot
before rolling something out into production.
I hope I'm wrong. I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
BTW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
Take care,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS
------------------------------------------------------------------------
---
Message: 1
Date: Thu, 18 Oct 2007 04:06:52 -0400
From: "Curtis Preston" <cpreston < at > glasshouse.com>
Subject: Re: [Veritas-bu] Tapeless backup environments?
To: <bob944 < at > attglobal.net>, <veritas-bu < at > mailman.eng.auburn.edu>
Message-ID:
<4FBA0941CF3D9347889AA5FF23A809BEF3C673 < at > ghmail02.glasshousetech.com>
Content-Type: text/plain; charset="US-ASCII"
At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.
I said:
As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:
http://www.backupcentral.com/content/view/145/47/
Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
Bob944 said:
One of us still doesn't understand this.
Got that right.
Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.
I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:
* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.
* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.
The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.
And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.
An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure
out
if people in the room have the same birthday.
Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)
What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.
I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.
Hashing is a technique for saving time in certain circumstances. It
is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
comparison
to undetected write errors will not change that.
This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.
Visit our website at www.wilmingtontrust.com
Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 11:01 am |
|
 |
Dustin Damour
Guest
|
 Tapeless backup environments
I would say no as Wikipedia is like an encyclopedia and is a good spot
to start but it isn't peer reviewed published articles so in research it
would not be considered a valid source.
Dustin D'Amour
Wireless Switching
Plateau Wireless
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Eagle,
Kent
Sent: Thursday, October 18, 2007 12:59 PM
To: cpreston < at > glasshouse.com; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments
Sorry, but I just can't keep from jumping in at this point.
Not taking either side, but...
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research? I could place a posting on there that
either concurs with, or totally rejects the position of that posting;
and someone else would come along and claim it as gospel.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
Saying
" This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead."
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that. Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no matter
what the numbers are, you're not going to accept..."
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were. I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
I would thing that almost everyone on this forum does some kind of pilot
before rolling something out into production.
I hope I'm wrong. I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
BTW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
Take care,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS
------------------------------------------------------------------------
---
Message: 1
Date: Thu, 18 Oct 2007 04:06:52 -0400
From: "Curtis Preston" <cpreston < at > glasshouse.com>
Subject: Re: [Veritas-bu] Tapeless backup environments?
To: <bob944 < at > attglobal.net>, <veritas-bu < at > mailman.eng.auburn.edu>
Message-ID:
<4FBA0941CF3D9347889AA5FF23A809BEF3C673 < at > ghmail02.glasshousetech.com>
Content-Type: text/plain; charset="US-ASCII"
At the risk of chasing windmills, I will continue to try to have this
discussion, although it appears to me that you're already made up your
mind. I again say that no one is saying that hash collisions can't
happen. We are simply saying that the odds of them happening are
astromically less than having an undetected/uncorrected bit error on
tape. And I believe that the math that I use in my blog post
illustrates this.
I said:
As promised, I looked into applying the Birthday Paradox
logic to de-duplication. I blogged about my results here:
http://www.backupcentral.com/content/view/145/47/
Long and short of it: If you've got less than 95 Exabytes of
data, I think you'll be OK.
Bob944 said:
One of us still doesn't understand this.
Got that right.
Your blog raises a red herring in misunderstanding or misrepresenting
the applicability of Birthday Paradox.
I completely disagree. If you read the Birthday Paradox entry on
Wikipedia, it specifically explains how the Birthday Paradox applies in
this case. All the BP says is that the odds of a "clash" (i.e. a
birthday match or a hash collision) in an environment increase with the
number of elements in the set, and that the odds increase faster than
you think:
* The odds of two people in the same room having the same birthday
increase with the number of people in the room. If there are only
two people in the room, those odds will be roughly 1 in 365, or .27%
(leap year aside). If there are 23 people in the room,
the odds are 50%.
* The odds of two DIFFERENT blocks having the same hash (i.e. a
hash collision) increase with the number of blocks in the data set
If there are two blocks in the set, the odds are 1 in 2^160.
If there are less than 12.7 quintillion blocks in the data set,
the odds don't show up in a percentage calculated out to 50 decimal
places. As soon as you have more than 12.7 quintillion blocks, the
odds at least register in 50 decimal places, but are still really
small. And to get 12.7 quintillion blocks, you need to store at
least 95 Exabytes of data.
The number of possible values in
BP is 366; there is no data reduction in it, no key values. An
algorithm which reduced the 366 possibilities the same way that hashing
8KB down to 160 bits would yield infinitesimal keys smaller than one
bit, an absurdity.
Yeah, IMHO, we are talking apples and oranges. Let me try to put the
hash collision into the birthday world. Let's say that we want a wall
of photos of everyone who came to our party. When you show up, we check
your birthday, and we check it off on a list. (We'll call your BD the
"hash.") If we've never seen your birthday before, we take your photo
and put it on the wall. If your birthday has already been checked off
on the list, though, we don't take your photo. We assume that since you
have the same birthday, you must be the same person. So you don't get
your photo taken. We just write on the photo of the first guy whose
picture we took that he came to the party twice (he must have left and
come back). Now, if he is indeed the same guy, that's not a hash/BD
collision. If he is indeed a different person, and we said he was the
same person simply because he had the same birthday, then that would be
a hash/BD collision.
And THIS would be an absurdity to think you can represent n number of
people in a party with an array of photos selected solely on their
birthday (a key space of only 366). But it's not out of the realm of
possibility to say that we could represent n number of bits in our data
center with an array of bits selected solely on a 160-bit hash (a
keyspace of 2^160). Crytographers have been doing it for years. We're
just adding another application on it.
An absurdity which should show that even if it
stopped at eight bits, one short of the bits required to hold 1-366,
there would still be fatal hash collisions--say, Feb 7, Feb 11 and Jun
30 all represented by the same code, in which case you can't figure
out
if people in the room have the same birthday.
Again, I hope if you read what I read above. In the analogy, we're not
de-duping birthdays; we're de-duping people BASED on their birthdays.
(Which would be a dumb idea because the key space is too small: 366)
What you must grasp is that it is *impossible* to
represent/re-create/look up the values of 2^65536 bits in fewer than
2^65536 bits--unless you concede that each checksum/hash/fingerprint
will represent many different values of the original data--any more
than
you can represent three bits of data with two.
I concede, I concede! The only point I'm trying to make is what are the
odds that two different blocks of data will have the same hash (i.e. a
hash collision) bin a given data center.
Hashing is a technique for saving time in certain circumstances. It
is
valueless in re-creating (and a lookup is a re-creation) original data
when those data can have unlimited arbitrary values. All the blog
hand-waving about decimal places, Zetabytes and the specious
comparison
to undetected write errors will not change that.
This is the part where I believe you've made your mind up already.
You're saying that no matter what the entire world is saying -- no
matter what the numbers are, you're not going to accept hash-based
de-dupe. Fine! That's why there are vendors that don't use hashes to
de-dupe data. Buy one of those instead. Some use a weak hint +
bit-level verify, some use delta-differencing technologies, which are
bit-to-bit comparisons as well.
Visit our website at www.wilmingtontrust.com
Investment products are not insured by the FDIC or any other
governmental agency, are not deposits of or other obligations of or
guaranteed by Wilmington Trust or any other bank or entity, and are
subject to risks, including a possible loss of the principal amount
invested. This e-mail and any files transmitted with it may contain
confidential and/or proprietary information. It is intended solely for
the use of the individual or entity who is the intended recipient.
Unauthorized use of this information is prohibited. If you have
received this in error, please contact the sender by replying to this
message and delete this material from any system it may be on.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 11:53 am |
|
 |
cpjlboss
Site Admin
Joined: 04 May 2007
Posts: 802
|
 Tapeless backup environments
Glad to have another person in the party. What's your birthday?
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.
I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..
As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it?
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.
Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.
I would thing that almost everyone on this forum does some kind of
pilot
before rolling something out into production.
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.
What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.
I hope I'm wrong.
About what? That I'm an idiot?  I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.
I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.
TW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote
(Sorry, just couldn't resist.)
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 12:43 pm |
|
 |
Austin Murphy
Guest
|
 Tapeless backup environments?
On 10/18/07, Iverson, Jerald <Jerald.Iverson < at > aiminvestments.com> wrote:
...
that is why i have turned off all hardware and software compression on
my tape drives. imagine trying to store more than 400GB of data onto a
single lto3 tape! they "say" that you can store up to and even more
than 800GB, but i don't believe a word of it. there is no way 1 nibble
of data can represent 1 byte! once i have the time to study lzr
compression and understand it,
<snip>
Hi jerald,
Data compression exploits the non-randomness of "normal" data.
Compression algorithms have variable compression rates because their
performance is dependent on the data being compressed. Truly random
data does NOT compress at all. "Typical" data is not truly random.
Once data has been compressed, it is close to random, so compression
can not be applied again. Many encryption algorithms also result in
near-random data that does not compress.
A formal definition of a data set's "randomness" is it's Kolmogorov
complexity. http://en.wikipedia.org/wiki/Kolmogorov_complexity
Compression is just an alternate means of data representation.
Several others are at work on your LTO tapes too!
http://en.wikipedia.org/wiki/Forward_error_correction
http://en.wikipedia.org/wiki/Run_Length_Limited
http://en.wikipedia.org/wiki/PRML
Don't get too paranoid...these are good things.
Austin
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 12:53 pm |
|
 |
A Darren Dunham
Guest
|
 Tapeless backup environments?
On Thu, Oct 18, 2007 at 01:44:03PM -0400, Curtis Preston wrote:
So you're OK with hash-based de-dupe, which everyone acknowledges has a
chance (although quite small) that you could have a hash-collision and
potentially corrupt a block of data somewhere, sometime, when you least
expect it...
But you're NOT ok with the long-running industry standard of loss-less
compression algorithms? [...]
I think the smiley on the end indicated that it was a humorous comment.
At least that's how I took it.
--
Darren Dunham ddunham < at > taos.com
Senior Technical Consultant TAOS http://www.taos.com/
Got some Dr Pepper? San Francisco, CA bay area
< This line left intentionally blank to confuse you. >
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 1:10 pm |
|
 |
bob944
Guest
|
 Tapeless backup environments?
discussion, although it appears to me that you're already made up your
mind.
I'd prefer to say I have little interest in a technology which, by
design, will retrieve a completely different chunk of data than what was
written, with no notice whatsoever. BTW, before you bring out tape
errors again, I posted long ago why this argument was not comparable.
No point in beating the poor Birthday Paradox to death; you've
completely missed the point there. It doesn't matter that the same
values come up more often than our intuition suggests--which is the
_only_ lesson of BP--what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
Here's all a thinking person responsible for data needs to consider:
An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat. The
exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 9:19 pm |
|
 |
cpjlboss
Site Admin
Joined: 04 May 2007
Posts: 802
|
 Tapeless backup environments?
I wish we had a white board and could sit in front of each other to
finish the discussion, but it's obvious that it's not going to be
resolved here.
You believe I'm missing your point, and I believe you're missing my
point.
what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
Not sure how many times I (or others) have to keep saying, the dates are
not the data that are being deduped. The dates are the hashes. The
data is the person.
An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
This, again, only makes sense if you are using the hash to
store/reconstruct the data, not to ID the data. The fingerprint (like a
real fingerprint) is not used to reconstruct a block, it's only used to
give it a unique ID that distinguishes it from other blocks. You still
have to store the block with the key. And with 2^160 different
fingerprints, that means we can calculate unique fingerprints for 2^160
blocks. That means we can calculate a unique fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.
If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
I really don't think you need to make it personal, and suggest that I
don't know what I'm doing simply because we have been unable to
successfully communicate to each other in this medium. This medium can
be a very difficult one to communicate such a difficult subject in. I
think things would be very different in person with a whiteboard.
should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
They worked really hard to figure out how to take one block that
calculates to a particular hash and create another block that calculates
to the same hash. It's used to fake a signature. I get it. I just
don't see how or why somebody would use this to do I don't know what
with my backups. And if we were having this discussion over a few
drinks we could try to come up with some ideas. Right now, I'm as tired
as you are of this discussion.
Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.
For anyone who's still reading, I just want to say this:
I was only trying to bring some sanity to what I felt was an undue
amount of FUD against the hash-only products. I'm not necessarily trying
to talk anyone into them. I just want you to understand what I THINK
the real odds are. If after understanding how it works and what the
odds are, you're still uncomfortable, don't dismiss dedupe. Just
consider a non-hash-based de-dupe product.
Curtis out.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Thu Oct 18, 2007 11:40 pm |
|
 |
WEAVER, Simon (external)
Guest
|
 Tapeless backup environments?
How about setting up a white board / aka NetMeeting !
I think this thread has gone on for some time now, and yet there still
appears to be 2 different opinions.
Not going to please everyone.....!  personally, I would not be worried
about it and will just step out of the debate and move on.
Right or wrong, I really don't care that much
But anyhow, something like DIGG Whiteboard might help - think its still free
if those wishing to continue the debate want to continue offline
Bye !
Regards
Simon Weaver
3rd Line Technical Support
Windows Domain Administrator
EADS Astrium Limited, B23AA IM (DCS)
Anchorage Road, Portsmouth, PO3 5PU
Email: Simon.Weaver < at > Astrium.eads.net
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Curtis
Preston
Sent: Friday, October 19, 2007 8:38 AM
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?
I wish we had a white board and could sit in front of each other to finish
the discussion, but it's obvious that it's not going to be resolved here.
You believe I'm missing your point, and I believe you're missing my point.
what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
Not sure how many times I (or others) have to keep saying, the dates are not
the data that are being deduped. The dates are the hashes. The data is the
person.
An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
This, again, only makes sense if you are using the hash to store/reconstruct
the data, not to ID the data. The fingerprint (like a real fingerprint) is
not used to reconstruct a block, it's only used to give it a unique ID that
distinguishes it from other blocks. You still have to store the block with
the key. And with 2^160 different fingerprints, that means we can calculate
unique fingerprints for 2^160 blocks. That means we can calculate a unique
fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.
If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
I really don't think you need to make it personal, and suggest that I don't
know what I'm doing simply because we have been unable to successfully
communicate to each other in this medium. This medium can be a very
difficult one to communicate such a difficult subject in. I think things
would be very different in person with a whiteboard.
should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
They worked really hard to figure out how to take one block that calculates
to a particular hash and create another block that calculates to the same
hash. It's used to fake a signature. I get it. I just don't see how or
why somebody would use this to do I don't know what with my backups. And if
we were having this discussion over a few drinks we could try to come up
with some ideas. Right now, I'm as tired as you are of this discussion.
Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.
For anyone who's still reading, I just want to say this:
I was only trying to bring some sanity to what I felt was an undue amount of
FUD against the hash-only products. I'm not necessarily trying to talk
anyone into them. I just want you to understand what I THINK the real odds
are. If after understanding how it works and what the odds are, you're
still uncomfortable, don't dismiss dedupe. Just consider a non-hash-based
de-dupe product.
Curtis out.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Fri Oct 19, 2007 12:03 am |
|
 |
Jeff Lightner
Guest
|
 Tapeless backup environments?
"since you should know from earlier postings that spoofing your favorite
160-bit hashing algorithm with reasonable-looking fake data is now old
hat. The exploit itself should concern us"
This I don't get. In addition to lamenting about possibility over
probability is bob944 now suggesting that dedupe vendors are evil
hackers that INTEND to destroy our data?
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of WEAVER,
Simon (external)
Sent: Friday, October 19, 2007 4:01 AM
To: 'Curtis Preston'; bob944 < at > attglobal.net;
veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?
How about setting up a white board / aka NetMeeting !
I think this thread has gone on for some time now, and yet there still
appears to be 2 different opinions.
Not going to please everyone.....!  personally, I would not be
worried
about it and will just step out of the debate and move on.
Right or wrong, I really don't care that much
But anyhow, something like DIGG Whiteboard might help - think its still
free
if those wishing to continue the debate want to continue offline
Bye !
Regards
Simon Weaver
3rd Line Technical Support
Windows Domain Administrator
EADS Astrium Limited, B23AA IM (DCS)
Anchorage Road, Portsmouth, PO3 5PU
Email: Simon.Weaver < at > Astrium.eads.net
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Curtis
Preston
Sent: Friday, October 19, 2007 8:38 AM
To: bob944 < at > attglobal.net; veritas-bu < at > mailman.eng.auburn.edu
Subject: Re: [Veritas-bu] Tapeless backup environments?
I wish we had a white board and could sit in front of each other to
finish
the discussion, but it's obvious that it's not going to be resolved
here.
You believe I'm missing your point, and I believe you're missing my
point.
what matters is if you use a shorthand to track the
values which can't tell that Feb 7 and Dec 28 are different values
because you put them in the same hash bucket and therefore think that
everything that bucket is Feb 7, you retrieve the wrong data.
Not sure how many times I (or others) have to keep saying, the dates are
not
the data that are being deduped. The dates are the hashes. The data is
the
person.
An 8KB chunk of data can have 2^65536 possible values. Representing
that 8KB of data in 160 bits means that each of the 2^160 possible
checksum/hash/fingerprint values MUST represent, on average, 2^65376
*different* 8KB chunks of data.
This, again, only makes sense if you are using the hash to
store/reconstruct
the data, not to ID the data. The fingerprint (like a real fingerprint)
is
not used to reconstruct a block, it's only used to give it a unique ID
that
distinguishes it from other blocks. You still have to store the block
with
the key. And with 2^160 different fingerprints, that means we can
calculate
unique fingerprints for 2^160 blocks. That means we can calculate a
unique
fingerprint for
1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000
blocks, which is
11,832,317,255,831,000,000,000,000,000,000,000,000,000,000,000,000,000
bytes of data. That's a lot of stinking data.
If that doesn't concern you, well, it's safe to say I won't be hiring
you as my backup admin. Or as my technology consultant, since you
I really don't think you need to make it personal, and suggest that I
don't
know what I'm doing simply because we have been unable to successfully
communicate to each other in this medium. This medium can be a very
difficult one to communicate such a difficult subject in. I think
things
would be very different in person with a whiteboard.
should know from earlier postings that spoofing your favorite 160-bit
hashing algorithm with reasonable-looking fake data is now old hat.
The exploit itself should concern us, not to mention that it also
illustrates that similar data which yields the same hash is not the
once-in-the-lifetime-of-the-universe oddity you portray.
They worked really hard to figure out how to take one block that
calculates
to a particular hash and create another block that calculates to the
same
hash. It's used to fake a signature. I get it. I just don't see how
or
why somebody would use this to do I don't know what with my backups.
And if
we were having this discussion over a few drinks we could try to come up
with some ideas. Right now, I'm as tired as you are of this discussion.
Everything mentioned here was covered in the original postings a month
ago. Unless there's something new, I'm done with this.
You're right. IN THIS MEDIUM, you don't understand me, and I don't
understand you. Let's agree to disagree and move on.
For anyone who's still reading, I just want to say this:
I was only trying to bring some sanity to what I felt was an undue
amount of
FUD against the hash-only products. I'm not necessarily trying to talk
anyone into them. I just want you to understand what I THINK the real
odds
are. If after understanding how it works and what the odds are, you're
still uncomfortable, don't dismiss dedupe. Just consider a
non-hash-based
de-dupe product.
Curtis out.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
This email (including any attachments) may contain confidential and/or
privileged information or information otherwise protected from
disclosure.
If you are not the intended recipient, please notify the sender
immediately, do not copy this message or any attachments and do not use
it
for any purpose or disclose its content to any person, but delete this
message and any attachments from your system. Astrium disclaims any and
all
liability if this email transmission was virus corrupted, altered or
falsified.
---------------------------------------------------------------------
Astrium Limited, Registered in England and Wales No. 2449259
REGISTERED OFFICE:-
Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Fri Oct 19, 2007 6:54 am |
|
 |
Eagle, Kent
Guest
|
 Tapeless backup environments
O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...
2nd. (and last) post on this -
My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".
As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.
I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.
The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
This was your text, no?
Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).
You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"
I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...
Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.
End transmission.
Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS
-----Original Message-----
From: Curtis Preston [mailto:cpreston < at > glasshouse.com]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments
Glad to have another person in the party. What's your birthday?
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.
I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..
As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it?
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.
Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.
I would thing that almost everyone on this forum does some kind of
pilot
before rolling something out into production.
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.
What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.
I hope I'm wrong.
About what? That I'm an idiot?  I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.
I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.
TW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote
(Sorry, just couldn't resist.)
Visit our website at www.wilmingtontrust.com
Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Fri Oct 19, 2007 8:10 am |
|
 |
Jeff Lightner
Guest
|
 Tapeless backup environments
Not an attack - just a question: Did someone in this thread say they
HAD experienced data loss due to deduplication? If so I missed it.
You mixed comments about another thread in here and I *think* you're
saying something about someone's experience with 10GigE rather than
deduplication. Your post could be misread to say someone had in fact
had such a data loss and posted it here.
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Eagle,
Kent
Sent: Friday, October 19, 2007 12:08 PM
To: Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments
O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...
2nd. (and last) post on this -
My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".
As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.
I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.
The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
This was your text, no?
Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).
You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"
I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...
Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.
End transmission.
Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS
-----Original Message-----
From: Curtis Preston [mailto:cpreston < at > glasshouse.com]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments
Glad to have another person in the party. What's your birthday?
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.
I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..
As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it?
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.
Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.
I would thing that almost everyone on this forum does some kind of
pilot
before rolling something out into production.
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.
What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.
I hope I'm wrong.
About what? That I'm an idiot?  I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.
I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.
TW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote
(Sorry, just couldn't resist.)
Visit our website at www.wilmingtontrust.com
Investment products are not insured by the FDIC or any other
governmental agency, are not deposits of or other obligations of or
guaranteed by Wilmington Trust or any other bank or entity, and are
subject to risks, including a possible loss of the principal amount
invested. This e-mail and any files transmitted with it may contain
confidential and/or proprietary information. It is intended solely for
the use of the individual or entity who is the intended recipient.
Unauthorized use of this information is prohibited. If you have
received this in error, please contact the sender by replying to this
message and delete this material from any system it may be on.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
----------------------------------
CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you.
----------------------------------
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Fri Oct 19, 2007 8:39 am |
|
 |
Eagle, Kent
Guest
|
 Tapeless backup environments
Jeff,
The mix was deliberate. Please re-read my post & it should become
evident as to why. There was no implication that someone stated they had
experienced data loss.
In fact, nothing in my post is really speaking to dedupe or data loss.
It's about the posts themselves...
- Kent
-----Original Message-----
From: Jeff Lightner [mailto:jlightner < at > water.com]
Sent: Friday, October 19, 2007 12:36 PM
To: Eagle, Kent; Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: [Veritas-bu] Tapeless backup environments
Not an attack - just a question: Did someone in this thread say they
HAD experienced data loss due to deduplication? If so I missed it.
You mixed comments about another thread in here and I *think* you're
saying something about someone's experience with 10GigE rather than
deduplication. Your post could be misread to say someone had in fact
had such a data loss and posted it here.
-----Original Message-----
From: veritas-bu-bounces < at > mailman.eng.auburn.edu
[mailto:veritas-bu-bounces < at > mailman.eng.auburn.edu] On Behalf Of Eagle,
Kent
Sent: Friday, October 19, 2007 12:08 PM
To: Curtis Preston; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: Re: [Veritas-bu] Tapeless backup environments
O.k., at the risk of seeming like "I wrote more than you, therefore I
must be right"...
2nd. (and last) post on this -
My first point was that you quoted a "Wikipedia" article as a source.
For me, it really had nothing to do with the subject matter. They have a
disclaimer as to the validity of anything on there, and for good reason:
Anyone can post anything on there, about anything, containing anything.
It might be right, it might be wrong. I would be far more inclined to
trust, or quote an industry consortium, or even a vendors test results
page than "Wikipedia".
As long as were throwing credentials around, I might as well mention: As
a former scientist, and statistician, and current engineer, I fully
understand what empirical research is. It INCLUDES math. It is the
actual testing and the statistics of that testing. FWIW: I was trained
in this and FMEA (Failure Modes Effects Analysis) by the gentleman who
ran the Reliability and Maintainability program for Boeing's Saturn and
Apollo space programs, as well as their VERTOL and fixed wing programs.
I can see where my second point could have easily been misinterpreted.
Apologies to anyone led astray. What I meant was that the posts made by
"Bob944" seemed to me to be supported by cited facts, and denoted
personal experiences. He's not pointing to something he previously
authored as proof that information is fact. I've only seen him reference
previous posts for the purposes of levelset. To be fair, I haven't read
any of your blog postings, only your posts in this forum. More on that
below. And yes; an "Industry Pundit, Author, SME", or whomever, quoting
"Wikipedia" as a source does tend to dilute credibility, in my mind.
It's not a personal attack, just my personal position on the issue.
The part below has me confused where you say " No, because I never said
those words or anything like them in my article." Since I never
mentioned anything about any articles... All my comments are in regard
you your posts on this forum, in which you did say that. ">Wouldn't THAT
be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
This was your text, no?
Obviously there's nothing wrong with admitting you're wrong. What I was
pointing out was that it appears duplicitous to make the comment above
and then state you're probably going to post a retraction in your blog
based one users experience. I'm referring to the 10 GbE thread where one
user reported stellar throughput, which contradicted a contrived
theoretical maximum, and several reports of ho-hum throughput.
" 7500 MB/s! That's the most impressive numbers I've ever seen by FAR.
I may have to take back my "10 GbE is a Lie!" blog post, and I'd be
happy to do so."
This was your text, no?
So one could easily conclude that a position was taken (and published)
on this topic without sufficient testing or research (the related
SunSolve and other articles were already out there before these posts
were made).
You said: "Remember also that these posts are often done on my own time
late at night, etc. I never claimed to be perfect."
True, but you do cite that you are an author of books on the subject,
author of a blog on the subject, and work for one of the largest
industry resources. Indeed the " VP Data Protection". You can see how
maybe a newbie might assume a post as gospel with the barrage of
credentials? Would they not be disappointed to learn they need to check
the timestamp of a post before lending any credence to it's contents?
You said: " I don't think you'll find that to be a problem. I'm an
in-the-trenches guy, who has sat in front of many a tape drive, tape
library, and backup GUI in my 14 years in this space. I actually cut my
teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.)"
I'm not sure what you meant to imply by all this? If tenure with backup
is an issue, than I would suggest you really don't have all that much
time "in this space", relative to my experience anyway. I had been
working with various forms of backup for that long before MBNA even had
a Data Center in DE. Why would it be necessary to point out that you
were in the same geographic locale, or used the services of my employer?
I've never made mention of my employer, or even implied that any of my
statements represented any opinion or position of theirs? I find this
statement, well, bizarre...
Maybe I will attend the class after all. I'm beginning to think I'll be
entertained.
End transmission.
Regards,
Kent Eagle
MTS Infrastructure Engineer II, MCP, MCSE
Tech Services / SMSS
-----Original Message-----
From: Curtis Preston [mailto:cpreston < at > glasshouse.com]
Sent: Thursday, October 18, 2007 4:41 PM
To: Eagle, Kent; veritas-bu < at > mailman.eng.auburn.edu
Cc: bob944 < at > attglobal.net
Subject: RE: Tapeless backup environments
Glad to have another person in the party. What's your birthday?
Are you seriously suggesting that a quote from "Wikipedia" constitutes
empirical scientific research?
NO. He said that I was misusing the Birthday Paradox, and I merely
pointed to the Wikipedia article that uses it the same way. If you
search on Birthday Paradox on Google, you'll also find a number of other
articles that use the BP in the same way I'm using it, specifically in
regards to hash collisions, as the concept is not new to deduplication.
It has applied to cryptographic uses of hashing for years.
I then went further to explain WHY the BP applies, and I gave a reverse
analogy that I believe completed my argument that the BP applies in this
situation. So..
As to whether or not what I'm doing is empirical scientific research,
It's not. Empirical research requires testing, observation, and
repeatability. For the record, I have done repeated testing of many
hash-based dedupe systems using hundreds of backups and restores without
a single hash occurrence of data corruption, but that doesn't address
the question. IMHO, it's the equivalent of saying a meteor has never
hit my house so meteors must never hit houses. The discussion is about
the statistical probabilities of a meteor hitting your house, and you
have to do that with math, not empirical scientific research.
I would be the first to admit that "bob944" has made more than a few
posts that have "pushed my chair back a couple inches", but at least
they made me THINK!
And you're saying that my half-a-dozen or so blog postings on the
subject, and none of my responses in this thread don't make you think?
I was fine until I quoted Wikipedia, is that it?
Is pretty gutsy since you have another post within the past few days
stating you're ready to RETRACT what you already blogged on this, or
blogged on that.
I am admitting that I am not a math or statistics specialist and that I
misunderstood the odds before. What's wrong with that? That I was
wrong before, or that I'm stating it publicly that I was wrong before?
I was wrong. I was told I was wrong because I didn't apply the birthday
paradox. So I applied the Birthday Paradox in the same way I see
everyone else applying it, and the way that makes sense according to the
problem, and the numbers still come out OK.
Wouldn't THAT be saying that up until that point, YOU
WERE SAYING "that no matter what the entire world is saying -- no
matter
what the numbers are, you're not going to accept..."
No, because I never said those words or anything like them in my
article. I said, "some people say this, but I say that." Then I even
elicited feedback from the audience. The point of that portion of the
article was that some are talking about hash collisions as if they're
going to happen to everybody and happen a lot, and I wanted to add some
actual math to the discussion, rather than just talk about fear
uncertainty and doubt (FUD). I felt there was a little Henny-Penny
business going on.
If I am asked to restore something for the CEO, and can't, it won't
matter a hill of beans what all the theory was and what the odds were.
I
either can, or I can't. I'll be accountable for that result, and why I
got it. As someone so accurately posted recently: We're in the recovery
business, not the restore business.
You won't get any argument from me. I think you'll find almost that
exact sentence in the first few paragraphs of any of my books. Having
said that, we all use technologies as part of our backup system that
have a failure rate percentage (like tape). And to the best of my
understanding, the odds of a single hash collision in 95 Exabytes of
data is significantly lower than the odds of having corrupted data on an
LTO tape and not even knowing it, based on the odds they publish. Even
if you make two copies, the copy could be corrupted, and you could have
a failed restore. Yet we're all ok with that, but we're freaking out
about hash collisions, which statistically speaking have a MUCH lower
probability of happening.
I would thing that almost everyone on this forum does some kind of
pilot
before rolling something out into production.
I sure as heck hope so, but I don't think it addresses this issue. So
you test it and you don't get any hash collisions. What does that prove?
It proves that a meteor has never hit your house.
What I recommend (especially if you're using a hash-only de-dupe system)
is a constant verification of the system. Use a product like NBU that
can do CRC checks against the bytes it's copying or reading, and either
copy all de-duped data to tape or run a NBU verify on every backup. If
you have a hash collision, your copy or verify will fail, and at least
know when it happens.
I hope I'm wrong.
About what? That I'm an idiot?  I think judging me solely on this
long, protracted, difficult to follow discussion (with over 70 posts) is
probably unfair. Remember also that these posts are often done on my
own time late at night, etc. I never claimed to be perfect.
I love to learn. I'm actually signed up for one of
your classes next week. But, if quoting everyone else's
posts/blogs/Wikipedia entries, etc. without backing up re-posting them
with empirical evidence or firsthand testing is your program agenda, I
will skip the engagement...
I don't think you'll find that to be a problem. I'm an in-the-trenches
guy, who has sat in front of many a tape drive, tape library, and backup
GUI in my 14 years in this space. I actually cut my teeth right down
the road from you as the backup guy at MBNA. (I lived in Newark, DE,
and you were my bank.) Don't skip out on the school just because of I
quoted Wikipedia once.
TW - You "Tilt at Windmills" (Don Quixote), you don't chase them.
You are right. I stand corrected again. Even Wikipedia backs you up:
http://en.wikipedia.org/wiki/Don_Quixote
(Sorry, just couldn't resist.)
Visit our website at www.wilmingtontrust.com
Investment products are not insured by the FDIC or any other governmental agency, are not deposits of or other obligations of or guaranteed by Wilmington Trust or any other bank or entity, and are subject to risks, including a possible loss of the principal amount invested. This e-mail and any files transmitted with it may contain confidential and/or proprietary information. It is intended solely for the use of the individual or entity who is the intended recipient. Unauthorized use of this information is prohibited. If you have received this in error, please contact the sender by replying to this message and delete this material from any system it may be on.
_______________________________________________
Veritas-bu maillist - Veritas-bu < at > mailman.eng.auburn.edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
|
| Fri Oct 19, 2007 9:10 am |
|
 |
|
|
The time now is Tue May 21, 2013 11:19 pm | All times are GMT - 8 Hours
|
Page 6 of 6
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|
|