Welcome! » Log In » Create A New Profile

Data Deduplication

Posted by Anonymous 
Data Deduplication
August 26, 2007 02:16AM
Is TSM planning on adding data deduplication similar to avamar? I
understand how TSM does not duplicate data now but minor edits in files
or simple file name changes would result in additional copies of the
entire file using TSM today.

We recently had a pitch from EMC on avamar. I can think of some reasons
to pass on it (Having two separate backup/restore solutions is a big
one, cost etc) but some persuasive arguments were made supporting their
solution. If TSM is going to be adding similar functionality soon it may
be another reason to focus on other efforts.

George Hughes

Senior UNIX Engineer

Children's National Medical Center

12211 Plum Orchard Dr.

Silver Spring, MD 20904

(301) 572-3693

Confidentiality Notice: This e-mail message, including any attachments, is
for the sole use of the intended recipient(s) and may contain confidential
and privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient, please
contact the sender by reply e-mail and destroy all copies of the original message.
Data Deduplication
August 26, 2007 05:24AM
On Aug 26, 2007, at 4:58 AM, Hughes, George wrote:

[quote]Is TSM planning on adding data deduplication similar to avamar? I
understand how TSM does not duplicate data now but minor edits in
files
or simple file name changes would result in additional copies of the
entire file using TSM today.
[/quote]
Except in Windows, where Adaptive Subfile Backup may be employed.
That's as far as it has gone in the product thus far.

Richard Sims
Data Deduplication
August 26, 2007 08:25AM
But it is being pursued for future release - after the conversion of the DB to DB2.

________________________________

From: ADSM: Dist Stor Manager on behalf of Richard Sims
Sent: Sun 8/26/2007 7:23 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

On Aug 26, 2007, at 4:58 AM, Hughes, George wrote:

[quote]Is TSM planning on adding data deduplication similar to avamar? I
understand how TSM does not duplicate data now but minor edits in
files
or simple file name changes would result in additional copies of the
entire file using TSM today.
[/quote]
Except in Windows, where Adaptive Subfile Backup may be employed.
That's as far as it has gone in the product thus far.

Richard Sims
Data Deduplication
August 26, 2007 04:06PM
[quote]Is TSM planning on adding data deduplication similar to avamar?
[/quote]
As mentioned by Richard, the closest thing TSM has to this now is
subfile backup. It is related to de-duplication, where once it has a
backup of a given file, it backs up only the changed bytes of that file.
This is also referred to as delta incrementals.

True de-duplication takes this much farther, as it would recognize a
file or email that's duplicated on two or three different systems, such
as an attachment/email that's sent to users on several different
Exchange servers. The "compression" ratios it can achieve are therefore
much higher than delta differentials.

[quote]I understand how TSM does not duplicate data now but minor edits in
[/quote]files
[quote]or simple file name changes would result in additional copies of the
entire file using TSM today.
[/quote]
Instead of switching from TSM to something like Avamar (EMC) or Puredisk
(Symantec), a TSM user can benefit from de-dupe today by using a
de-duplication backup target, such as de-dupe VTL or NAS device. Just
make sure you realize that you won't the same de-dupe as non-TSM users.
(TSM customers who switch to a de-dupe target are seeing approximately
10:1 de-dupe ratios, where non-TSM customers are seeing 20:1.)

Most TSM users don't do repeated full backups of their filesystems, and
a lot of the duplicated data comes from those full backups. But TSM
users still have duplicated data: multiple versions of the same file and
database backups. You already mentioned edited versions of the same
file. It is also common that a file will be present in multiple places.
In addition, TSM users do perform periodic full backups of their
database data.

[quote]We recently had a pitch from EMC on avamar. I can think of some reasons
to pass on it (Having two separate backup/restore solutions is a big
one, cost etc) but some persuasive arguments were made supporting their
solution.
[/quote]
If you like the idea of using de-dupe to backup your remote offices
(which is what Avamar and Puredisk are designed for), but want to stay
with TSM, again de-dupe targets can help. Buy a small de-dupe target to
place at your remote site, perform TSM backups to it, then replicate the
new/unique blocks to a central location as your offsite mechanism.

[quote]If TSM is going to be adding similar functionality soon it may
be another reason to focus on other efforts.
[/quote]
Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
Data Deduplication
August 27, 2007 12:31AM
Hi,

[quote]Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
[/quote]
We're just testing a deduplication disk array from DataDomain with TSM.
The compression ratio is much less than promised by the sales people.
During the last 10 days of incremental backups we only achieved a ratio
of 2.6:1. The disk array is very expensive and for the money you can buy
more disks than you need without compression.

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
Data Deduplication
August 27, 2007 05:27AM
How are you using it? As your disk cache? You have to store backups on
it long term in order to get de-duplication.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Dirk Kastens
Sent: Monday, August 27, 2007 12:31 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Hi,

[quote]Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
[/quote]
We're just testing a deduplication disk array from DataDomain with TSM.
The compression ratio is much less than promised by the sales people.
During the last 10 days of incremental backups we only achieved a ratio
of 2.6:1. The disk array is very expensive and for the money you can buy
more disks than you need without compression.

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
Data Deduplication
August 27, 2007 06:24AM
Being that TSM does incremental, your de-dupe ratio will be lower than
other Full / Incr backup products. Here's a few lessons learned with TSM
and a Diligent Protectier.

1) Do the best you can to put like data together. (ie all Oracle
DB Backups go to the same de-dupe VirtualTape head (Repository),

2) Turn off all compression (Client and DB's)

3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time thus
not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)

4) Do not Mix Windows and Unix data, It wont de-dupe well.

We are seeing 10 and 15:1 on our Oracle and DB2, Exchange 12:1 (The DB's
and Exchange all do Daily Full Backups) In the regular Win env, we see
1.45 and 2:1 ick...

Hope this helps.

Regards,

Charles

Dirk Kastens <Dirk.Kastens < at > UNI-OSNABRUECK.DE>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 02:31 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

Hi,

[quote]Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
[/quote]
We're just testing a deduplication disk array from DataDomain with TSM.
The compression ratio is much less than promised by the sales people.
During the last 10 days of incremental backups we only achieved a ratio
of 2.6:1. The disk array is very expensive and for the money you can buy
more disks than you need without compression.

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 27, 2007 07:46AM
[quote][quote]On Sun, 26 Aug 2007 04:58:45 -0400, "Hughes, George" <GHughes < at > CNMC.ORG> said:
[/quote][/quote]

[quote]Is TSM planning on adding data deduplication similar to avamar? I
understand how TSM does not duplicate data now but minor edits in
files or simple file name changes would result in additional copies
of the entire file using TSM today.
[/quote]
The 'Rename a top level directory' problem is the best case I've seen
for something like this in TSM.

Every pitch I've yet seen on de-dupe has glossed over where the
metadata goes, and how it's defended. If you see a data deduplication
solution which doesn't take at -least- as much care over the DB as we
do in TSM land, then my opinion is "Flee at flank speed".

In TSM-land we're very sensitive to the fact that our TSM database is
both the key to our featureset and the most delicate part of our
infrastructure, so we take neurotic degrees of care with it. I think
perhaps the new products have yet to blood themselves. Be careful it
doesn't splash on you when they do. :)

- Allen S. Rout
Data Deduplication
August 27, 2007 08:04AM
At 09:24 AM 8/27/2007, Charles A Hart wrote:
[quote]We are seeing 10 and 15:1 on our Oracle and DB2, Exchange 12:1 (The
DB's and Exchange all do Daily Full Backups) In the regular Win
env, we see 1.45 and 2:1 ick...
[/quote]
Charles, how many windows clients was this with? I've been thinking
about this and am thinking that targeting specific data to a smaller
deduping VTL might make more sense than just putting everything
there. Specifically, windows System Objects might be a good
candidate, as well as e-mail attachments.

Thanks for the Oracle hints.

..Paul

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 27, 2007 08:32AM
This is GREAT information, thanks much!

[quote]Being that TSM does incremental, your de-dupe ratio will be lower than
other Full / Incr backup products. Here's a few lessons learned with TSM
and a Diligent Protectier.

1) Do the best you can to put like data together. (ie all Oracle
DB Backups go to the same de-dupe VirtualTape head (Repository),

2) Turn off all compression (Client and DB's)

3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time thus
not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)

4) Do not Mix Windows and Unix data, It wont de-dupe well.

We are seeing 10 and 15:1 on our Oracle and DB2, Exchange 12:1 (The DB's
and Exchange all do Daily Full Backups) In the regular Win env, we see
1.45 and 2:1 ick...

Hope this helps.

Regards,

Charles

Dirk Kastens <Dirk.Kastens < at > UNI-OSNABRUECK.DE>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 02:31 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

Hi,

[quote]Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
[/quote]
We're just testing a deduplication disk array from DataDomain with TSM.
The compression ratio is much less than promised by the sales people.
During the last 10 days of incremental backups we only achieved a ratio
of 2.6:1. The disk array is very expensive and for the money you can buy
more disks than you need without compression.

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
[/quote]
Data Deduplication
August 27, 2007 09:40AM
[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?
Data Deduplication
August 27, 2007 09:48AM
[quote]Every pitch I've yet seen on de-dupe has glossed over where the
metadata goes, and how it's defended. If you see a data deduplication
solution which doesn't take at -least- as much care over the DB as we
do in TSM land, then my opinion is "Flee at flank speed".
[/quote]
I have seen that, but my experience is that if you ask the right
questions, you'll get the right answers. If they DON'T give you the
right answers, then go on to the next vendor. ;)
Data Deduplication
August 27, 2007 09:56AM
Preston, I believe it depends on the de-dupe technology being used. We
have started to play with the NetApp iSIS (dedupe product) and at least
in their case they don't look at every block coming into the host.

Their documentation is lacking, but from what we have been able
to deduce, it seems to take a hash of the first chunk of all the files,
them compares hashes and then tries to de-dupe if the hashes match. We
saw that 400GB of 5GB files took about 3 minutes to try to dedupe and
400GB of 1MB files took over 23 hours. In this case the number of files
seems to dictate how long a de-dupe will take, to me, that doesn't sound
like it is looking at every block, because the number of blocks with
data on the filer are actually the same between the 2 attempts.

Like I said, this is my interpretation of the results of my
testing, not anything I saw documented.

Ben

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Curtis Preston
Sent: Monday, August 27, 2007 10:40 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: Data Deduplication

[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?
Data Deduplication
August 27, 2007 10:53AM
According to Dilligent, when RMAN uses Multiplexing, it intermingles the
data from each RMAN so the data block will be different every time so the
blocks are different, similar to Multiplexing with Netbackup... I'm not
an RMAN expert, just trusting what the Vendor is stating.

The following link seems to match with what we are being told

http://download.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmconc1002.htm
(Look for the Multiplex Section)

Is there an RMAn expert in the house? Can some one confirm this info?

Charles Hart
UHT - Data Protection
(763)744-2263
Sharepoint:
http://unitedteams.uhc.com/uht/EnterpriseStorage/DataProtection/default.aspx

Curtis Preston <cpreston < at > GLASSHOUSE.COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 11:40 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 27, 2007 11:23AM
I agree that this is what Oracle does. What I'm not sure is whether or
not this de-dupe issue applies to de-dupe vendors other than Diligent.
I've fired off a few emails and I'll reply when they do.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Charles A Hart
Sent: Monday, August 27, 2007 10:53 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

According to Dilligent, when RMAN uses Multiplexing, it intermingles
the
data from each RMAN so the data block will be different every time so
the
blocks are different, similar to Multiplexing with Netbackup... I'm
not
an RMAN expert, just trusting what the Vendor is stating.

The following link seems to match with what we are being told

http://download.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmconc10
02.htm
(Look for the Multiplex Section)

Is there an RMAn expert in the house? Can some one confirm this info?

Charles Hart
UHT - Data Protection
(763)744-2263
Sharepoint:
http://unitedteams.uhc.com/uht/EnterpriseStorage/DataProtection/default.
aspx

Curtis Preston <cpreston < at > GLASSHOUSE.COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 11:40 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 27, 2007 12:10PM
Thanks Curtis!

Charles Hart
UHT - Data Protection

Curtis Preston <cpreston < at > GLASSHOUSE.COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 01:22 PM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

I agree that this is what Oracle does. What I'm not sure is whether or
not this de-dupe issue applies to de-dupe vendors other than Diligent.
I've fired off a few emails and I'll reply when they do.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Charles A Hart
Sent: Monday, August 27, 2007 10:53 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

According to Dilligent, when RMAN uses Multiplexing, it intermingles
the
data from each RMAN so the data block will be different every time so
the
blocks are different, similar to Multiplexing with Netbackup... I'm
not
an RMAN expert, just trusting what the Vendor is stating.

The following link seems to match with what we are being told

http://download.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmconc10
02.htm
(Look for the Multiplex Section)

Is there an RMAn expert in the house? Can some one confirm this info?

Charles Hart
UHT - Data Protection
(763)744-2263
Sharepoint:
http://unitedteams.uhc.com/uht/EnterpriseStorage/DataProtection/default.
aspx

Curtis Preston <cpreston < at > GLASSHOUSE.COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 11:40 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 27, 2007 12:27PM
At 12:40 PM 8/27/2007, Curtis Preston wrote:
[quote]Every block coming into the device should be compared to every other
block ever seen by the device.
[/quote]

As others have noted, different vendors dedup at different levels of
granularity. When I spoke to Diligent at the Gartner conference over
a year ago, they were very tight-lipped about their actual
algorithm. The would, however, state that they were able to dedup
parts of two files that had similar data, but were not
identical. I.e., if data was inserted at the beginning of the file,
some parts of the end of the file could still be deduped. Neat trick
if it's true. Other vendors dedup at the file or block (or chunk) level.

I've not been able to gather much more detail about the specific
dedup algorithms, but hope to get some more info this fall, as take a
closer look at these products. If anyone has more details, I'd love
to hear them.

..Paul

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 27, 2007 01:15PM
[quote]As others have noted, different vendors dedup at different levels of
granularity.
[/quote]
I think I'd put it slightly differently. I'd say that they each
approach it differently. Those different approaches may have advantages
and disadvantages with different data types.

[quote]When I spoke to Diligent at the Gartner conference over
a year ago, they were very tight-lipped about their actual
algorithm.
[/quote]
The patent was filed. It's not that secret. ;) They are quite
different in their approach, and it's a little different to grock. But
based on what I know about their approach, the scenario that started the
discussion may indeed be a limitation. (Or all the vendors may have
this limitation; I have some questions out to them.)

[quote]The[y] would, however, state that they were able to dedup
parts of two files that had similar data, but were not
identical. I.e., if data was inserted at the beginning of the file,
some parts of the end of the file could still be deduped. Neat trick
if it's true.
[/quote]
Any de-dupe vendor is able to claim that. If it wasn't true, they
wouldn't see the de-dupe rates they're seeing. They can also identify
blocks that are common between a file in the file system and the same
file emailed via Exchange.

[quote]Other vendors dedup at the file or block (or chunk) level.
[/quote]
If a vendor doesn't do subfile de-dupe, then they're not a de-dupe
vendor; they're a CAS vendor. File-level de-dupe is CAS (i.e. Centerra,
Archivas), and the de-dupe is not really pitched as the main feature.
It's about using the signature as a way to provide immutability of data
stored in the CAS array.

[quote]I've not been able to gather much more detail about the specific
dedup algorithms, but hope to get some more info this fall, as take a
closer look at these products. If anyone has more details, I'd love
to hear them.
[/quote]
I wrote this article that may help: http://tinyurl.com/3588fb . I also
blog about de-dupe quite a bit at www.backupcentral.com.
Data Deduplication
August 27, 2007 02:14PM
Curtis - I'm unclear on your terminology. Are you equating "subfile"
to "block" level deduping? To me, block level means block
boundaries, whereas subfile doesn't have the boundary
restriction. Perhaps I interpret these words this way because of my
history. To me, a block is a 4K chunk (or 1K or some fixed
amount). But I am suspecting that this is not what you mean.

In fact, my impression was that some vendors deduped at a block level
(my defnition) and others at a subfile level, which to me is probably
more valuable but also probably more performance-costly to implement.

I've read lots of articles about this and talked with many
vendors. I'll take a look at your article. Thanks.

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 27, 2007 08:38PM
I use "subfile" to differentiate from file-level de-dupe, which is
really only CAS. (A subfile de-dupe product will, of course, notice two
files that are exactly the same as well -- just like a file-level CAS
product will.)

Subfile to me means that it looks inside the file, and looks for
duplicated information inside that file. Consider two versions of a
file stored inside TSM, for example. A subfile de-dupe product would
notice that most of the information between those two files is the same
and store that info once. Then it would also store any info that is
unique to each file.

I stay away from terms like block, chunk, and fragment in this context
because the mean different things to different people, and mean other
things historically outside of de-dupe.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Paul Zarnowski
Sent: Monday, August 27, 2007 2:01 PM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Curtis - I'm unclear on your terminology. Are you equating "subfile"
to "block" level deduping? To me, block level means block
boundaries, whereas subfile doesn't have the boundary
restriction. Perhaps I interpret these words this way because of my
history. To me, a block is a 4K chunk (or 1K or some fixed
amount). But I am suspecting that this is not what you mean.

In fact, my impression was that some vendors deduped at a block level
(my defnition) and others at a subfile level, which to me is probably
more valuable but also probably more performance-costly to implement.

I've read lots of articles about this and talked with many
vendors. I'll take a look at your article. Thanks.

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 28, 2007 05:26AM
Dirk

I also tried Data Domain and was not impressed. I now use Diligent's
Protectier and its far more impressive. Its scalable, reasonably priced,
achieves throughput of 200mb per second and better and factoring ratio's
of
Over 10 to 1

Regards

Jon

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Dirk Kastens
Sent: 27 August 2007 08:31
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Hi,

[quote]Writing a de-dupe backup product isn't easy. EMC bought Avamar and
Symantec bought Data Center Technologies to get their respective
products. I don't know of any other de-dupe companies for IBM to
acquire, so they'll have to write their own. That may take them a bit
longer.
[/quote]
We're just testing a deduplication disk array from DataDomain with TSM.
The compression ratio is much less than promised by the sales people.
During the last 10 days of incremental backups we only achieved a ratio
of 2.6:1. The disk array is very expensive and for the money you can buy
more disks than you need without compression.

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
Data Deduplication
August 28, 2007 07:56AM
I have now verified with two other de-dupe vendors (Falconstor &
SEPATON) that Oracle multiplexing is not an issue for them. (I also
have a question in to Diligent to verify that this accurately reflects
the way their product works. They said they would get back to me
today.)

Having said that, I think this does present something to test with any
de-dupe product you are considering. I knew that multiplexing in NW and
NBU might be an issue for some de-dupe products, but I never thought
about multiplexing in Oracle might be an issue. I wonder what other
apps might munge data together like this in their backup streams...

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Charles A Hart
Sent: Monday, August 27, 2007 10:53 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

According to Dilligent, when RMAN uses Multiplexing, it intermingles
the
data from each RMAN so the data block will be different every time so
the
blocks are different, similar to Multiplexing with Netbackup... I'm
not
an RMAN expert, just trusting what the Vendor is stating.

The following link seems to match with what we are being told

http://download.oracle.com/docs/cd/B19306_01/backup.102/b14191/rcmconc10
02.htm
(Look for the Multiplex Section)

Is there an RMAn expert in the house? Can some one confirm this info?

Charles Hart
UHT - Data Protection
(763)744-2263
Sharepoint:
http://unitedteams.uhc.com/uht/EnterpriseStorage/DataProtection/default.
aspx

Curtis Preston <cpreston < at > GLASSHOUSE.COM>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/27/2007 11:40 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

[quote]3) Oracle Specific
Do not use RMAN's Multiplexing in RMAN will combine 4
Channels together and the backup data then will be unique every time
[/quote]thus
[quote]not allowing for de-duping)
Use the File Seq=1 (Then run multiple channels)
[/quote]
I don't see how this would affect de-duplication if your de-dupe product
knows what it's doing. Every block coming into the device should be
compared to every other block ever seen by the device. So combining
multiple files together using Oracle multiplexing shouldn't affect
de-dupe.

Did you test this, or see it in the docs somewhere? Was this true for
multiple de-dupe vendors, or just the one you chose?

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 28, 2007 09:18AM
Hi,

Jon Evans wrote:
[quote]Dirk

I also tried Data Domain and was not impressed. I now use Diligent's
Protectier and its far more impressive. Its scalable, reasonably priced,
achieves throughput of 200mb per second and better and factoring ratio's
of
Over 10 to 1
[/quote]
We mainly backup normal files and only use 3 backup versions so that the
compression will not be more than 3:1 or 5:1. The best results can be
achieved with databases and application data like Exchange. That's what
the people from DataDomain said. I'm just running another test with
MySQL and Domino data. Let's wait and see :-)

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
Data Deduplication
August 28, 2007 03:29PM
That sounds about right. Data Domain's a good product with a lot of
happy customers, but TSM customers who are only backing up files and
only keeping 3 versions aren't going to be among them. ;) You've got to
backup database/app/email type data that does recurring full backups
and/or keep a whole lot more than 3 versions to have de-dupe make sense
for you. That's not a Data Domain thing. That's just how de-dupe
works.

In addition, it won't work if you use it as you would normally use a
disk pool (1-2 days of backups and then move to tape). There won't be
anything to de-dupe against, and you'll get close to nothing. You need
to leave your onsite backups permanently on it for de-dupe to work.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Dirk Kastens
Sent: Tuesday, August 28, 2007 9:18 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Hi,

Jon Evans wrote:
[quote]Dirk

I also tried Data Domain and was not impressed. I now use Diligent's
Protectier and its far more impressive. Its scalable, reasonably
[/quote]priced,
[quote]achieves throughput of 200mb per second and better and factoring
[/quote]ratio's
[quote]of
Over 10 to 1
[/quote]
We mainly backup normal files and only use 3 backup versions so that the
compression will not be more than 3:1 or 5:1. The best results can be
achieved with databases and application data like Exchange. That's what
the people from DataDomain said. I'm just running another test with
MySQL and Domino data. Let's wait and see :-)

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
Data Deduplication
August 29, 2007 08:47AM
Any idea why Diligent's dedup ratio is better? What's different
about the dedup algorithm that makes it work better?

At 06:29 PM 8/28/2007, Curtis Preston wrote:
[quote]That sounds about right. Data Domain's a good product with a lot of
happy customers, but TSM customers who are only backing up files and
only keeping 3 versions aren't going to be among them. ;) You've got to
backup database/app/email type data that does recurring full backups
and/or keep a whole lot more than 3 versions to have de-dupe make sense
for you. That's not a Data Domain thing. That's just how de-dupe
works.

In addition, it won't work if you use it as you would normally use a
disk pool (1-2 days of backups and then move to tape). There won't be
anything to de-dupe against, and you'll get close to nothing. You need
to leave your onsite backups permanently on it for de-dupe to work.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Dirk Kastens
Sent: Tuesday, August 28, 2007 9:18 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Hi,

Jon Evans wrote:
[quote]Dirk

I also tried Data Domain and was not impressed. I now use Diligent's
Protectier and its far more impressive. Its scalable, reasonably
[/quote]priced,
[quote]achieves throughput of 200mb per second and better and factoring
[/quote]ratio's
[quote]of
Over 10 to 1
[/quote]
We mainly backup normal files and only use 3 backup versions so that the
compression will not be more than 3:1 or 5:1. The best results can be
achieved with databases and application data like Exchange. That's what
the people from DataDomain said. I'm just running another test with
MySQL and Domino data. Let's wait and see :-)

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
[/quote]

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 29, 2007 09:07AM
I have been hearing bits and pieces about this de-dup thing.

Several things have me wondering , as folks on this list also
testify.

One thing I haven't heard about is performance. Even with TSM
clients,there is the thing not do "compression" on the client
due to performance issues. That is just for individual files
or data streams.

As de-dup, from what I have read, compares across all files
on a "system" (server, disk storage or whatever), it seems
to me that this will be an enormous resource hog of CPU, memory
and disk I/O. I am not just talking about using as some part of TSM
disk but say for instance on a File Server.

Any experiences/comments?

David Longo

[quote][quote][quote]Paul Zarnowski <psz1 < at > CORNELL.EDU> 8/27/2007 3:27 PM >>>
[/quote][/quote][/quote]At 12:40 PM 8/27/2007, Curtis Preston wrote:
[quote]Every block coming into the device should be compared to every other
block ever seen by the device.
[/quote]

As others have noted, different vendors dedup at different levels of
granularity. When I spoke to Diligent at the Gartner conference over
a year ago, they were very tight-lipped about their actual
algorithm. The would, however, state that they were able to dedup
parts of two files that had similar data, but were not
identical. I.e., if data was inserted at the beginning of the file,
some parts of the end of the file could still be deduped. Neat trick
if it's true. Other vendors dedup at the file or block (or chunk) level.

I've not been able to gather much more detail about the specific
dedup algorithms, but hope to get some more info this fall, as take a
closer look at these products. If anyone has more details, I'd love
to hear them.

..Paul

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu

#####################################
This message is for the named person's use only. It may
contain confidential, proprietary, or legally privileged
information. No confidentiality or privilege is waived or
lost by any mistransmission. If you receive this message
in error, please immediately delete it and all copies of it
from your system, destroy any hard copies of it, and notify
the sender. You must not, directly or indirectly, use,
disclose, distribute, print, or copy any part of this message
if you are not the intended recipient. Health First reserves
the right to monitor all e-mail communications through its
networks. Any views or opinions expressed in this message
are solely those of the individual sender, except (1) where
the message states such views or opinions are on behalf of
a particular entity; and (2) the sender is authorized by
the entity to give such views or opinions.
#####################################
Data Deduplication
August 29, 2007 09:35AM
The compression challenge is more related to creating unique backup
objects that can not be de-duped. Compression does cause a CPU
performance hit on the client. Performace related experience with
Dilkligent Protectier running on a Sunv40 with 4xDaul Core Procs and 32GB
memory we see a max of 250MBS writes per PT head, and up to 800MB reads.

As de-dup, from what I have read, compares across all files
on a "system" (server, disk storage or whatever), it seems
to me that this will be an enormous resource hog

The de-dup technology only compares / looks at the files with in its
specific repository. Example: We have 8 Protectier node in one data
center which equtes to 8 Virtual Tape Libraries and 8 reposoitires. The
data that gets compared is only with in 1 of the 8 repositires. This is
why to get the best bang for the buck is to match up. What we've been
trying to do is we'll have two instances on a lpar one that backs
up Unix Prod the Other backs up Unix Non Prod, so we will register a Prod
DB on the Prod TSM Instance and the Dev DB CLient on the TSM Dev instance
then the two instances will share a Protectier Library, so in therory your
Prod and non-prod backups should factor very well.

Hopee this helps!

Regards,

Charles Hart

David Longo <David.Longo < at > HEALTH-FIRST.ORG>
Sent by: "ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>
08/29/2007 10:45 AM
Please respond to
"ADSM: Dist Stor Manager" <ADSM-L < at > VM.MARIST.EDU>

To
ADSM-L < at > VM.MARIST.EDU
cc

Subject
Re: [ADSM-L] Data Deduplication

I have been hearing bits and pieces about this de-dup thing.

Several things have me wondering , as folks on this list also
testify.

One thing I haven't heard about is performance. Even with TSM
clients,there is the thing not do "compression" on the client
due to performance issues. That is just for individual files
or data streams.

As de-dup, from what I have read, compares across all files
on a "system" (server, disk storage or whatever), it seems
to me that this will be an enormous resource hog of CPU, memory
and disk I/O. I am not just talking about using as some part of TSM
disk but say for instance on a File Server.

Any experiences/comments?

David Longo

[quote][quote][quote]Paul Zarnowski <psz1 < at > CORNELL.EDU> 8/27/2007 3:27 PM >>>
[/quote][/quote][/quote]At 12:40 PM 8/27/2007, Curtis Preston wrote:
[quote]Every block coming into the device should be compared to every other
block ever seen by the device.
[/quote]

As others have noted, different vendors dedup at different levels of
granularity. When I spoke to Diligent at the Gartner conference over
a year ago, they were very tight-lipped about their actual
algorithm. The would, however, state that they were able to dedup
parts of two files that had similar data, but were not
identical. I.e., if data was inserted at the beginning of the file,
some parts of the end of the file could still be deduped. Neat trick
if it's true. Other vendors dedup at the file or block (or chunk) level.

I've not been able to gather much more detail about the specific
dedup algorithms, but hope to get some more info this fall, as take a
closer look at these products. If anyone has more details, I'd love
to hear them.

..Paul

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu

#####################################
This message is for the named person's use only. It may
contain confidential, proprietary, or legally privileged
information. No confidentiality or privilege is waived or
lost by any mistransmission. If you receive this message
in error, please immediately delete it and all copies of it
from your system, destroy any hard copies of it, and notify
the sender. You must not, directly or indirectly, use,
disclose, distribute, print, or copy any part of this message
if you are not the intended recipient. Health First reserves
the right to monitor all e-mail communications through its
networks. Any views or opinions expressed in this message
are solely those of the individual sender, except (1) where
the message states such views or opinions are on behalf of
a particular entity; and (2) the sender is authorized by
the entity to give such views or opinions.
#####################################

This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity to
which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.
Data Deduplication
August 29, 2007 09:36AM
First, I would say the only thing that this post says is the Diligent
had a better de-dupe ratio with this customer's data -- not that
Diligent's de-dupe is better overall. The different vendors use VERY
DIFFERENT ways to scan the incoming data and identify redundant pieces
of data. Those different ways will work better or worse for different
environments and different types of backup software and backed-up data.

I've tested a number of these products in a number of environments and
TRUST ME: your mileage will vary. The best product for one is NOT the
best product for another.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
Paul Zarnowski
Sent: Wednesday, August 29, 2007 8:39 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Any idea why Diligent's dedup ratio is better? What's different
about the dedup algorithm that makes it work better?

At 06:29 PM 8/28/2007, Curtis Preston wrote:
[quote]That sounds about right. Data Domain's a good product with a lot of
happy customers, but TSM customers who are only backing up files and
only keeping 3 versions aren't going to be among them. ;) You've got
[/quote]to
[quote]backup database/app/email type data that does recurring full backups
and/or keep a whole lot more than 3 versions to have de-dupe make sense
for you. That's not a Data Domain thing. That's just how de-dupe
works.

In addition, it won't work if you use it as you would normally use a
disk pool (1-2 days of backups and then move to tape). There won't be
anything to de-dupe against, and you'll get close to nothing. You need
to leave your onsite backups permanently on it for de-dupe to work.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf
[/quote]Of
[quote]Dirk Kastens
Sent: Tuesday, August 28, 2007 9:18 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

Hi,

Jon Evans wrote:
[quote]Dirk

I also tried Data Domain and was not impressed. I now use Diligent's
Protectier and its far more impressive. Its scalable, reasonably
[/quote]priced,
[quote]achieves throughput of 200mb per second and better and factoring
[/quote]ratio's
[quote]of
Over 10 to 1
[/quote]
We mainly backup normal files and only use 3 backup versions so that
[/quote]the
[quote]compression will not be more than 3:1 or 5:1. The best results can be
achieved with databases and application data like Exchange. That's what
the people from DataDomain said. I'm just running another test with
MySQL and Domino data. Let's wait and see :-)

--
Regards,

Dirk Kastens
Universitaet Osnabrueck, Rechenzentrum (Computer Center)
Albrechtstr. 28, 49069 Osnabrueck, Germany
Tel.: +49-541-969-2347, FAX: -2470
[/quote]

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu
Data Deduplication
August 29, 2007 11:10AM
De-dupe comes in two flavors:
1. Target de-dupe
2. Source de-dupe

Target de-dupe is de-dupe inside a VTL/IDT (intelligent disk target).
You send it regular TSM backups and it finds the duplicate data within
it. A good vendor of this type should give you all the benefits of
de-dupe without any performance issues during backup or restore -- even
in a very large environment.

Source de-dupe is backup software written to de-dupe (e.g. EMC Avamar,
Symantec Puredisk, Asigra Televaulting) data before it ever leaves the
client. This software definitely requires significant amounts of CPU on
the client, but the amount of bandwidth it saves is worth the trouble.
These products are therefore best for backing up remote office data, not
large datacenters.

---
W. Curtis Preston
Backup Blog < at > www.backupcentral.com
VP Data Protection, GlassHouse Technologies

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto] On Behalf Of
David Longo
Sent: Wednesday, August 29, 2007 8:46 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] Data Deduplication

I have been hearing bits and pieces about this de-dup thing.

Several things have me wondering , as folks on this list also
testify.

One thing I haven't heard about is performance. Even with TSM
clients,there is the thing not do "compression" on the client
due to performance issues. That is just for individual files
or data streams.

As de-dup, from what I have read, compares across all files
on a "system" (server, disk storage or whatever), it seems
to me that this will be an enormous resource hog of CPU, memory
and disk I/O. I am not just talking about using as some part of TSM
disk but say for instance on a File Server.

Any experiences/comments?

David Longo

[quote][quote][quote]Paul Zarnowski <psz1 < at > CORNELL.EDU> 8/27/2007 3:27 PM >>>
[/quote][/quote][/quote]At 12:40 PM 8/27/2007, Curtis Preston wrote:
[quote]Every block coming into the device should be compared to every other
block ever seen by the device.
[/quote]

As others have noted, different vendors dedup at different levels of
granularity. When I spoke to Diligent at the Gartner conference over
a year ago, they were very tight-lipped about their actual
algorithm. The would, however, state that they were able to dedup
parts of two files that had similar data, but were not
identical. I.e., if data was inserted at the beginning of the file,
some parts of the end of the file could still be deduped. Neat trick
if it's true. Other vendors dedup at the file or block (or chunk)
level.

I've not been able to gather much more detail about the specific
dedup algorithms, but hope to get some more info this fall, as take a
closer look at these products. If anyone has more details, I'd love
to hear them.

..Paul

--
Paul Zarnowski Ph: 607-255-4757
Manager, Storage Services Fx: 607-255-8521
719 Rhodes Hall, Ithaca, NY 14853-3801 Em: psz1 < at > cornell.edu

#####################################
This message is for the named person's use only. It may
contain confidential, proprietary, or legally privileged
information. No confidentiality or privilege is waived or
lost by any mistransmission. If you receive this message
in error, please immediately delete it and all copies of it
from your system, destroy any hard copies of it, and notify
the sender. You must not, directly or indirectly, use,
disclose, distribute, print, or copy any part of this message
if you are not the intended recipient. Health First reserves
the right to monitor all e-mail communications through its
networks. Any views or opinions expressed in this message
are solely those of the individual sender, except (1) where
the message states such views or opinions are on behalf of
a particular entity; and (2) the sender is authorized by
the entity to give such views or opinions.
#####################################
Data Deduplication
August 29, 2007 12:08PM
[quote]As de-dup, from what I have read, compares across all files
on a "system" (server, disk storage or whatever), it seems
to me that this will be an enormous resource hog
[/quote]
Exactly. To make sure everyone understands, the "system," is the
intelligent disk target, not a host you're backing up. A de-dupe
IDT/VTL is able to de-dupe anything against anything else that's been
sent to it. This can include, for example, a file in a filesystem and
the same file inside an Exchange Sent Items folder.

[quote]The de-dup technology only compares / looks at the files with in its
specific repository. Example: We have 8 Protectier node in one data
center which equtes to 8 Virtual Tape Libraries and 8 reposoitires.
[/quote]The

There are VTL/IDT vendors that offer a multi-head approach to
de-duplication. As you need more throughput, you buy more heads, and
all heads are part of one large appliance that uses a single global
de-dupe database. That way you don't have to point worry about which
backups go to which heads. Diligent's VTL Open is a multi-headed VTL,
but ProtecTier is not -- yet. I would ask them their plans for that.

While this feature is not required for many shops, I think it's a very
important feature for large shops.
Sorry, only registered users may post in this forum.

Click here to login