SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
DataDomain and dedup per node
Author Message
Post DataDomain and dedup per node 
Hi Everyone,

As we have been implementing our two new DD boxes we have been
setting them up like our existing two DD boxes - file devices
with the pool NOT collocated. This is what DD recommends and
it seems to work very well this way.

But, I've been thinking about collocating anyway!

I was poking around the DD command line and found that you
can get the dedup/compression information for any individual
directory or file. For example, below is the dedup/comp
factors for a file volume in a pool with one node I'm testing with:

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q nodedata WVLOGS01P"
| grep isdd2260
WVLOGS01p /isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260
30,551.83
WVLOGS01P /isdd2260/tsm2/test/0002267F.BFS TEST-PRI-ISDD2260
30,621.15
WVLOGS01P /isdd2260/tsm2/test/00022680.BFS TEST-PRI-ISDD2260
30,601.55
WVLOGS01P /isdd2260/tsm2/test/00022682.BFS TEST-PRI-ISDD2260
30,604.08
WVLOGS01P /isdd2260/tsm2/test/00022683.BFS TEST-PRI-ISDD2260
30,620.86
WVLOGS01P /isdd2260/tsm2/test/00022684.BFS TEST-PRI-ISDD2260
4,731.24

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q vol
/isdd2260/tsm2/test/0002267E.BFS"
/isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260 TEST
30.6 G 100.0 Full

sysadmin < at > isdd2260# filesys show compression
/data/col1/tsm2/test/0002267e.bfs
Total files: 1; bytes/storage_used: 4.6
Original Bytes: 32,332,636,620
Globally Compressed: 30,695,597,675
Locally Compressed: 6,930,888,022
Meta-data: 98,615,480

In this case, this vol is getting a 4.6x overall dedup/comp factor.

So, if I collocate the pool in TSM I should be able to use "q nodedata
<node>" to get a list of vols used by a node, then I can query the DD to
get the dedup/comp stats for that node. A little scripting and I can
generate a report of dedup/comp ratios by TSM node. This would help us
maintain which nodes make sense to put/keep on the DD.

Just curious if anyone is using collocation for a DD file pool? To do so
would use more volumes and more filling volumes, but I can't think of any
real reason to not collocate.

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

Post DataDomain and dedup per node 
The most serious problem we have encountered is the effect of
reclamation on backup throughput. We have a 1 GB backup network that is
used for servers to write directly to the Data Domain, bypassing TSM.
When only one client is writing directly to the DD we see backup network
utilization around 85%. When reclamation is running the client gets 5%
backup network utilization. I've cancelled reclamation and watched the
client throughput increase, then drop again when autoreclamation
restarts. We now only run reclamation when the client's 1 TB daily
backup is complete (a 4- to 5-hour process).

Collocation will increase the number of files used to store data and to
be reclaimed. If you can run reclamation when no other processing is
running there should be no impact, but watch your network stats for
verification.

Side Comment: We run NDMP backups across fiber to a VTL on the Data
domain. There is no effect on backup network utilization when the NDMP
backups are running. I'm still puzzled by this. Reclamation seems to
use enough DD resources to slow backup network data ingestion but NDMP
backups running with a higher throughput don't use enough DD processing
power to slow (or even effect) a direct write by a client over Ethernet.

Jim Schneider

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L < at > vm.marist.edu] On Behalf Of
Richard Rhodes
Sent: Thursday, April 19, 2012 8:28 AM
To: ADSM-L < at > vm.marist.edu
Subject: [ADSM-L] DataDomain and dedup per node

Hi Everyone,

As we have been implementing our two new DD boxes we have been setting
them up like our existing two DD boxes - file devices with the pool NOT
collocated. This is what DD recommends and it seems to work very well
this way.

But, I've been thinking about collocating anyway!

I was poking around the DD command line and found that you can get the
dedup/compression information for any individual directory or file. For
example, below is the dedup/comp factors for a file volume in a pool
with one node I'm testing with:

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q nodedata
WVLOGS01P"
| grep isdd2260
WVLOGS01p /isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260
30,551.83
WVLOGS01P /isdd2260/tsm2/test/0002267F.BFS TEST-PRI-ISDD2260
30,621.15
WVLOGS01P /isdd2260/tsm2/test/00022680.BFS TEST-PRI-ISDD2260
30,601.55
WVLOGS01P /isdd2260/tsm2/test/00022682.BFS TEST-PRI-ISDD2260
30,604.08
WVLOGS01P /isdd2260/tsm2/test/00022683.BFS TEST-PRI-ISDD2260
30,620.86
WVLOGS01P /isdd2260/tsm2/test/00022684.BFS TEST-PRI-ISDD2260
4,731.24

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q vol
/isdd2260/tsm2/test/0002267E.BFS"
/isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260 TEST
30.6 G 100.0 Full

sysadmin < at > isdd2260# filesys show compression
/data/col1/tsm2/test/0002267e.bfs
Total files: 1; bytes/storage_used: 4.6
Original Bytes: 32,332,636,620
Globally Compressed: 30,695,597,675
Locally Compressed: 6,930,888,022
Meta-data: 98,615,480

In this case, this vol is getting a 4.6x overall dedup/comp factor.

So, if I collocate the pool in TSM I should be able to use "q nodedata
<node>" to get a list of vols used by a node, then I can query the DD to
get the dedup/comp stats for that node. A little scripting and I can
generate a report of dedup/comp ratios by TSM node. This would help us
maintain which nodes make sense to put/keep on the DD.

Just curious if anyone is using collocation for a DD file pool? To do
so would use more volumes and more filling volumes, but I can't think of
any real reason to not collocate.

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If the
reader of this message is not the intended recipient or an agent
responsible for delivering it to the intended recipient, you are hereby
notified that you have received this document in error and that any
review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify us immediately, and delete the original message.

**********************************************************************
Information contained in this e-mail message and in any attachments thereto is confidential. If you are not the intended recipient, please destroy this message, delete any copies held on your systems, notify the sender immediately, and refrain from using or disclosing all or any part of its content to any other person.

Post DataDomain and dedup per node 
I misspoke in my previous email. The server on the backup network
writes to TSM using the TDP for SQL client, not directly to the Data
Domain.

Jim Schneider

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L < at > vm.marist.edu] On Behalf Of
Richard Rhodes
Sent: Thursday, April 19, 2012 8:28 AM
To: ADSM-L < at > vm.marist.edu
Subject: [ADSM-L] DataDomain and dedup per node

Hi Everyone,

As we have been implementing our two new DD boxes we have been setting
them up like our existing two DD boxes - file devices with the pool NOT
collocated. This is what DD recommends and it seems to work very well
this way.

But, I've been thinking about collocating anyway!

I was poking around the DD command line and found that you can get the
dedup/compression information for any individual directory or file. For
example, below is the dedup/comp factors for a file volume in a pool
with one node I'm testing with:

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q nodedata
WVLOGS01P"
| grep isdd2260
WVLOGS01p /isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260
30,551.83
WVLOGS01P /isdd2260/tsm2/test/0002267F.BFS TEST-PRI-ISDD2260
30,621.15
WVLOGS01P /isdd2260/tsm2/test/00022680.BFS TEST-PRI-ISDD2260
30,601.55
WVLOGS01P /isdd2260/tsm2/test/00022682.BFS TEST-PRI-ISDD2260
30,604.08
WVLOGS01P /isdd2260/tsm2/test/00022683.BFS TEST-PRI-ISDD2260
30,620.86
WVLOGS01P /isdd2260/tsm2/test/00022684.BFS TEST-PRI-ISDD2260
4,731.24

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q vol
/isdd2260/tsm2/test/0002267E.BFS"
/isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260 TEST
30.6 G 100.0 Full

sysadmin < at > isdd2260# filesys show compression
/data/col1/tsm2/test/0002267e.bfs
Total files: 1; bytes/storage_used: 4.6
Original Bytes: 32,332,636,620
Globally Compressed: 30,695,597,675
Locally Compressed: 6,930,888,022
Meta-data: 98,615,480

In this case, this vol is getting a 4.6x overall dedup/comp factor.

So, if I collocate the pool in TSM I should be able to use "q nodedata
<node>" to get a list of vols used by a node, then I can query the DD to
get the dedup/comp stats for that node. A little scripting and I can
generate a report of dedup/comp ratios by TSM node. This would help us
maintain which nodes make sense to put/keep on the DD.

Just curious if anyone is using collocation for a DD file pool? To do
so would use more volumes and more filling volumes, but I can't think of
any real reason to not collocate.

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If the
reader of this message is not the intended recipient or an agent
responsible for delivering it to the intended recipient, you are hereby
notified that you have received this document in error and that any
review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify us immediately, and delete the original message.

**********************************************************************
Information contained in this e-mail message and in any attachments thereto is confidential. If you are not the intended recipient, please destroy this message, delete any copies held on your systems, notify the sender immediately, and refrain from using or disclosing all or any part of its content to any other person.

Post DataDomain and dedup per node 
I was told the only reason EMC recommends to turn off collocation is that
collocation on shoots up the individual volume count-generally and they
also recommend a relatively high reclamation threshold. I think these 2
factors together might end up in a lost of wasted unreclaimed space. I
think it would be ok if you were more aggressive with your reclamation.
Something to keep an eye on at the least.

On another note, I've always been suspicious of whether or not granular
analysis like this is accurate. The deduplication of a single file would
vary depending on the other data that is on the system, which is
constantly changing. If you delete all the other files that share data
with this one, will the deduplication factor of this file should shoot up?
If so, than the deduplication ratio means nothing for a single file like
a compression ratio would. I think it really only applies to the storage
pool as a whole.

Using collocation to identify "bad dedupe citizens" sounds reasonable, but
only if the values being returned by the "filesys show compression"
command is accurate. Is that data dynamically updated? Are the
individual file deduplication ratios immediately update automatically as
data is written or cleaned? I remember Falconstor only recorded the
deduplication ratio of a virtual tape at the time the data was written and
was not updated. I find it hard to believe this is dynamically
maintained by the data domain, but I'd definitely want to know before
switching to colocation for this purpose.

Deduplication adds an abstraction layer between the file metadata and the
actual storage. I don't see how you could really get an accurate picture
of the true storage an individual file is occupying since it is sharing
space. Say there are 10x 100MB files sharing 50 percent of their data
with each other. How much space is one of those files occupying?



Regards,
Shawn
________________________________________________
Shawn Drew





Internet
rrhodes < at > FIRSTENERGYCORP.COM

Sent by: ADSM-L < at > VM.MARIST.EDU
04/19/2012 09:27 AM
Please respond to
ADSM-L < at > VM.MARIST.EDU


To
ADSM-L
cc

Subject
[ADSM-L] DataDomain and dedup per node






Hi Everyone,

As we have been implementing our two new DD boxes we have been
setting them up like our existing two DD boxes - file devices
with the pool NOT collocated. This is what DD recommends and
it seems to work very well this way.

But, I've been thinking about collocating anyway!

I was poking around the DD command line and found that you
can get the dedup/compression information for any individual
directory or file. For example, below is the dedup/comp
factors for a file volume in a pool with one node I'm testing with:

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q nodedata WVLOGS01P"
| grep isdd2260
WVLOGS01p /isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260
30,551.83
WVLOGS01P /isdd2260/tsm2/test/0002267F.BFS TEST-PRI-ISDD2260
30,621.15
WVLOGS01P /isdd2260/tsm2/test/00022680.BFS TEST-PRI-ISDD2260
30,601.55
WVLOGS01P /isdd2260/tsm2/test/00022682.BFS TEST-PRI-ISDD2260
30,604.08
WVLOGS01P /isdd2260/tsm2/test/00022683.BFS TEST-PRI-ISDD2260
30,620.86
WVLOGS01P /isdd2260/tsm2/test/00022684.BFS TEST-PRI-ISDD2260
4,731.24

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q vol
/isdd2260/tsm2/test/0002267E.BFS"
/isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260 TEST
30.6 G 100.0 Full

sysadmin < at > isdd2260# filesys show compression
/data/col1/tsm2/test/0002267e.bfs
Total files: 1; bytes/storage_used: 4.6
Original Bytes: 32,332,636,620
Globally Compressed: 30,695,597,675
Locally Compressed: 6,930,888,022
Meta-data: 98,615,480

In this case, this vol is getting a 4.6x overall dedup/comp factor.

So, if I collocate the pool in TSM I should be able to use "q nodedata
<node>" to get a list of vols used by a node, then I can query the DD to
get the dedup/comp stats for that node. A little scripting and I can
generate a report of dedup/comp ratios by TSM node. This would help us
maintain which nodes make sense to put/keep on the DD.

Just curious if anyone is using collocation for a DD file pool? To do so
would use more volumes and more filling volumes, but I can't think of any
real reason to not collocate.

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.



This message and any attachments (the "message") is intended solely for
the addressees and is confidential. If you receive this message in error,
please delete it and immediately notify the sender. Any use not in accord
with its purpose, any dissemination or disclosure, either whole or partial,
is prohibited except formal approval. The internet can not guarantee the
integrity of this message. BNP PARIBAS (and its subsidiaries) shall (will)
not therefore be liable for the message if modified. Please note that certain
functions and services for BNP Paribas may be performed by BNP Paribas RCC, Inc.

Post DataDomain and dedup per node 
I am suspicious of dedup ratios in general. What I found is that I can divide my data by 4 and be fairly accurate as to how much storage the DD will need. This formula has worked for 2 TSM (12-14:1) and 2 BE (20-25:1) sites, so I would not call it proven, expect in my little world.
BRMS seems to be different.

Andy Huebner

Perhaps this conversation should be at:
The Data Domain Admins List
http://lists.ufl.edu/cgi-bin/wa?A0=DD-ADMINS-L

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L < at > VM.MARIST.EDU] On Behalf Of Shawn Drew
Sent: Thursday, April 19, 2012 11:12 AM
To: ADSM-L < at > VM.MARIST.EDU
Subject: Re: [ADSM-L] DataDomain and dedup per node

I was told the only reason EMC recommends to turn off collocation is that
collocation on shoots up the individual volume count-generally and they
also recommend a relatively high reclamation threshold. I think these 2
factors together might end up in a lost of wasted unreclaimed space. I
think it would be ok if you were more aggressive with your reclamation.
Something to keep an eye on at the least.

On another note, I've always been suspicious of whether or not granular
analysis like this is accurate. The deduplication of a single file would
vary depending on the other data that is on the system, which is
constantly changing. If you delete all the other files that share data
with this one, will the deduplication factor of this file should shoot up?
If so, than the deduplication ratio means nothing for a single file like
a compression ratio would. I think it really only applies to the storage
pool as a whole.

Using collocation to identify "bad dedupe citizens" sounds reasonable, but
only if the values being returned by the "filesys show compression"
command is accurate. Is that data dynamically updated? Are the
individual file deduplication ratios immediately update automatically as
data is written or cleaned? I remember Falconstor only recorded the
deduplication ratio of a virtual tape at the time the data was written and
was not updated. I find it hard to believe this is dynamically
maintained by the data domain, but I'd definitely want to know before
switching to colocation for this purpose.

Deduplication adds an abstraction layer between the file metadata and the
actual storage. I don't see how you could really get an accurate picture
of the true storage an individual file is occupying since it is sharing
space. Say there are 10x 100MB files sharing 50 percent of their data
with each other. How much space is one of those files occupying?



Regards,
Shawn
________________________________________________
Shawn Drew





Internet
rrhodes < at > FIRSTENERGYCORP.COM

Sent by: ADSM-L < at > VM.MARIST.EDU
04/19/2012 09:27 AM
Please respond to
ADSM-L < at > VM.MARIST.EDU


To
ADSM-L
cc

Subject
[ADSM-L] DataDomain and dedup per node






Hi Everyone,

As we have been implementing our two new DD boxes we have been
setting them up like our existing two DD boxes - file devices
with the pool NOT collocated. This is what DD recommends and
it seems to work very well this way.

But, I've been thinking about collocating anyway!

I was poking around the DD command line and found that you
can get the dedup/compression information for any individual
directory or file. For example, below is the dedup/comp
factors for a file volume in a pool with one node I'm testing with:

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q nodedata WVLOGS01P"
| grep isdd2260
WVLOGS01p /isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260
30,551.83
WVLOGS01P /isdd2260/tsm2/test/0002267F.BFS TEST-PRI-ISDD2260
30,621.15
WVLOGS01P /isdd2260/tsm2/test/00022680.BFS TEST-PRI-ISDD2260
30,601.55
WVLOGS01P /isdd2260/tsm2/test/00022682.BFS TEST-PRI-ISDD2260
30,604.08
WVLOGS01P /isdd2260/tsm2/test/00022683.BFS TEST-PRI-ISDD2260
30,620.86
WVLOGS01P /isdd2260/tsm2/test/00022684.BFS TEST-PRI-ISDD2260
4,731.24

rsbkup:/tsmdata/tsm_scripts==>./run_cmd.ksh tsm2 "q vol
/isdd2260/tsm2/test/0002267E.BFS"
/isdd2260/tsm2/test/0002267E.BFS TEST-PRI-ISDD2260 TEST
30.6 G 100.0 Full

sysadmin < at > isdd2260# filesys show compression
/data/col1/tsm2/test/0002267e.bfs
Total files: 1; bytes/storage_used: 4.6
Original Bytes: 32,332,636,620
Globally Compressed: 30,695,597,675
Locally Compressed: 6,930,888,022
Meta-data: 98,615,480

In this case, this vol is getting a 4.6x overall dedup/comp factor.

So, if I collocate the pool in TSM I should be able to use "q nodedata
<node>" to get a list of vols used by a node, then I can query the DD to
get the dedup/comp stats for that node. A little scripting and I can
generate a report of dedup/comp ratios by TSM node. This would help us
maintain which nodes make sense to put/keep on the DD.

Just curious if anyone is using collocation for a DD file pool? To do so
would use more volumes and more filling volumes, but I can't think of any
real reason to not collocate.

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.



This message and any attachments (the "message") is intended solely for
the addressees and is confidential. If you receive this message in error,
please delete it and immediately notify the sender. Any use not in accord
with its purpose, any dissemination or disclosure, either whole or partial,
is prohibited except formal approval. The internet can not guarantee the
integrity of this message. BNP PARIBAS (and its subsidiaries) shall (will)
not therefore be liable for the message if modified. Please note that certain
functions and services for BNP Paribas may be performed by BNP Paribas RCC, Inc.

This e-mail (including any attachments) is confidential and may be legally privileged. If you are not an intended recipient or an authorized representative of an intended recipient, you are prohibited from using, copying or distributing the information in this e-mail or its attachments. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete all copies of this message and any attachments.

Thank you.

Post DataDomain and dedup per node 
On 04/19/2012 10:04 AM, Schneider, Jim wrote:
The most serious problem we have encountered is the effect of
reclamation on backup throughput. We have a 1 GB backup network that is
used for servers to write directly to the Data Domain, bypassing TSM.
When only one client is writing directly to the DD we see backup network
utilization around 85%. When reclamation is running the client gets 5%
backup network utilization. I've cancelled reclamation and watched the
client throughput increase, then drop again when autoreclamation
restarts. We now only run reclamation when the client's 1 TB daily
backup is complete (a 4- to 5-hour process).


Another axis on which reclamation will affect your DD behavior is the
amount of dead space due to uncleaned snapshots. You can think of this
as analogous to PENDING physical volumes.. Below I've included a
terribly contrived example, which I hope is at least a little clear.

The key is that there's a gap of formally unusued, but not-yet-available
space on the DD. The size of this gap is related to the churn rate of
the files. If you reclaim madly, you'll eventually make that gap grow,
a lot.

OTOH, some of that space will dedupe (I'd guess that copying 'most of' a
file from A to B on the DD ought to dedupe pretty darn well. Smile

----

Say you run snapshots daily and keep 7. You write one volume a day,
keep two weeks of them. You've got REUSEDELAY set to 5, and we'll start
on a Sunday, Jan 1.

So, at day 14 two Sundays downrange, you've got files

/ddfiles/vol01 -> vol14 present in TSM, on the filesystem

On Monday the 15th, vol01 is PENDING to TSM, but still present on the
filesystem.

On Saturday the 20th, vol01 is deleted from the filesystem, but
referenced by the snapshot you took on the 19th. Gotta keep it around.

On Friday the 26th, this is _still_ true. vol01 is deleted but still
referenced by an active snapshot. It's now in the company of
vol02-vol06. Sometime that day, though the snapshot from the 19th will
expire. BUT WAIT THERE'S MORE...

You don't get the space back from the expired snapshot until the cleanup
on (by default) the following Tuesday. So on January 30th, if all goes
well, you'll run the administrative process which returns to you the
space occupied by vol01.


We're accustomed to thinking in terms of volumes and reusedelays: this
is really just another few iterations of that, but it's important to
keep in mind that they sum.


- Allen S. Rout

Post DataDomain and dedup per node 
We have already encountered this situation although TSM was not part of
the picture. Oracle data that was written directly to the DD was not
deleted by the RMAN cleanup process. Our DD utilization at the
replication client was above 95% and we needed to reclaim space. We
deleted several TB of old data and the DBAs implemented RMAN cleanup.
We had to wait for the snapshots to expire, then had to wait for the
Tuesday DD cleaning process. With daily snapshots/one week retention it
took about two weeks for deleted files to finally stop using space.

Jim Schneider

-----Original Message-----
From: ADSM: Dist Stor Manager [mailto:ADSM-L < at > vm.marist.edu] On Behalf Of
Allen S. Rout
Sent: Monday, April 23, 2012 9:39 AM
To: ADSM-L < at > vm.marist.edu
Subject: Re: [ADSM-L] DataDomain and dedup per node

On 04/19/2012 10:04 AM, Schneider, Jim wrote:
The most serious problem we have encountered is the effect of
reclamation on backup throughput. We have a 1 GB backup network that
is used for servers to write directly to the Data Domain, bypassing
TSM.
When only one client is writing directly to the DD we see backup
network utilization around 85%. When reclamation is running the
client gets 5% backup network utilization. I've cancelled reclamation

and watched the client throughput increase, then drop again when
autoreclamation restarts. We now only run reclamation when the
client's 1 TB daily backup is complete (a 4- to 5-hour process).


Another axis on which reclamation will affect your DD behavior is the
amount of dead space due to uncleaned snapshots. You can think of this
as analogous to PENDING physical volumes.. Below I've included a
terribly contrived example, which I hope is at least a little clear.

The key is that there's a gap of formally unusued, but not-yet-available
space on the DD. The size of this gap is related to the churn rate of
the files. If you reclaim madly, you'll eventually make that gap grow,
a lot.

OTOH, some of that space will dedupe (I'd guess that copying 'most of' a
file from A to B on the DD ought to dedupe pretty darn well. Smile

----

Say you run snapshots daily and keep 7. You write one volume a day,
keep two weeks of them. You've got REUSEDELAY set to 5, and we'll start
on a Sunday, Jan 1.

So, at day 14 two Sundays downrange, you've got files

/ddfiles/vol01 -> vol14 present in TSM, on the filesystem

On Monday the 15th, vol01 is PENDING to TSM, but still present on the
filesystem.

On Saturday the 20th, vol01 is deleted from the filesystem, but
referenced by the snapshot you took on the 19th. Gotta keep it around.

On Friday the 26th, this is _still_ true. vol01 is deleted but still
referenced by an active snapshot. It's now in the company of
vol02-vol06. Sometime that day, though the snapshot from the 19th will
expire. BUT WAIT THERE'S MORE...

You don't get the space back from the expired snapshot until the cleanup
on (by default) the following Tuesday. So on January 30th, if all goes
well, you'll run the administrative process which returns to you the
space occupied by vol01.


We're accustomed to thinking in terms of volumes and reusedelays: this
is really just another few iterations of that, but it's important to
keep in mind that they sum.


- Allen S. Rout

**********************************************************************
Information contained in this e-mail message and in any attachments thereto is confidential. If you are not the intended recipient, please destroy this message, delete any copies held on your systems, notify the sender immediately, and refrain from using or disclosing all or any part of its content to any other person.

Post DataDomain and dedup per node 
Hi Everyone,

Thanks for the thoughts/comments.

I was told the only reason EMC recommends to turn off collocation is
that
collocation on shoots up the individual volume count-generally

yup, that's my understanding - more vols, more reclamation

also recommend a relatively high reclamation threshold.
think these 2
factors together might end up in a lost of wasted unreclaimed space. I
think it would be ok if you were more aggressive with your reclamation.
Something to keep an eye on at the least.

They suggested a reclamation thrshhold of 30%. We are using 35%. Fairly
aggressive reclamation.

I'm not too worried about throughput issues with reclamation.
We are implementing 10g ethernet connections from the TSM
servers into the DD. We currently are not replicating between
these new DD's since we do not have enough network between
our datacenters. Once a higher speed (dedicated) net is
in place for DD replication we can get rid of our tape copy
pools which will greatly lower our traffic from the DD's.

On another note, I've always been suspicious of whether or not granular
analysis like this is accurate.
Using collocation to identify "bad dedupe citizens" sounds reasonable,
but
only if the values being returned by the "filesys show compression"
command is accurate. Is that data dynamically updated? Are the
individual file deduplication ratios immediately update automatically as
data is written or cleaned?

So far, one of our new DD's has 215tb on 48tb of disk.
I've been testing nodes who's occupancy is greater than
10tb to see what kind of dedup they are getting. I
created a DD based file pool, and made it a next pool
behind our existing tape pool while set to never
migrate. I then run move nodedata from the tape pool
into the file pool (and back). I let about 500gb get moved, then
see what dedup the file pool is getting. So far the DD
is responding with stats (filesys show compression /path/to/dir)
that reflect the immediate status of the dir. It seems to
gather the stats right when the cmd is run. The manual
indicates that this cmd can take quite some time to run
if there are many files to be looked at under a dir.

I find it hard to believe this is dynamically
maintained by the data domain, but I'd definitely want to know before
switching to colocation for this purpose.

Dynamically maintained - no, I don't think so for an
arbitary dir or file. I think it computes it when
the cmd is run. Thus the warning in the manual
about how long it can run.

Deduplication adds an abstraction layer between the file metadata and
the
actual storage. I don't see how you could really get an accurate
picture
of the true storage an individual file is occupying since it is sharing
space. Say there are 10x 100MB files sharing 50 percent of their data
with each other. How much space is one of those files occupying?

Our concern is that the DD is a limited resource - how much
disk it has. If some node dumps a bunch of backups that
don't dedup that could greatly effect the disk available.
A couple examples.
1) audio files - we have a node that records audio which are kept for 2
years. It currently has a TSM occupancy of 48tb. we expected it wouldn't
be a good fit for a DD. Sure enough, when I put some of it's backups onto
the DD I got almost no dedup/compression.
2) Notes mail backups - We use Notes for email. It's our understanding
that the mail boxes are already compressed somehow by notes. If you take
one and gz it, it doesn't shrink much. We figured it was another bad fit
for the DD. To our surprise, when I put part of the backups on the DD it
deduped at a 5x ratio. It turns out this is a good fit for DD.

What we are worried about is a client that starts sending
backups that shouldn't be on the DD. Also, we can't test every
node for a DD fit. If TSM occupancy stays static but DD disk
starts to grow, how will we tell which node is/are the problem?
We would see it as the DD filling disk, but this may have little
relation to the TSM occupancy of a node. We can easily see
having to dig for nodes that should really be on tape. (our tape system
isn't
going away any time soon).

So. After reading the replies by everyone, we are going to
collocate our DD file pool and I'm going to try and create a
report that will list our nodes, occupancy, DD disk, and a computed ratio.


We'll see . . .this may crash and burn . . .


Thanks!

Rick




-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

Post DataDomain and dedup per node 
On 04/23/2012 03:15 PM, Richard Rhodes wrote:
Hi Everyone,


Our concern is that the DD is a limited resource - how much
disk it has. If some node dumps a bunch of backups that
don't dedup that could greatly effect the disk available.

Amen, brother. Preach it!

A couple examples.
1) audio files - we have a node that records audio which are kept for 2
years. It currently has a TSM occupancy of 48tb. we expected it wouldn't
be a good fit for a DD. Sure enough, when I put some of it's backups onto
the DD I got almost no dedup/compression.
2) Notes mail backups - We use Notes for email. It's our understanding
that the mail boxes are already compressed somehow by notes. If you take
one and gz it, it doesn't shrink much. We figured it was another bad fit
for the DD. To our surprise, when I put part of the backups on the DD it
deduped at a 5x ratio. It turns out this is a good fit for DD.

5x is a good deal?

I don't know how good you are at negotiating, but my DD disk is way more
than 5x as expensive as raw.



So. After reading the replies by everyone, we are going to
collocate our DD file pool and I'm going to try and create a
report that will list our nodes, occupancy, DD disk, and a computed ratio.

Remember, you can do show compression down to the file level..


sysadmin < at > cns-prod-dd01# filesys show compression
/data/col1/remedy/prod/data/df_REMEDY_P_1002_1_779409006
Assuming default global compression is type 1.
Total files: 1; bytes/storage_used: 3.9
Original Bytes: 6,995,806,860
Globally Compressed: 6,453,299,125
Locally Compressed: 1,774,055,288
Meta-data: 19,701,904

so that equates to the volume level if you use a collocated FILE devclass.



- Allen S. Rout

Post DataDomain and dedup per node 
Remember, you can do show compression down to the file level..

Exactly!

What I plan to script goes something like this:
(assumes collocation is in good shape - vols dedicated to individual
nodes)

get list of nodes using the DD
for each node
get list of vols (q nodedata <node> stgpool=<ddpool>)
for each vol
ssh to DD and get dedup info (filesys show compression <file>)

With some arithmetic I can get totals per node.
I'm thinking of a report with something like this . . .

<node> <node_occupancy> <num_vols> <dd_pre_dedup_size> <dd_dedup_size>
<dd_dedup_ratio> <date_time_stamp>

(the date stamp at then end allows grep'ing through multiple files to get
a history for a node in date order)

Rick



-----------------------------------------
The information contained in this message is intended only for the
personal and confidential use of the recipient(s) named above. If
the reader of this message is not the intended recipient or an
agent responsible for delivering it to the intended recipient, you
are hereby notified that you have received this document in error
and that any review, dissemination, distribution, or copying of
this message is strictly prohibited. If you have received this
communication in error, please notify us immediately, and delete
the original message.

Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB