TSM 6.1, dedupe & DB2

IBM announced TSM 6.1 a few days ago, and it’s supposed to be GA in March. The two features of DB2 and dedupe have been long awaited. What do I think about them? Click Read More to see more.

TSM 6.1 will be the first TSM version to ship (Mar. 27) with a full-fledged DB2 database and deduplication. The former is one of the most anticipated TSM features in a long time, and the latter is the biggest thing in backup since the incremental. So, let’s talk about these, shall we?

Let’s talk about dedupe first. Remember that dedupe (all dedupe, not just TSM’s) eliminates redundant data in three ways: eliminating redundant files in repeated full backups, eliminating redundant blocks between incremental backups, and compressing data once it’s deduped. In TSM environments, the first of these three should only happen in application backups (e.g. Oracle, Exchange), as TSM is designed not to do repeated full backups on filesystems. The second will only happen if a customer leaves many days of backups in the disk pool. Generally speaking, the more days of backup they leave in the disk pool, the more dedupe they will get. Conversely, if there aren’t multiple versions of a given file in the disk pool, TSM will not be able to find redundant blocks between them – and you’ll get no dedupe. Therefore, TSM’s deduplication will most help those that store data in your disk pools for more than the typical time. From what I’ve seen, most TSM customers will need to alter their use of the disk pool to get the full benefit of dedupe, since they typically only store a few days’ worth of backups in the disk pool.

Another challenge with TSM’s dedupe is that it is storage pool based. That is, data in one storage pool is only compared to other data in that same storage pool. It is not compared to data in other storage pools. This means that TSM customers wanting to get the best dedupe ratio will want to minimize the number of storage pools they use. Since storage pools are not shared between instances, they will also want to minimize the number of TSM instances they create. This is a perfect segue to the next topic of DB2, as it will increase TSM’s ability to handle larger instances.

Now let’s talk about DB2. While TSM’s pre-6.0 database is a relational database (which is more than you can say for many backup products), it wasn’t a “full-fledged” database like DB2. To understand the importance of DB2 to TSM shops, we need to discuss the difference between the way traditional backup products record their backup history and the way TSM records it.

Traditional backup products perform a full or incremental backup of a filesystem and store that as a unit on disk or tape. (We’ll call this unit an “image.”) They record in their database that image-n came from system-n, contains filesystem-n, was made on date-n, and expires in n days; this is one record in the image table of the database. They then make a record in the database for each file that is in that image, and part of that record specifies that this file is stored in image-n. Every day, the database looks in the image table for images that are past their expiration date, which is a relatively simple query. Then, for each image that has passed its expiration date, they expire all file-level records that are attached to that image. This daily process can take from seconds to a few minutes, depending on the size of the environment and the efficiency of the backup product in question.

TSM is very different. They do store files in aggregates, which could be considered analogous to images (because they are a chunk of data containing one or more files), aggregates are very different from images in at least one very important way: files in an aggregate expire independently of each other. (There are more differences, but I’m focusing on the difference that is germane to the topic at hand.) For example, suppose you told TSM to store 7 versions of all files in a given filespace (e.g. the E: drive). (There is also a time-based element as well, but I’m ignoring that right now for simplicity’s sake.) Once the eighth version of a given file in that filespace has been backed up by TSM, the oldest version of that file will be expired the next time TSM runs its expiration process. When TSM runs the expiration process, it runs a query that looks at every file in its database to see if more versions are stored for longer than they’re supposed to be stored, and extra versions are then expired. This expiration process is significantly more complex than the process outlined above with traditional products. This is why the expiration process with TSM takes significantly longer than it does with other products, with times reported on the TSM mailing list ranging from hours to more than 24 hours. (The latter is the exception and is indicative of a poorly-configured TSM instance.)

Another TSM administrative task that is pertinent is reclamation. Since files expire independently of one another, this creates “holes” on tape, where data that is still needed (retained data) is intermingled with expired data. At some point, there are more “holes” than retained data, and the TSM customer must move the retained data from a large set of tapes to a smaller set of tapes, after which the original tapes are made available for reuse. For example, consider 21 tapes that are 70% reclaimable (i.e. 70% empty); only 30% of the data on each tape is needed for restores. The reclamation process would consolidate that data to 7 tapes and return the original 21 tapes to circulation. Since TSM must record the movement of all files from one tape to another, reclamation is also a very database-intensive process. The same is true for the migration and copy processes where data is moved from one pool to another for various reasons.

While TSM’s progressive incremental method does reduce the amount of time it takes to get the initial backup done, the processes of expiration, migration, and reclamation can take up so much time that some users run out of hours in the day to complete them. They are then presented with two options: scale back the processes in question or split their TSM instance. An example of scaling back would be to increase the reclamation threshold so that only tapes that are 80% or 90% reclaimable will be reclaimed. While this saves time, it decreases media utilization. A TSM customer may consider not copying some of their backups to a copy pool. While this also saves time, it decreases their operational readiness. If a customer is unable to scale back enough (or not at all), their only option is to split the TSM instance into multiple instances. The data for some nodes is moved to another TSM instance that will then manage backups for those nodes. The customer often opts to run this additional instance on the same physical server that the previous instance is running on. I’m no database expert, but what this says to me is that the server wasn’t out of gas; the database was.

Finally, there is this the TSM database audit. It is not run on a regular basis; it is only run when support says to run it. (If they suspect something is wrong with your TSM database, they may tell you to run an audit.) The problem is that in order to run a full database audit, TSM must be down. That means no backups, no expiration, no reclamation, no nothing. Since the length of time it takes an audit to run is directly related to the size of your TSM database, the possibility of ever running an audit was another reason that TSM customers kept their databases to a “reasonable” size. While there is no hard-and-fast rule, the sizes I’ve seen mentioned most often on the TSM mailing list have been under 200 GB, and usually under 150 GB. After that, most TSM customers end up splitting their TSM instance into multiple instances.

Now we finally arrive to why having a full-fledged version of DB2 on a TSM server is a big deal. First, DB2 should be able to handle database-intensive activities much better (and faster) than the previous TSM database. (IBM is claiming up to a 50% reduction in the time it takes expiration to run; YMMV, of course.) Second, DB2 allows customers to perform a database audit without shutting down TSM. Both of these should allow TSM customers to manage much larger instances of TSM without having to split them into multiple instances.

I’d really be curious to hear from TSM customers about their experiences with 6.0. What kind of improvement have you seen in your administrative tasks? What kind of dedupe ratios are you getting and how long are storing data in the disk pool? I can’t wait to hear.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

Looks like a target based dedup.

I wonder why most the backup vendors have only target based dedup. It’s by choice ? or by restrictions from their earlier design ?

thoughts ?

Symantec and EMC have source-based, CommVault and IBM have target-based. One difference in both cases is that Symantec and EMC acquired, where IBM & CommVault rolled their own. The only reason I can guess is that it’s easier, but that’s really just a guess.

There are a few startups doing source based.

But a whole bunch of others trying to do target based (in backup) – CA announced new beta of ArcServe with target dedup, so did some of the others like HP (beta for DP).

No one develops …
IBM, Symantec, CA, EMC have almost closed down R&D centers. They only acquire startups.

IBM acquired dilligent to get dedup.

Perhaps (and I’m saying that only because I don’t know) the R&D houses are smaller than before, but they’re not non-existent. All the companies who you have named have released new (and sometimes very big) functionality in recent releases — functionality that was not acquired. I haven’t seen the CA or HP announcements. (It’s hard to follow all of these products.)

You mentioned that IBM acquired Diligent to “get dedup.” Diligent’s technology is completely unrelated to the dedupe coming out in TSM 6.1. TSM’s dedupe was already in development before they bought Diligent. Diligent’s technology was designed in a particular way that makes sense if you’re going to be handed a whole bunch of different formats, but it would be silly to use it if you were designing for just one format — and controlled that format.

I’m back to the only reason I can think of. It must be easier to start at the top.

Curtis,

Although I agree with the points you made.

But, I know for sure about Veritas. They did not invest more their products (including backup exec and netbackup). And most of the features Symantec adds are small acquisitions.

Almost complete storage stack with HP (including the JFS file system with veritas) is made by veritas.

disclosure: I have spent significant part of my career in veritas R&D.

5 comments

Further reading

Learning from Disaster: Takeaways from StorageCraft’s Cloud Backup Outage

Lessons Learned from the Rackspace Ransomware Attack

Lessons from Carbonite Lawsuit: Why Backup Vendor Due Diligence is Crucial

Salesforce’s 2019 Permission Blunder: Why SaaS Backup is Non-Negotiable

KPMG’s Microsoft 365 Data Loss Disaster: A Wake-Up Call

The OVHCloud Dumpster Fire (literally and figuratively)