Backups all in one night

I know it may sound like an obvious statement, but I think all your backups should fit in one night.  Not just 24 hours, but 12 hours or less.  Besides the obvious RTO/RPO problems, it just messes up so many parts other parts of the design when backups go over 24 hours.  A related issue is that I think it's a bad idea to push all your full backups to the weekend.

What's that, you say?  You have backups that go over 12 or 24 hours?  Your product does full backups and they run all weekend long, and sometimes into Monday?  I think you should look at your design.

  1. You'll never have time to copy those full backups if they don't finish until Sunday night/monday morning. (I know, you're not doing that either, but that's another blog entry.) Wink
  2. It masks problems. There's usually nothing that's looking to see if any single backup is running all weekend long, so some of your backups might stop you from meeting any reasonable recovery time objective (RTO)
  3. It usually means no incrementals when your full backups are running for multiple days, so you've got recovery point objective (RPO) issues as well.
  4. For TSM customers, instead of full backups, think about expiration and reclamation activities.  They'd better finish completely before your next backup cycle, or you've got problems too.

If you do full backups, I'd rather see you spread your full backups out across the week or even across the month.  Every night, do a full backup of 1/28th of your environment, do a cumulative incremental or level 1 backup of 1/7th of your environment, and an incremental of the rest.  That creates a predictable amount of data backed up each night that you can engineer for — including making copies.

And if you STILL can't get your backups done within 12/24 hours, then we've got more to talk about.  Chances are my answer will not be "you need more drives."

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

17 thoughts on “Backups all in one night

  1. cuyler says:

    When spliting your full backups over multiple nights you do have to manage the clients interpretation of what an RTO means. If you rely 100% on tape in the case of a disaster and you lose all (or most) of the system you back up – can you still meet your RTO overall.

    If you have 10 systems with an RTO of 24 hours and it takes three days of 12 hour full backups with the tape infrastructure you have then in the case of a disaster where all 10 servers have been lost there is no way you can get all that data off the tapes fast enough. Sure, you could recover a few systems in that RTO but not all of them.

    I’ve had a few clients that outline a data recovery RTO for day-to-day lost data (e.g. single server failure) and then another for the recovery of the data centre. Some clients assumed that the RTO applied for both situations and that your tape infrastrcture was endless.

  2. cpreston says:

    First we have to make sure we can meet the RTO of individual systems. Then we need to talk about how to meet that RTO if we lost the whole datacenter. If we can meet the former, we can usually meet the latter with enough hardware (translate: $$$). I’m not saying it’s easy, but it’s within the realm of possibility. BUT if you can’t meet the RTO of a single system, then we definitely can’t meet the RTO of the entire datacenter.

    While we’re at it, I’m not talking about taking the backups of an individual server and spreading those full backups out across the month. I’m only talking about doing a full on one server one night, and a full on another server the next night.

  3. cdevidal says:

    Our backups are currently running into 3 days, and with our predicted growth it’ll swamp us in no time. I think we’ll be slurping 40TB in 6 years and so we’ll need 400MB/sec.

    We’re considering D2D2T, an block-level asynchronous copy (rsync/robocopy or some other proprietary copy software) from all file servers to a large, fast drive array, then from that to tape (2 x LTO 4 drives). The tape drives get 120MB/sec each, with compression minus overhead comes to about 400MB/sec. So far, so good.

    Trouble is our budget for an upgraded library, the arrays, a new server and possibly synchronization software (if not hand-rolled) is only about $45k (USD). I’m not finding good, fast, reliable RAID arrays that will meet that budget. Promise has an iSCSI array that — in theory — can offer 200MB/sec so I thought I’d get two of those hooked to a quad Gigabit card. But there aren’t many reviews of this chassis and one blogger recommends against it due to bugs and problems.

    I’d considered Coraid’s AoE but my supervisor doesn’t feel comfortable with them (too new). Besides, I’m not seeing the required performance with them, either:
    http://www.coraid.com/support/linux/perf/throughput.pdf

    W. Curtis Preston, I have a question for you: where can I find out more about how to configure the 1/28th and 1/7th setup you talk about? More details, please. That *may* be the answer.

  4. cdevidal says:

    I don’t know if D2D2T alone will cut it. I think we will still be shoe-shining on small files, even if they are local to the tape drive.

    One alternative is to multiplex streams of data, and that’s tricky too. Plus you still don’t get around the problem of backing up our largest server, currently just over 2TB. Even with multiplexed streams we’d be running up to the same 3 days. So multiplexing doesn’t help us.

    Another alternative, which NetVault (our backup solution) prefers is to set up a virtual library, which is another form of D2D2T. Same problem as with multiplexing, though; that 2TB server takes a LONG time to back up, no matter if the target is tape or disk.

    I’m thinking we set up both a large file server and a VL. That is, we set up something like rsync to copy ONLY recent changes over the network to our backup server, which is only a Windows file server. Then use NetVault to copy *that* data to a VL. Then copy the data on the VL to tape.

    Pros:
    * No shoe-shining
    * Create a weekly full backup in a few hours
    * Instant file restores, even by users
    * “Poor man’s cluster,” with DFS we can point people at the slightly slower backup server if we have an outage
    * Saves NetVault license costs, thousands…
    * Rsync scripts can be written to stop databases and other services, do an rsync, start it again
    * Rapid restores of downed servers
    * Hourly, even up-to-the-minute backups are possible

    Cons:
    * Expensive, needs LOTS of FAST disks; would need to buy two times what we have on all of our servers, plus twice that for growth. Today we back up approximately 4TB of data, so that means we’d be buying 16TB of FAST storage, fast enough to keep up with the tape library. At 120MB/sec and with two drives plus compression minus overhead I’m estimating 400MB/sec. That swamps dual bonded gigabit; I’m thinking 10GbE or 4G FC. And how are we going to do all of this under $45k??

    I’m also pondering using NetVault’s consolidated full backup and just getting perpetual incrementals to a large virtual library. It would mean losing some of the benefits of the large file cache/mirror but on the up side we could meet our requirements for weekly fulls and backup windows and it would cost much less.

    Why not just a 1/28th full and 1/7th incremental solution like you suggested? I think our corporation will *require* weekly fulls…

    Whew, I’m weighed down with data! :roll::

  5. hga says:

    [quote name=cdevidal]Can anyone think of any downside to this?[/quote]
    Only the one you noted: complexity makes it a bit more dangerous than some other more $$$ solutions. E.g. if your multiple configurations of rsync miss anything important….

    And if it were me, I’d prefer tape stored off-site with every night’s incrementals, but only you can answer what level of recovery robustness/business continuity is right for you. E.g. absent the potential for an off-site facility to run your stuff (which sounds like it would be expensive, and doesn’t match a lot of my previous situations where in an emergency a backup site could be cobbled together), off site incrementals would do you little good….

    But if you get the complexity right, it sounds sweet.

    – Harold

  6. cdevidal says:

    Only the one you noted: complexity makes it a bit more dangerous than some other more $$$ solutions. E.g. if your multiple configurations of rsync miss anything important….

    Yep.

    I intend to run the rsync on each client, not on the server, so rather than some huge “master” script that pulls in everything it’ll just be a simple script, probably just a couple of lines with logging and error detection. An advantage to this is I can customize the stopping of the services on a server-by-server basis, I can stop the script from running if I need to do a restore, etc. Seems more simple, more scalable.

    As W. Curtis Preston recommended in “Unix Backup and Recovery,” it’ll be an all inclusive script, where I have to deliberately exclude folders. Missing something should be rare.

    I hate NetVault’s logging, it’s not flexible and it’s awkward and too verbose and not easy to trim out unnecessary errors such as locked files. With this setup there will only be one NetVault job per night and with no locked files I don’t anticipate seeing many unnecessary errors (whee!). When every server is scripted I’ll be able to trim errors with more flexibility, even inserting them into a database (should I choose to do that). It requires some programmatic forethought, such as ensuring I record when things go right AND wrong, emails to the admins, etc. But I think it’ll work great. Shoot, there’s probably a script already written out there.

    And if it were me, I’d prefer tape stored off-site with every night’s incrementals, but only you can answer what level of recovery robustness/business continuity is right for you. E.g. absent the potential for an off-site facility to run your stuff (which sounds like it would be expensive, and doesn’t match a lot of my previous situations where in an emergency a backup site could be cobbled together), off site incrementals would do you little good….

    I suppose with my proposed setup I could even copy from the VL to tape every morning. NetVault gives us that option. But in our company we have accepted the risk of weekly offsite fulls.

    Thanks!!

  7. cdevidal says:

    I think I’ve got it. By God, I think I’ve got it! It took me 2 weeks to think it all through, but I think I’ve got it!

    I think we’re going to use three technologies:
    * NetVault’s virtual library
    * NetVault’s consolidated full
    * The block-level asynchronous file mirror I was talking about above

    First, we set up the block-level asynchronous file mirror, or a "’super’ filer" as W. Curtis Preston calls it (in his book "Using SANs and NAS"-). This is an online mirror of every file on every server, copied with something like rsync. It’ll be on a medium-speed array attached to our backup server. Probably 8TB in size, connected with dual Gigabit ethernet bonded to the server. Speed isn’t all that important as it won’t be streaming directly into the tape library.

    Then we’ll run nightly incrementals from this array into a virtual library on very fast disks, probably inside the backup server itself. Since we want an HP DL380 G5 it has 8 x 2.5" drive bays, so this should work out great.

    We would do this instead of direct to tape because backing up many little files will cause "shoe-shining." I think "shoe-shining" has caused us to lose 3 drives in the last year.

    We keep our incrementals on this virtual library until I can use NetVault’s consolidation feature to consolidate the last full (on tape) plus incrementals into the next full (again, on tape). The last full on tape will always stay in the tape library.

    Weekly, I will consolidate the incrementals with the most recent full to create a new full set of tapes. Then I pull out the previous full and put the new full in its place.

    Pros:
    * Nightly incrementals are all that we will ever do, in the space of about an hour
    * Block-level rsync is VERY efficient for LAN and CPU on primary server
    * No shoe-shining
    * Instant file restores, even by users
    * "Poor man’s cluster," with DFS we can point people at the slightly slower backup server if we have an outage
    * Saves NetVault license costs, thousands… EXCEPT the virtual library license
    * Rsync scripts can be written to stop databases and other services, do an rsync, start it again (we’re currently not backing up databases with NetVault)
    * Rapid restores of downed servers
    * Hourly, even up-to-the-minute backups are possible
    * Should be under $45k because I don’t need HIGH speed connections from the RAID array to the tape library (I was looking at 10GbE), just need high speed from the virtual library, which is only a fraction of the total storage needed
    * Very scalable, cannot foresee future limitations (can you?)
    * Every file encrypted when using LTO 4 drives
    * Not too much rack space or power
    * Users can restore their own files; we’ll make shares on the ‘super’ filer to match the originals except they’ll be read-only

    Cons:
    * More complex; consolidated backups are more complicated to run. I could instead create a very large virtual library that can hold both the most recent full and incrementals, but that gets expensive — needs both performance and expandability.

    Can anyone think of any downside to this?

  8. ddierickx says:

    [quote name=cdevidal]I don’t know if D2D2T alone will cut it. I think we will still be shoe-shining on small files, even if they are local to the tape drive.
    [/quote]
    we’re using NBU, AFAIK it transfers the ‘tar’ files from disk to tape in it’s whole. so there are no small files in this case.

  9. cpreston says:

    When any backup product (that I’m aware of) copies backups from one device to another, it basically does a restore piped into a backup. NetBackup is no exception.

    Therefore, the number of files in the backup can very much affect the performance of the copy.

    Some D2D2T systems, though, can do the copy from disk to tape outside of the backup software. (Integrated VTLs can do this.) When they do it, they just copy bits and bytes and the number of files doesn’t affect performance. This is why a friend of mine switched from NBU Vault-based copying to VTL-based copying (in his case, using an EMC CDL) when he was backing up data with many, many files. His copy performance increased several hundred percent.

  10. wts says:

    TSM reclamation is not like backups and duplications where either you get it done or don’t. It’s tunable. You can set reclamation to a very high level (“I want my tapes very full”) and do reclamation 24×7, or cut it back to allow significant free space on full tapes.

    Can’t get your reclamations done? Simply allow more free space on a tape before it is reclaimed. The down sides are, of course, more tape media required and longer restores.

    Nota Bene: I haven’t used TSM in a few years and so am ignorant on “new” features.

    cheers, wayne

  11. aaaaarrrgghhh says:

    why not tar or zip all those small files on the client and then backup the tarball? that way your backup throughput goes up! of course, restore would be a two step process..

  12. cpreston says:

    I suppose that you could read what I wrote: “They’d better finish completely before your next backup cycle, or you’ve got problems too,” and believe that I meant that something bad would happen if this happened a single night. That’s not what I meant though.

    What I meant was that your reclamation thresholds were set for a reason, and your reclamation needs to finish for those reclamation thresholds to have their desired effect: minimizing the number of tape mounts for a restore and maximizing media utilization. While reclamation not finishing one night is no big deal; reclamations not finishing on a regular basis is a big deal:

    1. Your restores will end up asking for a lot more tapes

    2. Media utilization will hit record low levels.

    3. You’ll need a bigger tape library or a whole lot of manual tape swapping to continue normal operations.

  13. cdevidal says:

    we’re using NBU, AFAIK it transfers the ‘tar’ files from disk to tape in it’s whole. so there are no small files in this case.

    According to Wikipedia, NBU uses multiplexing to stream data to tape, which comes with its own set of problems.

    When any backup product (that I’m aware of) copies backups from one device to another, it basically does a restore piped into a backup. NetBackup is no exception.

    I don’t believe NetVault does this; I’ve been chatting with one of their gurus about VL-to-tape and he said that since it copies in tape format from the VL to tape it streams at maximum speed. This lets you stream as with something like multiplexing. You just need to buy a VL and a VL license. You don’t even need a very big VL, just big enough for your biggest incremental. We’ll be running them more frequently just-in-case.

    But your VL drive array should be fast; LTO-4 drives have a minimum speed of 40MB/sec. We’ll use RAID 0 since the data will just be on the drive for mere moments before being dumped to tape, and if there’s a failure I can run the job again.

    What’s everyone’s opinion on NetVault’s consolidated fulls? They can eliminate fulls entirely.

    My initial concern was if there’d be a corruption in the initial full that would then be perpetually copied, but I could do a complete verify the first time to rule that out.

    Going to try to talk to one of their customers who is using it in production.

    So that there isn’t any confusion, the original set of tapes is not perpetually left in the tape library. You are supposed to consolidate off of it with your incrementals to a new, fresh set of tapes which then gets left in the library for next time. The old ones are pulled and sent offsite. So the tapes themselves are always fresh.

    So what do you think?

  14. cpreston says:

    I know that NetBackup does not transfer the data en masse, as the performance of copies varies based on what’s being copied. Backups with lots of files copy slower than backups with a lot of files.

    This isn’t to say it’s not copying in tar format (or mpx’d tar format). I’m just saying that it unpacks it and repacks it along the way.

    I find it hard to believe that NetVault doesn’t do that.

    I don’t believe NetVault does this; I’ve been chatting with one of their gurus about VL-to-tape and he said that since it copies in tape format from the VL to tape it streams at maximum speed.

    Like I said, I find it hard to believe that it’s not verifying the copy that it makes, and the only way to do that is to unpack and repack.

    What’s everyone’s opinion on NetVault’s consolidated fulls? They can eliminate fulls entirely.

    I’m a big fan of the concept. It’s also available in NetBackup, NetWorker, & CommVault.

Leave a Reply

Your email address will not be published. Required fields are marked *