Should circumstances change the RTO & RPO?

Disaster recovery experts do not agree whether you should have one-and-only-one recovery time objective (RTO) and recovery point objective (RPO) for each application, or two of them.  What am I talking about?  Let me explain.

In case you’re not familiar with RTO & RPO, I’ll define them. RTO is the amount of time it should take to restore your data and return the application to a ready state (e.g. “This server must be up within four hours”).  RPO is the amount of data you can afford to lose (e.g. “You must restore this app to within one hour of when the outage occurred”).

Please note that no one is suggesting you have one RTO/RPO for your entire site. What we’re talking about is whether or not each application should have one RTO/RPO or two.  We’re also not talking about whether or not to have different values for RTO and RPO (e.g. 12-hour RPO and 4-hour RTO).  Most people do that.  Let me explain.

In defense of two RTOs/RPOs (for each app)

If you lose a building (e.g via a bomb blast or major fire) or a campus (e.g. via an earthquake or tsunami) it’s going to take a lot longer to get up and running than if you just have a triple-disk failure in a RAID6 array.  In addition, you might have an onsite solution that gets you a nice RPO or RTO as long as the building is still intact.  But when the building ceases to exist, most people are just left to their latest backup tape they sent to Iron Mountain.  This is why most people feel it’s acceptable to have two RTOs/RPOs: one for onsite “disasters” and another for true, site-wide disasters.

In defense of one RTO/RPO (for each app)

It is an absolute fact that RTOs and RPOs should be based on the needs of the business unit that is using any given application.  Those who feel that there can only be one RTO/RPO say that the business can either be down for a day or it can’t (24-hour RTO).  It can either lose a day of data or it can’t (24-hour RPO). If they can only afford to be down for one hour (1-hour RTO), it shouldn’t matter what the cause of the outage is — they can’t afford one longer than an hour.

I’m with the first team

While I agree with the second team that the business can either afford (or not) a certain amount of downtime and/or data loss, I also understand that backup and disaster recovery solutions come with a cost.  The shorter the RTO & RPO, the greater the cost.  In addition, solutions that are built to survive the loss of a datacenter or campus are more expensive than those that are built to survive a simple disk or server outage.  They cost more in terms of the software and hardware to make it possible — and especially in terms of the bandwidth required to satisfy an aggressive RTO or RPO.  You can’t do an RPO of less than 24-36 hours with trucks; you have to do it with replication.

This is how it plays out in my head.  Let’s say a given business unit says that one hour of downtime costs $1M.  This is after considering all of the factors, including loss of revenue and damage to the brand, etc.  So they say they decide that they can’t afford more than one hour of downtime.  No problem.  Now we go and design a solution to meet a 1-hour RTO.  Now suppose that the solution to satisfy that one-hour RTO costs $10M.  After hearing this, the IT department looks at alternatives, and it finds out that we can do a 12-hour RTO for $100K and a 6-hour RTO for $2M.

So for $10M, we are assured that we will lose only $1M in an outage.  For $2M we can have a 6-hour RTO, and for $100K we can have a 12-hour RTO.  That means that a severe outage would cost me $10M-11M ($10M + 1 hour of downtime at $1M), or $6M-$12M ($6M + $6M in downtime), or $100K-$12M ($100K + 12 hours of downtime).

A gambler would say that you’re looking at definitely losing (spending) $10M, $6M, or $100K and possibly losing $1M, $6M or $12M.  I would probably take option two or three — probably three.  I’d then put $9.9M I saved and make it work for me, and hopefully I’ll make more for the company with that $9.9M than the amount we will lose ($12M) if we have a major outage.

Now what if I told you that I could also give you an onsite 1-hour RTO for another $10K.  Wouldn’t you want to spend another $10K to prevent a loss greater than $1M, knowing full well that this solution will only work if the datacenter remains intact?  Of course you would.

So we’ll have a 12-hour RTO for a true disaster that takes out my datacenter, but we’ll have a 1-hour RTO as long as the outage is local and doesn’t take out the entire datacenter.

Guess what.  You just agreed to have two RTOs.  (All the same logic applies to RPOs, by the way.)

If everything cost the same, then I’d agree that each application should have one — and only one — RTO and RPO.  However, things do not cost the same.  That’s why I’m a firm believer in having two complete different sets of RTOs and RPOs.  You have one that you will live up to in most situations (e.g. dead disk array) and another that you hope you never have to live up to (loss of an entire building or campus).

What do you think?  Weigh in on this in the comments section.

Written by W. Curtis Preston (@wcpreston), four-time O'Reilly author, and host of The Backup Wrap-up podcast. I am now the Technology Evangelist at Sullivan Strickler, which helps companies manage their legacy data

3 comments
  • There’s a big factor you didn’t include, the financial. Business insurance policies don’t typically cover server power supply failed frying PDU and all the disks. So the cost of downtime for the in data center sub-disaster is a cost to the business.

    BUT that same policy includes business interruption coverage for the covered threats. So if the data center building burns down the $10 million in biz interruption costs is covered by a check from The Hartford Insurance Co.

    I find it all too rare that people include who’s money we’re spending in the analysis. In general, costs incurred to prepare for a disaster (that is dollars, euros or shekels spent before the excrement strikes the air motion device) are your costs but costs that are RECOVERING from the disaster are paid out of insurance proceeds.

    So if I accept, and I do, that a real disaster sitewide can have a longer RTO (and longer RPO but just a bit longer say going from 1sec to 30min) don’t just look at the costs but also which of those costs are covered by insurance and how to structure your recovery plan to take advantage of that “free money”.

    An example would be to choose a DRaaS service where the cloud VMs cost more than AWS AMIs but the monthly cost of streaming your data to the cloud is lower.

    • I thought I put in a lot of financial stuff! It’s all about the costs of the solution vs the potential business loss of a disaster. I think what you’re saying is that I didn’t include the insurance aspect, and that’s a solid point.

  • Most companies think they do this today. They spend money on local resilience to avoid single points of failure. Some companies definitely do this via dedicated technologies or application design. They balance this spend with the cost of downtime. For DR, they spend on different technologies to provide wide-area resilience.

    I would call this “protecting from what will happen” versus “protecting from what might happen”. This gives us your point: two different sets of SLOs.