One RTO/RPO or two? (Onsite & Offsite)

Disaster recovery experts do not agree whether you should have one-and-only-one recovery time objective (RTO) and recovery point objective (RPO) for each applicaition, or two of them.  What am I talking about?  Let me explain.

What are RTO and RPO, you ask? RTO is the amount of time it should take to restore your data and return the application to a ready state (e.g. "This server must be up w/in four hours").  RPO is the amount of data you can afford to lose (e.g. "You must restore this app to within one hour of when the outage occurred").

Please note that no one is suggesting you have one RTO/RPO for your entire site. What we're talking about is whether or not each application should have one RTO/RPO or two.  We're also not talking about whether or not to have different values for RTO and RPO (e.g. 12-hour RPO and 4-hour RTO).  Most people do that.

In defense of two RTOs/RPOs (for each app)

If you lose a building (e.g via a bomb blast or major fire) or a campus (e.g. via an earthquake or tsunami) it's going to take a lot longer to get up and running than if you just have a triple-disk failure in a RAID6 array.  In addition, you might have an onsite solution that gets you a nice RPO or RTO as long as the building is still intact.  But when the building ceases to exist, most people are just left to their latest backup tape they sent to Iron Mountain.  This is why most people feel it's acceptable to have two RTOs/RPOs: one for onsite "disasters" and another for true, site-wide disasters.

In defense of one RTO/RPO (for each app)

It is an absolute fact that RTOs and RPOs should be based on the needs of the business unit that is using any given application.  Those who feel that there can only be one RTO/RPO say that the business can either be down for a day or it can't (24-hour RTO).  It can either lose a day of data or it can't (24-hour RPO). If they can only afford to be down for one hour (1-hour RTO), it shouldn't matter what the cause of the outage is — they can't afford one longer than an hour.

I'm with the first team

While I agree with the second team that the business can either afford (or not) a certain amount of downtime and/or data loss, I also understand that backup and disaster recovery solutions come with a cost.  The shorter the RTO & RPO, the greater the cost.  In addition, solutions that are built to survive the loss of a datacenter or campus are more expensive than those that are built to survive a simple disk or server outage.  They cost more in terms of the software and hardware to make it possible — and especially in terms of the bandwidth required to satisfy an aggressive RTO or RPO.  You can't do an RPO of less than 24-36 hours with trucks; you have to do it with replication.

This is how it plays out in my head.  Let's say a given business unit says that one hour of downtime costs $1M.  This is after considering all of the factors, including loss of revenue and damage to the brand, etc.  So they say they decide that they can't afford more than one hour of downtime.  No problem.  Now we go and design a solution to meet a 1-hour RTO.  Now suppose that the solution to satisfy that one-hour RTO costs $10M.  After hearing this, the IT department looks at alternatives, and it finds out that we can do a 12-hour RTO for $100K and a 6-hour RTO for $2M.

So for $10M, we are assured that we will lose only $1M in an outage.  For $2M we can have a 6-hour RTO, and for $100K we can have a 12-hour RTO.  That means that a severe outage would cost me $10M-11M ($10M + 1 hour of downtime at $1M), or $6M-$12M ($6M + $6M in downtime), or $100K-$12M ($100K + 12 hours of downtime).  A gambler would say that you're looking at definitely losing (spending) $10M, $6M, or $100K and possibly losing $1M, $6M or $12M.  I would probably take option two or three — probably three.  I'd then put $9.9M I saved and make it work for me, and hopefully I'll make more for the company with that $9.9M than the amount we will lose ($12M) if we have a major outage.

Now what if I told you that I could also give you an onsite 1-hour RTO for another $10K.  Wouldn't you want to spend another $10K to prevent a loss greater than $1M, knowing full well that this solution will only work if the datacenter remains intact?  Of course you would.

So we'll have a 12-hour RTO for a true disaster that takes out my datacenter, but we'll have a 1-hour RTO as long as the outage is local and doesn't take out the entire datacenter.

Guess what.  You just agreed to have two RTOs.  (All the same logic applies to RPOs, by the way.)

If everything cost the same, then I'd agree that each application should have one — and only one — RTO and RPO.  However, things do not cost the same.  That's why I'm a firm believer in having two complete different sets of RTOs and RPOs.  You have one that you will live up to in most situations (e.g. dead disk array) and another that you hope you never have to live up to (loss of an entire building or campus).

What do you think?  Weigh in on this in the comments section.

----- Signature and Disclaimer -----

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

3 thoughts on “One RTO/RPO or two? (Onsite & Offsite)

  1. nickpc says:

    Your DR Strategy is really an insurance policy, and insurance is priced based on the odds of having to pay out, and what the payout would be. Everything else being equal, a bad driver in an expensive car pays a lot more than a good driver in a cheap car. You pick what type of events you need to protect yourself from, the likelihood of the events, what they would cost you, and as you demonstrate, you compare that to the cost of avoiding/mitigating/recovering from the events.

    One of the points around this that are missed is the fact that this equation changes as time goes on. The disk array that was your Tier1 storage 4 years ago may now be Tier2 (or lower) today, so its value has changed to your business. Keeping the Tier1 RTO/RPOs in place no longer makes sense, and could be costing the business much more than is required.

    That’s part of the fun of backup/recovery – there is an effect felt here from every decision the business makes, whether the decision makers see it or not.

  2. jm7640 says:

    I do not mean this to sound like an advertisement for my company but I am one of the architects who has designed part of these standards so I have a good understanding of att’s present approach to RPO/RTO.
    In short, we use multiple RPx/RTx values. We create an RPE/RTE (Estimate instead of Objective) for each type of failure mode and we estimate the likelihood of that failure to occur and at what frequency based on historical data from our problem management database.

    Steps we take as an organization for internal business units:
    -Identify and prioritize our business services and map out the business processes to deliver them. (performed by Business Continuity Architect)
    -Identify and prioritize the IT services (aka corporate applications) that support those business processes. (performed by IT Application Architect for that Business Unit)
    -Because this is so complex we give the corporate application the same rank of the most important business process it supports.
    -Then we group together all underlying constituent elements of the corporate application, such as the servers, file systems, databases, network connection points. (performed by IT Service Continuity Architect)
    That grouping is stored in a multi site replicated database (CADB). Every corporate application has an ID with its availability requirements and all elements that are required for it to deliver its application services and how each component is dependent on the other component (joint effort between several roles).
    Upon any new designs, enhancements, or CRs for that corporate application, the CADB is reviewed to ensure the change maintains the required availability and updated with those changes.
    One of the many benefits of this is that when a failure mode does occur on a particular server, storage frame, or device occurred, we immediately know what corporate applications are affected and what resiliency/recovery solutions are in place for them.
    We have some proprietary applications which continually ensure that we are meeting the RPO and RTO by getting a live feed from the data protection infrastructure and calculating the RPE/RTE (Estimate) and comparing to the Objective.
    This application provides a dashboard that rolls up all underlying systems that are part of that corporate application and reports the application’s RPE/RTE if a failure mode should occur.
    Since att has over a hundred years of experience with the design/build of technical systems, they have mapped out a majority of the failure modes that impact availability of an application or IT system.
    For each failure mode we have a portfolio of resiliency solutions to recover from a particular failure along with a corresponding estimated RPE/RTE.
    Depending on the desired RPO/RTO we have different resiliency solutions with different costs.
    We have an RPE/RTE for each type of failure and we estimate the likelihood of that failure to occur. Over time you gain experience in what failures (risks) to mitigate and which one to not.
    Some of the RPO/RTO are in seconds and some are in hours depending on the failure.
    For instance if the failure mode is a server crash and the resiliency solution is Oracle RAC the RPE=0sec and RTE=15sec .
    If the failure mode is a storage frame outage and the resiliency solution is Oracle DataGuard Standby, theRPE=1min (adjustable to 0sec) and RTE=5min.
    If the failure mode is a SAN fabric outage and the resiliency solution is Dual Fabric, the RPE=0sec and RTE=0sec .
    If the failure mode is a site wide outage and the resiliency solution is Active/Active Geo-Bi-Asynchronous Replication, the RPE=5min and RTE=5min.
    If the failure mode is intentional deletion and the resiliency solution is backup to tape the RPE=24h and RTE=8h
    The numbers are not exact for non disclosure reasons.
    Instead of allowing the customer (internal business unit) to pick and choose each individual solution we group the resiliency solutions together into a platinum, gold, silver, bronze package and for simplicity we advertise just one RPE/RTE for each package in our solution portfolio but if a customer asks we delve into the specifics and multiple achievable RPE/RTE depending on a certain failure mode being realized.

  3. Guest says:

    Curtis, I think your spend for 6 hours you meant to be $2M.
    “A gambler would say that you’re looking at definitely losing (spending) $10M, $6M, or $100K” change to “$10M, $2M, or $100K”. Having worked with BIA’s, RTO/RPO’s and SLA/SLO’s I concur with you thoughts.

Leave a Reply

Your email address will not be published. Required fields are marked *