Do you need to backup Office365?

The answer is absolutely yes, and anyone who thinks you don’t need to do so should not be put in charge of your data. Also, anyone who thinks I’m saying this just because I work for a company that backs up Office365 should read this blog post from seven years ago when I basically said exactly the same thing:  Cloud services need to be backed up.

I was reading a spiceworks thread on this topic and was shocked at some of the anti-backup recommendations I saw there. One person pointed to TechEd article that talks about how redundant the storage is for Office365. That has absolutely nothing to do with this topic. That’s the equivalent of saying “I have RAID, so I don’t need backups.”

I saw another post where someone explained that the recycle bin is sufficient for “oops” recovery needs, and that vendors just try to scare people with things like rogue admins to get them to buy their products.  He/she went on to say nothing like that had every happened to them, so…  It’s not just rogue admins, people.  There are all sorts of things  that can corrupt your entire datastore that can only be addressed via a good third party backup solution.

Backups aren’t included

Take a look at the feature page for Office365.  You will find that backups aren’t included. The references to data protection features are more about loss prevention and things like that.  They have nothing to do with recovering corrupted data.

MCSE Brian Posey points out that “the Office 365 service-level agreement addresses availability, not recoverability.”  So if you or someone else messes up your Office365 data, Microsoft is under no obligation to help you.

MCSE Experts think so

Microsoft MVP Brien Posey says that “you might not have as many options for restoring your data as you might think. As such, it is critically important to understand your options for disaster recovery in an Office 365 environment.”

“Microsoft says they also perform traditional backups of Office 365 servers. However, those backups are used for internal purposes only if they experienced a catastrophic event that wiped out large volumes of customer data…”

He also points out that there is no “provision for reverting a mailbox server to an earlier point in time (such as might be necessary if a virus corrupted all the mailboxes on a server).”

You can delete your primary & secondary recycle bin

A lot of people talk about using the recycle bin to recovery accidentally deleted or corrupted folders. It is true that it can keep such items for up to 90 days, depending on your settings.  However, it is also true that a well-meaning or malicious person can easily clean out both the primary and secondary recycle bin.  And a malicious person would indeed do just that.

Litigation hold doesn’t protect public folders

Some say that litigation hold protects you from such things.  It keeps a copy of most messages forever; however, it does not protect public folders. Someone could easily delete everything in a public folder and then empty the recycle bin, and you would no recourse if you did not have a third-party tool.

Litigation hold has no separation of powers

An important concept in many environments is the separation of powers between a person like the Exchange admin, and a backup person.  That protects the organization from rogue admins doing very bad things and then covering them up by deleting the backups as well.

But litigation hold has no such protection. Office 365 administrators could (rightly or wrongly) assign themselves eDiscovery Manager rights and have full access to search and export from Exchange mailboxes, SharePoint folders, and OneDrive locations. They could even modify the Litigation Hold policies.  One way to describe this is that it helps a good person to do the right thing, but it does not stop a bad or incompetent person from doing the wrong thing.

The OneDrive restore feature is all or nothing

The OneDrive restore feature is a bit puzzling. It can only restore things that are in the recycle bin, and it is all or nothing.  Meaning you have to restore the entire OneDrive system to a single point in time; you cannot just restore parts of it.  That has to be the most worthless restore I’ve ever heard of.

You need to backup Office365

You need to backup Exchange, OneDrive, and Sharepoint.  Microsoft isn’t doing it for you, and the features that protect you against accidents do not go far enough.  Look into a third-party solution, such as what my employer (Druva) provides.

Should circumstances change the RTO & RPO?

Disaster recovery experts do not agree whether you should have one-and-only-one recovery time objective (RTO) and recovery point objective (RPO) for each application, or two of them.  What am I talking about?  Let me explain.

In case you’re not familiar with RTO & RPO, I’ll define them. RTO is the amount of time it should take to restore your data and return the application to a ready state (e.g. “This server must be up within four hours”).  RPO is the amount of data you can afford to lose (e.g. “You must restore this app to within one hour of when the outage occurred”).

Please note that no one is suggesting you have one RTO/RPO for your entire site. What we’re talking about is whether or not each application should have one RTO/RPO or two.  We’re also not talking about whether or not to have different values for RTO and RPO (e.g. 12-hour RPO and 4-hour RTO).  Most people do that.  Let me explain.

In defense of two RTOs/RPOs (for each app)

If you lose a building (e.g via a bomb blast or major fire) or a campus (e.g. via an earthquake or tsunami) it’s going to take a lot longer to get up and running than if you just have a triple-disk failure in a RAID6 array.  In addition, you might have an onsite solution that gets you a nice RPO or RTO as long as the building is still intact.  But when the building ceases to exist, most people are just left to their latest backup tape they sent to Iron Mountain.  This is why most people feel it’s acceptable to have two RTOs/RPOs: one for onsite “disasters” and another for true, site-wide disasters.

In defense of one RTO/RPO (for each app)

It is an absolute fact that RTOs and RPOs should be based on the needs of the business unit that is using any given application.  Those who feel that there can only be one RTO/RPO say that the business can either be down for a day or it can’t (24-hour RTO).  It can either lose a day of data or it can’t (24-hour RPO). If they can only afford to be down for one hour (1-hour RTO), it shouldn’t matter what the cause of the outage is — they can’t afford one longer than an hour.

I’m with the first team

While I agree with the second team that the business can either afford (or not) a certain amount of downtime and/or data loss, I also understand that backup and disaster recovery solutions come with a cost.  The shorter the RTO & RPO, the greater the cost.  In addition, solutions that are built to survive the loss of a datacenter or campus are more expensive than those that are built to survive a simple disk or server outage.  They cost more in terms of the software and hardware to make it possible — and especially in terms of the bandwidth required to satisfy an aggressive RTO or RPO.  You can’t do an RPO of less than 24-36 hours with trucks; you have to do it with replication.

This is how it plays out in my head.  Let’s say a given business unit says that one hour of downtime costs $1M.  This is after considering all of the factors, including loss of revenue and damage to the brand, etc.  So they say they decide that they can’t afford more than one hour of downtime.  No problem.  Now we go and design a solution to meet a 1-hour RTO.  Now suppose that the solution to satisfy that one-hour RTO costs $10M.  After hearing this, the IT department looks at alternatives, and it finds out that we can do a 12-hour RTO for $100K and a 6-hour RTO for $2M.

So for $10M, we are assured that we will lose only $1M in an outage.  For $2M we can have a 6-hour RTO, and for $100K we can have a 12-hour RTO.  That means that a severe outage would cost me $10M-11M ($10M + 1 hour of downtime at $1M), or $6M-$12M ($6M + $6M in downtime), or $100K-$12M ($100K + 12 hours of downtime).

A gambler would say that you’re looking at definitely losing (spending) $10M, $6M, or $100K and possibly losing $1M, $6M or $12M.  I would probably take option two or three — probably three.  I’d then put $9.9M I saved and make it work for me, and hopefully I’ll make more for the company with that $9.9M than the amount we will lose ($12M) if we have a major outage.

Now what if I told you that I could also give you an onsite 1-hour RTO for another $10K.  Wouldn’t you want to spend another $10K to prevent a loss greater than $1M, knowing full well that this solution will only work if the datacenter remains intact?  Of course you would.

So we’ll have a 12-hour RTO for a true disaster that takes out my datacenter, but we’ll have a 1-hour RTO as long as the outage is local and doesn’t take out the entire datacenter.

Guess what.  You just agreed to have two RTOs.  (All the same logic applies to RPOs, by the way.)

If everything cost the same, then I’d agree that each application should have one — and only one — RTO and RPO.  However, things do not cost the same.  That’s why I’m a firm believer in having two complete different sets of RTOs and RPOs.  You have one that you will live up to in most situations (e.g. dead disk array) and another that you hope you never have to live up to (loss of an entire building or campus).

What do you think?  Weigh in on this in the comments section.

Continue reading

Salesforce recycle bin contains only deleted records

One of the most valuable resources your company has it probably not being backed up properly  – if at all. Like a lot of cloud services, the ability of salesforce customers to recover from big mistakes or a malicious attack is a bit overstated. Let’s take a look at that.

Big, bad update

Say, for example, that someone wants to change how phone numbers are stored in Salesforce.  (I know this because I wanted to do this once with a large number of records.) Let’s say they are tired of the inconsistent way phone numbers are stored and want to go to a standard format. They have chosen to get rid of all parentheses and spaces, and just use dashes.  (800) 555-1212 becomes 800-555-1212.

They download a CSV of all the salesforce IDs and accompanying phone numbers. They do their magic on the phone numbers and change everything to dashes. But they accidentally sort one column, completely disassociating numbers with Salesforce IDs. They then update every single one of your leads with incorrect phone numbers.  Little by little, salespeople notice that some phone numbers are wrong and fix them. But it’s days before they realize that it was this update that broke everything.

This would also be a great way for a salesperson to get even with your company for not giving him the bonus he wanted. Download a bunch of records, do a quick sort on only one column, then use data loader to upload nonsense back to salesforce.

Recycle bin cannot fix updated records

The recycle bin contains deleted records, not updated records. So fixing even a few mistakenly (or maliciously) updated records is not possible with the recycle bin.  It can only fix things if you accidentally delete records – as long as it’s not more records than what can fit in your recycle bin.  (The number of megabytes of storage you have X 25.)

You really need to back up Salesforce

Without an external salesforce backup, you are literally one bad update away from being forced to use their “recovery service,” which may be the worst service ever.  It’s so bad they don’t want you to use it.  They call it a “last resort,” and tell you it’s going to take 6-8 weeks and cost $10,000. And after six weeks, all you have is a bunch of CSV files that represent your salesforce instance at a particular point in time. It will be your job to determine what needs to be uploaded, updated, replaced, etc.  That process will be complicated and likely take a long time as well.

Please look into an automated way to backup you Salesforce data.