Check out our companion blog!
March 25, 2022

Restore test fails due to bad documentation

Restore test fails due to bad documentation

Gary Williams tells a great story about earlier in his career that taught him the value of testing backups and updating documentation. He explains how he thought his backups were fine, until a "new guy" came onto the scene and dared to ask the question, "When was the last time you tested your backups?" As Gary explains, sometimes new people have the best perspective. They let him do the first test, and .... it failed spectacularly! It all came down to the documentation they were so proud of. Hear Gary's story and learn from his mistake – one that defined his career. (Mr. Backup also tells the story that defined his career as well!)

Mentioned in this episode:

Interview ad

Transcript
Gary Williams:

The thing is, if you want to do a team building exercise,

Gary Williams:

forget all the assault courses and other things they have, you do get

Gary Williams:

the team together and do a restore.

Gary Williams:

Some of the best...

Gary Williams:

seriously.

W. Curtis Preston:

It's a bit like, the trust exercises where you lean

W. Curtis Preston:

backwards and catches you .It's like that.

W. Curtis Preston:

Hi, and welcome to backup.

W. Curtis Preston:

Central's Restore it All podcast.

W. Curtis Preston:

I'm your host.

W. Curtis Preston:

W.

W. Curtis Preston:

Curtis Preston, AKA Mr.

W. Curtis Preston:

Backup and I have with me, my table saw safety, enthusiast, Prasanna Malaiyandi.

W. Curtis Preston:

How's it going Prasanna?

Prasanna Malaiyandi:

I'm good, Curtis.

Prasanna Malaiyandi:

I don't know if I'd call myself a safety enthusiast, but

W. Curtis Preston:

You don't believe in safety.

Prasanna Malaiyandi:

no, not at all.

Prasanna Malaiyandi:

Plus I think you could say I'm a bad influence on you seeing, how much

Prasanna Malaiyandi:

equipment you've now started to accrue.

W. Curtis Preston:

Yeah, last night I watched, I don't know.

W. Curtis Preston:

I'm going to say two solid hours of just table saw safety videos.

Prasanna Malaiyandi:

Yeah, but it is good for you to refresh your

Prasanna Malaiyandi:

mind on what table saw safety means.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

you do recall that table saw is the reason that this finger is missing

W. Curtis Preston:

the, or this hand is missing end.

W. Curtis Preston:

I'm missing the end of the middle finger on my left hand

W. Curtis Preston:

for those of you listening.

W. Curtis Preston:

so it's actually really hard for me to watch some of those videos.

Prasanna Malaiyandi:

Is it like, when you're doing like driver's

Prasanna Malaiyandi:

education learning to drive, they show what is that red asphalt.

Prasanna Malaiyandi:

Was that the name of the movie where it's like accidents happen and.

W. Curtis Preston:

Blood on the asphalt, I think is what that one's called.

W. Curtis Preston:

I do remember that one, but this one, there's one where a guy actually

W. Curtis Preston:

shows in the video,, he doesn't have the board completely clear the blade

W. Curtis Preston:

when he takes his hand off of it.

W. Curtis Preston:

And it, the blade grabs the board and tosses it essentially at his groin, area.

W. Curtis Preston:

And the thing is when you watch it, he looks at it one frame at a time.

W. Curtis Preston:

And the board goes from being on the other side of the blade to his groin

W. Curtis Preston:

in less than a frame of the video.

W. Curtis Preston:

And, so that's, one 30th of a second, probably.

W. Curtis Preston:

yeah.

W. Curtis Preston:

And he's like, don't do that.

W. Curtis Preston:

but yeah, it's been interesting, but the thing that's got me super

W. Curtis Preston:

excited right now, has been this new video editing or just editing tool.

W. Curtis Preston:

It's both video and audio and, and it's this thing called,

Prasanna Malaiyandi:

Descript,

W. Curtis Preston:

Descript.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And it's just.

Prasanna Malaiyandi:

you sounded so excited when you texted me.

W. Curtis Preston:

Oh, my God.

W. Curtis Preston:

it's hard to describe how amazing this tool is, where you input the, in my

W. Curtis Preston:

case, I'm in, I'm actually, because we're using video clips of these episodes.

W. Curtis Preston:

I'm inputting the video and I edit the video and then I excerpt

W. Curtis Preston:

the audio for the audio excerpts.

W. Curtis Preston:

, It's made mainly for talking head videos like these, right?

W. Curtis Preston:

Or audio and you input the audio or video, it does, automated transcription,

W. Curtis Preston:

which gets about 95% accurate.

W. Curtis Preston:

And then you go through and you obviously, you can correct the things that it

W. Curtis Preston:

got wrong, but the really amazing part is if you start a sentence and you

W. Curtis Preston:

change your mind, or you have the, a lot of words going up to that sentence.

W. Curtis Preston:

All you have to do is highlight those words in the document and

W. Curtis Preston:

it cuts them out of the video.

Prasanna Malaiyandi:

It's like magic

W. Curtis Preston:

It's like magic.

W. Curtis Preston:

And then if that's not enough magic, the part that I'm super excited

W. Curtis Preston:

about trying is sometimes you say one word when you meant to say another.

Prasanna Malaiyandi:

that never happens to you, Curtis.

W. Curtis Preston:

Like the podcast I was editing yesterday...

W. Curtis Preston:

. It was you and I talking about 365 and you don't want this

W. Curtis Preston:

to happen on your worst day.

W. Curtis Preston:

That's what I meant to say.

W. Curtis Preston:

But for some reason I said last day, so with this tool, first

W. Curtis Preston:

off, I train it with my voice.

W. Curtis Preston:

I literally speak into the microphone, a bunch of stuff.

W. Curtis Preston:

It can then synthesize my voice.

W. Curtis Preston:

And I can select that word and change the word last to worst,

W. Curtis Preston:

and it will put my voice there, a synthesized version of my voice.

Prasanna Malaiyandi:

So here's a question, Curtis, do we actually

Prasanna Malaiyandi:

need to have this podcast anymore?

Prasanna Malaiyandi:

Or can we just have not even just type it out.

Prasanna Malaiyandi:

Can we just have something auto-generate based on all of our past podcasts and

Prasanna Malaiyandi:

just have it start creating new podcasts.

W. Curtis Preston:

It'll just be a recording that says

W. Curtis Preston:

3, 2, 1 rule over and over.

Prasanna Malaiyandi:

No, but it's you know how they have, they've trained AI

Prasanna Malaiyandi:

to now do paintings and things like that.

Prasanna Malaiyandi:

I wonder if we could basically have,

W. Curtis Preston:

AI based.

W. Curtis Preston:

yeah.

W. Curtis Preston:

first I get it.

W. Curtis Preston:

I have to get all the audio and then feed that into a thing.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

We don't need you and me anymore.

Prasanna Malaiyandi:

Exactly.

W. Curtis Preston:

How hard is it to just say backup your stuff, backup all the

W. Curtis Preston:

stuff and make sure you test your backups?

Prasanna Malaiyandi:

And then you just do it based off of whatever's

Prasanna Malaiyandi:

trending on Twitter and the data protection, security space.

Prasanna Malaiyandi:

And it comes up with a new podcast episode for us.

W. Curtis Preston:

That may have already happened.

W. Curtis Preston:

Who knows?

W. Curtis Preston:

You don't know this is an auto-generated video and auto-generated audio, who

W. Curtis Preston:

knows, but speaking of testing backups, I was thinking about this concept, as

W. Curtis Preston:

long as you don't test your backups, your backup is both a complete success

W. Curtis Preston:

and a complete failure, which reminds me of, the concept of Schrodinger's cat.

Prasanna Malaiyandi:

I like the former, rather than thinking

Prasanna Malaiyandi:

about the latter, but that's

W. Curtis Preston:

Yeah, but,

Prasanna Malaiyandi:

Speaker:

rather than the realist.

W. Curtis Preston:

So you're familiar with the concept of Schrodinger's cat, right?

Prasanna Malaiyandi:

Speaker:

Based on TV shows, movies,

W. Curtis Preston:

Okay.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

So it's just a concept, as I understand the concept that you have this cat in

W. Curtis Preston:

a box, and as long as you don't look in the box, the cat is both alive and dead.

W. Curtis Preston:

But once you look in the box, you will know that the cat is alive or dead.

W. Curtis Preston:

That's the concept of Schrodinger's cat.

W. Curtis Preston:

And the reason why this is relevant today is that we have the author of

a blog called Schrodinger's Backup:

Speaker:

when good documentation goes bad.

a blog called Schrodinger's Backup:

Speaker:

He's been in the IT industry almost as long as I have.

a blog called Schrodinger's Backup:

Speaker:

He comes to us from the UK.

a blog called Schrodinger's Backup:

Speaker:

Welcome to the podcast, Gary Williams.

Gary Williams:

Thank you and thank you for the invite.

W. Curtis Preston:

I saw that title.

W. Curtis Preston:

And I was like, I gotta get this guy on the podcast.

Prasanna Malaiyandi:

Speaker:

Curtis was so excited.

Prasanna Malaiyandi:

Speaker:

Gary, you have no idea.

Prasanna Malaiyandi:

Speaker:

This is like one of his favorite topics.

Gary Williams:

Thank you.

Gary Williams:

I don't know if I coined the term.

Gary Williams:

I have seen it used since I'd like to think I coined the term,

Gary Williams:

but I don't know for certain,

W. Curtis Preston:

why not?

Gary Williams:

it might be something that I heard and I just copied because

Gary Williams:

it's just sounds really cool when it's perfectly accurate, I think.

Gary Williams:

It was all three or four companies ago.

Gary Williams:

The lessons we learned still definitely apply today, but this

Gary Williams:

happened about three companies back.

Gary Williams:

So about 10 years ago.

W. Curtis Preston:

So what was your role at the time?

Gary Williams:

So my role at the time was a senior network engineer or senior

Gary Williams:

support engineer, something like that.

W. Curtis Preston:

OK, And you had the, the gall to, to ask about backups.

Gary Williams:

No, I didn't.

Gary Williams:

I was overconfident with our backups, let's say so we had the backup

Gary Williams:

software, I think it was backup exec.

Gary Williams:

And, we had all the servers being backed up.

Gary Williams:

We had everything going to dual tapes.

Gary Williams:

The tapes were going off site.

Gary Williams:

Everything was working.

W. Curtis Preston:

Jewel, jewel tapes?

Gary Williams:

Dual tapes.

Gary Williams:

We actually had the backups, the software was writing

Gary Williams:

effectively RAID-1 one backups.

Gary Williams:

So it was writing to two tapes.

W. Curtis Preston:

Oh, duel, capes.

W. Curtis Preston:

Okay.

W. Curtis Preston:

I heard, for some reason I heard Jewel.

W. Curtis Preston:

I don't know why.

Gary Williams:

It's the English accent.

Gary Williams:

And yeah.

Gary Williams:

So it's going to two tapes simultaneously.

Gary Williams:

So the idea was that even if a tape broke, or if something happens to the

Gary Williams:

backup and we weren't entirely sure of, or you couldn't restore from one of the

Gary Williams:

tapes, you could then get the other tape and use that tape to do the restore.

Gary Williams:

So we had all that stuff going on.

Gary Williams:

And we got all the emails and of course we're getting the emails

Gary Williams:

saying all the backups are good, everything must be absolutely fine.

Gary Williams:

Why would we test them?

Gary Williams:

Why we're busy enough with other tickets and other stuff going on and projects.

Gary Williams:

We haven't got time to test them.

Gary Williams:

What's the point?

Gary Williams:

We know they work.

Prasanna Malaiyandi:

And so it looks like you were doing all the

Prasanna Malaiyandi:

right things in terms of setting up backups, Following the 3, 2, 1 rule.

Prasanna Malaiyandi:

Right?

Prasanna Malaiyandi:

Making sure your copies were offsite and.

W. Curtis Preston:

Yeah.

Prasanna Malaiyandi:

I think that's probably better than maybe

Prasanna Malaiyandi:

like 70% of the people out there.

Prasanna Malaiyandi:

Who try to do backups.

Prasanna Malaiyandi:

You're like doing the right things.

Prasanna Malaiyandi:

You're like, oh, I'm good to go.

Gary Williams:

Yeah, absolutely.

Gary Williams:

As I say, we had the emails, we even checked the emails.

Gary Williams:

I think we even had a shared folder or something like that, where all

Gary Williams:

the backups emails went, and if one of us saw that the folder had

Gary Williams:

an unread one going, we check it.

Gary Williams:

If there was an error, someone would get a ticket, it would get sorted out.

Gary Williams:

If the error went on for several days, there would be a conversation.

Gary Williams:

We will get these things fixed.

Gary Williams:

where's the problem.

Gary Williams:

We know our backups are good.

W. Curtis Preston:

So you, you were a.

W. Curtis Preston:

You were, I don't know.

W. Curtis Preston:

I don't know what to call it, but so instead of being a proponent

W. Curtis Preston:

of testing the backups, you were a proponent of oh, everything's fine.

Gary Williams:

Unfortunately at that time.

Gary Williams:

Yes, I was, sitting there and quite fat, dumb and happy going.

Gary Williams:

We've got the emails, the backups work.

Gary Williams:

We know they work.

Gary Williams:

Where's the problem.

Gary Williams:

I didn't see any issue here at all.

W. Curtis Preston:

For what it's worth.

W. Curtis Preston:

I had a similar point in my career and there was a time.

W. Curtis Preston:

I remember when I was at a company, I won't give the actual name of the

W. Curtis Preston:

company, but I will just say it's a very, well-known electronics manufacturer.

W. Curtis Preston:

and I had helped him set up their backup system and I wasn't

W. Curtis Preston:

there just to do the backups.

W. Curtis Preston:

I was there to do sysadmin stuff.

W. Curtis Preston:

And they were a mess.

W. Curtis Preston:

th this was a, it was a small department in this bigger, electronics company.

W. Curtis Preston:

It was an interesting department.

W. Curtis Preston:

They called it.

W. Curtis Preston:

Simulation modeling and research.

W. Curtis Preston:

So it was a revolutionary idea at the time of the idea of modeling,

W. Curtis Preston:

like in a computer, what would happen if you drop this device?

W. Curtis Preston:

And so they were doing this in a computer.

W. Curtis Preston:

It was a fascinating new at the time, new field of science.

W. Curtis Preston:

So I was there to fix a whole bunch of problems.

W. Curtis Preston:

One of which, for example, was that every workstation, it was all

W. Curtis Preston:

Unix workstations, and every person had root on their workstation.

W. Curtis Preston:

And that was the first thing I was going to fix.

W. Curtis Preston:

But I also set up their backup system and, the backups worked.

W. Curtis Preston:

So I assumed the restores would work and it was some time.

W. Curtis Preston:

I was there long enough that I went, I actually, at some

W. Curtis Preston:

point needed to do a restore.

W. Curtis Preston:

And I found out that those tape drives were really good at writing data.

W. Curtis Preston:

And they were completely incapable of reading data.

W. Curtis Preston:

Again, I don't want.

W. Curtis Preston:

I'm sure there was something wrong with these drives, but

W. Curtis Preston:

they were IBM 3590 drives.

W. Curtis Preston:

Normally IBM drives are top of the line or whatever, but there was something wrong

W. Curtis Preston:

with these drives that I was completely.

W. Curtis Preston:

So I guess what I'm saying is you're not alone.

W. Curtis Preston:

even me who, I've spent my career in this, although honestly, that's

W. Curtis Preston:

that event is on the list of things that I think back to when.

Prasanna Malaiyandi:

Yeah.

W. Curtis Preston:

when I try to get other people to do it.

Gary Williams:

Absolutely same with me.

Gary Williams:

the backups that we were taking, as I say, we were only a small

Gary Williams:

team and we had all the emails.

Gary Williams:

We had everything in place.

Gary Williams:

We had the two tape libraries doing the backups.

Gary Williams:

So we thought we were in a really good position because we had

Gary Williams:

everything working the way it should.

Gary Williams:

We even had documentation for how all this stuff was put together.

Gary Williams:

we actually had to consultancy come in and help us put all this stuff together.

Gary Williams:

Because at the time I was working for a financial institution, we

Gary Williams:

had to have certain boxes ticked, and we had those boxes ticked

Gary Williams:

because we have the documentation.

Gary Williams:

We had the backups, they were going off site.

Gary Williams:

They were going off site.

Gary Williams:

They were being looked after for us.

Gary Williams:

We even recalled tapes to make sure we could do the process

Gary Williams:

and no tapes were getting lost.

Gary Williams:

So we did that level of testing, but what we never actually tested was

Gary Williams:

actually restoring the data itself.

Gary Williams:

And it was a bit of an epiphany when we actually had someone come

Gary Williams:

into the team who a brand new to IT.

Gary Williams:

Had never worked in IT before.

Gary Williams:

Always wanted to work in IT.

Gary Williams:

Was actually employed in the business in a completely different role.

Gary Williams:

And then he actually said to me, one day, I'd like to move into IT.

Gary Williams:

I thought he was joking.

Gary Williams:

It turns out no, he was actually serious.

Gary Williams:

He was an ex-finance person wanting to move into IT.

Gary Williams:

So he applied internally, he got the job and he started with us and he started

Gary Williams:

looking through some old tickets and he was saying things like, why did you

Gary Williams:

do such and such a change this way?

Gary Williams:

So there's a whole education thing going on there.

Gary Williams:

And that's when he asked the question.

Gary Williams:

When did you test the backups?

Gary Williams:

What do you mean test them.

Gary Williams:

We've got the emails.

Gary Williams:

Look, here, you can see the service.

Gary Williams:

Here's the tape drives.

Gary Williams:

Here's the tapes.

Gary Williams:

We record the tape.

Gary Williams:

Yeah, sure.

Gary Williams:

But when did you restore something?

Gary Williams:

And I, something I won't actually forget because there was this look,

Gary Williams:

there's only four of us in the IT team.

Gary Williams:

We were a really small team for a company of about 300 and there's

Gary Williams:

this look going around the whole office and everyone's going well, we

Gary Williams:

haven't actually tested them have we?

Prasanna Malaiyandi:

It's like a light bulb goes off and it's yeah.

Prasanna Malaiyandi:

It's Ooh.

Gary Williams:

We like, hang on.

Gary Williams:

yeah, we should probably test one of they shouldn't we.

Gary Williams:

Okay.

Gary Williams:

what should we test and looking back on it, it was a really insane moment

Gary Williams:

just to think that we've had easy.

Gary Williams:

I think what actually had the emails was coming in for over a year.

Gary Williams:

And yes, we'd had the odd backup failure where something a time there, or there

Gary Williams:

was a fault with one of the tape drives.

Gary Williams:

These tape drives were quite old.

Gary Williams:

So they actually had physical SCSI cables that would sometimes play up.

Gary Williams:

So you had to make sure the SCSI cables were all firmly

Gary Williams:

attached, the terminator was in.

Gary Williams:

The good old days.

Gary Williams:

And.

Prasanna Malaiyandi:

never had to deal with restores?

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And course you had both active and passive, terminators as well.

Gary Williams:

Yeah, exactly.

Gary Williams:

we did actually have to do some restores, but we had, a storage array and the

Gary Williams:

storage provider let us do snapshots.

Gary Williams:

So 99% of the restores that we needed.

Gary Williams:

Just copy and paste from the snapshot.

Gary Williams:

Not a problem.

Gary Williams:

You deleted that file not a problem.

Gary Williams:

There it is.

Gary Williams:

If something was deleted from a desktop, the common response was, we

Gary Williams:

don't back things up on your desktop.

Gary Williams:

Sorry.

Gary Williams:

That's tough.

Gary Williams:

If you want it backed up, put it onto the server, put it into your

Gary Williams:

home drive or something like that.

Gary Williams:

It will get backed up.

Gary Williams:

So that was the general understood consensus because it was a small company.

Gary Williams:

Most of the time, this wasn't an issue, and as I say, people deleted a file.

Gary Williams:

I remember one time we had an Excel file.

Gary Williams:

That was a real pain because of all these financial macros.

Gary Williams:

And we restored that from a snapshot.

Gary Williams:

And it was still corrupt and we had to go back a week or so, we

Gary Williams:

managed to get the file back and it was working and we actually said

Gary Williams:

it and I remember it quite well.

Gary Williams:

We said it within the team.

Gary Williams:

that was lucky.

Gary Williams:

We might have actually asked to get the tapes on site and do a restore from

Gary Williams:

the tapes, but the snapshot worked.

Gary Williams:

Everything's fine, you know yeah.

Prasanna Malaiyandi:

Now you've decided, okay, we haven't tested.

Prasanna Malaiyandi:

Maybe we should actually try doing the test.

Prasanna Malaiyandi:

How did you decide what to test?

Gary Williams:

funny enough, it was a new guy.

Gary Williams:

the discussion was actually, okay, you're the person sitting

Gary Williams:

there looking through the tickets.

Gary Williams:

You're looking through the documentation.

Gary Williams:

You're new to it all.

Gary Williams:

You want us to prove to you that the restore process works.

Gary Williams:

We know it does.

Gary Williams:

Pick something.

Gary Williams:

And then he sat there and he went, How about the exchange server?

Prasanna Malaiyandi:

Speaker:

Swinging for the fences!

Gary Williams:

Fine.

Gary Williams:

So we thought, okay, fine.

Gary Williams:

we'll get the tapes back on site.

Gary Williams:

We'll do the restore.

Gary Williams:

We'll prove that the backups work and we can go back to what we're normally doing.

Gary Williams:

all the project work, that kind of thing.

Gary Williams:

We could spend a day on this.

Gary Williams:

It will be good for us.

Gary Williams:

Not a problem.

Gary Williams:

We even went to the documentation and got the documentation out and said,

Gary Williams:

look, we've got the documentation.

Gary Williams:

The tapes are coming in.

Gary Williams:

This is going to be easy.

Gary Williams:

And it wasn't.

Prasanna Malaiyandi:

Of course not.

Prasanna Malaiyandi:

So when you decided to do the restore.

Prasanna Malaiyandi:

Did you bring down your production or were you like, I'm going to

Prasanna Malaiyandi:

restore this into a safe spot and

Gary Williams:

Yeah, we couldn't bring down production because the nature

Gary Williams:

of the business was that we needed to keep the server up and running.

Gary Williams:

We actually had a spare server and I think we're maybe had two spare servers.

Gary Williams:

VMs were just starting to come on the scene and we actually

Gary Williams:

had a spare server racked.

Gary Williams:

And the idea was that if we had a server failure, we could take the

Gary Williams:

physical discs out of one server.

Gary Williams:

Put it into another server power it on, be back running.

Gary Williams:

this is also before the days of re replicas.

Gary Williams:

They were, again, just coming out on a lot of software was super expensive and

W. Curtis Preston:

You're giving me flashbacks, Gary.

Gary Williams:

the good old days.

W. Curtis Preston:

Yeah.

Gary Williams:

We had this physical server and it had plenty

Gary Williams:

of disc space to handle this.

Gary Williams:

So we said, okay, Let's we've not actually even powered this server on.

Gary Williams:

I don't even think, I think maybe it was powered on when

Gary Williams:

we bought it and that was it.

Gary Williams:

So we said we should test that server out anyway.

Gary Williams:

Yeah.

Gary Williams:

Let's power it on.

Gary Williams:

Let's get the data restored to that server and bring exchange up.

Gary Williams:

We can bring it up in an isolated network.

Gary Williams:

Do some very basic tests on it, because it was a small team.

Gary Williams:

We had access to the networking guys.

Gary Williams:

I'll say networking guys.

Gary Williams:

We did a little bit of networking age and there was one guy who did a lot

Gary Williams:

of the really key networking, tasks.

Gary Williams:

So none of that was a problem.

Gary Williams:

We didn't have to wait months for tickets or to get done or anything like that.

Gary Williams:

So we set up this isolated network, we got the tapes on site and we

Gary Williams:

started doing the restore and that's when it all went horribly wrong.

Prasanna Malaiyandi:

So who was doing the restore?

Gary Williams:

I, if I recall, it was actually our help desk guy.

Gary Williams:

We S we said to him, look, you came up with this.

W. Curtis Preston:

You put a lot on this guy.

W. Curtis Preston:

It was his idea.

W. Curtis Preston:

And you're like, what, if you think testing backups is so

W. Curtis Preston:

important, why don't you do it?

Gary Williams:

Pretty much .We did put it on him.

Gary Williams:

cause it was his idea.

Gary Williams:

And we said, look, this is a really good exercise for you to do again.

Gary Williams:

Unfortunately, I'm going to put my hands up to this.

Gary Williams:

It's a bad thing to have done, but we said, I'm a senior IT person.

Gary Williams:

I know the backups are good.

Gary Williams:

here you go.

Gary Williams:

Here's the tapes.

Gary Williams:

Here's the documentation.

Gary Williams:

See you later and off he goes and he comes back.

Gary Williams:

I think it was about two, three hours later, something like that.

Gary Williams:

And he went, I can't get this working.

W. Curtis Preston:

Yeah.

Gary Williams:

What do you mean you can't get it working.

Gary Williams:

What's the problem.

Gary Williams:

And I don't actually recall what the problems, all the problems were, but

Gary Williams:

I know that the server itself didn't have enough disc space, even though

Gary Williams:

it was supposed to have the disc space, because the documentation said,

Gary Williams:

you need partition sizes like this.

Gary Williams:

And it actually changed since then.

Gary Williams:

And we didn't realize, and that was really the start of a lot of problems.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

first off I will say that even though.

W. Curtis Preston:

the way you got there.

W. Curtis Preston:

I like the way you did it.

W. Curtis Preston:

what to say, even though the way you got there was wrong, the fact that

W. Curtis Preston:

you, the fact that you had this person.

W. Curtis Preston:

do it, who wasn't the person, that made the documentation.

W. Curtis Preston:

That's actually something I push pretty heavily.

W. Curtis Preston:

And it's an idea that came from back in my days when I was at a bank

W. Curtis Preston:

and we very much did test restores.

W. Curtis Preston:

first off we didn't have snapshots.

W. Curtis Preston:

We didn't have any of that stuff.

W. Curtis Preston:

And we had 10,000 employees and any one of them was allowed to

W. Curtis Preston:

call into the help desk and ask for a restore on any given day.

W. Curtis Preston:

And, so we would get 10 to 15 restores a day.

W. Curtis Preston:

So we tested pretty regular, but the thing that we buy in that degree, but

W. Curtis Preston:

the thing that we had to test in the way that you did were these large

W. Curtis Preston:

server restores, we did a DR test and it was an absolute imperative

W. Curtis Preston:

from the powers that be was that.

W. Curtis Preston:

Curtis wrote the documentation.

W. Curtis Preston:

Curtis cannot be the person actually doing the test.

W. Curtis Preston:

Curtis needs to be standing back there, listening closely to the problems that

W. Curtis Preston:

are happening, but, w which, which was actually kind of nice, although

W. Curtis Preston:

it's nerve wracking to be the person who wrote the documentation and then

W. Curtis Preston:

sitting there watching someone, you think you've answered all the questions,

W. Curtis Preston:

but it's not like in this case, you.

W. Curtis Preston:

you had the classic example of the documentation might've been

W. Curtis Preston:

correct, but it was out of date.

Gary Williams:

It was correct at the time, the irony is very similar with you.

Gary Williams:

I didn't actually write the documentation.

Gary Williams:

It was written by the contractors and consultants that came on.

Gary Williams:

Actually signed off on the documentation saying, yes, all

Gary Williams:

the version numbers are correct.

Gary Williams:

And I think I'd done a couple of updates.

Gary Williams:

And then we'd had other changes and the other people

Gary Williams:

had forgoten or I'd forgotten.

Gary Williams:

Probably I'd forgotten to update the documentation because we

Gary Williams:

were busy only a small team.

Gary Williams:

And so things very slowly on, not just that document, but on every other

Gary Williams:

document that we had about the environment become out of date and it was this

Gary Williams:

snowball of errors that had crept in.

Gary Williams:

And the thing that we realized is actually having no documentation

Gary Williams:

would have been better because the documentation was lying to us.

Gary Williams:

this poor guy is sitting there going, I followed steps three, four,

Gary Williams:

and five, but I can't do step six because step five doesn't work.

Gary Williams:

What do you mean?

Gary Williams:

It doesn't work.

Gary Williams:

And that's when we found that there was a service pack that

Gary Williams:

was missing from exchange.

Gary Williams:

So it couldn't go any further and it just kept on building and building like this.

Prasanna Malaiyandi:

That is an interesting problem.

Prasanna Malaiyandi:

How do you keep your documentation up to date as you're making these

Prasanna Malaiyandi:

changes and making sure everyone across the environment knows like where the

Prasanna Malaiyandi:

documentation is and all the rest of that.

Gary Williams:

today, we use a Wiki solution for all of our documentation.

Gary Williams:

The idea behind that of course, is the Wiki is so easy to edit.

Gary Williams:

But you still don't or sometimes you still don't.

Gary Williams:

You make a note, I'll do that tomorrow or next week.

Gary Williams:

So there is still the exact same risk.

Gary Williams:

And even in my current place, we've seen this with certain, we do testing as well.

Gary Williams:

We do a lot more testing now than, anywhere I've ever worked before.

Gary Williams:

And even with a lot of the modern systems with Amazon.

Gary Williams:

Backups to S3 and all this kind of stuff.

Gary Williams:

We still test to make sure that everything's correct,

Gary Williams:

that we know what we're doing.

Gary Williams:

That those Wiki pages are fully up to date.

Gary Williams:

we did some AD restore testing not so long ago and we found, not major errors,

Gary Williams:

but there was a couple of little issues there with the restore process, which

Gary Williams:

just needed a few corrections in the documentation, just, as like a permissions

Gary Williams:

era type of thing where we couldn't actually get access to the bucket.

Gary Williams:

So we had to make some changes there.

Gary Williams:

So even with all the modern backup software.

Gary Williams:

It's still so important.

W. Curtis Preston:

I talked about those DR tests that we did back in the day and.

W. Curtis Preston:

The, and the fact that we always had someone who wasn't me doing the

W. Curtis Preston:

tests, and frequent listeners to the podcast will have heard this before.

W. Curtis Preston:

But if we define a successful restore, as we got from A to Z without having to ask

W. Curtis Preston:

Curtis, what does this line mean, not a single one of the restores was successful.

W. Curtis Preston:

so if Curtis ever got, blown up and, whatever, the chances of a restore

W. Curtis Preston:

going completely without a hitch was, zero, which is why you talked

W. Curtis Preston:

about updating, there's always little things that you have to update.

W. Curtis Preston:

I would suggest that original documentation.

W. Curtis Preston:

and again, take this for what it's worth to anybody who's listening.

W. Curtis Preston:

the first mistake was writing the documentation in a way that

W. Curtis Preston:

it can easily get outdated.

W. Curtis Preston:

our exchange server is 75.

W. Curtis Preston:

Terra...

W. Curtis Preston:

r ight.

W. Curtis Preston:

that's a problem.

W. Curtis Preston:

So if you're going to hand that to a restore documentation, what it should say

W. Curtis Preston:

is before beginning the restore, go look at the size of the backups, And figure

W. Curtis Preston:

out how big the current exchange server is, and then size the volume accordingly.

W. Curtis Preston:

yeah, that, that line wouldn't have gone out of date as quickly.

W. Curtis Preston:

it is a real challenge by the way.

W. Curtis Preston:

this idea of what it's like to update documentation, by the way, back in

W. Curtis Preston:

the day we were using Wordperfect.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

And I remember the official company standard was WordPerfect,

W. Curtis Preston:

because we could use it on, we had Unix versions of WordPerfect.

W. Curtis Preston:

By the way, curses spaced WordPerfect.

W. Curtis Preston:

Not this fancy Windows.

W. Curtis Preston:

what you'd see is what you get editing stuff.

W. Curtis Preston:

This was text on a screen.

W. Curtis Preston:

and I remember getting in a fight over.

W. Curtis Preston:

There was this one guy that was new and he wanted to use Word

W. Curtis Preston:

because nobody used WordPerfect.

W. Curtis Preston:

And we were like, we don't care.

W. Curtis Preston:

We use WordPerfect here for our documentation.

W. Curtis Preston:

And if you want your documentation to fit into our documentation,

W. Curtis Preston:

you will use Wordperfect.

W. Curtis Preston:

And you will like it.

Gary Williams:

I remember our first days of moving across

Gary Williams:

the world where you had the.

Gary Williams:

Word had the ability to mimic WordPerfect key presses.

Gary Williams:

So you could transition easily.

Gary Williams:

Good old days.

W. Curtis Preston:

Good old days, but I think what you're doing now with the Wiki,

W. Curtis Preston:

I think that's a much better approach.

Gary Williams:

It is.

Gary Williams:

There's permissions list behind it, obviously, so that not everyone

Gary Williams:

can get access to it, but it's the right people can get access.

Gary Williams:

but what it means is everyone in the team can get access.

Gary Williams:

They can all update.

Gary Williams:

It.

Gary Williams:

There's a history as well.

Gary Williams:

So the other thing that we didn't have is the backup of the documentation

Gary Williams:

was on the server we were backing up.

Prasanna Malaiyandi:

Oh,

Gary Williams:

Exactly.

Gary Williams:

So we, all we had was that documentation and looking back on it, we made

Gary Williams:

quite a few mistakes like this.

Gary Williams:

We had the, let's say we had the documentation on the file server.

Gary Williams:

So if the file server was lost.

Gary Williams:

How did you get your documentation?

Gary Williams:

And it was, again, something that the helpdesk guy pointed out to us.

Gary Williams:

How did you get your documentation?

Gary Williams:

That's fine, actually.

Gary Williams:

How would we.

Prasanna Malaiyandi:

Sometimes it's an outside perspective or

Prasanna Malaiyandi:

someone's Hey, how are you actually going to get this stuff done?

Gary Williams:

Something I think it's really important to know is at

Gary Williams:

the time I was a senior IT person.

Gary Williams:

There's a colleague of mine who was senior and we had a network guy.

Gary Williams:

All of us, were reasonably senior.

Gary Williams:

This guy was a junior.

Gary Williams:

He'd been working in finance for three or four years beforehand.

Gary Williams:

And then he'd just moved into IT.

Gary Williams:

And he had such a fresh perspective on everything that it really opened our eyes.

Gary Williams:

And that was the day I learned that it doesn't matter if you got 50

Gary Williams:

years IT experience or five minutes.

Gary Williams:

There's always something you can learn from someone.

Gary Williams:

And sometimes the most valuable thing you can learn is from someone

Gary Williams:

who is very new to the team, fresh eyes, fresh perspective.

Gary Williams:

It's invaluable.

Prasanna Malaiyandi:

100% agree.

W. Curtis Preston:

There, there is a perspective that you can only

W. Curtis Preston:

gain by being completely ignorant.

W. Curtis Preston:

He could have been not junior to IT in this case.

W. Curtis Preston:

He was, but even if he's a senior IT person, but he's joining your organization

W. Curtis Preston:

for the first time, another way, you look at this person when they ask for things

W. Curtis Preston:

of like, when they ask stupid questions, so how often do we, test our backups here?

W. Curtis Preston:

And you're like, we don't do that.

Gary Williams:

with my current place, any new person we get into our IT team, we

Gary Williams:

literally do that sort of thing with them.

Gary Williams:

Now where we say, have a look through the tickets.

Gary Williams:

You've got any questions.

Gary Williams:

Ask, have a look through the Wiki again.

Gary Williams:

You've got any questions ask because.

Gary Williams:

There's so many things in there.

Gary Williams:

There's like the whole corporate culture and there's corporate acronyms.

Gary Williams:

And if they don't know what they are, we've just found a problem

Gary Williams:

because if there's one acronym we have this, I, my brain's gone.

Gary Williams:

Sorry.

Gary Williams:

there's one acronym that we have, that's very similar to an IT acronym.

Gary Williams:

I can't remember what it is off the top of my head.

Gary Williams:

Yeah.

Gary Williams:

But when you look at it, you think the, IT term because you're an IT person,

Gary Williams:

but it actually means the corporate.

Gary Williams:

so there's that kind of thing.

Gary Williams:

it's always important to spell out these acronyms at the start of any

Gary Williams:

documentation so that everyone knows this is what you are referring to.

Prasanna Malaiyandi:

Especially

W. Curtis Preston:

it's Prasanna's job on, on the podcast.

W. Curtis Preston:

If anybody ever brings up, an acronym that, they don't spell out,

W. Curtis Preston:

Prasanna's, always making them spell it

Prasanna Malaiyandi:

out.

Prasanna Malaiyandi:

Yep.

Prasanna Malaiyandi:

I'm like, what does that really mean?

Prasanna Malaiyandi:

Please tell me.

Gary Williams:

And this is the thing.

Gary Williams:

You can walk into a meeting with all the IT acronyms and every IT

Gary Williams:

person sitting there will probably think it's something different.

Gary Williams:

I think DC is a good one because DC's direct current data center.

Gary Williams:

Things like that.

Gary Williams:

And this is the sort of thing that we've experienced several times, a few different

Gary Williams:

companies I've worked for, and it's always valuable to get that new person's insight.

Gary Williams:

Because they don't know the corporate terminology, they don't

Gary Williams:

know the corporate acronyms.

Gary Williams:

So it's worth getting them on board and going through all this stuff because

Gary Williams:

they've got this fresh insight before they learn that stuff and they can spot these

Gary Williams:

problems before they become a problem.

W. Curtis Preston:

I just realized I haven't thrown out our

W. Curtis Preston:

usual disclaimer, Prasanna and I work for different companies.

W. Curtis Preston:

I work for Druva and he worked for Zoom.

W. Curtis Preston:

And this is not a podcast of either company.

W. Curtis Preston:

And the opinions that you hear are ours.

W. Curtis Preston:

Please rate this podcast at ratethispodcast.com/restore.

W. Curtis Preston:

And if you, are like our guest here today, Gary who, just you're an IT person

W. Curtis Preston:

out there, and you want to talk about your favorite subject to, or if you know

W. Curtis Preston:

what, maybe if you don't understand why

Prasanna Malaiyandi:

Come challenge, Mr.

Prasanna Malaiyandi:

Backup.

W. Curtis Preston:

Some crazy person would actually like them then, come on

W. Curtis Preston:

here related topics, cybersecurity, data privacy, a number of related topics.

W. Curtis Preston:

We'd love to have you on as a guest and, and reach out

W. Curtis Preston:

to me at wcurtispreston@gmail or at @wcpreston on Twitter.

W. Curtis Preston:

And we'll get you on here.

W. Curtis Preston:

So, um, how did it turn.

W. Curtis Preston:

With your, with your restore.

Gary Williams:

So eventually we got there, we actually got the

Gary Williams:

exchange server fully restored with correctly, the documentation.

Gary Williams:

and I think it took three or four days, something like that.

Gary Williams:

And the thing is, if you want to do a team building exercise, forget all the

Gary Williams:

assault courses and other things they have, you do get the team together and

Gary Williams:

do a restore some of the best seriously.

W. Curtis Preston:

It's a bit like, the trust exercises where you lean

W. Curtis Preston:

backwards and catches you it's like that.

Gary Williams:

I've also never seen so many whiteboards being used to

Gary Williams:

describe issues and draw diagrams of how things hung together.

Gary Williams:

And it was actually really good.

Gary Williams:

And I will admit we ended up putting some projects, not exactly on pause,

Gary Williams:

but we put them to one side as all of us started getting involved in

Gary Williams:

this restore, because we realized we actually had a very serious problem.

Gary Williams:

I'll be honest.

Gary Williams:

We gave the help desk guy, this junior guy to IT the documentation.

Gary Williams:

And we did expect him to trip over a few things.

Gary Williams:

He's a new person, some of the terminology is new, fine, not a problem.

Gary Williams:

We know we're there to help.

Gary Williams:

What we didn't expect was us to trip over the same issues.

Gary Williams:

We honestly thought that, like you were saying earlier, Curtis, that he

Gary Williams:

was going to ask us some questions.

Gary Williams:

We could do some updates to the documentation, do it again,

Gary Williams:

and everything would be fine.

Gary Williams:

But we didn't expect to get stumped by our own documentation.

Gary Williams:

And unfortunately we actually did, we're sitting there going

Gary Williams:

through the documentation going well, hang on a minute.

Gary Williams:

we know the, the password is in this password safe and that password

Gary Williams:

should work, but something had changed or I think at one point would

Gary Williams:

actually, changed the security model.

Gary Williams:

So it was requiring stronger passwords.

Gary Williams:

So you couldn't actually use a password that was on the backup.

Gary Williams:

You had to go and reset an account.

Gary Williams:

And it was lots of.

Gary Williams:

It was nothing seriously, wrong with a backup as such.

Gary Williams:

And there's nothing seriously wrong with the documentation,

Gary Williams:

but it was lots of little things that just piled up and piled up.

Gary Williams:

And every time we took a couple of steps forward, we thought, that's it.

Gary Williams:

We've got this solved, we'll get this restored.

Gary Williams:

And then we got it all up and running and got the server running and

Gary Williams:

exchange server service wouldn't start.

Gary Williams:

couldn't figure out why.

Gary Williams:

I think that one took us a day to go through and we ended up having

Gary Williams:

to run some additional commands.

Gary Williams:

And finally, we got there, we got it all up and running.

Gary Williams:

And I still remember, I think it was actually like a Friday or something

Gary Williams:

we're sitting there in the office and went, yeah, that was a really good

Gary Williams:

question know, can we restore the data?

Gary Williams:

Thank you for asking it.

Gary Williams:

we had a bit of a celebration over that one.

W. Curtis Preston:

I would say that, I like what you were saying

W. Curtis Preston:

about, it sounded like there was a lot of collaboration.

W. Curtis Preston:

It sounds like there's a lot of whiteboards going on

W. Curtis Preston:

and you were learning a lot.

W. Curtis Preston:

I would argue that the reason that was the case is that you

W. Curtis Preston:

weren't doing it under duress.

W. Curtis Preston:

You were doing this as a test.

W. Curtis Preston:

if your exchange had been down for three or four days, that would have

W. Curtis Preston:

been a very different experience.

Gary Williams:

Completely.

Gary Williams:

It's something that we actually discussed, that Friday afternoon, we've got the

Gary Williams:

exchange server up and running and the conversation was what happens if

Gary Williams:

this happens for real, because sure.

Gary Williams:

We got the backup restored.

Gary Williams:

We know that the backup is good.

Gary Williams:

Do you mean I was told to stay and it was good, but the restore process wasn't good.

Gary Williams:

And I think we focused way too much on the backup itself and

Gary Williams:

not the restore at that point.

Gary Williams:

I said we had that conversation and it was a matter of what would happen.

Gary Williams:

And we knew that we were a small company.

Gary Williams:

We knew we would have the CEO down in the office, screaming at us.

Gary Williams:

I need this back.

Gary Williams:

We can't conduct business and I'll be honest that day.

Gary Williams:

We got a healthy lot of respect, both for the backups, for documentation

Gary Williams:

and the accuracy of documentation and for the server itself.

Gary Williams:

Because we knew that the company at that point, the company relied on email so

Gary Williams:

much that if that server did disappear, and we took that long to get back up and

Gary Williams:

running the loss, the financial loss to the company and the reputational loss

Gary Williams:

to the company would have been huge.

Gary Williams:

And that also actually helped form some push forward for additional resilience

Gary Williams:

in , like, the servers and moving more towards things like virtual machines,

Gary Williams:

so that we had the ability to clone and do other bits and pieces, because

Gary Williams:

we could use that as an experience.

Gary Williams:

It's look, this is how long potentially worst case scenario it will take.

Gary Williams:

It shouldn't because we're learning and we need to do this a lot more often.

Gary Williams:

We need to allocate time to do this.

Gary Williams:

And the beauty was again, being such a small company.

Gary Williams:

We actually had the ear of a couple of directors, so you could

Gary Williams:

put this case forward and they were really receptive to it.

W. Curtis Preston:

I want to tack on something you said there.

W. Curtis Preston:

the fact that you and I have been in that timeframe.

W. Curtis Preston:

young kids today, they don't understand what it was like back then, when you had

W. Curtis Preston:

no resiliency, you had no redundancy.

W. Curtis Preston:

You had nothing.

W. Curtis Preston:

So we had a server that a server had a disk drive.

W. Curtis Preston:

We didn't have mirroring.

W. Curtis Preston:

We didn't have

Gary Williams:

RAID.

Gary Williams:

Although we had very, we had rightful life.

Gary Williams:

We got, we were really market.

W. Curtis Preston:

we didn't, when I was back in the day, we

W. Curtis Preston:

literally were installing data directly on individual disk drives.

W. Curtis Preston:

I think we might've had redundant power supplies on the servers that

W. Curtis Preston:

we were using, and that was it.

W. Curtis Preston:

And so the loss of any one of those components could take the server.

W. Curtis Preston:

Right.

W. Curtis Preston:

And, and now nowadays we move forward to the days of virtualization and

W. Curtis Preston:

that you can just, if there's a little problem with this server, you just

W. Curtis Preston:

move your VM over to another server.

W. Curtis Preston:

In fact, you can V motion at and storage V motion, and you can

W. Curtis Preston:

move it while it's running, which continues to boggle my brain.

Gary Williams:

likewise.

W. Curtis Preston:

And also the devices that would, that so

W. Curtis Preston:

many of us have grown used to.

W. Curtis Preston:

I, at home I pretty much live a solid state life.

W. Curtis Preston:

My TiVo has a solid state hard drive.

W. Curtis Preston:

and so those are so much more reliable than the moving part

W. Curtis Preston:

drives that you and I grew up on.

W. Curtis Preston:

and I think as a result, they don't have

W. Curtis Preston:

the respect that you need to do to test backups the way you should.

W. Curtis Preston:

I don't know.

W. Curtis Preston:

Just a quick editor's note.

W. Curtis Preston:

In the next section, Gary is going to mention something called iLO and iDRAC.

W. Curtis Preston:

And he, we forgot to have him define it.

W. Curtis Preston:

So I'm doing that now.

W. Curtis Preston:

They are systems from Dell and HP, the integrated Dell remote access

W. Curtis Preston:

controller and HP integrated lights out.

W. Curtis Preston:

They're both systems that help increase the uptime of the server by notifying

W. Curtis Preston:

you of potential failures or issues.

W. Curtis Preston:

Back to your podcast.

Gary Williams:

One of the things that we still do today, and this is

Gary Williams:

probably me being paranoid coming from that environment, we didn't

Gary Williams:

get alerts on a service if a disk failed, because it didn't really know.

Gary Williams:

The ILOs and iDRACs were way too expensive for us to have at that point.

Gary Williams:

So daily server room checks go around.

Gary Williams:

Is there any flashing lights that shouldn't be flashing?

Gary Williams:

And we still do that in our data centers today.

Gary Williams:

And we still do that with some of our machines.

Gary Williams:

We've actually got this philosophy in place now where if a machine is up for

Gary Williams:

more than 30 days, it needs to be rebooted because we don't know if it's reboot safe.

Gary Williams:

So we're starting to put uptime alarms in.

Gary Williams:

Certainly on Windows.

Gary Williams:

Linux is a bit different, but with Windows, when it hits a 30-day point.

Gary Williams:

If we get an uptime alarm, it means that there's possibly

Gary Williams:

a patching issue with that.

Gary Williams:

We should get an alarm from the patching system as well.

Gary Williams:

So we go off and we check.

Gary Williams:

but the other thing we do something similar with Linux as well.

Gary Williams:

We're trying to get all the Linux servers rebooted because generally

Gary Williams:

with those, we can patch them hot, but we still want to get them rebooted.

Gary Williams:

Are they reboot safe?

Gary Williams:

Because if we do lose power or machine crashes, it's great having all that stuff

Gary Williams:

there, but if it doesn't reboot, we've got a problem and we may have a backup if

Gary Williams:

that backup is inherited that corruption or that problem we're in a bad place.

Gary Williams:

So we do try to make sure that we've got, these servers rebooted on a fairly regular

Prasanna Malaiyandi:

Speaker:

actually very interesting.

Prasanna Malaiyandi:

Speaker:

I never thought about that About the fact that you need to reboot the systems

Prasanna Malaiyandi:

Speaker:

and just make sure is a hardware and dos and everything else could be.

Gary Williams:

Absolutely.

Gary Williams:

The other thing that we've done is we've actually turned up time

Gary Williams:

on his head now in the old days.

Gary Williams:

Now these uptime figures of two years, three years, we'll put on the

Gary Williams:

internet and it's look at my up time.

Gary Williams:

Now it's the other way around.

Gary Williams:

It's like, yeah, there's an uptime of 45 days.

Gary Williams:

Oh, look at my uptime.

Gary Williams:

That's bad.

Gary Williams:

We need to get this rebooted.

Gary Williams:

And check it is reboot safe.

Gary Williams:

Trying to find reboot windows sometimes is a bit difficult, even with all the

Gary Williams:

resilience . Just take systems down.

Gary Williams:

but we do have some sort of bargaining going on with various teams where

Gary Williams:

we do try and reboot the systems at least once a month, just to make

Gary Williams:

sure that they are reboot safe.

W. Curtis Preston:

So help me understand that phrase.

W. Curtis Preston:

W what do you mean when you say reboot safe?

Gary Williams:

So reboot safe is simply that if it's potentially a change can be

Gary Williams:

made to a machine, that means a machine when it reboots is going to crash, or

Gary Williams:

there's going to be a problem where it can't complete the boot corrupted

Gary Williams:

boot loader or something like that.

Gary Williams:

We've seen issues in the past where.

Gary Williams:

Microsoft update has corrupted the bootloader.

Gary Williams:

So when you go to reboot, it doesn't restart properly.

Gary Williams:

So we've actually got the term reboot safe, which just means I know

Gary Williams:

if I have to reboot that server, I don't have to worry about it.

Gary Williams:

It will come up.

Gary Williams:

You're printing system will start all the services that need to start will start,

Gary Williams:

because we've had issues in the past where certain key services don't start.

Gary Williams:

So we get a ticket.

Gary Williams:

Can you please reboot this machine?

Gary Williams:

Sure.

Gary Williams:

Reboot it, you walk off.

Gary Williams:

You think it's done, but the services don't start.

Gary Williams:

Now, the alerting will alert on that.

Gary Williams:

But in the meantime, you potentially still down for a bit longer than you need to be.

Gary Williams:

So we do these tests where we just want something sure all the services

Gary Williams:

that need to start actually start.

Gary Williams:

And it comes up completely clean and working exactly how it should.

W. Curtis Preston:

you're giving me, Yeah.

W. Curtis Preston:

And by the way, I agree with you with this idea of, the occasional

W. Curtis Preston:

reboots and I agree that it's, that it's a practice that has gone by

W. Curtis Preston:

the wayside by a lot of people.

W. Curtis Preston:

And I remember, I can remember the first time I left my,

W. Curtis Preston:

this is before I got the Mr.

W. Curtis Preston:

Backup.

W. Curtis Preston:

moniker and I got a different moniker and I'll explain it in a minute.

W. Curtis Preston:

I was at a large oil and gas company and no one had administered the data

W. Curtis Preston:

center, like a real sysadmin in years.

W. Curtis Preston:

And so I was going in there and I was doing crazy things like

W. Curtis Preston:

installing the latest patch set.

W. Curtis Preston:

And this was, these were a Solaris systems and, it required a reboot

W. Curtis Preston:

in order to, to install the patches.

W. Curtis Preston:

And what was happening was I was like, 0 for 10, in terms

W. Curtis Preston:

of I would install a patch.

W. Curtis Preston:

I would reboot the server and it wouldn't come back.

W. Curtis Preston:

And so I picked up the nickname crash, because that's what I was

W. Curtis Preston:

just, I was literally, it's like the cure is worse than the disease.

W. Curtis Preston:

So it's we need to do this, but I was doing, I was proactively

W. Curtis Preston:

doing damage to the environment.

W. Curtis Preston:

By doing the things I was doing, what I did get really good at though is restoring

W. Curtis Preston:

their environment because it kept,

W. Curtis Preston:

so what it turned out, the things that were really.

W. Curtis Preston:

I don't know uh in trouble were the disks themselves, because we actually

W. Curtis Preston:

powered down the servers for some of them.

W. Curtis Preston:

And that's when things really went awry because the disk

W. Curtis Preston:

drives had never been turned off.

W. Curtis Preston:

And then, yeah.

W. Curtis Preston:

And then, they wouldn't come back on.

W. Curtis Preston:

So I had to get all new disk drives and then, and then do the restore, but yeah.

Prasanna Malaiyandi:

Yeah.

Gary Williams:

Yeah, but even with the virtual machines, we still like to

Gary Williams:

reboot them and to make sure all the services that should come up do come up.

Gary Williams:

We've even in some cases taken that paranoia to the next level where we'll

Gary Williams:

do a reboot test before we install a patch or before we do something,

Gary Williams:

just to make sure that it's not, that patch that has caused a problem.

Gary Williams:

Now, we generally don't do that for the Microsoft patches, but we do

Gary Williams:

that for certain application patches.

Gary Williams:

And it's almost a sanity check.

Gary Williams:

Because that way, if there is a problem, we know that it is that patch

Gary Williams:

that has caused a problem and not something lurking from beforehand.

Prasanna Malaiyandi:

Going back to the article you wrote, Gary, one of the

Prasanna Malaiyandi:

things I liked in it was you talked about this spreadsheet, if you will, that

Prasanna Malaiyandi:

track sort of assets that were backed up and you had a methodology that you

Prasanna Malaiyandi:

called out in the article in terms of how long you would wait before something

Prasanna Malaiyandi:

had to be tested, Or how the longest something could go without being tested.

Prasanna Malaiyandi:

And there were certain things that were critical in your environment that sort

Prasanna Malaiyandi:

of had to be done more periodically.

Gary Williams:

Yeah.

Gary Williams:

So what we did is.

Gary Williams:

we had a spreadsheet, the list of all the backups anyway, and one of the

Gary Williams:

things we tried to do was make sure that there was no clashing backups.

Gary Williams:

So the exchange server would get backed up at say, 10:00 PM.

Gary Williams:

The file server get backed up at 11:00 PM.

Gary Williams:

That kind of thing, because otherwise we found there was a

Gary Williams:

lot of issues on the network and latency and all this kind of thing.

Gary Williams:

So we wanted to stagger the backups as much as possible.

Gary Williams:

But what we did was we actually added a column to that spreadsheet that said.

Gary Williams:

Restore last tested, documentation last updated, that kind of thing.

Gary Williams:

So that we new when the backups were tested and we knew when that

Gary Williams:

documentation was last updated.

Gary Williams:

And what we do is we actually have, there was a formula in

Gary Williams:

it that would color the cells.

Gary Williams:

And if it was all green, everything's fine.

Gary Williams:

We've done a recent test.

Gary Williams:

I think recent was like six months, 12 months, something like that.

Gary Williams:

and if anything was over outside of that window, it would go red.

Gary Williams:

So I think the exchange was every six months, the active

Gary Williams:

directory was once a year.

Gary Williams:

The file server was I think we would restore a folder or a file

Gary Williams:

every month, something like that.

Gary Williams:

and we did this quite a lot and we actually slowed down some

Gary Williams:

of the tapes going off site for things like the file server.

Gary Williams:

So we could do a backup a couple of days later, you do a restore test,

Gary Williams:

update the date in the documentation.

Gary Williams:

We know that's good.

Gary Williams:

Send the tape off-site and that's actually funny enough.

Gary Williams:

That was a financial reason as well because of the cost

Gary Williams:

of sending the tapes offsite.

Gary Williams:

but yeah, we started to do that and we started to get quite good

Gary Williams:

at being able to do these restores.

Gary Williams:

We were even able to get some additional hardware and we even

Gary Williams:

starting to do some tests where we're restoring to virtual machines.

Gary Williams:

Because doing that process.

Gary Williams:

We found we could get them up and running a lot quicker.

Gary Williams:

We had a bit more room to breathe and we could have a

Gary Williams:

much better virtual environment.

Gary Williams:

And then we've got into some other really clever stuff where we had a physical

Gary Williams:

domain controller and a virtual domain controller, and we tested fail-over

Gary Williams:

and all this, we got really advanced

W. Curtis Preston:

So you're saying that, the green column was

W. Curtis Preston:

actually the color of that column was automatically determined by By

Gary Williams:

the age of the last test.

W. Curtis Preston:

That's pretty cool

Prasanna Malaiyandi:

Conditional formatting, Curtis in Excel.

Gary Williams:

that's it.

W. Curtis Preston:

you're probably better at Excel that I am,

W. Curtis Preston:

but, Gary, this has been great.

W. Curtis Preston:

I, I love this story.

W. Curtis Preston:

I love that it, like the other story we had, where, I don't know if you

W. Curtis Preston:

listen to the podcast at all, Gary, but we had an episode where someone, they

W. Curtis Preston:

tested their backups by essentially deleting their entire data center.

Prasanna Malaiyandi:

Paul van Dyke episode 135.

Gary Williams:

Wow.

Gary Williams:

I haven't heard that one.

Gary Williams:

I have heard some of the others and I have to say I'm a fan.

W. Curtis Preston:

And it was that one that would just, it

W. Curtis Preston:

hurt to, to listen to his story.

W. Curtis Preston:

And it was, he agrees that it was a really dumb idea.

W. Curtis Preston:

It did eventually work out, but it it took him awhile.

Gary Williams:

I can imagine.

Gary Williams:

I I just remember the pain of the exchange server and whilst I've not had

Gary Williams:

a repeat of that pain since, because.

Gary Williams:

The software is better these days, the restores are a lot quicker and you do

Gary Williams:

have a lot more options to play with.

Gary Williams:

we still have that pain from time to time when trying to do certain restores

Gary Williams:

and testing the environment out.

Gary Williams:

So I am still not that brave to do something like that, but, yeah,

Gary Williams:

I think we're getting there and.

W. Curtis Preston:

Not brave be the word I would use, but.

Gary Williams:

Now we have talks about bringing in things like the chaos

Gary Williams:

monkey and taking down things, but yeah, that's a test for another day.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

thanks Prasanna for your usual great questions

Prasanna Malaiyandi:

Always and nice chatting with you, Gary.

Prasanna Malaiyandi:

That was fun.

Gary Williams:

Thank you.

W. Curtis Preston:

and, thanks to the listeners again.

W. Curtis Preston:

this is you're why we're here.

W. Curtis Preston:

You're why we sit here and talk to us.

Prasanna Malaiyandi:

Curtis.

Prasanna Malaiyandi:

And we'll talk to each other anyway.

Prasanna Malaiyandi:

It doesn't matter.

W. Curtis Preston:

Yeah.

W. Curtis Preston:

Yeah, exactly.

W. Curtis Preston:

We'll probably be talking about table saws or video editing tools,

W. Curtis Preston:

but, anyway, remember to subscribe so that you can restore it all.