SearchFAQMemberlist Log in
Reply to topic Page 1 of 2
Goto page 1, 2  Next
critical error -- tape labels get corrupted, previous backup
Author Message
Post critical error -- tape labels get corrupted, previous backup 
I'm experiencing a critical problem where tape labels on volumes with data
get corrupted, leaving all data on the tape inaccessible to bacula.

I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
x86_64).

This problem has happened with approximately 15 tapes over approximately 6
months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
The problem is sporadic, appearing in approximately 1 out of 60 tapes
per week.

I do not think the issue is related to the physical media or the tape
drives. One tape was last written successfully when in drive 0, then appears
corrupt when a later job tries to use is in drive 1. Another tape was last
written successfully when in drive 1, then appears corrupt when a later job
tries to use it in drive 0.

I'm not sure what combination of circumstances trigger the problem,
but it seems to show up after:

a tape is used for backups
the tape is unloaded
the tape is later reloaded for use in another job

Data from uncorrupted tapes can be read and restored without problems.

Purging the corrupt volume from the bacula database and relabeling media
allows me to reuse the tape successfully (at the loss of TB of backups!).

A tape will work successfully for many jobs, then later bacula will mount the
tape and be unable to read the label.

Here are the log records for a particular volume. It was labeled about
Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
(350.49GB), then the label was corrupted.

------------------------------
04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled Volume "004090" on device "ml6000-drv1" (/dev/tape1-ml6000)
04-Jan 06:25 sbia-infr-vbacula JobId 42676: New volume "004090" mounted on device "ml6000-drv1" (/dev/tape1-ml6000) at 04-Jan-2012 06:25.
04-Jan 08:23 sbia-infr-vbacula JobId 42676: Committing spooled data to Volume "004090". Despooling 37,003,975,390 bytes ...
05-Jan 06:47 sbia-infr-vbacula JobId 42724: Volume "004090" previously written, moving to end of data.
05-Jan 06:49 sbia-infr-vbacula JobId 42724: Ready to append to end of Volume "004090" at file=69.
05-Jan 11:05 sbia-infr-vbacula JobId 42724: Committing spooled data to Volume "004090". Despooling 495 bytes ...
06-Jan 06:51 sbia-infr-vbacula JobId 42746: Volume "004090" previously written, moving to end of data.
06-Jan 06:52 sbia-infr-vbacula JobId 42746: Ready to append to end of Volume "004090" at file=70.
06-Jan 12:08 sbia-infr-vbacula JobId 42746: Committing spooled data to Volume "004090". Despooling 495 bytes ...
07-Jan 06:48 sbia-infr-vbacula JobId 42768: Volume "004090" previously written, moving to end of data.
07-Jan 06:50 sbia-infr-vbacula JobId 42768: Ready to append to end of Volume "004090" at file=71.
07-Jan 12:07 sbia-infr-vbacula JobId 42768: Committing spooled data to Volume "004090". Despooling 495 bytes ...
08-Jan 06:48 sbia-infr-vbacula JobId 42790: Volume "004090" previously written, moving to end of data.
08-Jan 06:50 sbia-infr-vbacula JobId 42790: Ready to append to end of Volume "004090" at file=72.
08-Jan 11:40 sbia-infr-vbacula JobId 42790: Committing spooled data to Volume "004090". Despooling 495 bytes ...
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Committing spooled data to Volume "004090". Despooling 495 bytes ...
10-Jan 05:47 sbia-infr-vbacula JobId 42831: Volume "004090" previously written, moving to end of data.
10-Jan 05:49 sbia-infr-vbacula JobId 42831: Ready to append to end of Volume "004090" at file=74.
10-Jan 23:52 sbia-infr-vbacula JobId 42831: Committing spooled data to Volume "004090". Despooling 35,276,606,182 bytes ...
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Committing spooled data to Volume "004090". Despooling 495 bytes ...
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Volume "004090" previously written, moving to end of data.
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Ready to append to end of Volume "004090" at file=83.
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume "004090". Despooling 5,914 bytes ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume "004090". Despooling 404 bytes ...
15-Jan 16:54 sbia-infr-vbacula JobId 42924: Please mount Volume "004090" or label a new one for:
------------------------------




At this point, the volume 004090 is unusable. Running 'btape' on that volume reports
----------------------------
[root < at > sbia-infr1 working]# ../bin/btape -v ml6000-drv0
Tape block granularity is 1024 bytes.
btape: butil.c:290 Using device: "ml6000-drv0" for writing.
23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
command.
23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result is Slot
9.
btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
*readlabel
btape: btape.c:526 Volume has no label.

Volume Label:
Id : **error**VerNo : 0
VolName :
PrevVolName :
VolFile : 0
LabelType : Unknown 0
LabelSize : 0
PoolName :
MediaType :
PoolType :
HostName :
Date label written: -4712-01-01 at 00:00
----------------------------




However, there _is_ data on the tape. I'm able to read the tape via dd
(ibs=64k). The ASCII data at the beginning of the tape shows fragments of the
Bacula label and data that corresponds to some of the backups:

-----------------------------------------
strings /tmp/vol4090.header |head -20
BB02
Bacula 1.0 immortal
ge$o
Archive
Backup
archive
sbia-infr-vnfs2
archive.2012-01-13_06.45.00_01
archive
ItnIpRhN5JVFXhw/+0/sAsD
Bacula 1.0 immortal
Archive
Backup
archive
sbia-infr-vnfs2
archive.2012-01-13_06.45.00_01
archive
ItnIpRhN5JVFXhw/+0/sAsD
-----------------------------------------



I am happy to provide additional debugging information, etc.

Any assistance is welcome.

Thanks,

Mark

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:

I'm experiencing a critical problem where tape labels on volumes with data
get corrupted, leaving all data on the tape inaccessible to bacula.

I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
x86_64).

This problem has happened with approximately 15 tapes over approximately 6
months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
The problem is sporadic, appearing in approximately 1 out of 60 tapes
per week.

I do not think the issue is related to the physical media or the tape
drives. One tape was last written successfully when in drive 0, then appears
corrupt when a later job tries to use is in drive 1. Another tape was last
written successfully when in drive 1, then appears corrupt when a later job
tries to use it in drive 0.

Why do think it isn't a hardware problem?

Bacula only looks at the label when a volume is mounted, so it could be
written unsuccessfully but you wouldn't know that until later.


Here are the log records for a particular volume. It was labeled about
Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
(350.49GB), then the label was corrupted.

------------------------------
04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled Volume "004090" on device "ml6000-drv1" (/dev/tape1-ml6000)
04-Jan 06:25 sbia-infr-vbacula JobId 42676: New volume "004090" mounted on device "ml6000-drv1" (/dev/tape1-ml6000) at 04-Jan-2012 06:25.

Is /dev/tape1-ml6000 a non-rewinding device (like /dev/nst0)?


At this point, the volume 004090 is unusable. Running 'btape' on that volume reports
----------------------------
[root < at > sbia-infr1 working]# ../bin/btape -v ml6000-drv0
Tape block granularity is 1024 bytes.
btape: butil.c:290 Using device: "ml6000-drv0" for writing.
23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
command.
23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result is Slot
9.
btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
*readlabel
btape: btape.c:526 Volume has no label.

Volume Label:
Id : **error**VerNo : 0
VolName :
PrevVolName :
VolFile : 0
LabelType : Unknown 0
LabelSize : 0
PoolName :
MediaType :
PoolType :
HostName :
Date label written: -4712-01-01 at 00:00
----------------------------




However, there _is_ data on the tape. I'm able to read the tape via dd
(ibs=64k). The ASCII data at the beginning of the tape shows fragments of the
Bacula label and data that corresponds to some of the backups:

The output of

od -tx1 /tmp/vol4090.header | head -n 40

might be useful, to see why Bacula rejects it.

__Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
The pithy ruminations from Martin Simmons on


Thanks for replying.


<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> >
=> > I'm experiencing a critical problem where tape labels on volumes with data
=> > get corrupted, leaving all data on the tape inaccessible to bacula.
=> >
=> > I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > x86_64).
=> >
=> > This problem has happened with approximately 15 tapes over approximately 6
=> > months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
=> > The problem is sporadic, appearing in approximately 1 out of 60 tapes
=> > per week.
=> >
=> > I do not think the issue is related to the physical media or the tape
=> > drives. One tape was last written successfully when in drive 0, then appears
=> > corrupt when a later job tries to use is in drive 1. Another tape was last
=> > written successfully when in drive 1, then appears corrupt when a later job
=> > tries to use it in drive 0.
=>
=> Why do think it isn't a hardware problem?
=>

I don't think it's a hardware problem because:

the vast majority of tape access (read or write) doesn't result
in corrupted labels

there aren't SCSI, tape, or bacula errors reported during backups
(within Bacula, the OS, or the tape library console)

the tapes are readable--though the data is not usable by bacula

the problem occurs on tapes that have been written and read in
both drives (this doesn't rule out some common element in the
tape library)

=> Bacula only looks at the label when a volume is mounted, so it could be
=> written unsuccessfully but you wouldn't know that until later.

Interesting... thanks for bringing this up...I'm checking the logs
(which only go back to early Dec) to see if any of the corrupted tapes
got unloaded, reloaded and then written to successfully....

[pause]

Some of the tapes that are not corrupt have gone through multiple
load/unload cycles for different jobs....so the act of reloading a tape in
order to append new jobs does not always cause corruption in the label.

The initial label (via "label barcodes", assigning the tape to the
Scratch pool) must be valid, or bacula would detect the corruption
when it later loads the tape and uses is for its first job.

Does bacula actually relabel a tape with a working pool (Full,
Incremental, or Archive), or will it continue to have the pool name
(Scratch) that was assigned during "label barcodes"?

I'm also going to load all the other tapes in the changer and
check their labels with btape and dump.

[pause]

Done. I loaded, dumped the first ~640MB from the tape, and used 'btape' to
read the label from each tape in the changer (35). There were no I/O errors,
all tapes were 'readable', though 4 have corrupt labels.

=>
=>
=> > Here are the log records for a particular volume. It was labeled about
=> > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
=> > (350.49GB), then the label was corrupted.
=> >
=> > ------------------------------
=> > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
=> > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled Volume "004090" on devic


What does "Wrote label to prelabeled Volume" mean, exactly? Should the label
be changed from when it was first written with "label barcodes"?

The labels on new tapes that have been used but are not corrupted show a pool
of "Scratch". However, the database shows that those tapes have been used for
jobs in the working pools (Full, Incremental, Archive). For example, the
database shows that volume 004056 is 'full', has ~1513GB of data from 4 backups, and
is in the "Full" pool, but the label shows the pool as "Scratch".

It appears that:

label barcodes successfully labels a tape and puts it into the Scratch
pool

when the tape is first used, bacula logs the message "Wrote
label to prelabed Volume", but the label is not updated ('btape
readlabel' reports Scratch), and jobs can be written to the tape

at some time the label on some tapes gets corrupted

when the tape is reloaded, bacula detects the corruption


=> e "ml6000-drv1" (/dev/tape1-ml6000)
=> > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: New volume "004090" mounted on device "ml6000-drv1
=> " (/dev/tape1-ml6000) at 04-Jan-2012 06:25.
=>
=> Is /dev/tape1-ml6000 a non-rewinding device (like /dev/nst0)?

Yes, so successive writes to the same tape without unloading should
append.

The tape hardware consists of 2x LTO-4 drives in a 40 slot Dell ML6000
(rebranded Adic) changer.

=>
=>
=> > At this point, the volume 004090 is unusable. Running 'btape' on that volume reports
=> > ----------------------------
=> > [root < at > sbia-infr1 working]# ../bin/btape -v ml6000-drv0
=> > Tape block granularity is 1024 bytes.
=> > btape: butil.c:290 Using device: "ml6000-drv0" for writing.
=> > 23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
=> > command.
=> > 23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result is Slot
=> > 9.
=> > btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
=> > *readlabel
=> > btape: btape.c:526 Volume has no label.
=> >
=> > Volume Label:
=> > Id : **error**VerNo : 0
=> > VolName :
=> > PrevVolName :
=> > VolFile : 0
=> > LabelType : Unknown 0
=> > LabelSize : 0
=> > PoolName :
=> > MediaType :
=> > PoolType :
=> > HostName :
=> > Date label written: -4712-01-01 at 00:00
=> > ----------------------------
=> >
=> >
=> >
=> >
=> > However, there _is_ data on the tape. I'm able to read the tape via dd
=> > (ibs=64k). The ASCII data at the beginning of the tape shows fragments of the
=> > Bacula label and data that corresponds to some of the backups:
=>
=> The output of
=>
=> od -tx1 /tmp/vol4090.header | head -n 40

Sure.

0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32
0000020 00 00 00 2b 4f 0e 1c 72 ff ff ff fc ff ff 58 6b
0000040 00 00 00 9a 42 61 63 75 6c 61 20 31 2e 30 20 69
0000060 6d 6d 6f 72 74 61 6c 0a 00 00 00 00 0b 00 00 a7
0000100 95 00 04 b6 67 65 24 6f 0e 00 00 00 00 00 00 00
0000120 00 41 72 63 68 69 76 65 00 42 61 63 6b 75 70 00
0000140 61 72 63 68 69 76 65 00 73 62 69 61 2d 69 6e 66
0000160 72 2d 76 6e 66 73 32 00 61 72 63 68 69 76 65 2e
0000200 32 30 31 32 2d 30 31 2d 31 33 5f 30 36 2e 34 35
0000220 2e 30 30 5f 30 31 00 61 72 63 68 69 76 65 00 00
0000240 00 00 42 00 00 00 49 74 6e 49 70 52 68 4e 35 4a
0000260 56 46 58 68 77 2f 2b 30 2f 73 41 73 44 00 ff ff
0000300 ff fb ff ff 58 6b 00 00 00 be 42 61 63 75 6c 61
0000320 20 31 2e 30 20 69 6d 6d 6f 72 74 61 6c 0a 00 00
0000340 00 00 0b 00 00 a7 95 00 04 b6 67 65 26 18 80 00
0000360 00 00 00 00 00 00 00 41 72 63 68 69 76 65 00 42
0000400 61 63 6b 75 70 00 61 72 63 68 69 76 65 00 73 62
0000420 69 61 2d 69 6e 66 72 2d 76 6e 66 73 32 00 61 72
0000440 63 68 69 76 65 2e 32 30 31 32 2d 30 31 2d 31 33
0000460 5f 30 36 2e 34 35 2e 30 30 5f 30 31 00 61 72 63
0000500 68 69 76 65 00 00 00 00 42 00 00 00 49 74 6e 49
0000520 70 52 68 4e 35 4a 56 46 58 68 77 2f 2b 30 2f 73
0000540 41 73 44 00 00 00 00 00 00 00 00 00 00 00 00 00
0000560 00 00 00 00 00 00 00 00 00 00 00 54 00 00 00 53
0000600 00 00 00 00 00 00 00 54 00 00 00 00 00 00 00 00
0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0176000

=>
=> might be useful, to see why Bacula rejects it.

I hope that's more useful to you than me! Smile

I can provide dumps from headers of 3 other tapes with corrupted labels.

Thanks,

Mark

=>
=> __Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
On 1/24/12 2:22 PM, mark.bergman < at > uphs.upenn.edu wrote:
In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
The pithy ruminations from Martin Simmons on


Thanks for replying.


<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> >
=> > I'm experiencing a critical problem where tape labels on volumes with data
=> > get corrupted, leaving all data on the tape inaccessible to bacula.
=> >
=> > I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > x86_64).
=> >
=> > This problem has happened with approximately 15 tapes over approximately 6
=> > months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
=> > The problem is sporadic, appearing in approximately 1 out of 60 tapes
=> > per week.
=> >
=> > I do not think the issue is related to the physical media or the tape
=> > drives. One tape was last written successfully when in drive 0, then appears
=> > corrupt when a later job tries to use is in drive 1. Another tape was last
=> > written successfully when in drive 1, then appears corrupt when a later job
=> > tries to use it in drive 0.
=>
=> Why do think it isn't a hardware problem?
=>

I don't think it's a hardware problem because:

the vast majority of tape access (read or write) doesn't result
in corrupted labels

there aren't SCSI, tape, or bacula errors reported during backups
(within Bacula, the OS, or the tape library console)

the tapes are readable--though the data is not usable by bacula

the problem occurs on tapes that have been written and read in
both drives (this doesn't rule out some common element in the
tape library)

Perhaps someone else already suggested this and I missed it--this looks
like somehow the tapes were rewound behind bacula's back--could that
explain the behavior you are seeing?

-se

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
In the message dated: Tue, 24 Jan 2012 14:30:44 PST,
The pithy ruminations from Steve Ellis on
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups
unreadable> were:
=> On 1/24/12 2:22 PM, mark.bergman < at > uphs.upenn.edu wrote:
=> > In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
=> > The pithy ruminations from Martin Simmons on
=> >
=> >
=> > Thanks for replying.
=> >
=> >
=> > <Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
=> > ps unreadable> were:
=> > => >>>>> On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> > => >
=> > => > I'm experiencing a critical problem where tape labels on volumes with data
=> > => > get corrupted, leaving all data on the tape inaccessible to bacula.
=> > => >
=> > => > I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > => > x86_64).
=> > => >
=> > => > This problem has happened with approximately 15 tapes over approximately 6
=> > => > months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
=> > => > The problem is sporadic, appearing in approximately 1 out of 60 tapes
=> > => > per week.
=> > => >
=> > => > I do not think the issue is related to the physical media or the tape
=> > => > drives. One tape was last written successfully when in drive 0, then appears
=> > => > corrupt when a later job tries to use is in drive 1. Another tape was last
=> > => > written successfully when in drive 1, then appears corrupt when a later job
=> > => > tries to use it in drive 0.
=> > =>
=> > => Why do think it isn't a hardware problem?
=> > =>
=> >
=> > I don't think it's a hardware problem because:
=> >
=> > the vast majority of tape access (read or write) doesn't result
=> > in corrupted labels
=> >
=> > there aren't SCSI, tape, or bacula errors reported during backups
=> > (within Bacula, the OS, or the tape library console)
=> >
=> > the tapes are readable--though the data is not usable by bacula
=> >
=> > the problem occurs on tapes that have been written and read in
=> > both drives (this doesn't rule out some common element in the
=> > tape library)
=> >
=> Perhaps someone else already suggested this and I missed it--this looks
=> like somehow the tapes were rewound behind bacula's back--could that
=> explain the behavior you are seeing?

Thanks for suggesting this. I appreciate the feedback.

Yeah, it would explain the symptom, but if I understand it correctly,
this would require:

bacula loads a tape with a valid label

writes N backup jobs to the tape

"something" rewinds the tape

bacula writes to the beginning of the tape, corrupting the label (but
believing the job to be successful)

bacula unloads the tape

at some later point, bacula loads the tape for another
job and cannot read the label

It is difficult to think of a scenario where "something rewinds" but
does not unload the tape.

We don't have any software other than bacula that reads/writes from tape.

Attempts to access the tape drives (not the autochanger) manually with 'mt'
while bacula-sd is running are blocked as bacula-sd has a lock on the tape
devices.

It is possible to use "mtx" to unload tapes from the drives while bacula is
running, and I believe that unloading an LTO tape implies that it is rewound.

However, I can't think of any scenario where a tape is unloaded without
updating the "in changer" flag in the database, and where "update slots" is
not called after the tape is unloaded, and where bacula tries to append to the
same tape, and where the tape is loaded without triggering an attempt to read
the label, and the 'append' therefore overwrites the beginning of the
tape...but maybe that's possible.

I may just change the bacula-dir and bacula-sd init scripts to call
mtx-changer and unload all drives before starting either daemon. This would
help ensure consistency, regardless of which daemon starts first or is later
restarted.

Thanks,

Mark

=> -se
=>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
I just want to say that I have seen this recently too. I have a set of
LTO-3 tapes that I am now relabeling as I use them because the labels
have become unreadable. All I can add right now is that that particular
server is running Bacula v5.0.3 on Ubuntu.




On 1/24/2012 6:25 PM, mark.bergman < at > uphs.upenn.edu wrote:
In the message dated: Tue, 24 Jan 2012 14:30:44 PST,
The pithy ruminations from Steve Ellis on
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backups
unreadable> were:
=> On 1/24/12 2:22 PM, mark.bergman < at > uphs.upenn.edu wrote:
=> > In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
=> > The pithy ruminations from Martin Simmons on
=> >
=> >
=> > Thanks for replying.
=> >
=> >
=> > <Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
=> > ps unreadable> were:
=> > => >>>>> On Mon, 23 Jan 2012 18:47:31 -0500, mark bergman said:
=> > => >
=> > => > I'm experiencing a critical problem where tape labels on volumes with data
=> > => > get corrupted, leaving all data on the tape inaccessible to bacula.
=> > => >
=> > => > I'm running bacula 5.2.2 built from source, under Linux (CentOS 5.7
=> > => > x86_64).
=> > => >
=> > => > This problem has happened with approximately 15 tapes over approximately 6
=> > => > months, mostly new LTO-4 media, but some LTO-3 media that's being reused.
=> > => > The problem is sporadic, appearing in approximately 1 out of 60 tapes
=> > => > per week.
=> > => >
=> > => > I do not think the issue is related to the physical media or the tape
=> > => > drives. One tape was last written successfully when in drive 0, then appears
=> > => > corrupt when a later job tries to use is in drive 1. Another tape was last
=> > => > written successfully when in drive 1, then appears corrupt when a later job
=> > => > tries to use it in drive 0.
=> > =>
=> > => Why do think it isn't a hardware problem?
=> > =>
=> >
=> > I don't think it's a hardware problem because:
=> >
=> > the vast majority of tape access (read or write) doesn't result
=> > in corrupted labels
=> >
=> > there aren't SCSI, tape, or bacula errors reported during backups
=> > (within Bacula, the OS, or the tape library console)
=> >
=> > the tapes are readable--though the data is not usable by bacula
=> >
=> > the problem occurs on tapes that have been written and read in
=> > both drives (this doesn't rule out some common element in the
=> > tape library)
=> >
=> Perhaps someone else already suggested this and I missed it--this looks
=> like somehow the tapes were rewound behind bacula's back--could that
=> explain the behavior you are seeing?

Thanks for suggesting this. I appreciate the feedback.

Yeah, it would explain the symptom, but if I understand it correctly,
this would require:

bacula loads a tape with a valid label

writes N backup jobs to the tape

"something" rewinds the tape

bacula writes to the beginning of the tape, corrupting the label (but
believing the job to be successful)

bacula unloads the tape

at some later point, bacula loads the tape for another
job and cannot read the label

It is difficult to think of a scenario where "something rewinds" but
does not unload the tape.

We don't have any software other than bacula that reads/writes from tape.

Attempts to access the tape drives (not the autochanger) manually with 'mt'
while bacula-sd is running are blocked as bacula-sd has a lock on the tape
devices.

It is possible to use "mtx" to unload tapes from the drives while bacula is
running, and I believe that unloading an LTO tape implies that it is rewound.

However, I can't think of any scenario where a tape is unloaded without
updating the "in changer" flag in the database, and where "update slots" is
not called after the tape is unloaded, and where bacula tries to append to the
same tape, and where the tape is loaded without triggering an attempt to read
the label, and the 'append' therefore overwrites the beginning of the
tape...but maybe that's possible.

I may just change the bacula-dir and bacula-sd init scripts to call
mtx-changer and unload all drives before starting either daemon. This would
help ensure consistency, regardless of which daemon starts first or is later
restarted.

Thanks,

Mark

=> -se
=>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
On Wed, Jan 25, 2012 at 12:47 PM, Brian Debelius
<bdebelius < at > intelesyscorp.com> wrote:
I just want to say that I have seen this recently too.  I have a set of
LTO-3 tapes that I am now relabeling as I use them because the labels
have become unreadable.  All I can add right now is that that particular
server is running Bacula v5.0.3 on Ubuntu.


I have seen examples of this on the list when bacula was using a
rewinding tape device /dev/st0 instead of /dev/nst0. Could you be
doing that?

John

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

View user's profile Send private message
Post critical error -- tape labels get corrupted, previous backup 
On Tue, 24 Jan 2012 17:22:21 -0500, mark bergman said:

In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
The pithy ruminations from Martin Simmons on

I don't think it's a hardware problem because:

the vast majority of tape access (read or write) doesn't result
in corrupted labels

there aren't SCSI, tape, or bacula errors reported during backups
(within Bacula, the OS, or the tape library console)

the tapes are readable--though the data is not usable by bacula

the problem occurs on tapes that have been written and read in
both drives (this doesn't rule out some common element in the
tape library)

OK.


=> Bacula only looks at the label when a volume is mounted, so it could be
=> written unsuccessfully but you wouldn't know that until later.

Interesting... thanks for bringing this up...I'm checking the logs
(which only go back to early Dec) to see if any of the corrupted tapes
got unloaded, reloaded and then written to successfully....

[pause]

Some of the tapes that are not corrupt have gone through multiple
load/unload cycles for different jobs....so the act of reloading a tape in
order to append new jobs does not always cause corruption in the label.

The initial label (via "label barcodes", assigning the tape to the
Scratch pool) must be valid, or bacula would detect the corruption
when it later loads the tape and uses is for its first job.

Does bacula actually relabel a tape with a working pool (Full,
Incremental, or Archive), or will it continue to have the pool name
(Scratch) that was assigned during "label barcodes"?

I'm also going to load all the other tapes in the changer and
check their labels with btape and dump.

[pause]

Done. I loaded, dumped the first ~640MB from the tape, and used 'btape' to
read the label from each tape in the changer (35). There were no I/O errors,
all tapes were 'readable', though 4 have corrupt labels.

=>
=>
=> > Here are the log records for a particular volume. It was labeled about
=> > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
=> > (350.49GB), then the label was corrupted.
=> >
=> > ------------------------------
=> > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
=> > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled Volume "004090" on devic


What does "Wrote label to prelabeled Volume" mean, exactly?

Bacula always rewrites the label when a prelabeled volume is used for the
first time (but not when appending). This is necessary because the Bacula
label is stored in the same logical tape file as the backup data, so there is
no way to start writing after the existing label.


Should the label
be changed from when it was first written with "label barcodes"?

The labels on new tapes that have been used but are not corrupted show a pool
of "Scratch". However, the database shows that those tapes have been used for
jobs in the working pools (Full, Incremental, Archive). For example, the
database shows that volume 004056 is 'full', has ~1513GB of data from 4 backups, and
is in the "Full" pool, but the label shows the pool as "Scratch".

Leaving Scratch in the tape label sounds like a bug.


=>
=>
=> > At this point, the volume 004090 is unusable. Running 'btape' on that volume reports
=> > ----------------------------
=> > [root < at > sbia-infr1 working]# ../bin/btape -v ml6000-drv0
=> > Tape block granularity is 1024 bytes.
=> > btape: butil.c:290 Using device: "ml6000-drv0" for writing.
=> > 23-Jan 18:14 btape JobId 0: 3301 Issuing autochanger "loaded? drive 0"
=> > command.
=> > 23-Jan 18:14 btape JobId 0: 3302 Autochanger "loaded? drive 0", result is Slot
=> > 9.
=> > btape: btape.c:477 open device "ml6000-drv0" (/dev/tape0-ml6000): OK
=> > *readlabel
=> > btape: btape.c:526 Volume has no label.
=> >
=> > Volume Label:
=> > Id : **error**VerNo : 0
=> > VolName :
=> > PrevVolName :
=> > VolFile : 0
=> > LabelType : Unknown 0
=> > LabelSize : 0
=> > PoolName :
=> > MediaType :
=> > PoolType :
=> > HostName :
=> > Date label written: -4712-01-01 at 00:00
=> > ----------------------------
=> >
=> >
=> >
=> >
=> > However, there _is_ data on the tape. I'm able to read the tape via dd
=> > (ibs=64k). The ASCII data at the beginning of the tape shows fragments of the
=> > Bacula label and data that corresponds to some of the backups:
=>
=> The output of
=>
=> od -tx1 /tmp/vol4090.header | head -n 40

Sure.

0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32
0000020 00 00 00 2b 4f 0e 1c 72 ff ff ff fc ff ff 58 6b
0000040 00 00 00 9a 42 61 63 75 6c 61 20 31 2e 30 20 69
0000060 6d 6d 6f 72 74 61 6c 0a 00 00 00 00 0b 00 00 a7
0000100 95 00 04 b6 67 65 24 6f 0e 00 00 00 00 00 00 00
0000120 00 41 72 63 68 69 76 65 00 42 61 63 6b 75 70 00
0000140 61 72 63 68 69 76 65 00 73 62 69 61 2d 69 6e 66
0000160 72 2d 76 6e 66 73 32 00 61 72 63 68 69 76 65 2e
0000200 32 30 31 32 2d 30 31 2d 31 33 5f 30 36 2e 34 35
0000220 2e 30 30 5f 30 31 00 61 72 63 68 69 76 65 00 00
0000240 00 00 42 00 00 00 49 74 6e 49 70 52 68 4e 35 4a
0000260 56 46 58 68 77 2f 2b 30 2f 73 41 73 44 00 ff ff
0000300 ff fb ff ff 58 6b 00 00 00 be 42 61 63 75 6c 61
0000320 20 31 2e 30 20 69 6d 6d 6f 72 74 61 6c 0a 00 00
0000340 00 00 0b 00 00 a7 95 00 04 b6 67 65 26 18 80 00
0000360 00 00 00 00 00 00 00 41 72 63 68 69 76 65 00 42
0000400 61 63 6b 75 70 00 61 72 63 68 69 76 65 00 73 62
0000420 69 61 2d 69 6e 66 72 2d 76 6e 66 73 32 00 61 72
0000440 63 68 69 76 65 2e 32 30 31 32 2d 30 31 2d 31 33
0000460 5f 30 36 2e 34 35 2e 30 30 5f 30 31 00 61 72 63
0000500 68 69 76 65 00 00 00 00 42 00 00 00 49 74 6e 49
0000520 70 52 68 4e 35 4a 56 46 58 68 77 2f 2b 30 2f 73
0000540 41 73 44 00 00 00 00 00 00 00 00 00 00 00 00 00
0000560 00 00 00 00 00 00 00 00 00 00 00 54 00 00 00 53
0000600 00 00 00 00 00 00 00 54 00 00 00 00 00 00 00 00
0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0176000

=>
=> might be useful, to see why Bacula rejects it.

I hope that's more useful to you than me! Smile

That contains the following:

Begin Job Session Record:
JobId : 42901
VerNum : 11
PoolName : Archive
PoolType : Backup
JobName : archive
ClientName : sbia-infr-vnfs2
Job (unique name) : archive.2012-01-13_06.45.00_01
FileSet : archive
JobType : B
JobLevel : I
Date written : 13-Jan-2012 11:45

End Job Session Record:
JobId : 42901
VerNum : 11
PoolName : Archive
PoolType : Backup
JobName : archive
ClientName : sbia-infr-vnfs2
Job (unique name) : archive.2012-01-13_06.45.00_01
FileSet : archive
JobType : B
JobLevel : I
JobFiles : 0
JobBytes : 0
StartBlock : 0
EndBlock : 0
StartFile : 84
EndFile : 83
JobErrors : 0
JobStatus : T
Date written : 13-Jan-2012 11:45

The records are also marked as a "continuation" but I don't know if that has
any significance.

According to your previous email, that job was indeed written to the tape:

12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume "004090". Despooling 5,914 bytes ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume "004090". Despooling 404 bytes ...

Did anything unusual happen around (or between) those lines in the log?

Was the tape in the same drive for both jobs?

It looks very much like the tape was rewound between them.

__Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
I just checked the config on that server and I am using /dev/nst0. Its
just odd. I have never encountered this before in my ~5 years of
running bacula.


On 1/25/2012 1:36 PM, John Drescher wrote:
On Wed, Jan 25, 2012 at 12:47 PM, Brian Debelius
<bdebelius < at > intelesyscorp.com> wrote:
I just want to say that I have seen this recently too. I have a set of
LTO-3 tapes that I am now relabeling as I use them because the labels
have become unreadable. All I can add right now is that that particular
server is running Bacula v5.0.3 on Ubuntu.

I have seen examples of this on the list when bacula was using a
rewinding tape device /dev/st0 instead of /dev/nst0. Could you be
doing that?

John


------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
I just checked the config on that server and I am using /dev/nst0.  Its just
odd.  I have never encountered this before in my ~5 years of running bacula.


I have not seen it on my side yet.

John

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

View user's profile Send private message
Post critical error -- tape labels get corrupted, previous backup 
In the message dated: Wed, 25 Jan 2012 19:05:44 GMT,
The pithy ruminations from Martin Simmons on
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Tue, 24 Jan 2012 17:22:21 -0500, mark bergman said:
=> >
=> > In the message dated: Tue, 24 Jan 2012 19:09:15 GMT,
=> > The pithy ruminations from Martin Simmons on
=>
=> > I don't think it's a hardware problem because:
=> >
=> > the vast majority of tape access (read or write) doesn't result
=> > in corrupted labels
=> >
=> > there aren't SCSI, tape, or bacula errors reported during backups
=> > (within Bacula, the OS, or the tape library console)
=> >
=> > the tapes are readable--though the data is not usable by bacula
=> >
=> > the problem occurs on tapes that have been written and read in
=> > both drives (this doesn't rule out some common element in the
=> > tape library)
=>
=> OK.
=>
=>

[SNIP!]

=> > =>
=> > => > Here are the log records for a particular volume. It was labeled about
=> > => > Dec 22, 2011. First used on Jan 4 2012. Used successfully for 10 jobs
=> > => > (350.49GB), then the label was corrupted.
=> > => >
=> > => > ------------------------------
=> > => > 04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
=> > => > 04-Jan 06:25 sbia-infr-vbacula JobId 42676: Wrote label to prelabeled Volume "004090" on
=> devic
=> >
=> >
=> > What does "Wrote label to prelabeled Volume" mean, exactly?
=>
=> Bacula always rewrites the label when a prelabeled volume is used for the
=> first time (but not when appending). This is necessary because the Bacula
=> label is stored in the same logical tape file as the backup data, so there is
=> no way to start writing after the existing label.
=>
=>
=> > Should the label
=> > be changed from when it was first written with "label barcodes"?
=> >
=> > The labels on new tapes that have been used but are not corrupted show a pool
=> > of "Scratch". However, the database shows that those tapes have been used for
=> > jobs in the working pools (Full, Incremental, Archive). For example, the
=> > database shows that volume 004056 is 'full', has ~1513GB of data from 4 backups, and
=> > is in the "Full" pool, but the label shows the pool as "Scratch".
=>
=> Leaving Scratch in the tape label sounds like a bug.


All of the uncorrupted tapes have the label Scratch, even those that
have been used for many jobs.

I will report this as a bug.

=>
=>
=> > =>
=> > =>
=> > => > At this point, the volume 004090 is unusable. Running 'btape' on that volume reports

[SNIP!]

=> > =>
=> > => The output of
=> > =>
=> > => od -tx1 /tmp/vol4090.header | head -n 40
=> >
=> > Sure.
=> >
=> > 0000000 00 6a 49 7b 00 00 01 88 00 00 00 00 42 42 30 32

[SNIP!]

=> > 0000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=> > *
=> > 0176000
=> >
=> > =>
=> > => might be useful, to see why Bacula rejects it.
=> >
=> > I hope that's more useful to you than me! Smile
=>
=> That contains the following:
=>
=> Begin Job Session Record:
=> JobId : 42901
=> VerNum : 11
=> PoolName : Archive
=> PoolType : Backup
=> JobName : archive
=> ClientName : sbia-infr-vnfs2
=> Job (unique name) : archive.2012-01-13_06.45.00_01
=> FileSet : archive
=> JobType : B
=> JobLevel : I
=> Date written : 13-Jan-2012 11:45
=>
=> End Job Session Record:
=> JobId : 42901
=> VerNum : 11
=> PoolName : Archive
=> PoolType : Backup
=> JobName : archive
=> ClientName : sbia-infr-vnfs2
=> Job (unique name) : archive.2012-01-13_06.45.00_01
=> FileSet : archive
=> JobType : B
=> JobLevel : I
=> JobFiles : 0
=> JobBytes : 0
=> StartBlock : 0
=> EndBlock : 0
=> StartFile : 84
=> EndFile : 83
=> JobErrors : 0
=> JobStatus : T
=> Date written : 13-Jan-2012 11:45
=>
=> The records are also marked as a "continuation" but I don't know if that has
=> any significance.
=>
=> According to your previous email, that job was indeed written to the tape:
=>
=> > 12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume "004090". Despoo
=> ling 5,914 bytes ...
=> > 13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume "004090". Despoo
=> ling 404 bytes ...
=>
=> Did anything unusual happen around (or between) those lines in the log?

Not that I can see. Many other backups ran (some failed due to concurrency
issues, and the fd seemed to be down on one client) in that interval.

No other jobs accessed ml600-drv1 (which had volume 004090 at 12-Jan 12:56).

=>
=> Was the tape in the same drive for both jobs?

Yes, and not unloaded between those jobs.

Luckily I already had mtx logging enabled. As far as I can tell, volume 004090 was
loaded into drive1 at 20120110-05:46:40. There were no other mtx commands
referring to drive1 until 20120113-19:15:04 when the tape was unloaded.

The tape from the same slot was loaded again on 20120115-16:53:52, when the
corruption was detected.

=>
=> It looks very much like the tape was rewound between them.

Hmmm....I don't know what could do that, besides bacula itself. We don't
use any other software that access tapes.

According to the database (query 14), media 004090 contains:

+--------+---------+---------------------+------+-------+-----------+-------------------+--------+
| JobId | Name | StartTime | Type | Level | Files | Bytes | Status |
+--------+---------+---------------------+------+-------+-----------+-------------------+--------+
| 42,676 | archive | 2012-01-03 04:45:48 | B | F | 3,431,911 | 1,752,538,705,339 | T |
| 42,724 | archive | 2012-01-05 06:46:58 | B | I | 1 | 0 | T |
| 42,746 | archive | 2012-01-06 06:50:18 | B | I | 1 | 0 | T |
| 42,768 | archive | 2012-01-07 06:47:35 | B | I | 1 | 0 | T |
| 42,790 | archive | 2012-01-08 06:47:38 | B | I | 1 | 0 | T |
| 42,812 | archive | 2012-01-09 06:45:04 | B | I | 1 | 0 | T |
| 42,831 | archive | 2012-01-10 05:46:30 | B | D | 88,332 | 35,222,612,986 | T |
| 42,856 | archive | 2012-01-11 06:45:04 | B | I | 1 | 0 | T |
| 42,879 | archive | 2012-01-12 06:45:04 | B | I | 36 | 0 | T |
| 42,901 | archive | 2012-01-13 06:45:01 | B | I | 0 | 0 | T |
+--------+---------+---------------------+------+-------+-----------+-------------------+--------+

The zero-byte backups are normal--this is an archive fileset, and it changes
very rarely.


Here is an analysis of the logs of all activity involving tape 004090 and simultaneous system events:
-----------------------------------------------------------------------------------------------------

04-Jan 06:24 sbia-infr-vbacula JobId 42676: Using Volume "004090" from 'Scratch' pool.
04-Jan 06:25 sbia-infr-vbacula JobId 42676: 3305 Autochanger "load slot 9, drive 1", status is OK.
initial load of volume to continue job 42676


04-Jan 08:23 sbia-infr-vbacula JobId 42676: Committing spooled data to Volume "004090". Despooling 37,003,975,390 bytes ...
04-Jan 08:28 sbia-infr-vbacula JobId 42676: Despooling elapsed time = 00:05:16, Transfer rate = 117.1 M Bytes/second
04-Jan 08:28 sbia-infr-vbacula JobId 42676: Sending spooled attrs to the Director. Despooling 1,255,585,818 bytes ...
job 42676 completed

04-Jan 20:11 sbia-infr-vbacula JobId 42705: 3307 Issuing autochanger "unload slot 9, drive 1" command.
volume 004090 unloaded

05-Jan 06:47 sbia-infr-vbacula JobId 42724: 3305 Autochanger "load slot 9, drive 1", status is OK.
05-Jan 06:47 sbia-infr-vbacula JobId 42724: Volume "004090" previously written, moving to end of data.
05-Jan 06:49 sbia-infr-vbacula JobId 42724: Ready to append to end of Volume "004090" at file=69.
tape loaded to run job 42724, label read correctly

05-Jan 11:05 sbia-infr-vbacula JobId 42724: Sending spooled attrs to the Director. Despooling 160 bytes ...
job 42724 complete

05-Jan 11:05 sbia-infr-vbacula JobId 42723: 3307 Issuing autochanger "unload slot 9, drive 1" command.
tape 004090 unloaded

06-Jan 06:50 sbia-infr-vbacula JobId 42746: 3305 Autochanger "load slot 9, drive 1", status is OK.
06-Jan 06:51 sbia-infr-vbacula JobId 42746: Volume "004090" previously written, moving to end of data.
06-Jan 06:52 sbia-infr-vbacula JobId 42746: Ready to append to end of Volume "004090" at file=70.
tape loaded to run job 42746, label read correctly

06-Jan 12:08 sbia-infr-vbacula JobId 42746: Sending spooled attrs to the Director. Despooling 160 bytes ...
job 42746 complete

06-Jan 20:11 sbia-infr-vbacula JobId 42748: 3307 Issuing autochanger "unload slot 9, drive 1" command.
tape 004090 unloaded

07-Jan 06:48 sbia-infr-vbacula JobId 42768: 3305 Autochanger "load slot 9, drive 1", status is OK.
07-Jan 06:48 sbia-infr-vbacula JobId 42768: Volume "004090" previously written, moving to end of data.
07-Jan 06:50 sbia-infr-vbacula JobId 42768: Ready to append to end of Volume "004090" at file=71.
tape loaded to run job 42768, label read correctly

07-Jan 12:07 sbia-infr-vbacula JobId 42768: Sending spooled attrs to the Director. Despooling 160 bytes ...
job 42768 complete

07-Jan 20:11 sbia-infr-vbacula JobId 42769: 3307 Issuing autochanger "unload slot 9, drive 1" command.
tape 004090 unloaded

08-Jan 06:48 sbia-infr-vbacula JobId 42790: 3305 Autochanger "load slot 9, drive 1", status is OK.
08-Jan 06:48 sbia-infr-vbacula JobId 42790: Volume "004090" previously written, moving to end of data.
08-Jan 06:50 sbia-infr-vbacula JobId 42790: Ready to append to end of Volume "004090" at file=72.
tape loaded to run job 42790, label read correctly

08-Jan 11:40 sbia-infr-vbacula JobId 42790: Sending spooled attrs to the Director. Despooling 160 bytes ...
job 42790 complete

09-Jan 06:45 sbia-infr-vbacula JobId 42812: Start Backup JobId 42812, Job=archive.2012-01-09_06.45.00_08
09-Jan 06:45 sbia-infr-vbacula JobId 42812: Using Device "ml6000-drv1"
09-Jan 06:45 sbia-infr-vbacula JobId 42812: Spooling data ...
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Job write elapsed time = 04:27:31, Transfer rate = 0 Bytes/second
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Committing spooled data to Volume "004090". Despooling 495 bytes ...
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Despooling elapsed time = 00:00:01, Transfer rate = 495 Bytes/second
09-Jan 11:12 sbia-infr-vbacula JobId 42812: Sending spooled attrs to the Director. Despooling 160 bytes ...
tape remained in drive 1, used for job 42812


09-Jan 22:07 sbia-infr-vbacula JobId 42818: 3307 Issuing autochanger "unload slot 9, drive 1" command.
tape 004090 unloaded

10-Jan 05:47 sbia-infr-vbacula JobId 42831: 3305 Autochanger "load slot 9, drive 1", status is OK.
10-Jan 05:47 sbia-infr-vbacula JobId 42831: Volume "004090" previously written, moving to end of data.
10-Jan 05:49 sbia-infr-vbacula JobId 42831: Ready to append to end of Volume "004090" at file=74.
tape loaded to run job 42831, label read correctly

10-Jan 23:58 sbia-infr-vbacula JobId 42831: Sending spooled attrs to the Director. Despooling 32,744,074 bytes ...
job 42831 complete

11-Jan 06:45 sbia-infr-vbacula JobId 42856: Start Backup JobId 42856, Job=archive.2012-01-11_06.45.00_11
11-Jan 06:45 sbia-infr-vbacula JobId 42856: Using Device "ml6000-drv1"
11-Jan 06:45 sbia-infr-vbacula JobId 42856: Spooling data ...
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Job write elapsed time = 06:13:44, Transfer rate = 0 Bytes/second
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Committing spooled data to Volume "004090". Despooling 495 bytes ...
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Despooling elapsed time = 00:00:01, Transfer rate = 495 Bytes/second
11-Jan 12:58 sbia-infr-vbacula JobId 42856: Sending spooled attrs to the Director. Despooling 160 bytes ...
tape remained in drive 1, used for job 42856

** Jan 11 17:45 server rebooted **

12-Jan 06:45 sbia-infr-vbacula JobId 42879: Start Backup JobId 42879, Job=archive.2012-01-12_06.45.00_35
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Using Device "ml6000-drv1"
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Volume "004090" previously written, moving to end of data.
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Ready to append to end of Volume "004090" at file=83.
12-Jan 06:45 sbia-infr-vbacula JobId 42879: Spooling data ...
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Job write elapsed time = 06:10:45, Transfer rate = 0 Bytes/second
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Committing spooled data to Volume "004090". Despooling 5,914 bytes ...
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Despooling elapsed time = 00:00:01, Transfer rate = 5.914 K Bytes/second
12-Jan 12:56 sbia-infr-vbacula JobId 42879: Sending spooled attrs to the Director. Despooling 10,271 bytes ...
tape remained in drive 1, used for job 42876


13-Jan 06:45 sbia-infr-vbacula JobId 42901: Start Backup JobId 42901, Job=archive.2012-01-13_06.45.00_01
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Using Device "ml6000-drv1"
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Spooling data ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Job write elapsed time = 00:00:01, Transfer rate = 0 Bytes/second
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Committing spooled data to Volume "004090". Despooling 404 bytes ...
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Despooling elapsed time = 00:00:01, Transfer rate = 404 Bytes/second
13-Jan 06:45 sbia-infr-vbacula JobId 42901: Sending spooled attrs to the Director. Despooling 0 bytes ...
tape remained in drive 1, used for job 42901


13-Jan 19:15 sbia-infr-vbacula JobId 42902: 3307 Issuing autochanger "unload slot 9, drive 1" command.
tape 004090 unloaded


14-Jan 06:45 sbia-infr-vbacula JobId 42924: Start Backup JobId 42924, Job=archive.2012-01-14_06.45.00_32
15-Jan 16:52 sbia-infr-vbacula JobId 42924: 3307 Issuing autochanger "unload slot 13, drive 1" command.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: Using Device "ml6000-drv1"
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3301 Issuing autochanger "loaded? drive 1" command.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3302 Autochanger "loaded? drive 1", result: nothing loaded.
15-Jan 16:53 sbia-infr-vbacula JobId 42924: 3304 Issuing autochanger "load slot 9, drive 1" command.
15-Jan 16:54 sbia-infr-vbacula JobId 42924: 3305 Autochanger "load slot 9, drive 1", status is OK.
15-Jan 16:54 sbia-infr-vbacula JobId 42924: Please mount Volume "004090" or label a new one for:
attempt to use tape for job 42924, label corrupt

-----------------------------------------------------------------------------------------------------


The fact that the server rebooted between job 42856 and job 42879 makes
me wonder if something in the fibre/SCSI initialization or bacula-sd
startup causes tapes to be rewound but not ejected.

I find it interesting that the label was overwritten by job 42901. If the
server reboot (and SCSI bus reset, fibre loop initialization, bacula-sd
restart, bacula-dir restart) caused the tape[s] in the drive[s] to rewind,
wouldn't the label on tape 004090 have been overwritten by job 42879?

I don't see any indication that the tape was ejected or rewound between
jobs 42879 and 42901, yet that's where the corruption seems to have
happened.

Is there any other information that I can provide? Do you suggest running
bacula-sd or bacula-dir at any particular debugging level to try to get
more detail if this happens again?

I plan to do the following:

change the daemon startup order, so that bacula-sd starts BEFORE bacula-dir

change /etc/init.d/bacula-sd to unload all tape drives before starting the
daemon

Thanks for helping look into this,

Mark


=>
=> __Martin

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
On Wed, 25 Jan 2012 16:55:24 -0500, mark bergman said:

The fact that the server rebooted between job 42856 and job 42879 makes
me wonder if something in the fibre/SCSI initialization or bacula-sd
startup causes tapes to be rewound but not ejected.

The bacula-sd startup will rewind the tape, but Bacula should know about that.

If there is a fibre/SCSI reset without Bacula knowing about it, then that
could cause it.


Is there any other information that I can provide? Do you suggest running
bacula-sd or bacula-dir at any particular debugging level to try to get
more detail if this happens again?

I suspect that are no debug messages that would show anything useful here,
because bacula-sd is probably unaware of the problem.

If you have some way to query the actual tape position periodically then that
would be useful (I don't know if Linux has something similar, but FreeBSD's mt
status command reports this and it can be used on the control mode device even
if Bacula has the drive open).

__Martin

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
In the message dated: Mon, 30 Jan 2012 19:12:11 GMT,
The pithy ruminations from Martin Simmons on
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:
=> >>>>> On Wed, 25 Jan 2012 16:55:24 -0500, mark bergman said:
=> >
=> > The fact that the server rebooted between job 42856 and job 42879 makes
=> > me wonder if something in the fibre/SCSI initialization or bacula-sd
=> > startup causes tapes to be rewound but not ejected.
=>
=> The bacula-sd startup will rewind the tape, but Bacula should know about that.
=>
=> If there is a fibre/SCSI reset without Bacula knowing about it, then that
=> could cause it.

The problem happened again. Two tapes seem to be corrupted. (I say
"seem to be" because bacula loads the correct tape, but then asks for
another tape or to have that media labeled. I cannot check the suspect
tapes as a backup is running, so 'btape' is unusable.)

Here's a timetable of the events from the bacula-dir and mtx logs,
with comments {in curly brackets}:

{bacula-sd/bacula-dir server working successfully for ~8 days}
{mtx log shows which volumes are loaded}
20120129-06:46:20 Doing mtx -f /dev/changer-ml6000 load 4 0
{VolumeTag=003231}
20120130-00:37:40 Doing mtx -f /dev/changer-ml6000 load 7 1
{VolumeTag=000312}
20120130-00:38:08 Device /dev/tape1-ml6000 - not ready, retrying...
20120130-00:38:09 Device /dev/tape1-ml6000 - not ready, retrying...
20120131-15:59:55 Doing mtx -f /dev/changer-ml6000 -- to get count of slots
20120131-15:59:55 Parms: /dev/changer-ml6000 slots 0 /dev/tape0-ml6000 0
20120131-16:00:05 Doing mtx -f /dev/changer-ml6000 -- to list volumes
20120131-16:00:05 Parms: /dev/changer-ml6000 list 0 /dev/tape0-ml6000 0
{no mtx log entries until 20120201-19:41:10}


01-Feb 06:45 sbia-infr-vbacula JobId 43324: Start Backup JobId 43324, Job=corfu-full.2012-02-01_06.45.00_00
01-Feb 06:45 sbia-infr-vbacula JobId 43324: Using Device "ml6000-drv1"
01-Feb 06:45 sbia-infr-vbacula JobId 43324: Committing spooled data to Volume "000312". Despooling 180,926,597 bytes ...
{last successful backup to volume 000312, drive 1}

01-Feb 06:45 sbia-infr-vbacula JobId 43325: Start Backup JobId 43325, Job=archive.2012-02-01_06.45.00_01
01-Feb 06:45 sbia-infr-vbacula JobId 43325: Using Device "ml6000-drv0"
01-Feb 06:45 sbia-infr-vbacula JobId 43325: Committing spooled data to Volume "003231". Despooling 404 bytes ...
{last backup before infrastructure1 shutdown}


Feb 1 18:44 infrastructure1 server running bacula-sd & bacula-dir crashes

Feb 1 18:47 bacula-sd starts on infrastructure 3 (HA-initiated startup)
the /etc/init.d/bacula-sd script was previously
modified to do the following before starting the daemon
and after shutting down the daemon:

unload all tapes

release any SCSI reservations on each tape drive
sg_prevent --allow --verbose /dev/tape0-ml6000
sg_prevent --allow --verbose /dev/tape1-ml6000


Feb 1 18:47 bacula-dir starts on infrastructure 3 (HA-initiated startup)

Feb 1 18:49 infrastructure1 server starts

Feb 1 18:52:25 bacula-dir stopped on infrastructure 3 (manual shutdown)

Feb 1 18:52:25 bacula-sd stopped on infrastructure 3 (manual shutdown)

Feb 1 18:53 bacula-sd daemon started on infrastructure1 (normal startup), PID 8083

{mtx logs show query of devices in changer}
20120201-18:53:47 Doing mtx -f /dev/changer-ml6000 0 -- to find what is loaded
20120201-18:53:47 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0
20120201-18:53:57 Doing mtx -f /dev/changer-ml6000 0 -- to find what is loaded
20120201-18:53:57 Parms: /dev/changer-ml6000 loaded 0 /dev/tape0-ml6000 0

Feb 1 18:54 bacula-dir daemon started on infrastructure1 (normal startup), PID 8118

{first backup begins on bacula-sd/bacula-dir server}

01-Feb 19:41 sbia-infr-vbacula JobId 43326: Start Backup JobId 43326, Job=agora-full.2012-02-01_19.41.00_02
01-Feb 19:41 sbia-infr-vbacula JobId 43326: Using Device "ml6000-drv0"

{this should be empty from /etc/init.d/bacula-sd}

01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3301 Issuing autochanger "loaded? drive 0" command.
01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3302 Autochanger "loaded? drive 0", result: nothing loaded.
01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3304 Issuing autochanger "load slot 27, drive 0" command.

20120201-19:41:10 Doing mtx -f /dev/changer-ml6000 load 27 0
{there was no unload command issued by bacula...implying that the
modified /etc/init.d/bacula-sd startup script unloaded the tape from
slot 0}

01-Feb 19:41 sbia-infr-vbacula JobId 43326: 3305 Autochanger "load slot 27, drive 0", status is OK.
01-Feb 19:41 sbia-infr-vbacula JobId 43326: Volume "000335" previously written, moving to end of data.
01-Feb 19:42 sbia-infr-vbacula JobId 43326: Ready to append to end of Volume "000335" at file=63.
01-Feb 19:42 sbia-infr-vbacula JobId 43326: Spooling data ...

{backup to volume 000335 in slot 0 is OK}

{next backup job starts, uses drive 1}

01-Feb 20:11 sbia-infr-vbacula JobId 43328: Start Backup JobId 43328, Job=home-2.2012-02-01_20.11.00_05
01-Feb 20:11 sbia-infr-vbacula JobId 43328: Using Device "ml6000-drv1"
01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3301 Issuing autochanger "loaded? drive 1" command.
01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3302 Autochanger "loaded? drive 1", result: nothing loaded.

{as expected, nothing is loaded in drive 1}

01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load slot 7, drive 1" command.

20120201-20:11:10 Doing mtx -f /dev/changer-ml6000 load 7 1
{load required as the tape was unloaded by /etc/init.d/bacula-sd}

01-Feb 20:11 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 7, drive 1", status is OK.
01-Feb 20:11 sbia-infr-vbacula JobId 43328: Please mount Volume "000312" or label a new one for:

{bacula cannot read the label on tape 000312}

{used bconsole to change status of volume 000312 to "Disabled", bacula
requests, loads, and uses a different tape}

01-Feb 21:00 sbia-infr-vbacula JobId 43328: Recycled volume "000313"
01-Feb 21:00 sbia-infr-vbacula JobId 43328: Using Volume "000313" from 'Scratch' pool.
01-Feb 21:00 sbia-infr-vbacula JobId 43328: 3307 Issuing autochanger "unload slot 7, drive 1" command.
01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3304 Issuing autochanger "load slot 14, drive 1" command.
01-Feb 21:01 sbia-infr-vbacula JobId 43328: 3305 Autochanger "load slot 14, drive 1", status is OK.
01-Feb 21:01 sbia-infr-vbacula JobId 43328: Recycled volume "000313" on device "ml6000-drv1" (/dev/tape1-ml6000), all previous data lost.
01-Feb 21:01 sbia-infr-vbacula JobId 43328: Spooling data ...


{after several backups are written to tape 00313 in drive 1, bacula
attempts to use 003231 for a job in a different pool}

02-Feb 06:45 sbia-infr-vbacula JobId 43347: Start Backup JobId 43347, Job=archive.2012-02-02_06.45.00_34
02-Feb 06:48 sbia-infr-vbacula JobId 43347: 3307 Issuing autochanger "unload slot 14, drive 1" command.
02-Feb 06:50 sbia-infr-vbacula JobId 43347: Using Device "ml6000-drv1"
02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3301 Issuing autochanger "loaded? drive 1" command.
02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3302 Autochanger "loaded? drive 1", result: nothing loaded.
02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3304 Issuing autochanger "load slot 4, drive 1" command.
02-Feb 06:50 sbia-infr-vbacula JobId 43347: 3305 Autochanger "load slot 4, drive 1", status is OK.
02-Feb 06:51 sbia-infr-vbacula JobId 43347: Please mount Volume "003231" or label a new one for:


Based on this, it looks like the corruption is somehow associated with
some combination of the bacula-sd and bacula-dir stopping and starting
or with the SCSI release command....but that was as close as I was to
finding a solution earlier.

I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are
ejected and to timestamp the SCSI release commands.

Is it possible that bacula flagged tapes 003231 and 000312 as being in
the drives because they were loaded when the server crashed, even though
they were later ejected (outside of bacula's control)? Could this cause
bacula to believe that the tapes were at EOT when they do get loaded, and
bacula then immediately begins writing (corrupting the label)? [Unlikely
that bacula would try to write before reading the label, and would then
read the label after corrupting the tapes.]

When the current backup is finished, I'll extract the beginning data
on each of 003231 and 000312. Is there anything you recommend in terms
of checking the data on tape to determine whether the tape begins with
random garbage (possibly caused by the shutdown, startup, scsi reset,
etc.) or if it begins with valid bacula data that happened to overwrite
the label instead of being appended?

Does anyone have suggestions of how to troubleshoot this further,
or how to make the daemon startup process more resistant to causing
any corruption?

Any help is greatly appreciated.

Thanks,

Mark

=>
=>
=> > Is there any other information that I can provide? Do you suggest running
=> > bacula-sd or bacula-dir at any particular debugging level to try to get
=> > more detail if this happens again?
=>
=> I suspect that are no debug messages that would show anything useful here,
=> because bacula-sd is probably unaware of the problem.
=>
=> If you have some way to query the actual tape position periodically then that
=> would be useful (I don't know if Linux has something similar, but FreeBSD's mt
=> status command reports this and it can be used on the control mode device even
=> if Bacula has the drive open).
=>
=> __Martin

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
On Fri, 03 Feb 2012 20:04:44 -0500, mark bergman said:

I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are
ejected and to timestamp the SCSI release commands.

Is it possible that bacula flagged tapes 003231 and 000312 as being in
the drives because they were loaded when the server crashed, even though
they were later ejected (outside of bacula's control)? Could this cause
bacula to believe that the tapes were at EOT when they do get loaded, and
bacula then immediately begins writing (corrupting the label)? [Unlikely
that bacula would try to write before reading the label, and would then
read the label after corrupting the tapes.]

I don't see how this could happen. Bacula issues a rewind command when it
mounts a tape and should then know that the tape is at the start.


When the current backup is finished, I'll extract the beginning data
on each of 003231 and 000312. Is there anything you recommend in terms
of checking the data on tape to determine whether the tape begins with
random garbage (possibly caused by the shutdown, startup, scsi reset,
etc.) or if it begins with valid bacula data that happened to overwrite
the label instead of being appended?

Do you have a File device defined in the SD? If so, label a new File volume
and then append the data from the start of the tape to the end of the file
volume using dd and cat. You can then examine the file volume using bls -v -j
(the File label will allow bls to read it).


Does anyone have suggestions of how to troubleshoot this further,
or how to make the daemon startup process more resistant to causing
any corruption?

The important information missing is whether 000312 was already corrupted at
01-Feb 20:11. You could add some commands to the startup part of
/etc/init.d/bacula-sd script before it unloads all tapes. E.g. do mt status,
mt rewind and grab a copy of the first few blocks on any loaded tapes.

Also, you say that infrastructure1 server crashes. Maybe the crash caused the
tape to be rewound and some buffer flushed to start of the tape?

__Martin

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Post critical error -- tape labels get corrupted, previous backup 
In the message dated: Mon, 06 Feb 2012 12:43:41 GMT,
The pithy ruminations from Martin Simmons on
<Re: [Bacula-users] critical error -- tape labels get corrupted, previous backu
ps unreadable> were:


Martin,

Thanks again for continuing to respond...I appreciate the feedback and
troubleshooting help.


=> >>>>> On Fri, 03 Feb 2012 20:04:44 -0500, mark bergman said:
=> >
=> > I've added more logging to /etc/init.d/bacula-sd to confirm when tapes are
=> > ejected and to timestamp the SCSI release commands.
=> >
=> > Is it possible that bacula flagged tapes 003231 and 000312 as being in
=> > the drives because they were loaded when the server crashed, even though
=> > they were later ejected (outside of bacula's control)? Could this cause
=> > bacula to believe that the tapes were at EOT when they do get loaded, and
=> > bacula then immediately begins writing (corrupting the label)? [Unlikely
=> > that bacula would try to write before reading the label, and would then
=> > read the label after corrupting the tapes.]
=>
=> I don't see how this could happen. Bacula issues a rewind command when it

I don't see how it could happen either....but I'm searching for any
explanation.

=> mounts a tape and should then know that the tape is at the start.

That's what I'd expect too.


=>
=>
=> > When the current backup is finished, I'll extract the beginning data
=> > on each of 003231 and 000312. Is there anything you recommend in terms
=> > of checking the data on tape to determine whether the tape begins with
=> > random garbage (possibly caused by the shutdown, startup, scsi reset,
=> > etc.) or if it begins with valid bacula data that happened to overwrite
=> > the label instead of being appended?
=>
=> Do you have a File device defined in the SD? If so, label a new File volume

No.

=> and then append the data from the start of the tape to the end of the file
=> volume using dd and cat. You can then examine the file volume using bls -v -j
=> (the File label will allow bls to read it).


Can I do this against a tape directly?

=>
=>
=> > Does anyone have suggestions of how to troubleshoot this further,
=> > or how to make the daemon startup process more resistant to causing
=> > any corruption?
=>
=> The important information missing is whether 000312 was already corrupted at
=> 01-Feb 20:11. You could add some commands to the startup part of


Hmmm....The only way that I could imagine that happening is if:

bacula loads the tape as needed

bacula reads the volume label

{somehow the tape is rewound, either when the tape is first loaded, or
after some backups are written}

bacula writes to tape

The only thing outside of bacula that touches the tape drive in any way is the
/etc/init.d/bacula-sd script, which unloads any tapes before starting the
daemon & after shutting down the daemon.

=> /etc/init.d/bacula-sd script before it unloads all tapes. E.g. do mt status,
=> mt rewind and grab a copy of the first few blocks on any loaded tapes.

Sure. I'm thinking that I may modify /opt/bacula/scripts/mtx-changer to
replace the "unload" operation with:

mt rewind
dd if=$TAPE of=/opt/bacula/working/dump_$VOLUMEID.`date '+%Y-%m-%d_%T'` ibs=64k count=1024
mtx -f $ctl load $slot $drive

Is that a suitable number of blocks to dump? I've got the dumps from 5
corrupted tapes, and I'm trying to see if they have anything in common (for
example, maybe the first 128k is corrupted, followed by valid data from dumps
that should have been appended to the tape).

=>
=> Also, you say that infrastructure1 server crashes. Maybe the crash caused the
=> tape to be rewound and some buffer flushed to start of the tape?

I can't see how...

if there was unwritten data in a buffer within the memory of the
server infrastructure1, then when the server crashes it wouldn't
get written to tape. The 'infrastucture' machines are part of
an HA cluster...in this crash, the other nodes determined that
infrastructure1 had lost communication with the quorum disk,
and they powered off the node...even if that action reset the
fibre loop and caused the tape library to rewind both tapes
(unlikely), I don't know how any buffers on the infrastructure1
server could be written when the power was out.

if there was unwritten data in a buffer within the memory of
the tape library, then I believe it must be written before any
rewind command will be honored. If infrastructure1 sends
data to the tape drive, that data is buffered, infrastructure1 then
crashes, infrastructure2 runs /etc/init.d/bacula-sd (which ejects tapes,
thereby rewinding them)...the data within the buffer in the tape
drive would still be written before the rewind/eject command was executed.

Thanks again for your help,

Mark

=>
=> __Martin
=>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users < at > lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Display posts from previous:
Reply to topic Page 1 of 2
Goto page 1, 2  Next
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB