SearchFAQMemberlist Log in
Reply to topic Page 1 of 1
BackupPC recovery from unreliable disk
Author Message
Post BackupPC recovery from unreliable disk 
I'm running Debian Squeeze stock backuppc-3.1.0-9 on a server and I'm
getting kernel messages [1] and SMART errors [2] about the WD 2TB SATA
disk. Fine, I RMA'd it and have the new one... Now what? I know I can
either 'dd' or start fresh. But...


If I start fresh, I know everything will be work and be valid, but I
lose my historical backups when I wipe the bad disk and RMA it.


If I 'ddrescue' BAD --> GOOD, I'll worry about the integity of the
BackupPC store. As I understand it, the incoming files are hashed and
stored, but the store itself is never checked (true?). So when I do
backups, if an incoming file hash matches a file already in the store,
the incoming file is "de-duped" and dropped. But what if the file
actually in the store is corrupt due to the bad disk?

Am I correct? If so, is there a way to have BackupPC validate that the
files in the pool actually match their hash and weren't mangled by the disk?


Any other solution I'm missing?

Thanks,
JP
___________________________________________
[1] Example kernel errors:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
kernel: [4020993.728571] end_request: I/O error, dev sda, sector 81203507
kernel: [4021009.712952] end_request: I/O error, dev sda, sector 81203507

System Events
=-=-=-=-=-=-=
kernel: [4020983.471256] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
kernel: [4020983.471290] ata3.00: BMDMA stat 0x25
kernel: [4020983.471315] ata3.00: failed command: READ DMA
kernel: [4020983.471347] ata3.00: cmd
c8/00:18:33:11:d7/00:00:00:00:00/e4 tag 0 dma 12288 in
kernel: [4020983.471351] res
51/40:07:33:11:d7/40:00:28:00:00/e4 Emask 0x9 (media error)
kernel: [4020983.471424] ata3.00: status: { DRDY ERR }
kernel: [4020983.471446] ata3.00: error: { UNC }
kernel: [4020983.501157] ata3.00: configured for UDMA/133


[2] Example SMART error:

Error 1704 occurred at disk power-on lifetime: 10149 hours (422 days +
21 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 40 45 66 01 e0 Error: UNC 64 sectors at LBA = 0x00016645 = 91717

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 40 3f 66 01 e0 08 46d+13:36:50.242 READ DMA
ec 00 00 00 00 00 a0 08 46d+13:36:50.233 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 46d+13:36:50.225 SET FEATURES [Set transfer
mode]

----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

View user's profile Send private message
Post BackupPC recovery from unreliable disk 
I know this doesn't help for now, but next time make sure your storage
platform doesn't depend on hardware reliability - of which there is no
such thing, long term.

On the low end I recommend LVM over RAID1 for small, RAID6 for bigger
systems, obviously high-end environments have their SANs.

Just FFR. . .

On Thu, Dec 22, 2011 at 9:50 AM, JP Vossen <jp < at > jpsdomain.org> wrote:
I'm running Debian Squeeze stock backuppc-3.1.0-9 on a server and I'm
getting kernel messages [1] and SMART errors [2] about the WD 2TB SATA
disk.  Fine, I RMA'd it and have the new one...  Now what?  I know I can
either 'dd' or start fresh.  But...


If I start fresh, I know everything will be work and be valid, but I
lose my historical backups when I wipe the bad disk and RMA it.


If I 'ddrescue' BAD --> GOOD, I'll worry about the integity of the
BackupPC store.  As I understand it, the incoming files are hashed and
stored, but the store itself is never checked (true?).  So when I do
backups, if an incoming file hash matches a file already in the store,
the incoming file is "de-duped" and dropped.  But what if the file
actually in the store is corrupt due to the bad disk?

Am I correct?  If so, is there a way to have BackupPC validate that the
files in the pool actually match their hash and weren't mangled by the disk?


Any other solution I'm missing?

Thanks,
JP
___________________________________________
[1] Example kernel errors:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
kernel: [4020993.728571] end_request: I/O error, dev sda, sector 81203507
kernel: [4021009.712952] end_request: I/O error, dev sda, sector 81203507

System Events
=-=-=-=-=-=-=
kernel: [4020983.471256] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
kernel: [4020983.471290] ata3.00: BMDMA stat 0x25
kernel: [4020983.471315] ata3.00: failed command: READ DMA
kernel: [4020983.471347] ata3.00: cmd
c8/00:18:33:11:d7/00:00:00:00:00/e4 tag 0 dma 12288 in
kernel: [4020983.471351]          res
51/40:07:33:11:d7/40:00:28:00:00/e4 Emask 0x9 (media error)
kernel: [4020983.471424] ata3.00: status: { DRDY ERR }
kernel: [4020983.471446] ata3.00: error: { UNC }
kernel: [4020983.501157] ata3.00: configured for UDMA/133


[2] Example SMART error:

Error 1704 occurred at disk power-on lifetime: 10149 hours (422 days +
21 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 40 45 66 01 e0  Error: UNC 64 sectors at LBA = 0x00016645 = 91717

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 40 3f 66 01 e0 08  46d+13:36:50.242  READ DMA
  ec 00 00 00 00 00 a0 08  46d+13:36:50.233  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 08  46d+13:36:50.225  SET FEATURES [Set transfer
mode]

----------------------------|:::======|-------------------------------
JP Vossen, CISSP            |:::======|      http://bashcookbook.com/
My Account, My Opinions     |=========|      http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Post BackupPC recovery from unreliable disk 
On Wed, Dec 21, 2011 at 8:50 PM, JP Vossen <jp < at > jpsdomain.org> wrote:

If I 'ddrescue' BAD --> GOOD, I'll worry about the integity of the
BackupPC store.  As I understand it, the incoming files are hashed and
stored, but the store itself is never checked (true?).  So when I do
backups, if an incoming file hash matches a file already in the store,
the incoming file is "de-duped" and dropped.  But what if the file
actually in the store is corrupt due to the bad disk?

Am I correct?  If so, is there a way to have BackupPC validate that the
files in the pool actually match their hash and weren't mangled by the disk?


I'm not 100% sure, but I think the file contents are compared in order
to detect hash collisions. Also, if you do not have rsync checksum
caching enabled, an rsync full will compare the local and remote
copies and make a new local file if there are differences. Even when
you do have caching enabled a small percentage of the local files are
read for the comparison on each run to detect corruption. If you want
to verify the data, you could restore all or part of a host to some
other location, then run rsync with the -anv options against the
original. The -n will make it a 'dry run' and not actually transfer
anything, and the -v option will make it list files where differences
are found.

--
Les Mikesell
lesmikesell < at > gmail.com

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Post BackupPC recovery from unreliable disk 
JP Vossen wrote at about 21:50:29 -0500 on Wednesday, December 21, 2011:
I'm running Debian Squeeze stock backuppc-3.1.0-9 on a server and I'm
getting kernel messages [1] and SMART errors [2] about the WD 2TB SATA
disk. Fine, I RMA'd it and have the new one... Now what? I know I can
either 'dd' or start fresh. But...


If I start fresh, I know everything will be work and be valid, but I
lose my historical backups when I wipe the bad disk and RMA it.


If I 'ddrescue' BAD --> GOOD, I'll worry about the integity of the
BackupPC store. As I understand it, the incoming files are hashed and
stored, but the store itself is never checked (true?). So when I do
backups, if an incoming file hash matches a file already in the store,
the incoming file is "de-duped" and dropped. But what if the file
actually in the store is corrupt due to the bad disk?

If the hash of a new file matches the hash of an existing pool file
then the contents are compared since there is always the possibility
of a hash collision since the file hash is a partial file md5sum that
is based on the first and last 128K slice plus the filesize.


Am I correct? If so, is there a way to have BackupPC validate that the
files in the pool actually match their hash and weren't mangled by the disk?

Of course, there is no guarantee that the pool files themselves are
not corrupt. Checking the files against their pool file name hash can
rule out some file corruption but if the file size is unchanged and
the corruption is not in the first or last 128K slice then the hash
will be unchanged so any corruption won't be detectable.

That being said, I have written several routines to both check and fix
the pool for corruption relative to the partial file md5sum pool file
name hash. Please search the archives where I have discussed and
posted the code...

Note that there have been bugs in BackupPC itself and also in various
pool libraries (specifically on arm5 processors) that cause relatively
innocuous errors in the pool file names relative to the actual
intended partial file md5sum hash.



Any other solution I'm missing?

Thanks,
JP
___________________________________________
[1] Example kernel errors:

Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
kernel: [4020993.728571] end_request: I/O error, dev sda, sector 81203507
kernel: [4021009.712952] end_request: I/O error, dev sda, sector 81203507

System Events
=-=-=-=-=-=-=
kernel: [4020983.471256] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x0
kernel: [4020983.471290] ata3.00: BMDMA stat 0x25
kernel: [4020983.471315] ata3.00: failed command: READ DMA
kernel: [4020983.471347] ata3.00: cmd
c8/00:18:33:11:d7/00:00:00:00:00/e4 tag 0 dma 12288 in
kernel: [4020983.471351] res
51/40:07:33:11:d7/40:00:28:00:00/e4 Emask 0x9 (media error)
kernel: [4020983.471424] ata3.00: status: { DRDY ERR }
kernel: [4020983.471446] ata3.00: error: { UNC }
kernel: [4020983.501157] ata3.00: configured for UDMA/133


[2] Example SMART error:

Error 1704 occurred at disk power-on lifetime: 10149 hours (422 days +
21 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 40 45 66 01 e0 Error: UNC 64 sectors at LBA = 0x00016645 = 91717

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 40 3f 66 01 e0 08 46d+13:36:50.242 READ DMA
ec 00 00 00 00 00 a0 08 46d+13:36:50.233 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 08 46d+13:36:50.225 SET FEATURES [Set transfer
mode]

----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Post BackupPC recovery from unreliable disk 
On 12/21/2011 09:50 PM, JP Vossen wrote:
I'm running Debian Squeeze stock backuppc-3.1.0-9 on a server and I'm
getting kernel messages [1] and SMART errors [2] about the WD 2TB SATA
disk. Fine, I RMA'd it and have the new one... Now what? I know I can
either 'dd' or start fresh. But...

I'm having some problem with getting list messages to my account, though
oddly password reset messages work fine and nothing jumps out at me from
my mail server log. Yes, I checked my spam folder too. I dunno... So
anyway, I read the replies on the ML archive site.

So, thanks to everyone who responded! For the record, while it sounds
like it would have been OK to use the suspect pool, I ended up doing this:

1) Move drives into a temp server and boot Ubuntu 10.10 LiveUSB
2) apt-get install gddrescue
3) time ddrescue --force -d /dev/sda /dev/sdb /tmp/ddrescue.log
Took about 17 hours to copy 2TB via local SATA
4) Rename the old BackupPC dir (for the history)
5) Create a new BackupPC dir and copy over host configs

It turned out that my BackupPC pool was smaller than I expected and I
had more free disk space than I thought, so starting fresh while keeping
the ability to restore old files by changing symlinks around if I have
to seems ideal.

As for using fault-tolerant hardware/RAID, etc, in this case that's not
an option. This is my personal backup server, running on the previous
generation http://www.system76.com/desktops/model/meerkat, and it fits
only a single 3.5" drive. I didn't use LVM since the OS is on a USB
drive and the hard drive in question is just one big /data/ partition.

Thanks again,
JP
----------------------------|:::======|-------------------------------
JP Vossen, CISSP |:::======| http://bashcookbook.com/
My Account, My Opinions |=========| http://www.jpsdomain.org/
----------------------------|=========|-------------------------------
"Microsoft Tax" = the additional hardware & yearly fees for the add-on
software required to protect Windows from its own poorly designed and
implemented self, while the overhead incidentally flattens Moore's Law.

------------------------------------------------------------------------------
Ridiculously easy VDI. With Citrix VDI-in-a-Box, you don't need a complex
infrastructure or vast IT resources to deliver seamless, secure access to
virtual desktops. With this all-in-one solution, easily deploy virtual
desktops for less than the cost of PCs and save 60% on VDI infrastructure
costs. Try it free! http://p.sf.net/sfu/Citrix-VDIinabox
_______________________________________________
BackupPC-users mailing list
BackupPC-users < at > lists.sourceforge.net
List: https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

View user's profile Send private message
Display posts from previous:
Reply to topic Page 1 of 1
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
  


Magic SEO URL for phpBB