Unocorrectable errors with RAID1

* Unocorrectable errors with RAID1
@ 2017-01-16 11:10 Christoph Groth
  2017-01-16 13:24 ` Austin S. Hemmelgarn
  2017-01-16 22:45 ` Goldwyn Rodrigues
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Groth @ 2017-01-16 11:10 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7876 bytes --]

Hi,

I’ve been using a btrfs RAID1 of two hard disks since early 2012 
on my home server.  The machine has been working well overall, but 
recently some problems with the file system surfaced.  Since I do 
have backups, I do not worry about the data, but I post here to 
better understand what happened.  Also I cannot exclude that my 
case is useful in some way to btrfs development.

First some information about the system:

root@mim:~# uname -a
Linux mim 4.6.0-1-amd64 #1 SMP Debian 4.6.3-1 (2016-07-04) x86_64 
GNU/Linux
root@mim:~# btrfs --version
btrfs-progs v4.7.3
root@mim:~# btrfs fi show
Label: none  uuid: 2da00153-f9ea-4d6c-a6cc-10c913d22686
	Total devices 2 FS bytes used 345.97GiB
	devid    1 size 465.29GiB used 420.06GiB path /dev/sda2
	devid    2 size 465.29GiB used 420.04GiB path /dev/sdb2

root@mim:~# btrfs fi df /
Data, RAID1: total=417.00GiB, used=344.62GiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=40.00MiB, used=68.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=3.00GiB, used=1.35GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=464.00MiB, used=0.00B
root@mim:~# dmesg | grep -i btrfs
[    4.165859] Btrfs loaded
[    4.481712] BTRFS: device fsid 
2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 1 transid 2075354 
/dev/sda2
[    4.482025] BTRFS: device fsid 
2da00153-f9ea-4d6c-a6cc-10c913d22686 devid 2 transid 2075354 
/dev/sdb2
[    4.521090] BTRFS info (device sdb2): disk space caching is 
enabled
[    4.628506] BTRFS info (device sdb2): bdev /dev/sdb2 errs: wr 
0, rd 0, flush 0, corrupt 3, gen 0
[    4.628521] BTRFS info (device sdb2): bdev /dev/sda2 errs: wr 
0, rd 0, flush 0, corrupt 3, gen 0
[   18.315694] BTRFS info (device sdb2): disk space caching is 
enabled

The disks themselves have been turning for almost 5 years by now, 
but their SMART health is still fully satisfactory.

I noticed that something was wrong because printing stopped to 
work.  So I did a scrub that detected 0 "correctable errors" and 6 
"uncorrectable" errors.  The relevant bits from kern.log are:

Jan 11 11:05:56 mim kernel: [159873.938579] BTRFS warning (device 
sdb2): checksum error at logical 180829634560 on dev /dev/sdb2, 
sector 353143968, root 5, inode 10014144, offset 221184, length 
4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
Jan 11 11:05:57 mim kernel: [159874.857132] BTRFS warning (device 
sdb2): checksum error at logical 180829634560 on dev /dev/sda2, 
sector 353182880, root 5, inode 10014144, offset 221184, length 
4096, links 1 (path: usr/lib/x86_64-linux-gnu/libcups.so.2)
Jan 11 11:28:42 mim kernel: [161240.083721] BTRFS warning (device 
sdb2): checksum error at logical 260254629888 on dev /dev/sda2, 
sector 508309824, root 5, inode 9990924, offset 6676480, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:28:42 mim kernel: [161240.235837] BTRFS warning (device 
sdb2): checksum error at logical 260254638080 on dev /dev/sda2, 
sector 508309840, root 5, inode 9990924, offset 6684672, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:37:21 mim kernel: [161759.725120] BTRFS warning (device 
sdb2): checksum error at logical 260254629888 on dev /dev/sdb2, 
sector 508270912, root 5, inode 9990924, offset 6676480, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)
Jan 11 11:37:21 mim kernel: [161759.750251] BTRFS warning (device 
sdb2): checksum error at logical 260254638080 on dev /dev/sdb2, 
sector 508270928, root 5, inode 9990924, offset 6684672, length 
4096, links 1 (path: 
var/lib/apt/lists/ftp.fr.debian.org_debian_dists_unstable_main_binary-amd64_Packages)

As you can see each disk has the same three errors, and there are 
no other errors.  Random bad blocks cannot explain this situation. 
I asked on #btrfs and someone suggested that these errors are 
likely due to RAM problems.  This may indeed be the case, since 
the machine has no ECC.  I managed to fix these errors by 
replacing the broken files with good copies.  Scrubbing shows no 
errors now:

root@mim:~# btrfs scrub status /
scrub status for 2da00153-f9ea-4d6c-a6cc-10c913d22686
	scrub started at Sat Jan 14 12:52:03 2017 and finished 
	after 01:49:10
	total bytes scrubbed: 699.17GiB with 0 errors

However, there are further problems.  When trying to archive the 
full filesystem I noticed that some files/directories cannot be 
read.  (The problem is localized to some ".git" directory that I 
don’t need.)  Any attempt to read the broken files (or to delete 
them) does not work:

$ du -sh .git
du: cannot access 
'.git/objects/28/ea2aae3fe57ab4328adaa8b79f3c1cf005dd8d': No such 
file or directory
du: cannot access 
'.git/objects/28/fd95a5e9d08b6684819ce6e3d39d99e2ecccd5': Stale 
file handle
du: cannot access 
'.git/objects/28/52e887ed436ed2c549b20d4f389589b7b58e09': Stale 
file handle
du: cannot access '.git/objects/info': Stale file handle
du: cannot access '.git/objects/pack': Stale file handle

During the above command the following lines were added to 
kern.log:

Jan 16 09:41:34 mim kernel: [132206.957566] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.957924] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.958505] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.958971] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.959534] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.959874] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.960523] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15
Jan 16 09:41:34 mim kernel: [132206.960943] BTRFS critical (device 
sda2): corrupt leaf, slot offset bad: block=192561152,root=1, 
slot=15

So I tried to repair the file system by running "btrfs check 
--repair", but this doesn’t work:

(initramfs) btrfs --version
btrfs-progs v4.7.3
(initramfs) btrfs check --repair /dev/sda2
UUID: ...
checking extents
incorrect offsets 2527 2543
items overlap, can't fix
cmds-check.c:4297: fix_item_offset: Assertion `ret` failed.
btrfs[0x41a8b4]
btrfs[0x41a8db]
btrfs[0x42428b]
btrfs[0x424f83]
btrfs[0x4259cd]
btrfs(cmd_check+0x1111)[0x427d6d]
btrfs(main+0x12f)[0x40a341]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fd98859d2b1]
btrfs(_start+0x2a)[0x40a37a]

I now have the following questions:

* So scrubbing is not enough to check the health of a btrfs file 
  system?  It’s also necessary to read all the files?

* Any ideas what coud have caused the "stale file handle" errors? 
  Is there any way to fix them?  Of course RAM errors can in 
  principle have _any_ consequences, but I would have hoped that 
  even without ECC RAM it’s practically inpossible to end up with 
  an unrepairable file system.  Perhaps I simply had very bad 
  luck.

* I believe that btrfs RAID1 is considered reasonably safe for 
  production use by now.  I want to replace that home server with 
  a new machine (still without ECC).  Is it a good idea to use 
  btrfs for the main file system?  I would certainly hope so! :-)

Thanks for your time,
Christoph

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread