All of lore.kernel.org
 help / color / mirror / Atom feed
* Repair broken btrfs raid6?
@ 2015-02-09 22:45 Tobias Holst
  2015-02-10  3:36 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tobias Holst @ 2015-02-09 22:45 UTC (permalink / raw)
  To: linux-btrfs

Hi

I'm having some trouble with my six-drives btrfs raid6 (each drive
encrypted with LUKS). At first: Yes, I do have backups, but it may
take at least days, maybe weeks or even some month to restore
everything from the (offside) backups. So it is not essential to
recover the data, but would be great ;-)

OS: Ubuntu 14.04
Kernel: 3.19.0
btrfs-progs: 3.19-rc2

When booting my server I am getting this in the syslog:
> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
> [    8.555570] BTRFS info (device dm-3): force lzo compression
> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
> [    8.556310] BTRFS: failed to read the system array on dm-3
> [    8.592135] BTRFS: open_ctree failed
> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
Looks like there is something wrong on drive 3, giving me "open_ctree
failed". I have to press "S" to skip mounting of the btrfs volume. It
boots and with "sudo mount --all" I can successfully mount the btrfs
volume. Sometimes it takes one or two minutes but it will mount.

After a while I am sometimes/randomly getting this in the syslog:
> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
Looks like something else is broken on dm-5... But shouldn't this be
repaired with the new raid56-repair-features of kernel 3.19?

After some more time I am getting this:
> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
Then it is not possible to access the mounted volume anymore. I have
to "umount -l" to unmount it and then I can remount it. Until it
happens again (after some time)...

I also tried a balance and a scrub but they "crash". Syslog is full of
messages like the following examples:
> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
and "btrfs scrub status /[device]" gives me the following output:
> "scrub status for [UUID]
>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>        total bytes scrubbed: 113.04GiB with 0 errors"

So a short summary:
- btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
- does not mount at boot up, "open_ctree failed" (disk 3)
- mounts successfully after bootup
- randomly "checksum verify failed" (disk 5)
- balance and scrub crash after some time
- after a while the volume gets unreadable, saying "parent transid
verify failed" (disk 4 or 5)

And it looks like there still is no way to btrfsck a raid6.

Any ideas how to repair this filesystem?

Regards,
Tobias

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-09 22:45 Repair broken btrfs raid6? Tobias Holst
@ 2015-02-10  3:36 ` Duncan
  2015-02-10  7:17 ` Kai Krakow
  2015-02-11 14:46 ` Tobias Holst
  2 siblings, 0 replies; 14+ messages in thread
From: Duncan @ 2015-02-10  3:36 UTC (permalink / raw)
  To: linux-btrfs

Tobias Holst posted on Mon, 09 Feb 2015 23:45:21 +0100 as excerpted:

> So a short summary:
> - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
> - does not mount at boot up, "open_ctree failed" (disk 3)
> - mounts successfully after bootup
> - randomly "checksum verify failed" (disk 5)
> - balance and scrub crash after some time
> - after a while the volume gets unreadable, saying "parent transid
> verify failed" (disk 4 or 5)
> 
> And it looks like there still is no way to btrfsck a raid6.
> 
> Any ideas how to repair this filesystem?

(As a btrfs user/sysadmin and a list regular, not a dev, and not yet 
brave enough to try raid5/6 modes here...)

Btrfs raid6 should indeed be generally working in 3.19, including repair, 
yes.  Certainly, it's much closer to working than anything previous.

However, that code, while it actually exists now and is I believe in 
theory complete, is still very VERY new, and thus, it can be expected to 
be still quite buggy.  I've been telling people not to expect it to 
actually work for another kernel cycle (3.20), and even then, don't 
expect it to be as stable as the raid0/1/10 code, which after all has 
been in actual use for (well) over a year now, and thus has had a chance 
to have even many of the the not immediately obvious bugs show up and get 
worked out.  That'll take several more kernel cycles -- I've been 
suggesting that people not consider the raid56 code as stable as the 
earlier raid forms for another two cycles (3.22) at least.

HOWEVER, without claiming to speak for the devs working on it themselves, 
now that the code is actually there and it's time to start exterminating 
bugs in it, I expect they'll be very interested in your bug report, and 
if you're prepared to spend the time working thru it with them, applying 
patches, etc, you could well find your bugs fixed and be back operational 
before 3.20 or whatever. =:^)

Meanwhile, there's actually an integration branch with even newer code 
that hasn't hit release yet.  Given the still very new state of the 
btrfs56 mode code, if you're already brave enough to be running raid6 
mode and are having problems, your chances with integration are likely to 
be even better than with current release.  Of course it could break 
things worse too, but if you're already running raid56 mode I guess 
you're already prepared for that, and are either testing with throw-away 
data or data that's already well backed up, such that you're prepared to 
lose the btrfs raid6 copy of it in any case, so you might as well try 
integration...

See the wiki or other posts for the integration branch repos.  (As I said 
above I'm not brave enough to try raid56 yet, nor have I tried 
integration, so I don't have the links handy.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-09 22:45 Repair broken btrfs raid6? Tobias Holst
  2015-02-10  3:36 ` Duncan
@ 2015-02-10  7:17 ` Kai Krakow
  2015-02-10 13:15   ` Ed Tomlinson
  2015-02-10 18:18   ` Tobias Holst
  2015-02-11 14:46 ` Tobias Holst
  2 siblings, 2 replies; 14+ messages in thread
From: Kai Krakow @ 2015-02-10  7:17 UTC (permalink / raw)
  To: linux-btrfs

Tobias Holst <tobby@tobby.eu> schrieb:

> and "btrfs scrub status /[device]" gives me the following output:
>> "scrub status for [UUID]
>>scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008
>>seconds total bytes scrubbed: 113.04GiB with 0 errors"

Does not look very correct to me:

Why should a scrub in a six-drivers btrfs array which is probably multi-
terabytes big (as you state a restore from backup would take days) take only 
~2000 seconds? And scrub only ~120 GB worth of data. Either your 6 devices 
are really small (then why RAID-6), or your data is very sparse (then way 
does it take so long), or scrub prematurely aborts and never checks the 
complete devices (I guess this is it).

And that's what it actually says: "aborted after 2008" seconds. I'd expect 
"finished after XXXX seconds" if I remember my scrub runs correctly (which I 
currently don't do regularly because it takes long and IO performance sucks 
during running it).

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-10  7:17 ` Kai Krakow
@ 2015-02-10 13:15   ` Ed Tomlinson
  2015-02-13  1:12     ` Kai Krakow
  2015-02-10 18:18   ` Tobias Holst
  1 sibling, 1 reply; 14+ messages in thread
From: Ed Tomlinson @ 2015-02-10 13:15 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Tuesday, February 10, 2015 2:17:43 AM EST, Kai Krakow wrote:
> Tobias Holst <tobby@tobby.eu> schrieb:
>
>> and "btrfs scrub status /[device]" gives me the following output:
>>> "scrub status for [UUID]
>>> scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008
>>> seconds total bytes scrubbed: 113.04GiB with 0 errors"
>
> Does not look very correct to me:
>
> Why should a scrub in a six-drivers btrfs array which is probably multi-
> terabytes big (as you state a restore from backup would take 
> days) take only 
> ~2000 seconds? And scrub only ~120 GB worth of data. Either your 6 devices 
> are really small (then why RAID-6), or your data is very sparse (then way 
> does it take so long), or scrub prematurely aborts and never checks the 
> complete devices (I guess this is it).
>
> And that's what it actually says: "aborted after 2008" seconds. I'd expect 
> "finished after XXXX seconds" if I remember my scrub runs 
> correctly (which I 
> currently don't do regularly because it takes long and IO performance sucks 
> during running it).

IO perfermance does suffer during a scrub.  I use the following:

ionice -c 3 btrfs scrub start -Bd -n 19 /<target>

The combo of -n19 and ionice makes it workable here.  

Tobias why do you think btrfsck does not work on raid6?  It runs fine
here on raid5.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-10  7:17 ` Kai Krakow
  2015-02-10 13:15   ` Ed Tomlinson
@ 2015-02-10 18:18   ` Tobias Holst
  1 sibling, 0 replies; 14+ messages in thread
From: Tobias Holst @ 2015-02-10 18:18 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

2015-02-10 8:17 GMT+01:00 Kai Krakow <hurikhan77@gmail.com>:
> Tobias Holst <tobby@tobby.eu> schrieb:
>
>> and "btrfs scrub status /[device]" gives me the following output:
>>> "scrub status for [UUID]
>>>scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008
>>>seconds total bytes scrubbed: 113.04GiB with 0 errors"
>
> Does not look very correct to me:
>
> Why should a scrub in a six-drivers btrfs array which is probably multi-
> terabytes big (as you state a restore from backup would take days) take only
> ~2000 seconds? And scrub only ~120 GB worth of data. Either your 6 devices
> are really small (then why RAID-6), or your data is very sparse (then way
> does it take so long), or scrub prematurely aborts and never checks the
> complete devices (I guess this is it).

Yes, sorry, I didn't post an output of "btrfs filesystem show" - but here it is:

Label: 'tobby-btrfs'  uuid: b689ab76-7ff5-434c-a2c6-03efb45faa46
        Total devices 6 FS bytes used 13.13TiB
        devid    1 size 3.64TiB used 3.28TiB path /dev/mapper/sde_crypt
        devid    2 size 3.64TiB used 3.28TiB path /dev/mapper/sdd_crypt
        devid    3 size 3.64TiB used 3.28TiB path /dev/mapper/sdf_crypt
        devid    4 size 3.64TiB used 3.28TiB path /dev/mapper/sda_crypt
        devid    5 size 3.64TiB used 3.28TiB path /dev/mapper/sdb_crypt
        devid    6 size 3.64TiB used 3.28TiB path /dev/mapper/sdc_crypt
btrfs-progs v3.19-rc2

So there are ~13TiB of data on this raid6 - but like it says it was
"aborted" after 2008 seconds (about half an hour) and ~120GB of data.
Then a "parent transid verify failed" happened, the volume got
unreadable and the scrub was aborted. Until a remount of the btrfs -
and until it happens again...

>
> And that's what it actually says: "aborted after 2008" seconds. I'd expect
> "finished after XXXX seconds" if I remember my scrub runs correctly (which I
> currently don't do regularly because it takes long and IO performance sucks
> during running it).
>
> --
> Replies to list only preferred.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-09 22:45 Repair broken btrfs raid6? Tobias Holst
  2015-02-10  3:36 ` Duncan
  2015-02-10  7:17 ` Kai Krakow
@ 2015-02-11 14:46 ` Tobias Holst
  2015-02-12  9:16   ` Liu Bo
  2 siblings, 1 reply; 14+ messages in thread
From: Tobias Holst @ 2015-02-11 14:46 UTC (permalink / raw)
  To: linux-btrfs

Hmm, it looks like it is getting worse... Here are some parts of my
syslog, including two crashed btrfs-threads:

So I am still getting many of these:
> BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
> BTRFS warning (device dm-5): page private not zero on page 25033166798848
> BTRFS warning (device dm-5): page private not zero on page 25033166802944
> BTRFS warning (device dm-5): page private not zero on page 25033166807040
> BTRFS warning (device dm-5): page private not zero on page 25033166811136
> BTRFS info (device dm-5): force lzo compression
> BTRFS info (device dm-5): disk space caching is enabled
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0

Then there is this crash of "super"/btrfs_abort_transaction:
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
> BTRFS: Transaction aborted (error -5)
> Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
> ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
> ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
> Call Trace:
> [<ffffffff817c4c00>] dump_stack+0x45/0x57
> [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
> [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
> [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
> [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
> [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
> [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
> [<ffffffff8108f76d>] process_one_work+0x14d/0x460
> [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
> [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
> [<ffffffff81095d59>] kthread+0xc9/0xe0
> [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> ---[ end trace dd65465954546462 ]---
> BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
> BTRFS info (device dm-5): forced readonly

and this crash of "delayed-ref"/btrfs_select_ref_head:
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
> Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
> 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
> ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
> Call Trace:
> [<ffffffff817c4c00>] dump_stack+0x45/0x57
> [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
> [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
> [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
> [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
> [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
> [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
> [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
> [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
> [<ffffffff81095d59>] kthread+0xc9/0xe0
> [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> ---[ end trace dd65465954546463 ]---
> BTRFS warning (device dm-5): Skipping commit of aborted transaction.
> BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure


Any thoughts? Would it help to unplug the "dm5"-device which seems to
be causing this errors and then balance the array?

Regards,
Tobias

2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> Hi
>
> I'm having some trouble with my six-drives btrfs raid6 (each drive
> encrypted with LUKS). At first: Yes, I do have backups, but it may
> take at least days, maybe weeks or even some month to restore
> everything from the (offside) backups. So it is not essential to
> recover the data, but would be great ;-)
>
> OS: Ubuntu 14.04
> Kernel: 3.19.0
> btrfs-progs: 3.19-rc2
>
> When booting my server I am getting this in the syslog:
>> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
>> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
>> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
>> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
>> [    8.555570] BTRFS info (device dm-3): force lzo compression
>> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
>> [    8.556310] BTRFS: failed to read the system array on dm-3
>> [    8.592135] BTRFS: open_ctree failed
>> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
>> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
> Looks like there is something wrong on drive 3, giving me "open_ctree
> failed". I have to press "S" to skip mounting of the btrfs volume. It
> boots and with "sudo mount --all" I can successfully mount the btrfs
> volume. Sometimes it takes one or two minutes but it will mount.
>
> After a while I am sometimes/randomly getting this in the syslog:
>> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
> Looks like something else is broken on dm-5... But shouldn't this be
> repaired with the new raid56-repair-features of kernel 3.19?
>
> After some more time I am getting this:
>> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
> Then it is not possible to access the mounted volume anymore. I have
> to "umount -l" to unmount it and then I can remount it. Until it
> happens again (after some time)...
>
> I also tried a balance and a scrub but they "crash". Syslog is full of
> messages like the following examples:
>> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
>> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
>> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
> and "btrfs scrub status /[device]" gives me the following output:
>> "scrub status for [UUID]
>>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>>        total bytes scrubbed: 113.04GiB with 0 errors"
>
> So a short summary:
> - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
> - does not mount at boot up, "open_ctree failed" (disk 3)
> - mounts successfully after bootup
> - randomly "checksum verify failed" (disk 5)
> - balance and scrub crash after some time
> - after a while the volume gets unreadable, saying "parent transid
> verify failed" (disk 4 or 5)
>
> And it looks like there still is no way to btrfsck a raid6.
>
> Any ideas how to repair this filesystem?
>
> Regards,
> Tobias

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-11 14:46 ` Tobias Holst
@ 2015-02-12  9:16   ` Liu Bo
  2015-02-12 23:22     ` Tobias Holst
  0 siblings, 1 reply; 14+ messages in thread
From: Liu Bo @ 2015-02-12  9:16 UTC (permalink / raw)
  To: Tobias Holst; +Cc: linux-btrfs

On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
> Hmm, it looks like it is getting worse... Here are some parts of my
> syslog, including two crashed btrfs-threads:
> 
> So I am still getting many of these:
> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
> > BTRFS warning (device dm-5): page private not zero on page 25033166811136

First we probably make sure that your device is well setup, since these
messages usually occur after a drive is removed(the device is somehow droping
writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.

And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?

Thanks,

-liubo

> > BTRFS info (device dm-5): force lzo compression
> > BTRFS info (device dm-5): disk space caching is enabled
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> 
> Then there is this crash of "super"/btrfs_abort_transaction:
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
> > BTRFS: Transaction aborted (error -5)
> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
> > Call Trace:
> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> > ---[ end trace dd65465954546462 ]---
> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
> > BTRFS info (device dm-5): forced readonly
> 
> and this crash of "delayed-ref"/btrfs_select_ref_head:
> > ------------[ cut here ]------------
> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
> > Call Trace:
> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> > ---[ end trace dd65465954546463 ]---
> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
> 
> 
> Any thoughts? Would it help to unplug the "dm5"-device which seems to
> be causing this errors and then balance the array?
> 
> Regards,
> Tobias
> 
> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> > Hi
> >
> > I'm having some trouble with my six-drives btrfs raid6 (each drive
> > encrypted with LUKS). At first: Yes, I do have backups, but it may
> > take at least days, maybe weeks or even some month to restore
> > everything from the (offside) backups. So it is not essential to
> > recover the data, but would be great ;-)
> >
> > OS: Ubuntu 14.04
> > Kernel: 3.19.0
> > btrfs-progs: 3.19-rc2
> >
> > When booting my server I am getting this in the syslog:
> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
> >> [    8.556310] BTRFS: failed to read the system array on dm-3
> >> [    8.592135] BTRFS: open_ctree failed
> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
> > Looks like there is something wrong on drive 3, giving me "open_ctree
> > failed". I have to press "S" to skip mounting of the btrfs volume. It
> > boots and with "sudo mount --all" I can successfully mount the btrfs
> > volume. Sometimes it takes one or two minutes but it will mount.
> >
> > After a while I am sometimes/randomly getting this in the syslog:
> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
> > Looks like something else is broken on dm-5... But shouldn't this be
> > repaired with the new raid56-repair-features of kernel 3.19?
> >
> > After some more time I am getting this:
> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
> > Then it is not possible to access the mounted volume anymore. I have
> > to "umount -l" to unmount it and then I can remount it. Until it
> > happens again (after some time)...
> >
> > I also tried a balance and a scrub but they "crash". Syslog is full of
> > messages like the following examples:
> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
> > and "btrfs scrub status /[device]" gives me the following output:
> >> "scrub status for [UUID]
> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
> >>        total bytes scrubbed: 113.04GiB with 0 errors"
> >
> > So a short summary:
> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
> > - does not mount at boot up, "open_ctree failed" (disk 3)
> > - mounts successfully after bootup
> > - randomly "checksum verify failed" (disk 5)
> > - balance and scrub crash after some time
> > - after a while the volume gets unreadable, saying "parent transid
> > verify failed" (disk 4 or 5)
> >
> > And it looks like there still is no way to btrfsck a raid6.
> >
> > Any ideas how to repair this filesystem?
> >
> > Regards,
> > Tobias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-12  9:16   ` Liu Bo
@ 2015-02-12 23:22     ` Tobias Holst
  2015-02-13  8:06       ` Liu Bo
  0 siblings, 1 reply; 14+ messages in thread
From: Tobias Holst @ 2015-02-12 23:22 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

Hi

I don't remember the exact mkfs.btrfs options anymore but
> ls /sys/fs/btrfs/[UUID]/features/
shows the following output:
> big_metadata  compress_lzo  extended_iref  mixed_backref  raid56

I also tested my device with a short
> hdparm -tT /dev/dm5
and got
> /dev/mapper/sdc_crypt:
>  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
>  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec

Looks ok to me. Should I test more?

I bought a few new hard drives so currently I am copying all my data
to a second (faster) backup, so I can maybe overwrite the current file
system, if it's not repairable.

Regards,
Tobias


2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
>> Hmm, it looks like it is getting worse... Here are some parts of my
>> syslog, including two crashed btrfs-threads:
>>
>> So I am still getting many of these:
>> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
>> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
>> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
>> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
>> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
>
> First we probably make sure that your device is well setup, since these
> messages usually occur after a drive is removed(the device is somehow droping
> writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
>
> And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
>
> Thanks,
>
> -liubo
>
>> > BTRFS info (device dm-5): force lzo compression
>> > BTRFS info (device dm-5): disk space caching is enabled
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>
>> Then there is this crash of "super"/btrfs_abort_transaction:
>> > ------------[ cut here ]------------
>> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
>> > BTRFS: Transaction aborted (error -5)
>> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
>> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
>> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
>> > Call Trace:
>> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
>> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
>> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
>> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
>> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
>> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
>> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
>> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
>> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
>> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> > ---[ end trace dd65465954546462 ]---
>> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
>> > BTRFS info (device dm-5): forced readonly
>>
>> and this crash of "delayed-ref"/btrfs_select_ref_head:
>> > ------------[ cut here ]------------
>> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
>> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
>> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
>> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
>> > Call Trace:
>> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
>> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
>> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
>> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
>> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
>> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
>> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
>> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
>> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> > ---[ end trace dd65465954546463 ]---
>> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
>> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
>>
>>
>> Any thoughts? Would it help to unplug the "dm5"-device which seems to
>> be causing this errors and then balance the array?
>>
>> Regards,
>> Tobias
>>
>> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
>> > Hi
>> >
>> > I'm having some trouble with my six-drives btrfs raid6 (each drive
>> > encrypted with LUKS). At first: Yes, I do have backups, but it may
>> > take at least days, maybe weeks or even some month to restore
>> > everything from the (offside) backups. So it is not essential to
>> > recover the data, but would be great ;-)
>> >
>> > OS: Ubuntu 14.04
>> > Kernel: 3.19.0
>> > btrfs-progs: 3.19-rc2
>> >
>> > When booting my server I am getting this in the syslog:
>> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
>> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
>> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
>> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
>> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
>> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
>> >> [    8.556310] BTRFS: failed to read the system array on dm-3
>> >> [    8.592135] BTRFS: open_ctree failed
>> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
>> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
>> > Looks like there is something wrong on drive 3, giving me "open_ctree
>> > failed". I have to press "S" to skip mounting of the btrfs volume. It
>> > boots and with "sudo mount --all" I can successfully mount the btrfs
>> > volume. Sometimes it takes one or two minutes but it will mount.
>> >
>> > After a while I am sometimes/randomly getting this in the syslog:
>> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
>> > Looks like something else is broken on dm-5... But shouldn't this be
>> > repaired with the new raid56-repair-features of kernel 3.19?
>> >
>> > After some more time I am getting this:
>> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
>> > Then it is not possible to access the mounted volume anymore. I have
>> > to "umount -l" to unmount it and then I can remount it. Until it
>> > happens again (after some time)...
>> >
>> > I also tried a balance and a scrub but they "crash". Syslog is full of
>> > messages like the following examples:
>> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
>> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
>> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
>> > and "btrfs scrub status /[device]" gives me the following output:
>> >> "scrub status for [UUID]
>> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>> >>        total bytes scrubbed: 113.04GiB with 0 errors"
>> >
>> > So a short summary:
>> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
>> > - does not mount at boot up, "open_ctree failed" (disk 3)
>> > - mounts successfully after bootup
>> > - randomly "checksum verify failed" (disk 5)
>> > - balance and scrub crash after some time
>> > - after a while the volume gets unreadable, saying "parent transid
>> > verify failed" (disk 4 or 5)
>> >
>> > And it looks like there still is no way to btrfsck a raid6.
>> >
>> > Any ideas how to repair this filesystem?
>> >
>> > Regards,
>> > Tobias
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-10 13:15   ` Ed Tomlinson
@ 2015-02-13  1:12     ` Kai Krakow
  0 siblings, 0 replies; 14+ messages in thread
From: Kai Krakow @ 2015-02-13  1:12 UTC (permalink / raw)
  To: linux-btrfs

Ed Tomlinson <edt@aei.ca> schrieb:

> On Tuesday, February 10, 2015 2:17:43 AM EST, Kai Krakow wrote:
>> Tobias Holst <tobby@tobby.eu> schrieb:
>>
>>> and "btrfs scrub status /[device]" gives me the following output:
>>>> "scrub status for [UUID]
>>>> scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008
>>>> seconds total bytes scrubbed: 113.04GiB with 0 errors"
>>
>> Does not look very correct to me:
>>
>> Why should a scrub in a six-drivers btrfs array which is probably multi-
>> terabytes big (as you state a restore from backup would take
>> days) take only
>> ~2000 seconds? And scrub only ~120 GB worth of data. Either your 6
>> devices are really small (then why RAID-6), or your data is very sparse
>> (then way does it take so long), or scrub prematurely aborts and never
>> checks the complete devices (I guess this is it).
>>
>> And that's what it actually says: "aborted after 2008" seconds. I'd
>> expect "finished after XXXX seconds" if I remember my scrub runs
>> correctly (which I
>> currently don't do regularly because it takes long and IO performance
>> sucks during running it).
> 
> IO perfermance does suffer during a scrub.  I use the following:
> 
> ionice -c 3 btrfs scrub start -Bd -n 19 /<target>

Doesn't work for deadline scheduler... Although, when my btrfs was still 
fresh (and already had a lot of data), I hardly noticed a running scrub in 
the background. But since I did one balance, everything sucks IO 
performance-wise.

Off-topic but maybe interesting in this regard:

Meanwhile, I switched away from deadline (which served me better than CFQ at 
that time) and am running with BFQ scheduler. It works really nice though 
booting is slower and application startup is a little bit less snappy. But 
it copes with background IO much better since after the "balance incident".

I went one step further and deployed bcache into the setup and everything is 
really snappy now. So I'm playing with the thought of re-enabling a 
regularly running scrub. But I still need to figure out if it would or 
wouldn't destroy the bcache hit ratio and fill bcache with non-relevant 
data.

And thinking further about it: I'm not sure if btrfs RAID protection and 
scrub make much sense at all with bcache inbetween... Due to the nature of 
bcache, errors may slip through undetected until the bcache LRU forces 
cached good copies out of the cache. If this data isn't dirty, it won't be 
written to cache. In that case there are three possible outcomes: the 
associated blocks on HDD are in perfect shape, one copy is rotten and one is 
good, or both are rotten. In the last case, btrfs can no longer help me 
there... Scrub may not have catched those as the good copies were in bcache 
until shortly before. I wonder if bcache should have a policy for writing 
back even non-dirty blocks if they are evicted from the cache...

> The combo of -n19 and ionice makes it workable here.

Yeah, should work here, too, now that I'm using BFQ. But then again, I am 
not sure: bcache frontend runs on SSD whose block device is working with 
deadline scheduler. My bcache backends are running on HDD with BFQ 
scheduler. The virtual bcache partitions sitting inbetween both are 
magically setting themselves to the noop scheduler (or maybe it even shows 
"none", I'm not sure) - which is intended, I guess.

So kernel access probably goes like this:

---> bcache0-2[noop] <---> phys. SSD [deadline] **
         |
          `---> phys. HDD 1-3 [bfq], mraid-1, draid-0 

So, I guess part of block accesses pass through two schedulers if access to 
both devices is needed (frontend and backend) with bcache acting as a huge 
block-sorting scheduler itself (which is what makes its performance). But 
for scrub, the deadline scheduler may becoming the dominating scheduler 
which brings me back to the situation a had in the start while running 
scrub.

** --> maybe I should put noop here, too

Does my thought experiment make work?

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-12 23:22     ` Tobias Holst
@ 2015-02-13  8:06       ` Liu Bo
  2015-02-13 18:26         ` Tobias Holst
  0 siblings, 1 reply; 14+ messages in thread
From: Liu Bo @ 2015-02-13  8:06 UTC (permalink / raw)
  To: Tobias Holst; +Cc: linux-btrfs

On Fri, Feb 13, 2015 at 12:22:16AM +0100, Tobias Holst wrote:
> Hi
> 
> I don't remember the exact mkfs.btrfs options anymore but
> > ls /sys/fs/btrfs/[UUID]/features/
> shows the following output:
> > big_metadata  compress_lzo  extended_iref  mixed_backref  raid56

Well... mkfs.btrfs can specify a '-m' for metadata profile and a '-d'
for data profile, the default profile for metadata is RAID1,
so we're not sure if your metadata is RAID1 or RAID6, if raid1 and both
copies are corrupted, then please use your backup.
> 
> I also tested my device with a short
> > hdparm -tT /dev/dm5
> and got
> > /dev/mapper/sdc_crypt:
> >  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
> >  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec
> 
> Looks ok to me. Should I test more?

Okay, looks good.

> 
> I bought a few new hard drives so currently I am copying all my data
> to a second (faster) backup, so I can maybe overwrite the current file
> system, if it's not repairable.

Another question, have you tried "mount -o recovery", did it work?

Thanks,

-liubo
> 
> Regards,
> Tobias
> 
> 
> 2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> > On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
> >> Hmm, it looks like it is getting worse... Here are some parts of my
> >> syslog, including two crashed btrfs-threads:
> >>
> >> So I am still getting many of these:
> >> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
> >> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
> >> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
> >> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
> >> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
> >
> > First we probably make sure that your device is well setup, since these
> > messages usually occur after a drive is removed(the device is somehow droping
> > writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
> >
> > And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
> >
> > Thanks,
> >
> > -liubo
> >
> >> > BTRFS info (device dm-5): force lzo compression
> >> > BTRFS info (device dm-5): disk space caching is enabled
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>
> >> Then there is this crash of "super"/btrfs_abort_transaction:
> >> > ------------[ cut here ]------------
> >> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
> >> > BTRFS: Transaction aborted (error -5)
> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> >> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> >> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
> >> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
> >> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
> >> > Call Trace:
> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> >> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
> >> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
> >> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> >> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
> >> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
> >> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
> >> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
> >> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
> >> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >> > ---[ end trace dd65465954546462 ]---
> >> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
> >> > BTRFS info (device dm-5): forced readonly
> >>
> >> and this crash of "delayed-ref"/btrfs_select_ref_head:
> >> > ------------[ cut here ]------------
> >> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> >> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
> >> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
> >> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
> >> > Call Trace:
> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> >> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
> >> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
> >> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
> >> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
> >> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> >> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
> >> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
> >> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >> > ---[ end trace dd65465954546463 ]---
> >> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
> >> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
> >>
> >>
> >> Any thoughts? Would it help to unplug the "dm5"-device which seems to
> >> be causing this errors and then balance the array?
> >>
> >> Regards,
> >> Tobias
> >>
> >> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> >> > Hi
> >> >
> >> > I'm having some trouble with my six-drives btrfs raid6 (each drive
> >> > encrypted with LUKS). At first: Yes, I do have backups, but it may
> >> > take at least days, maybe weeks or even some month to restore
> >> > everything from the (offside) backups. So it is not essential to
> >> > recover the data, but would be great ;-)
> >> >
> >> > OS: Ubuntu 14.04
> >> > Kernel: 3.19.0
> >> > btrfs-progs: 3.19-rc2
> >> >
> >> > When booting my server I am getting this in the syslog:
> >> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
> >> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
> >> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
> >> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
> >> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
> >> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
> >> >> [    8.556310] BTRFS: failed to read the system array on dm-3
> >> >> [    8.592135] BTRFS: open_ctree failed
> >> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
> >> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
> >> > Looks like there is something wrong on drive 3, giving me "open_ctree
> >> > failed". I have to press "S" to skip mounting of the btrfs volume. It
> >> > boots and with "sudo mount --all" I can successfully mount the btrfs
> >> > volume. Sometimes it takes one or two minutes but it will mount.
> >> >
> >> > After a while I am sometimes/randomly getting this in the syslog:
> >> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
> >> > Looks like something else is broken on dm-5... But shouldn't this be
> >> > repaired with the new raid56-repair-features of kernel 3.19?
> >> >
> >> > After some more time I am getting this:
> >> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
> >> > Then it is not possible to access the mounted volume anymore. I have
> >> > to "umount -l" to unmount it and then I can remount it. Until it
> >> > happens again (after some time)...
> >> >
> >> > I also tried a balance and a scrub but they "crash". Syslog is full of
> >> > messages like the following examples:
> >> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
> >> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
> >> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
> >> > and "btrfs scrub status /[device]" gives me the following output:
> >> >> "scrub status for [UUID]
> >> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
> >> >>        total bytes scrubbed: 113.04GiB with 0 errors"
> >> >
> >> > So a short summary:
> >> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
> >> > - does not mount at boot up, "open_ctree failed" (disk 3)
> >> > - mounts successfully after bootup
> >> > - randomly "checksum verify failed" (disk 5)
> >> > - balance and scrub crash after some time
> >> > - after a while the volume gets unreadable, saying "parent transid
> >> > verify failed" (disk 4 or 5)
> >> >
> >> > And it looks like there still is no way to btrfsck a raid6.
> >> >
> >> > Any ideas how to repair this filesystem?
> >> >
> >> > Regards,
> >> > Tobias
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-13  8:06       ` Liu Bo
@ 2015-02-13 18:26         ` Tobias Holst
  2015-02-13 21:54           ` Tobias Holst
  0 siblings, 1 reply; 14+ messages in thread
From: Tobias Holst @ 2015-02-13 18:26 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

2015-02-13 9:06 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> On Fri, Feb 13, 2015 at 12:22:16AM +0100, Tobias Holst wrote:
>> Hi
>>
>> I don't remember the exact mkfs.btrfs options anymore but
>> > ls /sys/fs/btrfs/[UUID]/features/
>> shows the following output:
>> > big_metadata  compress_lzo  extended_iref  mixed_backref  raid56
>
> Well... mkfs.btrfs can specify a '-m' for metadata profile and a '-d'
> for data profile, the default profile for metadata is RAID1,
> so we're not sure if your metadata is RAID1 or RAID6, if raid1 and both
> copies are corrupted, then please use your backup.

Ah, I used RAID6 for both, so "btrfs fi df /[mountpoint]" looks like this:
Data, RAID6: total=13.11TiB, used=13.10TiB
System, RAID6: total=64.00MiB, used=928.00KiB
Metadata, RAID6: total=25.00GiB, used=23.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


>>
>> I also tested my device with a short
>> > hdparm -tT /dev/dm5
>> and got
>> > /dev/mapper/sdc_crypt:
>> >  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
>> >  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec
>>
>> Looks ok to me. Should I test more?
>
> Okay, looks good.
>
>>
>> I bought a few new hard drives so currently I am copying all my data
>> to a second (faster) backup, so I can maybe overwrite the current file
>> system, if it's not repairable.
>
> Another question, have you tried "mount -o recovery", did it work?

Yes and no. At the moment I mounted it with
"defaults,recovery,ro,compress-force=lzo,nospace_cache,clear_cache". I
am still getting some errors in the syslog, but less than before. Also
it doesn't get unreadable after a while like before. But it seems to
be a little bit slow sometimes and two times the whole system freezed
until I did a hard reset.

>
> Thanks,
>
> -liubo
>>
>> Regards,
>> Tobias
>>
>>
>> 2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
>> > On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
>> >> Hmm, it looks like it is getting worse... Here are some parts of my
>> >> syslog, including two crashed btrfs-threads:
>> >>
>> >> So I am still getting many of these:
>> >> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
>> >
>> > First we probably make sure that your device is well setup, since these
>> > messages usually occur after a drive is removed(the device is somehow droping
>> > writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
>> >
>> > And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
>> >
>> > Thanks,
>> >
>> > -liubo
>> >
>> >> > BTRFS info (device dm-5): force lzo compression
>> >> > BTRFS info (device dm-5): disk space caching is enabled
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>
>> >> Then there is this crash of "super"/btrfs_abort_transaction:
>> >> > ------------[ cut here ]------------
>> >> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
>> >> > BTRFS: Transaction aborted (error -5)
>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> >> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> >> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
>> >> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
>> >> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
>> >> > Call Trace:
>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> >> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
>> >> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
>> >> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> >> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
>> >> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
>> >> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
>> >> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
>> >> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
>> >> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >> > ---[ end trace dd65465954546462 ]---
>> >> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
>> >> > BTRFS info (device dm-5): forced readonly
>> >>
>> >> and this crash of "delayed-ref"/btrfs_select_ref_head:
>> >> > ------------[ cut here ]------------
>> >> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> >> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
>> >> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
>> >> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
>> >> > Call Trace:
>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> >> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
>> >> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
>> >> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
>> >> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
>> >> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> >> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
>> >> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
>> >> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >> > ---[ end trace dd65465954546463 ]---
>> >> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
>> >> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
>> >>
>> >>
>> >> Any thoughts? Would it help to unplug the "dm5"-device which seems to
>> >> be causing this errors and then balance the array?
>> >>
>> >> Regards,
>> >> Tobias
>> >>
>> >> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
>> >> > Hi
>> >> >
>> >> > I'm having some trouble with my six-drives btrfs raid6 (each drive
>> >> > encrypted with LUKS). At first: Yes, I do have backups, but it may
>> >> > take at least days, maybe weeks or even some month to restore
>> >> > everything from the (offside) backups. So it is not essential to
>> >> > recover the data, but would be great ;-)
>> >> >
>> >> > OS: Ubuntu 14.04
>> >> > Kernel: 3.19.0
>> >> > btrfs-progs: 3.19-rc2
>> >> >
>> >> > When booting my server I am getting this in the syslog:
>> >> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
>> >> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
>> >> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
>> >> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
>> >> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
>> >> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
>> >> >> [    8.556310] BTRFS: failed to read the system array on dm-3
>> >> >> [    8.592135] BTRFS: open_ctree failed
>> >> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
>> >> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
>> >> > Looks like there is something wrong on drive 3, giving me "open_ctree
>> >> > failed". I have to press "S" to skip mounting of the btrfs volume. It
>> >> > boots and with "sudo mount --all" I can successfully mount the btrfs
>> >> > volume. Sometimes it takes one or two minutes but it will mount.
>> >> >
>> >> > After a while I am sometimes/randomly getting this in the syslog:
>> >> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
>> >> > Looks like something else is broken on dm-5... But shouldn't this be
>> >> > repaired with the new raid56-repair-features of kernel 3.19?
>> >> >
>> >> > After some more time I am getting this:
>> >> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
>> >> > Then it is not possible to access the mounted volume anymore. I have
>> >> > to "umount -l" to unmount it and then I can remount it. Until it
>> >> > happens again (after some time)...
>> >> >
>> >> > I also tried a balance and a scrub but they "crash". Syslog is full of
>> >> > messages like the following examples:
>> >> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
>> >> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
>> >> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
>> >> > and "btrfs scrub status /[device]" gives me the following output:
>> >> >> "scrub status for [UUID]
>> >> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>> >> >>        total bytes scrubbed: 113.04GiB with 0 errors"
>> >> >
>> >> > So a short summary:
>> >> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
>> >> > - does not mount at boot up, "open_ctree failed" (disk 3)
>> >> > - mounts successfully after bootup
>> >> > - randomly "checksum verify failed" (disk 5)
>> >> > - balance and scrub crash after some time
>> >> > - after a while the volume gets unreadable, saying "parent transid
>> >> > verify failed" (disk 4 or 5)
>> >> >
>> >> > And it looks like there still is no way to btrfsck a raid6.
>> >> >
>> >> > Any ideas how to repair this filesystem?
>> >> >
>> >> > Regards,
>> >> > Tobias
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-13 18:26         ` Tobias Holst
@ 2015-02-13 21:54           ` Tobias Holst
  2015-02-15  3:30             ` Liu Bo
  0 siblings, 1 reply; 14+ messages in thread
From: Tobias Holst @ 2015-02-13 21:54 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

It's me again. I just found out why my system crashed during the back up.

I don't know what it means, but maybe it helps you?

WARNING: CPU: 7 PID: 22878 at
/home/kernel/COD/linux/fs/btrfs/extent_io.c:5203
read_extent_buffer+0xe3/0x120 [btrfs]()
Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E)
raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E) ghash_clmulni_intel(E)
aesni_intel(E) aes_x86_64(E) virtio_rng(E) lrw(E) gf128mul(E)
glue_helper(E) ablk_helper(E) cryptd(E) serio_raw(E) 8250_fintek(E)
parport_pc(E) pvpanic(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E)
cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) mpt2sas(E) ttm(E)
drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E)
scsi_transport_sas(E) drm(E)
 [<ffffffffc05089f3>] read_extent_buffer+0xe3/0x120 [btrfs]
 [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
 [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
 [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
 [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
 [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
 [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
 [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
 [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
 [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
 hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E)
 [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
 [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
 [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
 [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
 [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
 [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
 [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
 [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
 [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
 [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
 [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
 [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
 [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
 [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
 [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
 [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
 [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
 [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
 [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
 [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
 [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
 [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]

Regards,
Tobias


2015-02-13 19:26 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> 2015-02-13 9:06 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
>> On Fri, Feb 13, 2015 at 12:22:16AM +0100, Tobias Holst wrote:
>>> Hi
>>>
>>> I don't remember the exact mkfs.btrfs options anymore but
>>> > ls /sys/fs/btrfs/[UUID]/features/
>>> shows the following output:
>>> > big_metadata  compress_lzo  extended_iref  mixed_backref  raid56
>>
>> Well... mkfs.btrfs can specify a '-m' for metadata profile and a '-d'
>> for data profile, the default profile for metadata is RAID1,
>> so we're not sure if your metadata is RAID1 or RAID6, if raid1 and both
>> copies are corrupted, then please use your backup.
>
> Ah, I used RAID6 for both, so "btrfs fi df /[mountpoint]" looks like this:
> Data, RAID6: total=13.11TiB, used=13.10TiB
> System, RAID6: total=64.00MiB, used=928.00KiB
> Metadata, RAID6: total=25.00GiB, used=23.29GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
>>>
>>> I also tested my device with a short
>>> > hdparm -tT /dev/dm5
>>> and got
>>> > /dev/mapper/sdc_crypt:
>>> >  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
>>> >  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec
>>>
>>> Looks ok to me. Should I test more?
>>
>> Okay, looks good.
>>
>>>
>>> I bought a few new hard drives so currently I am copying all my data
>>> to a second (faster) backup, so I can maybe overwrite the current file
>>> system, if it's not repairable.
>>
>> Another question, have you tried "mount -o recovery", did it work?
>
> Yes and no. At the moment I mounted it with
> "defaults,recovery,ro,compress-force=lzo,nospace_cache,clear_cache". I
> am still getting some errors in the syslog, but less than before. Also
> it doesn't get unreadable after a while like before. But it seems to
> be a little bit slow sometimes and two times the whole system freezed
> until I did a hard reset.
>
>>
>> Thanks,
>>
>> -liubo
>>>
>>> Regards,
>>> Tobias
>>>
>>>
>>> 2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
>>> > On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
>>> >> Hmm, it looks like it is getting worse... Here are some parts of my
>>> >> syslog, including two crashed btrfs-threads:
>>> >>
>>> >> So I am still getting many of these:
>>> >> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
>>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
>>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
>>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
>>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
>>> >
>>> > First we probably make sure that your device is well setup, since these
>>> > messages usually occur after a drive is removed(the device is somehow droping
>>> > writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
>>> >
>>> > And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
>>> >
>>> > Thanks,
>>> >
>>> > -liubo
>>> >
>>> >> > BTRFS info (device dm-5): force lzo compression
>>> >> > BTRFS info (device dm-5): disk space caching is enabled
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>>> >>
>>> >> Then there is this crash of "super"/btrfs_abort_transaction:
>>> >> > ------------[ cut here ]------------
>>> >> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
>>> >> > BTRFS: Transaction aborted (error -5)
>>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>>> >> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
>>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> >> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> >> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
>>> >> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
>>> >> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
>>> >> > Call Trace:
>>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>>> >> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
>>> >> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
>>> >> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
>>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>>> >> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
>>> >> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
>>> >> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
>>> >> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
>>> >> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
>>> >> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
>>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>>> >> > ---[ end trace dd65465954546462 ]---
>>> >> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
>>> >> > BTRFS info (device dm-5): forced readonly
>>> >>
>>> >> and this crash of "delayed-ref"/btrfs_select_ref_head:
>>> >> > ------------[ cut here ]------------
>>> >> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
>>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>>> >> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
>>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> >> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
>>> >> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
>>> >> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
>>> >> > Call Trace:
>>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>>> >> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
>>> >> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
>>> >> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
>>> >> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
>>> >> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
>>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>>> >> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
>>> >> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
>>> >> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
>>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>>> >> > ---[ end trace dd65465954546463 ]---
>>> >> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
>>> >> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
>>> >>
>>> >>
>>> >> Any thoughts? Would it help to unplug the "dm5"-device which seems to
>>> >> be causing this errors and then balance the array?
>>> >>
>>> >> Regards,
>>> >> Tobias
>>> >>
>>> >> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
>>> >> > Hi
>>> >> >
>>> >> > I'm having some trouble with my six-drives btrfs raid6 (each drive
>>> >> > encrypted with LUKS). At first: Yes, I do have backups, but it may
>>> >> > take at least days, maybe weeks or even some month to restore
>>> >> > everything from the (offside) backups. So it is not essential to
>>> >> > recover the data, but would be great ;-)
>>> >> >
>>> >> > OS: Ubuntu 14.04
>>> >> > Kernel: 3.19.0
>>> >> > btrfs-progs: 3.19-rc2
>>> >> >
>>> >> > When booting my server I am getting this in the syslog:
>>> >> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
>>> >> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
>>> >> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
>>> >> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
>>> >> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
>>> >> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
>>> >> >> [    8.556310] BTRFS: failed to read the system array on dm-3
>>> >> >> [    8.592135] BTRFS: open_ctree failed
>>> >> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
>>> >> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
>>> >> > Looks like there is something wrong on drive 3, giving me "open_ctree
>>> >> > failed". I have to press "S" to skip mounting of the btrfs volume. It
>>> >> > boots and with "sudo mount --all" I can successfully mount the btrfs
>>> >> > volume. Sometimes it takes one or two minutes but it will mount.
>>> >> >
>>> >> > After a while I am sometimes/randomly getting this in the syslog:
>>> >> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
>>> >> > Looks like something else is broken on dm-5... But shouldn't this be
>>> >> > repaired with the new raid56-repair-features of kernel 3.19?
>>> >> >
>>> >> > After some more time I am getting this:
>>> >> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
>>> >> > Then it is not possible to access the mounted volume anymore. I have
>>> >> > to "umount -l" to unmount it and then I can remount it. Until it
>>> >> > happens again (after some time)...
>>> >> >
>>> >> > I also tried a balance and a scrub but they "crash". Syslog is full of
>>> >> > messages like the following examples:
>>> >> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
>>> >> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
>>> >> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
>>> >> > and "btrfs scrub status /[device]" gives me the following output:
>>> >> >> "scrub status for [UUID]
>>> >> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>>> >> >>        total bytes scrubbed: 113.04GiB with 0 errors"
>>> >> >
>>> >> > So a short summary:
>>> >> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
>>> >> > - does not mount at boot up, "open_ctree failed" (disk 3)
>>> >> > - mounts successfully after bootup
>>> >> > - randomly "checksum verify failed" (disk 5)
>>> >> > - balance and scrub crash after some time
>>> >> > - after a while the volume gets unreadable, saying "parent transid
>>> >> > verify failed" (disk 4 or 5)
>>> >> >
>>> >> > And it looks like there still is no way to btrfsck a raid6.
>>> >> >
>>> >> > Any ideas how to repair this filesystem?
>>> >> >
>>> >> > Regards,
>>> >> > Tobias
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-13 21:54           ` Tobias Holst
@ 2015-02-15  3:30             ` Liu Bo
  2015-02-15 20:45               ` Tobias Holst
  0 siblings, 1 reply; 14+ messages in thread
From: Liu Bo @ 2015-02-15  3:30 UTC (permalink / raw)
  To: Tobias Holst; +Cc: linux-btrfs

On Fri, Feb 13, 2015 at 10:54:22PM +0100, Tobias Holst wrote:
> It's me again. I just found out why my system crashed during the back up.
> 
> I don't know what it means, but maybe it helps you?

The warning means somehow checksum becomes inconsistent with file extents, but no clear clues about the cause :-(

Thanks,

-liubo

> 
> WARNING: CPU: 7 PID: 22878 at
> /home/kernel/COD/linux/fs/btrfs/extent_io.c:5203
> read_extent_buffer+0xe3/0x120 [btrfs]()
> Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
> ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E)
> raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
> crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E) ghash_clmulni_intel(E)
> aesni_intel(E) aes_x86_64(E) virtio_rng(E) lrw(E) gf128mul(E)
> glue_helper(E) ablk_helper(E) cryptd(E) serio_raw(E) 8250_fintek(E)
> parport_pc(E) pvpanic(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E)
> cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) mpt2sas(E) ttm(E)
> drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E)
> scsi_transport_sas(E) drm(E)
>  [<ffffffffc05089f3>] read_extent_buffer+0xe3/0x120 [btrfs]
>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
>  hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
> xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
> crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E)
>  [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
> Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
> ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
>  [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
> 
> Regards,
> Tobias
> 
> 
> 2015-02-13 19:26 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> > 2015-02-13 9:06 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> >> On Fri, Feb 13, 2015 at 12:22:16AM +0100, Tobias Holst wrote:
> >>> Hi
> >>>
> >>> I don't remember the exact mkfs.btrfs options anymore but
> >>> > ls /sys/fs/btrfs/[UUID]/features/
> >>> shows the following output:
> >>> > big_metadata  compress_lzo  extended_iref  mixed_backref  raid56
> >>
> >> Well... mkfs.btrfs can specify a '-m' for metadata profile and a '-d'
> >> for data profile, the default profile for metadata is RAID1,
> >> so we're not sure if your metadata is RAID1 or RAID6, if raid1 and both
> >> copies are corrupted, then please use your backup.
> >
> > Ah, I used RAID6 for both, so "btrfs fi df /[mountpoint]" looks like this:
> > Data, RAID6: total=13.11TiB, used=13.10TiB
> > System, RAID6: total=64.00MiB, used=928.00KiB
> > Metadata, RAID6: total=25.00GiB, used=23.29GiB
> > GlobalReserve, single: total=512.00MiB, used=0.00B
> >
> >
> >>>
> >>> I also tested my device with a short
> >>> > hdparm -tT /dev/dm5
> >>> and got
> >>> > /dev/mapper/sdc_crypt:
> >>> >  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
> >>> >  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec
> >>>
> >>> Looks ok to me. Should I test more?
> >>
> >> Okay, looks good.
> >>
> >>>
> >>> I bought a few new hard drives so currently I am copying all my data
> >>> to a second (faster) backup, so I can maybe overwrite the current file
> >>> system, if it's not repairable.
> >>
> >> Another question, have you tried "mount -o recovery", did it work?
> >
> > Yes and no. At the moment I mounted it with
> > "defaults,recovery,ro,compress-force=lzo,nospace_cache,clear_cache". I
> > am still getting some errors in the syslog, but less than before. Also
> > it doesn't get unreadable after a while like before. But it seems to
> > be a little bit slow sometimes and two times the whole system freezed
> > until I did a hard reset.
> >
> >>
> >> Thanks,
> >>
> >> -liubo
> >>>
> >>> Regards,
> >>> Tobias
> >>>
> >>>
> >>> 2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> >>> > On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
> >>> >> Hmm, it looks like it is getting worse... Here are some parts of my
> >>> >> syslog, including two crashed btrfs-threads:
> >>> >>
> >>> >> So I am still getting many of these:
> >>> >> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
> >>> >
> >>> > First we probably make sure that your device is well setup, since these
> >>> > messages usually occur after a drive is removed(the device is somehow droping
> >>> > writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
> >>> >
> >>> > And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
> >>> >
> >>> > Thanks,
> >>> >
> >>> > -liubo
> >>> >
> >>> >> > BTRFS info (device dm-5): force lzo compression
> >>> >> > BTRFS info (device dm-5): disk space caching is enabled
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
> >>> >>
> >>> >> Then there is this crash of "super"/btrfs_abort_transaction:
> >>> >> > ------------[ cut here ]------------
> >>> >> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
> >>> >> > BTRFS: Transaction aborted (error -5)
> >>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> >>> >> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
> >>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >>> >> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> >>> >> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
> >>> >> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
> >>> >> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
> >>> >> > Call Trace:
> >>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> >>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> >>> >> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
> >>> >> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
> >>> >> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
> >>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> >>> >> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
> >>> >> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
> >>> >> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
> >>> >> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
> >>> >> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
> >>> >> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
> >>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >>> >> > ---[ end trace dd65465954546462 ]---
> >>> >> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
> >>> >> > BTRFS info (device dm-5): forced readonly
> >>> >>
> >>> >> and this crash of "delayed-ref"/btrfs_select_ref_head:
> >>> >> > ------------[ cut here ]------------
> >>> >> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
> >>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
> >>> >> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
> >>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> >>> >> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
> >>> >> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
> >>> >> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
> >>> >> > Call Trace:
> >>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
> >>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
> >>> >> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
> >>> >> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
> >>> >> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
> >>> >> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
> >>> >> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
> >>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
> >>> >> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
> >>> >> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
> >>> >> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
> >>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
> >>> >> > ---[ end trace dd65465954546463 ]---
> >>> >> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
> >>> >> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
> >>> >>
> >>> >>
> >>> >> Any thoughts? Would it help to unplug the "dm5"-device which seems to
> >>> >> be causing this errors and then balance the array?
> >>> >>
> >>> >> Regards,
> >>> >> Tobias
> >>> >>
> >>> >> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
> >>> >> > Hi
> >>> >> >
> >>> >> > I'm having some trouble with my six-drives btrfs raid6 (each drive
> >>> >> > encrypted with LUKS). At first: Yes, I do have backups, but it may
> >>> >> > take at least days, maybe weeks or even some month to restore
> >>> >> > everything from the (offside) backups. So it is not essential to
> >>> >> > recover the data, but would be great ;-)
> >>> >> >
> >>> >> > OS: Ubuntu 14.04
> >>> >> > Kernel: 3.19.0
> >>> >> > btrfs-progs: 3.19-rc2
> >>> >> >
> >>> >> > When booting my server I am getting this in the syslog:
> >>> >> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
> >>> >> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
> >>> >> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
> >>> >> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
> >>> >> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
> >>> >> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
> >>> >> >> [    8.556310] BTRFS: failed to read the system array on dm-3
> >>> >> >> [    8.592135] BTRFS: open_ctree failed
> >>> >> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
> >>> >> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
> >>> >> > Looks like there is something wrong on drive 3, giving me "open_ctree
> >>> >> > failed". I have to press "S" to skip mounting of the btrfs volume. It
> >>> >> > boots and with "sudo mount --all" I can successfully mount the btrfs
> >>> >> > volume. Sometimes it takes one or two minutes but it will mount.
> >>> >> >
> >>> >> > After a while I am sometimes/randomly getting this in the syslog:
> >>> >> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
> >>> >> > Looks like something else is broken on dm-5... But shouldn't this be
> >>> >> > repaired with the new raid56-repair-features of kernel 3.19?
> >>> >> >
> >>> >> > After some more time I am getting this:
> >>> >> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
> >>> >> > Then it is not possible to access the mounted volume anymore. I have
> >>> >> > to "umount -l" to unmount it and then I can remount it. Until it
> >>> >> > happens again (after some time)...
> >>> >> >
> >>> >> > I also tried a balance and a scrub but they "crash". Syslog is full of
> >>> >> > messages like the following examples:
> >>> >> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
> >>> >> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
> >>> >> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
> >>> >> > and "btrfs scrub status /[device]" gives me the following output:
> >>> >> >> "scrub status for [UUID]
> >>> >> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
> >>> >> >>        total bytes scrubbed: 113.04GiB with 0 errors"
> >>> >> >
> >>> >> > So a short summary:
> >>> >> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
> >>> >> > - does not mount at boot up, "open_ctree failed" (disk 3)
> >>> >> > - mounts successfully after bootup
> >>> >> > - randomly "checksum verify failed" (disk 5)
> >>> >> > - balance and scrub crash after some time
> >>> >> > - after a while the volume gets unreadable, saying "parent transid
> >>> >> > verify failed" (disk 4 or 5)
> >>> >> >
> >>> >> > And it looks like there still is no way to btrfsck a raid6.
> >>> >> >
> >>> >> > Any ideas how to repair this filesystem?
> >>> >> >
> >>> >> > Regards,
> >>> >> > Tobias
> >>> >> --
> >>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>> >> the body of a message to majordomo@vger.kernel.org
> >>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Repair broken btrfs raid6?
  2015-02-15  3:30             ` Liu Bo
@ 2015-02-15 20:45               ` Tobias Holst
  0 siblings, 0 replies; 14+ messages in thread
From: Tobias Holst @ 2015-02-15 20:45 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs

OK, I see. Maybe there is even more damaged...

Now I finished my second backup of the important data and just
"killed" this damaged raid. I created a new one and now I am restoring
my data. Let's hope it will last longer this time :)

Regards,
Tobias


2015-02-15 4:30 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
> On Fri, Feb 13, 2015 at 10:54:22PM +0100, Tobias Holst wrote:
>> It's me again. I just found out why my system crashed during the back up.
>>
>> I don't know what it means, but maybe it helps you?
>
> The warning means somehow checksum becomes inconsistent with file extents, but no clear clues about the cause :-(
>
> Thanks,
>
> -liubo
>
>>
>> WARNING: CPU: 7 PID: 22878 at
>> /home/kernel/COD/linux/fs/btrfs/extent_io.c:5203
>> read_extent_buffer+0xe3/0x120 [btrfs]()
>> Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
>> ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E)
>> raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
>> crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E) ghash_clmulni_intel(E)
>> aesni_intel(E) aes_x86_64(E) virtio_rng(E) lrw(E) gf128mul(E)
>> glue_helper(E) ablk_helper(E) cryptd(E) serio_raw(E) 8250_fintek(E)
>> parport_pc(E) pvpanic(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E)
>> cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) mpt2sas(E) ttm(E)
>> drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E)
>> scsi_transport_sas(E) drm(E)
>>  [<ffffffffc05089f3>] read_extent_buffer+0xe3/0x120 [btrfs]
>>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
>>  hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
>> xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E)
>> crct10dif_pclmul(E) ppdev(E) crc32_pclmul(E)
>>  [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
>>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
>> Modules linked in: raid0(E) ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E)
>> ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E)
>>  [<ffffffffc05089ca>] ? read_extent_buffer+0xba/0x120 [btrfs]
>>  [<ffffffffc04d7dde>] __btrfs_lookup_bio_sums.isra.8+0x2ce/0x540 [btrfs]
>>  [<ffffffffc04d82a6>] btrfs_lookup_bio_sums+0x36/0x40 [btrfs]
>>  [<ffffffffc05301e6>] btrfs_submit_compressed_read+0x316/0x4e0 [btrfs]
>>  [<ffffffffc04ea031>] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
>>  [<ffffffffc05010ca>] submit_one_bio+0x6a/0xa0 [btrfs]
>>  [<ffffffffc0504958>] submit_extent_page.isra.34+0xe8/0x210 [btrfs]
>>  [<ffffffffc0506087>] __do_readpage+0x3f7/0x640 [btrfs]
>>  [<ffffffffc05057a0>] ? clean_io_failure+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc0506606>] __extent_readpages.constprop.45+0x266/0x290 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc050734e>] extent_readpages+0x15e/0x1a0 [btrfs]
>>  [<ffffffffc04eb400>] ? btrfs_submit_direct+0x1b0/0x1b0 [btrfs]
>>  [<ffffffffc04e771f>] btrfs_readpages+0x1f/0x30 [btrfs]
>>  [<ffffffffc04dc969>] ? btrfs_congested_fn+0x49/0xb0 [btrfs]
>>
>> Regards,
>> Tobias
>>
>>
>> 2015-02-13 19:26 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
>> > 2015-02-13 9:06 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
>> >> On Fri, Feb 13, 2015 at 12:22:16AM +0100, Tobias Holst wrote:
>> >>> Hi
>> >>>
>> >>> I don't remember the exact mkfs.btrfs options anymore but
>> >>> > ls /sys/fs/btrfs/[UUID]/features/
>> >>> shows the following output:
>> >>> > big_metadata  compress_lzo  extended_iref  mixed_backref  raid56
>> >>
>> >> Well... mkfs.btrfs can specify a '-m' for metadata profile and a '-d'
>> >> for data profile, the default profile for metadata is RAID1,
>> >> so we're not sure if your metadata is RAID1 or RAID6, if raid1 and both
>> >> copies are corrupted, then please use your backup.
>> >
>> > Ah, I used RAID6 for both, so "btrfs fi df /[mountpoint]" looks like this:
>> > Data, RAID6: total=13.11TiB, used=13.10TiB
>> > System, RAID6: total=64.00MiB, used=928.00KiB
>> > Metadata, RAID6: total=25.00GiB, used=23.29GiB
>> > GlobalReserve, single: total=512.00MiB, used=0.00B
>> >
>> >
>> >>>
>> >>> I also tested my device with a short
>> >>> > hdparm -tT /dev/dm5
>> >>> and got
>> >>> > /dev/mapper/sdc_crypt:
>> >>> >  Timing cached reads:   30712 MB in  2.00 seconds = 15376.11 MB/sec
>> >>> >  Timing buffered disk reads: 444 MB in  3.01 seconds = 147.51 MB/sec
>> >>>
>> >>> Looks ok to me. Should I test more?
>> >>
>> >> Okay, looks good.
>> >>
>> >>>
>> >>> I bought a few new hard drives so currently I am copying all my data
>> >>> to a second (faster) backup, so I can maybe overwrite the current file
>> >>> system, if it's not repairable.
>> >>
>> >> Another question, have you tried "mount -o recovery", did it work?
>> >
>> > Yes and no. At the moment I mounted it with
>> > "defaults,recovery,ro,compress-force=lzo,nospace_cache,clear_cache". I
>> > am still getting some errors in the syslog, but less than before. Also
>> > it doesn't get unreadable after a while like before. But it seems to
>> > be a little bit slow sometimes and two times the whole system freezed
>> > until I did a hard reset.
>> >
>> >>
>> >> Thanks,
>> >>
>> >> -liubo
>> >>>
>> >>> Regards,
>> >>> Tobias
>> >>>
>> >>>
>> >>> 2015-02-12 10:16 GMT+01:00 Liu Bo <bo.li.liu@oracle.com>:
>> >>> > On Wed, Feb 11, 2015 at 03:46:33PM +0100, Tobias Holst wrote:
>> >>> >> Hmm, it looks like it is getting worse... Here are some parts of my
>> >>> >> syslog, including two crashed btrfs-threads:
>> >>> >>
>> >>> >> So I am still getting many of these:
>> >>> >> > BTRFS (device dm-5): parent transid verify failed on 25033166798848 wanted 108976 found 108958
>> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166798848
>> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166802944
>> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166807040
>> >>> >> > BTRFS warning (device dm-5): page private not zero on page 25033166811136
>> >>> >
>> >>> > First we probably make sure that your device is well setup, since these
>> >>> > messages usually occur after a drive is removed(the device is somehow droping
>> >>> > writes), the below -EIO also implies btrfs cannot read/write data from or to that drive.
>> >>> >
>> >>> > And in theory, RAID6 can tolerate two drive failures, so what's your mkfs.btrfs option?
>> >>> >
>> >>> > Thanks,
>> >>> >
>> >>> > -liubo
>> >>> >
>> >>> >> > BTRFS info (device dm-5): force lzo compression
>> >>> >> > BTRFS info (device dm-5): disk space caching is enabled
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >> > BTRFS: dm-5 checksum verify failed on 30525304061952 wanted 55270A94 found B18E3934 level 0
>> >>> >>
>> >>> >> Then there is this crash of "super"/btrfs_abort_transaction:
>> >>> >> > ------------[ cut here ]------------
>> >>> >> > WARNING: CPU: 0 PID: 30526 at /home/kernel/COD/linux/fs/btrfs/super.c:260 __btrfs_abort_transaction+0x5f/0x140 [btrfs]()
>> >>> >> > BTRFS: Transaction aborted (error -5)
>> >>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> >>> >> > CPU: 0 PID: 30526 Comm: kworker/u16:6 Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> >>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >>> >> > Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>> >>> >> > 0000000000000104 ffff880002743c18 ffffffff817c4c00 0000000000000007
>> >>> >> > ffff880002743c68 ffff880002743c58 ffffffff81076e87 ffff880002743c58
>> >>> >> > ffff88020a8694d0 ffff8801fb715800 00000000fffffffb 0000000000000ae8
>> >>> >> > Call Trace:
>> >>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> >>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> >>> >> > [<ffffffff81076f86>] warn_slowpath_fmt+0x46/0x50
>> >>> >> > [<ffffffffc06375cf>] __btrfs_abort_transaction+0x5f/0x140 [btrfs]
>> >>> >> > [<ffffffffc0655105>] btrfs_run_delayed_refs.part.82+0x175/0x290 [btrfs]
>> >>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> >>> >> > [<ffffffffc0655507>] delayed_ref_async_start+0x37/0x90 [btrfs]
>> >>> >> > [<ffffffffc069720e>] normal_work_helper+0x7e/0x1b0 [btrfs]
>> >>> >> > [<ffffffffc0697572>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
>> >>> >> > [<ffffffff8108f76d>] process_one_work+0x14d/0x460
>> >>> >> > [<ffffffff8109014b>] worker_thread+0x11b/0x3f0
>> >>> >> > [<ffffffff81090030>] ? create_worker+0x1e0/0x1e0
>> >>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >>> >> > ---[ end trace dd65465954546462 ]---
>> >>> >> > BTRFS: error (device dm-5) in btrfs_run_delayed_refs:2792: errno=-5 IO failure
>> >>> >> > BTRFS info (device dm-5): forced readonly
>> >>> >>
>> >>> >> and this crash of "delayed-ref"/btrfs_select_ref_head:
>> >>> >> > ------------[ cut here ]------------
>> >>> >> > WARNING: CPU: 7 PID: 3159 at /home/kernel/COD/linux/fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0x120/0x130 [btrfs]()
>> >>> >> > Modules linked in: ufs(E) qnx4(E) hfsplus(E) hfs(E) minix(E) ntfs(E) msdos(E) jfs(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) iosf_mbi(E) dm_crypt(E) kvm_intel(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) ppdev(E) ghash_clmulni_intel(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) 8250_fintek(E) serio_raw(E) virtio_rng(E) parport_pc(E) mac_hid(E) pvpanic(E) i2c_piix4(E) lp(E) parport(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) mpt2sas(E) drm_kms_helper(E) raid_class(E) floppy(E) psmouse(E) drm(E) scsi_transport_sas(E)
>> >>> >> > CPU: 7 PID: 3159 Comm: btrfs-transacti Tainted: G        W   E  3.19.0-031900-generic #201502091451
>> >>> >> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> >>> >> > 00000000000001b6 ffff8801cb687c48 ffffffff817c4c00 0000000000000007
>> >>> >> > 0000000000000000 ffff8801cb687c88 ffffffff81076e87 0000000000000001
>> >>> >> > ffff8801fe80bf00 0000000000000000 ffff8801fe80bfc8 ffff8802345d8280
>> >>> >> > Call Trace:
>> >>> >> > [<ffffffff817c4c00>] dump_stack+0x45/0x57
>> >>> >> > [<ffffffff81076e87>] warn_slowpath_common+0x97/0xe0
>> >>> >> > [<ffffffff81076eea>] warn_slowpath_null+0x1a/0x20
>> >>> >> > [<ffffffffc06b2d40>] btrfs_select_ref_head+0x120/0x130 [btrfs]
>> >>> >> > [<ffffffffc0652cd1>] __btrfs_run_delayed_refs+0x1e1/0x5f0 [btrfs]
>> >>> >> > [<ffffffffc0654ffa>] btrfs_run_delayed_refs.part.82+0x6a/0x290 [btrfs]
>> >>> >> > [<ffffffffc0664e5c>] ? join_transaction.isra.31+0x13c/0x380 [btrfs]
>> >>> >> > [<ffffffffc0655237>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
>> >>> >> > [<ffffffffc0665e50>] btrfs_commit_transaction+0xb0/0xa70 [btrfs]
>> >>> >> > [<ffffffffc0663d95>] transaction_kthread+0x1d5/0x250 [btrfs]
>> >>> >> > [<ffffffffc0663bc0>] ? open_ctree+0x1f40/0x1f40 [btrfs]
>> >>> >> > [<ffffffff81095d59>] kthread+0xc9/0xe0
>> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >>> >> > [<ffffffff817d1e7c>] ret_from_fork+0x7c/0xb0
>> >>> >> > [<ffffffff81095c90>] ? flush_kthread_worker+0x90/0x90
>> >>> >> > ---[ end trace dd65465954546463 ]---
>> >>> >> > BTRFS warning (device dm-5): Skipping commit of aborted transaction.
>> >>> >> > BTRFS: error (device dm-5) in cleanup_transaction:1670: errno=-5 IO failure
>> >>> >>
>> >>> >>
>> >>> >> Any thoughts? Would it help to unplug the "dm5"-device which seems to
>> >>> >> be causing this errors and then balance the array?
>> >>> >>
>> >>> >> Regards,
>> >>> >> Tobias
>> >>> >>
>> >>> >> 2015-02-09 23:45 GMT+01:00 Tobias Holst <tobby@tobby.eu>:
>> >>> >> > Hi
>> >>> >> >
>> >>> >> > I'm having some trouble with my six-drives btrfs raid6 (each drive
>> >>> >> > encrypted with LUKS). At first: Yes, I do have backups, but it may
>> >>> >> > take at least days, maybe weeks or even some month to restore
>> >>> >> > everything from the (offside) backups. So it is not essential to
>> >>> >> > recover the data, but would be great ;-)
>> >>> >> >
>> >>> >> > OS: Ubuntu 14.04
>> >>> >> > Kernel: 3.19.0
>> >>> >> > btrfs-progs: 3.19-rc2
>> >>> >> >
>> >>> >> > When booting my server I am getting this in the syslog:
>> >>> >> >> [    8.026362] BTRFS: device label tobby-btrfs devid 3 transid 108721 /dev/dm-0
>> >>> >> >> [    8.118896] BTRFS: device label tobby-btrfs devid 6 transid 108721 /dev/dm-1
>> >>> >> >> [    8.202477] BTRFS: device label tobby-btrfs devid 1 transid 108721 /dev/dm-2
>> >>> >> >> [    8.520988] BTRFS: device label tobby-btrfs devid 4 transid 108721 /dev/dm-3
>> >>> >> >> [    8.555570] BTRFS info (device dm-3): force lzo compression
>> >>> >> >> [    8.555574] BTRFS info (device dm-3): disk space caching is enabled
>> >>> >> >> [    8.556310] BTRFS: failed to read the system array on dm-3
>> >>> >> >> [    8.592135] BTRFS: open_ctree failed
>> >>> >> >> [    9.039187] BTRFS: device label tobby-btrfs devid 2 transid 108721 /dev/dm-4
>> >>> >> >> [    9.107779] BTRFS: device label tobby-btrfs devid 5 transid 108721 /dev/dm-5
>> >>> >> > Looks like there is something wrong on drive 3, giving me "open_ctree
>> >>> >> > failed". I have to press "S" to skip mounting of the btrfs volume. It
>> >>> >> > boots and with "sudo mount --all" I can successfully mount the btrfs
>> >>> >> > volume. Sometimes it takes one or two minutes but it will mount.
>> >>> >> >
>> >>> >> > After a while I am sometimes/randomly getting this in the syslog:
>> >>> >> >> [ 1161.283246] BTRFS: dm-5 checksum verify failed on 39099619901440 wanted BB5B0AD5 found 6B6F5040 level 0
>> >>> >> > Looks like something else is broken on dm-5... But shouldn't this be
>> >>> >> > repaired with the new raid56-repair-features of kernel 3.19?
>> >>> >> >
>> >>> >> > After some more time I am getting this:
>> >>> >> >> [637017.631044] BTRFS (device dm-4): parent transid verify failed on 39099305132032 wanted 108722 found 108719
>> >>> >> > Then it is not possible to access the mounted volume anymore. I have
>> >>> >> > to "umount -l" to unmount it and then I can remount it. Until it
>> >>> >> > happens again (after some time)...
>> >>> >> >
>> >>> >> > I also tried a balance and a scrub but they "crash". Syslog is full of
>> >>> >> > messages like the following examples:
>> >>> >> >> [ 3355.523157] csum_tree_block: 53 callbacks suppressed
>> >>> >> >> [ 3355.523160] BTRFS: dm-5 checksum verify failed on 39099306917888 wanted F90D8231 found 5981C697 level 0
>> >>> >> >> [ 4006.935632]  BTRFS (device dm-5): parent transid verify failed on 30525418536960 wanted 108975 found 108767
>> >>> >> > and "btrfs scrub status /[device]" gives me the following output:
>> >>> >> >> "scrub status for [UUID]
>> >>> >> >>        scrub started at Mon Feb  9 18:16:38 2015 and was aborted after 2008 seconds
>> >>> >> >>        total bytes scrubbed: 113.04GiB with 0 errors"
>> >>> >> >
>> >>> >> > So a short summary:
>> >>> >> > - btrfs raid6 on 3.19.0 with btrfs-progs 3.19-rc2
>> >>> >> > - does not mount at boot up, "open_ctree failed" (disk 3)
>> >>> >> > - mounts successfully after bootup
>> >>> >> > - randomly "checksum verify failed" (disk 5)
>> >>> >> > - balance and scrub crash after some time
>> >>> >> > - after a while the volume gets unreadable, saying "parent transid
>> >>> >> > verify failed" (disk 4 or 5)
>> >>> >> >
>> >>> >> > And it looks like there still is no way to btrfsck a raid6.
>> >>> >> >
>> >>> >> > Any ideas how to repair this filesystem?
>> >>> >> >
>> >>> >> > Regards,
>> >>> >> > Tobias
>> >>> >> --
>> >>> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> >>> >> the body of a message to majordomo@vger.kernel.org
>> >>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-02-15 20:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-09 22:45 Repair broken btrfs raid6? Tobias Holst
2015-02-10  3:36 ` Duncan
2015-02-10  7:17 ` Kai Krakow
2015-02-10 13:15   ` Ed Tomlinson
2015-02-13  1:12     ` Kai Krakow
2015-02-10 18:18   ` Tobias Holst
2015-02-11 14:46 ` Tobias Holst
2015-02-12  9:16   ` Liu Bo
2015-02-12 23:22     ` Tobias Holst
2015-02-13  8:06       ` Liu Bo
2015-02-13 18:26         ` Tobias Holst
2015-02-13 21:54           ` Tobias Holst
2015-02-15  3:30             ` Liu Bo
2015-02-15 20:45               ` Tobias Holst

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.