All of lore.kernel.org
 help / color / mirror / Atom feed
* USB reset + raid6 = majority of files unreadable
@ 2020-02-25 20:39 Jonathan H
  2020-02-25 23:58 ` Qu Wenruo
  0 siblings, 1 reply; 24+ messages in thread
From: Jonathan H @ 2020-02-25 20:39 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

Previously, I was running an array with six disks all connected via
USB. I am running raid1c3 for metadata and raid6 for data, kernel
5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for
deduplication. Four of the six drives are stored in a single four-bay
enclosure. Due to my oversight, TLER was not enabled for any of the
drives, so when one of them started failing, the enclosure was reset
and all four drives were disconnected.

After rebooting, the file system was still mountable. I saw some
transid errors in dmesg, but I didn't really pay attention to them
because I was trying to get rid of the now failed drive. I tried to
"btrfs replace" the drive with a different one, but the replace
stopped making progress because all reads to the dead drive in a
certain location were failing (even with the "-r") flag. So I tried
mounting degraded without the dead drive and doing "btrfs dev delete
missing" instead. The deletion failed with the following kernel
message:

[  +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
[  +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
[  +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
[  +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
[  +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
[  +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257
off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4

I noticed that almost all of the files give an I/O error when read,
and similar kernel messages are generated, but with positive roots. I
also see "read error corrected" messages, but
if I try to read the files again, I the exact same messages are
printed again, which seems to suggest that the errors haven't really
been corrected? (But maybe this is intended behavior.)

I also attempted to use "btrfs restore" to recover the files, but
almost all of the files produce "ERROR: zstd decompress failed Unknown
frame descriptor" and the recovery does not succeed.

Since, then, I have been scrubbing the file system. The first scrub
produce lots of Uncorrectable read errors and several hundred csum
errors. I'm assuming the read errors are due to the missing drive. The
puzzling thing is, the scrub can "complete" (actually, it is aborted
after it completes on all drives but the missing one) and I can delete
all of the files with unrecoverable csum errors, but all of the issues
above persist. I can then turn around scrub again, and the scrub will
find new csum errors, which seems bizarre to me, since I would have
expected them all to be fixed. However, all transid related errors
have disappeared after the first scrub.

I have also tried deleting the file referenced in the device deletion
error and restarting the deletion. This seems to be working, but
progress has been very slow and I fear I'll have to delete all of the
I/O error-producing files above, which I would like to avoid if
possible.

What should I do in this situation and how can I avoid this in the future?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-25 20:39 USB reset + raid6 = majority of files unreadable Jonathan H
@ 2020-02-25 23:58 ` Qu Wenruo
  2020-02-26  0:51   ` Steven Fosdick
  2020-02-26  3:38   ` Jonathan H
  0 siblings, 2 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-02-25 23:58 UTC (permalink / raw)
  To: Jonathan H, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4371 bytes --]



On 2020/2/26 上午4:39, Jonathan H wrote:
> Hello everyone,
> 
> Previously, I was running an array with six disks all connected via
> USB. I am running raid1c3 for metadata and raid6 for data, kernel
> 5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for
> deduplication. Four of the six drives are stored in a single four-bay
> enclosure. Due to my oversight, TLER was not enabled for any of the
> drives, so when one of them started failing, the enclosure was reset
> and all four drives were disconnected.
> 
> After rebooting, the file system was still mountable. I saw some
> transid errors in dmesg,

This means the fs is already corrupted.
If btrfs check is run before mount, it may provide some pretty good
debugging info.

Also the exact message for the transid error and some context would help
us to determine how serious the corruption is.

> but I didn't really pay attention to them
> because I was trying to get rid of the now failed drive. I tried to
> "btrfs replace" the drive with a different one, but the replace
> stopped making progress because all reads to the dead drive in a
> certain location were failing (even with the "-r") flag. So I tried
> mounting degraded without the dead drive and doing "btrfs dev delete
> missing" instead. The deletion failed with the following kernel
> message:
> 
> [  +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
> [  +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
> [  +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
> [  +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
> [  +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
> [  +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257
> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4

This is a different error.
This means data reloc tree is corrupted.
This somewhat looks like an existing bug. especially when all rebuild
result the same csum.

> 
> I noticed that almost all of the files give an I/O error when read,
> and similar kernel messages are generated, but with positive roots.

Please give the exact dmesg.
Including all the messages for the same bytenr.

> I
> also see "read error corrected" messages, but
> if I try to read the files again, I the exact same messages are
> printed again, which seems to suggest that the errors haven't really
> been corrected? (But maybe this is intended behavior.)
> 
> I also attempted to use "btrfs restore" to recover the files, but
> almost all of the files produce "ERROR: zstd decompress failed Unknown
> frame descriptor" and the recovery does not succeed.
> 
> Since, then, I have been scrubbing the file system. The first scrub
> produce lots of Uncorrectable read errors and several hundred csum
> errors. I'm assuming the read errors are due to the missing drive. The
> puzzling thing is, the scrub can "complete" (actually, it is aborted
> after it completes on all drives but the missing one) and I can delete
> all of the files with unrecoverable csum errors, but all of the issues
> above persist. I can then turn around scrub again, and the scrub will
> find new csum errors, which seems bizarre to me, since I would have
> expected them all to be fixed. However, all transid related errors
> have disappeared after the first scrub.
> 
> I have also tried deleting the file referenced in the device deletion
> error and restarting the deletion. This seems to be working, but
> progress has been very slow and I fear I'll have to delete all of the
> I/O error-producing files above, which I would like to avoid if
> possible.
> 
> What should I do in this situation and how can I avoid this in the future?

Although I don't believe it's the hardware to blame, but you can still
try to disable write cache on all related devices, as an experiment to
rule out bad disk flush/fua behavior.

Thanks,
Qu

> 
> Thanks,
> Jonathan
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-25 23:58 ` Qu Wenruo
@ 2020-02-26  0:51   ` Steven Fosdick
  2020-02-26  0:55     ` Qu Wenruo
  2020-02-26  3:38   ` Jonathan H
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Fosdick @ 2020-02-26  0:51 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Jonathan H

On Tue, 25 Feb 2020 at 23:59, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2020/2/26 上午4:39, Jonathan H wrote:
> > [  +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
> > [  +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
> > [  +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
> > [  +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
> > [  +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
> > [  +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257
> > off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
>
> This is a different error.
> This means data reloc tree is corrupted.
> This somewhat looks like an existing bug. especially when all rebuild
> result the same csum.

This looks remarkably similar to the errors I am getting that prevent
removing a failed device from the array, except in my case it is only
the found value is the same in the whole set of messages, the expected
value varies.  Do you know if/when this bug was/will be corrected?
Can the reloc tree be repaired/rebuilt?

Here are the messages I am getting:

Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494745088 csum 0x8941f998 expected csum
0x99726972 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494749184 csum 0x8941f998 expected csum
0x4c946d24 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494753280 csum 0x8941f998 expected csum
0x3cacfa54 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494757376 csum 0x8941f998 expected csum
0x453f4f60 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494761472 csum 0x8941f998 expected csum
0x5630f6fa mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494765568 csum 0x8941f998 expected csum
0xbf215c7a mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494769664 csum 0x8941f998 expected csum
0x242df5b3 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494773760 csum 0x8941f998 expected csum
0x84d8643c mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494777856 csum 0x8941f998 expected csum
0xcd4799e3 mirror 2
Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 261 off 1494781952 csum 0x8941f998 expected csum
0x84e72065 mirror 2


Regards,
Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26  0:51   ` Steven Fosdick
@ 2020-02-26  0:55     ` Qu Wenruo
  0 siblings, 0 replies; 24+ messages in thread
From: Qu Wenruo @ 2020-02-26  0:55 UTC (permalink / raw)
  To: Steven Fosdick, linux-btrfs; +Cc: Jonathan H


[-- Attachment #1.1: Type: text/plain, Size: 3557 bytes --]



On 2020/2/26 上午8:51, Steven Fosdick wrote:
> On Tue, 25 Feb 2020 at 23:59, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2020/2/26 上午4:39, Jonathan H wrote:
>>> [  +2.697798] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
>>> [  +0.003381] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
>>> [  +0.002514] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
>>> [  +0.000543] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 1
>>> [  +0.001170] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 2
>>> [  +0.001151] BTRFS warning (device sdb): csum failed root -9 ino 257
>>> off 2083160064 csum 0xd0a0b14c expected csum 0x7f3ec5ab mirror 4
>>
>> This is a different error.
>> This means data reloc tree is corrupted.
>> This somewhat looks like an existing bug. especially when all rebuild
>> result the same csum.
> 
> This looks remarkably similar to the errors I am getting that prevent
> removing a failed device from the array, except in my case it is only
> the found value is the same in the whole set of messages, the expected
> value varies.  Do you know if/when this bug was/will be corrected?
> Can the reloc tree be repaired/rebuilt?

Data reloc tree is just a temporary tree for relocation.

Since it failed to pass csum, relocation will abort and cleanup the data
reloc tree.

> 
> Here are the messages I am getting:
> 
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494745088 csum 0x8941f998 expected csum
> 0x99726972 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494749184 csum 0x8941f998 expected csum
> 0x4c946d24 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494753280 csum 0x8941f998 expected csum
> 0x3cacfa54 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494757376 csum 0x8941f998 expected csum
> 0x453f4f60 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494761472 csum 0x8941f998 expected csum
> 0x5630f6fa mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494765568 csum 0x8941f998 expected csum
> 0xbf215c7a mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494769664 csum 0x8941f998 expected csum
> 0x242df5b3 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494773760 csum 0x8941f998 expected csum
> 0x84d8643c mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494777856 csum 0x8941f998 expected csum
> 0xcd4799e3 mirror 2
> Feb 22 20:08:01 meije kernel: BTRFS warning (device sda): csum failed
> root -9 ino 261 off 1494781952 csum 0x8941f998 expected csum
> 0x84e72065 mirror 2

Since the root is still -9, it's still data reloc tree and doesn't make
much sense.

BTW, full dmesg and btrfs check output are appreciated.

Thanks,
Qu
> 
> 
> Regards,
> Steve.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-25 23:58 ` Qu Wenruo
  2020-02-26  0:51   ` Steven Fosdick
@ 2020-02-26  3:38   ` Jonathan H
  2020-02-26  3:44     ` Jonathan H
  2020-02-26  4:37     ` Qu Wenruo
  1 sibling, 2 replies; 24+ messages in thread
From: Jonathan H @ 2020-02-26  3:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2020 at 3:58 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2020/2/26 上午4:39, Jonathan H wrote:
> > Hello everyone,
> >
> > Previously, I was running an array with six disks all connected via
> > USB. I am running raid1c3 for metadata and raid6 for data, kernel
> > 5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for
> > deduplication. Four of the six drives are stored in a single four-bay
> > enclosure. Due to my oversight, TLER was not enabled for any of the
> > drives, so when one of them started failing, the enclosure was reset
> > and all four drives were disconnected.
> >
> > After rebooting, the file system was still mountable. I saw some
> > transid errors in dmesg,
>
> This means the fs is already corrupted.
> If btrfs check is run before mount, it may provide some pretty good
> debugging

Here is the output from "btrfs check --check-data-csum": http://ix.io/2cH7

> Also the exact message for the transid error and some context would help
> us to determine how serious the corruption is.

Unfortunately, I can't reproduce the transid errors, since they seem
to have been fixed. However, I do recall that the wanted and found
generations numbers differed by less than a hundred?

> > I noticed that almost all of the files give an I/O error when read,
> > and similar kernel messages are generated, but with positive roots.
>
> Please give the exact dmesg.
> Including all the messages for the same bytenr.

Here is the dmesg: http://ix.io/2cH7

> Although I don't believe it's the hardware to blame, but you can still
> try to disable write cache on all related devices, as an experiment to
> rule out bad disk flush/fua behavior.

I've disabled the write cashes for now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26  3:38   ` Jonathan H
@ 2020-02-26  3:44     ` Jonathan H
  2020-02-26  4:37     ` Qu Wenruo
  1 sibling, 0 replies; 24+ messages in thread
From: Jonathan H @ 2020-02-26  3:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2020 at 7:38 PM Jonathan H <pythonnut@gmail.com> wrote:

> Here is the output from "btrfs check --check-data-csum": http://ix.io/2cH7

Whoops, it's actually at http://ix.io/2cGZ

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26  3:38   ` Jonathan H
  2020-02-26  3:44     ` Jonathan H
@ 2020-02-26  4:37     ` Qu Wenruo
  2020-02-26 16:45       ` Jonathan H
  1 sibling, 1 reply; 24+ messages in thread
From: Qu Wenruo @ 2020-02-26  4:37 UTC (permalink / raw)
  To: Jonathan H; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3211 bytes --]



On 2020/2/26 上午11:38, Jonathan H wrote:
> On Tue, Feb 25, 2020 at 3:58 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2020/2/26 上午4:39, Jonathan H wrote:
>>> Hello everyone,
>>>
>>> Previously, I was running an array with six disks all connected via
>>> USB. I am running raid1c3 for metadata and raid6 for data, kernel
>>> 5.5.4-arch1-1 and btrfs --version v5.4, and I use bees for
>>> deduplication. Four of the six drives are stored in a single four-bay
>>> enclosure. Due to my oversight, TLER was not enabled for any of the
>>> drives, so when one of them started failing, the enclosure was reset
>>> and all four drives were disconnected.
>>>
>>> After rebooting, the file system was still mountable. I saw some
>>> transid errors in dmesg,
>>
>> This means the fs is already corrupted.
>> If btrfs check is run before mount, it may provide some pretty good
>> debugging
> 
> Here is the output from "btrfs check --check-data-csum": http://ix.io/2cH7

It's great that your metadata is safe.

> 
>> Also the exact message for the transid error and some context would help
>> us to determine how serious the corruption is.
> 
> Unfortunately, I can't reproduce the transid errors, since they seem
> to have been fixed. However, I do recall that the wanted and found
> generations numbers differed by less than a hundred?

So your previous transid may be a bad copy on your failing disk and
btrfs found it way to get next good copy.

The biggest concern is no longer a concern now.

> 
>>> I noticed that almost all of the files give an I/O error when read,
>>> and similar kernel messages are generated, but with positive roots.
>>
>> Please give the exact dmesg.
>> Including all the messages for the same bytenr.
> 
> Here is the dmesg: http://ix.io/2cH7

More context would be welcomed.

Anyway, even with more context, it may still lack the needed info as
such csum failure message is rate limited.

The mirror num 2 means it's the first rebuild try failed.

Since only the first rebuild try failed, and there are some corrected
data read, it looks btrfs can still rebuild the data.

Since you have already observed some EIO, it looks like write hole is
involved, screwing up the rebuild process.
But it's still very strange, as I'm expecting more mirror number other
than 2.
For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
only showing mirror num 2 doesn't look correct to me.

> 
>> Although I don't believe it's the hardware to blame, but you can still
>> try to disable write cache on all related devices, as an experiment to
>> rule out bad disk flush/fua behavior.
> 
> I've disabled the write cashes for now.
> 
BTW, since your free space cache is already corrupted, it's recommended
to clear the space cache.

For now, since it looks like write hole is involved, the only way to
solve the problem is to remove all offending files (including a super
large file in root 5).

You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
the involved files.

The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
output.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26  4:37     ` Qu Wenruo
@ 2020-02-26 16:45       ` Jonathan H
  2020-02-26 22:36         ` Steven Fosdick
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Jonathan H @ 2020-02-26 16:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Feb 25, 2020 at 8:37 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> It's great that your metadata is safe.
>
> The biggest concern is no longer a concern now.

Glad to hear.

> More context would be welcomed.

Here's a string of uncorrectable errors detected by the scrub: http://ix.io/2cJM

Here is another attempt to read a file giving an I/O error: http://ix.io/2cJS
The last two lines are produced when trying to read the file a second time.

Here's the state of the currently running scrub: http://ix.io/2cJU
I had to cancel and resume the scrub to run `btrfs check` earlier, but
otherwise it has been uninterrupted.

> Anyway, even with more context, it may still lack the needed info as
> such csum failure message is rate limited.
>
> The mirror num 2 means it's the first rebuild try failed.
>
> Since only the first rebuild try failed, and there are some corrected
> data read, it looks btrfs can still rebuild the data.
>
> Since you have already observed some EIO, it looks like write hole is
> involved, screwing up the rebuild process.
> But it's still very strange, as I'm expecting more mirror number other
> than 2.
> For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
> only showing mirror num 2 doesn't look correct to me.

I'm sort of curious why so many files have been affected. It seems
like most of the file system has become unreadable, but I was under
the impression that if the write hole occurred it would at least not
damage too much data at once. Is that incorrect?

> BTW, since your free space cache is already corrupted, it's recommended
> to clear the space cache.

It's strange to me that the free space cache is giving an error, since
I cleared it previously and the most recent unmount was clean.

> For now, since it looks like write hole is involved, the only way to
> solve the problem is to remove all offending files (including a super
> large file in root 5).
>
> You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
> the involved files.
>
> The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
> output.

The strange thing is if you use `btrfs inspect logical-resolve` on all
of the bytenrs mentioned in the check output, I get that all of the
corruption is in the same file (see http://ix.io/2cJP), but this does
not seem consistent with the uncorrectable csum errors the scrub is
detecting.

I've been calculating the offsets of the files mentioned in the
relocation csum errors (by adding the block group and offset),
resolving the files with `btrfs inspect logical-resolve` and deleting
them. But it seems like the set of files I'm deleting is also totally
unrelated to the set of files the scrub is detecting errors in. Given
the frequency of relocation errors, I fear I will need to delete
almost everything on the file system for the deletion to complete. I
can't tell if I should expect these errors to be fixable since the
relocation isn't making any attempt to correct them as far as I can
tell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 16:45       ` Jonathan H
@ 2020-02-26 22:36         ` Steven Fosdick
  2020-02-26 23:14           ` Steven Fosdick
  2020-02-27  0:39           ` Chris Murphy
  2020-03-03 15:42         ` Jonathan H
  2020-03-11  4:41         ` Zygo Blaxell
  2 siblings, 2 replies; 24+ messages in thread
From: Steven Fosdick @ 2020-02-26 22:36 UTC (permalink / raw)
  To: Jonathan H; +Cc: Qu Wenruo, linux-btrfs

Ok, so to expand on my previous message, I have what was a 3-drive
filesystem with RAID1 metadata and RAID5 data.  One drive failed so I
mounted degraded, added a replacement and tried to remove the missing
(failed) drive.  It won't remove - the remove aborts with an I/O error
after checksum errors have been logged as reported in my last e-mail.

I have run a btrfs check on the filesystem and this gives the following output:

WARNING: filesystem mounted, continuing because of --force
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sda
UUID: a3d38933-ee90-4b84-8f24-3a5c36dfd9be
found 9834224820224 bytes used, no error found
total csum bytes: 9588337304
total tree bytes: 13656375296
total fs tree bytes: 2760966144
total extent tree bytes: 388759552
btree space waste bytes: 1321640764
file data blocks allocated: 9820591190016
 referenced 9820501786624

The filesystem was mounted r/o to avoid any changes upsetting the
check.  I have now started a scrub to see what that finds but the ETA
is Sat Feb 29 07:57:49 2020 so I will report what that finds at the
time.

Regarding kernel messages I have found a few of these in the log
starting before the disc failure:

Sep 27 15:16:08 meije kernel: INFO: task kworker/u128:1:18864 blocked
for more than 122 seconds.
Sep 27 15:16:08 meije kernel:       Not tainted 5.1.10-arch1-1-ARCH #1
Sep 27 15:16:08 meije kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 27 15:16:08 meije kernel: kworker/u128:1  D    0 18864      2 0x80000080
Sep 27 15:16:08 meije kernel: Workqueue: events_unbound
btrfs_async_reclaim_metadata_space [btrfs]
Sep 27 15:16:08 meije kernel: Call Trace:
Sep 27 15:16:08 meije kernel:  ? __schedule+0x26c/0x8c0
Sep 27 15:16:08 meije kernel:  ? preempt_count_add+0x79/0xb0
Sep 27 15:16:08 meije kernel:  schedule+0x3c/0x80
Sep 27 15:16:08 meije kernel:  wait_for_commit+0x53/0x80 [btrfs]
Sep 27 15:16:08 meije kernel:  ? wait_woken+0x70/0x70
Sep 27 15:16:08 meije kernel:  btrfs_commit_transaction+0x145/0x930 [btrfs]
Sep 27 15:16:08 meije kernel:  ? _raw_spin_lock_irqsave+0x26/0x50
Sep 27 15:16:08 meije kernel:  ? _raw_spin_unlock_irqrestore+0x20/0x40
Sep 27 15:16:08 meije kernel:  ? __percpu_counter_sum+0x52/0x60
Sep 27 15:16:08 meije kernel:  flush_space+0x163/0x6a0 [btrfs]
Sep 27 15:16:08 meije kernel:  ? __switch_to_asm+0x41/0x70
Sep 27 15:16:08 meije kernel:  ? __switch_to_asm+0x35/0x70
Sep 27 15:16:08 meije kernel:  ? __switch_to_asm+0x41/0x70
Sep 27 15:16:08 meije kernel:
btrfs_async_reclaim_metadata_space+0xc4/0x4a0 [btrfs]
Sep 27 15:16:08 meije kernel:  ? __schedule+0x274/0x8c0
Sep 27 15:16:08 meije kernel:  process_one_work+0x1d1/0x3e0
Sep 27 15:16:08 meije kernel:  worker_thread+0x4a/0x3d0
Sep 27 15:16:08 meije kernel:  kthread+0xfb/0x130
Sep 27 15:16:08 meije kernel:  ? process_one_work+0x3e0/0x3e0
Sep 27 15:16:08 meije kernel:  ? kthread_park+0x90/0x90
Sep 27 15:16:08 meije kernel:  ret_from_fork+0x35/0x40

but I think these may have nothing to do with it - they may be another
filesystem (root) and the timeout may be because that is a USB stick
which is rather slow.  My reason for thinking that is that the process
that gave rise to the timeout appears to be pacman, the Arch package
manager which primarily writes to the root fileystem.

It looks like the disc started to fail here:

Jan 30 13:41:04 meije kernel: scsi_io_completion_action: 806 callbacks
suppressed
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 CDB: Read(16)
88 00 00 00 00 00 a2 d3 3b 00 00 00 00 40 00 00
Jan 30 13:41:04 meije kernel: print_req_error: 806 callbacks suppressed
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752192 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 00 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: btrfs_dev_stat_print_on_error: 732
callbacks suppressed
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 743, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#19 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#19 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 08 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752200 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#21 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#21 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 18 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752216 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#20 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#20 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 10 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752208 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 744, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 745, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#22 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#22 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 20 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752224 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 746, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#23 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#23 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 28 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752232 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 747, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 748, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#21 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#21 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 30 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752240 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#22 FAILED Result:
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#22 CDB: Write(16)
8a 00 00 00 00 00 a2 d3 3b 38 00 00 00 08 00 00
Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
sector 2731752248 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 749, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:41:04 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 750, rd 0, flush 0, corrupt 0, gen 0
Jan 30 13:42:37 meije rwhod[451]: sending on interface eno1
...

This goes on for pages and quite a few days, I can extract more if it
is of interest.  Next is a reboot:

Feb 10 16:19:50 meije kernel: sd 3:0:0:0: [sde] tag#0 CDB: Write(16)
8a 00 00 00 00 02 18 c2 76 00 00 00 00 80 00 00
Feb 10 16:19:50 meije kernel: blk_update_request: I/O error, dev sde,
sector 9005331968 op 0x1:(WRITE) flags 0x0 phys_seg 16 prio class 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857068, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857069, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857070, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857071, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857072, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857073, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857074, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857075, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857076, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): bdev /dev/sde
errs: wr 2857077, rd 20518, flush 0, corrupt 0, gen 0
Feb 10 16:19:50 meije kernel: BTRFS warning (device sda): lost page
write due to IO error on /dev/sde
Feb 10 16:19:50 meije kernel: BTRFS warning (device sda): lost page
write due to IO error on /dev/sde
Feb 10 16:19:50 meije kernel: BTRFS warning (device sda): lost page
write due to IO error on /dev/sde
Feb 10 16:19:50 meije kernel: BTRFS error (device sda): error writing
primary super block to device 3
Feb 10 16:19:50 meije systemd[1]: data.mount: Succeeded.
Feb 10 16:19:50 meije systemd[1]: Unmounted /data.
Feb 10 16:19:50 meije systemd[1]: Stopped target Local File Systems (Pre).

then on the way back up:

Feb 10 16:52:44 meije kernel: Btrfs loaded, crc32c=crc32c-intel
Feb 10 16:52:44 meije kernel: BTRFS: device fsid
f8ffa6cb-ab56-466a-99f1-438e705609c6 devid 1 transid 2007 /dev/sdb
scanned by systemd-udevd (199)
Feb 10 16:52:44 meije kernel: usb 3-1: New USB device found,
idVendor=051d, idProduct=0002, bcdDevice= 1.06
Feb 10 16:52:44 meije kernel: usb 3-1: New USB device strings: Mfr=3,
Product=1, SerialNumber=2
Feb 10 16:52:44 meije kernel: usb 3-1: Product: Back-UPS ES 550
FW:828.D3 .I USB FW:D3
Feb 10 16:52:44 meije kernel: usb 3-1: Manufacturer: APC
Feb 10 16:52:44 meije kernel: usb 3-1: SerialNumber: 3B0918X18511
Feb 10 16:52:44 meije kernel: BTRFS: device label data devid 2 transid
389104 /dev/sdc scanned by systemd-udevd (191)
Feb 10 16:52:44 meije kernel: hid: raw HID events driver (C) Jiri Kosina
Feb 10 16:52:44 meije kernel: BTRFS: device fsid
aae38dd1-9303-4169-97b1-35a7fc951e02 devid 1 transid 2954881 /dev/sdd1
scanned by systemd-udevd (199)
Feb 10 16:52:44 meije kernel: usbcore: registered new interface driver usbhid
Feb 10 16:52:44 meije kernel: usbhid: USB HID core driver
Feb 10 16:52:44 meije kernel: hid-generic 0003:051D:0002.0001:
hiddev0,hidraw0: USB HID v1.10 Device [APC Back-UPS ES 550 FW:828.D3
.I USB FW:D3 ] on usb-0000:0>
Feb 10 16:52:44 meije kernel: ata4: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Feb 10 16:52:44 meije kernel: ata4.00: failed to IDENTIFY
(INIT_DEV_PARAMS failed, err_mask=0x80)
Feb 10 16:52:44 meije kernel: BTRFS info (device sdd1): disk space
caching is enabled
Feb 10 16:52:44 meije kernel: BTRFS info (device sdd1): has skinny extents
...
Feb 10 17:01:46 meije kernel: BTRFS info (device sda): disk space
caching is enabled
Feb 10 17:01:46 meije kernel: BTRFS info (device sda): has skinny extents
Feb 10 17:01:46 meije kernel: BTRFS error (device sda): devid 3 uuid
443cd484-9ae5-49cf-b630-3f886f523302 is missing
Feb 10 17:01:46 meije kernel: BTRFS error (device sda): failed to read
the system array: -2
Feb 10 17:01:46 meije kernel: BTRFS error (device sda): open_ctree failed

then after mounting degraded, add a new device and attempt to remove
the missing one:

Feb 10 19:38:36 meije kernel: BTRFS info (device sda): disk added /dev/sdb
Feb 10 19:39:18 meije kernel: BTRFS info (device sda): relocating
block group 10045992468480 flags data|raid5
Feb 10 19:39:27 meije kernel: BTRFS info (device sda): found 19 extents
Feb 10 19:39:34 meije kernel: BTRFS info (device sda): found 19 extents
Feb 10 19:39:39 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:39:39 meije kernel: BTRFS info (device sda): relocating
block group 10043844984832 flags data|raid5
Feb 10 19:39:41 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:39:52 meije kernel: BTRFS info (device sda): found 38 extents
Feb 10 19:39:57 meije kernel: BTRFS info (device sda): found 38 extents
Feb 10 19:40:01 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:40:01 meije kernel: BTRFS info (device sda): relocating
block group 10041697501184 flags data|raid5
Feb 10 19:40:21 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:40:32 meije kernel: BTRFS info (device sda): found 155 extents
Feb 10 19:40:37 meije kernel: BTRFS info (device sda): found 155 extents
Feb 10 19:40:42 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:40:42 meije kernel: BTRFS info (device sda): relocating
block group 10039550017536 flags data|raid5
Feb 10 19:41:01 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:41:13 meije kernel: BTRFS info (device sda): found 166 extents
Feb 10 19:41:18 meije kernel: BTRFS info (device sda): found 166 extents
Feb 10 19:41:23 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:41:23 meije kernel: BTRFS info (device sda): relocating
block group 10037402533888 flags data|raid5
Feb 10 19:41:45 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:41:59 meije kernel: BTRFS info (device sda): found 518 extents
Feb 10 19:42:04 meije kernel: BTRFS info (device sda): found 518 extents
Feb 10 19:42:09 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:42:09 meije kernel: BTRFS info (device sda): relocating
block group 10035255050240 flags data|raid5
Feb 10 19:42:21 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:42:40 meije kernel: BTRFS info (device sda): found 46 extents
Feb 10 19:42:45 meije kernel: BTRFS info (device sda): found 46 extents
Feb 10 19:42:47 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:42:47 meije kernel: BTRFS info (device sda): relocating
block group 10033107566592 flags data|raid5
Feb 10 19:42:55 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:43:18 meije kernel: BTRFS info (device sda): found 45 extents
Feb 10 19:43:21 meije kernel: BTRFS info (device sda): found 45 extents
Feb 10 19:43:26 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:43:26 meije kernel: BTRFS info (device sda): relocating
block group 10030960082944 flags data|raid5
Feb 10 19:43:38 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:43:58 meije kernel: BTRFS info (device sda): found 66 extents
Feb 10 19:44:00 meije rwhod[422]: sending on interface eno1
Feb 10 19:44:02 meije kernel: BTRFS info (device sda): found 66 extents
Feb 10 19:44:06 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:44:06 meije kernel: BTRFS info (device sda): relocating
block group 10028812599296 flags data|raid5
Feb 10 19:44:21 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:44:36 meije kernel: BTRFS info (device sda): found 64 extents
Feb 10 19:44:41 meije kernel: BTRFS info (device sda): found 64 extents
Feb 10 19:44:44 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:44:44 meije kernel: BTRFS info (device sda): relocating
block group 10026665115648 flags data|raid5
Feb 10 19:44:55 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:45:13 meije kernel: BTRFS info (device sda): found 51 extents
Feb 10 19:45:17 meije kernel: BTRFS info (device sda): found 51 extents
Feb 10 19:45:22 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:45:22 meije kernel: BTRFS info (device sda): relocating
block group 10024517632000 flags data|raid5
Feb 10 19:45:36 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:45:52 meije kernel: BTRFS info (device sda): found 46 extents
Feb 10 19:45:56 meije kernel: BTRFS info (device sda): found 46 extents
Feb 10 19:46:00 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:46:00 meije kernel: BTRFS info (device sda): relocating
block group 10022370148352 flags data|raid5
Feb 10 19:46:17 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:46:31 meije kernel: BTRFS info (device sda): found 49 extents
Feb 10 19:46:36 meije kernel: BTRFS info (device sda): found 49 extents
Feb 10 19:46:41 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:46:41 meije kernel: BTRFS info (device sda): relocating
block group 10020222664704 flags data|raid5
Feb 10 19:47:00 meije rwhod[422]: sending on interface eno1
Feb 10 19:47:00 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:47:14 meije kernel: BTRFS info (device sda): found 51 extents
Feb 10 19:47:19 meije kernel: BTRFS info (device sda): found 51 extents
Feb 10 19:47:24 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:47:24 meije kernel: BTRFS info (device sda): relocating
block group 10018075181056 flags data|raid5
Feb 10 19:47:44 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:48:04 meije kernel: BTRFS info (device sda): found 1060 extents
Feb 10 19:48:11 meije kernel: BTRFS info (device sda): found 1060 extents
Feb 10 19:48:22 meije kernel: BTRFS info (device sda): clearing
incompat feature flag for RAID56 (0x80)
Feb 10 19:48:25 meije kernel: BTRFS info (device sda): relocating
block group 10015927697408 flags data|raid5
Feb 10 19:48:44 meije kernel: BTRFS info (device sda): setting
incompat feature flag for RAID56 (0x80)
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494749184 csum 0x8941f998 expected csum
0x4c946d24 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494753280 csum 0x8941f998 expected csum
0x3cacfa54 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494757376 csum 0x8941f998 expected csum
0x453f4f60 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494761472 csum 0x8941f998 expected csum
0x5630f6fa mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494765568 csum 0x8941f998 expected csum
0xbf215c7a mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494769664 csum 0x8941f998 expected csum
0x242df5b3 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494773760 csum 0x8941f998 expected csum
0x84d8643c mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494777856 csum 0x8941f998 expected csum
0xcd4799e3 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494781952 csum 0x8941f998 expected csum
0x84e72065 mirror 2
Feb 10 19:49:44 meije kernel: BTRFS warning (device sda): csum failed
root -9 ino 272 off 1494786048 csum 0x8941f998 expected csum
0xa1a55d97 mirror 2

and at that point the device remove aborted with an I/O error.

I did discover I could use balance with a filter to balance much of
the onto the three working discs, away from the missing one but I also
discovered that whenever the checksum error appears the space cache
seems to get corrupted.  Any further balance attempt results in
getting stuck in a loop.  Mounting with clear_cache resolves that.

Regards.
Steve.

On Wed, 26 Feb 2020 at 16:49, Jonathan H <pythonnut@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 8:37 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > It's great that your metadata is safe.
> >
> > The biggest concern is no longer a concern now.
>
> Glad to hear.
>
> > More context would be welcomed.
>
> Here's a string of uncorrectable errors detected by the scrub: http://ix.io/2cJM
>
> Here is another attempt to read a file giving an I/O error: http://ix.io/2cJS
> The last two lines are produced when trying to read the file a second time.
>
> Here's the state of the currently running scrub: http://ix.io/2cJU
> I had to cancel and resume the scrub to run `btrfs check` earlier, but
> otherwise it has been uninterrupted.
>
> > Anyway, even with more context, it may still lack the needed info as
> > such csum failure message is rate limited.
> >
> > The mirror num 2 means it's the first rebuild try failed.
> >
> > Since only the first rebuild try failed, and there are some corrected
> > data read, it looks btrfs can still rebuild the data.
> >
> > Since you have already observed some EIO, it looks like write hole is
> > involved, screwing up the rebuild process.
> > But it's still very strange, as I'm expecting more mirror number other
> > than 2.
> > For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
> > only showing mirror num 2 doesn't look correct to me.
>
> I'm sort of curious why so many files have been affected. It seems
> like most of the file system has become unreadable, but I was under
> the impression that if the write hole occurred it would at least not
> damage too much data at once. Is that incorrect?
>
> > BTW, since your free space cache is already corrupted, it's recommended
> > to clear the space cache.
>
> It's strange to me that the free space cache is giving an error, since
> I cleared it previously and the most recent unmount was clean.
>
> > For now, since it looks like write hole is involved, the only way to
> > solve the problem is to remove all offending files (including a super
> > large file in root 5).
> >
> > You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
> > the involved files.
> >
> > The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
> > output.
>
> The strange thing is if you use `btrfs inspect logical-resolve` on all
> of the bytenrs mentioned in the check output, I get that all of the
> corruption is in the same file (see http://ix.io/2cJP), but this does
> not seem consistent with the uncorrectable csum errors the scrub is
> detecting.
>
> I've been calculating the offsets of the files mentioned in the
> relocation csum errors (by adding the block group and offset),
> resolving the files with `btrfs inspect logical-resolve` and deleting
> them. But it seems like the set of files I'm deleting is also totally
> unrelated to the set of files the scrub is detecting errors in. Given
> the frequency of relocation errors, I fear I will need to delete
> almost everything on the file system for the deletion to complete. I
> can't tell if I should expect these errors to be fixable since the
> relocation isn't making any attempt to correct them as far as I can
> tell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 22:36         ` Steven Fosdick
@ 2020-02-26 23:14           ` Steven Fosdick
  2020-02-27  0:42             ` Chris Murphy
  2020-02-27  0:39           ` Chris Murphy
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Fosdick @ 2020-02-26 23:14 UTC (permalink / raw)
  To: Jonathan H; +Cc: Qu Wenruo, linux-btrfs

Ok so the last message wasn't so easy to read due to line wrap so here
it is again with the log output replaced by ix.io links.

To expand on my previous message, I have what was a 3-drive
filesystem with RAID1 metadata and RAID5 data.  One drive failed so I
mounted degraded, added a replacement and tried to remove the missing
(failed) drive.  It won't remove - the remove aborts with an I/O error
after checksum errors have been logged as reported in my last e-mail.

I have run a btrfs check on the filesystem and this gives the following output:

WARNING: filesystem mounted, continuing because of --force
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/sda
UUID: a3d38933-ee90-4b84-8f24-3a5c36dfd9be
found 9834224820224 bytes used, no error found
total csum bytes: 9588337304
total tree bytes: 13656375296
total fs tree bytes: 2760966144
total extent tree bytes: 388759552
btree space waste bytes: 1321640764
file data blocks allocated: 9820591190016
 referenced 9820501786624

The filesystem was mounted r/o to avoid any changes upsetting the
check.  I have now started a scrub to see what that finds but the ETA
is Sat Feb 29 07:57:49 2020 so I will report what that finds at the
time.

Regarding kernel messages I have found a few of these in the log
starting before the disc failure:

http://ix.io/2cLX

but I think these may have nothing to do with it - they may be another
filesystem (root) and the timeout may be because that is a USB stick
which is rather slow.  My reason for thinking that is that the process
that gave rise to the timeout appears to be pacman, the Arch package
manager which primarily writes to the root fileystem.

It looks like the disc started to fail here:

http://ix.io/2cM1

This goes on for pages and quite a few days, I can extract more if it
is of interest.  Next is a reboot - this is the shutdown part:

http://ix.io/2cM2

then on the way back up:

http://ix.io/2cM3

then after mounting degraded, add a new device and attempt to remove
the missing one:

http://ix.io/2cM4

and at that point the device remove aborted with an I/O error.

I did discover I could use balance with a filter to balance much of
the data onto the three working discs, away from the missing one but I also
discovered that whenever the checksum error appears the space cache
seems to get corrupted.  Any further balance attempt results in
getting stuck in a loop.  Mounting with clear_cache resolves that.

Regards.
Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 22:36         ` Steven Fosdick
  2020-02-26 23:14           ` Steven Fosdick
@ 2020-02-27  0:39           ` Chris Murphy
  2020-02-27  0:47             ` Chris Murphy
  2020-02-27 12:20             ` Steven Fosdick
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Murphy @ 2020-02-27  0:39 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Jonathan H, Qu Wenruo, Btrfs BTRFS

On Wed, Feb 26, 2020 at 3:37 PM Steven Fosdick <stevenfosdick@gmail.com> wrote:

> It looks like the disc started to fail here:
>
> Jan 30 13:41:04 meije kernel: scsi_io_completion_action: 806 callbacks
> suppressed
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#18 CDB: Read(16)
> 88 00 00 00 00 00 a2 d3 3b 00 00 00 00 40 00 00
> Jan 30 13:41:04 meije kernel: print_req_error: 806 callbacks suppressed
> Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
> sector 2731752192 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 FAILED Result:
> hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
> Jan 30 13:41:04 meije kernel: sd 3:0:0:0: [sde] tag#15 CDB: Write(16)
> 8a 00 00 00 00 00 a2 d3 3b 00 00 00 00 08 00 00
> Jan 30 13:41:04 meije kernel: blk_update_request: I/O error, dev sde,
> sector 2731752192 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
> Jan 30 13:41:04 meije kernel: btrfs_dev_stat_print_on_error: 732
> callbacks suppressed

Both read and write errors reported by the hardware. These aren't the
typical UNC error though. I'm not sure what DID_BAD_TARGET means. Some
errors might be suppressed.

Write errors are generally fatal. Read errors, if they include sector
LBA, Btrfs can fix if there's an extra copy (dup, raid1, raid56, etc),
otherwise, it may or may not be fatal depending on what's missing, and
what's affected by it being missing.

Btrfs might survive the write errors though with metadata raid1c3. But
later you get more concerning messages...


> This goes on for pages and quite a few days, I can extract more if it
> is of interest.

Ahhh yeah. So for what it's worth, in an md driver backed world, this
drive would have been ejected (faulty) upon the first write error. md
does retries for reads but writes it pretty much considers the drive
written off, which means the array is degraded.

As btrfs doesn't have such a concept of faulty drives, ejected drives,
yet; you kinda have to keep an eye on this, and setup monitoring so
you know when the array goes degraded like this.

It's vaguely possible to get the array into a kind of split brain
situation, if two drives experience transient write errors. And in
that case, right now, there's no recovery. Btrfs just gets too
confused.

You need to replace the bad drive, and do a scrub to fix things up.
And double check with 'btrfs fi us /mountpoint/' that all block groups
have one profile set, and that it's the correct one.


> then after mounting degraded, add a new device and attempt to remove
> the missing one:

That's not a good idea in my opinion... you really need to replace the
drive. Otherwise you're doing a really expensive full rebalance while
degraded, effectively. That means nothing else can go wrong or you're
in much bigger trouble. In particular it's really common for there to
be a mismatch between physical drive SCT ERC timeouts, and the
kernel's command timer. Mismatches can cause a of confusion because
upon kernel timer being reached, it resets the block device that
contains the "late" command, which then blows away that drive's entire
command queue.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch


> Feb 10 19:38:36 meije kernel: BTRFS info (device sda): disk added /dev/sdb
> Feb 10 19:39:18 meije kernel: BTRFS info (device sda): relocating
> block group 10045992468480 flags data|raid5
> Feb 10 19:39:27 meije kernel: BTRFS info (device sda): found 19 extents
> Feb 10 19:39:34 meije kernel: BTRFS info (device sda): found 19 extents
> Feb 10 19:39:39 meije kernel: BTRFS info (device sda): clearing
> incompat feature flag for RAID56 (0x80)
> Feb 10 19:39:39 meije kernel: BTRFS info (device sda): relocating
> block group 10043844984832 flags data|raid5

I'm not sure what's going on here. This is a raid6 volume and raid56
flag is being cleared? That's unexpected and I dn't know why you have
raid5 block groups on a raid6 array.


> and at that point the device remove aborted with an I/O error.

OK well you didn't include that so we have no idea if this I/O error
is about the same failed device or another device. If it's another
device it's more complicated what can happen to the array. Hence why
timeout mismatches are important. And why it's important to have
monitoring so you aren't running a degraded array for three days.

>
> I did discover I could use balance with a filter to balance much of
> the onto the three working discs, away from the missing one but I also
> discovered that whenever the checksum error appears the space cache
> seems to get corrupted.  Any further balance attempt results in
> getting stuck in a loop.  Mounting with clear_cache resolves that.

This sounds like a bug. The default space cache is stored in the data
block group which for you should be raid6, with a missing device it's
effectively raid5. But there's some kind of conversion happening
during the balance/missing device removal, hence the clearing of the
raid56 flag per block group, and maybe this corruption is happening
related to that removal.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 23:14           ` Steven Fosdick
@ 2020-02-27  0:42             ` Chris Murphy
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Murphy @ 2020-02-27  0:42 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Jonathan H, Qu Wenruo, Btrfs BTRFS

On Wed, Feb 26, 2020 at 4:15 PM Steven Fosdick <stevenfosdick@gmail.com> wrote:
>
> then after mounting degraded, add a new device and attempt to remove
> the missing one:

Sorry. I misread this as just removing the missing device. I didn't
catch that you had first added a new device first. That's OK.

But it's still better to use 'btrfs replace' command for this, because
it has quite a lot of shortcuts to do the rebuild much faster.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-27  0:39           ` Chris Murphy
@ 2020-02-27  0:47             ` Chris Murphy
  2020-02-27 12:20             ` Steven Fosdick
  1 sibling, 0 replies; 24+ messages in thread
From: Chris Murphy @ 2020-02-27  0:47 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Steven Fosdick, Jonathan H, Qu Wenruo, Btrfs BTRFS

On Wed, Feb 26, 2020 at 5:39 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> > Feb 10 19:38:36 meije kernel: BTRFS info (device sda): disk added /dev/sdb
> > Feb 10 19:39:18 meije kernel: BTRFS info (device sda): relocating
> > block group 10045992468480 flags data|raid5
> > Feb 10 19:39:27 meije kernel: BTRFS info (device sda): found 19 extents
> > Feb 10 19:39:34 meije kernel: BTRFS info (device sda): found 19 extents
> > Feb 10 19:39:39 meije kernel: BTRFS info (device sda): clearing
> > incompat feature flag for RAID56 (0x80)
> > Feb 10 19:39:39 meije kernel: BTRFS info (device sda): relocating
> > block group 10043844984832 flags data|raid5
>
> I'm not sure what's going on here. This is a raid6 volume and raid56
> flag is being cleared? That's unexpected and I dn't know why you have
> raid5 block groups on a raid6 array.


OK part of my confusion is that you sorta threadjacked, while still
being on topic, and didn't realize you weren't the original poster. So
you started out with a raid5 from the get go. Original poster had a
raid6.

I still don't know why you're getting messages:

clearing incompat feature flag for RAID56 (0x80)

I think that's confusing if you haven't asked for a conversion from
raid5 to a non-raid56 profile.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-27  0:39           ` Chris Murphy
  2020-02-27  0:47             ` Chris Murphy
@ 2020-02-27 12:20             ` Steven Fosdick
  2020-02-27 12:35               ` Steven Fosdick
  2020-02-29  6:31               ` Chris Murphy
  1 sibling, 2 replies; 24+ messages in thread
From: Steven Fosdick @ 2020-02-27 12:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Jonathan H, Qu Wenruo, Btrfs BTRFS

Chris,

Thanks for getting back to me.  I have read your subsequent e-mails too.

On Thu, 27 Feb 2020 at 00:39, Chris Murphy <lists@colorremedies.com> wrote:

> Both read and write errors reported by the hardware. These aren't the
> typical UNC error though. I'm not sure what DID_BAD_TARGET means. Some
> errors might be suppressed.

What are UNC errors?  My suspicion is that the drive failed in some
catastrophic way rather than just experiencing a larger number of
errors.  Upon rebooting the drive made lots of clicking noises but did
not respond to whatever probe the BIOS does to find hard drives.

> Write errors are generally fatal. Read errors, if they include sector
> LBA, Btrfs can fix if there's an extra copy (dup, raid1, raid56, etc),
> otherwise, it may or may not be fatal depending on what's missing, and
> what's affected by it being missing.
>
> Btrfs might survive the write errors though with metadata raid1c3. But
> later you get more concerning messages...

So if each of these writes had succeeded, in that the drive reported
success, but then the drive was subsequently unable to read the data
back surely btrfs should be able to reconstruct the data in the failed
write as it will either be mirrored, in the case of metadata, or
raid5, in the case of data.  So why should this be worse if the drive
reports that the write has failed?

> Ahhh yeah. So for what it's worth, in an md driver backed world, this
> drive would have been ejected (faulty) upon the first write error. md
> does retries for reads but writes it pretty much considers the drive
> written off, which means the array is degraded.

That sounds very sensible.

> As btrfs doesn't have such a concept of faulty drives, ejected drives,
> yet; you kinda have to keep an eye on this, and setup monitoring so
> you know when the array goes degraded like this.

So given that I probably don't have a spare drive lying around to
replace the failed one and also given that this server is not
hot-plug, is there anything, other than just alerting me, that this
monitoring could do automatically to minimise or avoid damage to the
filesystem?  Can be btrfs be instructed to mark a drive missing and go
into degraded mode on-line?

> You need to replace the bad drive, and do a scrub to fix things up.

Qu previously asked Jonathan for the output of btrfs check.
I ran 'btrfs check --force --check-data-csum /dev/sda'
and it found no errors.  Does this check all accessible devices in the
filesystem or just the one specified on the command line?

I'll resume the scrub I started earlier.

> And double check with 'btrfs fi us /mountpoint/' that all block groups
> have one profile set, and that it's the correct one.

Here is the output:

# btrfs fi us /data
WARNING: RAID56 detected, not implemented
Overall:
    Device size:   21.83TiB
    Device allocated:   30.06GiB
    Device unallocated:   21.80TiB
    Device missing:    5.46TiB
    Used:   25.44GiB
    Free (estimated):      0.00B (min: 8.00EiB)
    Data ratio:       0.00
    Metadata ratio:       2.00
    Global reserve: 512.00MiB (used: 0.00B)

Data,RAID5: Size:8.93TiB, Used:8.93TiB (99.98%)
   /dev/sda    4.47TiB
   /dev/sdc    4.47TiB
   missing   65.00GiB
   /dev/sdb    4.40TiB

Metadata,RAID1: Size:15.00GiB, Used:12.72GiB (84.78%)
   /dev/sda    9.00GiB
   /dev/sdc   11.00GiB
   /dev/sdb   10.00GiB

System,RAID1: Size:32.00MiB, Used:816.00KiB (2.49%)
   /dev/sda   32.00MiB
   /dev/sdb   32.00MiB

Unallocated:
   /dev/sda 1006.00GiB
   /dev/sdc 1004.03GiB
   missing    5.39TiB
   /dev/sdb    1.04TiB

The only thing that looks odd to me is the overall summary in that
both "Device Allocated" and "Free (estimated)" seem wrong.

> That's not a good idea in my opinion... you really need to replace the
> drive. Otherwise you're doing a really expensive full rebalance while
> degraded, effectively. That means nothing else can go wrong or you're
> in much bigger trouble.

As you noted in your other e-mail I did add another device first.  I
would have used btrfs device replace but the documentation I found
said that the superblock of the device being removed needs to be
readable.  After the reboot the failed device was not found by the
BIOS or Linux so the superblock is not readable.

Is the documentation correct, or can device replace now cope with
completely unreadable devices?

> In particular it's really common for there to
> be a mismatch between physical drive SCT ERC timeouts, and the
> kernel's command timer. Mismatches can cause a of confusion because
> upon kernel timer being reached, it resets the block device that
> contains the "late" command, which then blows away that drive's entire
> command queue.

These are NAS-specific hard discs and the SCT ERC timeout is set to 70:


>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
>
> > Feb 10 19:38:36 meije kernel: BTRFS info (device sda): disk added /dev/sdb
> > Feb 10 19:39:18 meije kernel: BTRFS info (device sda): relocating
> > block group 10045992468480 flags data|raid5
> > Feb 10 19:39:27 meije kernel: BTRFS info (device sda): found 19 extents
> > Feb 10 19:39:34 meije kernel: BTRFS info (device sda): found 19 extents
> > Feb 10 19:39:39 meije kernel: BTRFS info (device sda): clearing
> > incompat feature flag for RAID56 (0x80)
> > Feb 10 19:39:39 meije kernel: BTRFS info (device sda): relocating
> > block group 10043844984832 flags data|raid5
>
> I'm not sure what's going on here. This is a raid6 volume and raid56
> flag is being cleared? That's unexpected and I dn't know why you have
> raid5 block groups on a raid6 array.
>
>
> > and at that point the device remove aborted with an I/O error.
>
> OK well you didn't include that so we have no idea if this I/O error
> is about the same failed device or another device. If it's another
> device it's more complicated what can happen to the array. Hence why
> timeout mismatches are important. And why it's important to have
> monitoring so you aren't running a degraded array for three days.
>
> >
> > I did discover I could use balance with a filter to balance much of
> > the onto the three working discs, away from the missing one but I also
> > discovered that whenever the checksum error appears the space cache
> > seems to get corrupted.  Any further balance attempt results in
> > getting stuck in a loop.  Mounting with clear_cache resolves that.
>
> This sounds like a bug. The default space cache is stored in the data
> block group which for you should be raid6, with a missing device it's
> effectively raid5. But there's some kind of conversion happening
> during the balance/missing device removal, hence the clearing of the
> raid56 flag per block group, and maybe this corruption is happening
> related to that removal.
>
> --
> Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-27 12:20             ` Steven Fosdick
@ 2020-02-27 12:35               ` Steven Fosdick
  2020-02-29  6:31               ` Chris Murphy
  1 sibling, 0 replies; 24+ messages in thread
From: Steven Fosdick @ 2020-02-27 12:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Jonathan H, Qu Wenruo, Btrfs BTRFS

Chris,

Apologies, I was half way through replying and managed to send the
e-mail by mistake so here is second half.

These are NAS-specific hard discs and the SCT ERC timeout is set to 70:
SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

On Thu, 27 Feb 2020 at 00:39, Chris Murphy <lists@colorremedies.com> wrote:
> > and at that point the device remove aborted with an I/O error.
>
> OK well you didn't include that so we have no idea if this I/O error
> is about the same failed device or another device. If it's another
> device it's more complicated what can happen to the array. Hence why
> timeout mismatches are important. And why it's important to have
> monitoring so you aren't running a degraded array for three days.

When I first tried this there was nothing in the log except the
checksum errors.  Nothing from btrfs and nothing from the block device
driver either to indicate that there have been any hardware errors.

I did work out what code was being run within the kernel, added some
extra messages and got as far working out that I found the error is
being detected here, in relocate_file_extent_cluster, relocation.c
starting around line 3336:

        if (!PageUptodate(page)) {
            btrfs_readpage(NULL, page);
            lock_page(page);
            if (!PageUptodate(page)) {
                unlock_page(page);
                put_page(page);
                btrfs_delalloc_release_metadata(BTRFS_I(inode),
                            PAGE_SIZE, true);
                btrfs_delalloc_release_extents(BTRFS_I(inode),
                                   PAGE_SIZE);
                btrfs_err(fs_info, "relocate_file_extent_cluster:
err#%d from btrfs_readpage/PageUptodate", ret);
                ret = -EIO;
                goto out;
            }
        }

> This sounds like a bug. The default space cache is stored in the data
> block group which for you should be raid6, with a missing device it's
> effectively raid5. But there's some kind of conversion happening
> during the balance/missing device removal, hence the clearing of the
> raid56 flag per block group, and maybe this corruption is happening
> related to that removal.

Presumably it is still a bug with RAID5 - given that there are now no
hardware errors being logged btrfs presumably should not corrupt the
space cache.  I can work around it, of course, by clearing the cache
but there is still about 65Gb of data I cannot balance away from the
failed device so that it properly resilient on the working devices.

Is there anything I can do here to narrow down this bug?  I probably
can't send you 14Tb of data but I could run tools on this filesystem
or apply patches and post the output.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-27 12:20             ` Steven Fosdick
  2020-02-27 12:35               ` Steven Fosdick
@ 2020-02-29  6:31               ` Chris Murphy
  2020-03-03 23:40                 ` Steven Fosdick
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2020-02-29  6:31 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Chris Murphy, Jonathan H, Qu Wenruo, Btrfs BTRFS

On Thu, Feb 27, 2020 at 5:20 AM Steven Fosdick <stevenfosdick@gmail.com> wrote:
>
> Chris,
>
> Thanks for getting back to me.  I have read your subsequent e-mails too.
>
> On Thu, 27 Feb 2020 at 00:39, Chris Murphy <lists@colorremedies.com> wrote:
>
> > Both read and write errors reported by the hardware. These aren't the
> > typical UNC error though. I'm not sure what DID_BAD_TARGET means. Some
> > errors might be suppressed.
>
> What are UNC errors?

Uncorrectable (read or write). An error associated with bad sectors.

> > Btrfs might survive the write errors though with metadata raid1c3. But
> > later you get more concerning messages...
>
> So if each of these writes had succeeded, in that the drive reported
> success, but then the drive was subsequently unable to read the data
> back surely btrfs should be able to reconstruct the data in the failed
> write as it will either be mirrored, in the case of metadata, or
> raid5, in the case of data.  So why should this be worse if the drive
> reports that the write has failed?

s/might/should


>
> > As btrfs doesn't have such a concept of faulty drives, ejected drives,
> > yet; you kinda have to keep an eye on this, and setup monitoring so
> > you know when the array goes degraded like this.
>
> So given that I probably don't have a spare drive lying around to
> replace the failed one and also given that this server is not
> hot-plug, is there anything, other than just alerting me, that this
> monitoring could do automatically to minimise or avoid damage to the
> filesystem?  Can be btrfs be instructed to mark a drive missing and go
> into degraded mode on-line?

No. I'm not sure how to quiet the kernel noise that'll happen for
every write failure. You could stop all writes, unmount, remove the
offending drive, and mount degraded, and that'll be quiet. I haven't
tried yanking the bad drive, then using -o remount,degraded, so I'm
not sure if that will silence things.


>
> > You need to replace the bad drive, and do a scrub to fix things up.
>
> Qu previously asked Jonathan for the output of btrfs check.
> I ran 'btrfs check --force --check-data-csum /dev/sda'
> and it found no errors.  Does this check all accessible devices in the
> filesystem or just the one specified on the command line?

I'm curious why you had to use force, but yes that should check all of
them. If this is a mounted file system, there's 'btrfs scrub' for this
purpose though too and it can be set to run read-only on a read-only
mounted file system.

> > And double check with 'btrfs fi us /mountpoint/' that all block groups
> > have one profile set, and that it's the correct one.
>
> Here is the output:
>
> # btrfs fi us /data
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:   21.83TiB
>     Device allocated:   30.06GiB
>     Device unallocated:   21.80TiB
>     Device missing:    5.46TiB
>     Used:   25.44GiB
>     Free (estimated):      0.00B (min: 8.00EiB)
>     Data ratio:       0.00
>     Metadata ratio:       2.00
>     Global reserve: 512.00MiB (used: 0.00B)
>
> Data,RAID5: Size:8.93TiB, Used:8.93TiB (99.98%)
>    /dev/sda    4.47TiB
>    /dev/sdc    4.47TiB
>    missing   65.00GiB
>    /dev/sdb    4.40TiB
>
> Metadata,RAID1: Size:15.00GiB, Used:12.72GiB (84.78%)
>    /dev/sda    9.00GiB
>    /dev/sdc   11.00GiB
>    /dev/sdb   10.00GiB
>
> System,RAID1: Size:32.00MiB, Used:816.00KiB (2.49%)
>    /dev/sda   32.00MiB
>    /dev/sdb   32.00MiB
>
> Unallocated:
>    /dev/sda 1006.00GiB
>    /dev/sdc 1004.03GiB
>    missing    5.39TiB
>    /dev/sdb    1.04TiB
>
> The only thing that looks odd to me is the overall summary in that
> both "Device Allocated" and "Free (estimated)" seem wrong.

That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1
is EOL but I don't think that's related to the usage info. Still, tons
of btrfs bugs fixed between 5.1 and 5.5...
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1

Including raid56 specific fixes:z
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1


> As you noted in your other e-mail I did add another device first.  I
> would have used btrfs device replace but the documentation I found
> said that the superblock of the device being removed needs to be
> readable.  After the reboot the failed device was not found by the
> BIOS or Linux so the superblock is not readable.
>
> Is the documentation correct, or can device replace now cope with
> completely unreadable devices?

I'm not sure what btrfs-progs version you've got or what man page
you're looking at. But replace will work whether the drive being
replaced is present or not.

man btrfs replace

           On a live filesystem, duplicate the data to the target
device which is currently stored on the source device. If the source
device is not available anymore, or if the -r option is set, the data
is built only using the RAID redundancy mechanisms.

There are two requirements that don't apply to add then remove. a) the
replacement drive needs to be the same size or bigger than the source;
b) fs resize is not automatically performed, so if the replacement
drive is bigger, and if you want to use all of that space you need to
use 'btrfs fi resize <devid>:max' or whatever size you want it set to.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 16:45       ` Jonathan H
  2020-02-26 22:36         ` Steven Fosdick
@ 2020-03-03 15:42         ` Jonathan H
  2020-03-04 20:17           ` RooSoft Ltd @ bluehost
  2020-03-11  4:41         ` Zygo Blaxell
  2 siblings, 1 reply; 24+ messages in thread
From: Jonathan H @ 2020-03-03 15:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Update:

My most recent scrub just finished. It found a few hundred errors, but
many files that were not mentioned by the scrub at all are still
unreadable. I started another scrub and it is finding new errors and
correcting them, but I aborted it since I do feel like constantly
scrubbing is making progress.

Much more interestingly, I ran `btrfs rescue chunk-recover` and it
reported that the majority (2847/3514) of my chunks were
unrecoverable. The output from the `chunk-recover` is too long to
include in a pastebin. Is there anything in particular that might be
of interest?

Also, I take the existence of unrecoverable chunks to mean that the
filesystem is not salvageable, is that true?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-29  6:31               ` Chris Murphy
@ 2020-03-03 23:40                 ` Steven Fosdick
  2020-03-04  6:32                   ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Fosdick @ 2020-03-03 23:40 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS; +Cc: Jonathan H, Qu Wenruo

On Sat, 29 Feb 2020 at 06:31, Chris Murphy <lists@colorremedies.com> wrote:

> s/might/should

I do think it is worth looking at the possibility that the "write
hole", because it well documented, is being blamed for all cases that
data proves to be unrecoverable when some of these may be due to a bug
or bugs.  From what I've found about the write hole this is because of
uncertainty over which of several discs actually got written to so
when copies don't match there is no way to know which one is right.
In the case of a disc failure, though, surely the copy that is right
is the one that doesn't involve the failed disc?  Or is there
something else I don't understand?

> I'm curious why you had to use force, but yes that should check all of
> them. If this is a mounted file system, there's 'btrfs scrub' for this
> purpose though too and it can be set to run read-only on a read-only
> mounted file system.

In the case of 'btrfs check' the filesystem was mounted r/o but I had
things reading it so didn't want to unmount it completely.  It
requires --force to work on a mounted filesyetem even if the mount is
r/o.

I did try running a scrub but had to abandon it as it wasn't proving
very useful.  It wasn't fixing the errors and wasn't providing any
messages that would help diagnose or fix them some other way - it only
seems to provide a count of the errors it didn't fix.  That seems to
be general thing in that there seem plenty of ways an overall 'failed'
status can be returned to userspace, usually without anything being
logged.  That obviously makes sense if the request was to do something
stupid but if instead the error return is because corruption has been
found would it not be better to log an error?

> That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1
> is EOL but I don't think that's related to the usage info. Still, tons
> of btrfs bugs fixed between 5.1 and 5.5...
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1
>
> Including raid56 specific fixes:z
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1

This was in response to posting dodgy output from btrfs fi usage.  My
output was from btrfs-progs v5.4 which, when I checked yesterday,
seemed to be the latest.  I am also running Linux 5.5.7.  It may have
been slightly older when the disk failed but would still have been
5.5.x

Since my previous e-mail I have managed to get a 'btrfs device remove
missing' to work by reading all the files from userspace, deleting
those that returned I/O error and restoring from backup.  Even after
that the summary information is still wacky:

WARNING: RAID56 detected, not implemented
Overall:
    Device size:   16.37TiB
    Device allocated:   30.06GiB
    Device unallocated:   16.34TiB
    Device missing:      0.00B
    Used:   25.40GiB
    Free (estimated):      0.00B (min: 8.00EiB)
    Data ratio:       0.00
    Metadata ratio:       2.00
    Global reserve: 512.00MiB (used: 0.00B)

is the clue in the warning message?  It looks like it is failing to
count any of the RAID5 blocks.

Point taken about device replace.  What would device replace do if the
remove step failed in the same way that device remove has been failing
for me recently?

I'm a little disappointed we didn't get to the bottom of the bug that
was causing the free space cache to become corrupted when a balance
operation failed but when I asked what I could do to help I got no
reply to that part of my message (not just from you, from anyone on
the list).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-03-03 23:40                 ` Steven Fosdick
@ 2020-03-04  6:32                   ` Chris Murphy
  2020-03-04 12:45                     ` Steven Fosdick
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2020-03-04  6:32 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Chris Murphy, Btrfs BTRFS, Jonathan H, Qu Wenruo

On Tue, Mar 3, 2020 at 4:40 PM Steven Fosdick <stevenfosdick@gmail.com> wrote:
>
> On Sat, 29 Feb 2020 at 06:31, Chris Murphy <lists@colorremedies.com> wrote:
>
> > s/might/should
>
> I do think it is worth looking at the possibility that the "write
> hole", because it well documented, is being blamed for all cases that
> data proves to be unrecoverable when some of these may be due to a bug
> or bugs.  From what I've found about the write hole this is because of
> uncertainty over which of several discs actually got written to so
> when copies don't match there is no way to know which one is right.
> In the case of a disc failure, though, surely the copy that is right
> is the one that doesn't involve the failed disc?  Or is there
> something else I don't understand?

a. the write hole doesn't happen with raid1, and your metadata is
raid1 so any file system corruption is not related to the write hole
b. the write hole can apply to your raid5 data stripes, but this is a
rare case that happens with a crash or power failure during write and
causes a stripe to be incompletely rewrite when it's being modified.
That's rare conventional raid5, more rare on btrfs, but it can happen.
c. to actually be affected by the write hole problem, the stripe with
mismatching parity strip must have a missing data strip such as a bad
sector making up one of the strips, or a failed device. If neither of
those are the case, it's not the write hole, it's something else.
d. before there's a device or sector failure, a scrub following a
crash or power loss will correct the problem resulting from the write
hole



> I did try running a scrub but had to abandon it as it wasn't proving
> very useful.  It wasn't fixing the errors and wasn't providing any
> messages that would help diagnose or fix them some other way - it only
> seems to provide a count of the errors it didn't fix.

It can't fix them when the file system is mounted read only.


>  That seems to
> be general thing in that there seem plenty of ways an overall 'failed'
> status can be returned to userspace, usually without anything being
> logged.  That obviously makes sense if the request was to do something
> stupid but if instead the error return is because corruption has been
> found would it not be better to log an error?

The most obvious case of corruption is a checksum mismatch (the
onthefly checksum for a node/leaf/block compared to the recorded
checksum). Btrfs always reports this.

Parity strips are not checksummed. If parity is corrupt, it's only
corrected on a scrub (or balance). They're not used during normal read
operations. Upon degraded reads, parity is used to reconstruct data.
Since there's no checksum, the parity is trusted, and bad parity will
cause a corrupt reconstruction of data, and that corruption fails
checksum - and Btrfs will tell you about it, and also EIO.

So that leaves the less obvious cases of corruption where some
metadata or data is corrupt in memory, and a valid checksum is
computed on already corrupt data/metadata, and then written to disk.
Now when Btrfs reads it, there's no checksum mismatch, and yet there
is corruption. For metadata reads, the tree checker has gotten quite a
bit better lately at sanity checking metadata. For data, well you're
out of luck, the application will have to sanity check it and if not,
then the data is just corrupt - but it's no different than any other
file system. At least Btrfs gave you a chance. But that's the gotcha
of bad RAM or other sources of bit flips in the storage stack.


> > That looks like a bug. I'd try a newer btrfs-progs version. Kernel 5.1
> > is EOL but I don't think that's related to the usage info. Still, tons
> > of btrfs bugs fixed between 5.1 and 5.5...
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/?id=v5.5&id2=v5.1
> >
> > Including raid56 specific fixes:z
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v5.5&id2=v5.1
>
> This was in response to posting dodgy output from btrfs fi usage.  My
> output was from btrfs-progs v5.4 which, when I checked yesterday,
> seemed to be the latest.  I am also running Linux 5.5.7.  It may have
> been slightly older when the disk failed but would still have been
> 5.5.x

From six days ago, your dmesg:

Sep 27 15:16:08 meije kernel:       Not tainted 5.1.10-arch1-1-ARCH #1

Actually what I should have asked is whether you ever ran 5.2 - 5.2.14
kernels because that series had a known corruption bug in it, fixed in
5.2.15

> Since my previous e-mail I have managed to get a 'btrfs device remove
> missing' to work by reading all the files from userspace, deleting
> those that returned I/O error and restoring from backup.  Even after
> that the summary information is still wacky:
>
> WARNING: RAID56 detected, not implemented
> Overall:
>     Device size:   16.37TiB I
>     Device allocated:   30.06GiB
>     Device unallocated:   16.34TiB
>     Device missing:      0.00B
>     Used:   25.40GiB
>     Free (estimated):      0.00B (min: 8.00EiB)
>     Data ratio:       0.00
>     Metadata ratio:       2.00
>     Global reserve: 512.00MiB (used: 0.00B)
>
> is the clue in the warning message?  It looks like it is failing to
> count any of the RAID5 blocks.

I think btrf filesystem usage doesn't completely support raid56 is all
it's saying.

'btrfs fi df' and 'btrfs fi show' should show things correctly

>
> Point taken about device replace.  What would device replace do if the
> remove step failed in the same way that device remove has been failing
> for me recently?

I don't understand the question. The device replace command includes
'device add' and 'device remove' in one step, it just lacks the
implied resize that happens with add and remove.



> I'm a little disappointed we didn't get to the bottom of the bug that
> was causing the free space cache to become corrupted when a balance
> operation failed but when I asked what I could do to help I got no
> reply to that part of my message (not just from you, from anyone on
> the list).

The free space cache isn't that important. It can be discarded and
reconstructed. It's an optimization. I don't think it's checksummed
anyway, instead corruption is determined by mismatching
generation/transid? So it may not literally be corrupt, it's just
ambiguous whether it can be relied upon, therefore it's marked for
reconstruction.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-03-04  6:32                   ` Chris Murphy
@ 2020-03-04 12:45                     ` Steven Fosdick
  2020-03-04 20:10                       ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Steven Fosdick @ 2020-03-04 12:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS, Jonathan H, Qu Wenruo

On Wed, 4 Mar 2020 at 06:32, Chris Murphy <lists@colorremedies.com> wrote:

> a. the write hole doesn't happen with raid1, and your metadata is
> raid1 so any file system corruption is not related to the write hole

That makes sense and I had no problem moving the metadata blocks from
the failed disk onto the working ones.

> c. to actually be affected by the write hole problem, the stripe with
> mismatching parity strip must have a missing data strip such as a bad
> sector making up one of the strips, or a failed device. If neither of
> those are the case, it's not the write hole, it's something else.

Normally there would not have been any power failures as the machine
is protected by a UPS and auto shuts down when the battery is getting
low.  Certainly there would have been none between the filesystem
being created and the disk failure.  The only exception is that after
the disk failure, and after the device remove had failed I had to turn
the power off because the machine would not shut down.  As the device
remove had moved very little data before failing I had used balance to
move data away from the failed disk, i.e. to restore redundancy and
mitigate the disk of a further disk failure.  That is how I came to
migrate the metadata ahead of the data and then used a range filter to
migrate the data in stages.  That is where I discovered that when
migrating a range of block groups failed all subsequent attempts to
use balance resulted in an infinite loop and that is what lead to the
need to turn the power off.  Subsequently I was able to avoid a repeat
of that by applying a patch from this list to make the balance
cancellable in more places and also discovered that clearing the space
cache avoided the loop anyway.

> d. before there's a device or sector failure, a scrub following a
> crash or power loss will correct the problem resulting from the write
> hole.

Worth knowing for the future.

> It can't fix them when the file system is mounted read only.

I had mounted it r/w by then.

> The most obvious case of corruption is a checksum mismatch (the
> onthefly checksum for a node/leaf/block compared to the recorded
> checksum). Btrfs always reports this.

And it did, but only for the relocation tree that was being built as
part of the balance.  I am sure you or Qu said in a previous e-mail
that this is a temporary structure only built during that operation so
should not have been corrupted by previous problems.  As no media
errors were logged either that must surely mean that either there is a
bug in constructing the tree or corrupted data was being copied from
elsewhere into the tree and only detected after that copy rather than
before.

> So that leaves the less obvious cases of corruption where some
> metadata or data is corrupt in memory, and a valid checksum is
> computed on already corrupt data/metadata, and then written to disk.

But if the relocation tree is constructed during the balance operation
rather than being a permanent structure then the chance of flipped
bits in memory corrupting it on successive attempts is surely very
small indeed.

> At least Btrfs gave you a chance. But that's the gotcha
> of bad RAM or other sources of bit flips in the storage stack.

I am not complaining about checksums - yes it is better to know if
your data has been corrupted, I just want btrfs to be as robust as
possible.

> From six days ago, your dmesg:
>
> Sep 27 15:16:08 meije kernel:       Not tainted 5.1.10-arch1-1-ARCH #1

Sorry for the confusion from having theadjacked.  My kernel history is:

Sep 04 13:53:53 meije kernel: Linux version 5.1.10-arch1-1-ARCH
Oct 14 14:31:56 meije kernel: Linux version 5.3.6-arch1-1-ARCH
Nov 29 12:25:21 meije kernel: Linux version 5.4.0-arch1-1
Dec 30 17:30:49 meije kernel: Linux version 5.4.6-arch3-1
Jan 04 15:54:22 meije kernel: Linux version 5.4.7-arch1-1
Jan 09 15:43:56 meije kernel: Linux version 5.4.8-arch1-1
Jan 22 10:15:30 meije kernel: Linux version 5.4.13-arch1-1
Jan 26 17:23:36 meije kernel: Linux version 5.4.15-arch1-1
Jan 29 22:56:42 meije kernel: Linux version 5.5.0-arch1-1
Feb 10 16:28:52 meije kernel: Linux version 5.5.2-arch2-2
Feb 14 20:23:08 meije kernel: Linux version 5.5.3-arch1-1
Feb 16 16:06:43 meije kernel: Linux version 5.5.3-arch1-1
Feb 18 08:36:49 meije kernel: Linux version 5.5.4-arch1-1
Feb 26 17:11:08 meije kernel: Linux version 5.5.6-arch1-1
Mar 03 20:45:28 meije kernel: Linux version 5.5.7-arch1-1

> Actually what I should have asked is whether you ever ran 5.2 - 5.2.14
> kernels because that series had a known corruption bug in it, fixed in
> 5.2.15

No, I skipped the whole of 5.2 because I saw messages about corruption
on this list.

> I think btrf filesystem usage doesn't completely support raid56 is all
> it's saying.
>
> 'btrfs fi df' and 'btrfs fi show' should show things correctly

They do, and the info for individual devices in the output of btrfs fi
usage also looks completely believable.  It's only the summary at the
top that's obviously incorrect.

> I don't understand the question. The device replace command includes
> 'device add' and 'device remove' in one step, it just lacks the
> implied resize that happens with add and remove.

When i did the add and remove separately, the add succeeded and the
remove failed (initially) having moved very little data.  If that were
to happen with those same steps within a replace would it simply stop
where it found the problem, leaving the new device added and the old
one not yet removed, or would it try to back out the whole operation?

> The free space cache isn't that important. It can be discarded and
> reconstructed. It's an optimization.

Of course, but if it wasn't for Jonathan mentioning that btrfs check
had found his space cache to be corrupt I would never have
hypothesised that the same might be happening for me as there were no
messages in the log about the cache or anything that looked like an
error, only an infinite loop.  I think what was happening was that
moving the block group was failing with an "out of space" error and
the loop simply retries it in the hope that some space has become
available since, and the out of space error was in turn caused by the
corrupt space cache.

So in summary, I believe I was suffering from a situation in which,
for a very small number of data blocks, something went wrong in the
reconstruction process or some associated metadata was bad such that:

1. Building the relocation tree went wrong so that it had a checksum error.
2. The free space cache became corrupt.

Regards,
Steve.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-03-04 12:45                     ` Steven Fosdick
@ 2020-03-04 20:10                       ` Chris Murphy
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Murphy @ 2020-03-04 20:10 UTC (permalink / raw)
  To: Steven Fosdick; +Cc: Chris Murphy, Btrfs BTRFS, Jonathan H, Qu Wenruo

On Wed, Mar 4, 2020 at 5:45 AM Steven Fosdick <stevenfosdick@gmail.com> wrote:
>
> > The most obvious case of corruption is a checksum mismatch (the
> > onthefly checksum for a node/leaf/block compared to the recorded
> > checksum). Btrfs always reports this.
>
> And it did, but only for the relocation tree that was being built as
> part of the balance.  I am sure you or Qu said in a previous e-mail
> that this is a temporary structure only built during that operation so
> should not have been corrupted by previous problems.  As no media
> errors were logged either that must surely mean that either there is a
> bug in constructing the tree or corrupted data was being copied from
> elsewhere into the tree and only detected after that copy rather than
> before.

I'm not familiar enough with data relocation tree, all I can do is
wild speculation: It could be the reported corruption, which might
just be reporting noise, is a consequence of the stalled/failed device
removal, and that the actual problem remains obscured.


>
> > So that leaves the less obvious cases of corruption where some
> > metadata or data is corrupt in memory, and a valid checksum is
> > computed on already corrupt data/metadata, and then written to disk.
>
> But if the relocation tree is constructed during the balance operation
> rather than being a permanent structure then the chance of flipped
> bits in memory corrupting it on successive attempts is surely very
> small indeed.

Probably true.

> > I don't understand the question. The device replace command includes
> > 'device add' and 'device remove' in one step, it just lacks the
> > implied resize that happens with add and remove.
>
> When i did the add and remove separately, the add succeeded and the
> remove failed (initially) having moved very little data.  If that were
> to happen with those same steps within a replace would it simply stop
> where it found the problem, leaving the new device added and the old
> one not yet removed, or would it try to back out the whole operation?

Yeah the replace code has its own ioctl in the kernel. So it's not
entirely fair to refer to it as a mere shortcut of the add then remove
method.

First data is copied from source to new target, the copy reuses scrub
code, and the new target isn't actually "added" until the very end of
the process. During the copy, new blocks are written to both source
and destination devices. Only once replication is definitely
successful is the new device really added, and the old device removed.
Up to the point where the two are swapped out, the source device is
not in a "being removed" state like the add then remove method.

The device add then remove method takes a while, involves resize and
balance code, and is migrating chunks on the source to other devices.
In the case of raid5 it means restriping all devices. Every device is
reading and writing. It's a lot more expensive than just replacing.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-03-03 15:42         ` Jonathan H
@ 2020-03-04 20:17           ` RooSoft Ltd @ bluehost
  2020-03-09 21:10             ` Steven Fosdick
  0 siblings, 1 reply; 24+ messages in thread
From: RooSoft Ltd @ bluehost @ 2020-03-04 20:17 UTC (permalink / raw)
  To: linux-btrfs

On 03/03/2020 15:42, Jonathan H wrote:
> Update:
>
> My most recent scrub just finished. It found a few hundred errors, but
> many files that were not mentioned by the scrub at all are still
> unreadable. I started another scrub and it is finding new errors and
> correcting them, but I aborted it since I do feel like constantly
> scrubbing is making progress.
>
> Much more interestingly, I ran `btrfs rescue chunk-recover` and it
> reported that the majority (2847/3514) of my chunks were
> unrecoverable. The output from the `chunk-recover` is too long to
> include in a pastebin. Is there anything in particular that might be
> of interest?
>
> Also, I take the existence of unrecoverable chunks to mean that the
> filesystem is not salvageable, is that true?

I would suspect memory corruption at this stage then. Run memtest and if
it finds anything wrong switch to ECC and don't use that memory for
critical anything.

-- 
==

Don Alexander


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-03-04 20:17           ` RooSoft Ltd @ bluehost
@ 2020-03-09 21:10             ` Steven Fosdick
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Fosdick @ 2020-03-09 21:10 UTC (permalink / raw)
  To: RooSoft Ltd @ bluehost, linux-btrfs

Thanks for the suggestion.  I ran memtest and it found no errors.
This is an HP Gen8 microserver and according to the spec sheet uses
ECC RAM.

On Wed, 4 Mar 2020 at 21:05, RooSoft Ltd @ bluehost
<roosoft@casa-di-locascio.net> wrote:
>
> On 03/03/2020 15:42, Jonathan H wrote:
> > Update:
> >
> > My most recent scrub just finished. It found a few hundred errors, but
> > many files that were not mentioned by the scrub at all are still
> > unreadable. I started another scrub and it is finding new errors and
> > correcting them, but I aborted it since I do feel like constantly
> > scrubbing is making progress.
> >
> > Much more interestingly, I ran `btrfs rescue chunk-recover` and it
> > reported that the majority (2847/3514) of my chunks were
> > unrecoverable. The output from the `chunk-recover` is too long to
> > include in a pastebin. Is there anything in particular that might be
> > of interest?
> >
> > Also, I take the existence of unrecoverable chunks to mean that the
> > filesystem is not salvageable, is that true?
>
> I would suspect memory corruption at this stage then. Run memtest and if
> it finds anything wrong switch to ECC and don't use that memory for
> critical anything.
>
> --
> ==
>
> Don Alexander
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: USB reset + raid6 = majority of files unreadable
  2020-02-26 16:45       ` Jonathan H
  2020-02-26 22:36         ` Steven Fosdick
  2020-03-03 15:42         ` Jonathan H
@ 2020-03-11  4:41         ` Zygo Blaxell
  2 siblings, 0 replies; 24+ messages in thread
From: Zygo Blaxell @ 2020-03-11  4:41 UTC (permalink / raw)
  To: Jonathan H; +Cc: Qu Wenruo, linux-btrfs

On Wed, Feb 26, 2020 at 08:45:17AM -0800, Jonathan H wrote:
> On Tue, Feb 25, 2020 at 8:37 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > It's great that your metadata is safe.
> >
> > The biggest concern is no longer a concern now.
> 
> Glad to hear.
> 
> > More context would be welcomed.
> 
> Here's a string of uncorrectable errors detected by the scrub: http://ix.io/2cJM
> 
> Here is another attempt to read a file giving an I/O error: http://ix.io/2cJS
> The last two lines are produced when trying to read the file a second time.
> 
> Here's the state of the currently running scrub: http://ix.io/2cJU
> I had to cancel and resume the scrub to run `btrfs check` earlier, but
> otherwise it has been uninterrupted.
> 
> > Anyway, even with more context, it may still lack the needed info as
> > such csum failure message is rate limited.
> >
> > The mirror num 2 means it's the first rebuild try failed.
> >
> > Since only the first rebuild try failed, and there are some corrected
> > data read, it looks btrfs can still rebuild the data.
> >
> > Since you have already observed some EIO, it looks like write hole is
> > involved, screwing up the rebuild process.
> > But it's still very strange, as I'm expecting more mirror number other
> > than 2.
> > For your 6 disks with 1 bad disk, we still have 5 ways to rebuild data,
> > only showing mirror num 2 doesn't look correct to me.
> 
> I'm sort of curious why so many files have been affected. It seems
> like most of the file system has become unreadable, but I was under
> the impression that if the write hole occurred it would at least not
> damage too much data at once. Is that incorrect?

There are still unfixed bugs in btrfs parity RAID:

	https://www.spinics.net/lists/linux-btrfs/msg94594.html

If you have an array where some of the drives go offline for a while and
come back online, then you will see a lot of what looks like disk-level
corruption.  The unwritten blocks on drives that come back online are
treated as corrupted data (csums or transid fields don't match expected
values recorded on the other drives) and btrfs will attempt to repair
them.

If you have parent transid verify failures, you are very likely to also
have correctable data errors made uncorrectable due to the above raid5/6
bug.  The two visible error cases are two different possible consequences
of the same low-level write loss events.  This means that at some point,
you had two disks offline for a while, but the other disks in the array
were still getting updates (array failures are never simple--multiple
modes of failure at different times during a single event are the norm).

If you have corrupted data on raid5/6 on btrfs, some of it won't come
back due to the data recovery corruption bug linked above.  Until this
bug is fixed, the only alternative is to restore the lost data from
backups.  Replacing the missing drive before fixing the correction bug
in the kernel will damage some more data, so data that is theoretically
readable now may be lost in the future as you replace drives; however,
losses should be 1% or less, so raid5/6 recovery in-place can still be
quicker than a full mkfs+restore for raid0.

Note that the raid5/6 write hole is a separate issue.  It's possible
for both issues to occur at the same time in a failing array, but the
correction bug will affect several orders of magnitude more data than
the write hole.

raid1 and raid1c3 have no such problems.  The parent transid verify errors
come from btrfs metadata, which in your filesystem is raid1c3, so they
would have been easily and correctly repaired as they were encountered.

> > BTW, since your free space cache is already corrupted, it's recommended
> > to clear the space cache.
> 
> It's strange to me that the free space cache is giving an error, since
> I cleared it previously and the most recent unmount was clean.

Free space cache is stored in data block groups and subject to all
of the above btrfs parity raid data integrity problems.  Do not use
space_cache=v1 with raid5 or raid6.  Better not to use space_cache=v1
at all, but v1 + raid5/6 is bad in ways that go beyond merely being slow
and unreliable.

Free space tree (space_cache=v2) is stored in btrfs metadata, so it will
work properly with raid1 or raid1c3 metadata.  Probably faster too,
and nothing can break v2 that doesn't also destroy the filesystem.

All that said, there are internal data integrity checks in free space
cache (v1), so it's possible that the only bad thing that happens here
is that you get a bunch of free space cache invalidation error messages.

> > For now, since it looks like write hole is involved, the only way to
> > solve the problem is to remove all offending files (including a super
> > large file in root 5).
> >
> > You can use `btrfs inspect logical-resolve <bytenr> <mnt>" to see all
> > the involved files.
> >
> > The full <bytenr> are the bytenr shown in btrfs check --check-data-csum
> > output.
> 
> The strange thing is if you use `btrfs inspect logical-resolve` on all
> of the bytenrs mentioned in the check output, I get that all of the
> corruption is in the same file (see http://ix.io/2cJP), but this does
> not seem consistent with the uncorrectable csum errors the scrub is
> detecting.

The uncorrectable csum errors, and changes in errors over time, are
probably the correction bug in action.  Scrub also produces highly
questionable error statistics on raid5/6.  That may be a distinct bug
from the correction/corruption bug--it's hard to tell without fixing
all the current bugs and testing again.

Note: even if scrub is fully debugged, it is limited by the btrfs
on-disk format.  Corruption in data blocks with csums can always be
corrected up to raid5/6 drive-loss limits.  nodatasum files cannot be
reliably corrected.  Free space cache will be corrupted, but btrfs
should detect this and invalidate/rebuild the cache (but don't use
space_cache=v1 anyway).  In the best case scrub will count some csum
errors against the wrong disks in some cases (though the best case
is certainly better than what scrub does now).

In a RAID5/6 stripe that is completely filled with data blocks
belonging to files that have csums, every data block in the stripe can
be individually tested against its csum.  If all the data blocks have
correct csums, but the parity block on disk does not match computed
parity of the data, then we know that the parity block is corrupted
because we eliminated every other possible corrupted block.

If one of the data blocks in a RAID5/6 stripe has an incorrect csum then
we can try to recover the data using the parity block.  If that recovered
data fails the csum check too, then we know both data and parity blocks
are corrupt, since all other blocks in the raid stripe have good csums.
RAID6 has another parity block and some more combinations to try,
but eventually ends up either recovering the entire stripe or knowing
exactly which blocks were corrupt.

If there is:

	- a data block in a RAID5/6 stripe which does not have a csum
	(either because it is unoccupied, part of free space cache,
	or part of a nodatasum file)

	- all the data blocks that do have csums in the RAID stripe
	are OK (otherwise we would know that those blocks were the
	corrupted ones)

	- a parity mismatch detected in the RAID stripe, i.e. by scrub

then btrfs cannot determine whether the parity block is corrupted or one
of the no-csum data blocks.  The parity mismatch can be detected, but any
of the drives without a csum on its data block could have contributed to
the mismatch, and there is no way to tell which no-csum data block(s) is
(are) correct.  This will cause scrub to place csum error counts on the
wrong disks, e.g. blaming the disk that happens to hold the parity block
for the raid stripe when one of the other disks is the one flipping bits.

None of this explains why scrub reports "read" errors on healthy drives
when there is data corruption on other drives.  That part is a bug, the
only question is whether it's the _same_ bug as the correction corruption
bug, and that won't be known until at least one of the bugs is fixed.

> I've been calculating the offsets of the files mentioned in the
> relocation csum errors (by adding the block group and offset),
> resolving the files with `btrfs inspect logical-resolve` and deleting
> them. But it seems like the set of files I'm deleting is also totally
> unrelated to the set of files the scrub is detecting errors in. Given
> the frequency of relocation errors, I fear I will need to delete
> almost everything on the file system for the deletion to complete. I
> can't tell if I should expect these errors to be fixable since the
> relocation isn't making any attempt to correct them as far as I can
> tell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-03-11  4:41 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-25 20:39 USB reset + raid6 = majority of files unreadable Jonathan H
2020-02-25 23:58 ` Qu Wenruo
2020-02-26  0:51   ` Steven Fosdick
2020-02-26  0:55     ` Qu Wenruo
2020-02-26  3:38   ` Jonathan H
2020-02-26  3:44     ` Jonathan H
2020-02-26  4:37     ` Qu Wenruo
2020-02-26 16:45       ` Jonathan H
2020-02-26 22:36         ` Steven Fosdick
2020-02-26 23:14           ` Steven Fosdick
2020-02-27  0:42             ` Chris Murphy
2020-02-27  0:39           ` Chris Murphy
2020-02-27  0:47             ` Chris Murphy
2020-02-27 12:20             ` Steven Fosdick
2020-02-27 12:35               ` Steven Fosdick
2020-02-29  6:31               ` Chris Murphy
2020-03-03 23:40                 ` Steven Fosdick
2020-03-04  6:32                   ` Chris Murphy
2020-03-04 12:45                     ` Steven Fosdick
2020-03-04 20:10                       ` Chris Murphy
2020-03-03 15:42         ` Jonathan H
2020-03-04 20:17           ` RooSoft Ltd @ bluehost
2020-03-09 21:10             ` Steven Fosdick
2020-03-11  4:41         ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.