All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
@ 2016-06-25 12:21 Goffredo Baroncelli
  2016-06-25 17:25 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2016-06-25 12:21 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4691 bytes --]

Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB 

	truncate -s 500M disk1.img; losetup -f disk1.img
	truncate -s 500M disk2.img; losetup -f disk2.img
	truncate -s 500M disk3.img; losetup -f disk3.img
	sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
	sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

	python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
	sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096        (64k, second half of the file: 'bdbbbb.....)
/dev/loop1: offset=61865984+16*4096        (64k, first half of the file: 'adaaaa.....)
/dev/loop2: offset=61865984+16*4096        (64k, parity: '\x03\x00\x03\x03\x03.....)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
	sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
        	        seek=$((61865984+16*4096)) count=5
	sudo mount /dev/loop0 mnt
	sudo btrfs scrub start mnt/.

5) I check the disks at the offsets above, to verify that the data/parity is correct

However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity 

I have to point out another strange thing: in dmesg I found two kinds of messages:

msg1)
	[....]
	[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
	[ 1021.366949] BTRFS: has skinny extents
	[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
	[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0

msg2)
	[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
	[ 1017.435074] BTRFS: has skinny extents
	[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
	[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, 	root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
	[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
	[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0
	[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh". 

Br
G.Baroncelli


 
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: test.sh --]
[-- Type: application/x-sh, Size: 2819 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 12:21 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Goffredo Baroncelli
@ 2016-06-25 17:25 ` Chris Murphy
  2016-06-25 17:58   ` Chris Murphy
  2016-06-26  2:53   ` Duncan
  2016-09-21  7:28 ` Qu Wenruo
  2016-11-04  2:10 ` Qu Wenruo
  2 siblings, 2 replies; 33+ messages in thread
From: Chris Murphy @ 2016-06-25 17:25 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:

> 5) I check the disks at the offsets above, to verify that the data/parity is correct
>
> However I found that:
> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);

This is mostly good news, that it is fixing bad parity during scrub.
What's not clear due to the lack of any message is if the scrub is
always writing out new parity, or only writes it if there's a
mismatch.


> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:
>
> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk

Wow. So it sees the data strip corruption, uses good parity on disk to
fix it, writes the fix to disk, recomputes parity for some reason but
does it wrongly, and then overwrites good parity with bad parity?
That's fucked. So in other words, if there are any errors fixed up
during a scrub, you should do a 2nd scrub. The first scrub should make
sure data is correct, and the 2nd scrub should make sure the bug is
papered over by computing correct parity and replacing the bad parity.

I wonder if the same problem happens with balance or if this is just a
bug in scrub code?


> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>
> Another strangeness is that SCRUB sometime reports
>  ERROR: there are uncorrectable errors
> and sometime reports
>  WARNING: errors detected during scrubbing, corrected
>
> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2

I've seen this also, errors in user space but no kernel messages.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 17:25 ` Chris Murphy
@ 2016-06-25 17:58   ` Chris Murphy
  2016-06-25 18:42     ` Goffredo Baroncelli
  2016-06-26  2:53   ` Duncan
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2016-06-25 17:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: kreijack, linux-btrfs

On Sat, Jun 25, 2016 at 11:25 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Sat, Jun 25, 2016 at 6:21 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
>
>> 5) I check the disks at the offsets above, to verify that the data/parity is correct
>>
>> However I found that:
>> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);
>
> This is mostly good news, that it is fixing bad parity during scrub.
> What's not clear due to the lack of any message is if the scrub is
> always writing out new parity, or only writes it if there's a
> mismatch.
>
>
>> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:
>>
>> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk
>
> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?

The wrong parity, is it valid for the data strips that includes the
(intentionally) corrupt data?

Can parity computation happen before the csum check? Where sometimes you get:

read data strips > computer parity > check csum fails > read good
parity from disk > fix up the bad data chunk > write wrong parity
(based on wrong data)?

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3

2371-2383 suggest that there's a parity check, it's not always being
rewritten to disk if it's already correct. But it doesn't know it's
not correct, it thinks it's wrong so writes out the wrongly computed
parity?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 17:58   ` Chris Murphy
@ 2016-06-25 18:42     ` Goffredo Baroncelli
  2016-06-25 22:33       ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Goffredo Baroncelli @ 2016-06-25 18:42 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 2016-06-25 19:58, Chris Murphy wrote:
[...]
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
> 
> The wrong parity, is it valid for the data strips that includes the
> (intentionally) corrupt data?
> 
> Can parity computation happen before the csum check? Where sometimes you get:
> 
> read data strips > computer parity > check csum fails > read good
> parity from disk > fix up the bad data chunk > write wrong parity
> (based on wrong data)?
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
> 
> 2371-2383 suggest that there's a parity check, it's not always being
> rewritten to disk if it's already correct. But it doesn't know it's
> not correct, it thinks it's wrong so writes out the wrongly computed
> parity?

The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero.

BR
GB


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 18:42     ` Goffredo Baroncelli
@ 2016-06-25 22:33       ` Chris Murphy
  2016-06-26  9:20         ` Goffredo Baroncelli
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2016-06-25 22:33 UTC (permalink / raw)
  To: kreijack; +Cc: Chris Murphy, linux-btrfs

On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
<kreijack@inwind.it> wrote:
> On 2016-06-25 19:58, Chris Murphy wrote:
> [...]
>>> Wow. So it sees the data strip corruption, uses good parity on disk to
>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>> does it wrongly, and then overwrites good parity with bad parity?
>>
>> The wrong parity, is it valid for the data strips that includes the
>> (intentionally) corrupt data?
>>
>> Can parity computation happen before the csum check? Where sometimes you get:
>>
>> read data strips > computer parity > check csum fails > read good
>> parity from disk > fix up the bad data chunk > write wrong parity
>> (based on wrong data)?
>>
>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>
>> 2371-2383 suggest that there's a parity check, it's not always being
>> rewritten to disk if it's already correct. But it doesn't know it's
>> not correct, it thinks it's wrong so writes out the wrongly computed
>> parity?
>
> The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero.

I'm not sure what it takes to hit this exactly. I just tested 3x
raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
64KiB of "b" did a scrub, error is detected, and corrected, and parity
is still correct.

I also tried to corrupt both parities and scrub, and like you I get no
messages from scrub in user space or kernel but the parity is
corrected.

The fixup is also not cow'd. It is an overwrite, which seems
unproblematic to me at face value. But?

Next I corrupted parities, failed one drive, mounted degraded, and
read in both files. If there is a write hole, I should get back
corrupt data from parity reconstruction blindly being trusted and
wrongly reconstructed.

[root@f24s ~]# cp /mnt/5/* /mnt/1/tmp
cp: error reading '/mnt/5/a128.txt': Input/output error
cp: error reading '/mnt/5/b128.txt': Input/output error

[607594.478720] BTRFS warning (device dm-7): csum failed ino 295 off 0
csum 1940348404 expected csum 650595490
[607594.478818] BTRFS warning (device dm-7): csum failed ino 295 off
4096 csum 463855480 expected csum 650595490
[607594.478869] BTRFS warning (device dm-7): csum failed ino 295 off
8192 csum 3317251692 expected csum 650595490
[607594.479227] BTRFS warning (device dm-7): csum failed ino 295 off
12288 csum 2973611336 expected csum 650595490
[607594.479244] BTRFS warning (device dm-7): csum failed ino 295 off
16384 csum 2556299655 expected csum 650595490
[607594.479254] BTRFS warning (device dm-7): csum failed ino 295 off
20480 csum 1098993191 expected csum 650595490
[607594.479263] BTRFS warning (device dm-7): csum failed ino 295 off
24576 csum 1503293813 expected csum 650595490
[607594.479272] BTRFS warning (device dm-7): csum failed ino 295 off
28672 csum 1538866238 expected csum 650595490
[607594.479282] BTRFS warning (device dm-7): csum failed ino 295 off
36864 csum 2855931166 expected csum 650595490
[607594.479292] BTRFS warning (device dm-7): csum failed ino 295 off
32768 csum 3351364818 expected csum 650595490


Soo.....no write hole? Clearly it must reconstruct from corrupt
parity, and then checks the csum tree for EXTENT_CSUM and it doesn't
match so it fails to propagate upstream. And doesn't result in a
fixup. Good.

What happens if I umount, make the missing device visible again, and
mount not degraded?

[607775.394504] BTRFS error (device dm-7): parent transid verify
failed on 18517852160 wanted 143 found 140
[607775.424505] BTRFS info (device dm-7): read error corrected: ino 1
off 18517852160 (dev /dev/mapper/VG-a sector 67584)
[607775.425055] BTRFS info (device dm-7): read error corrected: ino 1
off 18517856256 (dev /dev/mapper/VG-a sector 67592)
[607775.425560] BTRFS info (device dm-7): read error corrected: ino 1
off 18517860352 (dev /dev/mapper/VG-a sector 67600)
[607775.425850] BTRFS info (device dm-7): read error corrected: ino 1
off 18517864448 (dev /dev/mapper/VG-a sector 67608)
[607775.431867] BTRFS error (device dm-7): parent transid verify
failed on 16303439872 wanted 145 found 139
[607775.432973] BTRFS info (device dm-7): read error corrected: ino 1
off 16303439872 (dev /dev/mapper/VG-a sector 4262240)
[607775.433438] BTRFS info (device dm-7): read error corrected: ino 1
off 16303443968 (dev /dev/mapper/VG-a sector 4262248)
[607775.433842] BTRFS info (device dm-7): read error corrected: ino 1
off 16303448064 (dev /dev/mapper/VG-a sector 4262256)
[607775.434220] BTRFS info (device dm-7): read error corrected: ino 1
off 16303452160 (dev /dev/mapper/VG-a sector 4262264)
[607775.434847] BTRFS error (device dm-7): parent transid verify
failed on 16303456256 wanted 145 found 139
[607775.435972] BTRFS info (device dm-7): read error corrected: ino 1
off 16303456256 (dev /dev/mapper/VG-a sector 4262272)
[607775.436426] BTRFS info (device dm-7): read error corrected: ino 1
off 16303460352 (dev /dev/mapper/VG-a sector 4262280)
[607775.439786] BTRFS error (device dm-7): parent transid verify
failed on 16303259648 wanted 143 found 140
[607775.441974] BTRFS error (device dm-7): parent transid verify
failed on 16303472640 wanted 145 found 139
[607775.453652] BTRFS error (device dm-7): parent transid verify
failed on 16303341568 wanted 144 found 138

OK? Btrfs sees the wrong generation on the now readded device, and
looks like it's doing fixups of missing metadata on the missing device
also. Good.

Can I copy the files? Yes, no complaints. But it's parity that's bad
not data. What happens if I scrub?

Parity is fixed, no messages in user space or kernel. But I do see for
formerly "failed" and missing disk from scrub -BdR:
[...snip...]
    super_errors: 2
    malloc_errors: 0
    uncorrectable_errors: 0
    unverified_errors: 0
    corrected_errors: 0

Curious. Super errors, but neither uncorrected nor uncorrected?

[root@f24s ~]# btrfs rescue super-recover -v /dev/VG/c
All Devices:
    Device: id = 1, name = /dev/mapper/VG-a
    Device: id = 2, name = /dev/mapper/VG-b
    Device: id = 3, name = /dev/VG/c

Before Recovering:
    [All good supers]:
        device name = /dev/mapper/VG-a
        superblock bytenr = 65536

        device name = /dev/mapper/VG-a
        superblock bytenr = 67108864

        device name = /dev/mapper/VG-b
        superblock bytenr = 65536

        device name = /dev/mapper/VG-b
        superblock bytenr = 67108864

        device name = /dev/VG/c
        superblock bytenr = 65536

        device name = /dev/VG/c
        superblock bytenr = 67108864

    [All bad supers]:

All supers are valid, no need to recover. There are only two supers on
these devices because they're 250GiB each, and the 3rd super would
have been at 256GiB.


Alright so the errors were fixed. *shrug*


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 17:25 ` Chris Murphy
  2016-06-25 17:58   ` Chris Murphy
@ 2016-06-26  2:53   ` Duncan
  2016-06-26 22:33     ` ronnie sahlberg
  1 sibling, 1 reply; 33+ messages in thread
From: Duncan @ 2016-06-26  2:53 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:

> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?
> That's fucked. So in other words, if there are any errors fixed up
> during a scrub, you should do a 2nd scrub. The first scrub should make
> sure data is correct, and the 2nd scrub should make sure the bug is
> papered over by computing correct parity and replacing the bad parity.
> 
> I wonder if the same problem happens with balance or if this is just a
> bug in scrub code?

Could this explain why people have been reporting so many raid56 mode 
cases of btrfs replacing a first drive appearing to succeed just fine, 
but then they go to btrfs replace a second drive, and the array crashes 
as if the first replace didn't work correctly after all, resulting in two 
bad devices once the second replace gets under way, of course bringing 
down the array?

If so, then it looks like we have our answer as to what has been going 
wrong that has been so hard to properly trace and thus to bugfix.

Combine that with the raid4 dedicated parity device behavior you're 
seeing if the writes are all exactly 128 MB, with that possibly 
explaining the super-slow replaces, and this thread may have just given 
us answers to both of those until-now-untraceable issues.

Regardless, what's /very/ clear by now is that raid56 mode as it 
currently exists is more or less fatally flawed, and a full scrap and 
rewrite to an entirely different raid56 mode on-disk format may be 
necessary to fix it.

And what's even clearer is that people /really/ shouldn't be using raid56 
mode for anything but testing with throw-away data, at this point.  
Anything else is simply irresponsible.

Does that mean we need to put a "raid56 mode may eat your babies" level 
warning in the manpage and require a --force to either mkfs.btrfs or 
balance to raid56 mode?  Because that's about where I am on it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 22:33       ` Chris Murphy
@ 2016-06-26  9:20         ` Goffredo Baroncelli
  2016-06-26 16:43           ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Goffredo Baroncelli @ 2016-06-26  9:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

On 2016-06-26 00:33, Chris Murphy wrote:
> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
> <kreijack@inwind.it> wrote:
>> On 2016-06-25 19:58, Chris Murphy wrote:
>> [...]
>>>> Wow. So it sees the data strip corruption, uses good parity on disk to
>>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>>> does it wrongly, and then overwrites good parity with bad parity?
>>>
>>> The wrong parity, is it valid for the data strips that includes the
>>> (intentionally) corrupt data?
>>>
>>> Can parity computation happen before the csum check? Where sometimes you get:
>>>
>>> read data strips > computer parity > check csum fails > read good
>>> parity from disk > fix up the bad data chunk > write wrong parity
>>> (based on wrong data)?
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>>
>>> 2371-2383 suggest that there's a parity check, it's not always being
>>> rewritten to disk if it's already correct. But it doesn't know it's
>>> not correct, it thinks it's wrong so writes out the wrongly computed
>>> parity?
>>
>> The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero.
> 
> I'm not sure what it takes to hit this exactly. I just tested 3x
> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
> is still correct.

How many time tried this corruption test ? I was unable to raise the bug systematically; in average every three tests I got 1 bug.... 


[....]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26  9:20         ` Goffredo Baroncelli
@ 2016-06-26 16:43           ` Chris Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2016-06-26 16:43 UTC (permalink / raw)
  To: kreijack; +Cc: Chris Murphy, linux-btrfs

On Sun, Jun 26, 2016 at 3:20 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> On 2016-06-26 00:33, Chris Murphy wrote:
>> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
>> <kreijack@inwind.it> wrote:
>>> On 2016-06-25 19:58, Chris Murphy wrote:
>>> [...]
>>>>> Wow. So it sees the data strip corruption, uses good parity on disk to
>>>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>>>> does it wrongly, and then overwrites good parity with bad parity?
>>>>
>>>> The wrong parity, is it valid for the data strips that includes the
>>>> (intentionally) corrupt data?
>>>>
>>>> Can parity computation happen before the csum check? Where sometimes you get:
>>>>
>>>> read data strips > computer parity > check csum fails > read good
>>>> parity from disk > fix up the bad data chunk > write wrong parity
>>>> (based on wrong data)?
>>>>
>>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>>>
>>>> 2371-2383 suggest that there's a parity check, it's not always being
>>>> rewritten to disk if it's already correct. But it doesn't know it's
>>>> not correct, it thinks it's wrong so writes out the wrongly computed
>>>> parity?
>>>
>>> The parity is not valid for both the corrected data and the corrupted data. It seems that the scrub process copy the contents of the disk2 to disk3. It could happens only if the contents of disk1 is zero.
>>
>> I'm not sure what it takes to hit this exactly. I just tested 3x
>> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
>> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
>> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
>> is still correct.
>
> How many time tried this corruption test ? I was unable to raise the bug systematically; in average every three tests I got 1 bug....

Once.

I just did it a 2nd time and both file's parity are wrong now. So I
did it several more times. Sometimes both files' parity is bad.
Sometimes just one file's parity is bad. Sometimes neither file's
parity is bad.

It's a very bad bug, because it is a form of silent data corruption
and it's induced by Btrfs. And it's apparently non-deterministically
hit. Is this some form of race condition?

Somewhat orthogonal to this, is that while Btrfs is subject to the
write hole problem where parity can be wrong, this is detected and
warned. Bad data doesn't propagate up to user space.

This might explain how users are getting hit with corrupt files only
after they have a degraded volume. They did a scrub where some fixups
happen, but behind the scene possibly parity was corrupted even though
their data was fixed. Later they have a failed device, and the bad
parity is needed, and now there are a bunch of scary checksum errors
with piles of files listed as unrecoverable. And in fact they are
unrecoverable because the bad parity means bad reconstruction, so even
scraping it off with btrfs restore won't work.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26  2:53   ` Duncan
@ 2016-06-26 22:33     ` ronnie sahlberg
  2016-06-26 22:38       ` Hugo Mills
                         ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: ronnie sahlberg @ 2016-06-26 22:33 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:
>
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
>> That's fucked. So in other words, if there are any errors fixed up
>> during a scrub, you should do a 2nd scrub. The first scrub should make
>> sure data is correct, and the 2nd scrub should make sure the bug is
>> papered over by computing correct parity and replacing the bad parity.
>>
>> I wonder if the same problem happens with balance or if this is just a
>> bug in scrub code?
>
> Could this explain why people have been reporting so many raid56 mode
> cases of btrfs replacing a first drive appearing to succeed just fine,
> but then they go to btrfs replace a second drive, and the array crashes
> as if the first replace didn't work correctly after all, resulting in two
> bad devices once the second replace gets under way, of course bringing
> down the array?
>
> If so, then it looks like we have our answer as to what has been going
> wrong that has been so hard to properly trace and thus to bugfix.
>
> Combine that with the raid4 dedicated parity device behavior you're
> seeing if the writes are all exactly 128 MB, with that possibly
> explaining the super-slow replaces, and this thread may have just given
> us answers to both of those until-now-untraceable issues.
>
> Regardless, what's /very/ clear by now is that raid56 mode as it
> currently exists is more or less fatally flawed, and a full scrap and
> rewrite to an entirely different raid56 mode on-disk format may be
> necessary to fix it.
>
> And what's even clearer is that people /really/ shouldn't be using raid56
> mode for anything but testing with throw-away data, at this point.
> Anything else is simply irresponsible.
>
> Does that mean we need to put a "raid56 mode may eat your babies" level
> warning in the manpage and require a --force to either mkfs.btrfs or
> balance to raid56 mode?  Because that's about where I am on it.

Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.

2, Instead of a --force flag. (Users tend to ignore ---force and
warnings in documentation.)
Instead ifdef out the options to create raid56 in mkfs.btrfs.
Developers who want to test can just remove the ifdef and recompile
the tools anyway.
But if end-users have to recompile userspace, that really forces the
point that "you
really should not use this right now".

3, reach out to the documentation and fora for the major distros and
make sure they update their
documentation accordingly.
I think a lot of end-users, if they try to research something, are
more likely to go to <their-distro> fora and wiki
than search out an upstream fora.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26 22:33     ` ronnie sahlberg
@ 2016-06-26 22:38       ` Hugo Mills
  2016-06-27  3:22         ` Steven Haigh
  2016-06-27  3:21       ` Steven Haigh
  2016-06-27  3:50       ` Christoph Anton Mitterer
  2 siblings, 1 reply; 33+ messages in thread
From: Hugo Mills @ 2016-06-26 22:38 UTC (permalink / raw)
  To: ronnie sahlberg; +Cc: Duncan, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2903 bytes --]

On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:
> On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> > Could this explain why people have been reporting so many raid56 mode
> > cases of btrfs replacing a first drive appearing to succeed just fine,
> > but then they go to btrfs replace a second drive, and the array crashes
> > as if the first replace didn't work correctly after all, resulting in two
> > bad devices once the second replace gets under way, of course bringing
> > down the array?
> >
> > If so, then it looks like we have our answer as to what has been going
> > wrong that has been so hard to properly trace and thus to bugfix.
> >
> > Combine that with the raid4 dedicated parity device behavior you're
> > seeing if the writes are all exactly 128 MB, with that possibly
> > explaining the super-slow replaces, and this thread may have just given
> > us answers to both of those until-now-untraceable issues.
> >
> > Regardless, what's /very/ clear by now is that raid56 mode as it
> > currently exists is more or less fatally flawed, and a full scrap and
> > rewrite to an entirely different raid56 mode on-disk format may be
> > necessary to fix it.
> >
> > And what's even clearer is that people /really/ shouldn't be using raid56
> > mode for anything but testing with throw-away data, at this point.
> > Anything else is simply irresponsible.
> >
> > Does that mean we need to put a "raid56 mode may eat your babies" level
> > warning in the manpage and require a --force to either mkfs.btrfs or
> > balance to raid56 mode?  Because that's about where I am on it.
> 
> Agree. At this point letting ordinary users create raid56 filesystems
> is counterproductive.
> 
> 
> I would suggest:
> 
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.

   I beefed up the warnings in several places in the wiki a couple of
days ago.

   Hugo.

> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".
> 
> 3, reach out to the documentation and fora for the major distros and
> make sure they update their
> documentation accordingly.
> I think a lot of end-users, if they try to research something, are
> more likely to go to <their-distro> fora and wiki
> than search out an upstream fora.

-- 
Hugo Mills             | "No! My collection of rare, incurable diseases!
hugo@... carfax.org.uk | Violated!"
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                Stimpson J. Cat, The Ren & Stimpy Show

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26 22:33     ` ronnie sahlberg
  2016-06-26 22:38       ` Hugo Mills
@ 2016-06-27  3:21       ` Steven Haigh
  2016-06-27 19:47         ` Duncan
  2016-06-27  3:50       ` Christoph Anton Mitterer
  2 siblings, 1 reply; 33+ messages in thread
From: Steven Haigh @ 2016-06-27  3:21 UTC (permalink / raw)
  To: linux-btrfs

On 2016-06-27 08:33, ronnie sahlberg wrote:
> On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:
>> 
>>> Wow. So it sees the data strip corruption, uses good parity on disk 
>>> to
>>> fix it, writes the fix to disk, recomputes parity for some reason but
>>> does it wrongly, and then overwrites good parity with bad parity?
>>> That's fucked. So in other words, if there are any errors fixed up
>>> during a scrub, you should do a 2nd scrub. The first scrub should 
>>> make
>>> sure data is correct, and the 2nd scrub should make sure the bug is
>>> papered over by computing correct parity and replacing the bad 
>>> parity.
>>> 
>>> I wonder if the same problem happens with balance or if this is just 
>>> a
>>> bug in scrub code?
>> 
>> Could this explain why people have been reporting so many raid56 mode
>> cases of btrfs replacing a first drive appearing to succeed just fine,
>> but then they go to btrfs replace a second drive, and the array 
>> crashes
>> as if the first replace didn't work correctly after all, resulting in 
>> two
>> bad devices once the second replace gets under way, of course bringing
>> down the array?
>> 
>> If so, then it looks like we have our answer as to what has been going
>> wrong that has been so hard to properly trace and thus to bugfix.
>> 
>> Combine that with the raid4 dedicated parity device behavior you're
>> seeing if the writes are all exactly 128 MB, with that possibly
>> explaining the super-slow replaces, and this thread may have just 
>> given
>> us answers to both of those until-now-untraceable issues.
>> 
>> Regardless, what's /very/ clear by now is that raid56 mode as it
>> currently exists is more or less fatally flawed, and a full scrap and
>> rewrite to an entirely different raid56 mode on-disk format may be
>> necessary to fix it.
>> 
>> And what's even clearer is that people /really/ shouldn't be using 
>> raid56
>> mode for anything but testing with throw-away data, at this point.
>> Anything else is simply irresponsible.
>> 
>> Does that mean we need to put a "raid56 mode may eat your babies" 
>> level
>> warning in the manpage and require a --force to either mkfs.btrfs or
>> balance to raid56 mode?  Because that's about where I am on it.
> 
> Agree. At this point letting ordinary users create raid56 filesystems
> is counterproductive.

+1

> I would suggest:
> 
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.

I voiced my concern on #btrfs about this - it really should show that 
this may eat your data and is properly experimental. At the moment, it 
looks as if the features are implemented and working as expected. In my 
case with nothing out of the ordinary - I've now got ~3.8Tb free disk 
space. Certainly not ready for *ANY* kind of public use.

> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".

I think this is a somewhat good idea - however it should be a warning 
along the lines of:
"BTRFS RAID56 is VERY experimental and is known to corrupt data in 
certain cases. Use at your own risk!

Continue? (y/N):"

> 3, reach out to the documentation and fora for the major distros and
> make sure they update their
> documentation accordingly.
> I think a lot of end-users, if they try to research something, are
> more likely to go to <their-distro> fora and wiki
> than search out an upstream fora.

Another good idea.

I'd also recommend updates to the ArchLinux wiki - as for some reason I 
always seem to end up there when searching for a certain topic...

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26 22:38       ` Hugo Mills
@ 2016-06-27  3:22         ` Steven Haigh
  0 siblings, 0 replies; 33+ messages in thread
From: Steven Haigh @ 2016-06-27  3:22 UTC (permalink / raw)
  To: Hugo Mills, ronnie sahlberg, Duncan, Btrfs BTRFS

On 2016-06-27 08:38, Hugo Mills wrote:
> On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:
>> On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.duncan@cox.net> wrote:
>> > Could this explain why people have been reporting so many raid56 mode
>> > cases of btrfs replacing a first drive appearing to succeed just fine,
>> > but then they go to btrfs replace a second drive, and the array crashes
>> > as if the first replace didn't work correctly after all, resulting in two
>> > bad devices once the second replace gets under way, of course bringing
>> > down the array?
>> >
>> > If so, then it looks like we have our answer as to what has been going
>> > wrong that has been so hard to properly trace and thus to bugfix.
>> >
>> > Combine that with the raid4 dedicated parity device behavior you're
>> > seeing if the writes are all exactly 128 MB, with that possibly
>> > explaining the super-slow replaces, and this thread may have just given
>> > us answers to both of those until-now-untraceable issues.
>> >
>> > Regardless, what's /very/ clear by now is that raid56 mode as it
>> > currently exists is more or less fatally flawed, and a full scrap and
>> > rewrite to an entirely different raid56 mode on-disk format may be
>> > necessary to fix it.
>> >
>> > And what's even clearer is that people /really/ shouldn't be using raid56
>> > mode for anything but testing with throw-away data, at this point.
>> > Anything else is simply irresponsible.
>> >
>> > Does that mean we need to put a "raid56 mode may eat your babies" level
>> > warning in the manpage and require a --force to either mkfs.btrfs or
>> > balance to raid56 mode?  Because that's about where I am on it.
>> 
>> Agree. At this point letting ordinary users create raid56 filesystems
>> is counterproductive.
>> 
>> 
>> I would suggest:
>> 
>> 1, a much more strongly worded warning in the wiki. Make sure there
>> are no misunderstandings
>> that they really should not use raid56 right now for new filesystems.
> 
>    I beefed up the warnings in several places in the wiki a couple of
> days ago.

Not to sound rude - but I don't think these go anywhere near far enough. 
It needs to be completely obvious that its a good chance you'll lose 
everything. IMHO that's the only way that will stop BTRFS from getting 
the 'data eater' reputation. It can be revisited and reworded when the 
implementation is more tested and stable.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-26 22:33     ` ronnie sahlberg
  2016-06-26 22:38       ` Hugo Mills
  2016-06-27  3:21       ` Steven Haigh
@ 2016-06-27  3:50       ` Christoph Anton Mitterer
  2016-06-27  4:35         ` Andrei Borzenkov
  2 siblings, 1 reply; 33+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-27  3:50 UTC (permalink / raw)
  To: ronnie sahlberg, Duncan; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1511 bytes --]

On Sun, 2016-06-26 at 15:33 -0700, ronnie sahlberg wrote:
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.
I doubt most end users can be assumed to read the wiki...


> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".

Well if one does --force or --yes-i-know-what-i-do, and one actually
doesn't than such person is on his own.
People can always shoot themselves if they want to.

Actually I think that the compile-time way is inferior here.
Distros may just always enable raid56 there to allow people to continue
mounting their existing filesystems.



What should IMHO be done as well is giving a big fat warning in the
manpages/etc. that when nodatacow is used RAID recovery cannot produce
valid data (at least as long as there isn't checksumming implemented
for nodatacow).
Probably it should also be documented what btrfs does in such
situation. E.g. does it just randomly pick a readable block from one of
the copies? Simply error out and consider the file broken? Fill the
blocks in question with zero?

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-27  3:50       ` Christoph Anton Mitterer
@ 2016-06-27  4:35         ` Andrei Borzenkov
  2016-06-27 16:39           ` Christoph Anton Mitterer
  0 siblings, 1 reply; 33+ messages in thread
From: Andrei Borzenkov @ 2016-06-27  4:35 UTC (permalink / raw)
  To: Christoph Anton Mitterer, ronnie sahlberg, Duncan; +Cc: Btrfs BTRFS

27.06.2016 06:50, Christoph Anton Mitterer пишет:
> 
> What should IMHO be done as well is giving a big fat warning in the
> manpages/etc. that when nodatacow is used RAID recovery cannot produce
> valid data (at least as long as there isn't checksumming implemented
> for nodatacow).

The problem is that current implementation of RAID56 puts exactly CoW
data at risk. I.e. writing new (copy of) data may suddenly make old
(copy of) data inaccessible, even though it had been safely committed to
disk and is now in read-only snapshot.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-27  4:35         ` Andrei Borzenkov
@ 2016-06-27 16:39           ` Christoph Anton Mitterer
  0 siblings, 0 replies; 33+ messages in thread
From: Christoph Anton Mitterer @ 2016-06-27 16:39 UTC (permalink / raw)
  To: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

On Mon, 2016-06-27 at 07:35 +0300, Andrei Borzenkov wrote:
> The problem is that current implementation of RAID56 puts exactly CoW
> data at risk. I.e. writing new (copy of) data may suddenly make old
> (copy of) data inaccessible, even though it had been safely committed
> to
> disk and is now in read-only snapshot.

Sure,... mine was just a general thing to be added.
No checksums => no way to tell which block is valid in case of silent
block errors => no way to recover unless by chance

=> should be included as a warning, especially as userland software
starts to automatically set nodatacow (IIRC systemd does so), thereby
silently breaking functionality (integrity+recoverability) assumed by
the user.


Cheers,
Chris-

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-27  3:21       ` Steven Haigh
@ 2016-06-27 19:47         ` Duncan
  0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2016-06-27 19:47 UTC (permalink / raw)
  To: linux-btrfs

Steven Haigh posted on Mon, 27 Jun 2016 13:21:00 +1000 as excerpted:

> I'd also recommend updates to the ArchLinux wiki - as for some reason I
> always seem to end up there when searching for a certain topic...

Not really btrfs related, but for people using popular search engines, at 
least, this is likely for two reasons:

1) Arch is apparently the most popular distro among reasonably 
technically literate users -- those who will both tend to have good 
technical knowledge and advice on the various real-life issues Linux 
users tend to encounter, and are likely to post it to an easily publicly 
indexable forum.  (And in that regard, wikis are likely to be more 
indexable than (web archives of) mailing lists like this, because that's 
(part of) what wikis /do/ by design, make topics keyword searchable.  
Lists, not so much.)

2) Specifically to the point of being _publicly_ indexable, Arch may have 
a more liberal robots.txt that allows indexers more access, than other 
distros, some of which may limit robot access for performance reasons.


With this combination, arch's wiki is a natural place for searches to 
point.

So agreed, a high priority on getting the raid56 warning up there on the 
arch wiki is a good idea, indeed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 12:21 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Goffredo Baroncelli
  2016-06-25 17:25 ` Chris Murphy
@ 2016-09-21  7:28 ` Qu Wenruo
  2016-09-21  7:35   ` Tomasz Torcz
  2016-09-21 15:02   ` Chris Murphy
  2016-11-04  2:10 ` Qu Wenruo
  2 siblings, 2 replies; 33+ messages in thread
From: Qu Wenruo @ 2016-09-21  7:28 UTC (permalink / raw)
  To: kreijack, linux-btrfs

Hi,

For this well-known bug, is there any one fixing it?

It can't be more frustrating finding some one has already worked on it 
after spending days digging.

BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to 
implement btrfsck scrub support, at least we can use btrfsck to fix bad 
stripes before kernel fix.

Thanks,
Qu

At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote:
> Hi all,
>
> following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk.
> The result showed that sometime the kernel recomputed the parity wrongly.
>
> I tested the following kernel
> - 4.6.1
> - 4.5.4
> and both showed the same behavior.
>
> The test was performed as described below:
>
> 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB
>
> 	truncate -s 500M disk1.img; losetup -f disk1.img
> 	truncate -s 500M disk2.img; losetup -f disk2.img
> 	truncate -s 500M disk3.img; losetup -f disk3.img
> 	sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
> 	sudo mount /dev/loop0 mnt/
>
> 2) I created a file with a length of 128kb:
>
> 	python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
> 	sudo umount mnt/
>
> 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located:
>
> /dev/loop0: offset=81788928+16*4096        (64k, second half of the file: 'bdbbbb.....)
> /dev/loop1: offset=61865984+16*4096        (64k, first half of the file: 'adaaaa.....)
> /dev/loop2: offset=61865984+16*4096        (64k, parity: '\x03\x00\x03\x03\x03.....)
>
> 4) I tried to corrupt each disk (one disk per test), and then run a scrub:
>
> for example for the disk /dev/loop2:
> 	sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
>         	        seek=$((61865984+16*4096)) count=5
> 	sudo mount /dev/loop0 mnt
> 	sudo btrfs scrub start mnt/.
>
> 5) I check the disks at the offsets above, to verify that the data/parity is correct
>
> However I found that:
> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);
>
> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:
>
> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk
>
> 2.b) the kernel repaired the damage, and rebuild a correct parity
>
> I have to point out another strange thing: in dmesg I found two kinds of messages:
>
> msg1)
> 	[....]
> 	[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
> 	[ 1021.366949] BTRFS: has skinny extents
> 	[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> 	[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>
> msg2)
> 	[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
> 	[ 1017.435074] BTRFS: has skinny extents
> 	[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> 	[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, 	root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> 	[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> 	[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0
> 	[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>
>
> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>
> Another strangeness is that SCRUB sometime reports
>  ERROR: there are uncorrectable errors
> and sometime reports
>  WARNING: errors detected during scrubbing, corrected
>
> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2
>
>
> Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh".
>
> Br
> G.Baroncelli
>
>
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-21  7:28 ` Qu Wenruo
@ 2016-09-21  7:35   ` Tomasz Torcz
  2016-09-21  9:15     ` Qu Wenruo
  2016-09-21 15:02   ` Chris Murphy
  1 sibling, 1 reply; 33+ messages in thread
From: Tomasz Torcz @ 2016-09-21  7:35 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
> Hi,
> 
> For this well-known bug, is there any one fixing it?
> 
> It can't be more frustrating finding some one has already worked on it after
> spending days digging.
> 
> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
> btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
> kernel fix.

  Why wouldn't you fix in-kernel code?  Why implement duplicate functionality
when you can fix the root cause?

-- 
Tomasz   .. oo o.   oo o. .o   .o o. o. oo o.   ..
Torcz    .. .o .o   .o .o oo   oo .o .. .. oo   oo
o.o.o.   .o .. o.   o. o. o.   o. o. oo .. ..   o.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-21  7:35   ` Tomasz Torcz
@ 2016-09-21  9:15     ` Qu Wenruo
  2016-09-21 15:13       ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2016-09-21  9:15 UTC (permalink / raw)
  To: Tomasz Torcz, linux-btrfs



At 09/21/2016 03:35 PM, Tomasz Torcz wrote:
> On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
>> Hi,
>>
>> For this well-known bug, is there any one fixing it?
>>
>> It can't be more frustrating finding some one has already worked on it after
>> spending days digging.
>>
>> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
>> btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
>> kernel fix.
>
>   Why wouldn't you fix in-kernel code?  Why implement duplicate functionality
> when you can fix the root cause?
>
We'll fix in-kernel code.

Fsck one is not duplicate, we need a better standard thing to compare 
with kernel behavior.

Just like qgroup fix in btrfsck, if kernel can't handle something well, 
we do need to fix kernel, but a good off-line fixer won't hurt.
(Btrfs-progs is much easier to implement, and get fast review/merge 
cycle, and it can help us to find better solution before screwing kernel 
up again)

Thanks,
Qu



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-21  7:28 ` Qu Wenruo
  2016-09-21  7:35   ` Tomasz Torcz
@ 2016-09-21 15:02   ` Chris Murphy
  1 sibling, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2016-09-21 15:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Goffredo Baroncelli, linux-btrfs

On Wed, Sep 21, 2016 at 1:28 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> Hi,
>
> For this well-known bug, is there any one fixing it?
>
> It can't be more frustrating finding some one has already worked on it after
> spending days digging.
>
> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to implement
> btrfsck scrub support, at least we can use btrfsck to fix bad stripes before
> kernel fix.

Well the kernel will fix it if the user just scrubs again. The problem
is the user doesn't know their file system might have bad parity. So I
don't know how implementing an optional check in btrfsk helps. If it's
non-optional that means reading 100% of the volume, not just metadata,
that's not workable for btrfsck. The user just needs to do another
scrub if they suspect they have been hit by this, and if they get no
errors they're OK. If they get an error that something is being fixed,
they might have to do a 2nd scrub to avoid this bug - but I'm not sure
if there's any different error message between a non-parity strip
being fixed compared to parity strip being replaced.

The central thing happening in this bug is it requires a degraded full
stripe [1] already exists. That is, a non-parity strip [2] is already
corrupt. What this bug does is it fixes that strip from good parity,
but then wrongly recomputes parity for some reason and writes bad
parity to disk. So it shifts the "degradedness" of the full stripe
from non-parity to parity. There's no actual additional loss of
redundancy, it's just that the messages will say a problem was found
and fixed, which is not entirely true. Non-parity data is fixed, but
now parity is wrong, silently. There is no consequence of this unless
it's raid5 and there's another strip loss in that same stripe.

Uncertain if the bug happens with raid6, or if raid6 extra redundancy
has just masked the problem. Uncertain if the bug happens with
balance, or passively with normal reads. Only scrub has been tested
and it's non-deterministic, maybe happens 1 in 3 or 4 attempts.


[1][2]
I'm using SNIA terms. Strip = stripe element = mdadm chunk = the 64KiB
per device block. Stripe = full stripe = data strips + parity strip
(or 2 strips for raid6).


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-21  9:15     ` Qu Wenruo
@ 2016-09-21 15:13       ` Chris Murphy
  2016-09-22  2:08         ` Qu Wenruo
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2016-09-21 15:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Tomasz Torcz, linux-btrfs

On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 09/21/2016 03:35 PM, Tomasz Torcz wrote:
>>
>> On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
>>>
>>> Hi,
>>>
>>> For this well-known bug, is there any one fixing it?
>>>
>>> It can't be more frustrating finding some one has already worked on it
>>> after
>>> spending days digging.
>>>
>>> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to
>>> implement
>>> btrfsck scrub support, at least we can use btrfsck to fix bad stripes
>>> before
>>> kernel fix.
>>
>>
>>   Why wouldn't you fix in-kernel code?  Why implement duplicate
>> functionality
>> when you can fix the root cause?
>>
> We'll fix in-kernel code.
>
> Fsck one is not duplicate, we need a better standard thing to compare with
> kernel behavior.
>
> Just like qgroup fix in btrfsck, if kernel can't handle something well, we
> do need to fix kernel, but a good off-line fixer won't hurt.
> (Btrfs-progs is much easier to implement, and get fast review/merge cycle,
> and it can help us to find better solution before screwing kernel up again)

I understand some things should go in fsck for comparison. But in this
case I don't see how it can help. Parity is not checksummed. The only
way to know if it's wrong is to read all of the data strips, compute
parity, and compare in-memory parity from current read to on-disk
parity. It takes a long time, and at least scrub is online, where
btrfsck scrub is not.  There is already an offline scrub in btrfs
check which doesn't repair, but also I don't know if it checks parity.

       --check-data-csum
           verify checksums of data blocks

           This expects that the filesystem is otherwise OK, so this
is basically and
           offline scrub but does not repair data from spare coipes.

Is it possible to put parities into their own tree? They'd be
checksummed there. Somehow I think the long term approach is that
partial stripe writes, which apparently are overwrites and not CoW,
need to go away. In particular I wonder what the metadata raid56 write
pattern is, if this usually means a lot of full stripe CoW writes, or
if there are many small metadata RMW changes that makes them partial
stripe writes and not CoW and thus not safe.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-21 15:13       ` Chris Murphy
@ 2016-09-22  2:08         ` Qu Wenruo
  2016-09-22  2:44           ` Chris Murphy
  2016-09-22  3:07           ` Christoph Anton Mitterer
  0 siblings, 2 replies; 33+ messages in thread
From: Qu Wenruo @ 2016-09-22  2:08 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Tomasz Torcz, linux-btrfs



At 09/21/2016 11:13 PM, Chris Murphy wrote:
> On Wed, Sep 21, 2016 at 3:15 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 09/21/2016 03:35 PM, Tomasz Torcz wrote:
>>>
>>> On Wed, Sep 21, 2016 at 03:28:25PM +0800, Qu Wenruo wrote:
>>>>
>>>> Hi,
>>>>
>>>> For this well-known bug, is there any one fixing it?
>>>>
>>>> It can't be more frustrating finding some one has already worked on it
>>>> after
>>>> spending days digging.
>>>>
>>>> BTW, since kernel scrub is somewhat scrap for raid5/6, I'd like to
>>>> implement
>>>> btrfsck scrub support, at least we can use btrfsck to fix bad stripes
>>>> before
>>>> kernel fix.
>>>
>>>
>>>   Why wouldn't you fix in-kernel code?  Why implement duplicate
>>> functionality
>>> when you can fix the root cause?
>>>
>> We'll fix in-kernel code.
>>
>> Fsck one is not duplicate, we need a better standard thing to compare with
>> kernel behavior.
>>
>> Just like qgroup fix in btrfsck, if kernel can't handle something well, we
>> do need to fix kernel, but a good off-line fixer won't hurt.
>> (Btrfs-progs is much easier to implement, and get fast review/merge cycle,
>> and it can help us to find better solution before screwing kernel up again)
>
> I understand some things should go in fsck for comparison. But in this
> case I don't see how it can help. Parity is not checksummed. The only
> way to know if it's wrong is to read all of the data strips, compute
> parity, and compare in-memory parity from current read to on-disk
> parity.

That's what we plan to do.
And I don't see the necessary to csum the parity.
Why csum a csum again?

> It takes a long time, and at least scrub is online, where
> btrfsck scrub is not.

At least btrfsck scrub will work and easier to implement, while kernel 
scrub doesn't.

The more important thing is, we can forget all about the complicated 
concurrency of online scrub, focusing on the implementation itself at 
user-space.
Which is easier to implement and easier to maintain.

> There is already an offline scrub in btrfs
> check which doesn't repair, but also I don't know if it checks parity.
>
>        --check-data-csum
>            verify checksums of data blocks

Just as you expected, it doesn't check parity.
Even for RAID1/DUP, it won't check the backup if it succeeded reading 
the first stripe.

Current implement doesn't really care if it's the data or the copy 
corrupted, any data can be read out, then there is no problem.
The same thing applies to tree blocks.

So the ability to check every stripe/copy is still quite needed for that 
option.

And that's what I'm planning to enhance, make --check-data-csum to 
kernel scrub equivalent.

>
>            This expects that the filesystem is otherwise OK, so this
> is basically and
>            offline scrub but does not repair data from spare coipes.

Repair can be implemented, but maybe just rewrite the same data into the 
same place.
If that's a bad block, then it can't repair further more unless we can 
relocate extent to other place.

>
> Is it possible to put parities into their own tree? They'd be
> checksummed there.

Personally speaking, this is quite a bad idea to me.
I prefer to separate different logical layers into their own codes.
Not mixing them together.

Block level things to block level(RAID/Chunk), logical thing to logical 
level(tree blocks).

Current btrfs csum design is already much much better than pure RAID.
Just think of RAID1, while one copy is corrupted, then which copy is 
correct then?

Thanks,
Qu

> Somehow I think the long term approach is that
> partial stripe writes, which apparently are overwrites and not CoW,
> need to go away. In particular I wonder what the metadata raid56 write
> pattern is, if this usually means a lot of full stripe CoW writes, or
> if there are many small metadata RMW changes that makes them partial
> stripe writes and not CoW and thus not safe.
>
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-22  2:08         ` Qu Wenruo
@ 2016-09-22  2:44           ` Chris Murphy
  2016-09-22  3:00             ` Qu Wenruo
  2016-09-22  3:07           ` Christoph Anton Mitterer
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Murphy @ 2016-09-22  2:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Tomasz Torcz, linux-btrfs

On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> At 09/21/2016 11:13 PM, Chris Murphy wrote:

>> I understand some things should go in fsck for comparison. But in this
>> case I don't see how it can help. Parity is not checksummed. The only
>> way to know if it's wrong is to read all of the data strips, compute
>> parity, and compare in-memory parity from current read to on-disk
>> parity.
>
>
> That's what we plan to do.
> And I don't see the necessary to csum the parity.
> Why csum a csum again?

parity!=csum


http://www.spinics.net/lists/linux-btrfs/msg56602.html



>> There is already an offline scrub in btrfs
>> check which doesn't repair, but also I don't know if it checks parity.
>>
>>        --check-data-csum
>>            verify checksums of data blocks
>
>
> Just as you expected, it doesn't check parity.
> Even for RAID1/DUP, it won't check the backup if it succeeded reading the
> first stripe.

Both copies are not scrubbed? Oh hell...

[chris@f24s ~]$ sudo btrfs scrub status /brick2
scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55
    scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58
    total bytes scrubbed: 955.93GiB with 0 errors

How can this possibly correctly say 956GiB scrubbed if it has not
checked both copies? That message is saying *all* the data, both
copies, were scrubbed. You're saying that message is wrong? It only
scrubbed half that amount?

[chris@f24s ~]$ sudo btrfs fi df /brick2
Data, RAID1: total=478.00GiB, used=477.11GiB
System, RAID1: total=32.00MiB, used=96.00KiB
Metadata, RAID1: total=2.00GiB, used=877.59MiB
GlobalReserve, single: total=304.00MiB, used=0.00B


When that scrub was happening, both drives were being accessed at 100%
throughput.







>
> Current implement doesn't really care if it's the data or the copy
> corrupted, any data can be read out, then there is no problem.
> The same thing applies to tree blocks.
>
> So the ability to check every stripe/copy is still quite needed for that
> option.
>
> And that's what I'm planning to enhance, make --check-data-csum to kernel
> scrub equivalent.

OK thanks.


>
>>
>>            This expects that the filesystem is otherwise OK, so this
>> is basically and
>>            offline scrub but does not repair data from spare coipes.
>
>
> Repair can be implemented, but maybe just rewrite the same data into the
> same place.
> If that's a bad block, then it can't repair further more unless we can
> relocate extent to other place.

Any device that's out of reserve sectors and can no longer remap LBA's
on its own, is a drive that needs to be decommissioned. It's a new
feature in just the last year or so that mdadm has a badblocks map so
it can do what the drive won't do, but I'm personally not a fan of
keeping malfunctioning drives in RAID.


>
>>
>> Is it possible to put parities into their own tree? They'd be
>> checksummed there.
>
>
> Personally speaking, this is quite a bad idea to me.
> I prefer to separate different logical layers into their own codes.
> Not mixing them together.
>
> Block level things to block level(RAID/Chunk), logical thing to logical
> level(tree blocks).

OK.

>
> Current btrfs csum design is already much much better than pure RAID.
> Just think of RAID1, while one copy is corrupted, then which copy is correct
> then?

Yes.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-22  2:44           ` Chris Murphy
@ 2016-09-22  3:00             ` Qu Wenruo
  2016-09-22  3:12               ` Chris Murphy
  0 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2016-09-22  3:00 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Tomasz Torcz, linux-btrfs



At 09/22/2016 10:44 AM, Chris Murphy wrote:
> On Wed, Sep 21, 2016 at 8:08 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 09/21/2016 11:13 PM, Chris Murphy wrote:
>
>>> I understand some things should go in fsck for comparison. But in this
>>> case I don't see how it can help. Parity is not checksummed. The only
>>> way to know if it's wrong is to read all of the data strips, compute
>>> parity, and compare in-memory parity from current read to on-disk
>>> parity.
>>
>>
>> That's what we plan to do.
>> And I don't see the necessary to csum the parity.
>> Why csum a csum again?
>
> parity!=csum
>
>
> http://www.spinics.net/lists/linux-btrfs/msg56602.html

I know, it's more than csum, as normal csum will only tell you if it's 
corrupted, but parity can be used to recover data.

But, parity needs enough data stripe to recover data.
It's just between full backup(RAID1) and pure csum(hash).

But csum for parity is still not worthy, and will screw the whole 
block(chunk) and logical layer.

>
>
>
>>> There is already an offline scrub in btrfs
>>> check which doesn't repair, but also I don't know if it checks parity.
>>>
>>>        --check-data-csum
>>>            verify checksums of data blocks
>>
>>
>> Just as you expected, it doesn't check parity.
>> Even for RAID1/DUP, it won't check the backup if it succeeded reading the
>> first stripe.
>
> Both copies are not scrubbed? Oh hell...

I was replying to the "--check-data-csum" of btrfsck.

I mean, for --check-data-csum, it doesn't read the backup if the first 
data can be read out without error.

And if the first data is wrong, btrfsck will read backup, and output 
error about wrong csum, but won't return error.


>
> [chris@f24s ~]$ sudo btrfs scrub status /brick2
> scrub status for 7fea4120-9581-43cb-ab07-6631757c0b55
>     scrub started at Tue Sep 20 12:16:18 2016 and finished after 01:46:58
>     total bytes scrubbed: 955.93GiB with 0 errors
>
> How can this possibly correctly say 956GiB scrubbed if it has not
> checked both copies? That message is saying *all* the data, both
> copies, were scrubbed. You're saying that message is wrong? It only
> scrubbed half that amount?
>
> [chris@f24s ~]$ sudo btrfs fi df /brick2
> Data, RAID1: total=478.00GiB, used=477.11GiB
> System, RAID1: total=32.00MiB, used=96.00KiB
> Metadata, RAID1: total=2.00GiB, used=877.59MiB
> GlobalReserve, single: total=304.00MiB, used=0.00B
>
>
> When that scrub was happening, both drives were being accessed at 100%
> throughput.
>
>
>
>
>
>
>
>>
>> Current implement doesn't really care if it's the data or the copy
>> corrupted, any data can be read out, then there is no problem.
>> The same thing applies to tree blocks.
>>
>> So the ability to check every stripe/copy is still quite needed for that
>> option.
>>
>> And that's what I'm planning to enhance, make --check-data-csum to kernel
>> scrub equivalent.
>
> OK thanks.
>
>
>>
>>>
>>>            This expects that the filesystem is otherwise OK, so this
>>> is basically and
>>>            offline scrub but does not repair data from spare coipes.
>>
>>
>> Repair can be implemented, but maybe just rewrite the same data into the
>> same place.
>> If that's a bad block, then it can't repair further more unless we can
>> relocate extent to other place.
>
> Any device that's out of reserve sectors and can no longer remap LBA's
> on its own, is a drive that needs to be decommissioned. It's a new
> feature in just the last year or so that mdadm has a badblocks map so
> it can do what the drive won't do, but I'm personally not a fan of
> keeping malfunctioning drives in RAID.
>
>
>>
>>>
>>> Is it possible to put parities into their own tree? They'd be
>>> checksummed there.
>>
>>
>> Personally speaking, this is quite a bad idea to me.
>> I prefer to separate different logical layers into their own codes.
>> Not mixing them together.
>>
>> Block level things to block level(RAID/Chunk), logical thing to logical
>> level(tree blocks).
>
> OK.
>
>>
>> Current btrfs csum design is already much much better than pure RAID.
>> Just think of RAID1, while one copy is corrupted, then which copy is correct
>> then?
>
> Yes.
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-22  2:08         ` Qu Wenruo
  2016-09-22  2:44           ` Chris Murphy
@ 2016-09-22  3:07           ` Christoph Anton Mitterer
  2016-09-22  3:18             ` Qu Wenruo
  1 sibling, 1 reply; 33+ messages in thread
From: Christoph Anton Mitterer @ 2016-09-22  3:07 UTC (permalink / raw)
  To: Qu Wenruo, Chris Murphy; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote:
> And I don't see the necessary to csum the parity.
> Why csum a csum again?

I'd say simply for the following reason:
Imagine the smallest RAID5: 2x data D1 D2, 1x parity P
If D2 is lost it could be recalculated via D1 and P.

What if only (all) the checksum information for D2 is lost (e.g.
because of further silent data corruption on the blocks of these
csums)?
Then we'd only know D1 is valid (which still has working csums). But we
wouldn't know whether D2 is (because gone) neither whether P is
(because not csummed).
And next imagine silent data corruption in either D2 or P => we cannot
tell which of them is valid, no repair possible... or do I miss
something?


> Just as you expected, it doesn't check parity.
> Even for RAID1/DUP, it won't check the backup if it succeeded
> reading 
> the first stripe.
That would IMO be really a highly severe bug... making scrubbing close
to completely useless for multi-device fs.
I mean the whole reason for doing it is to find [silently] corrupted
blocks... in order to be able to do something about it.
If on would only notice if the read actually fails because the one
working block is also gone.. then why having a RAID in the first place?

> Current implement doesn't really care if it's the data or the copy 
> corrupted, any data can be read out, then there is no problem.
Except it makes RAID practically useless... => problem


Cheers,
Chris.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5930 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-22  3:00             ` Qu Wenruo
@ 2016-09-22  3:12               ` Chris Murphy
  0 siblings, 0 replies; 33+ messages in thread
From: Chris Murphy @ 2016-09-22  3:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Tomasz Torcz, linux-btrfs

On Wed, Sep 21, 2016 at 9:00 PM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:

>> Both copies are not scrubbed? Oh hell...
>
>
> I was replying to the "--check-data-csum" of btrfsck.
>
> I mean, for --check-data-csum, it doesn't read the backup if the first data
> can be read out without error.
>
> And if the first data is wrong, btrfsck will read backup, and output error
> about wrong csum, but won't return error.

OK I'm convinced both progs and kernel scrubs have problems. I guess
it makes sense to fix progs first, and make sure it's fixed there,
then get it fixed in the kernel.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-09-22  3:07           ` Christoph Anton Mitterer
@ 2016-09-22  3:18             ` Qu Wenruo
  0 siblings, 0 replies; 33+ messages in thread
From: Qu Wenruo @ 2016-09-22  3:18 UTC (permalink / raw)
  To: Christoph Anton Mitterer, Chris Murphy; +Cc: linux-btrfs



At 09/22/2016 11:07 AM, Christoph Anton Mitterer wrote:
> On Thu, 2016-09-22 at 10:08 +0800, Qu Wenruo wrote:
>> And I don't see the necessary to csum the parity.
>> Why csum a csum again?
>
> I'd say simply for the following reason:
> Imagine the smallest RAID5: 2x data D1 D2, 1x parity P
> If D2 is lost it could be recalculated via D1 and P.
>
> What if only (all) the checksum information for D2 is lost (e.g.
> because of further silent data corruption on the blocks of these
> csums)?

First thing first, csum is metadata, which is normally protected by 
metadata profile.

So, you are saying the csum tree is corrupted.
Wow, that's protected by RAID5(assume you're using RAID5 for both 
metadata and data) and tree leaf csum.
If that happens, which means at least 2 devices are already corrupted.

Your data is doomed anyway.

> Then we'd only know D1 is valid (which still has working csums).

If that really happens, csum tree leaf is corrupted, and your D1 csum 
will be corrupted too, considering how close they are.

No means to recover anyway.

> But we
> wouldn't know whether D2 is (because gone) neither whether P is
> (because not csummed).
> And next imagine silent data corruption in either D2 or P => we cannot
> tell which of them is valid, no repair possible... or do I miss
> something?

Stated above, your fs is already screwed up in that case.

The biggest problem is, parity is a block level thing (its bytenr are 
dev bytenr, not logical bytenr), and csum is logical level thing.

Screwing these things up will cause more problem and more time to 
maintain than the benefit.

>
>
>> Just as you expected, it doesn't check parity.
>> Even for RAID1/DUP, it won't check the backup if it succeeded
>> reading
>> the first stripe.
> That would IMO be really a highly severe bug... making scrubbing close
> to completely useless for multi-device fs.
> I mean the whole reason for doing it is to find [silently] corrupted
> blocks... in order to be able to do something about it.
> If on would only notice if the read actually fails because the one
> working block is also gone.. then why having a RAID in the first place?

That's "--check-data-csum", not in-kernel scrub.

Thanks,
Qu

>
>> Current implement doesn't really care if it's the data or the copy
>> corrupted, any data can be read out, then there is no problem.
> Except it makes RAID practically useless... => problem
>
>
> Cheers,
> Chris.
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-06-25 12:21 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Goffredo Baroncelli
  2016-06-25 17:25 ` Chris Murphy
  2016-09-21  7:28 ` Qu Wenruo
@ 2016-11-04  2:10 ` Qu Wenruo
  2016-11-05  7:23   ` Goffredo Baroncelli
  2 siblings, 1 reply; 33+ messages in thread
From: Qu Wenruo @ 2016-11-04  2:10 UTC (permalink / raw)
  To: kreijack, linux-btrfs



At 06/25/2016 08:21 PM, Goffredo Baroncelli wrote:
> Hi all,
>
> following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk.
> The result showed that sometime the kernel recomputed the parity wrongly.
>
> I tested the following kernel
> - 4.6.1
> - 4.5.4
> and both showed the same behavior.
>
> The test was performed as described below:
>
> 1) create a filesystem in raid5 mode (for data and metadata) of 1500MB
>
> 	truncate -s 500M disk1.img; losetup -f disk1.img
> 	truncate -s 500M disk2.img; losetup -f disk2.img
> 	truncate -s 500M disk3.img; losetup -f disk3.img
> 	sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
> 	sudo mount /dev/loop0 mnt/
>
> 2) I created a file with a length of 128kb:
>
> 	python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
> 	sudo umount mnt/
>
> 3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located:
>
> /dev/loop0: offset=81788928+16*4096        (64k, second half of the file: 'bdbbbb.....)
> /dev/loop1: offset=61865984+16*4096        (64k, first half of the file: 'adaaaa.....)
> /dev/loop2: offset=61865984+16*4096        (64k, parity: '\x03\x00\x03\x03\x03.....)
>
> 4) I tried to corrupt each disk (one disk per test), and then run a scrub:
>
> for example for the disk /dev/loop2:
> 	sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
>         	        seek=$((61865984+16*4096)) count=5
> 	sudo mount /dev/loop0 mnt
> 	sudo btrfs scrub start mnt/.

I reproduced your problem and find that seems to be a problem of race.

The problem I found is mostly the same as yours, but with some more details:

The script is like:
-------
#!/bin/bash

dev1=/dev/vdb6
dev2=/dev/vdb7
dev3=/dev/vdb8
mnt=/mnt/test

umount $mnt
# First full stripe layout is
# dev1: Parity, dev2: Data Stripe1 dev3: Data Stripe2
mkfs.btrfs $dev1 $dev2 $dev3 -f -m raid5 -d raid5
mount $dev1 $mnt -o nospace_cache
xfs_io -f -c "pwrite 0 128k" $mnt/file1
sync
umount $mnt

dmesg -C
# destory parity of data stripe 1
dd if=/dev/urandom of=$dev2 bs=1 count=64k seek=546308096


# btrfs-progs scrub
# Newly introduced function, offline scrub, quite handy here
btrfs check --scrub $dev1

# kernel scrub
mount $dev1 $mnt -o nospace_cache
btrfs scrub start -B $mnt

# Feel free to umount and call offline scrub
# umount $mnt
# btrfs check --scrub $dev1
-------

The result is divided into the following types:

1) Kernel scrub reports 16 recoverable csum error
    Same as offline scrub. All correct.

2) Kernel scrub reports 17 or more recoverable csum error
    Even we only corrupted 16 pages, the extra csum error comes out
    almost no where

3) Kernel scrub reports some unrecoverable csum error
    Totally insane. But quite low possibility.

I digged a little further into the case 2) and found:
a) Kernel is scrubbing correct range
    So the extra csum error is not caused by checking wrong logical
    bytenr range

b) Kernel is scrubbing some pages twice
    That's the cause of the problem.


And unlike most of us assume, in fact scrub full fs is a very racy thing.
Scrubbing full fs is split into N scrubbing ioctls for each device.

So for above script, kernel is doing *3* scrubbing work.
For other profile it may not be a problem, but for RAID5/6 race can 
happen easily like:

Scrub dev1(P)           |  Scrub dev2(D1)        | Scrub dev3(D2)
---------------------------------------------------------------
Read out full stripe    | Read out D1            | Read out D2
                         | Check Csum for D1      | Check Csum for D2
                         | Csum mismatch (err++)  | Csum matches
Cal parity              | Read out full stripe   |
Parity mismatch         | Do recovery            |
Check full stripe       |
D1 csum mismatch (err++)|

So csum mismatch can be counted twice.

And since scrubbing for corrupted data stripe can easily race with 
scrubbing for parity, if timing happens in a worse situation, it can 
lead to unrecoverable csum error.

On the other hand, if you only scrub the damaged device only, no race 
will happen so case 2) 3) will just disappear.

Would you please try to only scrub one device one time?

>
> 5) I check the disks at the offsets above, to verify that the data/parity is correct

You could try the new offline scrub, it can save you a lot of time to 
find data/parity corruption.
https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

And of course, more reliable than kernel scrub (single thread, no extra 
IO no race) :)

Thanks,
Qu

>
> However I found that:
> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);
>
> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:
>
> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk
>
> 2.b) the kernel repaired the damage, and rebuild a correct parity
>
> I have to point out another strange thing: in dmesg I found two kinds of messages:
>
> msg1)
> 	[....]
> 	[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
> 	[ 1021.366949] BTRFS: has skinny extents
> 	[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> 	[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>
> msg2)
> 	[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
> 	[ 1017.435074] BTRFS: has skinny extents
> 	[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> 	[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, 	root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
> 	[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
> 	[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
> 	[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0
> 	[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>
>
> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>
> Another strangeness is that SCRUB sometime reports
>  ERROR: there are uncorrectable errors
> and sometime reports
>  WARNING: errors detected during scrubbing, corrected
>
> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2
>
>
> Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh".
>
> Br
> G.Baroncelli
>
>
>
>



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-11-04  2:10 ` Qu Wenruo
@ 2016-11-05  7:23   ` Goffredo Baroncelli
  0 siblings, 0 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2016-11-05  7:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Resent because I don't see it in ml
----------------------------
Hi Qu,

On 2016-11-04 03:10, Qu Wenruo wrote:
[...]
> 
> I reproduced your problem and find that seems to be a problem of race.
[...]
[...]> 
> I digged a little further into the case 2) and found:
> a) Kernel is scrubbing correct range
>    So the extra csum error is not caused by checking wrong logical
>    bytenr range
> 
> b) Kernel is scrubbing some pages twice
>    That's the cause of the problem.

For time constraint, I was unable to dig further this issue. I tried to add some printk to check if the code process correctly the data; what I found is 

1) the code seems to process the right data
2) the code seems to produces the correct data (i.e. the code was able to rebuild the correct data on the basis of the check-sums and the parity)
3) As you, I found that the code processed two time the same data

Unfortunately I was unable to figure why the code was unable to write the right data on the platter. Following the write path through the several handler of the bio was a job greater than my capabilities (and my time :-) )

> 
> 
> And unlike most of us assume, in fact scrub full fs is a very racy thing.

On the basis of your (and mine) observation that the code seems to process multiple time the same data, this may be an explanation.

> Scrubbing full fs is split into N scrubbing ioctls for each device.
> 
> So for above script, kernel is doing *3* scrubbing work.
> For other profile it may not be a problem, but for RAID5/6 race can happen easily like:
> 
> Scrub dev1(P)           |  Scrub dev2(D1)        | Scrub dev3(D2)
> ---------------------------------------------------------------
> Read out full stripe    | Read out D1            | Read out D2
>                         | Check Csum for D1      | Check Csum for D2
>                         | Csum mismatch (err++)  | Csum matches
> Cal parity              | Read out full stripe   |
> Parity mismatch         | Do recovery            |
> Check full stripe       |
> D1 csum mismatch (err++)|
> 
> So csum mismatch can be counted twice.
> 
> And since scrubbing for corrupted data stripe can easily race with
> scrubbing for parity, if timing happens in a worse situation, it can
> lead to unrecoverable csum error.

Interesting, I wasn't aware that the scrub is done in parallel on the different disks. This explain a lot of strangeness....




 
> On the other hand, if you only scrub the damaged device only, no race
> will happen so case 2) 3) will just disappear.

> 
> Would you please try to only scrub one device one time?

I do it and I can confirm your hypothesis: if I do the scrub process 1 disk a time, I was unable to reproduce the corruption. Instead if I do the scrub process in parallel on all the disks, I sometime got a corruption: in average each 6 tests I got from 1 to 3 failures.

So the strategy of scrub must be different in case of a RAID6/5 chunk: in this case the parallel scrub must be avoided: the scrub must be performed on per stripe basis.


> 
>>
>> 5) I check the disks at the offsets above, to verify that the data/parity is correct
> 
> You could try the new offline scrub, it can save you a lot of time to find data/parity corruption.
> https://github.com/adam900710/btrfs-progs/tree/fsck_scrub

I will try it

BR
G.Baroncelli


> 
> And of course, more reliable than kernel scrub (single thread, no extra IO no race) :)
> 
> Thanks,
> Qu
> 
>>
>> However I found that:
>> 1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);
>>
>> 2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:
>>
>> 2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk
>>
>> 2.b) the kernel repaired the damage, and rebuild a correct parity
>>
>> I have to point out another strange thing: in dmesg I found two kinds of messages:
>>
>> msg1)
>>     [....]
>>     [ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
>>     [ 1021.366949] BTRFS: has skinny extents
>>     [ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
>>     [ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>>     [ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>>
>> msg2)
>>     [ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
>>     [ 1017.435074] BTRFS: has skinny extents
>>     [ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
>>     [ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872,     root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
>>     [ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
>>     [ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
>>     [ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
>>     [ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0
>>     [ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0
>>
>>
>> but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)
>>
>> Another strangeness is that SCRUB sometime reports
>>  ERROR: there are uncorrectable errors
>> and sometime reports
>>  WARNING: errors detected during scrubbing, corrected
>>
>> but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2
>>
>>
>> Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh".
>>
>> Br
>> G.Baroncelli
>>
>>
>>
>>
> 
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
@ 2016-08-19 13:17 Philip Espunkt
  0 siblings, 0 replies; 33+ messages in thread
From: Philip Espunkt @ 2016-08-19 13:17 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>>
>> most of the time, it seems that btrfs-raid5 is not capable to
>> rebuild parity and data. Worse the message returned by scrub is
>> incoherent by the status on the disk. The tests didn't fail every
>> time; this complicate the diagnosis. However my script fails most
>> of the time.

Have you opened a bug ticket at http://bugzilla.kernel.org/?
I think that would be useful for tracking.


Philip

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-07-16 15:51 ` [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Jarkko Lavinen
  2016-07-17 19:46   ` Jarkko Lavinen
@ 2016-07-18 18:56   ` Goffredo Baroncelli
  1 sibling, 0 replies; 33+ messages in thread
From: Goffredo Baroncelli @ 2016-07-18 18:56 UTC (permalink / raw)
  To: Jarkko Lavinen, linux-btrfs

Hi

On 2016-07-16 17:51, Jarkko Lavinen wrote:
> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>> Using "btrfs insp phy" I developed a script to trigger the bug.
> 
> Thank you for the script and all for sharing the raid5 and scrubbing
> issues. I have been using two raid5 arrays and ran scrub occasionally
> without any problems lately and been in false confidence. I converted
> successfully raid5 arrays into raid10 without any glitch.
> 
> I tried to modify the shell script so that instead of corrupting data
> with dd, a simulated bad block is created with device mapper. Modern
> disks are likely to either return the correct data or an error if
> they cannot.


You are right; but doing so we are complicating further the test case:
- my tests show what happen when there is a corruption, but the drive behaves well
- your tests show what happen when there is a corruption AND the drive has a failure

I agree that your simulation is more realistic, but I fear that doing so we are complicating the bug finding.

> 
> The modified script behaves very much like the original dd version.
> With dd version I see wrong data instead of expected data. 

When toy say "I see wrong data", you means with 
1) "cat mnt/out.txt" 
or 2) with "dd if=/dev/loop....." ?

In the first case I see always good data; in the second case I see wrong data but of course no reading error

> With simulated bad block I see no data at all instead of expected data
> since dd quits on read error.
> 
> Jarkko Lavinen
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-07-16 15:51 ` [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Jarkko Lavinen
@ 2016-07-17 19:46   ` Jarkko Lavinen
  2016-07-18 18:56   ` Goffredo Baroncelli
  1 sibling, 0 replies; 33+ messages in thread
From: Jarkko Lavinen @ 2016-07-17 19:46 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5491 bytes --]

On Sat, Jul 16, 2016 at 06:51:11PM +0300, Jarkko Lavinen wrote:
>  The modified script behaves very much like the original dd version.

Not quite. The bad sector simulation works like old hard drives without error correction and bad block remapping. This changes the error behaviour.

My script prints now kernel messages once the check_fs fails. The time range of messages messages is from the adding of the bad sector device to the point when check_fs fails.

The parity test which often passes with the Goffredo's script, always fails with my bad sector version and scrub says the error is uncorrectable. In the kernel messages there are two buffer IO read errors but no write error as if scrub quits before writing?

In the data2 test scrub again says the error is uncorrectable but according to the kernel messages the bad sector is read 4 times and written twice during the scrub. In my bad sector script the data2 is still corrupted and parity ok since the bad sector cannot be written and scrub likely quits earlier than in Goffredo's script. In his script the data2 gets fixed but the parity gets corrupted.

Jarkko Lavinen

$ bash h2.sh
--- test 1: corrupt parity
scrub started on mnt/., fsid 2625e2d0-420c-40b6-befa-97fc18eaed48 (pid=32490)
ERROR: there are uncorrectable errors
******* Wrong data on disk:off /dev/mapper/loop0:61931520 (parity)
Data read ||, expected |0300 0303|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-0, logical block 15120, async page read
Scrub started
Second Check_fs started
Buffer I/O error on dev dm-0, logical block 15120, async page read

--- test 2: corrupt data2
scrub started on mnt/., fsid 8e506268-16c7-48fa-b176-0a8877f2a7aa (pid=434)
ERROR: there are uncorrectable errors
******* Wrong data on disk:off /dev/mapper/loop2:81854464 (data2)
Data read ||, expected |bdbbb|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-2, logical block 19984, async page read
Scrub started
BTRFS warning (device dm-0): i/o error at logical 142802944 on dev /dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 1, flush 0, corrupt 0, gen 0
BTRFS warning (device dm-0): i/o error at logical 142802944 on dev /dev/mapper/loop2, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 on dev /dev/mapper/loop2
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 2, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142802944 on dev /dev/mapper/loop2
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 3, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, corrupt 0, gen 0
Second Check_fs started
BTRFS info (device dm-0): bdev /dev/mapper/loop2 errs: wr 2, rd 4, flush 0, corrupt 0, gen 0
Buffer I/O error on dev dm-2, logical block 19984, async page read

--- test 3: corrupt data1
scrub started on mnt/., fsid f8a4ecca-2475-4e5e-9651-65d9478b56fe (pid=856)
ERROR: there are uncorrectable errors
******* Wrong data on disk:off /dev/mapper/loop1:61931520 (data1)
Data read ||, expected |adaaa|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-1, logical block 15120, async page read
Scrub started
BTRFS warning (device dm-0): i/o error at logical 142737408 on dev /dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
BTRFS warning (device dm-0): i/o error at logical 142737408 on dev /dev/mapper/loop1, sector 120960, root 5, inode 257, offset 0, length 4096, links 1 (path: out.txt)
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 2, flush 0, corrupt 0, gen 0
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 on dev /dev/mapper/loop1
BTRFS error (device dm-0): unable to fixup (regular) error at logical 142737408 on dev /dev/mapper/loop1
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 3, flush 0, corrupt 0, gen 0
Second Check_fs started
BTRFS error (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0
BTRFS info (device dm-0): bdev /dev/mapper/loop1 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0
Buffer I/O error on dev dm-1, logical block 15120, async page read

--- test 4: corrupt data2; read without scrub
******* Wrong data on disk:off /dev/mapper/loop2:81854464 (data2)
Data read ||, expected |bdbbb|

Kernel messages in the test
First Check_fs started
Buffer I/O error on dev dm-2, logical block 19984, async page read
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Second Check_fs started
BTRFS error (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
BTRFS info (device dm-0): bdev /dev/mapper/loop2 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Buffer I/O error on dev dm-2, logical block 19984, async page read

[-- Attachment #2: h2.sh --]
[-- Type: application/x-sh, Size: 7811 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
  2016-07-12 21:50 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two Goffredo Baroncelli
@ 2016-07-16 15:51 ` Jarkko Lavinen
  2016-07-17 19:46   ` Jarkko Lavinen
  2016-07-18 18:56   ` Goffredo Baroncelli
  0 siblings, 2 replies; 33+ messages in thread
From: Jarkko Lavinen @ 2016-07-16 15:51 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 849 bytes --]

On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
> Using "btrfs insp phy" I developed a script to trigger the bug.

Thank you for the script and all for sharing the raid5 and scrubbing issues. I have been using two raid5 arrays and ran scrub occasionally without any problems lately and been in false confidence. I converted successfully raid5 arrays into raid10 without any glitch.

I tried to modify the shell script so that instead of corrupting data with dd, a simulated bad block is created with device mapper. Modern disks are likely to either return the correct data or an error if they cannot.

The modified script behaves very much like the original dd version. With dd version I see wrong data instead of expected data. With simulated bad block I see no data at all instead of expected data since dd quits on read error.

Jarkko Lavinen

[-- Attachment #2: h.sh --]
[-- Type: application/x-sh, Size: 6526 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2016-11-05  7:23 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-25 12:21 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Goffredo Baroncelli
2016-06-25 17:25 ` Chris Murphy
2016-06-25 17:58   ` Chris Murphy
2016-06-25 18:42     ` Goffredo Baroncelli
2016-06-25 22:33       ` Chris Murphy
2016-06-26  9:20         ` Goffredo Baroncelli
2016-06-26 16:43           ` Chris Murphy
2016-06-26  2:53   ` Duncan
2016-06-26 22:33     ` ronnie sahlberg
2016-06-26 22:38       ` Hugo Mills
2016-06-27  3:22         ` Steven Haigh
2016-06-27  3:21       ` Steven Haigh
2016-06-27 19:47         ` Duncan
2016-06-27  3:50       ` Christoph Anton Mitterer
2016-06-27  4:35         ` Andrei Borzenkov
2016-06-27 16:39           ` Christoph Anton Mitterer
2016-09-21  7:28 ` Qu Wenruo
2016-09-21  7:35   ` Tomasz Torcz
2016-09-21  9:15     ` Qu Wenruo
2016-09-21 15:13       ` Chris Murphy
2016-09-22  2:08         ` Qu Wenruo
2016-09-22  2:44           ` Chris Murphy
2016-09-22  3:00             ` Qu Wenruo
2016-09-22  3:12               ` Chris Murphy
2016-09-22  3:07           ` Christoph Anton Mitterer
2016-09-22  3:18             ` Qu Wenruo
2016-09-21 15:02   ` Chris Murphy
2016-11-04  2:10 ` Qu Wenruo
2016-11-05  7:23   ` Goffredo Baroncelli
2016-07-12 21:50 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two Goffredo Baroncelli
2016-07-16 15:51 ` [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Jarkko Lavinen
2016-07-17 19:46   ` Jarkko Lavinen
2016-07-18 18:56   ` Goffredo Baroncelli
2016-08-19 13:17 Philip Espunkt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.