All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two
@ 2016-07-12 21:50 Goffredo Baroncelli
  2016-07-14 21:20 ` Chris Mason
  2016-07-16 15:51 ` [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Jarkko Lavinen
  0 siblings, 2 replies; 43+ messages in thread
From: Goffredo Baroncelli @ 2016-07-12 21:50 UTC (permalink / raw)
  To: linux-btrfs

Hi All,

I developed a new btrfs command "btrfs insp phy"[1] to further investigate this bug [2]. Using "btrfs insp phy" I developed a script to trigger the bug. The bug is not always triggered, but most of time yes. 

Basically the script create a raid5 filesystem (using three loop-device on three file called disk[123].img); on this filesystem  it is create a file. Then using "btrfs insp phy", the physical placement of the data on the device are computed.

First the script checks that the data are the right one (for data1, data2 and parity), then it corrupt the data:

test1: the parity is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test goes fine all the times

test2: data2 is corrupted, then scrub is ran. Then the (data1, data2, parity) data on the disk are checked. This test fail most of the time: the data on the disk is not correct; the parity is wrong. Scrub sometime reports "WARNING: errors detected during scrubbing, corrected" and sometime reports "ERROR: there are uncorrectable errors". But this seems unrelated to the fact that the data is corrupetd or not
test3: like test2, but data1 is corrupted. The result are the same as above.


test4: data2 is corrupted, the the file is read. The system doesn't return error (the data seems to be fine); but the data2 on the disk is still corrupted.


Note: data1, data2, parity are the disk-element of the raid5 stripe-

Conclusion:

most of the time, it seems that btrfs-raid5 is not capable to rebuild parity and data. Worse the message returned by scrub is incoherent by the status on the disk. The tests didn't fail every time; this complicate the diagnosis. However my script fails most of the time.

BR
G.Baroncelli

----

root="$(pwd)"
disks="disk1.img disk2.img disk3.img"
imgsize=500M
BTRFS=../btrfs-progs/btrfs


#
# returns all the loopback devices
#
loop_disks() {
	sudo losetup | grep $root | awk '{ print $1 }'
}

#init the fs

init_fs() {
	#destroy fs
	echo umount mnt
	sudo umount mnt
	for i in $( loop_disks ); do
		echo "losetup -d $i"
		sudo losetup -d $i
	done

	for i in $disks; do
		rm $i
		truncate -s $imgsize $i
		sudo losetup -f $i
	done

	loops="$(loop_disks)"
	loop1="$(echo $loops | awk '{ print $1 }')"
	echo "loops=$loops; loop1=$loop1"

	sudo mkfs.btrfs -d raid5 -m raid5 $loops
	sudo mount $loop1 mnt/

	python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt >/dev/null

	ls -l mnt/out.txt

	sudo umount mnt
	sync; sync
}

check_fs() {

        sudo mount $loop1 mnt
        data="$(sudo $BTRFS insp phy mnt/out.txt)"
        
        data1_off="$(echo "$data" | grep "DATA$" | awk '{ print $5 }')"
        data2_off="$(echo "$data" | grep "OTHER$" | awk '{ print $5 }')"
        parity_off="$(echo "$data" | grep "PARITY$" | awk '{ print $5 }')"
        data1_dev="$(echo "$data" | grep "DATA$" | awk '{ print $3 }')"
        data2_dev="$(echo "$data" | grep "OTHER$" | awk '{ print $3 }')"
        parity_dev="$(echo "$data" | grep "PARITY$" | awk '{ print $3 }')"
        
        sudo umount mnt
        
	# check
	d="$(dd 2>/dev/null if=$data1_dev bs=1 skip=$data1_off count=5)"
	if [ "$d" != "adaaa" ]; then
		echo "******* Wrong data on disk:off $data1_dev:$data1_off (data1)"
		return 1
	fi

	d="$(dd 2>/dev/null if=$data2_dev bs=1 skip=$data2_off count=5)"
	if [ "$d" != "bdbbb" ]; then
		echo "******* Wrong data on disk:off $data2_dev:$data2_off (data2)"
		return 1
	fi

	d="$(dd 2>/dev/null if=$parity_dev bs=1 skip=$parity_off count=5 | 
                xxd | dd 2>/dev/null bs=1 count=9 skip=10)"
	if [ "x$d" != "x0300 0303" ]; then
		echo "******* Wrong data on disk:off $parity_dev:$parity_off (parity)"
		return 1
	fi

	return 0
}

test_corrupt_parity() {
	echo "--- test 1: corrupt parity"
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo dd 2>/dev/null if=/dev/zero of=$parity_dev bs=1 \
		seek=$parity_off count=5
	check_fs &>/dev/null && {
			echo Corruption failed
			exit 100
		}
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo mount $loop1 mnt
	sudo btrfs scrub start mnt/.
	sync; sync
	cat mnt/out.txt &>/dev/null || echo "Read FAIL"
	sudo umount mnt
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	check_fs || return 1
	echo "--- test1: OK"
	return 0
}



test_corrupt_data2() {
	echo "--- test 2: corrupt data2"
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \
		seek=$data2_off count=5
	check_fs &>/dev/null && {
			echo Corruption failed
			exit 100
		}
	echo 3 | sudo tee >/dev/null >/dev/null /proc/sys/vm/drop_caches
	sudo mount $loop1 mnt
	sudo btrfs scrub start mnt/.
	sync; sync
	cat mnt/out.txt &>/dev/null || echo "Read FAIL"
	sudo umount mnt
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	check_fs || return 1
	echo "--- test2: OK"
	return 0
}

test_corrupt_data1() {
	echo "--- test 3: corrupt data1"
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo dd 2>/dev/null if=/dev/zero of=$data1_dev bs=1 \
		seek=$data1_off count=5
	check_fs &>/dev/null && {
			echo Corruption failed
			exit 100
		}
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo mount $loop1 mnt
	sudo btrfs scrub start mnt/.
	sync; sync
	cat mnt/out.txt &>/dev/null || echo "Read FAIL"
	sudo umount mnt
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	check_fs || return 1
	echo "--- test3: OK"
	return 0
}

test_corrupt_data2_wo_scrub() {
	echo "--- test 4: corrupt data2; read without scrub"
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo dd 2>/dev/null if=/dev/zero of=$data2_dev bs=1 \
		seek=$data2_off count=5
	check_fs &>/dev/null && {
			echo Corruption failed
			exit 100
		}
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	sudo mount $loop1 mnt
	cat mnt/out.txt &>/dev/null || echo "Read FAIL"
	sudo umount mnt
	echo 3 | sudo tee >/dev/null /proc/sys/vm/drop_caches
	check_fs || return 1
	echo "--- test 4: OK"
	return 0
}


for t in test_corrupt_parity test_corrupt_data2 test_corrupt_data1 \
    test_corrupt_data2_wo_scrub; do
    
        init_fs &>/dev/null
        if ! check_fs &>/dev/null; then 
             echo Integrity test failed
             exit 100
        fi

        $t
        echo
    
done


-----------------




[1] See email "New btrfs sub command: btrfs inspect physical-find"
[2] See email "[BUG] Btrfs scrub sometime recalculate wrong parity in raid5"



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 43+ messages in thread
* Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
@ 2016-08-19 13:17 Philip Espunkt
  0 siblings, 0 replies; 43+ messages in thread
From: Philip Espunkt @ 2016-08-19 13:17 UTC (permalink / raw)
  To: kreijack; +Cc: linux-btrfs

> On 07/12/2016 05:50 PM, Goffredo Baroncelli wrote:
>>
>> most of the time, it seems that btrfs-raid5 is not capable to
>> rebuild parity and data. Worse the message returned by scrub is
>> incoherent by the status on the disk. The tests didn't fail every
>> time; this complicate the diagnosis. However my script fails most
>> of the time.

Have you opened a bug ticket at http://bugzilla.kernel.org/?
I think that would be useful for tracking.


Philip

^ permalink raw reply	[flat|nested] 43+ messages in thread
* [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
@ 2016-06-25 12:21 Goffredo Baroncelli
  2016-06-25 17:25 ` Chris Murphy
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Goffredo Baroncelli @ 2016-06-25 12:21 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4691 bytes --]

Hi all,

following the thread "Adventures in btrfs raid5 disk recovery", I investigated a bit the BTRFS capability to scrub a corrupted raid5 filesystem. To test it, I first find where a file was stored, and then I tried to corrupt the data disks (when unmounted) or the parity disk.
The result showed that sometime the kernel recomputed the parity wrongly.

I tested the following kernel
- 4.6.1
- 4.5.4
and both showed the same behavior.

The test was performed as described below:

1) create a filesystem in raid5 mode (for data and metadata) of 1500MB 

	truncate -s 500M disk1.img; losetup -f disk1.img
	truncate -s 500M disk2.img; losetup -f disk2.img
	truncate -s 500M disk3.img; losetup -f disk3.img
	sudo mkfs.btrfs -d raid5 -m raid5 /dev/loop[0-2]
	sudo mount /dev/loop0 mnt/

2) I created a file with a length of 128kb:

	python -c "print 'ad'+'a'*65534+'bd'+'b'*65533" | sudo tee mnt/out.txt
	sudo umount mnt/

3) I looked at the output of 'btrfs-debug-tree /dev/loop0' and I was able to find where the file stripe is located:

/dev/loop0: offset=81788928+16*4096        (64k, second half of the file: 'bdbbbb.....)
/dev/loop1: offset=61865984+16*4096        (64k, first half of the file: 'adaaaa.....)
/dev/loop2: offset=61865984+16*4096        (64k, parity: '\x03\x00\x03\x03\x03.....)

4) I tried to corrupt each disk (one disk per test), and then run a scrub:

for example for the disk /dev/loop2:
	sudo dd if=/dev/zero of=/dev/loop2 bs=1 \
        	        seek=$((61865984+16*4096)) count=5
	sudo mount /dev/loop0 mnt
	sudo btrfs scrub start mnt/.

5) I check the disks at the offsets above, to verify that the data/parity is correct

However I found that:
1) if I corrupt the parity disk (/dev/loop2), scrub don't find any corruption, but recomputed the parity (always correctly);

2) when I corrupted the other disks (/dev/loop[01]) btrfs was able to find the corruption. But I found two main behaviors:

2.a) the kernel repaired the damage, but compute the wrong parity. Where it was the parity, the kernel copied the data of the second disk on the parity disk

2.b) the kernel repaired the damage, and rebuild a correct parity 

I have to point out another strange thing: in dmesg I found two kinds of messages:

msg1)
	[....]
	[ 1021.366944] BTRFS info (device loop2): disk space caching is enabled
	[ 1021.366949] BTRFS: has skinny extents
	[ 1021.399208] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1021.399214] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
	[ 1021.399291] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0

msg2)
	[ 1017.435068] BTRFS info (device loop2): disk space caching is enabled
	[ 1017.435074] BTRFS: has skinny extents
	[ 1017.436778] BTRFS info (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
	[ 1017.463403] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, 	root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1017.463409] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
	[ 1017.463467] BTRFS warning (device loop2): checksum error at logical 142802944 on dev /dev/loop0, sector 159872, root 5, inode 257, offset 65536, length 4096, links 1 (path: out.txt)
	[ 1017.463472] BTRFS error (device loop2): bdev /dev/loop0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
	[ 1017.463512] BTRFS error (device loop2): unable to fixup (regular) error at logical 142802944 on dev /dev/loop0
	[ 1017.463535] BTRFS error (device loop2): fixed up error at logical 142802944 on dev /dev/loop0


but these seem to be UNrelated to the kernel behavior 2.a) or 2.b)

Another strangeness is that SCRUB sometime reports
 ERROR: there are uncorrectable errors
and sometime reports
 WARNING: errors detected during scrubbing, corrected

but also these seems UNrelated to the behavior 2.a) or 2.b) or msg1 or msg2


Enclosed you can find the script which I used to trigger the bug. I have to rerun it several times to show the problem because it doesn't happen every time. Pay attention that the offset and the loop device name are hard coded. You must run the script in the same directory where it is: eg "bash test.sh". 

Br
G.Baroncelli


 
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: test.sh --]
[-- Type: application/x-sh, Size: 2819 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2016-11-05  7:23 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-12 21:50 [BUG] Btrfs scrub sometime recalculate wrong parity in raid5: take two Goffredo Baroncelli
2016-07-14 21:20 ` Chris Mason
2016-07-15  4:39   ` Andrei Borzenkov
2016-07-15 13:20     ` Chris Mason
2016-07-15 15:10       ` Andrei Borzenkov
2016-07-15 15:21         ` Chris Mason
2016-07-15 16:30     ` Goffredo Baroncelli
2016-07-15 16:28   ` Goffredo Baroncelli
2016-07-15 16:29     ` Chris Mason
2016-07-15 16:34       ` Andrei Borzenkov
2016-07-16 15:51 ` [BUG] Btrfs scrub sometime recalculate wrong parity in raid5 Jarkko Lavinen
2016-07-17 19:46   ` Jarkko Lavinen
2016-07-18 18:56   ` Goffredo Baroncelli
  -- strict thread matches above, loose matches on Subject: below --
2016-08-19 13:17 Philip Espunkt
2016-06-25 12:21 Goffredo Baroncelli
2016-06-25 17:25 ` Chris Murphy
2016-06-25 17:58   ` Chris Murphy
2016-06-25 18:42     ` Goffredo Baroncelli
2016-06-25 22:33       ` Chris Murphy
2016-06-26  9:20         ` Goffredo Baroncelli
2016-06-26 16:43           ` Chris Murphy
2016-06-26  2:53   ` Duncan
2016-06-26 22:33     ` ronnie sahlberg
2016-06-26 22:38       ` Hugo Mills
2016-06-27  3:22         ` Steven Haigh
2016-06-27  3:21       ` Steven Haigh
2016-06-27 19:47         ` Duncan
2016-06-27  3:50       ` Christoph Anton Mitterer
2016-06-27  4:35         ` Andrei Borzenkov
2016-06-27 16:39           ` Christoph Anton Mitterer
2016-09-21  7:28 ` Qu Wenruo
2016-09-21  7:35   ` Tomasz Torcz
2016-09-21  9:15     ` Qu Wenruo
2016-09-21 15:13       ` Chris Murphy
2016-09-22  2:08         ` Qu Wenruo
2016-09-22  2:44           ` Chris Murphy
2016-09-22  3:00             ` Qu Wenruo
2016-09-22  3:12               ` Chris Murphy
2016-09-22  3:07           ` Christoph Anton Mitterer
2016-09-22  3:18             ` Qu Wenruo
2016-09-21 15:02   ` Chris Murphy
2016-11-04  2:10 ` Qu Wenruo
2016-11-05  7:23   ` Goffredo Baroncelli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.