All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes).  With reproducer for kernel 5.3.11
@ 2019-11-19  4:08 Zygo Blaxell
  2019-11-19  6:58 ` Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Zygo Blaxell @ 2019-11-19  4:08 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 8519 bytes --]

Sometimes, btrfs raid5 not only fails to recover corrupt data with a
parity stripe, it also copies bad data over good data.  This propagates
errors between drives and makes a correctable failure uncorrectable.
Reproducer script at the end.

This doesn't happen very often.  The repro script corrupts *every*
data block on one of the RAID5 drives, and only a handful of blocks
fail to be corrected--about 16 errors per 3GB of data, but sometimes
half or double that rate.  It behaves more like a race condition than
a boundary condition.  It can take a few tries to get a failure with a
16GB disk array.  It seems to happen more often on 2-disk raid5 than
5-disk raid5, but if you repeat the test enough times even a 5-disk
raid5 will eventually fail.

Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
I haven't tried this reproducer on kernels earlier than 4.16 due to
other raid5 issues in earlier kernels.

I found 16 corrupted files after one test with 3GB of data (multiple
copies of /usr on a Debian vm).  I dumped out the files with 'btrfs
restore'.  These are the differences between the restored files and the
original data:

	# find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
	-rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
	2532 253 147
	-rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
	 2481  20 145
	 6481 270 145
	 8876   3 150
	13137 232  75
	16805 103  55
	-rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
	2064   0 157
	-rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
	2262 226 145
	-rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
	 7115 361 164
	 8333 330  12
	-rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
	263868  35  40
	268307 143  40
	272168 370 154
	275138  25 145
	280076 125  40
	282683 310 136
	286949 132  44
	293453 176 163
	296803  40  52
	300719 307  40
	305953  77 174
	310419 124 161
	312922  47  40
	-rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
	  787 323 102
	 6775 372 141
	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
	 484 262  41
	-rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
	2777 247 164
	-rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
	3128  22   6
	-rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
	3378 253 146
	6224 162  42
	-rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
	68112   2 111
	73226 344 150
	75622  12  40
	-rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
	1859 247  40
	-rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
	66721 200 346
	72692 211 270
	74596 336 101
	79179 257   0
	85438 104 256
	-rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
	1722 243 356
	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
	1062 212 267

Note that the reproducer script will corrupt exactly one random byte per
4K block to guarantee the corruption is detected by the crc32c algorithm.
In all cases the corrupted data is one byte per 4K block, as expected.

I dumped out the files by reading the blocks directly from the file
system.  Data and parity blocks from btrfs were identical and matched the
corrupted data from btrfs restore.  This is interesting because the repro
script only corrupts one drive!  The only way blocks on both drives end
up corrupted identically (or at all) is if btrfs copies the bad data
over the good.

There is also some spatial clustering of the unrecoverable blocks.
Here are the physical block addresses (in hex to make mod-4K and mod-64K
easier to see):

	Extent bytenr start..end   Filename
	0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
	0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
	0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
	0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
	0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
	0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
	0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
	0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
	0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
	0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
	0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
	0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
	0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
	0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
	0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
	0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm

Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000.  The files
gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
a single RAID strip.

All files that had sha1sum failures also had EIO/csum failures, so btrfs
did detect all the (now uncorrectable) corrupted blocks correctly.  Also
no problems have been seen with btrfs raid1 (metadata or data).

Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):

	#!/bin/bash
	set -x

	# Reset state
	umount /try
	mkdir -p /try

	# Create FS and mount.	Use raid1 metadata so the filesystem
	# has a fair chance of survival.
	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
	btrfs dev scan
	mount -onoatime /dev/vdb /try || exit 1

	# Must be on btrfs
	cd /try || exit 1
	btrfs sub list . || exit 1

	# Fill disk with files.  Increase seq for more test data
	# to increase the chance of finding corruption.
	for x in $(seq 0 3); do
		sync &
		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
		sleep 2
	done
	wait

	# Remove half the files.  If you increased seq above, increase the
	# '-2' here as well.
	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
		sync &
		rm -fr "$x" &
		sleep 2
	done
	wait

	# Fill in some of the holes.  This is to get a good mix of
	# partially filled RAID stripes of various sizes.
	for x in $(seq 0 1); do
		sync &
		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
		sleep 2
	done
	wait

	# Calculate hash we will use to verify data later
	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt

	# Make sure it's all on the disk
	sync
	sysctl vm.drop_caches=3

	# See distribution of data across drives
	btrfs dev usage /try
	btrfs fi usage /try

	# Corrupt one byte of each of the first 4G on /dev/vdb,
	# so that the crc32c algorithm will always detect the corruption.
	# If you need a bigger test disk then increase the '4'.
	# Leave the first 16MB of the disk alone so we don't kill the superblock.
	perl -MFcntl -e '
		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
			sysread(STDIN, $dat, 1) or die "read: $!";
			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
		}
	' </dev/vdb >/dev/vdb

	# Make sure all that's on disk and our caches are empty
	sync
	sysctl vm.drop_caches=3

	# Before and after dev stat and read-only scrub to see what the damage looks like.
	# This will produce some ratelimited kernel output.
	btrfs dev stat /try | grep -v ' 0$'
	btrfs scrub start -rBd /try
	btrfs dev stat /try | grep -v ' 0$'

	# Verify all the files are correctly restored transparently by btrfs.
	# btrfs repairs correctable blocks as a side-effect.
	sha1sum --quiet -c /tmp/sha1sums.txt

	# Do a scrub to clean up stray corrupted blocks (including superblocks)
	btrfs dev stat /try | grep -v ' 0$'
	btrfs scrub start -Bd /try
	btrfs dev stat /try | grep -v ' 0$'

	# This scrub should be clean, but sometimes is not.
	btrfs scrub start -Bd /try
	btrfs dev stat /try | grep -v ' 0$'

	# Verify that the scrub didn't corrupt anything.
	sha1sum --quiet -c /tmp/sha1sums.txt

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
  2019-11-19  4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
@ 2019-11-19  6:58 ` Qu Wenruo
  2019-11-19 23:57   ` Zygo Blaxell
  2020-02-06 23:40 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1 Zygo Blaxell
  2020-04-02 11:10 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Filipe Manana
  2 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2019-11-19  6:58 UTC (permalink / raw)
  To: Zygo Blaxell, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 9633 bytes --]



On 2019/11/19 下午12:08, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data.  This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
> 
> This doesn't happen very often.  The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate.  It behaves more like a race condition than
> a boundary condition.  It can take a few tries to get a failure with a
> 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
> 
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
> 
> I found 16 corrupted files after one test with 3GB of data (multiple
> copies of /usr on a Debian vm).  I dumped out the files with 'btrfs
> restore'.  These are the differences between the restored files and the
> original data:
> 
> 	# find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
> 	-rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 	2532 253 147
> 	-rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 	 2481  20 145
> 	 6481 270 145
> 	 8876   3 150
> 	13137 232  75
> 	16805 103  55
> 	-rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 	2064   0 157
> 	-rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 	2262 226 145
> 	-rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
> 	 7115 361 164
> 	 8333 330  12
> 	-rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
> 	263868  35  40
> 	268307 143  40
> 	272168 370 154
> 	275138  25 145
> 	280076 125  40
> 	282683 310 136
> 	286949 132  44
> 	293453 176 163
> 	296803  40  52
> 	300719 307  40
> 	305953  77 174
> 	310419 124 161
> 	312922  47  40
> 	-rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
> 	  787 323 102
> 	 6775 372 141
> 	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
> 	 484 262  41
> 	-rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
> 	2777 247 164
> 	-rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
> 	3128  22   6
> 	-rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 	3378 253 146
> 	6224 162  42
> 	-rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 	68112   2 111
> 	73226 344 150
> 	75622  12  40
> 	-rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> 	1859 247  40
> 	-rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 	66721 200 346
> 	72692 211 270
> 	74596 336 101
> 	79179 257   0
> 	85438 104 256
> 	-rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
> 	1722 243 356
> 	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
> 	1062 212 267
> 
> Note that the reproducer script will corrupt exactly one random byte per
> 4K block to guarantee the corruption is detected by the crc32c algorithm.
> In all cases the corrupted data is one byte per 4K block, as expected.
> 
> I dumped out the files by reading the blocks directly from the file
> system.  Data and parity blocks from btrfs were identical and matched the
> corrupted data from btrfs restore.  This is interesting because the repro
> script only corrupts one drive!  The only way blocks on both drives end
> up corrupted identically (or at all) is if btrfs copies the bad data
> over the good.
> 
> There is also some spatial clustering of the unrecoverable blocks.
> Here are the physical block addresses (in hex to make mod-4K and mod-64K
> easier to see):
> 
> 	Extent bytenr start..end   Filename
> 	0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 	0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
> 	0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
> 	0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
> 	0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
> 	0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
> 	0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
> 	0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
> 	0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 	0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 	0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 	0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 	0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
> 	0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 	0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 	0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> 
> Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000.  The files
> gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> a single RAID strip.
> 
> All files that had sha1sum failures also had EIO/csum failures, so btrfs
> did detect all the (now uncorrectable) corrupted blocks correctly.  Also
> no problems have been seen with btrfs raid1 (metadata or data).
> 
> Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> 
> 	#!/bin/bash
> 	set -x
> 
> 	# Reset state
> 	umount /try
> 	mkdir -p /try
> 
> 	# Create FS and mount.	Use raid1 metadata so the filesystem
> 	# has a fair chance of survival.
> 	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1

Two devices raid5, that's just RAID1, but with all the RAID5 stripe
limitations.
But even two disks raid5 shouldn't be that unstable, so it looks like a bug.

I'll try to make the reproducer smaller and take a look into the behavior.
To be honest, I didn't expect 2 disks raid5 nor the strange corruption.

BTW, I guess you also tried 3 disks raid5, what's the result?

Thanks,
Qu

> 	btrfs dev scan
> 	mount -onoatime /dev/vdb /try || exit 1
> 
> 	# Must be on btrfs
> 	cd /try || exit 1
> 	btrfs sub list . || exit 1
> 
> 	# Fill disk with files.  Increase seq for more test data
> 	# to increase the chance of finding corruption.
> 	for x in $(seq 0 3); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Remove half the files.  If you increased seq above, increase the
> 	# '-2' here as well.
> 	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> 		sync &
> 		rm -fr "$x" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Fill in some of the holes.  This is to get a good mix of
> 	# partially filled RAID stripes of various sizes.
> 	for x in $(seq 0 1); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Calculate hash we will use to verify data later
> 	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> 
> 	# Make sure it's all on the disk
> 	sync
> 	sysctl vm.drop_caches=3
> 
> 	# See distribution of data across drives
> 	btrfs dev usage /try
> 	btrfs fi usage /try
> 
> 	# Corrupt one byte of each of the first 4G on /dev/vdb,
> 	# so that the crc32c algorithm will always detect the corruption.
> 	# If you need a bigger test disk then increase the '4'.
> 	# Leave the first 16MB of the disk alone so we don't kill the superblock.
> 	perl -MFcntl -e '
> 		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> 			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> 			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> 			sysread(STDIN, $dat, 1) or die "read: $!";
> 			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> 			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> 		}
> 	' </dev/vdb >/dev/vdb
> 
> 	# Make sure all that's on disk and our caches are empty
> 	sync
> 	sysctl vm.drop_caches=3
> 
> 	# Before and after dev stat and read-only scrub to see what the damage looks like.
> 	# This will produce some ratelimited kernel output.
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -rBd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify all the files are correctly restored transparently by btrfs.
> 	# btrfs repairs correctable blocks as a side-effect.
> 	sha1sum --quiet -c /tmp/sha1sums.txt
> 
> 	# Do a scrub to clean up stray corrupted blocks (including superblocks)
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# This scrub should be clean, but sometimes is not.
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify that the scrub didn't corrupt anything.
> 	sha1sum --quiet -c /tmp/sha1sums.txt
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
  2019-11-19  6:58 ` Qu Wenruo
@ 2019-11-19 23:57   ` Zygo Blaxell
  0 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2019-11-19 23:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 11075 bytes --]

On Tue, Nov 19, 2019 at 02:58:00PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/11/19 下午12:08, Zygo Blaxell wrote:
> > Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> > parity stripe, it also copies bad data over good data.  This propagates
> > errors between drives and makes a correctable failure uncorrectable.
> > Reproducer script at the end.
> > 
> > This doesn't happen very often.  The repro script corrupts *every*
> > data block on one of the RAID5 drives, and only a handful of blocks
> > fail to be corrected--about 16 errors per 3GB of data, but sometimes
> > half or double that rate.  It behaves more like a race condition than
> > a boundary condition.  It can take a few tries to get a failure with a
> > 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> > 5-disk raid5, but if you repeat the test enough times even a 5-disk
> > raid5 will eventually fail.
> > 
> > Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> > I haven't tried this reproducer on kernels earlier than 4.16 due to
> > other raid5 issues in earlier kernels.
> > 
> > I found 16 corrupted files after one test with 3GB of data (multiple
> > copies of /usr on a Debian vm).  I dumped out the files with 'btrfs
> > restore'.  These are the differences between the restored files and the
> > original data:
> > 
> > 	# find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
> > 	-rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> > 	2532 253 147
> > 	-rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> > 	 2481  20 145
> > 	 6481 270 145
> > 	 8876   3 150
> > 	13137 232  75
> > 	16805 103  55
> > 	-rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> > 	2064   0 157
> > 	-rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> > 	2262 226 145
> > 	-rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
> > 	 7115 361 164
> > 	 8333 330  12
> > 	-rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
> > 	263868  35  40
> > 	268307 143  40
> > 	272168 370 154
> > 	275138  25 145
> > 	280076 125  40
> > 	282683 310 136
> > 	286949 132  44
> > 	293453 176 163
> > 	296803  40  52
> > 	300719 307  40
> > 	305953  77 174
> > 	310419 124 161
> > 	312922  47  40
> > 	-rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
> > 	  787 323 102
> > 	 6775 372 141
> > 	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
> > 	 484 262  41
> > 	-rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
> > 	2777 247 164
> > 	-rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
> > 	3128  22   6
> > 	-rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> > 	3378 253 146
> > 	6224 162  42
> > 	-rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> > 	68112   2 111
> > 	73226 344 150
> > 	75622  12  40
> > 	-rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> > 	1859 247  40
> > 	-rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> > 	66721 200 346
> > 	72692 211 270
> > 	74596 336 101
> > 	79179 257   0
> > 	85438 104 256
> > 	-rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
> > 	1722 243 356
> > 	-rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
> > 	1062 212 267
> > 
> > Note that the reproducer script will corrupt exactly one random byte per
> > 4K block to guarantee the corruption is detected by the crc32c algorithm.
> > In all cases the corrupted data is one byte per 4K block, as expected.
> > 
> > I dumped out the files by reading the blocks directly from the file
> > system.  Data and parity blocks from btrfs were identical and matched the
> > corrupted data from btrfs restore.  This is interesting because the repro
> > script only corrupts one drive!  The only way blocks on both drives end
> > up corrupted identically (or at all) is if btrfs copies the bad data
> > over the good.
> > 
> > There is also some spatial clustering of the unrecoverable blocks.
> > Here are the physical block addresses (in hex to make mod-4K and mod-64K
> > easier to see):
> > 
> > 	Extent bytenr start..end   Filename
> > 	0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> > 	0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
> > 	0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
> > 	0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
> > 	0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
> > 	0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
> > 	0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
> > 	0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
> > 	0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> > 	0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> > 	0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> > 	0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> > 	0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
> > 	0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> > 	0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> > 	0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> > 
> > Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> > as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000.  The files
> > gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> > a single RAID strip.
> > 
> > All files that had sha1sum failures also had EIO/csum failures, so btrfs
> > did detect all the (now uncorrectable) corrupted blocks correctly.  Also
> > no problems have been seen with btrfs raid1 (metadata or data).
> > 
> > Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> > 
> > 	#!/bin/bash
> > 	set -x
> > 
> > 	# Reset state
> > 	umount /try
> > 	mkdir -p /try
> > 
> > 	# Create FS and mount.	Use raid1 metadata so the filesystem
> > 	# has a fair chance of survival.
> > 	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> 
> Two devices raid5, that's just RAID1, but with all the RAID5 stripe
> limitations.
> But even two disks raid5 shouldn't be that unstable, so it looks like a bug.
> 
> I'll try to make the reproducer smaller and take a look into the behavior.
> To be honest, I didn't expect 2 disks raid5 nor the strange corruption.

2 disks raid5 happens naturally with unequally sized devices in a RAID5
array, in a metadata-raid1 data-raid5 array, and in a degraded 3-disk
raid5 array that allocates a new block group with a missing drive.

Arguably those 2-disk raid5 block groups should switch automatically to
raid1 at allocation time, but that's a separate topic filled with gotchas.

> BTW, I guess you also tried 3 disks raid5, what's the result?

With 3 and 5 disks it takes longer (more data, more attempts) for this
behavior to appear.  I had about 20% repro rate with 11GB of data and
5 drives.  2 disks are much faster, about 60% at 3GB with 2 drives.

I didn't analyze 3- and 5- disk cases in as much detail (didn't want to
do the parity math ;).  The final scrub result has uncorrectable errors
and files are unreadable at the end.  Hopefully it's the same bug.

I only tried 2, 3, and 5 disk arrays, so I don't know if it's different
with 4 or 6 or any other number.

> Thanks,
> Qu
> 
> > 	btrfs dev scan
> > 	mount -onoatime /dev/vdb /try || exit 1
> > 
> > 	# Must be on btrfs
> > 	cd /try || exit 1
> > 	btrfs sub list . || exit 1
> > 
> > 	# Fill disk with files.  Increase seq for more test data
> > 	# to increase the chance of finding corruption.
> > 	for x in $(seq 0 3); do
> > 		sync &
> > 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> > 		sleep 2
> > 	done
> > 	wait
> > 
> > 	# Remove half the files.  If you increased seq above, increase the
> > 	# '-2' here as well.
> > 	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> > 		sync &
> > 		rm -fr "$x" &
> > 		sleep 2
> > 	done
> > 	wait
> > 
> > 	# Fill in some of the holes.  This is to get a good mix of
> > 	# partially filled RAID stripes of various sizes.
> > 	for x in $(seq 0 1); do
> > 		sync &
> > 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> > 		sleep 2
> > 	done
> > 	wait
> > 
> > 	# Calculate hash we will use to verify data later
> > 	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> > 
> > 	# Make sure it's all on the disk
> > 	sync
> > 	sysctl vm.drop_caches=3
> > 
> > 	# See distribution of data across drives
> > 	btrfs dev usage /try
> > 	btrfs fi usage /try
> > 
> > 	# Corrupt one byte of each of the first 4G on /dev/vdb,
> > 	# so that the crc32c algorithm will always detect the corruption.
> > 	# If you need a bigger test disk then increase the '4'.
> > 	# Leave the first 16MB of the disk alone so we don't kill the superblock.
> > 	perl -MFcntl -e '
> > 		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> > 			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> > 			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> > 			sysread(STDIN, $dat, 1) or die "read: $!";
> > 			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> > 			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> > 		}
> > 	' </dev/vdb >/dev/vdb
> > 
> > 	# Make sure all that's on disk and our caches are empty
> > 	sync
> > 	sysctl vm.drop_caches=3
> > 
> > 	# Before and after dev stat and read-only scrub to see what the damage looks like.
> > 	# This will produce some ratelimited kernel output.
> > 	btrfs dev stat /try | grep -v ' 0$'
> > 	btrfs scrub start -rBd /try
> > 	btrfs dev stat /try | grep -v ' 0$'
> > 
> > 	# Verify all the files are correctly restored transparently by btrfs.
> > 	# btrfs repairs correctable blocks as a side-effect.
> > 	sha1sum --quiet -c /tmp/sha1sums.txt
> > 
> > 	# Do a scrub to clean up stray corrupted blocks (including superblocks)
> > 	btrfs dev stat /try | grep -v ' 0$'
> > 	btrfs scrub start -Bd /try
> > 	btrfs dev stat /try | grep -v ' 0$'
> > 
> > 	# This scrub should be clean, but sometimes is not.
> > 	btrfs scrub start -Bd /try
> > 	btrfs dev stat /try | grep -v ' 0$'
> > 
> > 	# Verify that the scrub didn't corrupt anything.
> > 	sha1sum --quiet -c /tmp/sha1sums.txt
> > 
> 




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes).  With reproducer for kernel 5.3.11, still works on 5.5.1
  2019-11-19  4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
  2019-11-19  6:58 ` Qu Wenruo
@ 2020-02-06 23:40 ` Zygo Blaxell
  2020-04-02 11:10 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Filipe Manana
  2 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2020-02-06 23:40 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]

On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data.  This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
> 
> This doesn't happen very often.  The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate.  It behaves more like a race condition than
> a boundary condition.  It can take a few tries to get a failure with a
> 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
> 
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.

Still reproducible on 5.5.1.  Some more details below:

> [...snip...]
> Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> 
> 	#!/bin/bash
> 	set -x
> 
> 	# Reset state
> 	umount /try
> 	mkdir -p /try
> 
> 	# Create FS and mount.	Use raid1 metadata so the filesystem
> 	# has a fair chance of survival.
> 	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> 	btrfs dev scan
> 	mount -onoatime /dev/vdb /try || exit 1
> 
> 	# Must be on btrfs
> 	cd /try || exit 1
> 	btrfs sub list . || exit 1
> 
> 	# Fill disk with files.  Increase seq for more test data
> 	# to increase the chance of finding corruption.
> 	for x in $(seq 0 3); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Remove half the files.  If you increased seq above, increase the
> 	# '-2' here as well.
> 	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> 		sync &
> 		rm -fr "$x" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Fill in some of the holes.  This is to get a good mix of
> 	# partially filled RAID stripes of various sizes.
> 	for x in $(seq 0 1); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Calculate hash we will use to verify data later
> 	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> 
> 	# Make sure it's all on the disk
> 	sync
> 	sysctl vm.drop_caches=3
> 
> 	# See distribution of data across drives
> 	btrfs dev usage /try
> 	btrfs fi usage /try
> 
> 	# Corrupt one byte of each of the first 4G on /dev/vdb,
> 	# so that the crc32c algorithm will always detect the corruption.
> 	# If you need a bigger test disk then increase the '4'.
> 	# Leave the first 16MB of the disk alone so we don't kill the superblock.
> 	perl -MFcntl -e '
> 		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> 			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> 			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> 			sysread(STDIN, $dat, 1) or die "read: $!";
> 			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> 			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> 		}
> 	' </dev/vdb >/dev/vdb
> 
> 	# Make sure all that's on disk and our caches are empty
> 	sync
> 	sysctl vm.drop_caches=3

I split the test into two parts: everything up to the above line (let's
call it part 1), and everything below this line (part 2).  Part 1 creates
a RAID5 array with corruption on one disk.  Part 2 tries to read all the
original data and correct the corrupted disk with sha1sum and btrfs scrub.

I saved a copy of the VM's disk images after part 1, so I could repeatedly
reset and run part 2 on identical filesystem images.

I also split up

	btrfs scrub start -rBd /try

and replaced it with

	btrfs scrub start -rBd /dev/vdb
	btrfs scrub start -rBd /dev/vdc

(and same for the non-read-only scrubs) so the scrub runs sequentially
on each disk, instead of running on both in parallel.

Original part 2:

> 	# Before and after dev stat and read-only scrub to see what the damage looks like.
> 	# This will produce some ratelimited kernel output.
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -rBd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify all the files are correctly restored transparently by btrfs.
> 	# btrfs repairs correctable blocks as a side-effect.
> 	sha1sum --quiet -c /tmp/sha1sums.txt
> 
> 	# Do a scrub to clean up stray corrupted blocks (including superblocks)
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# This scrub should be clean, but sometimes is not.
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify that the scrub didn't corrupt anything.
> 	sha1sum --quiet -c /tmp/sha1sums.txt

Multiple runs of part 2 produce different results in scrub:

	result-1581019560.txt:scrub device /dev/vdb (id 1) done
	result-1581019560.txt:Error summary:    super=1 csum=273977
	result-1581019560.txt:scrub device /dev/vdc (id 2) done
	result-1581019560.txt:Error summary:    read=1600744 csum=230813
	result-1581019560.txt:[/dev/vdb].corruption_errs  504791

	result-1581029949.txt:scrub device /dev/vdb (id 1) done
	result-1581029949.txt:Error summary:    super=1 csum=273799
	result-1581029949.txt:scrub device /dev/vdc (id 2) done
	result-1581029949.txt:Error summary:    read=1600744 csum=230813
	result-1581029949.txt:[/dev/vdb].corruption_errs  504613

With scrub on the filesystem it is no better:

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    super=1 csum=272757
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    read=1600744 csum=230813
	result-1.txt:[/dev/vdb].corruption_errs  503571
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    super=1 csum=273430
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    read=1600744 csum=230813
	result-2.txt:[/dev/vdb].corruption_errs  504244
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    super=1 csum=273456
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    read=1600744 csum=230813
	result-3.txt:[/dev/vdb].corruption_errs  504270

The scrub summaries after the sha1sum -c are different too, although in
this case the errors were all corrected (sha1sum -c is clean):

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    csum=29911
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    csum=11
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    csum=29919
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    csum=14
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    csum=29713
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    csum=9

The error counts on /dev/vdb are different after the sha1sum -c,
indicating that file reads are nondeterministically correcting or not
correcting csum errors on btrfs raid5.  This could be due to readahead
or maybe something else.

The error counts on /dev/vdc are interesting, as that drive is not
corrupted, nor does it have read errors, but it is very consistently
reporting read=1600744 csum=230813 in scrub output.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
  2019-11-19  4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
  2019-11-19  6:58 ` Qu Wenruo
  2020-02-06 23:40 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1 Zygo Blaxell
@ 2020-04-02 11:10 ` Filipe Manana
  2 siblings, 0 replies; 5+ messages in thread
From: Filipe Manana @ 2020-04-02 11:10 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Tue, Nov 19, 2019 at 4:09 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data.  This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
>
> This doesn't happen very often.  The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate.  It behaves more like a race condition than
> a boundary condition.  It can take a few tries to get a failure with a
> 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
>
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
>
> I found 16 corrupted files after one test with 3GB of data (multiple
> copies of /usr on a Debian vm).  I dumped out the files with 'btrfs
> restore'.  These are the differences between the restored files and the
> original data:
>
>         # find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
>         -rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
>         2532 253 147
>         -rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
>          2481  20 145
>          6481 270 145
>          8876   3 150
>         13137 232  75
>         16805 103  55
>         -rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
>         2064   0 157
>         -rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
>         2262 226 145
>         -rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
>          7115 361 164
>          8333 330  12
>         -rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
>         263868  35  40
>         268307 143  40
>         272168 370 154
>         275138  25 145
>         280076 125  40
>         282683 310 136
>         286949 132  44
>         293453 176 163
>         296803  40  52
>         300719 307  40
>         305953  77 174
>         310419 124 161
>         312922  47  40
>         -rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
>           787 323 102
>          6775 372 141
>         -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
>          484 262  41
>         -rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
>         2777 247 164
>         -rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
>         3128  22   6
>         -rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
>         3378 253 146
>         6224 162  42
>         -rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
>         68112   2 111
>         73226 344 150
>         75622  12  40
>         -rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
>         1859 247  40
>         -rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
>         66721 200 346
>         72692 211 270
>         74596 336 101
>         79179 257   0
>         85438 104 256
>         -rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
>         1722 243 356
>         -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
>         1062 212 267
>
> Note that the reproducer script will corrupt exactly one random byte per
> 4K block to guarantee the corruption is detected by the crc32c algorithm.
> In all cases the corrupted data is one byte per 4K block, as expected.
>
> I dumped out the files by reading the blocks directly from the file
> system.  Data and parity blocks from btrfs were identical and matched the
> corrupted data from btrfs restore.  This is interesting because the repro
> script only corrupts one drive!  The only way blocks on both drives end
> up corrupted identically (or at all) is if btrfs copies the bad data
> over the good.
>
> There is also some spatial clustering of the unrecoverable blocks.
> Here are the physical block addresses (in hex to make mod-4K and mod-64K
> easier to see):
>
>         Extent bytenr start..end   Filename
>         0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
>         0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
>         0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
>         0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
>         0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
>         0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
>         0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
>         0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
>         0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
>         0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
>         0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
>         0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
>         0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
>         0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
>         0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
>         0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
>
> Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000.  The files
> gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> a single RAID strip.
>
> All files that had sha1sum failures also had EIO/csum failures, so btrfs
> did detect all the (now uncorrectable) corrupted blocks correctly.  Also
> no problems have been seen with btrfs raid1 (metadata or data).

Indeed, there's a serious problem with raid5 and raid6.
I've just started a thread (I cc'ed you) describing the problem and to
see if anyone else was aware of the problem and has already any
thoughts on how to fix it.

See: https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/

Thanks.

>
> Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
>
>         #!/bin/bash
>         set -x
>
>         # Reset state
>         umount /try
>         mkdir -p /try
>
>         # Create FS and mount.  Use raid1 metadata so the filesystem
>         # has a fair chance of survival.
>         mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
>         btrfs dev scan
>         mount -onoatime /dev/vdb /try || exit 1
>
>         # Must be on btrfs
>         cd /try || exit 1
>         btrfs sub list . || exit 1
>
>         # Fill disk with files.  Increase seq for more test data
>         # to increase the chance of finding corruption.
>         for x in $(seq 0 3); do
>                 sync &
>                 rsync -axHSWI "/usr/." "/try/$(date +%s)" &
>                 sleep 2
>         done
>         wait
>
>         # Remove half the files.  If you increased seq above, increase the
>         # '-2' here as well.
>         find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
>                 sync &
>                 rm -fr "$x" &
>                 sleep 2
>         done
>         wait
>
>         # Fill in some of the holes.  This is to get a good mix of
>         # partially filled RAID stripes of various sizes.
>         for x in $(seq 0 1); do
>                 sync &
>                 rsync -axHSWI "/usr/." "/try/$(date +%s)" &
>                 sleep 2
>         done
>         wait
>
>         # Calculate hash we will use to verify data later
>         find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
>
>         # Make sure it's all on the disk
>         sync
>         sysctl vm.drop_caches=3
>
>         # See distribution of data across drives
>         btrfs dev usage /try
>         btrfs fi usage /try
>
>         # Corrupt one byte of each of the first 4G on /dev/vdb,
>         # so that the crc32c algorithm will always detect the corruption.
>         # If you need a bigger test disk then increase the '4'.
>         # Leave the first 16MB of the disk alone so we don't kill the superblock.
>         perl -MFcntl -e '
>                 for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
>                         my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
>                         sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
>                         sysread(STDIN, $dat, 1) or die "read: $!";
>                         sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
>                         syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
>                 }
>         ' </dev/vdb >/dev/vdb
>
>         # Make sure all that's on disk and our caches are empty
>         sync
>         sysctl vm.drop_caches=3
>
>         # Before and after dev stat and read-only scrub to see what the damage looks like.
>         # This will produce some ratelimited kernel output.
>         btrfs dev stat /try | grep -v ' 0$'
>         btrfs scrub start -rBd /try
>         btrfs dev stat /try | grep -v ' 0$'
>
>         # Verify all the files are correctly restored transparently by btrfs.
>         # btrfs repairs correctable blocks as a side-effect.
>         sha1sum --quiet -c /tmp/sha1sums.txt
>
>         # Do a scrub to clean up stray corrupted blocks (including superblocks)
>         btrfs dev stat /try | grep -v ' 0$'
>         btrfs scrub start -Bd /try
>         btrfs dev stat /try | grep -v ' 0$'
>
>         # This scrub should be clean, but sometimes is not.
>         btrfs scrub start -Bd /try
>         btrfs dev stat /try | grep -v ' 0$'
>
>         # Verify that the scrub didn't corrupt anything.
>         sha1sum --quiet -c /tmp/sha1sums.txt



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-04-02 11:10 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-19  4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
2019-11-19  6:58 ` Qu Wenruo
2019-11-19 23:57   ` Zygo Blaxell
2020-02-06 23:40 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1 Zygo Blaxell
2020-04-02 11:10 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.