On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data.  This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
> 
> This doesn't happen very often.  The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate.  It behaves more like a race condition than
> a boundary condition.  It can take a few tries to get a failure with a
> 16GB disk array.  It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
> 
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.

Still reproducible on 5.5.1.  Some more details below:

> [...snip...]
> Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> 
> 	#!/bin/bash
> 	set -x
> 
> 	# Reset state
> 	umount /try
> 	mkdir -p /try
> 
> 	# Create FS and mount.	Use raid1 metadata so the filesystem
> 	# has a fair chance of survival.
> 	mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> 	btrfs dev scan
> 	mount -onoatime /dev/vdb /try || exit 1
> 
> 	# Must be on btrfs
> 	cd /try || exit 1
> 	btrfs sub list . || exit 1
> 
> 	# Fill disk with files.  Increase seq for more test data
> 	# to increase the chance of finding corruption.
> 	for x in $(seq 0 3); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Remove half the files.  If you increased seq above, increase the
> 	# '-2' here as well.
> 	find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> 		sync &
> 		rm -fr "$x" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Fill in some of the holes.  This is to get a good mix of
> 	# partially filled RAID stripes of various sizes.
> 	for x in $(seq 0 1); do
> 		sync &
> 		rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> 		sleep 2
> 	done
> 	wait
> 
> 	# Calculate hash we will use to verify data later
> 	find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> 
> 	# Make sure it's all on the disk
> 	sync
> 	sysctl vm.drop_caches=3
> 
> 	# See distribution of data across drives
> 	btrfs dev usage /try
> 	btrfs fi usage /try
> 
> 	# Corrupt one byte of each of the first 4G on /dev/vdb,
> 	# so that the crc32c algorithm will always detect the corruption.
> 	# If you need a bigger test disk then increase the '4'.
> 	# Leave the first 16MB of the disk alone so we don't kill the superblock.
> 	perl -MFcntl -e '
> 		for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> 			my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> 			sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> 			sysread(STDIN, $dat, 1) or die "read: $!";
> 			sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> 			syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> 		}
> 	' </dev/vdb >/dev/vdb
> 
> 	# Make sure all that's on disk and our caches are empty
> 	sync
> 	sysctl vm.drop_caches=3

I split the test into two parts: everything up to the above line (let's
call it part 1), and everything below this line (part 2).  Part 1 creates
a RAID5 array with corruption on one disk.  Part 2 tries to read all the
original data and correct the corrupted disk with sha1sum and btrfs scrub.

I saved a copy of the VM's disk images after part 1, so I could repeatedly
reset and run part 2 on identical filesystem images.

I also split up

	btrfs scrub start -rBd /try

and replaced it with

	btrfs scrub start -rBd /dev/vdb
	btrfs scrub start -rBd /dev/vdc

(and same for the non-read-only scrubs) so the scrub runs sequentially
on each disk, instead of running on both in parallel.

Original part 2:

> 	# Before and after dev stat and read-only scrub to see what the damage looks like.
> 	# This will produce some ratelimited kernel output.
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -rBd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify all the files are correctly restored transparently by btrfs.
> 	# btrfs repairs correctable blocks as a side-effect.
> 	sha1sum --quiet -c /tmp/sha1sums.txt
> 
> 	# Do a scrub to clean up stray corrupted blocks (including superblocks)
> 	btrfs dev stat /try | grep -v ' 0$'
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# This scrub should be clean, but sometimes is not.
> 	btrfs scrub start -Bd /try
> 	btrfs dev stat /try | grep -v ' 0$'
> 
> 	# Verify that the scrub didn't corrupt anything.
> 	sha1sum --quiet -c /tmp/sha1sums.txt

Multiple runs of part 2 produce different results in scrub:

	result-1581019560.txt:scrub device /dev/vdb (id 1) done
	result-1581019560.txt:Error summary:    super=1 csum=273977
	result-1581019560.txt:scrub device /dev/vdc (id 2) done
	result-1581019560.txt:Error summary:    read=1600744 csum=230813
	result-1581019560.txt:[/dev/vdb].corruption_errs  504791

	result-1581029949.txt:scrub device /dev/vdb (id 1) done
	result-1581029949.txt:Error summary:    super=1 csum=273799
	result-1581029949.txt:scrub device /dev/vdc (id 2) done
	result-1581029949.txt:Error summary:    read=1600744 csum=230813
	result-1581029949.txt:[/dev/vdb].corruption_errs  504613

With scrub on the filesystem it is no better:

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    super=1 csum=272757
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    read=1600744 csum=230813
	result-1.txt:[/dev/vdb].corruption_errs  503571
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    super=1 csum=273430
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    read=1600744 csum=230813
	result-2.txt:[/dev/vdb].corruption_errs  504244
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    super=1 csum=273456
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    read=1600744 csum=230813
	result-3.txt:[/dev/vdb].corruption_errs  504270

The scrub summaries after the sha1sum -c are different too, although in
this case the errors were all corrected (sha1sum -c is clean):

	result-1.txt:scrub device /dev/vdb (id 1) done
	result-1.txt:Error summary:    csum=29911
	result-1.txt:scrub device /dev/vdc (id 2) done
	result-1.txt:Error summary:    csum=11
	result-2.txt:scrub device /dev/vdb (id 1) done
	result-2.txt:Error summary:    csum=29919
	result-2.txt:scrub device /dev/vdc (id 2) done
	result-2.txt:Error summary:    csum=14
	result-3.txt:scrub device /dev/vdb (id 1) done
	result-3.txt:Error summary:    csum=29713
	result-3.txt:scrub device /dev/vdc (id 2) done
	result-3.txt:Error summary:    csum=9

The error counts on /dev/vdb are different after the sha1sum -c,
indicating that file reads are nondeterministically correcting or not
correcting csum errors on btrfs raid5.  This could be due to readahead
or maybe something else.

The error counts on /dev/vdc are interesting, as that drive is not
corrupted, nor does it have read errors, but it is very consistently
reporting read=1600744 csum=230813 in scrub output.