On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote: > Sometimes, btrfs raid5 not only fails to recover corrupt data with a > parity stripe, it also copies bad data over good data. This propagates > errors between drives and makes a correctable failure uncorrectable. > Reproducer script at the end. > > This doesn't happen very often. The repro script corrupts *every* > data block on one of the RAID5 drives, and only a handful of blocks > fail to be corrected--about 16 errors per 3GB of data, but sometimes > half or double that rate. It behaves more like a race condition than > a boundary condition. It can take a few tries to get a failure with a > 16GB disk array. It seems to happen more often on 2-disk raid5 than > 5-disk raid5, but if you repeat the test enough times even a 5-disk > raid5 will eventually fail. > > Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug. > I haven't tried this reproducer on kernels earlier than 4.16 due to > other raid5 issues in earlier kernels. Still reproducible on 5.5.1. Some more details below: > [...snip...] > Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc): > > #!/bin/bash > set -x > > # Reset state > umount /try > mkdir -p /try > > # Create FS and mount. Use raid1 metadata so the filesystem > # has a fair chance of survival. > mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1 > btrfs dev scan > mount -onoatime /dev/vdb /try || exit 1 > > # Must be on btrfs > cd /try || exit 1 > btrfs sub list . || exit 1 > > # Fill disk with files. Increase seq for more test data > # to increase the chance of finding corruption. > for x in $(seq 0 3); do > sync & > rsync -axHSWI "/usr/." "/try/$(date +%s)" & > sleep 2 > done > wait > > # Remove half the files. If you increased seq above, increase the > # '-2' here as well. > find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do > sync & > rm -fr "$x" & > sleep 2 > done > wait > > # Fill in some of the holes. This is to get a good mix of > # partially filled RAID stripes of various sizes. > for x in $(seq 0 1); do > sync & > rsync -axHSWI "/usr/." "/try/$(date +%s)" & > sleep 2 > done > wait > > # Calculate hash we will use to verify data later > find -type f -exec sha1sum {} + > /tmp/sha1sums.txt > > # Make sure it's all on the disk > sync > sysctl vm.drop_caches=3 > > # See distribution of data across drives > btrfs dev usage /try > btrfs fi usage /try > > # Corrupt one byte of each of the first 4G on /dev/vdb, > # so that the crc32c algorithm will always detect the corruption. > # If you need a bigger test disk then increase the '4'. > # Leave the first 16MB of the disk alone so we don't kill the superblock. > perl -MFcntl -e ' > for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) { > my $pos = int(rand(4096)) + 16777216 + ($x * 4096); > sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!"; > sysread(STDIN, $dat, 1) or die "read: $!"; > sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!"; > syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!"; > } > ' /dev/vdb > > # Make sure all that's on disk and our caches are empty > sync > sysctl vm.drop_caches=3 I split the test into two parts: everything up to the above line (let's call it part 1), and everything below this line (part 2). Part 1 creates a RAID5 array with corruption on one disk. Part 2 tries to read all the original data and correct the corrupted disk with sha1sum and btrfs scrub. I saved a copy of the VM's disk images after part 1, so I could repeatedly reset and run part 2 on identical filesystem images. I also split up btrfs scrub start -rBd /try and replaced it with btrfs scrub start -rBd /dev/vdb btrfs scrub start -rBd /dev/vdc (and same for the non-read-only scrubs) so the scrub runs sequentially on each disk, instead of running on both in parallel. Original part 2: > # Before and after dev stat and read-only scrub to see what the damage looks like. > # This will produce some ratelimited kernel output. > btrfs dev stat /try | grep -v ' 0$' > btrfs scrub start -rBd /try > btrfs dev stat /try | grep -v ' 0$' > > # Verify all the files are correctly restored transparently by btrfs. > # btrfs repairs correctable blocks as a side-effect. > sha1sum --quiet -c /tmp/sha1sums.txt > > # Do a scrub to clean up stray corrupted blocks (including superblocks) > btrfs dev stat /try | grep -v ' 0$' > btrfs scrub start -Bd /try > btrfs dev stat /try | grep -v ' 0$' > > # This scrub should be clean, but sometimes is not. > btrfs scrub start -Bd /try > btrfs dev stat /try | grep -v ' 0$' > > # Verify that the scrub didn't corrupt anything. > sha1sum --quiet -c /tmp/sha1sums.txt Multiple runs of part 2 produce different results in scrub: result-1581019560.txt:scrub device /dev/vdb (id 1) done result-1581019560.txt:Error summary: super=1 csum=273977 result-1581019560.txt:scrub device /dev/vdc (id 2) done result-1581019560.txt:Error summary: read=1600744 csum=230813 result-1581019560.txt:[/dev/vdb].corruption_errs 504791 result-1581029949.txt:scrub device /dev/vdb (id 1) done result-1581029949.txt:Error summary: super=1 csum=273799 result-1581029949.txt:scrub device /dev/vdc (id 2) done result-1581029949.txt:Error summary: read=1600744 csum=230813 result-1581029949.txt:[/dev/vdb].corruption_errs 504613 With scrub on the filesystem it is no better: result-1.txt:scrub device /dev/vdb (id 1) done result-1.txt:Error summary: super=1 csum=272757 result-1.txt:scrub device /dev/vdc (id 2) done result-1.txt:Error summary: read=1600744 csum=230813 result-1.txt:[/dev/vdb].corruption_errs 503571 result-2.txt:scrub device /dev/vdb (id 1) done result-2.txt:Error summary: super=1 csum=273430 result-2.txt:scrub device /dev/vdc (id 2) done result-2.txt:Error summary: read=1600744 csum=230813 result-2.txt:[/dev/vdb].corruption_errs 504244 result-3.txt:scrub device /dev/vdb (id 1) done result-3.txt:Error summary: super=1 csum=273456 result-3.txt:scrub device /dev/vdc (id 2) done result-3.txt:Error summary: read=1600744 csum=230813 result-3.txt:[/dev/vdb].corruption_errs 504270 The scrub summaries after the sha1sum -c are different too, although in this case the errors were all corrected (sha1sum -c is clean): result-1.txt:scrub device /dev/vdb (id 1) done result-1.txt:Error summary: csum=29911 result-1.txt:scrub device /dev/vdc (id 2) done result-1.txt:Error summary: csum=11 result-2.txt:scrub device /dev/vdb (id 1) done result-2.txt:Error summary: csum=29919 result-2.txt:scrub device /dev/vdc (id 2) done result-2.txt:Error summary: csum=14 result-3.txt:scrub device /dev/vdb (id 1) done result-3.txt:Error summary: csum=29713 result-3.txt:scrub device /dev/vdc (id 2) done result-3.txt:Error summary: csum=9 The error counts on /dev/vdb are different after the sha1sum -c, indicating that file reads are nondeterministically correcting or not correcting csum errors on btrfs raid5. This could be due to readahead or maybe something else. The error counts on /dev/vdc are interesting, as that drive is not corrupted, nor does it have read errors, but it is very consistently reporting read=1600744 csum=230813 in scrub output.