From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: linux-btrfs@vger.kernel.org
Subject: Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1
Date: Thu, 6 Feb 2020 18:40:37 -0500 [thread overview]
Message-ID: <20200206234037.GD13306@hungrycats.org> (raw)
In-Reply-To: <20191119040827.GC22121@hungrycats.org>
[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]
On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data. This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
>
> This doesn't happen very often. The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate. It behaves more like a race condition than
> a boundary condition. It can take a few tries to get a failure with a
> 16GB disk array. It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
>
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
Still reproducible on 5.5.1. Some more details below:
> [...snip...]
> Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
>
> #!/bin/bash
> set -x
>
> # Reset state
> umount /try
> mkdir -p /try
>
> # Create FS and mount. Use raid1 metadata so the filesystem
> # has a fair chance of survival.
> mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> btrfs dev scan
> mount -onoatime /dev/vdb /try || exit 1
>
> # Must be on btrfs
> cd /try || exit 1
> btrfs sub list . || exit 1
>
> # Fill disk with files. Increase seq for more test data
> # to increase the chance of finding corruption.
> for x in $(seq 0 3); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Remove half the files. If you increased seq above, increase the
> # '-2' here as well.
> find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> sync &
> rm -fr "$x" &
> sleep 2
> done
> wait
>
> # Fill in some of the holes. This is to get a good mix of
> # partially filled RAID stripes of various sizes.
> for x in $(seq 0 1); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Calculate hash we will use to verify data later
> find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
>
> # Make sure it's all on the disk
> sync
> sysctl vm.drop_caches=3
>
> # See distribution of data across drives
> btrfs dev usage /try
> btrfs fi usage /try
>
> # Corrupt one byte of each of the first 4G on /dev/vdb,
> # so that the crc32c algorithm will always detect the corruption.
> # If you need a bigger test disk then increase the '4'.
> # Leave the first 16MB of the disk alone so we don't kill the superblock.
> perl -MFcntl -e '
> for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> sysread(STDIN, $dat, 1) or die "read: $!";
> sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> }
> ' </dev/vdb >/dev/vdb
>
> # Make sure all that's on disk and our caches are empty
> sync
> sysctl vm.drop_caches=3
I split the test into two parts: everything up to the above line (let's
call it part 1), and everything below this line (part 2). Part 1 creates
a RAID5 array with corruption on one disk. Part 2 tries to read all the
original data and correct the corrupted disk with sha1sum and btrfs scrub.
I saved a copy of the VM's disk images after part 1, so I could repeatedly
reset and run part 2 on identical filesystem images.
I also split up
btrfs scrub start -rBd /try
and replaced it with
btrfs scrub start -rBd /dev/vdb
btrfs scrub start -rBd /dev/vdc
(and same for the non-read-only scrubs) so the scrub runs sequentially
on each disk, instead of running on both in parallel.
Original part 2:
> # Before and after dev stat and read-only scrub to see what the damage looks like.
> # This will produce some ratelimited kernel output.
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -rBd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify all the files are correctly restored transparently by btrfs.
> # btrfs repairs correctable blocks as a side-effect.
> sha1sum --quiet -c /tmp/sha1sums.txt
>
> # Do a scrub to clean up stray corrupted blocks (including superblocks)
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # This scrub should be clean, but sometimes is not.
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify that the scrub didn't corrupt anything.
> sha1sum --quiet -c /tmp/sha1sums.txt
Multiple runs of part 2 produce different results in scrub:
result-1581019560.txt:scrub device /dev/vdb (id 1) done
result-1581019560.txt:Error summary: super=1 csum=273977
result-1581019560.txt:scrub device /dev/vdc (id 2) done
result-1581019560.txt:Error summary: read=1600744 csum=230813
result-1581019560.txt:[/dev/vdb].corruption_errs 504791
result-1581029949.txt:scrub device /dev/vdb (id 1) done
result-1581029949.txt:Error summary: super=1 csum=273799
result-1581029949.txt:scrub device /dev/vdc (id 2) done
result-1581029949.txt:Error summary: read=1600744 csum=230813
result-1581029949.txt:[/dev/vdb].corruption_errs 504613
With scrub on the filesystem it is no better:
result-1.txt:scrub device /dev/vdb (id 1) done
result-1.txt:Error summary: super=1 csum=272757
result-1.txt:scrub device /dev/vdc (id 2) done
result-1.txt:Error summary: read=1600744 csum=230813
result-1.txt:[/dev/vdb].corruption_errs 503571
result-2.txt:scrub device /dev/vdb (id 1) done
result-2.txt:Error summary: super=1 csum=273430
result-2.txt:scrub device /dev/vdc (id 2) done
result-2.txt:Error summary: read=1600744 csum=230813
result-2.txt:[/dev/vdb].corruption_errs 504244
result-3.txt:scrub device /dev/vdb (id 1) done
result-3.txt:Error summary: super=1 csum=273456
result-3.txt:scrub device /dev/vdc (id 2) done
result-3.txt:Error summary: read=1600744 csum=230813
result-3.txt:[/dev/vdb].corruption_errs 504270
The scrub summaries after the sha1sum -c are different too, although in
this case the errors were all corrected (sha1sum -c is clean):
result-1.txt:scrub device /dev/vdb (id 1) done
result-1.txt:Error summary: csum=29911
result-1.txt:scrub device /dev/vdc (id 2) done
result-1.txt:Error summary: csum=11
result-2.txt:scrub device /dev/vdb (id 1) done
result-2.txt:Error summary: csum=29919
result-2.txt:scrub device /dev/vdc (id 2) done
result-2.txt:Error summary: csum=14
result-3.txt:scrub device /dev/vdb (id 1) done
result-3.txt:Error summary: csum=29713
result-3.txt:scrub device /dev/vdc (id 2) done
result-3.txt:Error summary: csum=9
The error counts on /dev/vdb are different after the sha1sum -c,
indicating that file reads are nondeterministically correcting or not
correcting csum errors on btrfs raid5. This could be due to readahead
or maybe something else.
The error counts on /dev/vdc are interesting, as that drive is not
corrupted, nor does it have read errors, but it is very consistently
reporting read=1600744 csum=230813 in scrub output.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
next prev parent reply other threads:[~2020-02-06 23:40 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-19 4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
2019-11-19 6:58 ` Qu Wenruo
2019-11-19 23:57 ` Zygo Blaxell
2020-02-06 23:40 ` Zygo Blaxell [this message]
2020-04-02 11:10 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200206234037.GD13306@hungrycats.org \
--to=ce3g8jdj@umail.furryterror.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).