* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
2019-11-19 4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
@ 2019-11-19 6:58 ` Qu Wenruo
2019-11-19 23:57 ` Zygo Blaxell
2020-02-06 23:40 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1 Zygo Blaxell
2020-04-02 11:10 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Filipe Manana
2 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2019-11-19 6:58 UTC (permalink / raw)
To: Zygo Blaxell, linux-btrfs
[-- Attachment #1.1: Type: text/plain, Size: 9633 bytes --]
On 2019/11/19 下午12:08, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data. This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
>
> This doesn't happen very often. The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate. It behaves more like a race condition than
> a boundary condition. It can take a few tries to get a failure with a
> 16GB disk array. It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
>
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
>
> I found 16 corrupted files after one test with 3GB of data (multiple
> copies of /usr on a Debian vm). I dumped out the files with 'btrfs
> restore'. These are the differences between the restored files and the
> original data:
>
> # find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
> -rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 2532 253 147
> -rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 2481 20 145
> 6481 270 145
> 8876 3 150
> 13137 232 75
> 16805 103 55
> -rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 2064 0 157
> -rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 2262 226 145
> -rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
> 7115 361 164
> 8333 330 12
> -rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
> 263868 35 40
> 268307 143 40
> 272168 370 154
> 275138 25 145
> 280076 125 40
> 282683 310 136
> 286949 132 44
> 293453 176 163
> 296803 40 52
> 300719 307 40
> 305953 77 174
> 310419 124 161
> 312922 47 40
> -rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
> 787 323 102
> 6775 372 141
> -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
> 484 262 41
> -rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
> 2777 247 164
> -rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
> 3128 22 6
> -rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 3378 253 146
> 6224 162 42
> -rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 68112 2 111
> 73226 344 150
> 75622 12 40
> -rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> 1859 247 40
> -rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 66721 200 346
> 72692 211 270
> 74596 336 101
> 79179 257 0
> 85438 104 256
> -rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
> 1722 243 356
> -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
> 1062 212 267
>
> Note that the reproducer script will corrupt exactly one random byte per
> 4K block to guarantee the corruption is detected by the crc32c algorithm.
> In all cases the corrupted data is one byte per 4K block, as expected.
>
> I dumped out the files by reading the blocks directly from the file
> system. Data and parity blocks from btrfs were identical and matched the
> corrupted data from btrfs restore. This is interesting because the repro
> script only corrupts one drive! The only way blocks on both drives end
> up corrupted identically (or at all) is if btrfs copies the bad data
> over the good.
>
> There is also some spatial clustering of the unrecoverable blocks.
> Here are the physical block addresses (in hex to make mod-4K and mod-64K
> easier to see):
>
> Extent bytenr start..end Filename
> 0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
> 0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
> 0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
> 0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
> 0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
> 0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
> 0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
> 0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
> 0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
>
> Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000. The files
> gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> a single RAID strip.
>
> All files that had sha1sum failures also had EIO/csum failures, so btrfs
> did detect all the (now uncorrectable) corrupted blocks correctly. Also
> no problems have been seen with btrfs raid1 (metadata or data).
>
> Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
>
> #!/bin/bash
> set -x
>
> # Reset state
> umount /try
> mkdir -p /try
>
> # Create FS and mount. Use raid1 metadata so the filesystem
> # has a fair chance of survival.
> mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
Two devices raid5, that's just RAID1, but with all the RAID5 stripe
limitations.
But even two disks raid5 shouldn't be that unstable, so it looks like a bug.
I'll try to make the reproducer smaller and take a look into the behavior.
To be honest, I didn't expect 2 disks raid5 nor the strange corruption.
BTW, I guess you also tried 3 disks raid5, what's the result?
Thanks,
Qu
> btrfs dev scan
> mount -onoatime /dev/vdb /try || exit 1
>
> # Must be on btrfs
> cd /try || exit 1
> btrfs sub list . || exit 1
>
> # Fill disk with files. Increase seq for more test data
> # to increase the chance of finding corruption.
> for x in $(seq 0 3); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Remove half the files. If you increased seq above, increase the
> # '-2' here as well.
> find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> sync &
> rm -fr "$x" &
> sleep 2
> done
> wait
>
> # Fill in some of the holes. This is to get a good mix of
> # partially filled RAID stripes of various sizes.
> for x in $(seq 0 1); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Calculate hash we will use to verify data later
> find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
>
> # Make sure it's all on the disk
> sync
> sysctl vm.drop_caches=3
>
> # See distribution of data across drives
> btrfs dev usage /try
> btrfs fi usage /try
>
> # Corrupt one byte of each of the first 4G on /dev/vdb,
> # so that the crc32c algorithm will always detect the corruption.
> # If you need a bigger test disk then increase the '4'.
> # Leave the first 16MB of the disk alone so we don't kill the superblock.
> perl -MFcntl -e '
> for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> sysread(STDIN, $dat, 1) or die "read: $!";
> sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> }
> ' </dev/vdb >/dev/vdb
>
> # Make sure all that's on disk and our caches are empty
> sync
> sysctl vm.drop_caches=3
>
> # Before and after dev stat and read-only scrub to see what the damage looks like.
> # This will produce some ratelimited kernel output.
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -rBd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify all the files are correctly restored transparently by btrfs.
> # btrfs repairs correctable blocks as a side-effect.
> sha1sum --quiet -c /tmp/sha1sums.txt
>
> # Do a scrub to clean up stray corrupted blocks (including superblocks)
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # This scrub should be clean, but sometimes is not.
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify that the scrub didn't corrupt anything.
> sha1sum --quiet -c /tmp/sha1sums.txt
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
2019-11-19 6:58 ` Qu Wenruo
@ 2019-11-19 23:57 ` Zygo Blaxell
0 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2019-11-19 23:57 UTC (permalink / raw)
To: Qu Wenruo; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 11075 bytes --]
On Tue, Nov 19, 2019 at 02:58:00PM +0800, Qu Wenruo wrote:
>
>
> On 2019/11/19 下午12:08, Zygo Blaxell wrote:
> > Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> > parity stripe, it also copies bad data over good data. This propagates
> > errors between drives and makes a correctable failure uncorrectable.
> > Reproducer script at the end.
> >
> > This doesn't happen very often. The repro script corrupts *every*
> > data block on one of the RAID5 drives, and only a handful of blocks
> > fail to be corrected--about 16 errors per 3GB of data, but sometimes
> > half or double that rate. It behaves more like a race condition than
> > a boundary condition. It can take a few tries to get a failure with a
> > 16GB disk array. It seems to happen more often on 2-disk raid5 than
> > 5-disk raid5, but if you repeat the test enough times even a 5-disk
> > raid5 will eventually fail.
> >
> > Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> > I haven't tried this reproducer on kernels earlier than 4.16 due to
> > other raid5 issues in earlier kernels.
> >
> > I found 16 corrupted files after one test with 3GB of data (multiple
> > copies of /usr on a Debian vm). I dumped out the files with 'btrfs
> > restore'. These are the differences between the restored files and the
> > original data:
> >
> > # find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
> > -rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> > 2532 253 147
> > -rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> > 2481 20 145
> > 6481 270 145
> > 8876 3 150
> > 13137 232 75
> > 16805 103 55
> > -rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> > 2064 0 157
> > -rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> > 2262 226 145
> > -rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
> > 7115 361 164
> > 8333 330 12
> > -rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
> > 263868 35 40
> > 268307 143 40
> > 272168 370 154
> > 275138 25 145
> > 280076 125 40
> > 282683 310 136
> > 286949 132 44
> > 293453 176 163
> > 296803 40 52
> > 300719 307 40
> > 305953 77 174
> > 310419 124 161
> > 312922 47 40
> > -rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
> > 787 323 102
> > 6775 372 141
> > -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
> > 484 262 41
> > -rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
> > 2777 247 164
> > -rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
> > 3128 22 6
> > -rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> > 3378 253 146
> > 6224 162 42
> > -rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> > 68112 2 111
> > 73226 344 150
> > 75622 12 40
> > -rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> > 1859 247 40
> > -rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> > 66721 200 346
> > 72692 211 270
> > 74596 336 101
> > 79179 257 0
> > 85438 104 256
> > -rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
> > 1722 243 356
> > -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
> > 1062 212 267
> >
> > Note that the reproducer script will corrupt exactly one random byte per
> > 4K block to guarantee the corruption is detected by the crc32c algorithm.
> > In all cases the corrupted data is one byte per 4K block, as expected.
> >
> > I dumped out the files by reading the blocks directly from the file
> > system. Data and parity blocks from btrfs were identical and matched the
> > corrupted data from btrfs restore. This is interesting because the repro
> > script only corrupts one drive! The only way blocks on both drives end
> > up corrupted identically (or at all) is if btrfs copies the bad data
> > over the good.
> >
> > There is also some spatial clustering of the unrecoverable blocks.
> > Here are the physical block addresses (in hex to make mod-4K and mod-64K
> > easier to see):
> >
> > Extent bytenr start..end Filename
> > 0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> > 0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
> > 0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
> > 0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
> > 0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
> > 0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
> > 0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
> > 0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
> > 0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> > 0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> > 0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> > 0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> > 0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
> > 0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> > 0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> > 0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> >
> > Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> > as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000. The files
> > gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> > a single RAID strip.
> >
> > All files that had sha1sum failures also had EIO/csum failures, so btrfs
> > did detect all the (now uncorrectable) corrupted blocks correctly. Also
> > no problems have been seen with btrfs raid1 (metadata or data).
> >
> > Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
> >
> > #!/bin/bash
> > set -x
> >
> > # Reset state
> > umount /try
> > mkdir -p /try
> >
> > # Create FS and mount. Use raid1 metadata so the filesystem
> > # has a fair chance of survival.
> > mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
>
> Two devices raid5, that's just RAID1, but with all the RAID5 stripe
> limitations.
> But even two disks raid5 shouldn't be that unstable, so it looks like a bug.
>
> I'll try to make the reproducer smaller and take a look into the behavior.
> To be honest, I didn't expect 2 disks raid5 nor the strange corruption.
2 disks raid5 happens naturally with unequally sized devices in a RAID5
array, in a metadata-raid1 data-raid5 array, and in a degraded 3-disk
raid5 array that allocates a new block group with a missing drive.
Arguably those 2-disk raid5 block groups should switch automatically to
raid1 at allocation time, but that's a separate topic filled with gotchas.
> BTW, I guess you also tried 3 disks raid5, what's the result?
With 3 and 5 disks it takes longer (more data, more attempts) for this
behavior to appear. I had about 20% repro rate with 11GB of data and
5 drives. 2 disks are much faster, about 60% at 3GB with 2 drives.
I didn't analyze 3- and 5- disk cases in as much detail (didn't want to
do the parity math ;). The final scrub result has uncorrectable errors
and files are unreadable at the end. Hopefully it's the same bug.
I only tried 2, 3, and 5 disk arrays, so I don't know if it's different
with 4 or 6 or any other number.
> Thanks,
> Qu
>
> > btrfs dev scan
> > mount -onoatime /dev/vdb /try || exit 1
> >
> > # Must be on btrfs
> > cd /try || exit 1
> > btrfs sub list . || exit 1
> >
> > # Fill disk with files. Increase seq for more test data
> > # to increase the chance of finding corruption.
> > for x in $(seq 0 3); do
> > sync &
> > rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> > sleep 2
> > done
> > wait
> >
> > # Remove half the files. If you increased seq above, increase the
> > # '-2' here as well.
> > find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> > sync &
> > rm -fr "$x" &
> > sleep 2
> > done
> > wait
> >
> > # Fill in some of the holes. This is to get a good mix of
> > # partially filled RAID stripes of various sizes.
> > for x in $(seq 0 1); do
> > sync &
> > rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> > sleep 2
> > done
> > wait
> >
> > # Calculate hash we will use to verify data later
> > find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
> >
> > # Make sure it's all on the disk
> > sync
> > sysctl vm.drop_caches=3
> >
> > # See distribution of data across drives
> > btrfs dev usage /try
> > btrfs fi usage /try
> >
> > # Corrupt one byte of each of the first 4G on /dev/vdb,
> > # so that the crc32c algorithm will always detect the corruption.
> > # If you need a bigger test disk then increase the '4'.
> > # Leave the first 16MB of the disk alone so we don't kill the superblock.
> > perl -MFcntl -e '
> > for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> > my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> > sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> > sysread(STDIN, $dat, 1) or die "read: $!";
> > sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> > syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> > }
> > ' </dev/vdb >/dev/vdb
> >
> > # Make sure all that's on disk and our caches are empty
> > sync
> > sysctl vm.drop_caches=3
> >
> > # Before and after dev stat and read-only scrub to see what the damage looks like.
> > # This will produce some ratelimited kernel output.
> > btrfs dev stat /try | grep -v ' 0$'
> > btrfs scrub start -rBd /try
> > btrfs dev stat /try | grep -v ' 0$'
> >
> > # Verify all the files are correctly restored transparently by btrfs.
> > # btrfs repairs correctable blocks as a side-effect.
> > sha1sum --quiet -c /tmp/sha1sums.txt
> >
> > # Do a scrub to clean up stray corrupted blocks (including superblocks)
> > btrfs dev stat /try | grep -v ' 0$'
> > btrfs scrub start -Bd /try
> > btrfs dev stat /try | grep -v ' 0$'
> >
> > # This scrub should be clean, but sometimes is not.
> > btrfs scrub start -Bd /try
> > btrfs dev stat /try | grep -v ' 0$'
> >
> > # Verify that the scrub didn't corrupt anything.
> > sha1sum --quiet -c /tmp/sha1sums.txt
> >
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1
2019-11-19 4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
2019-11-19 6:58 ` Qu Wenruo
@ 2020-02-06 23:40 ` Zygo Blaxell
2020-04-02 11:10 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Filipe Manana
2 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2020-02-06 23:40 UTC (permalink / raw)
To: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 7591 bytes --]
On Mon, Nov 18, 2019 at 11:08:27PM -0500, Zygo Blaxell wrote:
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data. This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
>
> This doesn't happen very often. The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate. It behaves more like a race condition than
> a boundary condition. It can take a few tries to get a failure with a
> 16GB disk array. It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
>
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
Still reproducible on 5.5.1. Some more details below:
> [...snip...]
> Reproducer part 1 (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
>
> #!/bin/bash
> set -x
>
> # Reset state
> umount /try
> mkdir -p /try
>
> # Create FS and mount. Use raid1 metadata so the filesystem
> # has a fair chance of survival.
> mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> btrfs dev scan
> mount -onoatime /dev/vdb /try || exit 1
>
> # Must be on btrfs
> cd /try || exit 1
> btrfs sub list . || exit 1
>
> # Fill disk with files. Increase seq for more test data
> # to increase the chance of finding corruption.
> for x in $(seq 0 3); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Remove half the files. If you increased seq above, increase the
> # '-2' here as well.
> find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> sync &
> rm -fr "$x" &
> sleep 2
> done
> wait
>
> # Fill in some of the holes. This is to get a good mix of
> # partially filled RAID stripes of various sizes.
> for x in $(seq 0 1); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Calculate hash we will use to verify data later
> find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
>
> # Make sure it's all on the disk
> sync
> sysctl vm.drop_caches=3
>
> # See distribution of data across drives
> btrfs dev usage /try
> btrfs fi usage /try
>
> # Corrupt one byte of each of the first 4G on /dev/vdb,
> # so that the crc32c algorithm will always detect the corruption.
> # If you need a bigger test disk then increase the '4'.
> # Leave the first 16MB of the disk alone so we don't kill the superblock.
> perl -MFcntl -e '
> for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> sysread(STDIN, $dat, 1) or die "read: $!";
> sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> }
> ' </dev/vdb >/dev/vdb
>
> # Make sure all that's on disk and our caches are empty
> sync
> sysctl vm.drop_caches=3
I split the test into two parts: everything up to the above line (let's
call it part 1), and everything below this line (part 2). Part 1 creates
a RAID5 array with corruption on one disk. Part 2 tries to read all the
original data and correct the corrupted disk with sha1sum and btrfs scrub.
I saved a copy of the VM's disk images after part 1, so I could repeatedly
reset and run part 2 on identical filesystem images.
I also split up
btrfs scrub start -rBd /try
and replaced it with
btrfs scrub start -rBd /dev/vdb
btrfs scrub start -rBd /dev/vdc
(and same for the non-read-only scrubs) so the scrub runs sequentially
on each disk, instead of running on both in parallel.
Original part 2:
> # Before and after dev stat and read-only scrub to see what the damage looks like.
> # This will produce some ratelimited kernel output.
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -rBd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify all the files are correctly restored transparently by btrfs.
> # btrfs repairs correctable blocks as a side-effect.
> sha1sum --quiet -c /tmp/sha1sums.txt
>
> # Do a scrub to clean up stray corrupted blocks (including superblocks)
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # This scrub should be clean, but sometimes is not.
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify that the scrub didn't corrupt anything.
> sha1sum --quiet -c /tmp/sha1sums.txt
Multiple runs of part 2 produce different results in scrub:
result-1581019560.txt:scrub device /dev/vdb (id 1) done
result-1581019560.txt:Error summary: super=1 csum=273977
result-1581019560.txt:scrub device /dev/vdc (id 2) done
result-1581019560.txt:Error summary: read=1600744 csum=230813
result-1581019560.txt:[/dev/vdb].corruption_errs 504791
result-1581029949.txt:scrub device /dev/vdb (id 1) done
result-1581029949.txt:Error summary: super=1 csum=273799
result-1581029949.txt:scrub device /dev/vdc (id 2) done
result-1581029949.txt:Error summary: read=1600744 csum=230813
result-1581029949.txt:[/dev/vdb].corruption_errs 504613
With scrub on the filesystem it is no better:
result-1.txt:scrub device /dev/vdb (id 1) done
result-1.txt:Error summary: super=1 csum=272757
result-1.txt:scrub device /dev/vdc (id 2) done
result-1.txt:Error summary: read=1600744 csum=230813
result-1.txt:[/dev/vdb].corruption_errs 503571
result-2.txt:scrub device /dev/vdb (id 1) done
result-2.txt:Error summary: super=1 csum=273430
result-2.txt:scrub device /dev/vdc (id 2) done
result-2.txt:Error summary: read=1600744 csum=230813
result-2.txt:[/dev/vdb].corruption_errs 504244
result-3.txt:scrub device /dev/vdb (id 1) done
result-3.txt:Error summary: super=1 csum=273456
result-3.txt:scrub device /dev/vdc (id 2) done
result-3.txt:Error summary: read=1600744 csum=230813
result-3.txt:[/dev/vdb].corruption_errs 504270
The scrub summaries after the sha1sum -c are different too, although in
this case the errors were all corrected (sha1sum -c is clean):
result-1.txt:scrub device /dev/vdb (id 1) done
result-1.txt:Error summary: csum=29911
result-1.txt:scrub device /dev/vdc (id 2) done
result-1.txt:Error summary: csum=11
result-2.txt:scrub device /dev/vdb (id 1) done
result-2.txt:Error summary: csum=29919
result-2.txt:scrub device /dev/vdc (id 2) done
result-2.txt:Error summary: csum=14
result-3.txt:scrub device /dev/vdb (id 1) done
result-3.txt:Error summary: csum=29713
result-3.txt:scrub device /dev/vdc (id 2) done
result-3.txt:Error summary: csum=9
The error counts on /dev/vdb are different after the sha1sum -c,
indicating that file reads are nondeterministically correcting or not
correcting csum errors on btrfs raid5. This could be due to readahead
or maybe something else.
The error counts on /dev/vdc are interesting, as that drive is not
corrupted, nor does it have read errors, but it is very consistently
reporting read=1600744 csum=230813 in scrub output.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11
2019-11-19 4:08 RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11 Zygo Blaxell
2019-11-19 6:58 ` Qu Wenruo
2020-02-06 23:40 ` RAID5 fails to correct correctable errors, makes them uncorrectable instead (sometimes). With reproducer for kernel 5.3.11, still works on 5.5.1 Zygo Blaxell
@ 2020-04-02 11:10 ` Filipe Manana
2 siblings, 0 replies; 5+ messages in thread
From: Filipe Manana @ 2020-04-02 11:10 UTC (permalink / raw)
To: Zygo Blaxell; +Cc: linux-btrfs
On Tue, Nov 19, 2019 at 4:09 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> Sometimes, btrfs raid5 not only fails to recover corrupt data with a
> parity stripe, it also copies bad data over good data. This propagates
> errors between drives and makes a correctable failure uncorrectable.
> Reproducer script at the end.
>
> This doesn't happen very often. The repro script corrupts *every*
> data block on one of the RAID5 drives, and only a handful of blocks
> fail to be corrected--about 16 errors per 3GB of data, but sometimes
> half or double that rate. It behaves more like a race condition than
> a boundary condition. It can take a few tries to get a failure with a
> 16GB disk array. It seems to happen more often on 2-disk raid5 than
> 5-disk raid5, but if you repeat the test enough times even a 5-disk
> raid5 will eventually fail.
>
> Kernels 4.16..5.3 all seem to behave similarly, so this is not a new bug.
> I haven't tried this reproducer on kernels earlier than 4.16 due to
> other raid5 issues in earlier kernels.
>
> I found 16 corrupted files after one test with 3GB of data (multiple
> copies of /usr on a Debian vm). I dumped out the files with 'btrfs
> restore'. These are the differences between the restored files and the
> original data:
>
> # find -type f -print | while read x; do ls -l "$x"; cmp -l /tmp/restored/"$x" /usr/"${x#*/*/}"; done
> -rw-r--r-- 1 root root 4179 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 2532 253 147
> -rw-r--r-- 1 root root 18725 Nov 18 20:47 ./1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 2481 20 145
> 6481 270 145
> 8876 3 150
> 13137 232 75
> 16805 103 55
> -rw-r--r-- 1 root root 3421 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 2064 0 157
> -rw-r--r-- 1 root root 4948 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 2262 226 145
> -rw-r--r-- 1 root root 11692 Nov 18 20:47 ./1574119549/share/perl/5.24.1/App/Prove/State.pm
> 7115 361 164
> 8333 330 12
> -rw-r--r-- 1 root root 316211 Nov 18 20:47 ./1574119549/share/perl/5.24.1/perl5db.pl
> 263868 35 40
> 268307 143 40
> 272168 370 154
> 275138 25 145
> 280076 125 40
> 282683 310 136
> 286949 132 44
> 293453 176 163
> 296803 40 52
> 300719 307 40
> 305953 77 174
> 310419 124 161
> 312922 47 40
> -rw-r--r-- 1 root root 11113 Nov 18 20:47 ./1574119549/share/perl/5.24.1/B/Debug.pm
> 787 323 102
> 6775 372 141
> -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119549/share/man/man1/getconf.1.gz
> 484 262 41
> -rw-r--r-- 1 root root 3296 Nov 18 20:47 ./1574119549/share/man/man1/genrsa.1ssl.gz
> 2777 247 164
> -rw-r--r-- 1 root root 4815 Nov 18 20:47 ./1574119549/share/man/man1/genpkey.1ssl.gz
> 3128 22 6
> -rw-r--r-- 1 root root 6558 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 3378 253 146
> 6224 162 42
> -rw-r--r-- 1 root root 75950 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 68112 2 111
> 73226 344 150
> 75622 12 40
> -rw-r--r-- 1 root root 3873 Nov 18 20:47 ./1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
> 1859 247 40
> -rw-r--r-- 1 root root 86458 Nov 18 20:47 ./1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 66721 200 346
> 72692 211 270
> 74596 336 101
> 79179 257 0
> 85438 104 256
> -rw-r--r-- 1 root root 2528 Nov 18 20:47 ./1574119553/share/man/man1/getent.1.gz
> 1722 243 356
> -rw-r--r-- 1 root root 2346 Nov 18 20:47 ./1574119553/share/man/man1/getconf.1.gz
> 1062 212 267
>
> Note that the reproducer script will corrupt exactly one random byte per
> 4K block to guarantee the corruption is detected by the crc32c algorithm.
> In all cases the corrupted data is one byte per 4K block, as expected.
>
> I dumped out the files by reading the blocks directly from the file
> system. Data and parity blocks from btrfs were identical and matched the
> corrupted data from btrfs restore. This is interesting because the repro
> script only corrupts one drive! The only way blocks on both drives end
> up corrupted identically (or at all) is if btrfs copies the bad data
> over the good.
>
> There is also some spatial clustering of the unrecoverable blocks.
> Here are the physical block addresses (in hex to make mod-4K and mod-64K
> easier to see):
>
> Extent bytenr start..end Filename
> 0xcc160000..0xcc176000 1574119553/share/locale/zh_CN/LC_MESSAGES/gnupg2.mo
> 0xcd0f0000..0xcd0f2000 1574119549/share/man/man1/genpkey.1ssl.gz
> 0xcd0f3000..0xcd0f4000 1574119549/share/man/man1/genrsa.1ssl.gz
> 0xcd0f4000..0xcd0f5000 1574119549/share/man/man1/getconf.1.gz
> 0xcd0fb000..0xcd0fc000 1574119553/share/man/man1/getconf.1.gz
> 0xcd13f000..0xcd140000 1574119553/share/man/man1/getent.1.gz
> 0xd0d70000..0xd0dbe000 1574119549/share/perl/5.24.1/perl5db.pl
> 0xd0f8f000..0xd0f92000 1574119549/share/perl/5.24.1/App/Prove/State.pm
> 0xd0f92000..0xd0f94000 1574119549/share/perl/5.24.1/App/Prove/State/Result.pm
> 0xd0f94000..0xd0f95000 1574119549/share/perl/5.24.1/App/Prove/State/Result/Test.pm
> 0xd0fd6000..0xd0fd8000 1574119549/share/perl/5.24.1/Archive/Tar/Constant.pm
> 0xd0fd8000..0xd0fdd000 1574119549/share/perl/5.24.1/Archive/Tar/File.pm
> 0xd0fdd000..0xd0fe0000 1574119549/share/perl/5.24.1/B/Debug.pm
> 0xd1540000..0xd1553000 1574119553/share/perl/5.24.1/ExtUtils/MM_Any.pm
> 0xd155c000..0xd155e000 1574119553/share/perl/5.24.1/ExtUtils/MM_NW5.pm
> 0xd155e000..0xd155f000 1574119553/share/perl/5.24.1/ExtUtils/MM_OS2.pm
>
> Notice that 0xcd0f0000 to 0xcd0fb000 are in the same RAID5 strip (64K),
> as are 0xd0f92000 to 0xd0fdd000, and 0xd155c000 to 0xd155e000. The files
> gnupg2.mo and perl5db.pl also include multiple corrupted blocks within
> a single RAID strip.
>
> All files that had sha1sum failures also had EIO/csum failures, so btrfs
> did detect all the (now uncorrectable) corrupted blocks correctly. Also
> no problems have been seen with btrfs raid1 (metadata or data).
Indeed, there's a serious problem with raid5 and raid6.
I've just started a thread (I cc'ed you) describing the problem and to
see if anyone else was aware of the problem and has already any
thoughts on how to fix it.
See: https://lore.kernel.org/linux-btrfs/CAL3q7H4oa70DUhOFE7kot62KjxcbvvZKxu62VfLpAcmgsinBFw@mail.gmail.com/
Thanks.
>
> Reproducer (runs in a qemu with test disks on /dev/vdb and /dev/vdc):
>
> #!/bin/bash
> set -x
>
> # Reset state
> umount /try
> mkdir -p /try
>
> # Create FS and mount. Use raid1 metadata so the filesystem
> # has a fair chance of survival.
> mkfs.btrfs -draid5 -mraid1 -f /dev/vd[bc] || exit 1
> btrfs dev scan
> mount -onoatime /dev/vdb /try || exit 1
>
> # Must be on btrfs
> cd /try || exit 1
> btrfs sub list . || exit 1
>
> # Fill disk with files. Increase seq for more test data
> # to increase the chance of finding corruption.
> for x in $(seq 0 3); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Remove half the files. If you increased seq above, increase the
> # '-2' here as well.
> find /try/* -maxdepth 0 -type d -print | unsort | head -2 | while read x; do
> sync &
> rm -fr "$x" &
> sleep 2
> done
> wait
>
> # Fill in some of the holes. This is to get a good mix of
> # partially filled RAID stripes of various sizes.
> for x in $(seq 0 1); do
> sync &
> rsync -axHSWI "/usr/." "/try/$(date +%s)" &
> sleep 2
> done
> wait
>
> # Calculate hash we will use to verify data later
> find -type f -exec sha1sum {} + > /tmp/sha1sums.txt
>
> # Make sure it's all on the disk
> sync
> sysctl vm.drop_caches=3
>
> # See distribution of data across drives
> btrfs dev usage /try
> btrfs fi usage /try
>
> # Corrupt one byte of each of the first 4G on /dev/vdb,
> # so that the crc32c algorithm will always detect the corruption.
> # If you need a bigger test disk then increase the '4'.
> # Leave the first 16MB of the disk alone so we don't kill the superblock.
> perl -MFcntl -e '
> for my $x (0..(4 * 1024 * 1024 * 1024 / 4096)) {
> my $pos = int(rand(4096)) + 16777216 + ($x * 4096);
> sysseek(STDIN, $pos, SEEK_SET) or die "seek: $!";
> sysread(STDIN, $dat, 1) or die "read: $!";
> sysseek(STDOUT, $pos, SEEK_SET) or die "seek: $!";
> syswrite(STDOUT, chr(ord($dat) ^ int(rand(255) + 1)), 1) or die "write: $!";
> }
> ' </dev/vdb >/dev/vdb
>
> # Make sure all that's on disk and our caches are empty
> sync
> sysctl vm.drop_caches=3
>
> # Before and after dev stat and read-only scrub to see what the damage looks like.
> # This will produce some ratelimited kernel output.
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -rBd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify all the files are correctly restored transparently by btrfs.
> # btrfs repairs correctable blocks as a side-effect.
> sha1sum --quiet -c /tmp/sha1sums.txt
>
> # Do a scrub to clean up stray corrupted blocks (including superblocks)
> btrfs dev stat /try | grep -v ' 0$'
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # This scrub should be clean, but sometimes is not.
> btrfs scrub start -Bd /try
> btrfs dev stat /try | grep -v ' 0$'
>
> # Verify that the scrub didn't corrupt anything.
> sha1sum --quiet -c /tmp/sha1sums.txt
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
^ permalink raw reply [flat|nested] 5+ messages in thread