Spurious read errors in btrfs raid5 degraded mode

* Spurious read errors in btrfs raid5 degraded mode
@ 2020-06-27  2:42 Zygo Blaxell
  0 siblings, 0 replies; only message in thread
From: Zygo Blaxell @ 2020-06-27  2:42 UTC (permalink / raw)
  To: linux-btrfs

I removed a failed disk from a raid5 filesystem (-draid5 -mraid1) and
mounted the filesystem with -o degraded.  I observed a large number of
csum failures and IO errors while reading files, particularly recently
modified files but also some files much older than the disk failure.
These occurred under two conditions:

	1.  "cp testfile test2; sync; sysctl vm.drop_caches=3; sha1sum
	test2" The sha1sum would return EIO about half the time, and
	report csum failures in the kernel log.  The write in the cp
	command completed without error, and the file was readable while
	it was cached in RAM, but could not be re-read from disk (i.e.
	some files written after the degraded mount were unreadable).

        2.  Some files that existed before the disk failure would return
        EIO on reads and csum failures in the kernel log.  (i.e. files
        written before the degraded mount were unreadable).  The set of
        such damaged files grew over time (i.e. the file was readable
        at one time, then not readable at a later time), giving the
        (false) impression that new writes were corrupting old data.

In both cases these EIO errors were persistent, even across a reboot:
once they appeared, they did not go away or change location on multiple
read attempts spaced hours (and gigabytes of IO activity) apart.  A
reboot did not affect the locations of IO errors in files.

A bit of probing showed that the files in case 2 were close enough
to files in case 1 to share raid stripes.  99.2% of the block groups
on this filesystem were completely full, and no files were observed
affected by this bug in the completely full block groups.

Hypothesis: the bug is a bug in the degraded read code that is triggered
by partially filled or partially empty raid stripes.  Such a bug shouldn't
appear in a block group that is completely full since a completely full
block group cannot have partially empty raid stripes.

The apparent corruption is _not_ permanent!  The data is not damaged on
disk, and btrfs replace can recover it.  Most IO errors and csum failures
disappeared at the end of the btrfs replace operation.

At the end of the btrfs replace operation for the failed disk (which
will be the subject of some different bug reports), only 84K of data
was permanently lost in 16TB of filesystem--all of it on the disks that
had not been replaced.

^ permalink raw reply	[flat|nested] only message in thread