spuious I/O errors from btrfs...at the caching layer?

* spuious I/O errors from btrfs...at the caching layer?
@ 2015-01-24 18:06 Zygo Blaxell
  2015-01-25 16:50 ` Zygo Blaxell
  0 siblings, 1 reply; 4+ messages in thread
From: Zygo Blaxell @ 2015-01-24 18:06 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1570 bytes --]

I am seeing a lot of spurious I/O errors that look like they come from
the cache-facing side of btrfs.  While running a heavy load with some
extent-sharing (e.g. building 20 Linux kernels at once from source trees
copied with 'cp -a --reflink=always'), some files will return spurious
EIO on read.  It happens often enough to prevent a Linux kernel build
about 1/3 of the time.

I believe the I/O errors to be spurious because:

	- there is no kernel message of any kind during the event

	- scrub detects 0 errors

	- device stats report 0 errors

	- the drive firmware reports nothing wrong through SMART

	- there seems to be no attempt to read the disk when the error
	is reported

	- "sysctl vm.drop_caches={1,2}" makes the I/O error go away.

Files become unreadable at random, and stay unreadable indefinitely;
however, any time I discover a file that gives EIO on read, I can
poke vm.drop_caches and make the EIO go away.  The file can then be
read normally and has correct contents.  The disk does not seem to be
involved in the I/O error return.

This seems to happen more often when snapshots are being deleted;
however, it occurs on systems with no snapshots as well (though
in these cases the system had snapshots in the past).

When a file returns EIO on read, other snapshots of the same file also
return EIO on read.  I have not been able to test whether this affects
reflink copies (clones) as well.

Observed from 3.17..3.18.3.  All filesystems affected use skinny-metadata.
No filesystems that are not using skinny-metadata seem to have this
problem.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread