* Cache Device Failure Expectations
@ 2021-03-22 16:09 Marc Smith
0 siblings, 0 replies; only message in thread
From: Marc Smith @ 2021-03-22 16:09 UTC (permalink / raw)
I'm using bcache in a Linux 5.4.69 kernel, and I'm testing transient
cache device failures with a backing backing device using 'writeback'
mode, and with several gigabytes of dirty data (that has not reached
the backing device).
In my first test, the cache devices are using the default "unregister"
value for the "errors" sysfs attribute knob (for bcache cache devices
in /sys/fs/bcache/...). When I induce a cache device failure, bcache
backing devices stop, the cache device is detached from all affected
backing devices, and I/O errors are returned on subsequent access
attempts to the backing devices. This all works as I think it would
based on how it's configured.
The downside to "unregister" is when I reboot the system (with the
cache block device reinstated/working), the backing devices come up
but with no cache device attached! So this certainly causes file
system corruption as dirty data is not present on the backing device
(since the backing device is started without the cache device).
On the second test run, I used "panic" for the "unregister" sysfs
value, and this works cleaner, most of the time. When I induce a cache
block device failure, the system then panics, but the cache device
stays associated with the backing devices -- and dirty data can then
flush to the backing device. On this second test, when the system
booted back up, one cache device failed to start:
[ 333.116149] bcache: prio_read() bad csum reading priorities
[ 333.116151] bcache: prio_read() bad magic reading priorities
[ 333.116636] bcache: bch_cache_set_error() bcache: error on
[ 333.116637] corrupted btree at bucket 473, block 44, 504 keys
[ 333.116638] bcache: bch_cache_set_error() , disabling caching
[ 333.116649] bcache: register_cache() error dm-12: failed to run cache set
[ 333.116650] bcache: register_bcache() error : failed to register device
This seemed to be a temporary problem -- I rebooted the system again,
and then the bcache cache device started without issue. I did not
check for data loss / corruption in this instance.
A third test run using "panic" mode resulted in everything coming back
up normally, and seemingly operating just fine (no cache/backing
device start errors). I did not check for data loss / corruption in
this instance either.
So, I guess just a couple questions to solidify my expectations on
this type of transient cache device failure (cache block device fails,
but then can come back later fully intact):
- It sounds like for handling this case, "panic" mode for the "errors"
sysfs attribute is best since it does not detach the cache device from
- Is this safe/reliable (transient cache device failures)? Obviously
it's not preferred, but should I expect any problems should this occur
and using "panic" mode? No metadata corruption on the cache device is
Thanks for your time. Appreciate the great work on bcache!
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2021-03-22 16:10 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-22 16:09 Cache Device Failure Expectations Marc Smith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).