Re: Dirty data loss after cache disk error recovery

From: Kai Krakow <kai@kaishome.de>
To: Coly Li <colyli@suse.de>
Cc: "linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,
	"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>
Subject: Re: Dirty data loss after cache disk error recovery
Date: Fri, 7 May 2021 16:56:27 +0200	[thread overview]
Message-ID: <CAC2ZOYuCLQpD___YBua7yEuuG85+OQ+HiRGDy=FRLS9cgMg4rA@mail.gmail.com> (raw)
In-Reply-To: <70b9cdd0-ace9-9ee7-19c7-5c47a4d2fce9@suse.de>

Hi!

> There is an option to panic the system when cache device failed. It is
> in errors file with available options as "unregister" and "panic". This
> option is default set to "unregister", if you set it to "panic" then
> panic() will be called.

Hmm, okay, I didn't find "panic" documented somewhere. I'll take a
look at it again. If it's missing, I'll create a patch to improve
documentation.

> If the cache set is attached, read-only the bcache device does not
> prevent the meta data I/O on cache device (when try to cache the reading
> data), if the cache device is really disconnected that will be
> problematic too.

I didn't completely understand the sentence, it seems to miss a word.
But whatever it is, it's probably true. ;-)

> The "auto" and "always" options are for "unregister" error action. When
> I enhance the device failure handling, I don't add new error action, all
> my work was to make the "unregister" action work better.

But isn't the failure case here that it hits both code paths: The one
that unregisters the device, and the one that then retires the cache?

> Adding a new "stop" error action IMHO doesn't make things better. When
> the cache device is disconnected, it is always risky that some caching
> data or meta data is not updated onto cache device. Permit the cache
> device to be re-attached to the backing device may introduce "silent
> data loss" which might be worse....  It was the reason why I didn't add
> new error action for the device failure handling patch set.

But we are actually now seeing silent data loss: The system f'ed up
somehow, needed a hard reset, and after reboot the bcache device was
accessible in cache mode "none" (because they have been unregistered
before, and because udev just detected it and you can use bcache
without an attached cache in "none" mode), completely hiding the fact
that we lost dirty write-back data, it's even not quite obvious that
/dev/bcache0 now is detached, cache mode none, but accessible
nevertheless. To me, this is quite clearly "silent data loss",
especially since the unregister action threw the dirty data away.

So this:

> Permit the cache
> device to be re-attached to the backing device may introduce "silent
> data loss" which might be worse....

is actually the situation we are facing currently: Device has been
unregistered, after reboot, udev detects it has clean backing device
without cache association, using cache mode none, and it is readable
and writable just fine: It essentially permitted access to the stale
backing device (tho, it didn't re-attach as you outlined, but that's
more or less the same situation).

Maybe devices that become disassociated from a cache due to IO errors
but have dirty data should go to a caching mode "stale", and bcache
should refuse to access such devices or throw away their dirty data
until I decide to force them back online into the cache set or force
discard the dirty data. Then at least I would discover that something
went badly wrong. Otherwise, I may not detect that dirty data wasn't
written. In the best case, that makes my FS unmountable, in the worst
case, some file data is simply lost (aka silent data loss), besides
both situations are the worst-case scenario anyways.

The whole situation probably comes from udev auto-registering bcache
backing devices again, and bcache has no record of why the device was
unregistered - it looks clean after such a situation.

> Sorry I just find this thread from my INBOX. Hope it is not too late.

No worries. ;-)

It was already too late when the dirty cache was discarded but I have
daily backups. My system is up and running again, but it's probably
not a question of IF it happens again but WHEN it does. So I'd like to
discuss how we can get a cleaner fail situation because currently it's
just unclean because every status is lost after reboot, and devices
look clean, and caching mode is simply "none", which is completely
fine for the boot process.

Thanks,
Kai