Dirty data loss after cache disk error recovery

* Dirty data loss after cache disk error recovery
@ 2021-04-20  3:17 吴本卿(云桌面 福州)
  2021-04-28 18:30 ` Kai Krakow
  2023-10-17  1:57 ` Coly Li
  0 siblings, 2 replies; 17+ messages in thread
From: 吴本卿(云桌面 福州) @ 2021-04-20  3:17 UTC (permalink / raw)
  To: linux-bcache

Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.

I checked the log and found that logs:
[12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
[12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
[12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered

I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.

Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit

It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: 
Apr 19 05:50:18 localhost.localdomain kernel: journal io error
Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.

When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.

^ permalink raw reply	[flat|nested] 17+ messages in thread