linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Dirty data loss after cache disk error recovery
@ 2021-04-20  3:17 吴本卿(云桌面 福州)
  2021-04-28 18:30 ` Kai Krakow
  2023-10-17  1:57 ` Coly Li
  0 siblings, 2 replies; 17+ messages in thread
From: 吴本卿(云桌面 福州) @ 2021-04-20  3:17 UTC (permalink / raw)
  To: linux-bcache

Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.

I checked the log and found that logs:
[12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
[12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
[12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered

I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.

Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit

It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: 
Apr 19 05:50:18 localhost.localdomain kernel: journal io error
Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.

When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-20  3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州)
@ 2021-04-28 18:30 ` Kai Krakow
  2021-04-28 18:39   ` Kai Krakow
  2023-10-17  1:57 ` Coly Li
  1 sibling, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2021-04-28 18:30 UTC (permalink / raw)
  To: 吴本卿(云桌面 福州)
  Cc: linux-bcache

Hello!

Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州)
<wubenqing@ruijie.com.cn>:
>
> Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.
>
> I checked the log and found that logs:
> [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.

"stop it to avoid potential data corruption" is not what it actually
does: neither it stops it, nor it prevents corruption because dirty
data becomes thrown away.

> [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
> [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered
>
> I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.
>
> Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
> I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
> echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit
>
> It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37:
> Apr 19 05:50:18 localhost.localdomain kernel: journal io error
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
>
> When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
> Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.

I think the same problem hit me, too, last night.

My kernel choked because of a GPU error, and that somehow disconnected
the cache. I can only guess that there was some sort of timeout due to
blocked queues, and that introduced an IO error which detached the
caches.

Sadly, I only realized this after I already reformatted and started
restore from backup: During the restore I watched the bcache status
and found that the devices are not attached.

I don't know if I could have re-attached the devices instead of
formatting. But I think the dirty data would have been discarded
anyways due to incrementing bcache_device->id.

This really needs a better solution, detaching is one of the worst,
especially on btrfs this has catastrophic consequences because data is
not updated inline but via copy on write. This requires updating a lot
of pointers. Usually, cow filesystem would be robust to this kind of
data-loss but the vast amount of dirty data that is lost puts the tree
generations too far behind of what btrfs is expecting, making it
essentially broken beyond repair. If some trees in the FS are just a
few generations behind, btrfs can repair itself by using a backup tree
root, but when the bcache is lost, generation numbers usually lag
behind several hundred generations. Detaching would be fine if there'd
be no dirty data - otherwise the device should probably stop and
refuse any more IO.

@Coly If I patched the source to stop instead of detach, would it have
made anything better? Would there be any side-effects? Is it possible
to atomically check for dirty data in that case and take either the
one or the other action?

Thanks,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-28 18:30 ` Kai Krakow
@ 2021-04-28 18:39   ` Kai Krakow
  2021-04-28 18:51     ` Kai Krakow
  2021-05-07 12:13     ` Coly Li
  0 siblings, 2 replies; 17+ messages in thread
From: Kai Krakow @ 2021-04-28 18:39 UTC (permalink / raw)
  To: 吴本卿(云桌面 福州)
  Cc: linux-bcache

Hi Coly!

Am Mi., 28. Apr. 2021 um 20:30 Uhr schrieb Kai Krakow <kai@kaishome.de>:
>
> Hello!
>
> Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州)
> <wubenqing@ruijie.com.cn>:
> >
> > Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.
> >
> > I checked the log and found that logs:
> > [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
>
> "stop it to avoid potential data corruption" is not what it actually
> does: neither it stops it, nor it prevents corruption because dirty
> data becomes thrown away.
>
> > [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
> > [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered
> >
> > I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.
> >
> > Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
> > I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
> > echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit
> >
> > It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
> > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37:
> > Apr 19 05:50:18 localhost.localdomain kernel: journal io error
> > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
> > Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
> >
> > When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
> > Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.
>
> I think the same problem hit me, too, last night.
>
> My kernel choked because of a GPU error, and that somehow disconnected
> the cache. I can only guess that there was some sort of timeout due to
> blocked queues, and that introduced an IO error which detached the
> caches.
>
> Sadly, I only realized this after I already reformatted and started
> restore from backup: During the restore I watched the bcache status
> and found that the devices are not attached.
>
> I don't know if I could have re-attached the devices instead of
> formatting. But I think the dirty data would have been discarded
> anyways due to incrementing bcache_device->id.
>
> This really needs a better solution, detaching is one of the worst,
> especially on btrfs this has catastrophic consequences because data is
> not updated inline but via copy on write. This requires updating a lot
> of pointers. Usually, cow filesystem would be robust to this kind of
> data-loss but the vast amount of dirty data that is lost puts the tree
> generations too far behind of what btrfs is expecting, making it
> essentially broken beyond repair. If some trees in the FS are just a
> few generations behind, btrfs can repair itself by using a backup tree
> root, but when the bcache is lost, generation numbers usually lag
> behind several hundred generations. Detaching would be fine if there'd
> be no dirty data - otherwise the device should probably stop and
> refuse any more IO.
>
> @Coly If I patched the source to stop instead of detach, would it have
> made anything better? Would there be any side-effects? Is it possible
> to atomically check for dirty data in that case and take either the
> one or the other action?

I think this behavior was introduced by https://lwn.net/Articles/748226/

So above is my late review. ;-)

(around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
access LWN for reasons[tm])

Thanks,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-28 18:39   ` Kai Krakow
@ 2021-04-28 18:51     ` Kai Krakow
  2021-05-07 12:11       ` Coly Li
  2021-05-07 12:13     ` Coly Li
  1 sibling, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2021-04-28 18:51 UTC (permalink / raw)
  To: 吴本卿(云桌面 福州)
  Cc: linux-bcache

> I think this behavior was introduced by https://lwn.net/Articles/748226/
>
> So above is my late review. ;-)
>
> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
> access LWN for reasons[tm])

The problem may actually come from a different code path which retires
the cache on metadata error:

commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae
"bcache: fix cached_dev->count usage for bch_cache_set_error()"

It probably should consider if there's any dirty data. As a first
step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this
would kill the bcache thread, may not be a good idea) or even freeze
the system with an unrecoverable error, or at least stop the device to
prevent any IO with possibly stale data (because retiring throws away
dirty data). A good solution would be if the "with dirty data" error
path could somehow force the attached file system into read-only mode,
maybe by just reporting IO errors when this bdev is accessed through
bcache.

Thanks,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-28 18:51     ` Kai Krakow
@ 2021-05-07 12:11       ` Coly Li
  2021-05-07 14:56         ` Kai Krakow
  0 siblings, 1 reply; 17+ messages in thread
From: Coly Li @ 2021-05-07 12:11 UTC (permalink / raw)
  To: Kai Krakow
  Cc: linux-bcache,
	吴本卿(云桌面
	福州)

On 4/29/21 2:51 AM, Kai Krakow wrote:
>> I think this behavior was introduced by https://lwn.net/Articles/748226/
>>
>> So above is my late review. ;-)
>>
>> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
>> access LWN for reasons[tm])
> 
> The problem may actually come from a different code path which retires
> the cache on metadata error:
> 
> commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae
> "bcache: fix cached_dev->count usage for bch_cache_set_error()"
> 
> It probably should consider if there's any dirty data. As a first
> step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this
> would kill the bcache thread, may not be a good idea) or even freeze
> the system with an unrecoverable error, or at least stop the device to
> prevent any IO with possibly stale data (because retiring throws away
> dirty data). A good solution would be if the "with dirty data" error
> path could somehow force the attached file system into read-only mode,
> maybe by just reporting IO errors when this bdev is accessed through
> bcache.


There is an option to panic the system when cache device failed. It is
in errors file with available options as "unregister" and "panic". This
option is default set to "unregister", if you set it to "panic" then
panic() will be called.

If the cache set is attached, read-only the bcache device does not
prevent the meta data I/O on cache device (when try to cache the reading
data), if the cache device is really disconnected that will be
problematic too.

The "auto" and "always" options are for "unregister" error action. When
I enhance the device failure handling, I don't add new error action, all
my work was to make the "unregister" action work better.

Adding a new "stop" error action IMHO doesn't make things better. When
the cache device is disconnected, it is always risky that some caching
data or meta data is not updated onto cache device. Permit the cache
device to be re-attached to the backing device may introduce "silent
data loss" which might be worse....  It was the reason why I didn't add
new error action for the device failure handling patch set.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-28 18:39   ` Kai Krakow
  2021-04-28 18:51     ` Kai Krakow
@ 2021-05-07 12:13     ` Coly Li
  1 sibling, 0 replies; 17+ messages in thread
From: Coly Li @ 2021-05-07 12:13 UTC (permalink / raw)
  To: Kai Krakow,
	吴本卿(云桌面
	福州)
  Cc: linux-bcache

On 4/29/21 2:39 AM, Kai Krakow wrote:
> Hi Coly!
> 
> Am Mi., 28. Apr. 2021 um 20:30 Uhr schrieb Kai Krakow <kai@kaishome.de>:
>>
>> Hello!
>>
>> Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州)
>> <wubenqing@ruijie.com.cn>:
>>>
>>> Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.
>>>
>>> I checked the log and found that logs:
>>> [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
>>
>> "stop it to avoid potential data corruption" is not what it actually
>> does: neither it stops it, nor it prevents corruption because dirty
>> data becomes thrown away.
>>
>>> [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
>>> [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered
>>>
>>> I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.
>>>
>>> Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
>>> I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
>>> echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit
>>>
>>> It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
>>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37:
>>> Apr 19 05:50:18 localhost.localdomain kernel: journal io error
>>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
>>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
>>>
>>> When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
>>> Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.
>>
>> I think the same problem hit me, too, last night.
>>
>> My kernel choked because of a GPU error, and that somehow disconnected
>> the cache. I can only guess that there was some sort of timeout due to
>> blocked queues, and that introduced an IO error which detached the
>> caches.
>>
>> Sadly, I only realized this after I already reformatted and started
>> restore from backup: During the restore I watched the bcache status
>> and found that the devices are not attached.
>>
>> I don't know if I could have re-attached the devices instead of
>> formatting. But I think the dirty data would have been discarded
>> anyways due to incrementing bcache_device->id.
>>
>> This really needs a better solution, detaching is one of the worst,
>> especially on btrfs this has catastrophic consequences because data is
>> not updated inline but via copy on write. This requires updating a lot
>> of pointers. Usually, cow filesystem would be robust to this kind of
>> data-loss but the vast amount of dirty data that is lost puts the tree
>> generations too far behind of what btrfs is expecting, making it
>> essentially broken beyond repair. If some trees in the FS are just a
>> few generations behind, btrfs can repair itself by using a backup tree
>> root, but when the bcache is lost, generation numbers usually lag
>> behind several hundred generations. Detaching would be fine if there'd
>> be no dirty data - otherwise the device should probably stop and
>> refuse any more IO.
>>
>> @Coly If I patched the source to stop instead of detach, would it have
>> made anything better? Would there be any side-effects? Is it possible
>> to atomically check for dirty data in that case and take either the
>> one or the other action?
> 
> I think this behavior was introduced by https://lwn.net/Articles/748226/
> 
> So above is my late review. ;-)
> 
> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot
> access LWN for reasons[tm])
> 

Hi Kai,

Sorry I just find this thread from my INBOX. Hope it is not too late. I
replied in your latest reply in this thread.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-05-07 12:11       ` Coly Li
@ 2021-05-07 14:56         ` Kai Krakow
       [not found]           ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net>
  0 siblings, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2021-05-07 14:56 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-bcache,
	吴本卿(云桌面
	福州)

Hi!

> There is an option to panic the system when cache device failed. It is
> in errors file with available options as "unregister" and "panic". This
> option is default set to "unregister", if you set it to "panic" then
> panic() will be called.

Hmm, okay, I didn't find "panic" documented somewhere. I'll take a
look at it again. If it's missing, I'll create a patch to improve
documentation.

> If the cache set is attached, read-only the bcache device does not
> prevent the meta data I/O on cache device (when try to cache the reading
> data), if the cache device is really disconnected that will be
> problematic too.

I didn't completely understand the sentence, it seems to miss a word.
But whatever it is, it's probably true. ;-)

> The "auto" and "always" options are for "unregister" error action. When
> I enhance the device failure handling, I don't add new error action, all
> my work was to make the "unregister" action work better.

But isn't the failure case here that it hits both code paths: The one
that unregisters the device, and the one that then retires the cache?

> Adding a new "stop" error action IMHO doesn't make things better. When
> the cache device is disconnected, it is always risky that some caching
> data or meta data is not updated onto cache device. Permit the cache
> device to be re-attached to the backing device may introduce "silent
> data loss" which might be worse....  It was the reason why I didn't add
> new error action for the device failure handling patch set.

But we are actually now seeing silent data loss: The system f'ed up
somehow, needed a hard reset, and after reboot the bcache device was
accessible in cache mode "none" (because they have been unregistered
before, and because udev just detected it and you can use bcache
without an attached cache in "none" mode), completely hiding the fact
that we lost dirty write-back data, it's even not quite obvious that
/dev/bcache0 now is detached, cache mode none, but accessible
nevertheless. To me, this is quite clearly "silent data loss",
especially since the unregister action threw the dirty data away.

So this:

> Permit the cache
> device to be re-attached to the backing device may introduce "silent
> data loss" which might be worse....

is actually the situation we are facing currently: Device has been
unregistered, after reboot, udev detects it has clean backing device
without cache association, using cache mode none, and it is readable
and writable just fine: It essentially permitted access to the stale
backing device (tho, it didn't re-attach as you outlined, but that's
more or less the same situation).

Maybe devices that become disassociated from a cache due to IO errors
but have dirty data should go to a caching mode "stale", and bcache
should refuse to access such devices or throw away their dirty data
until I decide to force them back online into the cache set or force
discard the dirty data. Then at least I would discover that something
went badly wrong. Otherwise, I may not detect that dirty data wasn't
written. In the best case, that makes my FS unmountable, in the worst
case, some file data is simply lost (aka silent data loss), besides
both situations are the worst-case scenario anyways.

The whole situation probably comes from udev auto-registering bcache
backing devices again, and bcache has no record of why the device was
unregistered - it looks clean after such a situation.

> Sorry I just find this thread from my INBOX. Hope it is not too late.

No worries. ;-)

It was already too late when the dirty cache was discarded but I have
daily backups. My system is up and running again, but it's probably
not a question of IF it happens again but WHEN it does. So I'd like to
discuss how we can get a cleaner fail situation because currently it's
just unclean because every status is lost after reboot, and devices
look clean, and caching mode is simply "none", which is completely
fine for the boot process.

Thanks,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
       [not found]           ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net>
@ 2023-09-06 22:56             ` Kai Krakow
       [not found]               ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net>
  0 siblings, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2023-09-06 22:56 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

Wow!

I call that a necro-bump... ;-)

Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler
<lists@bcache.ewheeler.net>:
>
> On Fri, 7 May 2021, Kai Krakow wrote:
>
> > > Adding a new "stop" error action IMHO doesn't make things better. When
> > > the cache device is disconnected, it is always risky that some caching
> > > data or meta data is not updated onto cache device. Permit the cache
> > > device to be re-attached to the backing device may introduce "silent
> > > data loss" which might be worse....  It was the reason why I didn't add
> > > new error action for the device failure handling patch set.
> >
> > But we are actually now seeing silent data loss: The system f'ed up
> > somehow, needed a hard reset, and after reboot the bcache device was
> > accessible in cache mode "none" (because they have been unregistered
> > before, and because udev just detected it and you can use bcache
> > without an attached cache in "none" mode), completely hiding the fact
> > that we lost dirty write-back data, it's even not quite obvious that
> > /dev/bcache0 now is detached, cache mode none, but accessible
> > nevertheless. To me, this is quite clearly "silent data loss",
> > especially since the unregister action threw the dirty data away.
> >
> > So this:
> >
> > > Permit the cache
> > > device to be re-attached to the backing device may introduce "silent
> > > data loss" which might be worse....
> >
> > is actually the situation we are facing currently: Device has been
> > unregistered, after reboot, udev detects it has clean backing device
> > without cache association, using cache mode none, and it is readable
> > and writable just fine: It essentially permitted access to the stale
> > backing device (tho, it didn't re-attach as you outlined, but that's
> > more or less the same situation).
> >
> > Maybe devices that become disassociated from a cache due to IO errors
> > but have dirty data should go to a caching mode "stale", and bcache
> > should refuse to access such devices or throw away their dirty data
> > until I decide to force them back online into the cache set or force
> > discard the dirty data. Then at least I would discover that something
> > went badly wrong. Otherwise, I may not detect that dirty data wasn't
> > written. In the best case, that makes my FS unmountable, in the worst
> > case, some file data is simply lost (aka silent data loss), besides
> > both situations are the worst-case scenario anyways.
> >
> > The whole situation probably comes from udev auto-registering bcache
> > backing devices again, and bcache has no record of why the device was
> > unregistered - it looks clean after such a situation.

[...]

> I think we hit this same issue from 2021. Here is that original thread from 2021:
>         https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de
>
> Kai, did you end up with a good patch for this? We are running a 5.15
> kernel with the many backported bcache commits that Coly suggested here:
>         https://www.spinics.net/lists/linux-bcache/msg12084.html

I'm currently running 6.1 with bcache on mdraid1 and device-level
write caching disabled. I didn't see this ever occur again.

BUT: Between that time and now I eventually also replaced my faulty
RAM which had a few rare bit-flips.


> Based on the thread from Kai (from 2021), I think we need to restore from
> backup. While the root of the problem may be hardware related, bcache
> should be more gracefully than unplugging the cache.

Yes, it may be hardware-related and you should probably confirm your
RAM working properly.

Currently, I'm running with no bcache patches on LTS 6.1, only some
btrfs patches:
https://github.com/kakra/linux/pull/26

Especially the allocation-hint patches provide better speedups for
meta data than bcache could ever do. With these patches, you could
dedicate a small amount of two SSD partitions (on different drivers)
to a btrfs metadata raid1, and use the remainder of the SSDs as a
bcache mdraid1. Then just don't use writeback caching but writearound
or writethrough instead. Most btrfs performance issues come from slow
metadata which can be much better improved by allocator-hints than by
bcache.

But as written above, I had bad RAM, and meanwhile upgraded to kernel
6.1, and had no issues since with bcache even on power loss.


> Coly, is there already a patch to prevent complete dirty cache loss?

This is probably still an issue. The cache attachment MUST NEVER EVER
automatically degrade to "none" which it did for my fail-cases I had
back then. I don't know if this has changed meanwhile. But because
bcache explicitly does not honor write-barriers from upstream writes
for its own writeback (which is okay because it guarantees to write
back all data anyways and give a consistent view to upstream FS -
well, unless it has to handle write errors), the backed filesystem is
guaranteed to be effed up in that case, and allowing it to mount and
write because bcache silently has fallen back to "none" will only make
the matter worse.

(HINT: I never used brbd personally, most of the following is
theoretical thinking without real-world experience)

I see that you're using drbd? Did it fail due to networking issues?
I'm pretty sure it should be robust in that case but maybe bcache
cannot handle the situation? Does brbd have a write log to replay
writes after network connection loss? It looks like it doesn't and
thus bcache exploded.

Anyways, since your backing device seems to be on drbd, using metadata
allocation hinting is probably no option. You could of course still
use drbd with bcache for metadata hinted partitions, and then use
writearound caching only for that. At least, in the fail-case, your
btrfs won't be destroyed. But your data chunks may have unreadable
files then. But it should be easy to select them and restore from
backup individually. Btrfs is very robust for that fail case: if
metadata is okay, data errors are properly detected and handled. If
you're not using btrfs, all of this doesn't apply ofc.

I'm not sure if write-back caching for drbd backing is a wise decision
anyways. drbd is slow for writes, that's part of the design (and no
writeback caching could fix that). I would not rely on
bcache-writeback to fix that for you because it is not prepared for
storage that may be temporarily not available, iow, it would freeze
and continue when drbd is available again. I think you should really
use writearound/writethrough so your FS can be sure data has been
written, replicated and persisted. In case of btrfs, you could still
split data and metadata as written above, and use writeback for data,
but reliable writes for metadata.

So concluding:

1. I'm now persisting metadata directly to disk with no intermediate
layers (no bcache, no md)

2. I'm using allocation-hinted data-only partitions with bcache
write-back, with bcache on mdraid1. If anything goes wrong, I have
file crc errors in btrfs files only, but the filesystem itself is
valid because no metadata is broken or lost. I have snapshots of
recently modified files. I have daily backups.

3. Your problem is that bcache can - by design - detect write errors
only when it's too late with no chance telling the filesystem. In that
case, writethrough/writearound is the correct choice.

4. Maybe bcache should know if backing is on storage that may be
temporarily unavailable and then freeze until the backing storage is
back online, similar to how iSCSI handles that. But otoh, maybe drbd
should freeze until the replicated storage is available again while
writing (from what I've read, it's designed to not do that but let
local storage get ahead of the replica, which is btw incompatible with
bcache-writeback assumptions). Or maybe using async mirroring can fix
this for you but then, the mirror will be compromised if a hardware
failure immediately follows a previous drbd network connection loss.
But, it may still be an issue with the local hardware (bit-flips etc)
because maybe just bcache internals broke - Coly may have a better
idea of that.

I think your main issue here is that bcache decouples writebarriers
from the underlying backing storage - and you should just not use
writeback, it is incompatible by design with how drbd works: your
replica will be broken when you need it.


> Here is our trace:
>
> [Sep 6 13:01] bcache: bch_cache_set_error() error on a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, 0:1163806048 gen 3: bad, length too big, disabling caching
> [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> [  +0.000548] block drbd8143: write: error=10 s=9205904s
> [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> [  +0.000866] block drbd8143: Local IO failed in drbd_endio_write_sec_final.
> [  +0.000809] Workqueue: bcache bch_data_insert_keys
> [  +0.000833] block drbd8143: disk( UpToDate -> Inconsistent )
> [  +0.000826] Call Trace:
> [  +0.000875] block drbd8143: write: error=10 s=8394752s
> [  +0.000797]  <TASK>
> [  +0.000006]  dump_stack_lvl+0x57/0x7e
> [  +0.000791] block drbd8143: Local IO failed in drbd_endio_write_sec_final.
> [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> [  +0.000760] block drbd8143: write: error=10 s=8397840s
> [  +0.000759]  btree_mergesort+0x27e/0x36e
> [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> [  +0.000009]  __btree_sort+0xa4/0x1e9
> [  +0.002085] block drbd8143: drbd_md_sync_page_io(,41943032s,WRITE) failed with error -5
> [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> [  +0.000878] block drbd8143: meta data update failed!
> [  +0.000836]  bch_btree_init_next+0x39/0xb6
> [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> [  +0.000877] block drbd8143: disk( Inconsistent -> Failed )
> [  +0.000863]  btree_insert_fn+0x20/0x48
> [  +0.000866] block drbd8143: Local IO failed in drbd_md_write. Detaching...
> [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> [  +0.000848]  bch_btree_insert+0x102/0x188
> [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> [  +0.000857]  bch_data_insert_keys+0x39/0xde
> [  +0.000845]  process_one_work+0x280/0x5cf
> [  +0.000858]  worker_thread+0x52/0x3bd
> [  +0.000851]  ? process_one_work.cold+0x52/0x51
> [  +0.000877]  kthread+0x13e/0x15b
> [  +0.000858]  ? set_kthread_struct+0x60/0x52
> [  +0.000855]  ret_from_fork+0x22/0x2d
> [  +0.000854]  </TASK>


Regards,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
       [not found]               ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net>
@ 2023-09-07 12:00                 ` Kai Krakow
  2023-09-07 19:10                   ` Eric Wheeler
  2023-09-12  6:54                 ` 邹明哲
  1 sibling, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2023-09-07 12:00 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Coly Li, linux-bcache,
	吴本卿(云桌面
	福州),
	Mingzhe Zou

Am Do., 7. Sept. 2023 um 02:42 Uhr schrieb Eric Wheeler
<lists@bcache.ewheeler.net>:
>
> +Mingzhe, Coly: please comment on the proposed fix below when you have a
> moment:
>
> > > Coly, is there already a patch to prevent complete dirty cache loss?
> >
> > This is probably still an issue. The cache attachment MUST NEVER EVER
> > automatically degrade to "none" which it did for my fail-cases I had
> > back then. I don't know if this has changed meanwhile.
>
> I would rather that bcache went to a read-only mode in failure
> conditions like this.  Maybe write-around would be acceptable since
> bcache returns -EIO for any failed dirty cache reads.  But if the cache
> is dirty, and it gets an error, it _must_never_ read from the bdev, which
> is what appears to happens now.
>
> Coly, Mingzhe, would this be an easy change?
>
> Here are the relevant bits:
>
> The allocator called btree_mergesort which called bch_extent_invalid:
>         https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
>
> Which called the `cache_bug` macro, which triggered bch_cache_set_error:
>         https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
>
> It then calls `bch_cache_set_unregister` which shuts down the cache:
>         https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
>
>         bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
>         {
>                 ...
>                 bch_cache_set_unregister(c);
>                 return true;
>         }
>
> Proposed solution:
>
> What if, instead of bch_cache_set_unregister() that this was called instead:
>         SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)
>
> This would bypass the cache for future writes, and allow reads to
> proceed if possible, and -EIO otherwise to let upper layers handle the
> failure.

Ensuring to not read stale content from bdev by switching to
writearound is probably a proper solution - if there are no other
side-effects. But due to the error, the cdev may be in some broken
limbo state. So it should probably try to writeback dirty data while
adding no more future data - neither for read-caching nor
write-caching. Maybe this was the intention of unregister but instead
of writing back dirty data and still serving dirty data from cdev, it
immediately unregisters and invalidates the cdev.

So maybe the bugfix should be about why unregister() doesn't write
back dirty data first...

So actually switching to "none" but without unregister should probably
provide that exact behavior? No more read/write but finishing
outstanding dirty writeback.

Earlier I write:

> > This is probably still an issue. The cache attachment MUST NEVER EVER
> > automatically degrade to "none" which it did for my fail-cases I had

This was meant under the assumption that "none" is the state after
unregister - just to differentiate from what I wrote immediately
before.


> What do you think?
>
> > But because bcache explicitly does not honor write-barriers from
> > upstream writes for its own writeback (which is okay because it
> > guarantees to write back all data anyways and give a consistent view to
> > upstream FS - well, unless it has to handle write errors), the backed
> > filesystem is guaranteed to be effed up in that case, and allowing it to
> > mount and write because bcache silently has fallen back to "none" will
> > only make the matter worse.
> >
> > (HINT: I never used brbd personally, most of the following is
> > theoretical thinking without real-world experience)
> >
> > I see that you're using drbd? Did it fail due to networking issues?
> > I'm pretty sure it should be robust in that case but maybe bcache
> > cannot handle the situation? Does brbd have a write log to replay
> > writes after network connection loss? It looks like it doesn't and
> > thus bcache exploded.
>
> DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
> bcache hung, not the other way around, so DRBD is not the issue here.
> Here is our stack:
>
> bcache:
>         bdev:     /dev/sda hardware RAID5
>         cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
>
> And then bcache is stacked like so:
>
>         bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
>                               |
>                               v
>                          [remote host]
>
> > Anyways, since your backing device seems to be on drbd, using metadata
> > allocation hinting is probably no option. You could of course still use
> > drbd with bcache for metadata hinted partitions, and then use
> > writearound caching only for that. At least, in the fail-case, your
> > btrfs won't be destroyed. But your data chunks may have unreadable files
> > then. But it should be easy to select them and restore from backup
> > individually. Btrfs is very robust for that fail case: if metadata is
> > okay, data errors are properly detected and handled. If you're not using
> > btrfs, all of this doesn't apply ofc.
> >
> > I'm not sure if write-back caching for drbd backing is a wise decision
> > anyways. drbd is slow for writes, that's part of the design (and no
> > writeback caching could fix that).
>
> Bcache-backed DRBD provides a noticable difference, especially with a
> 10GbE link (or faster) and the same disk stack on both sides.
>
> > I would not rely on bcache-writeback to fix that for you because it is
> > not prepared for storage that may be temporarily not available
>
> True, which is why we put drbd /on top/ of bcache, so bcache is unaware of
> DRBD's existence.
>
> > iow, it would freeze and continue when drbd is available again. I think
> > you should really use writearound/writethrough so your FS can be sure
> > data has been written, replicated and persisted. In case of btrfs, you
> > could still split data and metadata as written above, and use writeback
> > for data, but reliable writes for metadata.
> >
> > So concluding:
> >
> > 1. I'm now persisting metadata directly to disk with no intermediate
> > layers (no bcache, no md)
> >
> > 2. I'm using allocation-hinted data-only partitions with bcache
> > write-back, with bcache on mdraid1. If anything goes wrong, I have
> > file crc errors in btrfs files only, but the filesystem itself is
> > valid because no metadata is broken or lost. I have snapshots of
> > recently modified files. I have daily backups.
> >
> > 3. Your problem is that bcache can - by design - detect write errors
> > only when it's too late with no chance telling the filesystem. In that
> > case, writethrough/writearound is the correct choice.
> >
> > 4. Maybe bcache should know if backing is on storage that may be
> > temporarily unavailable and then freeze until the backing storage is
> > back online, similar to how iSCSI handles that.
>
> I don't think "temporarily unavailable" should be bcache's burden, as
> bcache is a local-only solution.  If someone is using iSCSI under bcache,
> then good luck ;)
>
> > But otoh, maybe drbd should freeze until the replicated storage is
> > available again while writing (from what I've read, it's designed to not
> > do that but let local storage get ahead of the replica, which is btw
> > incompatible with bcache-writeback assumptions).
>
> N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected
> and has no local copy for some reason.  If local storage is available, it
> will use that and resync when its peer comes up.
>
> > Or maybe using async mirroring can fix this for you but then, the mirror
> > will be compromised if a hardware failure immediately follows a previous
> > drbd network connection loss. But, it may still be an issue with the
> > local hardware (bit-flips etc) because maybe just bcache internals broke
> > - Coly may have a better idea of that.
>
> This isn't DRBDs fault since it is above bcache. I wish only address the
> the bcache cache=none issue.
>
> -Eric
>
> >
> > I think your main issue here is that bcache decouples writebarriers
> > from the underlying backing storage - and you should just not use
> > writeback, it is incompatible by design with how drbd works: your
> > replica will be broken when you need it.
>
>
> >
> >
> > > Here is our trace:
> > >
> > > [Sep 6 13:01] bcache: bch_cache_set_error() error on
> > > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
> > > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
> > > 0:1163806048 gen 3: bad, length too big, disabling caching
> >
> > > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> > > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> > > [  +0.000809] Workqueue: bcache bch_data_insert_keys
> > > [  +0.000826] Call Trace:
> > > [  +0.000797]  <TASK>
> > > [  +0.000006]  dump_stack_lvl+0x57/0x7e
> > > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> > > [  +0.000759]  btree_mergesort+0x27e/0x36e
> > > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> > > [  +0.000009]  __btree_sort+0xa4/0x1e9
> > > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> > > [  +0.000836]  bch_btree_init_next+0x39/0xb6
> > > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> > > [  +0.000863]  btree_insert_fn+0x20/0x48
> > > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> > > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> > > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > [  +0.000848]  bch_btree_insert+0x102/0x188
> > > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> > > [  +0.000857]  bch_data_insert_keys+0x39/0xde
> > > [  +0.000845]  process_one_work+0x280/0x5cf
> > > [  +0.000858]  worker_thread+0x52/0x3bd
> > > [  +0.000851]  ? process_one_work.cold+0x52/0x51
> > > [  +0.000877]  kthread+0x13e/0x15b
> > > [  +0.000858]  ? set_kthread_struct+0x60/0x52
> > > [  +0.000855]  ret_from_fork+0x22/0x2d
> > > [  +0.000854]  </TASK>
> >
> >
> > Regards,
> > Kai
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2023-09-07 12:00                 ` Kai Krakow
@ 2023-09-07 19:10                   ` Eric Wheeler
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Wheeler @ 2023-09-07 19:10 UTC (permalink / raw)
  To: Kai Krakow
  Cc: Coly Li, linux-bcache,
	吴本卿(云桌面
	福州),
	Mingzhe Zou

On Thu, 7 Sep 2023, Kai Krakow wrote:
> Am Do., 7. Sept. 2023 um 02:42 Uhr schrieb Eric Wheeler
> <lists@bcache.ewheeler.net>:
> >
> > +Mingzhe, Coly: please comment on the proposed fix below when you have a
> > moment:
> >
> > > > Coly, is there already a patch to prevent complete dirty cache loss?
> > >
> > > This is probably still an issue. The cache attachment MUST NEVER EVER
> > > automatically degrade to "none" which it did for my fail-cases I had
> > > back then. I don't know if this has changed meanwhile.
> >
> > I would rather that bcache went to a read-only mode in failure
> > conditions like this.  Maybe write-around would be acceptable since
> > bcache returns -EIO for any failed dirty cache reads.  But if the cache
> > is dirty, and it gets an error, it _must_never_ read from the bdev, which
> > is what appears to happens now.
> >
> > Coly, Mingzhe, would this be an easy change?
> >
> > Here are the relevant bits:
> >
> > The allocator called btree_mergesort which called bch_extent_invalid:
> >         https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
> >
> > Which called the `cache_bug` macro, which triggered bch_cache_set_error:
> >         https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
> >
> > It then calls `bch_cache_set_unregister` which shuts down the cache:
> >         https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
> >
> >         bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
> >         {
> >                 ...
> >                 bch_cache_set_unregister(c);
> >                 return true;
> >         }
> >
> > Proposed solution:
> >
> > What if, instead of bch_cache_set_unregister() that this was called instead:
> >         SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)
> >
> > This would bypass the cache for future writes, and allow reads to
> > proceed if possible, and -EIO otherwise to let upper layers handle the
> > failure.
> 
> Ensuring to not read stale content from bdev by switching to
> writearound is probably a proper solution - if there are no other
> side-effects. But due to the error, the cdev may be in some broken
> limbo state. So it should probably try to writeback dirty data while
> adding no more future data - neither for read-caching nor
> write-caching. Maybe this was the intention of unregister but instead
> of writing back dirty data and still serving dirty data from cdev, it
> immediately unregisters and invalidates the cdev.
> 
> So maybe the bugfix should be about why unregister() doesn't write
> back dirty data first...

So maybe it should "detach" in the same way that 
/sys/block/bcache0/bcache/detach triggers removal of the cache.

There seem to be three proposed graceful failure states in this situation:

1. Set read-only for all  bcache gendisk devices that use the failed cache
2. Set write-around and try to continue.
3. "Detach" the cache for all bcache devices using the failed cache.  If 
this fails, then maybe fall back to #1 or #2.

Coly, Mingzhe, what do you think would be best in terms of implementation?

--
Eric Wheeler



> 
> So actually switching to "none" but without unregister should probably
> provide that exact behavior? No more read/write but finishing
> outstanding dirty writeback.
> 
> Earlier I write:
> 
> > > This is probably still an issue. The cache attachment MUST NEVER EVER
> > > automatically degrade to "none" which it did for my fail-cases I had
> 
> This was meant under the assumption that "none" is the state after
> unregister - just to differentiate from what I wrote immediately
> before.
> 
> 
> > What do you think?
> >
> > > But because bcache explicitly does not honor write-barriers from
> > > upstream writes for its own writeback (which is okay because it
> > > guarantees to write back all data anyways and give a consistent view to
> > > upstream FS - well, unless it has to handle write errors), the backed
> > > filesystem is guaranteed to be effed up in that case, and allowing it to
> > > mount and write because bcache silently has fallen back to "none" will
> > > only make the matter worse.
> > >
> > > (HINT: I never used brbd personally, most of the following is
> > > theoretical thinking without real-world experience)
> > >
> > > I see that you're using drbd? Did it fail due to networking issues?
> > > I'm pretty sure it should be robust in that case but maybe bcache
> > > cannot handle the situation? Does brbd have a write log to replay
> > > writes after network connection loss? It looks like it doesn't and
> > > thus bcache exploded.
> >
> > DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
> > bcache hung, not the other way around, so DRBD is not the issue here.
> > Here is our stack:
> >
> > bcache:
> >         bdev:     /dev/sda hardware RAID5
> >         cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
> >
> > And then bcache is stacked like so:
> >
> >         bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
> >                               |
> >                               v
> >                          [remote host]
> >
> > > Anyways, since your backing device seems to be on drbd, using metadata
> > > allocation hinting is probably no option. You could of course still use
> > > drbd with bcache for metadata hinted partitions, and then use
> > > writearound caching only for that. At least, in the fail-case, your
> > > btrfs won't be destroyed. But your data chunks may have unreadable files
> > > then. But it should be easy to select them and restore from backup
> > > individually. Btrfs is very robust for that fail case: if metadata is
> > > okay, data errors are properly detected and handled. If you're not using
> > > btrfs, all of this doesn't apply ofc.
> > >
> > > I'm not sure if write-back caching for drbd backing is a wise decision
> > > anyways. drbd is slow for writes, that's part of the design (and no
> > > writeback caching could fix that).
> >
> > Bcache-backed DRBD provides a noticable difference, especially with a
> > 10GbE link (or faster) and the same disk stack on both sides.
> >
> > > I would not rely on bcache-writeback to fix that for you because it is
> > > not prepared for storage that may be temporarily not available
> >
> > True, which is why we put drbd /on top/ of bcache, so bcache is unaware of
> > DRBD's existence.
> >
> > > iow, it would freeze and continue when drbd is available again. I think
> > > you should really use writearound/writethrough so your FS can be sure
> > > data has been written, replicated and persisted. In case of btrfs, you
> > > could still split data and metadata as written above, and use writeback
> > > for data, but reliable writes for metadata.
> > >
> > > So concluding:
> > >
> > > 1. I'm now persisting metadata directly to disk with no intermediate
> > > layers (no bcache, no md)
> > >
> > > 2. I'm using allocation-hinted data-only partitions with bcache
> > > write-back, with bcache on mdraid1. If anything goes wrong, I have
> > > file crc errors in btrfs files only, but the filesystem itself is
> > > valid because no metadata is broken or lost. I have snapshots of
> > > recently modified files. I have daily backups.
> > >
> > > 3. Your problem is that bcache can - by design - detect write errors
> > > only when it's too late with no chance telling the filesystem. In that
> > > case, writethrough/writearound is the correct choice.
> > >
> > > 4. Maybe bcache should know if backing is on storage that may be
> > > temporarily unavailable and then freeze until the backing storage is
> > > back online, similar to how iSCSI handles that.
> >
> > I don't think "temporarily unavailable" should be bcache's burden, as
> > bcache is a local-only solution.  If someone is using iSCSI under bcache,
> > then good luck ;)
> >
> > > But otoh, maybe drbd should freeze until the replicated storage is
> > > available again while writing (from what I've read, it's designed to not
> > > do that but let local storage get ahead of the replica, which is btw
> > > incompatible with bcache-writeback assumptions).
> >
> > N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected
> > and has no local copy for some reason.  If local storage is available, it
> > will use that and resync when its peer comes up.
> >
> > > Or maybe using async mirroring can fix this for you but then, the mirror
> > > will be compromised if a hardware failure immediately follows a previous
> > > drbd network connection loss. But, it may still be an issue with the
> > > local hardware (bit-flips etc) because maybe just bcache internals broke
> > > - Coly may have a better idea of that.
> >
> > This isn't DRBDs fault since it is above bcache. I wish only address the
> > the bcache cache=none issue.
> >
> > -Eric
> >
> > >
> > > I think your main issue here is that bcache decouples writebarriers
> > > from the underlying backing storage - and you should just not use
> > > writeback, it is incompatible by design with how drbd works: your
> > > replica will be broken when you need it.
> >
> >
> > >
> > >
> > > > Here is our trace:
> > > >
> > > > [Sep 6 13:01] bcache: bch_cache_set_error() error on
> > > > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
> > > > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
> > > > 0:1163806048 gen 3: bad, length too big, disabling caching
> > >
> > > > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> > > > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> > > > [  +0.000809] Workqueue: bcache bch_data_insert_keys
> > > > [  +0.000826] Call Trace:
> > > > [  +0.000797]  <TASK>
> > > > [  +0.000006]  dump_stack_lvl+0x57/0x7e
> > > > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> > > > [  +0.000759]  btree_mergesort+0x27e/0x36e
> > > > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> > > > [  +0.000009]  __btree_sort+0xa4/0x1e9
> > > > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> > > > [  +0.000836]  bch_btree_init_next+0x39/0xb6
> > > > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> > > > [  +0.000863]  btree_insert_fn+0x20/0x48
> > > > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> > > > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> > > > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > > [  +0.000848]  bch_btree_insert+0x102/0x188
> > > > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> > > > [  +0.000857]  bch_data_insert_keys+0x39/0xde
> > > > [  +0.000845]  process_one_work+0x280/0x5cf
> > > > [  +0.000858]  worker_thread+0x52/0x3bd
> > > > [  +0.000851]  ? process_one_work.cold+0x52/0x51
> > > > [  +0.000877]  kthread+0x13e/0x15b
> > > > [  +0.000858]  ? set_kthread_struct+0x60/0x52
> > > > [  +0.000855]  ret_from_fork+0x22/0x2d
> > > > [  +0.000854]  </TASK>
> > >
> > >
> > > Regards,
> > > Kai
> > >
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re:Re: Dirty data loss after cache disk error recovery
       [not found]               ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net>
  2023-09-07 12:00                 ` Kai Krakow
@ 2023-09-12  6:54                 ` 邹明哲
       [not found]                   ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net>
  1 sibling, 1 reply; 17+ messages in thread
From: 邹明哲 @ 2023-09-12  6:54 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Coly Li, Kai Krakow, linux-bcache,
	吴本卿(云桌面
	福州)

From: Eric Wheeler <lists@bcache.ewheeler.net>
Date: 2023-09-07 08:42:41
To:  Coly Li <colyli@suse.de>
Cc:  Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn>
Subject: Re: Dirty data loss after cache disk error recovery
>+Mingzhe, Coly: please comment on the proposed fix below when you have a 
>moment:

Hi, Eric

This is an old issue, and it took me a long time to understand what
happened.

>
>On Thu, 7 Sep 2023, Kai Krakow wrote:
>> Wow!
>> 
>> I call that a necro-bump... ;-)
>> 
>> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler
>> <lists@bcache.ewheeler.net>:
>> >
>> > On Fri, 7 May 2021, Kai Krakow wrote:
>> >
>> > > > Adding a new "stop" error action IMHO doesn't make things better. When
>> > > > the cache device is disconnected, it is always risky that some caching
>> > > > data or meta data is not updated onto cache device. Permit the cache
>> > > > device to be re-attached to the backing device may introduce "silent
>> > > > data loss" which might be worse....  It was the reason why I didn't add
>> > > > new error action for the device failure handling patch set.
>> > >
>> > > But we are actually now seeing silent data loss: The system f'ed up
>> > > somehow, needed a hard reset, and after reboot the bcache device was
>> > > accessible in cache mode "none" (because they have been unregistered
>> > > before, and because udev just detected it and you can use bcache
>> > > without an attached cache in "none" mode), completely hiding the fact
>> > > that we lost dirty write-back data, it's even not quite obvious that
>> > > /dev/bcache0 now is detached, cache mode none, but accessible
>> > > nevertheless. To me, this is quite clearly "silent data loss",
>> > > especially since the unregister action threw the dirty data away.
>> > >
>> > > So this:
>> > >
>> > > > Permit the cache
>> > > > device to be re-attached to the backing device may introduce "silent
>> > > > data loss" which might be worse....
>> > >
>> > > is actually the situation we are facing currently: Device has been
>> > > unregistered, after reboot, udev detects it has clean backing device
>> > > without cache association, using cache mode none, and it is readable
>> > > and writable just fine: It essentially permitted access to the stale
>> > > backing device (tho, it didn't re-attach as you outlined, but that's
>> > > more or less the same situation).
>> > >
>> > > Maybe devices that become disassociated from a cache due to IO errors
>> > > but have dirty data should go to a caching mode "stale", and bcache
>> > > should refuse to access such devices or throw away their dirty data
>> > > until I decide to force them back online into the cache set or force
>> > > discard the dirty data. Then at least I would discover that something
>> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't
>> > > written. In the best case, that makes my FS unmountable, in the worst
>> > > case, some file data is simply lost (aka silent data loss), besides
>> > > both situations are the worst-case scenario anyways.
>> > >
>> > > The whole situation probably comes from udev auto-registering bcache
>> > > backing devices again, and bcache has no record of why the device was
>> > > unregistered - it looks clean after such a situation.
>> 
>> [...]
>> 
>> > I think we hit this same issue from 2021. Here is that original thread from 2021:
>> >         https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de
>> >
>> > Kai, did you end up with a good patch for this? We are running a 5.15
>> > kernel with the many backported bcache commits that Coly suggested here:
>> >         https://www.spinics.net/lists/linux-bcache/msg12084.html
>> 
>> I'm currently running 6.1 with bcache on mdraid1 and device-level
>> write caching disabled. I didn't see this ever occur again.
>
>Awesome, good to know.
>
>> But as written above, I had bad RAM, and meanwhile upgraded to kernel 
>> 6.1, and had no issues since with bcache even on power loss.
>> 
>> > Coly, is there already a patch to prevent complete dirty cache loss?
>> 
>> This is probably still an issue. The cache attachment MUST NEVER EVER
>> automatically degrade to "none" which it did for my fail-cases I had
>> back then. I don't know if this has changed meanwhile.
>
>I would rather that bcache went to a read-only mode in failure
>conditions like this.  Maybe write-around would be acceptable since
>bcache returns -EIO for any failed dirty cache reads.  But if the cache
>is dirty, and it gets an error, it _must_never_ read from the bdev, which
>is what appears to happens now.
>
>Coly, Mingzhe, would this be an easy change?

First of all, we have never had this problem. We have had an nvme
controller failure, but at this time the cache cannot be read or
written, so even unregister will not succeed.

Coly once replied like this:

"""
There is an option to panic the system when cache device failed. It
is in errors file with available options as "unregister" and "panic".
This option is default set to "unregister", if you set it to "panic"
then panic() will be called.
"""

I think "panic" is a better way to handle this situation. If cache
returns an error, there may be more unknown errors if the operation
continues.

>
>Here are the relevant bits:
>
>The allocator called btree_mergesort which called bch_extent_invalid:
>	https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
>
>Which called the `cache_bug` macro, which triggered bch_cache_set_error:
>	https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
>
>It then calls `bch_cache_set_unregister` which shuts down the cache:
>	https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
>
>	bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
>	{
>		...
>		bch_cache_set_unregister(c);
>		return true;
>	}
>
>Proposed solution:
>
>What if, instead of bch_cache_set_unregister() that this was called instead:
>	SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)

If cache_mode can be automatically modified, when will it be restored
to writeback? I think we need to be able to enable or disable this.

>
>This would bypass the cache for future writes, and allow reads to
>proceed if possible, and -EIO otherwise to let upper layers handle the
>failure.
>
>What do you think?

If we switch to writearound mode, how to ensure that the IO is read-only,
because writing IO may require invalidating dirty data. If the backing
write is successful but invalid fails, how should we handle it?

Maybe "panic" could be the default option. What do you think?

>
>> But because bcache explicitly does not honor write-barriers from 
>> upstream writes for its own writeback (which is okay because it 
>> guarantees to write back all data anyways and give a consistent view to 
>> upstream FS - well, unless it has to handle write errors), the backed 
>> filesystem is guaranteed to be effed up in that case, and allowing it to 
>> mount and write because bcache silently has fallen back to "none" will 
>> only make the matter worse.
>> 
>> (HINT: I never used brbd personally, most of the following is
>> theoretical thinking without real-world experience)
>> 
>> I see that you're using drbd? Did it fail due to networking issues?
>> I'm pretty sure it should be robust in that case but maybe bcache
>> cannot handle the situation? Does brbd have a write log to replay
>> writes after network connection loss? It looks like it doesn't and
>> thus bcache exploded.
>
>DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
>bcache hung, not the other way around, so DRBD is not the issue here.
>Here is our stack:
>
>bcache: 
>	bdev:     /dev/sda hardware RAID5
>	cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
>
>And then bcache is stacked like so:
>
>        bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
>                              |
>                              v
>                         [remote host]
>
>> Anyways, since your backing device seems to be on drbd, using metadata 
>> allocation hinting is probably no option. You could of course still use 
>> drbd with bcache for metadata hinted partitions, and then use 
>> writearound caching only for that. At least, in the fail-case, your 
>> btrfs won't be destroyed. But your data chunks may have unreadable files 
>> then. But it should be easy to select them and restore from backup 
>> individually. Btrfs is very robust for that fail case: if metadata is 
>> okay, data errors are properly detected and handled. If you're not using 
>> btrfs, all of this doesn't apply ofc.
>> 
>> I'm not sure if write-back caching for drbd backing is a wise decision
>> anyways. drbd is slow for writes, that's part of the design (and no
>> writeback caching could fix that).
>
>Bcache-backed DRBD provides a noticable difference, especially with a 
>10GbE link (or faster) and the same disk stack on both sides.
>
>> I would not rely on bcache-writeback to fix that for you because it is 
>> not prepared for storage that may be temporarily not available
>
>True, which is why we put drbd /on top/ of bcache, so bcache is unaware of 
>DRBD's existence.
>
>> iow, it would freeze and continue when drbd is available again. I think 
>> you should really use writearound/writethrough so your FS can be sure 
>> data has been written, replicated and persisted. In case of btrfs, you 
>> could still split data and metadata as written above, and use writeback 
>> for data, but reliable writes for metadata.
>> 
>> So concluding:
>> 
>> 1. I'm now persisting metadata directly to disk with no intermediate
>> layers (no bcache, no md)
>> 
>> 2. I'm using allocation-hinted data-only partitions with bcache
>> write-back, with bcache on mdraid1. If anything goes wrong, I have
>> file crc errors in btrfs files only, but the filesystem itself is
>> valid because no metadata is broken or lost. I have snapshots of
>> recently modified files. I have daily backups.
>> 
>> 3. Your problem is that bcache can - by design - detect write errors
>> only when it's too late with no chance telling the filesystem. In that
>> case, writethrough/writearound is the correct choice.
>>
>> 4. Maybe bcache should know if backing is on storage that may be
>> temporarily unavailable and then freeze until the backing storage is
>> back online, similar to how iSCSI handles that.
>
>I don't think "temporarily unavailable" should be bcache's burden, as 
>bcache is a local-only solution.  If someone is using iSCSI under bcache, 
>then good luck ;)
>
>> But otoh, maybe drbd should freeze until the replicated storage is 
>> available again while writing (from what I've read, it's designed to not 
>> do that but let local storage get ahead of the replica, which is btw 
>> incompatible with bcache-writeback assumptions).
>
>N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected 
>and has no local copy for some reason.  If local storage is available, it 
>will use that and resync when its peer comes up.
>
>> Or maybe using async mirroring can fix this for you but then, the mirror 
>> will be compromised if a hardware failure immediately follows a previous 
>> drbd network connection loss. But, it may still be an issue with the 
>> local hardware (bit-flips etc) because maybe just bcache internals broke 
>> - Coly may have a better idea of that.
>
>This isn't DRBDs fault since it is above bcache. I wish only address the 
>the bcache cache=none issue.
>
>-Eric
>
>> 
>> I think your main issue here is that bcache decouples writebarriers
>> from the underlying backing storage - and you should just not use
>> writeback, it is incompatible by design with how drbd works: your
>> replica will be broken when you need it.
>
>
>> 
>> 
>> > Here is our trace:
>> >
>> > [Sep 6 13:01] bcache: bch_cache_set_error() error on
>> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
>> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
>> > 0:1163806048 gen 3: bad, length too big, disabling caching
>>
>> > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
>> > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
>> > [  +0.000809] Workqueue: bcache bch_data_insert_keys
>> > [  +0.000826] Call Trace:
>> > [  +0.000797]  <TASK>
>> > [  +0.000006]  dump_stack_lvl+0x57/0x7e
>> > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
>> > [  +0.000759]  btree_mergesort+0x27e/0x36e
>> > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
>> > [  +0.000009]  __btree_sort+0xa4/0x1e9
>> > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
>> > [  +0.000836]  bch_btree_init_next+0x39/0xb6
>> > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
>> > [  +0.000863]  btree_insert_fn+0x20/0x48
>> > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
>> > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
>> > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
>> > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
>> > [  +0.000848]  bch_btree_insert+0x102/0x188
>> > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
>> > [  +0.000857]  bch_data_insert_keys+0x39/0xde
>> > [  +0.000845]  process_one_work+0x280/0x5cf
>> > [  +0.000858]  worker_thread+0x52/0x3bd
>> > [  +0.000851]  ? process_one_work.cold+0x52/0x51
>> > [  +0.000877]  kthread+0x13e/0x15b
>> > [  +0.000858]  ? set_kthread_struct+0x60/0x52
>> > [  +0.000855]  ret_from_fork+0x22/0x2d
>> > [  +0.000854]  </TASK>
>> 
>> 
>> Regards,
>> Kai
>> 





^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: Dirty data loss after cache disk error recovery
       [not found]                   ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net>
@ 2023-10-11 16:19                     ` Kai Krakow
  2023-10-16 23:39                       ` Eric Wheeler
  2023-10-11 16:29                     ` Kai Krakow
  1 sibling, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2023-10-11 16:19 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: 邹明哲,
	Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

Hello!

Sorry for the top-posting. I just want to share my story without
removing all of the context:

I've now faced a similar issue where one of my HDDs spontaneously
decided to have a series of bad blocks. It looks like it has 26145
failed writes due to how bcache handles writeback. It had 5275 failed
reads with btrfs loudly complaining about it. The system also became
really slow to respond until it eventually froze.

After a reboot it worked again but of course there were still bad
blocks because bcache did writeback, so no blocks have been replaced
with btrfs auto-repair on read feature. This time, the system handled
the situation a bit better but files became inaccessible in the middle
of writing them which destroyed my Plasma desktop configuration and
Chrome profile (I restored them from the last snapper snapshot
successfully). Essentially, the file system was in a readonly-like
state: most requests failed with IO errors despite the btrfs didn't
switch to read-only. Something messed up in the error path of
userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the
device somewhere in the limbo of not existing and not working - it
still tried to access it while bcache claimed the backend device would
be missing. To me this looks like bcache error handling may need some
fine tuning - it should not fail in that way, especially not with
btrfs-raid, but still the system was seeing IO errors and broken files
in the middle of writes.

"bcache show" showed the backend device missing while "btrfs dev show"
was still seeing the attached bcache device, and the system threw IO
errors to user-space despite btrfs still having a valid copy of the
blocks.

I've rebooted and now switched the bad device from bcache writeback to
bcache none - and guess what: The system runs stable now, btrfs
auto-repair does its thing. The above mentioned behavior does not
occur (IO errors in user-space). A final scrub across the bad devices
repaired the bad blocks, I currently do not see any more problems.

It's probably better to replace that device but this also shows that
switching bcache to "none" (if the backing device fails) or "write
through" at least may be a better choice than doing some other error
handling. Or bcache should have been able to make btrfs see the device
as missing (which obviously did not happen).

Of course, if the cache device fails we have a completely different
situation. I'm not sure which situation Eric was seeing (I think the
caching device failed) but for me, the backing device failed - and
with bcache involved, the result was very unexpected.

So we probably need at least two error handlers: Handling caching
device errors, and handling backing device errors (for which bcache
doesn't currently seem to have a setting).

Except for the strange IO errors and resulting incomplete writes (and
I really don't know why that happened), btrfs survived this perfectly
well - and somehow bcache did a good enough job. This has been
different in the past. So this is already a great achievement. Thank
you.

BTW: This probably only worked for me because I split btrfs metadata
and data to different devices
(https://github.com/kakra/linux/pull/26), and metadata does not pass
through bcache at all but natively to SSD. Otherwise I fear btrfs may
have seen partial metadata writes on different RAID members.

Regards,
Kai


Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler
<lists@bcache.ewheeler.net>:
>
> On Tue, 12 Sep 2023, 邹明哲 wrote:
> > From: Eric Wheeler <lists@bcache.ewheeler.net>
> > Date: 2023-09-07 08:42:41
> > To:  Coly Li <colyli@suse.de>
> > Cc:  Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn>
> > Subject: Re: Dirty data loss after cache disk error recovery
> > >+Mingzhe, Coly: please comment on the proposed fix below when you have a
> > >moment:
> >
> > Hi, Eric
> >
> > This is an old issue, and it took me a long time to understand what
> > happened.
> >
> > >
> > >On Thu, 7 Sep 2023, Kai Krakow wrote:
> > >> Wow!
> > >>
> > >> I call that a necro-bump... ;-)
> > >>
> > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler
> > >> <lists@bcache.ewheeler.net>:
> > >> >
> > >> > On Fri, 7 May 2021, Kai Krakow wrote:
> > >> >
> > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When
> > >> > > > the cache device is disconnected, it is always risky that some caching
> > >> > > > data or meta data is not updated onto cache device. Permit the cache
> > >> > > > device to be re-attached to the backing device may introduce "silent
> > >> > > > data loss" which might be worse....  It was the reason why I didn't add
> > >> > > > new error action for the device failure handling patch set.
> > >> > >
> > >> > > But we are actually now seeing silent data loss: The system f'ed up
> > >> > > somehow, needed a hard reset, and after reboot the bcache device was
> > >> > > accessible in cache mode "none" (because they have been unregistered
> > >> > > before, and because udev just detected it and you can use bcache
> > >> > > without an attached cache in "none" mode), completely hiding the fact
> > >> > > that we lost dirty write-back data, it's even not quite obvious that
> > >> > > /dev/bcache0 now is detached, cache mode none, but accessible
> > >> > > nevertheless. To me, this is quite clearly "silent data loss",
> > >> > > especially since the unregister action threw the dirty data away.
> > >> > >
> > >> > > So this:
> > >> > >
> > >> > > > Permit the cache
> > >> > > > device to be re-attached to the backing device may introduce "silent
> > >> > > > data loss" which might be worse....
> > >> > >
> > >> > > is actually the situation we are facing currently: Device has been
> > >> > > unregistered, after reboot, udev detects it has clean backing device
> > >> > > without cache association, using cache mode none, and it is readable
> > >> > > and writable just fine: It essentially permitted access to the stale
> > >> > > backing device (tho, it didn't re-attach as you outlined, but that's
> > >> > > more or less the same situation).
> > >> > >
> > >> > > Maybe devices that become disassociated from a cache due to IO errors
> > >> > > but have dirty data should go to a caching mode "stale", and bcache
> > >> > > should refuse to access such devices or throw away their dirty data
> > >> > > until I decide to force them back online into the cache set or force
> > >> > > discard the dirty data. Then at least I would discover that something
> > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't
> > >> > > written. In the best case, that makes my FS unmountable, in the worst
> > >> > > case, some file data is simply lost (aka silent data loss), besides
> > >> > > both situations are the worst-case scenario anyways.
> > >> > >
> > >> > > The whole situation probably comes from udev auto-registering bcache
> > >> > > backing devices again, and bcache has no record of why the device was
> > >> > > unregistered - it looks clean after such a situation.
> > >>
> > >> [...]
> > >>
> > >> > I think we hit this same issue from 2021. Here is that original thread from 2021:
> > >> >         https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de
> > >> >
> > >> > Kai, did you end up with a good patch for this? We are running a 5.15
> > >> > kernel with the many backported bcache commits that Coly suggested here:
> > >> >         https://www.spinics.net/lists/linux-bcache/msg12084.html
> > >>
> > >> I'm currently running 6.1 with bcache on mdraid1 and device-level
> > >> write caching disabled. I didn't see this ever occur again.
> > >
> > >Awesome, good to know.
> > >
> > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel
> > >> 6.1, and had no issues since with bcache even on power loss.
> > >>
> > >> > Coly, is there already a patch to prevent complete dirty cache loss?
> > >>
> > >> This is probably still an issue. The cache attachment MUST NEVER EVER
> > >> automatically degrade to "none" which it did for my fail-cases I had
> > >> back then. I don't know if this has changed meanwhile.
> > >
> > >I would rather that bcache went to a read-only mode in failure
> > >conditions like this.  Maybe write-around would be acceptable since
> > >bcache returns -EIO for any failed dirty cache reads.  But if the cache
> > >is dirty, and it gets an error, it _must_never_ read from the bdev, which
> > >is what appears to happens now.
> > >
> > >Coly, Mingzhe, would this be an easy change?
> >
> > First of all, we have never had this problem. We have had an nvme
> > controller failure, but at this time the cache cannot be read or
> > written, so even unregister will not succeed.
> >
> > Coly once replied like this:
> >
> > """
> > There is an option to panic the system when cache device failed. It
> > is in errors file with available options as "unregister" and "panic".
> > This option is default set to "unregister", if you set it to "panic"
> > then panic() will be called.
> > """
> >
> > I think "panic" is a better way to handle this situation. If cache
> > returns an error, there may be more unknown errors if the operation
> > continues.
>
> Depending on how the block devices are stacked, the OS can continue if
> bcache fails (eg, bcache under raid1, drbd, etc).  Returning IO requests
> with -EIO or setting bcache read-only would be better, because a panic
> would crash services that could otherwise proceed without noticing the
> bcache outage.
>
> If bcache has a critical failure, I would rather that it fail the IOs so
> upper-layers in the block stack can compensate.
>
> What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly"
> option, and make that the default setting?  The gendisk(s) for related
> /dev/bcacheX devices can be flagged BLKROSET in the error handler:
>         https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/
>
> This would protect the data and keep the host online.
>
> --
> Eric Wheeler
>
>
>
> >
> > >
> > >Here are the relevant bits:
> > >
> > >The allocator called btree_mergesort which called bch_extent_invalid:
> > >     https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
> > >
> > >Which called the `cache_bug` macro, which triggered bch_cache_set_error:
> > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
> > >
> > >It then calls `bch_cache_set_unregister` which shuts down the cache:
> > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
> > >
> > >     bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
> > >     {
> > >             ...
> > >             bch_cache_set_unregister(c);
> > >             return true;
> > >     }
> > >
> > >Proposed solution:
> > >
> > >What if, instead of bch_cache_set_unregister() that this was called instead:
> > >     SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)
> >
> > If cache_mode can be automatically modified, when will it be restored
> > to writeback? I think we need to be able to enable or disable this.
> >
> > >
> > >This would bypass the cache for future writes, and allow reads to
> > >proceed if possible, and -EIO otherwise to let upper layers handle the
> > >failure.
> > >
> > >What do you think?
> >
> > If we switch to writearound mode, how to ensure that the IO is read-only,
> > because writing IO may require invalidating dirty data. If the backing
> > write is successful but invalid fails, how should we handle it?
> >
> > Maybe "panic" could be the default option. What do you think?
> >
> > >
> > >> But because bcache explicitly does not honor write-barriers from
> > >> upstream writes for its own writeback (which is okay because it
> > >> guarantees to write back all data anyways and give a consistent view to
> > >> upstream FS - well, unless it has to handle write errors), the backed
> > >> filesystem is guaranteed to be effed up in that case, and allowing it to
> > >> mount and write because bcache silently has fallen back to "none" will
> > >> only make the matter worse.
> > >>
> > >> (HINT: I never used brbd personally, most of the following is
> > >> theoretical thinking without real-world experience)
> > >>
> > >> I see that you're using drbd? Did it fail due to networking issues?
> > >> I'm pretty sure it should be robust in that case but maybe bcache
> > >> cannot handle the situation? Does brbd have a write log to replay
> > >> writes after network connection loss? It looks like it doesn't and
> > >> thus bcache exploded.
> > >
> > >DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
> > >bcache hung, not the other way around, so DRBD is not the issue here.
> > >Here is our stack:
> > >
> > >bcache:
> > >     bdev:     /dev/sda hardware RAID5
> > >     cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
> > >
> > >And then bcache is stacked like so:
> > >
> > >        bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
> > >                              |
> > >                              v
> > >                         [remote host]
> > >
> > >> Anyways, since your backing device seems to be on drbd, using metadata
> > >> allocation hinting is probably no option. You could of course still use
> > >> drbd with bcache for metadata hinted partitions, and then use
> > >> writearound caching only for that. At least, in the fail-case, your
> > >> btrfs won't be destroyed. But your data chunks may have unreadable files
> > >> then. But it should be easy to select them and restore from backup
> > >> individually. Btrfs is very robust for that fail case: if metadata is
> > >> okay, data errors are properly detected and handled. If you're not using
> > >> btrfs, all of this doesn't apply ofc.
> > >>
> > >> I'm not sure if write-back caching for drbd backing is a wise decision
> > >> anyways. drbd is slow for writes, that's part of the design (and no
> > >> writeback caching could fix that).
> > >
> > >Bcache-backed DRBD provides a noticable difference, especially with a
> > >10GbE link (or faster) and the same disk stack on both sides.
> > >
> > >> I would not rely on bcache-writeback to fix that for you because it is
> > >> not prepared for storage that may be temporarily not available
> > >
> > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of
> > >DRBD's existence.
> > >
> > >> iow, it would freeze and continue when drbd is available again. I think
> > >> you should really use writearound/writethrough so your FS can be sure
> > >> data has been written, replicated and persisted. In case of btrfs, you
> > >> could still split data and metadata as written above, and use writeback
> > >> for data, but reliable writes for metadata.
> > >>
> > >> So concluding:
> > >>
> > >> 1. I'm now persisting metadata directly to disk with no intermediate
> > >> layers (no bcache, no md)
> > >>
> > >> 2. I'm using allocation-hinted data-only partitions with bcache
> > >> write-back, with bcache on mdraid1. If anything goes wrong, I have
> > >> file crc errors in btrfs files only, but the filesystem itself is
> > >> valid because no metadata is broken or lost. I have snapshots of
> > >> recently modified files. I have daily backups.
> > >>
> > >> 3. Your problem is that bcache can - by design - detect write errors
> > >> only when it's too late with no chance telling the filesystem. In that
> > >> case, writethrough/writearound is the correct choice.
> > >>
> > >> 4. Maybe bcache should know if backing is on storage that may be
> > >> temporarily unavailable and then freeze until the backing storage is
> > >> back online, similar to how iSCSI handles that.
> > >
> > >I don't think "temporarily unavailable" should be bcache's burden, as
> > >bcache is a local-only solution.  If someone is using iSCSI under bcache,
> > >then good luck ;)
> > >
> > >> But otoh, maybe drbd should freeze until the replicated storage is
> > >> available again while writing (from what I've read, it's designed to not
> > >> do that but let local storage get ahead of the replica, which is btw
> > >> incompatible with bcache-writeback assumptions).
> > >
> > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected
> > >and has no local copy for some reason.  If local storage is available, it
> > >will use that and resync when its peer comes up.
> > >
> > >> Or maybe using async mirroring can fix this for you but then, the mirror
> > >> will be compromised if a hardware failure immediately follows a previous
> > >> drbd network connection loss. But, it may still be an issue with the
> > >> local hardware (bit-flips etc) because maybe just bcache internals broke
> > >> - Coly may have a better idea of that.
> > >
> > >This isn't DRBDs fault since it is above bcache. I wish only address the
> > >the bcache cache=none issue.
> > >
> > >-Eric
> > >
> > >>
> > >> I think your main issue here is that bcache decouples writebarriers
> > >> from the underlying backing storage - and you should just not use
> > >> writeback, it is incompatible by design with how drbd works: your
> > >> replica will be broken when you need it.
> > >
> > >
> > >>
> > >>
> > >> > Here is our trace:
> > >> >
> > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on
> > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
> > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
> > >> > 0:1163806048 gen 3: bad, length too big, disabling caching
> > >>
> > >> > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> > >> > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> > >> > [  +0.000809] Workqueue: bcache bch_data_insert_keys
> > >> > [  +0.000826] Call Trace:
> > >> > [  +0.000797]  <TASK>
> > >> > [  +0.000006]  dump_stack_lvl+0x57/0x7e
> > >> > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> > >> > [  +0.000759]  btree_mergesort+0x27e/0x36e
> > >> > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> > >> > [  +0.000009]  __btree_sort+0xa4/0x1e9
> > >> > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> > >> > [  +0.000836]  bch_btree_init_next+0x39/0xb6
> > >> > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> > >> > [  +0.000863]  btree_insert_fn+0x20/0x48
> > >> > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> > >> > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > >> > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> > >> > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > >> > [  +0.000848]  bch_btree_insert+0x102/0x188
> > >> > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> > >> > [  +0.000857]  bch_data_insert_keys+0x39/0xde
> > >> > [  +0.000845]  process_one_work+0x280/0x5cf
> > >> > [  +0.000858]  worker_thread+0x52/0x3bd
> > >> > [  +0.000851]  ? process_one_work.cold+0x52/0x51
> > >> > [  +0.000877]  kthread+0x13e/0x15b
> > >> > [  +0.000858]  ? set_kthread_struct+0x60/0x52
> > >> > [  +0.000855]  ret_from_fork+0x22/0x2d
> > >> > [  +0.000854]  </TASK>
> > >>
> > >>
> > >> Regards,
> > >> Kai
> > >>
> >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: Dirty data loss after cache disk error recovery
       [not found]                   ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net>
  2023-10-11 16:19                     ` Kai Krakow
@ 2023-10-11 16:29                     ` Kai Krakow
  1 sibling, 0 replies; 17+ messages in thread
From: Kai Krakow @ 2023-10-11 16:29 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: 邹明哲,
	Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

Eric,

your "from" mail (lists@bcache.ewheeler.net) does not exist:
> DNS Error: DNS type 'mx' lookup of bcache.ewheeler.net responded with code NXDOMAIN Domain name not found: bcache.ewheeler.net

Or is something messed up on my side?

All others, please ignore. Doesn't add to the conversation. Thanks. :-)

Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler
<lists@bcache.ewheeler.net>:
>
> On Tue, 12 Sep 2023, 邹明哲 wrote:
> > From: Eric Wheeler <lists@bcache.ewheeler.net>
> > Date: 2023-09-07 08:42:41
> > To:  Coly Li <colyli@suse.de>
> > Cc:  Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn>
> > Subject: Re: Dirty data loss after cache disk error recovery
> > >+Mingzhe, Coly: please comment on the proposed fix below when you have a
> > >moment:
> >
> > Hi, Eric
> >
> > This is an old issue, and it took me a long time to understand what
> > happened.
> >
> > >
> > >On Thu, 7 Sep 2023, Kai Krakow wrote:
> > >> Wow!
> > >>
> > >> I call that a necro-bump... ;-)
> > >>
> > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler
> > >> <lists@bcache.ewheeler.net>:
> > >> >
> > >> > On Fri, 7 May 2021, Kai Krakow wrote:
> > >> >
> > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When
> > >> > > > the cache device is disconnected, it is always risky that some caching
> > >> > > > data or meta data is not updated onto cache device. Permit the cache
> > >> > > > device to be re-attached to the backing device may introduce "silent
> > >> > > > data loss" which might be worse....  It was the reason why I didn't add
> > >> > > > new error action for the device failure handling patch set.
> > >> > >
> > >> > > But we are actually now seeing silent data loss: The system f'ed up
> > >> > > somehow, needed a hard reset, and after reboot the bcache device was
> > >> > > accessible in cache mode "none" (because they have been unregistered
> > >> > > before, and because udev just detected it and you can use bcache
> > >> > > without an attached cache in "none" mode), completely hiding the fact
> > >> > > that we lost dirty write-back data, it's even not quite obvious that
> > >> > > /dev/bcache0 now is detached, cache mode none, but accessible
> > >> > > nevertheless. To me, this is quite clearly "silent data loss",
> > >> > > especially since the unregister action threw the dirty data away.
> > >> > >
> > >> > > So this:
> > >> > >
> > >> > > > Permit the cache
> > >> > > > device to be re-attached to the backing device may introduce "silent
> > >> > > > data loss" which might be worse....
> > >> > >
> > >> > > is actually the situation we are facing currently: Device has been
> > >> > > unregistered, after reboot, udev detects it has clean backing device
> > >> > > without cache association, using cache mode none, and it is readable
> > >> > > and writable just fine: It essentially permitted access to the stale
> > >> > > backing device (tho, it didn't re-attach as you outlined, but that's
> > >> > > more or less the same situation).
> > >> > >
> > >> > > Maybe devices that become disassociated from a cache due to IO errors
> > >> > > but have dirty data should go to a caching mode "stale", and bcache
> > >> > > should refuse to access such devices or throw away their dirty data
> > >> > > until I decide to force them back online into the cache set or force
> > >> > > discard the dirty data. Then at least I would discover that something
> > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't
> > >> > > written. In the best case, that makes my FS unmountable, in the worst
> > >> > > case, some file data is simply lost (aka silent data loss), besides
> > >> > > both situations are the worst-case scenario anyways.
> > >> > >
> > >> > > The whole situation probably comes from udev auto-registering bcache
> > >> > > backing devices again, and bcache has no record of why the device was
> > >> > > unregistered - it looks clean after such a situation.
> > >>
> > >> [...]
> > >>
> > >> > I think we hit this same issue from 2021. Here is that original thread from 2021:
> > >> >         https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de
> > >> >
> > >> > Kai, did you end up with a good patch for this? We are running a 5.15
> > >> > kernel with the many backported bcache commits that Coly suggested here:
> > >> >         https://www.spinics.net/lists/linux-bcache/msg12084.html
> > >>
> > >> I'm currently running 6.1 with bcache on mdraid1 and device-level
> > >> write caching disabled. I didn't see this ever occur again.
> > >
> > >Awesome, good to know.
> > >
> > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel
> > >> 6.1, and had no issues since with bcache even on power loss.
> > >>
> > >> > Coly, is there already a patch to prevent complete dirty cache loss?
> > >>
> > >> This is probably still an issue. The cache attachment MUST NEVER EVER
> > >> automatically degrade to "none" which it did for my fail-cases I had
> > >> back then. I don't know if this has changed meanwhile.
> > >
> > >I would rather that bcache went to a read-only mode in failure
> > >conditions like this.  Maybe write-around would be acceptable since
> > >bcache returns -EIO for any failed dirty cache reads.  But if the cache
> > >is dirty, and it gets an error, it _must_never_ read from the bdev, which
> > >is what appears to happens now.
> > >
> > >Coly, Mingzhe, would this be an easy change?
> >
> > First of all, we have never had this problem. We have had an nvme
> > controller failure, but at this time the cache cannot be read or
> > written, so even unregister will not succeed.
> >
> > Coly once replied like this:
> >
> > """
> > There is an option to panic the system when cache device failed. It
> > is in errors file with available options as "unregister" and "panic".
> > This option is default set to "unregister", if you set it to "panic"
> > then panic() will be called.
> > """
> >
> > I think "panic" is a better way to handle this situation. If cache
> > returns an error, there may be more unknown errors if the operation
> > continues.
>
> Depending on how the block devices are stacked, the OS can continue if
> bcache fails (eg, bcache under raid1, drbd, etc).  Returning IO requests
> with -EIO or setting bcache read-only would be better, because a panic
> would crash services that could otherwise proceed without noticing the
> bcache outage.
>
> If bcache has a critical failure, I would rather that it fail the IOs so
> upper-layers in the block stack can compensate.
>
> What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly"
> option, and make that the default setting?  The gendisk(s) for related
> /dev/bcacheX devices can be flagged BLKROSET in the error handler:
>         https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/
>
> This would protect the data and keep the host online.
>
> --
> Eric Wheeler
>
>
>
> >
> > >
> > >Here are the relevant bits:
> > >
> > >The allocator called btree_mergesort which called bch_extent_invalid:
> > >     https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
> > >
> > >Which called the `cache_bug` macro, which triggered bch_cache_set_error:
> > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
> > >
> > >It then calls `bch_cache_set_unregister` which shuts down the cache:
> > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
> > >
> > >     bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
> > >     {
> > >             ...
> > >             bch_cache_set_unregister(c);
> > >             return true;
> > >     }
> > >
> > >Proposed solution:
> > >
> > >What if, instead of bch_cache_set_unregister() that this was called instead:
> > >     SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)
> >
> > If cache_mode can be automatically modified, when will it be restored
> > to writeback? I think we need to be able to enable or disable this.
> >
> > >
> > >This would bypass the cache for future writes, and allow reads to
> > >proceed if possible, and -EIO otherwise to let upper layers handle the
> > >failure.
> > >
> > >What do you think?
> >
> > If we switch to writearound mode, how to ensure that the IO is read-only,
> > because writing IO may require invalidating dirty data. If the backing
> > write is successful but invalid fails, how should we handle it?
> >
> > Maybe "panic" could be the default option. What do you think?
> >
> > >
> > >> But because bcache explicitly does not honor write-barriers from
> > >> upstream writes for its own writeback (which is okay because it
> > >> guarantees to write back all data anyways and give a consistent view to
> > >> upstream FS - well, unless it has to handle write errors), the backed
> > >> filesystem is guaranteed to be effed up in that case, and allowing it to
> > >> mount and write because bcache silently has fallen back to "none" will
> > >> only make the matter worse.
> > >>
> > >> (HINT: I never used brbd personally, most of the following is
> > >> theoretical thinking without real-world experience)
> > >>
> > >> I see that you're using drbd? Did it fail due to networking issues?
> > >> I'm pretty sure it should be robust in that case but maybe bcache
> > >> cannot handle the situation? Does brbd have a write log to replay
> > >> writes after network connection loss? It looks like it doesn't and
> > >> thus bcache exploded.
> > >
> > >DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
> > >bcache hung, not the other way around, so DRBD is not the issue here.
> > >Here is our stack:
> > >
> > >bcache:
> > >     bdev:     /dev/sda hardware RAID5
> > >     cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
> > >
> > >And then bcache is stacked like so:
> > >
> > >        bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
> > >                              |
> > >                              v
> > >                         [remote host]
> > >
> > >> Anyways, since your backing device seems to be on drbd, using metadata
> > >> allocation hinting is probably no option. You could of course still use
> > >> drbd with bcache for metadata hinted partitions, and then use
> > >> writearound caching only for that. At least, in the fail-case, your
> > >> btrfs won't be destroyed. But your data chunks may have unreadable files
> > >> then. But it should be easy to select them and restore from backup
> > >> individually. Btrfs is very robust for that fail case: if metadata is
> > >> okay, data errors are properly detected and handled. If you're not using
> > >> btrfs, all of this doesn't apply ofc.
> > >>
> > >> I'm not sure if write-back caching for drbd backing is a wise decision
> > >> anyways. drbd is slow for writes, that's part of the design (and no
> > >> writeback caching could fix that).
> > >
> > >Bcache-backed DRBD provides a noticable difference, especially with a
> > >10GbE link (or faster) and the same disk stack on both sides.
> > >
> > >> I would not rely on bcache-writeback to fix that for you because it is
> > >> not prepared for storage that may be temporarily not available
> > >
> > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of
> > >DRBD's existence.
> > >
> > >> iow, it would freeze and continue when drbd is available again. I think
> > >> you should really use writearound/writethrough so your FS can be sure
> > >> data has been written, replicated and persisted. In case of btrfs, you
> > >> could still split data and metadata as written above, and use writeback
> > >> for data, but reliable writes for metadata.
> > >>
> > >> So concluding:
> > >>
> > >> 1. I'm now persisting metadata directly to disk with no intermediate
> > >> layers (no bcache, no md)
> > >>
> > >> 2. I'm using allocation-hinted data-only partitions with bcache
> > >> write-back, with bcache on mdraid1. If anything goes wrong, I have
> > >> file crc errors in btrfs files only, but the filesystem itself is
> > >> valid because no metadata is broken or lost. I have snapshots of
> > >> recently modified files. I have daily backups.
> > >>
> > >> 3. Your problem is that bcache can - by design - detect write errors
> > >> only when it's too late with no chance telling the filesystem. In that
> > >> case, writethrough/writearound is the correct choice.
> > >>
> > >> 4. Maybe bcache should know if backing is on storage that may be
> > >> temporarily unavailable and then freeze until the backing storage is
> > >> back online, similar to how iSCSI handles that.
> > >
> > >I don't think "temporarily unavailable" should be bcache's burden, as
> > >bcache is a local-only solution.  If someone is using iSCSI under bcache,
> > >then good luck ;)
> > >
> > >> But otoh, maybe drbd should freeze until the replicated storage is
> > >> available again while writing (from what I've read, it's designed to not
> > >> do that but let local storage get ahead of the replica, which is btw
> > >> incompatible with bcache-writeback assumptions).
> > >
> > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected
> > >and has no local copy for some reason.  If local storage is available, it
> > >will use that and resync when its peer comes up.
> > >
> > >> Or maybe using async mirroring can fix this for you but then, the mirror
> > >> will be compromised if a hardware failure immediately follows a previous
> > >> drbd network connection loss. But, it may still be an issue with the
> > >> local hardware (bit-flips etc) because maybe just bcache internals broke
> > >> - Coly may have a better idea of that.
> > >
> > >This isn't DRBDs fault since it is above bcache. I wish only address the
> > >the bcache cache=none issue.
> > >
> > >-Eric
> > >
> > >>
> > >> I think your main issue here is that bcache decouples writebarriers
> > >> from the underlying backing storage - and you should just not use
> > >> writeback, it is incompatible by design with how drbd works: your
> > >> replica will be broken when you need it.
> > >
> > >
> > >>
> > >>
> > >> > Here is our trace:
> > >> >
> > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on
> > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
> > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
> > >> > 0:1163806048 gen 3: bad, length too big, disabling caching
> > >>
> > >> > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> > >> > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> > >> > [  +0.000809] Workqueue: bcache bch_data_insert_keys
> > >> > [  +0.000826] Call Trace:
> > >> > [  +0.000797]  <TASK>
> > >> > [  +0.000006]  dump_stack_lvl+0x57/0x7e
> > >> > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> > >> > [  +0.000759]  btree_mergesort+0x27e/0x36e
> > >> > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> > >> > [  +0.000009]  __btree_sort+0xa4/0x1e9
> > >> > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> > >> > [  +0.000836]  bch_btree_init_next+0x39/0xb6
> > >> > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> > >> > [  +0.000863]  btree_insert_fn+0x20/0x48
> > >> > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> > >> > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > >> > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> > >> > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > >> > [  +0.000848]  bch_btree_insert+0x102/0x188
> > >> > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> > >> > [  +0.000857]  bch_data_insert_keys+0x39/0xde
> > >> > [  +0.000845]  process_one_work+0x280/0x5cf
> > >> > [  +0.000858]  worker_thread+0x52/0x3bd
> > >> > [  +0.000851]  ? process_one_work.cold+0x52/0x51
> > >> > [  +0.000877]  kthread+0x13e/0x15b
> > >> > [  +0.000858]  ? set_kthread_struct+0x60/0x52
> > >> > [  +0.000855]  ret_from_fork+0x22/0x2d
> > >> > [  +0.000854]  </TASK>
> > >>
> > >>
> > >> Regards,
> > >> Kai
> > >>
> >
> >
> >
> >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: Dirty data loss after cache disk error recovery
  2023-10-11 16:19                     ` Kai Krakow
@ 2023-10-16 23:39                       ` Eric Wheeler
  2023-10-17  0:33                         ` Kai Krakow
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wheeler @ 2023-10-16 23:39 UTC (permalink / raw)
  To: Kai Krakow
  Cc: 邹明哲,
	Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

[-- Attachment #1: Type: text/plain, Size: 20608 bytes --]

On Wed, 11 Oct 2023, Kai Krakow wrote:
> I've now faced a similar issue where one of my HDDs spontaneously
> decided to have a series of bad blocks. It looks like it has 26145
> failed writes due to how bcache handles writeback. It had 5275 failed
> reads with btrfs loudly complaining about it. The system also became
> really slow to respond until it eventually froze.
> 
> After a reboot it worked again but of course there were still bad
> blocks because bcache did writeback, so no blocks have been replaced
> with btrfs auto-repair on read feature. This time, the system handled
> the situation a bit better but files became inaccessible in the middle
> of writing them which destroyed my Plasma desktop configuration and
> Chrome profile (I restored them from the last snapper snapshot
> successfully). Essentially, the file system was in a readonly-like
> state: most requests failed with IO errors despite the btrfs didn't
> switch to read-only. Something messed up in the error path of
> userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the

Do you mean userspace -> btrfs -> bcache -> device

> device somewhere in the limbo of not existing and not working - it
> still tried to access it while bcache claimed the backend device would
> be missing. To me this looks like bcache error handling may need some
> fine tuning - it should not fail in that way, especially not with
> btrfs-raid, but still the system was seeing IO errors and broken files
> in the middle of writes.
> 
> "bcache show" showed the backend device missing while "btrfs dev show"
> was still seeing the attached bcache device, and the system threw IO
> errors to user-space despite btrfs still having a valid copy of the
> blocks.
> 
> I've rebooted and now switched the bad device from bcache writeback to
> bcache none - and guess what: The system runs stable now, btrfs
> auto-repair does its thing. The above mentioned behavior does not
> occur (IO errors in user-space). A final scrub across the bad devices
> repaired the bad blocks, I currently do not see any more problems.
> 
> It's probably better to replace that device but this also shows that
> switching bcache to "none" (if the backing device fails) or "write
> through" at least may be a better choice than doing some other error
> handling. Or bcache should have been able to make btrfs see the device
> as missing (which obviously did not happen).

Noted.  Did bcache actually detach its cache in the failure scenario 
you describe?

> Of course, if the cache device fails we have a completely different
> situation. I'm not sure which situation Eric was seeing (I think the
> caching device failed) but for me, the backing device failed - and
> with bcache involved, the result was very unexpected.

Ahh, so you are saying the cache continued to service requests even though 
the bdev was offline?  Was the bdev completely "unplugged" or was it just 
having IO errors?

> So we probably need at least two error handlers: Handling caching
> device errors, and handling backing device errors (for which bcache
> doesn't currently seem to have a setting).

I think it tries to write to the cache if the bdev dies.  Dirty or cached 
blocks are read from cache and other IOs are passed to bdev which may 
return end up returning an EIO.  Coly, is this correct?

-Eric
 
> Except for the strange IO errors and resulting incomplete writes (and
> I really don't know why that happened), btrfs survived this perfectly
> well - and somehow bcache did a good enough job. This has been
> different in the past. So this is already a great achievement. Thank
> you.
> 
> BTW: This probably only worked for me because I split btrfs metadata
> and data to different devices
> (https://github.com/kakra/linux/pull/26), and metadata does not pass
> through bcache at all but natively to SSD. Otherwise I fear btrfs may
> have seen partial metadata writes on different RAID members.
> 
> Regards,
> Kai
> 
> 
> Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler
> <lists@bcache.ewheeler.net>:
> >
> > On Tue, 12 Sep 2023, 邹明哲 wrote:
> > > From: Eric Wheeler <lists@bcache.ewheeler.net>
> > > Date: 2023-09-07 08:42:41
> > > To:  Coly Li <colyli@suse.de>
> > > Cc:  Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn>
> > > Subject: Re: Dirty data loss after cache disk error recovery
> > > >+Mingzhe, Coly: please comment on the proposed fix below when you have a
> > > >moment:
> > >
> > > Hi, Eric
> > >
> > > This is an old issue, and it took me a long time to understand what
> > > happened.
> > >
> > > >
> > > >On Thu, 7 Sep 2023, Kai Krakow wrote:
> > > >> Wow!
> > > >>
> > > >> I call that a necro-bump... ;-)
> > > >>
> > > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler
> > > >> <lists@bcache.ewheeler.net>:
> > > >> >
> > > >> > On Fri, 7 May 2021, Kai Krakow wrote:
> > > >> >
> > > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When
> > > >> > > > the cache device is disconnected, it is always risky that some caching
> > > >> > > > data or meta data is not updated onto cache device. Permit the cache
> > > >> > > > device to be re-attached to the backing device may introduce "silent
> > > >> > > > data loss" which might be worse....  It was the reason why I didn't add
> > > >> > > > new error action for the device failure handling patch set.
> > > >> > >
> > > >> > > But we are actually now seeing silent data loss: The system f'ed up
> > > >> > > somehow, needed a hard reset, and after reboot the bcache device was
> > > >> > > accessible in cache mode "none" (because they have been unregistered
> > > >> > > before, and because udev just detected it and you can use bcache
> > > >> > > without an attached cache in "none" mode), completely hiding the fact
> > > >> > > that we lost dirty write-back data, it's even not quite obvious that
> > > >> > > /dev/bcache0 now is detached, cache mode none, but accessible
> > > >> > > nevertheless. To me, this is quite clearly "silent data loss",
> > > >> > > especially since the unregister action threw the dirty data away.
> > > >> > >
> > > >> > > So this:
> > > >> > >
> > > >> > > > Permit the cache
> > > >> > > > device to be re-attached to the backing device may introduce "silent
> > > >> > > > data loss" which might be worse....
> > > >> > >
> > > >> > > is actually the situation we are facing currently: Device has been
> > > >> > > unregistered, after reboot, udev detects it has clean backing device
> > > >> > > without cache association, using cache mode none, and it is readable
> > > >> > > and writable just fine: It essentially permitted access to the stale
> > > >> > > backing device (tho, it didn't re-attach as you outlined, but that's
> > > >> > > more or less the same situation).
> > > >> > >
> > > >> > > Maybe devices that become disassociated from a cache due to IO errors
> > > >> > > but have dirty data should go to a caching mode "stale", and bcache
> > > >> > > should refuse to access such devices or throw away their dirty data
> > > >> > > until I decide to force them back online into the cache set or force
> > > >> > > discard the dirty data. Then at least I would discover that something
> > > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't
> > > >> > > written. In the best case, that makes my FS unmountable, in the worst
> > > >> > > case, some file data is simply lost (aka silent data loss), besides
> > > >> > > both situations are the worst-case scenario anyways.
> > > >> > >
> > > >> > > The whole situation probably comes from udev auto-registering bcache
> > > >> > > backing devices again, and bcache has no record of why the device was
> > > >> > > unregistered - it looks clean after such a situation.
> > > >>
> > > >> [...]
> > > >>
> > > >> > I think we hit this same issue from 2021. Here is that original thread from 2021:
> > > >> >         https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de
> > > >> >
> > > >> > Kai, did you end up with a good patch for this? We are running a 5.15
> > > >> > kernel with the many backported bcache commits that Coly suggested here:
> > > >> >         https://www.spinics.net/lists/linux-bcache/msg12084.html
> > > >>
> > > >> I'm currently running 6.1 with bcache on mdraid1 and device-level
> > > >> write caching disabled. I didn't see this ever occur again.
> > > >
> > > >Awesome, good to know.
> > > >
> > > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel
> > > >> 6.1, and had no issues since with bcache even on power loss.
> > > >>
> > > >> > Coly, is there already a patch to prevent complete dirty cache loss?
> > > >>
> > > >> This is probably still an issue. The cache attachment MUST NEVER EVER
> > > >> automatically degrade to "none" which it did for my fail-cases I had
> > > >> back then. I don't know if this has changed meanwhile.
> > > >
> > > >I would rather that bcache went to a read-only mode in failure
> > > >conditions like this.  Maybe write-around would be acceptable since
> > > >bcache returns -EIO for any failed dirty cache reads.  But if the cache
> > > >is dirty, and it gets an error, it _must_never_ read from the bdev, which
> > > >is what appears to happens now.
> > > >
> > > >Coly, Mingzhe, would this be an easy change?
> > >
> > > First of all, we have never had this problem. We have had an nvme
> > > controller failure, but at this time the cache cannot be read or
> > > written, so even unregister will not succeed.
> > >
> > > Coly once replied like this:
> > >
> > > """
> > > There is an option to panic the system when cache device failed. It
> > > is in errors file with available options as "unregister" and "panic".
> > > This option is default set to "unregister", if you set it to "panic"
> > > then panic() will be called.
> > > """
> > >
> > > I think "panic" is a better way to handle this situation. If cache
> > > returns an error, there may be more unknown errors if the operation
> > > continues.
> >
> > Depending on how the block devices are stacked, the OS can continue if
> > bcache fails (eg, bcache under raid1, drbd, etc).  Returning IO requests
> > with -EIO or setting bcache read-only would be better, because a panic
> > would crash services that could otherwise proceed without noticing the
> > bcache outage.
> >
> > If bcache has a critical failure, I would rather that it fail the IOs so
> > upper-layers in the block stack can compensate.
> >
> > What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly"
> > option, and make that the default setting?  The gendisk(s) for related
> > /dev/bcacheX devices can be flagged BLKROSET in the error handler:
> >         https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/
> >
> > This would protect the data and keep the host online.
> >
> > --
> > Eric Wheeler
> >
> >
> >
> > >
> > > >
> > > >Here are the relevant bits:
> > > >
> > > >The allocator called btree_mergesort which called bch_extent_invalid:
> > > >     https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480
> > > >
> > > >Which called the `cache_bug` macro, which triggered bch_cache_set_error:
> > > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626
> > > >
> > > >It then calls `bch_cache_set_unregister` which shuts down the cache:
> > > >     https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845
> > > >
> > > >     bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...)
> > > >     {
> > > >             ...
> > > >             bch_cache_set_unregister(c);
> > > >             return true;
> > > >     }
> > > >
> > > >Proposed solution:
> > > >
> > > >What if, instead of bch_cache_set_unregister() that this was called instead:
> > > >     SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND)
> > >
> > > If cache_mode can be automatically modified, when will it be restored
> > > to writeback? I think we need to be able to enable or disable this.
> > >
> > > >
> > > >This would bypass the cache for future writes, and allow reads to
> > > >proceed if possible, and -EIO otherwise to let upper layers handle the
> > > >failure.
> > > >
> > > >What do you think?
> > >
> > > If we switch to writearound mode, how to ensure that the IO is read-only,
> > > because writing IO may require invalidating dirty data. If the backing
> > > write is successful but invalid fails, how should we handle it?
> > >
> > > Maybe "panic" could be the default option. What do you think?
> > >
> > > >
> > > >> But because bcache explicitly does not honor write-barriers from
> > > >> upstream writes for its own writeback (which is okay because it
> > > >> guarantees to write back all data anyways and give a consistent view to
> > > >> upstream FS - well, unless it has to handle write errors), the backed
> > > >> filesystem is guaranteed to be effed up in that case, and allowing it to
> > > >> mount and write because bcache silently has fallen back to "none" will
> > > >> only make the matter worse.
> > > >>
> > > >> (HINT: I never used brbd personally, most of the following is
> > > >> theoretical thinking without real-world experience)
> > > >>
> > > >> I see that you're using drbd? Did it fail due to networking issues?
> > > >> I'm pretty sure it should be robust in that case but maybe bcache
> > > >> cannot handle the situation? Does brbd have a write log to replay
> > > >> writes after network connection loss? It looks like it doesn't and
> > > >> thus bcache exploded.
> > > >
> > > >DRBD is _above_ bcache, not below it.  In this case, DRBD hung because
> > > >bcache hung, not the other way around, so DRBD is not the issue here.
> > > >Here is our stack:
> > > >
> > > >bcache:
> > > >     bdev:     /dev/sda hardware RAID5
> > > >     cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1
> > > >
> > > >And then bcache is stacked like so:
> > > >
> > > >        bcache <- dm-thin <- DRBD <- dm-crypt <- KVM
> > > >                              |
> > > >                              v
> > > >                         [remote host]
> > > >
> > > >> Anyways, since your backing device seems to be on drbd, using metadata
> > > >> allocation hinting is probably no option. You could of course still use
> > > >> drbd with bcache for metadata hinted partitions, and then use
> > > >> writearound caching only for that. At least, in the fail-case, your
> > > >> btrfs won't be destroyed. But your data chunks may have unreadable files
> > > >> then. But it should be easy to select them and restore from backup
> > > >> individually. Btrfs is very robust for that fail case: if metadata is
> > > >> okay, data errors are properly detected and handled. If you're not using
> > > >> btrfs, all of this doesn't apply ofc.
> > > >>
> > > >> I'm not sure if write-back caching for drbd backing is a wise decision
> > > >> anyways. drbd is slow for writes, that's part of the design (and no
> > > >> writeback caching could fix that).
> > > >
> > > >Bcache-backed DRBD provides a noticable difference, especially with a
> > > >10GbE link (or faster) and the same disk stack on both sides.
> > > >
> > > >> I would not rely on bcache-writeback to fix that for you because it is
> > > >> not prepared for storage that may be temporarily not available
> > > >
> > > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of
> > > >DRBD's existence.
> > > >
> > > >> iow, it would freeze and continue when drbd is available again. I think
> > > >> you should really use writearound/writethrough so your FS can be sure
> > > >> data has been written, replicated and persisted. In case of btrfs, you
> > > >> could still split data and metadata as written above, and use writeback
> > > >> for data, but reliable writes for metadata.
> > > >>
> > > >> So concluding:
> > > >>
> > > >> 1. I'm now persisting metadata directly to disk with no intermediate
> > > >> layers (no bcache, no md)
> > > >>
> > > >> 2. I'm using allocation-hinted data-only partitions with bcache
> > > >> write-back, with bcache on mdraid1. If anything goes wrong, I have
> > > >> file crc errors in btrfs files only, but the filesystem itself is
> > > >> valid because no metadata is broken or lost. I have snapshots of
> > > >> recently modified files. I have daily backups.
> > > >>
> > > >> 3. Your problem is that bcache can - by design - detect write errors
> > > >> only when it's too late with no chance telling the filesystem. In that
> > > >> case, writethrough/writearound is the correct choice.
> > > >>
> > > >> 4. Maybe bcache should know if backing is on storage that may be
> > > >> temporarily unavailable and then freeze until the backing storage is
> > > >> back online, similar to how iSCSI handles that.
> > > >
> > > >I don't think "temporarily unavailable" should be bcache's burden, as
> > > >bcache is a local-only solution.  If someone is using iSCSI under bcache,
> > > >then good luck ;)
> > > >
> > > >> But otoh, maybe drbd should freeze until the replicated storage is
> > > >> available again while writing (from what I've read, it's designed to not
> > > >> do that but let local storage get ahead of the replica, which is btw
> > > >> incompatible with bcache-writeback assumptions).
> > > >
> > > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected
> > > >and has no local copy for some reason.  If local storage is available, it
> > > >will use that and resync when its peer comes up.
> > > >
> > > >> Or maybe using async mirroring can fix this for you but then, the mirror
> > > >> will be compromised if a hardware failure immediately follows a previous
> > > >> drbd network connection loss. But, it may still be an issue with the
> > > >> local hardware (bit-flips etc) because maybe just bcache internals broke
> > > >> - Coly may have a better idea of that.
> > > >
> > > >This isn't DRBDs fault since it is above bcache. I wish only address the
> > > >the bcache cache=none issue.
> > > >
> > > >-Eric
> > > >
> > > >>
> > > >> I think your main issue here is that bcache decouples writebarriers
> > > >> from the underlying backing storage - and you should just not use
> > > >> writeback, it is incompatible by design with how drbd works: your
> > > >> replica will be broken when you need it.
> > > >
> > > >
> > > >>
> > > >>
> > > >> > Here is our trace:
> > > >> >
> > > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on
> > > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent
> > > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48,
> > > >> > 0:1163806048 gen 3: bad, length too big, disabling caching
> > > >>
> > > >> > [  +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7
> > > >> > [  +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021
> > > >> > [  +0.000809] Workqueue: bcache bch_data_insert_keys
> > > >> > [  +0.000826] Call Trace:
> > > >> > [  +0.000797]  <TASK>
> > > >> > [  +0.000006]  dump_stack_lvl+0x57/0x7e
> > > >> > [  +0.000755]  bch_extent_invalid.cold+0x9/0x10
> > > >> > [  +0.000759]  btree_mergesort+0x27e/0x36e
> > > >> > [  +0.000005]  ? bch_cache_allocator_start+0x50/0x50
> > > >> > [  +0.000009]  __btree_sort+0xa4/0x1e9
> > > >> > [  +0.000109]  bch_btree_sort_partial+0xbc/0x14d
> > > >> > [  +0.000836]  bch_btree_init_next+0x39/0xb6
> > > >> > [  +0.000004]  bch_btree_insert_node+0x26e/0x2d3
> > > >> > [  +0.000863]  btree_insert_fn+0x20/0x48
> > > >> > [  +0.000864]  bch_btree_map_nodes_recurse+0x111/0x1a7
> > > >> > [  +0.004270]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > >> > [  +0.000850]  __bch_btree_map_nodes+0x1e0/0x1fb
> > > >> > [  +0.000858]  ? bch_btree_insert_check_key+0x1f0/0x1e1
> > > >> > [  +0.000848]  bch_btree_insert+0x102/0x188
> > > >> > [  +0.000844]  ? do_wait_intr_irq+0xb0/0xaf
> > > >> > [  +0.000857]  bch_data_insert_keys+0x39/0xde
> > > >> > [  +0.000845]  process_one_work+0x280/0x5cf
> > > >> > [  +0.000858]  worker_thread+0x52/0x3bd
> > > >> > [  +0.000851]  ? process_one_work.cold+0x52/0x51
> > > >> > [  +0.000877]  kthread+0x13e/0x15b
> > > >> > [  +0.000858]  ? set_kthread_struct+0x60/0x52
> > > >> > [  +0.000855]  ret_from_fork+0x22/0x2d
> > > >> > [  +0.000854]  </TASK>
> > > >>
> > > >>
> > > >> Regards,
> > > >> Kai
> > > >>
> > >
> > >
> > >
> > >
> > >
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: Dirty data loss after cache disk error recovery
  2023-10-16 23:39                       ` Eric Wheeler
@ 2023-10-17  0:33                         ` Kai Krakow
  2023-10-17  0:39                           ` Kai Krakow
  0 siblings, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2023-10-17  0:33 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: 邹明哲,
	Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

Am Di., 17. Okt. 2023 um 01:39 Uhr schrieb Eric Wheeler
<bcache@lists.ewheeler.net>:
>
> On Wed, 11 Oct 2023, Kai Krakow wrote:
> > After a reboot it worked again but of course there were still bad
> > blocks because bcache did writeback, so no blocks have been replaced
> > with btrfs auto-repair on read feature. This time, the system handled
> > the situation a bit better but files became inaccessible in the middle
> > of writing them which destroyed my Plasma desktop configuration and
> > Chrome profile (I restored them from the last snapper snapshot
> > successfully). Essentially, the file system was in a readonly-like
> > state: most requests failed with IO errors despite the btrfs didn't
> > switch to read-only. Something messed up in the error path of
> > userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the
>
> Do you mean userspace -> btrfs -> bcache -> device

Ehm.. Yes...


> > device somewhere in the limbo of not existing and not working - it
> > still tried to access it while bcache claimed the backend device would
> > be missing. To me this looks like bcache error handling may need some
> > fine tuning - it should not fail in that way, especially not with
> > btrfs-raid, but still the system was seeing IO errors and broken files
> > in the middle of writes.
> >
> > "bcache show" showed the backend device missing while "btrfs dev show"
> > was still seeing the attached bcache device, and the system threw IO
> > errors to user-space despite btrfs still having a valid copy of the
> > blocks.
> >
> > I've rebooted and now switched the bad device from bcache writeback to
> > bcache none - and guess what: The system runs stable now, btrfs
> > auto-repair does its thing. The above mentioned behavior does not
> > occur (IO errors in user-space). A final scrub across the bad devices
> > repaired the bad blocks, I currently do not see any more problems.
> >
> > It's probably better to replace that device but this also shows that
> > switching bcache to "none" (if the backing device fails) or "write
> > through" at least may be a better choice than doing some other error
> > handling. Or bcache should have been able to make btrfs see the device
> > as missing (which obviously did not happen).
>
> Noted.  Did bcache actually detach its cache in the failure scenario
> you describe?

It seemed still attached but was marked as "missing" the the bcache cli tool.


> > Of course, if the cache device fails we have a completely different
> > situation. I'm not sure which situation Eric was seeing (I think the
> > caching device failed) but for me, the backing device failed - and
> > with bcache involved, the result was very unexpected.
>
> Ahh, so you are saying the cache continued to service requests even though
> the bdev was offline?  Was the bdev completely "unplugged" or was it just
> having IO errors?

smartctl was still seeing the device, so I think it "just" had IO errors.


> > So we probably need at least two error handlers: Handling caching
> > device errors, and handling backing device errors (for which bcache
> > doesn't currently seem to have a setting).
>
> I think it tries to write to the cache if the bdev dies.  Dirty or cached
> blocks are read from cache and other IOs are passed to bdev which may
> return end up returning an EIO.

Hmm, yes that makes sense... But it seems to confuse user-space a lot.

Except that in writeback mode, it won't (and cannot) return errors to
user-space although writes eventually fail later and data does not
persist. So it may be better to turn writeback off as soon as bdev IO
errors are found, or trigger an immediate writeback by temporarily
setting writeback_percent to 0. Usually, HDDs support self-healing -
which didn't work in this case because of delayed writeback. After I
switched to "none", it worked. After some more experimenting, it looks
like even "writethrough" may lack behind and not bubble bdev IO errors
back up to user-space (or it was due to writeback_percent=0, errors
are gone so I can no longer reproduce). I would expect it to do
exactly that, tho. I didn't test "writearound".

Also, it looks like a failed delay write from writeback dirty data may
not be retried by bcache. Or at least, I needed to run "btrfs scrub"
with bcache mode "none" to make it work properly and let the HDD heal
itself. OTOH, the HDD probably didn't fail writes but reads (except
when the situation got completely messed up and even writes returned
IO errors but maybe btrfs was involved here).

BTW: The failed HDDs ran fine for a few days now, even switched
writeback on again. It properly healed itself. But still, time to swap
it sooner than later.


>  Coly, is this correct?
>
> -Eric


Regards,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: Dirty data loss after cache disk error recovery
  2023-10-17  0:33                         ` Kai Krakow
@ 2023-10-17  0:39                           ` Kai Krakow
  0 siblings, 0 replies; 17+ messages in thread
From: Kai Krakow @ 2023-10-17  0:39 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: 邹明哲,
	Coly Li, linux-bcache,
	吴本卿(云桌面
	福州)

Just another thought...

Am Di., 17. Okt. 2023 um 02:33 Uhr schrieb Kai Krakow <kai@kaishome.de>:

> Except that in writeback mode, it won't (and cannot) return errors to
> user-space although writes eventually fail later and data does not
> persist. So it may be better to turn writeback off as soon as bdev IO
> errors are found, or trigger an immediate writeback by temporarily
> setting writeback_percent to 0. Usually, HDDs support self-healing -
> which didn't work in this case because of delayed writeback. After I
> switched to "none", it worked.

In that light, it might be worth thinking about how bcache could be
used to encourage self-healing of HDDs:

1. If a read IO error occurs, it should start flushing dirty data,
maybe switch to "none" or "writethrough/writearound".

2. Cached bcache contents could be used to rewrite data - in case a
sector has become bad. But I think this needs the firmware to detect a
read error on that sector first - which doesn't help us because then
the data would not be in bcache in the first place.

3. How does bcache handle bdev write errors in common, and in case of
delayed writeback in special?


Regards,
Kai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Dirty data loss after cache disk error recovery
  2021-04-20  3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州)
  2021-04-28 18:30 ` Kai Krakow
@ 2023-10-17  1:57 ` Coly Li
  1 sibling, 0 replies; 17+ messages in thread
From: Coly Li @ 2023-10-17  1:57 UTC (permalink / raw)
  To: "吴本卿(云桌面
	福州)"
  Cc: linux-bcache



> 2021年4月20日 11:17,吴本卿(云桌面 福州) <wubenqing@ruijie.com.cn> 写道:
> 
> Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss.
> 
> I checked the log and found that logs:
> [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
> [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb
> [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered

When you mention the bcache related issue, it would be better if the kernel version and distribution information are provided too. Some distributions don’t support bcache, it is possible that some necessary fixes are missed from backport for previous kernel version.

Thanks.

Coly Li

> 
> I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id.
> 
> Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach.
> I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set.
> echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit
> 
> It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal:
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: 
> Apr 19 05:50:18 localhost.localdomain kernel: journal io error
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching
> Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption.
> 
> When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data.
> Is it possible to provide users with a choice to stop the cache_set instead of unregistering it.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-10-17  1:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-20  3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州)
2021-04-28 18:30 ` Kai Krakow
2021-04-28 18:39   ` Kai Krakow
2021-04-28 18:51     ` Kai Krakow
2021-05-07 12:11       ` Coly Li
2021-05-07 14:56         ` Kai Krakow
     [not found]           ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net>
2023-09-06 22:56             ` Kai Krakow
     [not found]               ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net>
2023-09-07 12:00                 ` Kai Krakow
2023-09-07 19:10                   ` Eric Wheeler
2023-09-12  6:54                 ` 邹明哲
     [not found]                   ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net>
2023-10-11 16:19                     ` Kai Krakow
2023-10-16 23:39                       ` Eric Wheeler
2023-10-17  0:33                         ` Kai Krakow
2023-10-17  0:39                           ` Kai Krakow
2023-10-11 16:29                     ` Kai Krakow
2021-05-07 12:13     ` Coly Li
2023-10-17  1:57 ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).