* Dirty data loss after cache disk error recovery @ 2021-04-20 3:17 吴本卿(云桌面 福州) 2021-04-28 18:30 ` Kai Krakow 2023-10-17 1:57 ` Coly Li 0 siblings, 2 replies; 17+ messages in thread From: 吴本卿(云桌面 福州) @ 2021-04-20 3:17 UTC (permalink / raw) To: linux-bcache Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss. I checked the log and found that logs: [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id. Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach. I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set. echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal: Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: Apr 19 05:50:18 localhost.localdomain kernel: journal io error Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data. Is it possible to provide users with a choice to stop the cache_set instead of unregistering it. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-20 3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州) @ 2021-04-28 18:30 ` Kai Krakow 2021-04-28 18:39 ` Kai Krakow 2023-10-17 1:57 ` Coly Li 1 sibling, 1 reply; 17+ messages in thread From: Kai Krakow @ 2021-04-28 18:30 UTC (permalink / raw) To: 吴本卿(云桌面 福州) Cc: linux-bcache Hello! Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州) <wubenqing@ruijie.com.cn>: > > Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss. > > I checked the log and found that logs: > [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. "stop it to avoid potential data corruption" is not what it actually does: neither it stops it, nor it prevents corruption because dirty data becomes thrown away. > [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb > [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered > > I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id. > > Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach. > I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set. > echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit > > It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal: > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: > Apr 19 05:50:18 localhost.localdomain kernel: journal io error > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching > Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. > > When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data. > Is it possible to provide users with a choice to stop the cache_set instead of unregistering it. I think the same problem hit me, too, last night. My kernel choked because of a GPU error, and that somehow disconnected the cache. I can only guess that there was some sort of timeout due to blocked queues, and that introduced an IO error which detached the caches. Sadly, I only realized this after I already reformatted and started restore from backup: During the restore I watched the bcache status and found that the devices are not attached. I don't know if I could have re-attached the devices instead of formatting. But I think the dirty data would have been discarded anyways due to incrementing bcache_device->id. This really needs a better solution, detaching is one of the worst, especially on btrfs this has catastrophic consequences because data is not updated inline but via copy on write. This requires updating a lot of pointers. Usually, cow filesystem would be robust to this kind of data-loss but the vast amount of dirty data that is lost puts the tree generations too far behind of what btrfs is expecting, making it essentially broken beyond repair. If some trees in the FS are just a few generations behind, btrfs can repair itself by using a backup tree root, but when the bcache is lost, generation numbers usually lag behind several hundred generations. Detaching would be fine if there'd be no dirty data - otherwise the device should probably stop and refuse any more IO. @Coly If I patched the source to stop instead of detach, would it have made anything better? Would there be any side-effects? Is it possible to atomically check for dirty data in that case and take either the one or the other action? Thanks, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-28 18:30 ` Kai Krakow @ 2021-04-28 18:39 ` Kai Krakow 2021-04-28 18:51 ` Kai Krakow 2021-05-07 12:13 ` Coly Li 0 siblings, 2 replies; 17+ messages in thread From: Kai Krakow @ 2021-04-28 18:39 UTC (permalink / raw) To: 吴本卿(云桌面 福州) Cc: linux-bcache Hi Coly! Am Mi., 28. Apr. 2021 um 20:30 Uhr schrieb Kai Krakow <kai@kaishome.de>: > > Hello! > > Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州) > <wubenqing@ruijie.com.cn>: > > > > Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss. > > > > I checked the log and found that logs: > > [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. > > "stop it to avoid potential data corruption" is not what it actually > does: neither it stops it, nor it prevents corruption because dirty > data becomes thrown away. > > > [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb > > [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered > > > > I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id. > > > > Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach. > > I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set. > > echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit > > > > It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal: > > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: > > Apr 19 05:50:18 localhost.localdomain kernel: journal io error > > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching > > Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. > > > > When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data. > > Is it possible to provide users with a choice to stop the cache_set instead of unregistering it. > > I think the same problem hit me, too, last night. > > My kernel choked because of a GPU error, and that somehow disconnected > the cache. I can only guess that there was some sort of timeout due to > blocked queues, and that introduced an IO error which detached the > caches. > > Sadly, I only realized this after I already reformatted and started > restore from backup: During the restore I watched the bcache status > and found that the devices are not attached. > > I don't know if I could have re-attached the devices instead of > formatting. But I think the dirty data would have been discarded > anyways due to incrementing bcache_device->id. > > This really needs a better solution, detaching is one of the worst, > especially on btrfs this has catastrophic consequences because data is > not updated inline but via copy on write. This requires updating a lot > of pointers. Usually, cow filesystem would be robust to this kind of > data-loss but the vast amount of dirty data that is lost puts the tree > generations too far behind of what btrfs is expecting, making it > essentially broken beyond repair. If some trees in the FS are just a > few generations behind, btrfs can repair itself by using a backup tree > root, but when the bcache is lost, generation numbers usually lag > behind several hundred generations. Detaching would be fine if there'd > be no dirty data - otherwise the device should probably stop and > refuse any more IO. > > @Coly If I patched the source to stop instead of detach, would it have > made anything better? Would there be any side-effects? Is it possible > to atomically check for dirty data in that case and take either the > one or the other action? I think this behavior was introduced by https://lwn.net/Articles/748226/ So above is my late review. ;-) (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot access LWN for reasons[tm]) Thanks, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-28 18:39 ` Kai Krakow @ 2021-04-28 18:51 ` Kai Krakow 2021-05-07 12:11 ` Coly Li 2021-05-07 12:13 ` Coly Li 1 sibling, 1 reply; 17+ messages in thread From: Kai Krakow @ 2021-04-28 18:51 UTC (permalink / raw) To: 吴本卿(云桌面 福州) Cc: linux-bcache > I think this behavior was introduced by https://lwn.net/Articles/748226/ > > So above is my late review. ;-) > > (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot > access LWN for reasons[tm]) The problem may actually come from a different code path which retires the cache on metadata error: commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae "bcache: fix cached_dev->count usage for bch_cache_set_error()" It probably should consider if there's any dirty data. As a first step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this would kill the bcache thread, may not be a good idea) or even freeze the system with an unrecoverable error, or at least stop the device to prevent any IO with possibly stale data (because retiring throws away dirty data). A good solution would be if the "with dirty data" error path could somehow force the attached file system into read-only mode, maybe by just reporting IO errors when this bdev is accessed through bcache. Thanks, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-28 18:51 ` Kai Krakow @ 2021-05-07 12:11 ` Coly Li 2021-05-07 14:56 ` Kai Krakow 0 siblings, 1 reply; 17+ messages in thread From: Coly Li @ 2021-05-07 12:11 UTC (permalink / raw) To: Kai Krakow Cc: linux-bcache, 吴本卿(云桌面 福州) On 4/29/21 2:51 AM, Kai Krakow wrote: >> I think this behavior was introduced by https://lwn.net/Articles/748226/ >> >> So above is my late review. ;-) >> >> (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot >> access LWN for reasons[tm]) > > The problem may actually come from a different code path which retires > the cache on metadata error: > > commit 804f3c6981f5e4a506a8f14dc284cb218d0659ae > "bcache: fix cached_dev->count usage for bch_cache_set_error()" > > It probably should consider if there's any dirty data. As a first > step, it may be sufficient to run a BUG_ON(there_is_dirty_data) (this > would kill the bcache thread, may not be a good idea) or even freeze > the system with an unrecoverable error, or at least stop the device to > prevent any IO with possibly stale data (because retiring throws away > dirty data). A good solution would be if the "with dirty data" error > path could somehow force the attached file system into read-only mode, > maybe by just reporting IO errors when this bdev is accessed through > bcache. There is an option to panic the system when cache device failed. It is in errors file with available options as "unregister" and "panic". This option is default set to "unregister", if you set it to "panic" then panic() will be called. If the cache set is attached, read-only the bcache device does not prevent the meta data I/O on cache device (when try to cache the reading data), if the cache device is really disconnected that will be problematic too. The "auto" and "always" options are for "unregister" error action. When I enhance the device failure handling, I don't add new error action, all my work was to make the "unregister" action work better. Adding a new "stop" error action IMHO doesn't make things better. When the cache device is disconnected, it is always risky that some caching data or meta data is not updated onto cache device. Permit the cache device to be re-attached to the backing device may introduce "silent data loss" which might be worse.... It was the reason why I didn't add new error action for the device failure handling patch set. Thanks. Coly Li ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-05-07 12:11 ` Coly Li @ 2021-05-07 14:56 ` Kai Krakow [not found] ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net> 0 siblings, 1 reply; 17+ messages in thread From: Kai Krakow @ 2021-05-07 14:56 UTC (permalink / raw) To: Coly Li Cc: linux-bcache, 吴本卿(云桌面 福州) Hi! > There is an option to panic the system when cache device failed. It is > in errors file with available options as "unregister" and "panic". This > option is default set to "unregister", if you set it to "panic" then > panic() will be called. Hmm, okay, I didn't find "panic" documented somewhere. I'll take a look at it again. If it's missing, I'll create a patch to improve documentation. > If the cache set is attached, read-only the bcache device does not > prevent the meta data I/O on cache device (when try to cache the reading > data), if the cache device is really disconnected that will be > problematic too. I didn't completely understand the sentence, it seems to miss a word. But whatever it is, it's probably true. ;-) > The "auto" and "always" options are for "unregister" error action. When > I enhance the device failure handling, I don't add new error action, all > my work was to make the "unregister" action work better. But isn't the failure case here that it hits both code paths: The one that unregisters the device, and the one that then retires the cache? > Adding a new "stop" error action IMHO doesn't make things better. When > the cache device is disconnected, it is always risky that some caching > data or meta data is not updated onto cache device. Permit the cache > device to be re-attached to the backing device may introduce "silent > data loss" which might be worse.... It was the reason why I didn't add > new error action for the device failure handling patch set. But we are actually now seeing silent data loss: The system f'ed up somehow, needed a hard reset, and after reboot the bcache device was accessible in cache mode "none" (because they have been unregistered before, and because udev just detected it and you can use bcache without an attached cache in "none" mode), completely hiding the fact that we lost dirty write-back data, it's even not quite obvious that /dev/bcache0 now is detached, cache mode none, but accessible nevertheless. To me, this is quite clearly "silent data loss", especially since the unregister action threw the dirty data away. So this: > Permit the cache > device to be re-attached to the backing device may introduce "silent > data loss" which might be worse.... is actually the situation we are facing currently: Device has been unregistered, after reboot, udev detects it has clean backing device without cache association, using cache mode none, and it is readable and writable just fine: It essentially permitted access to the stale backing device (tho, it didn't re-attach as you outlined, but that's more or less the same situation). Maybe devices that become disassociated from a cache due to IO errors but have dirty data should go to a caching mode "stale", and bcache should refuse to access such devices or throw away their dirty data until I decide to force them back online into the cache set or force discard the dirty data. Then at least I would discover that something went badly wrong. Otherwise, I may not detect that dirty data wasn't written. In the best case, that makes my FS unmountable, in the worst case, some file data is simply lost (aka silent data loss), besides both situations are the worst-case scenario anyways. The whole situation probably comes from udev auto-registering bcache backing devices again, and bcache has no record of why the device was unregistered - it looks clean after such a situation. > Sorry I just find this thread from my INBOX. Hope it is not too late. No worries. ;-) It was already too late when the dirty cache was discarded but I have daily backups. My system is up and running again, but it's probably not a question of IF it happens again but WHEN it does. So I'd like to discuss how we can get a cleaner fail situation because currently it's just unclean because every status is lost after reboot, and devices look clean, and caching mode is simply "none", which is completely fine for the boot process. Thanks, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net>]
* Re: Dirty data loss after cache disk error recovery [not found] ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net> @ 2023-09-06 22:56 ` Kai Krakow [not found] ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net> 0 siblings, 1 reply; 17+ messages in thread From: Kai Krakow @ 2023-09-06 22:56 UTC (permalink / raw) To: Eric Wheeler Cc: Coly Li, linux-bcache, 吴本卿(云桌面 福州) Wow! I call that a necro-bump... ;-) Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler <lists@bcache.ewheeler.net>: > > On Fri, 7 May 2021, Kai Krakow wrote: > > > > Adding a new "stop" error action IMHO doesn't make things better. When > > > the cache device is disconnected, it is always risky that some caching > > > data or meta data is not updated onto cache device. Permit the cache > > > device to be re-attached to the backing device may introduce "silent > > > data loss" which might be worse.... It was the reason why I didn't add > > > new error action for the device failure handling patch set. > > > > But we are actually now seeing silent data loss: The system f'ed up > > somehow, needed a hard reset, and after reboot the bcache device was > > accessible in cache mode "none" (because they have been unregistered > > before, and because udev just detected it and you can use bcache > > without an attached cache in "none" mode), completely hiding the fact > > that we lost dirty write-back data, it's even not quite obvious that > > /dev/bcache0 now is detached, cache mode none, but accessible > > nevertheless. To me, this is quite clearly "silent data loss", > > especially since the unregister action threw the dirty data away. > > > > So this: > > > > > Permit the cache > > > device to be re-attached to the backing device may introduce "silent > > > data loss" which might be worse.... > > > > is actually the situation we are facing currently: Device has been > > unregistered, after reboot, udev detects it has clean backing device > > without cache association, using cache mode none, and it is readable > > and writable just fine: It essentially permitted access to the stale > > backing device (tho, it didn't re-attach as you outlined, but that's > > more or less the same situation). > > > > Maybe devices that become disassociated from a cache due to IO errors > > but have dirty data should go to a caching mode "stale", and bcache > > should refuse to access such devices or throw away their dirty data > > until I decide to force them back online into the cache set or force > > discard the dirty data. Then at least I would discover that something > > went badly wrong. Otherwise, I may not detect that dirty data wasn't > > written. In the best case, that makes my FS unmountable, in the worst > > case, some file data is simply lost (aka silent data loss), besides > > both situations are the worst-case scenario anyways. > > > > The whole situation probably comes from udev auto-registering bcache > > backing devices again, and bcache has no record of why the device was > > unregistered - it looks clean after such a situation. [...] > I think we hit this same issue from 2021. Here is that original thread from 2021: > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de > > Kai, did you end up with a good patch for this? We are running a 5.15 > kernel with the many backported bcache commits that Coly suggested here: > https://www.spinics.net/lists/linux-bcache/msg12084.html I'm currently running 6.1 with bcache on mdraid1 and device-level write caching disabled. I didn't see this ever occur again. BUT: Between that time and now I eventually also replaced my faulty RAM which had a few rare bit-flips. > Based on the thread from Kai (from 2021), I think we need to restore from > backup. While the root of the problem may be hardware related, bcache > should be more gracefully than unplugging the cache. Yes, it may be hardware-related and you should probably confirm your RAM working properly. Currently, I'm running with no bcache patches on LTS 6.1, only some btrfs patches: https://github.com/kakra/linux/pull/26 Especially the allocation-hint patches provide better speedups for meta data than bcache could ever do. With these patches, you could dedicate a small amount of two SSD partitions (on different drivers) to a btrfs metadata raid1, and use the remainder of the SSDs as a bcache mdraid1. Then just don't use writeback caching but writearound or writethrough instead. Most btrfs performance issues come from slow metadata which can be much better improved by allocator-hints than by bcache. But as written above, I had bad RAM, and meanwhile upgraded to kernel 6.1, and had no issues since with bcache even on power loss. > Coly, is there already a patch to prevent complete dirty cache loss? This is probably still an issue. The cache attachment MUST NEVER EVER automatically degrade to "none" which it did for my fail-cases I had back then. I don't know if this has changed meanwhile. But because bcache explicitly does not honor write-barriers from upstream writes for its own writeback (which is okay because it guarantees to write back all data anyways and give a consistent view to upstream FS - well, unless it has to handle write errors), the backed filesystem is guaranteed to be effed up in that case, and allowing it to mount and write because bcache silently has fallen back to "none" will only make the matter worse. (HINT: I never used brbd personally, most of the following is theoretical thinking without real-world experience) I see that you're using drbd? Did it fail due to networking issues? I'm pretty sure it should be robust in that case but maybe bcache cannot handle the situation? Does brbd have a write log to replay writes after network connection loss? It looks like it doesn't and thus bcache exploded. Anyways, since your backing device seems to be on drbd, using metadata allocation hinting is probably no option. You could of course still use drbd with bcache for metadata hinted partitions, and then use writearound caching only for that. At least, in the fail-case, your btrfs won't be destroyed. But your data chunks may have unreadable files then. But it should be easy to select them and restore from backup individually. Btrfs is very robust for that fail case: if metadata is okay, data errors are properly detected and handled. If you're not using btrfs, all of this doesn't apply ofc. I'm not sure if write-back caching for drbd backing is a wise decision anyways. drbd is slow for writes, that's part of the design (and no writeback caching could fix that). I would not rely on bcache-writeback to fix that for you because it is not prepared for storage that may be temporarily not available, iow, it would freeze and continue when drbd is available again. I think you should really use writearound/writethrough so your FS can be sure data has been written, replicated and persisted. In case of btrfs, you could still split data and metadata as written above, and use writeback for data, but reliable writes for metadata. So concluding: 1. I'm now persisting metadata directly to disk with no intermediate layers (no bcache, no md) 2. I'm using allocation-hinted data-only partitions with bcache write-back, with bcache on mdraid1. If anything goes wrong, I have file crc errors in btrfs files only, but the filesystem itself is valid because no metadata is broken or lost. I have snapshots of recently modified files. I have daily backups. 3. Your problem is that bcache can - by design - detect write errors only when it's too late with no chance telling the filesystem. In that case, writethrough/writearound is the correct choice. 4. Maybe bcache should know if backing is on storage that may be temporarily unavailable and then freeze until the backing storage is back online, similar to how iSCSI handles that. But otoh, maybe drbd should freeze until the replicated storage is available again while writing (from what I've read, it's designed to not do that but let local storage get ahead of the replica, which is btw incompatible with bcache-writeback assumptions). Or maybe using async mirroring can fix this for you but then, the mirror will be compromised if a hardware failure immediately follows a previous drbd network connection loss. But, it may still be an issue with the local hardware (bit-flips etc) because maybe just bcache internals broke - Coly may have a better idea of that. I think your main issue here is that bcache decouples writebarriers from the underlying backing storage - and you should just not use writeback, it is incompatible by design with how drbd works: your replica will be broken when you need it. > Here is our trace: > > [Sep 6 13:01] bcache: bch_cache_set_error() error on a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, 0:1163806048 gen 3: bad, length too big, disabling caching > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > [ +0.000548] block drbd8143: write: error=10 s=9205904s > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > [ +0.000866] block drbd8143: Local IO failed in drbd_endio_write_sec_final. > [ +0.000809] Workqueue: bcache bch_data_insert_keys > [ +0.000833] block drbd8143: disk( UpToDate -> Inconsistent ) > [ +0.000826] Call Trace: > [ +0.000875] block drbd8143: write: error=10 s=8394752s > [ +0.000797] <TASK> > [ +0.000006] dump_stack_lvl+0x57/0x7e > [ +0.000791] block drbd8143: Local IO failed in drbd_endio_write_sec_final. > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > [ +0.000760] block drbd8143: write: error=10 s=8397840s > [ +0.000759] btree_mergesort+0x27e/0x36e > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > [ +0.000009] __btree_sort+0xa4/0x1e9 > [ +0.002085] block drbd8143: drbd_md_sync_page_io(,41943032s,WRITE) failed with error -5 > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > [ +0.000878] block drbd8143: meta data update failed! > [ +0.000836] bch_btree_init_next+0x39/0xb6 > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > [ +0.000877] block drbd8143: disk( Inconsistent -> Failed ) > [ +0.000863] btree_insert_fn+0x20/0x48 > [ +0.000866] block drbd8143: Local IO failed in drbd_md_write. Detaching... > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > [ +0.000848] bch_btree_insert+0x102/0x188 > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > [ +0.000857] bch_data_insert_keys+0x39/0xde > [ +0.000845] process_one_work+0x280/0x5cf > [ +0.000858] worker_thread+0x52/0x3bd > [ +0.000851] ? process_one_work.cold+0x52/0x51 > [ +0.000877] kthread+0x13e/0x15b > [ +0.000858] ? set_kthread_struct+0x60/0x52 > [ +0.000855] ret_from_fork+0x22/0x2d > [ +0.000854] </TASK> Regards, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net>]
* Re: Dirty data loss after cache disk error recovery [not found] ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net> @ 2023-09-07 12:00 ` Kai Krakow 2023-09-07 19:10 ` Eric Wheeler 2023-09-12 6:54 ` 邹明哲 1 sibling, 1 reply; 17+ messages in thread From: Kai Krakow @ 2023-09-07 12:00 UTC (permalink / raw) To: Eric Wheeler Cc: Coly Li, linux-bcache, 吴本卿(云桌面 福州), Mingzhe Zou Am Do., 7. Sept. 2023 um 02:42 Uhr schrieb Eric Wheeler <lists@bcache.ewheeler.net>: > > +Mingzhe, Coly: please comment on the proposed fix below when you have a > moment: > > > > Coly, is there already a patch to prevent complete dirty cache loss? > > > > This is probably still an issue. The cache attachment MUST NEVER EVER > > automatically degrade to "none" which it did for my fail-cases I had > > back then. I don't know if this has changed meanwhile. > > I would rather that bcache went to a read-only mode in failure > conditions like this. Maybe write-around would be acceptable since > bcache returns -EIO for any failed dirty cache reads. But if the cache > is dirty, and it gets an error, it _must_never_ read from the bdev, which > is what appears to happens now. > > Coly, Mingzhe, would this be an easy change? > > Here are the relevant bits: > > The allocator called btree_mergesort which called bch_extent_invalid: > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > Which called the `cache_bug` macro, which triggered bch_cache_set_error: > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > It then calls `bch_cache_set_unregister` which shuts down the cache: > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > { > ... > bch_cache_set_unregister(c); > return true; > } > > Proposed solution: > > What if, instead of bch_cache_set_unregister() that this was called instead: > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > This would bypass the cache for future writes, and allow reads to > proceed if possible, and -EIO otherwise to let upper layers handle the > failure. Ensuring to not read stale content from bdev by switching to writearound is probably a proper solution - if there are no other side-effects. But due to the error, the cdev may be in some broken limbo state. So it should probably try to writeback dirty data while adding no more future data - neither for read-caching nor write-caching. Maybe this was the intention of unregister but instead of writing back dirty data and still serving dirty data from cdev, it immediately unregisters and invalidates the cdev. So maybe the bugfix should be about why unregister() doesn't write back dirty data first... So actually switching to "none" but without unregister should probably provide that exact behavior? No more read/write but finishing outstanding dirty writeback. Earlier I write: > > This is probably still an issue. The cache attachment MUST NEVER EVER > > automatically degrade to "none" which it did for my fail-cases I had This was meant under the assumption that "none" is the state after unregister - just to differentiate from what I wrote immediately before. > What do you think? > > > But because bcache explicitly does not honor write-barriers from > > upstream writes for its own writeback (which is okay because it > > guarantees to write back all data anyways and give a consistent view to > > upstream FS - well, unless it has to handle write errors), the backed > > filesystem is guaranteed to be effed up in that case, and allowing it to > > mount and write because bcache silently has fallen back to "none" will > > only make the matter worse. > > > > (HINT: I never used brbd personally, most of the following is > > theoretical thinking without real-world experience) > > > > I see that you're using drbd? Did it fail due to networking issues? > > I'm pretty sure it should be robust in that case but maybe bcache > > cannot handle the situation? Does brbd have a write log to replay > > writes after network connection loss? It looks like it doesn't and > > thus bcache exploded. > > DRBD is _above_ bcache, not below it. In this case, DRBD hung because > bcache hung, not the other way around, so DRBD is not the issue here. > Here is our stack: > > bcache: > bdev: /dev/sda hardware RAID5 > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > And then bcache is stacked like so: > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > | > v > [remote host] > > > Anyways, since your backing device seems to be on drbd, using metadata > > allocation hinting is probably no option. You could of course still use > > drbd with bcache for metadata hinted partitions, and then use > > writearound caching only for that. At least, in the fail-case, your > > btrfs won't be destroyed. But your data chunks may have unreadable files > > then. But it should be easy to select them and restore from backup > > individually. Btrfs is very robust for that fail case: if metadata is > > okay, data errors are properly detected and handled. If you're not using > > btrfs, all of this doesn't apply ofc. > > > > I'm not sure if write-back caching for drbd backing is a wise decision > > anyways. drbd is slow for writes, that's part of the design (and no > > writeback caching could fix that). > > Bcache-backed DRBD provides a noticable difference, especially with a > 10GbE link (or faster) and the same disk stack on both sides. > > > I would not rely on bcache-writeback to fix that for you because it is > > not prepared for storage that may be temporarily not available > > True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > DRBD's existence. > > > iow, it would freeze and continue when drbd is available again. I think > > you should really use writearound/writethrough so your FS can be sure > > data has been written, replicated and persisted. In case of btrfs, you > > could still split data and metadata as written above, and use writeback > > for data, but reliable writes for metadata. > > > > So concluding: > > > > 1. I'm now persisting metadata directly to disk with no intermediate > > layers (no bcache, no md) > > > > 2. I'm using allocation-hinted data-only partitions with bcache > > write-back, with bcache on mdraid1. If anything goes wrong, I have > > file crc errors in btrfs files only, but the filesystem itself is > > valid because no metadata is broken or lost. I have snapshots of > > recently modified files. I have daily backups. > > > > 3. Your problem is that bcache can - by design - detect write errors > > only when it's too late with no chance telling the filesystem. In that > > case, writethrough/writearound is the correct choice. > > > > 4. Maybe bcache should know if backing is on storage that may be > > temporarily unavailable and then freeze until the backing storage is > > back online, similar to how iSCSI handles that. > > I don't think "temporarily unavailable" should be bcache's burden, as > bcache is a local-only solution. If someone is using iSCSI under bcache, > then good luck ;) > > > But otoh, maybe drbd should freeze until the replicated storage is > > available again while writing (from what I've read, it's designed to not > > do that but let local storage get ahead of the replica, which is btw > > incompatible with bcache-writeback assumptions). > > N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > and has no local copy for some reason. If local storage is available, it > will use that and resync when its peer comes up. > > > Or maybe using async mirroring can fix this for you but then, the mirror > > will be compromised if a hardware failure immediately follows a previous > > drbd network connection loss. But, it may still be an issue with the > > local hardware (bit-flips etc) because maybe just bcache internals broke > > - Coly may have a better idea of that. > > This isn't DRBDs fault since it is above bcache. I wish only address the > the bcache cache=none issue. > > -Eric > > > > > I think your main issue here is that bcache decouples writebarriers > > from the underlying backing storage - and you should just not use > > writeback, it is incompatible by design with how drbd works: your > > replica will be broken when you need it. > > > > > > > > > Here is our trace: > > > > > > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > > 0:1163806048 gen 3: bad, length too big, disabling caching > > > > > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > > [ +0.000826] Call Trace: > > > [ +0.000797] <TASK> > > > [ +0.000006] dump_stack_lvl+0x57/0x7e > > > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > > [ +0.000759] btree_mergesort+0x27e/0x36e > > > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > > [ +0.000009] __btree_sort+0xa4/0x1e9 > > > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > > [ +0.000863] btree_insert_fn+0x20/0x48 > > > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > [ +0.000848] bch_btree_insert+0x102/0x188 > > > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > > [ +0.000857] bch_data_insert_keys+0x39/0xde > > > [ +0.000845] process_one_work+0x280/0x5cf > > > [ +0.000858] worker_thread+0x52/0x3bd > > > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > > [ +0.000877] kthread+0x13e/0x15b > > > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > > [ +0.000855] ret_from_fork+0x22/0x2d > > > [ +0.000854] </TASK> > > > > > > Regards, > > Kai > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2023-09-07 12:00 ` Kai Krakow @ 2023-09-07 19:10 ` Eric Wheeler 0 siblings, 0 replies; 17+ messages in thread From: Eric Wheeler @ 2023-09-07 19:10 UTC (permalink / raw) To: Kai Krakow Cc: Coly Li, linux-bcache, 吴本卿(云桌面 福州), Mingzhe Zou On Thu, 7 Sep 2023, Kai Krakow wrote: > Am Do., 7. Sept. 2023 um 02:42 Uhr schrieb Eric Wheeler > <lists@bcache.ewheeler.net>: > > > > +Mingzhe, Coly: please comment on the proposed fix below when you have a > > moment: > > > > > > Coly, is there already a patch to prevent complete dirty cache loss? > > > > > > This is probably still an issue. The cache attachment MUST NEVER EVER > > > automatically degrade to "none" which it did for my fail-cases I had > > > back then. I don't know if this has changed meanwhile. > > > > I would rather that bcache went to a read-only mode in failure > > conditions like this. Maybe write-around would be acceptable since > > bcache returns -EIO for any failed dirty cache reads. But if the cache > > is dirty, and it gets an error, it _must_never_ read from the bdev, which > > is what appears to happens now. > > > > Coly, Mingzhe, would this be an easy change? > > > > Here are the relevant bits: > > > > The allocator called btree_mergesort which called bch_extent_invalid: > > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > > > Which called the `cache_bug` macro, which triggered bch_cache_set_error: > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > > > It then calls `bch_cache_set_unregister` which shuts down the cache: > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > > { > > ... > > bch_cache_set_unregister(c); > > return true; > > } > > > > Proposed solution: > > > > What if, instead of bch_cache_set_unregister() that this was called instead: > > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > > > This would bypass the cache for future writes, and allow reads to > > proceed if possible, and -EIO otherwise to let upper layers handle the > > failure. > > Ensuring to not read stale content from bdev by switching to > writearound is probably a proper solution - if there are no other > side-effects. But due to the error, the cdev may be in some broken > limbo state. So it should probably try to writeback dirty data while > adding no more future data - neither for read-caching nor > write-caching. Maybe this was the intention of unregister but instead > of writing back dirty data and still serving dirty data from cdev, it > immediately unregisters and invalidates the cdev. > > So maybe the bugfix should be about why unregister() doesn't write > back dirty data first... So maybe it should "detach" in the same way that /sys/block/bcache0/bcache/detach triggers removal of the cache. There seem to be three proposed graceful failure states in this situation: 1. Set read-only for all bcache gendisk devices that use the failed cache 2. Set write-around and try to continue. 3. "Detach" the cache for all bcache devices using the failed cache. If this fails, then maybe fall back to #1 or #2. Coly, Mingzhe, what do you think would be best in terms of implementation? -- Eric Wheeler > > So actually switching to "none" but without unregister should probably > provide that exact behavior? No more read/write but finishing > outstanding dirty writeback. > > Earlier I write: > > > > This is probably still an issue. The cache attachment MUST NEVER EVER > > > automatically degrade to "none" which it did for my fail-cases I had > > This was meant under the assumption that "none" is the state after > unregister - just to differentiate from what I wrote immediately > before. > > > > What do you think? > > > > > But because bcache explicitly does not honor write-barriers from > > > upstream writes for its own writeback (which is okay because it > > > guarantees to write back all data anyways and give a consistent view to > > > upstream FS - well, unless it has to handle write errors), the backed > > > filesystem is guaranteed to be effed up in that case, and allowing it to > > > mount and write because bcache silently has fallen back to "none" will > > > only make the matter worse. > > > > > > (HINT: I never used brbd personally, most of the following is > > > theoretical thinking without real-world experience) > > > > > > I see that you're using drbd? Did it fail due to networking issues? > > > I'm pretty sure it should be robust in that case but maybe bcache > > > cannot handle the situation? Does brbd have a write log to replay > > > writes after network connection loss? It looks like it doesn't and > > > thus bcache exploded. > > > > DRBD is _above_ bcache, not below it. In this case, DRBD hung because > > bcache hung, not the other way around, so DRBD is not the issue here. > > Here is our stack: > > > > bcache: > > bdev: /dev/sda hardware RAID5 > > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > > > And then bcache is stacked like so: > > > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > > | > > v > > [remote host] > > > > > Anyways, since your backing device seems to be on drbd, using metadata > > > allocation hinting is probably no option. You could of course still use > > > drbd with bcache for metadata hinted partitions, and then use > > > writearound caching only for that. At least, in the fail-case, your > > > btrfs won't be destroyed. But your data chunks may have unreadable files > > > then. But it should be easy to select them and restore from backup > > > individually. Btrfs is very robust for that fail case: if metadata is > > > okay, data errors are properly detected and handled. If you're not using > > > btrfs, all of this doesn't apply ofc. > > > > > > I'm not sure if write-back caching for drbd backing is a wise decision > > > anyways. drbd is slow for writes, that's part of the design (and no > > > writeback caching could fix that). > > > > Bcache-backed DRBD provides a noticable difference, especially with a > > 10GbE link (or faster) and the same disk stack on both sides. > > > > > I would not rely on bcache-writeback to fix that for you because it is > > > not prepared for storage that may be temporarily not available > > > > True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > > DRBD's existence. > > > > > iow, it would freeze and continue when drbd is available again. I think > > > you should really use writearound/writethrough so your FS can be sure > > > data has been written, replicated and persisted. In case of btrfs, you > > > could still split data and metadata as written above, and use writeback > > > for data, but reliable writes for metadata. > > > > > > So concluding: > > > > > > 1. I'm now persisting metadata directly to disk with no intermediate > > > layers (no bcache, no md) > > > > > > 2. I'm using allocation-hinted data-only partitions with bcache > > > write-back, with bcache on mdraid1. If anything goes wrong, I have > > > file crc errors in btrfs files only, but the filesystem itself is > > > valid because no metadata is broken or lost. I have snapshots of > > > recently modified files. I have daily backups. > > > > > > 3. Your problem is that bcache can - by design - detect write errors > > > only when it's too late with no chance telling the filesystem. In that > > > case, writethrough/writearound is the correct choice. > > > > > > 4. Maybe bcache should know if backing is on storage that may be > > > temporarily unavailable and then freeze until the backing storage is > > > back online, similar to how iSCSI handles that. > > > > I don't think "temporarily unavailable" should be bcache's burden, as > > bcache is a local-only solution. If someone is using iSCSI under bcache, > > then good luck ;) > > > > > But otoh, maybe drbd should freeze until the replicated storage is > > > available again while writing (from what I've read, it's designed to not > > > do that but let local storage get ahead of the replica, which is btw > > > incompatible with bcache-writeback assumptions). > > > > N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > > and has no local copy for some reason. If local storage is available, it > > will use that and resync when its peer comes up. > > > > > Or maybe using async mirroring can fix this for you but then, the mirror > > > will be compromised if a hardware failure immediately follows a previous > > > drbd network connection loss. But, it may still be an issue with the > > > local hardware (bit-flips etc) because maybe just bcache internals broke > > > - Coly may have a better idea of that. > > > > This isn't DRBDs fault since it is above bcache. I wish only address the > > the bcache cache=none issue. > > > > -Eric > > > > > > > > I think your main issue here is that bcache decouples writebarriers > > > from the underlying backing storage - and you should just not use > > > writeback, it is incompatible by design with how drbd works: your > > > replica will be broken when you need it. > > > > > > > > > > > > > > Here is our trace: > > > > > > > > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > > > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > > > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > > > 0:1163806048 gen 3: bad, length too big, disabling caching > > > > > > > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > > > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > > > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > > > [ +0.000826] Call Trace: > > > > [ +0.000797] <TASK> > > > > [ +0.000006] dump_stack_lvl+0x57/0x7e > > > > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > > > [ +0.000759] btree_mergesort+0x27e/0x36e > > > > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > > > [ +0.000009] __btree_sort+0xa4/0x1e9 > > > > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > > > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > > > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > > > [ +0.000863] btree_insert_fn+0x20/0x48 > > > > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > > > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > > > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > > [ +0.000848] bch_btree_insert+0x102/0x188 > > > > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > > > [ +0.000857] bch_data_insert_keys+0x39/0xde > > > > [ +0.000845] process_one_work+0x280/0x5cf > > > > [ +0.000858] worker_thread+0x52/0x3bd > > > > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > > > [ +0.000877] kthread+0x13e/0x15b > > > > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > > > [ +0.000855] ret_from_fork+0x22/0x2d > > > > [ +0.000854] </TASK> > > > > > > > > > Regards, > > > Kai > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re:Re: Dirty data loss after cache disk error recovery [not found] ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net> 2023-09-07 12:00 ` Kai Krakow @ 2023-09-12 6:54 ` 邹明哲 [not found] ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net> 1 sibling, 1 reply; 17+ messages in thread From: 邹明哲 @ 2023-09-12 6:54 UTC (permalink / raw) To: Eric Wheeler Cc: Coly Li, Kai Krakow, linux-bcache, 吴本卿(云桌面 福州) From: Eric Wheeler <lists@bcache.ewheeler.net> Date: 2023-09-07 08:42:41 To: Coly Li <colyli@suse.de> Cc: Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn> Subject: Re: Dirty data loss after cache disk error recovery >+Mingzhe, Coly: please comment on the proposed fix below when you have a >moment: Hi, Eric This is an old issue, and it took me a long time to understand what happened. > >On Thu, 7 Sep 2023, Kai Krakow wrote: >> Wow! >> >> I call that a necro-bump... ;-) >> >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler >> <lists@bcache.ewheeler.net>: >> > >> > On Fri, 7 May 2021, Kai Krakow wrote: >> > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When >> > > > the cache device is disconnected, it is always risky that some caching >> > > > data or meta data is not updated onto cache device. Permit the cache >> > > > device to be re-attached to the backing device may introduce "silent >> > > > data loss" which might be worse.... It was the reason why I didn't add >> > > > new error action for the device failure handling patch set. >> > > >> > > But we are actually now seeing silent data loss: The system f'ed up >> > > somehow, needed a hard reset, and after reboot the bcache device was >> > > accessible in cache mode "none" (because they have been unregistered >> > > before, and because udev just detected it and you can use bcache >> > > without an attached cache in "none" mode), completely hiding the fact >> > > that we lost dirty write-back data, it's even not quite obvious that >> > > /dev/bcache0 now is detached, cache mode none, but accessible >> > > nevertheless. To me, this is quite clearly "silent data loss", >> > > especially since the unregister action threw the dirty data away. >> > > >> > > So this: >> > > >> > > > Permit the cache >> > > > device to be re-attached to the backing device may introduce "silent >> > > > data loss" which might be worse.... >> > > >> > > is actually the situation we are facing currently: Device has been >> > > unregistered, after reboot, udev detects it has clean backing device >> > > without cache association, using cache mode none, and it is readable >> > > and writable just fine: It essentially permitted access to the stale >> > > backing device (tho, it didn't re-attach as you outlined, but that's >> > > more or less the same situation). >> > > >> > > Maybe devices that become disassociated from a cache due to IO errors >> > > but have dirty data should go to a caching mode "stale", and bcache >> > > should refuse to access such devices or throw away their dirty data >> > > until I decide to force them back online into the cache set or force >> > > discard the dirty data. Then at least I would discover that something >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't >> > > written. In the best case, that makes my FS unmountable, in the worst >> > > case, some file data is simply lost (aka silent data loss), besides >> > > both situations are the worst-case scenario anyways. >> > > >> > > The whole situation probably comes from udev auto-registering bcache >> > > backing devices again, and bcache has no record of why the device was >> > > unregistered - it looks clean after such a situation. >> >> [...] >> >> > I think we hit this same issue from 2021. Here is that original thread from 2021: >> > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de >> > >> > Kai, did you end up with a good patch for this? We are running a 5.15 >> > kernel with the many backported bcache commits that Coly suggested here: >> > https://www.spinics.net/lists/linux-bcache/msg12084.html >> >> I'm currently running 6.1 with bcache on mdraid1 and device-level >> write caching disabled. I didn't see this ever occur again. > >Awesome, good to know. > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel >> 6.1, and had no issues since with bcache even on power loss. >> >> > Coly, is there already a patch to prevent complete dirty cache loss? >> >> This is probably still an issue. The cache attachment MUST NEVER EVER >> automatically degrade to "none" which it did for my fail-cases I had >> back then. I don't know if this has changed meanwhile. > >I would rather that bcache went to a read-only mode in failure >conditions like this. Maybe write-around would be acceptable since >bcache returns -EIO for any failed dirty cache reads. But if the cache >is dirty, and it gets an error, it _must_never_ read from the bdev, which >is what appears to happens now. > >Coly, Mingzhe, would this be an easy change? First of all, we have never had this problem. We have had an nvme controller failure, but at this time the cache cannot be read or written, so even unregister will not succeed. Coly once replied like this: """ There is an option to panic the system when cache device failed. It is in errors file with available options as "unregister" and "panic". This option is default set to "unregister", if you set it to "panic" then panic() will be called. """ I think "panic" is a better way to handle this situation. If cache returns an error, there may be more unknown errors if the operation continues. > >Here are the relevant bits: > >The allocator called btree_mergesort which called bch_extent_invalid: > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > >Which called the `cache_bug` macro, which triggered bch_cache_set_error: > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > >It then calls `bch_cache_set_unregister` which shuts down the cache: > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > { > ... > bch_cache_set_unregister(c); > return true; > } > >Proposed solution: > >What if, instead of bch_cache_set_unregister() that this was called instead: > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) If cache_mode can be automatically modified, when will it be restored to writeback? I think we need to be able to enable or disable this. > >This would bypass the cache for future writes, and allow reads to >proceed if possible, and -EIO otherwise to let upper layers handle the >failure. > >What do you think? If we switch to writearound mode, how to ensure that the IO is read-only, because writing IO may require invalidating dirty data. If the backing write is successful but invalid fails, how should we handle it? Maybe "panic" could be the default option. What do you think? > >> But because bcache explicitly does not honor write-barriers from >> upstream writes for its own writeback (which is okay because it >> guarantees to write back all data anyways and give a consistent view to >> upstream FS - well, unless it has to handle write errors), the backed >> filesystem is guaranteed to be effed up in that case, and allowing it to >> mount and write because bcache silently has fallen back to "none" will >> only make the matter worse. >> >> (HINT: I never used brbd personally, most of the following is >> theoretical thinking without real-world experience) >> >> I see that you're using drbd? Did it fail due to networking issues? >> I'm pretty sure it should be robust in that case but maybe bcache >> cannot handle the situation? Does brbd have a write log to replay >> writes after network connection loss? It looks like it doesn't and >> thus bcache exploded. > >DRBD is _above_ bcache, not below it. In this case, DRBD hung because >bcache hung, not the other way around, so DRBD is not the issue here. >Here is our stack: > >bcache: > bdev: /dev/sda hardware RAID5 > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > >And then bcache is stacked like so: > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > | > v > [remote host] > >> Anyways, since your backing device seems to be on drbd, using metadata >> allocation hinting is probably no option. You could of course still use >> drbd with bcache for metadata hinted partitions, and then use >> writearound caching only for that. At least, in the fail-case, your >> btrfs won't be destroyed. But your data chunks may have unreadable files >> then. But it should be easy to select them and restore from backup >> individually. Btrfs is very robust for that fail case: if metadata is >> okay, data errors are properly detected and handled. If you're not using >> btrfs, all of this doesn't apply ofc. >> >> I'm not sure if write-back caching for drbd backing is a wise decision >> anyways. drbd is slow for writes, that's part of the design (and no >> writeback caching could fix that). > >Bcache-backed DRBD provides a noticable difference, especially with a >10GbE link (or faster) and the same disk stack on both sides. > >> I would not rely on bcache-writeback to fix that for you because it is >> not prepared for storage that may be temporarily not available > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of >DRBD's existence. > >> iow, it would freeze and continue when drbd is available again. I think >> you should really use writearound/writethrough so your FS can be sure >> data has been written, replicated and persisted. In case of btrfs, you >> could still split data and metadata as written above, and use writeback >> for data, but reliable writes for metadata. >> >> So concluding: >> >> 1. I'm now persisting metadata directly to disk with no intermediate >> layers (no bcache, no md) >> >> 2. I'm using allocation-hinted data-only partitions with bcache >> write-back, with bcache on mdraid1. If anything goes wrong, I have >> file crc errors in btrfs files only, but the filesystem itself is >> valid because no metadata is broken or lost. I have snapshots of >> recently modified files. I have daily backups. >> >> 3. Your problem is that bcache can - by design - detect write errors >> only when it's too late with no chance telling the filesystem. In that >> case, writethrough/writearound is the correct choice. >> >> 4. Maybe bcache should know if backing is on storage that may be >> temporarily unavailable and then freeze until the backing storage is >> back online, similar to how iSCSI handles that. > >I don't think "temporarily unavailable" should be bcache's burden, as >bcache is a local-only solution. If someone is using iSCSI under bcache, >then good luck ;) > >> But otoh, maybe drbd should freeze until the replicated storage is >> available again while writing (from what I've read, it's designed to not >> do that but let local storage get ahead of the replica, which is btw >> incompatible with bcache-writeback assumptions). > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected >and has no local copy for some reason. If local storage is available, it >will use that and resync when its peer comes up. > >> Or maybe using async mirroring can fix this for you but then, the mirror >> will be compromised if a hardware failure immediately follows a previous >> drbd network connection loss. But, it may still be an issue with the >> local hardware (bit-flips etc) because maybe just bcache internals broke >> - Coly may have a better idea of that. > >This isn't DRBDs fault since it is above bcache. I wish only address the >the bcache cache=none issue. > >-Eric > >> >> I think your main issue here is that bcache decouples writebarriers >> from the underlying backing storage - and you should just not use >> writeback, it is incompatible by design with how drbd works: your >> replica will be broken when you need it. > > >> >> >> > Here is our trace: >> > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, >> > 0:1163806048 gen 3: bad, length too big, disabling caching >> >> > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 >> > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 >> > [ +0.000809] Workqueue: bcache bch_data_insert_keys >> > [ +0.000826] Call Trace: >> > [ +0.000797] <TASK> >> > [ +0.000006] dump_stack_lvl+0x57/0x7e >> > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 >> > [ +0.000759] btree_mergesort+0x27e/0x36e >> > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 >> > [ +0.000009] __btree_sort+0xa4/0x1e9 >> > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d >> > [ +0.000836] bch_btree_init_next+0x39/0xb6 >> > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 >> > [ +0.000863] btree_insert_fn+0x20/0x48 >> > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 >> > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 >> > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb >> > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 >> > [ +0.000848] bch_btree_insert+0x102/0x188 >> > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf >> > [ +0.000857] bch_data_insert_keys+0x39/0xde >> > [ +0.000845] process_one_work+0x280/0x5cf >> > [ +0.000858] worker_thread+0x52/0x3bd >> > [ +0.000851] ? process_one_work.cold+0x52/0x51 >> > [ +0.000877] kthread+0x13e/0x15b >> > [ +0.000858] ? set_kthread_struct+0x60/0x52 >> > [ +0.000855] ret_from_fork+0x22/0x2d >> > [ +0.000854] </TASK> >> >> >> Regards, >> Kai >> ^ permalink raw reply [flat|nested] 17+ messages in thread
[parent not found: <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net>]
* Re: Re: Dirty data loss after cache disk error recovery [not found] ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net> @ 2023-10-11 16:19 ` Kai Krakow 2023-10-16 23:39 ` Eric Wheeler 2023-10-11 16:29 ` Kai Krakow 1 sibling, 1 reply; 17+ messages in thread From: Kai Krakow @ 2023-10-11 16:19 UTC (permalink / raw) To: Eric Wheeler Cc: 邹明哲, Coly Li, linux-bcache, 吴本卿(云桌面 福州) Hello! Sorry for the top-posting. I just want to share my story without removing all of the context: I've now faced a similar issue where one of my HDDs spontaneously decided to have a series of bad blocks. It looks like it has 26145 failed writes due to how bcache handles writeback. It had 5275 failed reads with btrfs loudly complaining about it. The system also became really slow to respond until it eventually froze. After a reboot it worked again but of course there were still bad blocks because bcache did writeback, so no blocks have been replaced with btrfs auto-repair on read feature. This time, the system handled the situation a bit better but files became inaccessible in the middle of writing them which destroyed my Plasma desktop configuration and Chrome profile (I restored them from the last snapper snapshot successfully). Essentially, the file system was in a readonly-like state: most requests failed with IO errors despite the btrfs didn't switch to read-only. Something messed up in the error path of userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the device somewhere in the limbo of not existing and not working - it still tried to access it while bcache claimed the backend device would be missing. To me this looks like bcache error handling may need some fine tuning - it should not fail in that way, especially not with btrfs-raid, but still the system was seeing IO errors and broken files in the middle of writes. "bcache show" showed the backend device missing while "btrfs dev show" was still seeing the attached bcache device, and the system threw IO errors to user-space despite btrfs still having a valid copy of the blocks. I've rebooted and now switched the bad device from bcache writeback to bcache none - and guess what: The system runs stable now, btrfs auto-repair does its thing. The above mentioned behavior does not occur (IO errors in user-space). A final scrub across the bad devices repaired the bad blocks, I currently do not see any more problems. It's probably better to replace that device but this also shows that switching bcache to "none" (if the backing device fails) or "write through" at least may be a better choice than doing some other error handling. Or bcache should have been able to make btrfs see the device as missing (which obviously did not happen). Of course, if the cache device fails we have a completely different situation. I'm not sure which situation Eric was seeing (I think the caching device failed) but for me, the backing device failed - and with bcache involved, the result was very unexpected. So we probably need at least two error handlers: Handling caching device errors, and handling backing device errors (for which bcache doesn't currently seem to have a setting). Except for the strange IO errors and resulting incomplete writes (and I really don't know why that happened), btrfs survived this perfectly well - and somehow bcache did a good enough job. This has been different in the past. So this is already a great achievement. Thank you. BTW: This probably only worked for me because I split btrfs metadata and data to different devices (https://github.com/kakra/linux/pull/26), and metadata does not pass through bcache at all but natively to SSD. Otherwise I fear btrfs may have seen partial metadata writes on different RAID members. Regards, Kai Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler <lists@bcache.ewheeler.net>: > > On Tue, 12 Sep 2023, 邹明哲 wrote: > > From: Eric Wheeler <lists@bcache.ewheeler.net> > > Date: 2023-09-07 08:42:41 > > To: Coly Li <colyli@suse.de> > > Cc: Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn> > > Subject: Re: Dirty data loss after cache disk error recovery > > >+Mingzhe, Coly: please comment on the proposed fix below when you have a > > >moment: > > > > Hi, Eric > > > > This is an old issue, and it took me a long time to understand what > > happened. > > > > > > > >On Thu, 7 Sep 2023, Kai Krakow wrote: > > >> Wow! > > >> > > >> I call that a necro-bump... ;-) > > >> > > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler > > >> <lists@bcache.ewheeler.net>: > > >> > > > >> > On Fri, 7 May 2021, Kai Krakow wrote: > > >> > > > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When > > >> > > > the cache device is disconnected, it is always risky that some caching > > >> > > > data or meta data is not updated onto cache device. Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... It was the reason why I didn't add > > >> > > > new error action for the device failure handling patch set. > > >> > > > > >> > > But we are actually now seeing silent data loss: The system f'ed up > > >> > > somehow, needed a hard reset, and after reboot the bcache device was > > >> > > accessible in cache mode "none" (because they have been unregistered > > >> > > before, and because udev just detected it and you can use bcache > > >> > > without an attached cache in "none" mode), completely hiding the fact > > >> > > that we lost dirty write-back data, it's even not quite obvious that > > >> > > /dev/bcache0 now is detached, cache mode none, but accessible > > >> > > nevertheless. To me, this is quite clearly "silent data loss", > > >> > > especially since the unregister action threw the dirty data away. > > >> > > > > >> > > So this: > > >> > > > > >> > > > Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... > > >> > > > > >> > > is actually the situation we are facing currently: Device has been > > >> > > unregistered, after reboot, udev detects it has clean backing device > > >> > > without cache association, using cache mode none, and it is readable > > >> > > and writable just fine: It essentially permitted access to the stale > > >> > > backing device (tho, it didn't re-attach as you outlined, but that's > > >> > > more or less the same situation). > > >> > > > > >> > > Maybe devices that become disassociated from a cache due to IO errors > > >> > > but have dirty data should go to a caching mode "stale", and bcache > > >> > > should refuse to access such devices or throw away their dirty data > > >> > > until I decide to force them back online into the cache set or force > > >> > > discard the dirty data. Then at least I would discover that something > > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't > > >> > > written. In the best case, that makes my FS unmountable, in the worst > > >> > > case, some file data is simply lost (aka silent data loss), besides > > >> > > both situations are the worst-case scenario anyways. > > >> > > > > >> > > The whole situation probably comes from udev auto-registering bcache > > >> > > backing devices again, and bcache has no record of why the device was > > >> > > unregistered - it looks clean after such a situation. > > >> > > >> [...] > > >> > > >> > I think we hit this same issue from 2021. Here is that original thread from 2021: > > >> > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de > > >> > > > >> > Kai, did you end up with a good patch for this? We are running a 5.15 > > >> > kernel with the many backported bcache commits that Coly suggested here: > > >> > https://www.spinics.net/lists/linux-bcache/msg12084.html > > >> > > >> I'm currently running 6.1 with bcache on mdraid1 and device-level > > >> write caching disabled. I didn't see this ever occur again. > > > > > >Awesome, good to know. > > > > > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel > > >> 6.1, and had no issues since with bcache even on power loss. > > >> > > >> > Coly, is there already a patch to prevent complete dirty cache loss? > > >> > > >> This is probably still an issue. The cache attachment MUST NEVER EVER > > >> automatically degrade to "none" which it did for my fail-cases I had > > >> back then. I don't know if this has changed meanwhile. > > > > > >I would rather that bcache went to a read-only mode in failure > > >conditions like this. Maybe write-around would be acceptable since > > >bcache returns -EIO for any failed dirty cache reads. But if the cache > > >is dirty, and it gets an error, it _must_never_ read from the bdev, which > > >is what appears to happens now. > > > > > >Coly, Mingzhe, would this be an easy change? > > > > First of all, we have never had this problem. We have had an nvme > > controller failure, but at this time the cache cannot be read or > > written, so even unregister will not succeed. > > > > Coly once replied like this: > > > > """ > > There is an option to panic the system when cache device failed. It > > is in errors file with available options as "unregister" and "panic". > > This option is default set to "unregister", if you set it to "panic" > > then panic() will be called. > > """ > > > > I think "panic" is a better way to handle this situation. If cache > > returns an error, there may be more unknown errors if the operation > > continues. > > Depending on how the block devices are stacked, the OS can continue if > bcache fails (eg, bcache under raid1, drbd, etc). Returning IO requests > with -EIO or setting bcache read-only would be better, because a panic > would crash services that could otherwise proceed without noticing the > bcache outage. > > If bcache has a critical failure, I would rather that it fail the IOs so > upper-layers in the block stack can compensate. > > What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly" > option, and make that the default setting? The gendisk(s) for related > /dev/bcacheX devices can be flagged BLKROSET in the error handler: > https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/ > > This would protect the data and keep the host online. > > -- > Eric Wheeler > > > > > > > > > > >Here are the relevant bits: > > > > > >The allocator called btree_mergesort which called bch_extent_invalid: > > > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > > > > >Which called the `cache_bug` macro, which triggered bch_cache_set_error: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > > > > >It then calls `bch_cache_set_unregister` which shuts down the cache: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > > > > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > > > { > > > ... > > > bch_cache_set_unregister(c); > > > return true; > > > } > > > > > >Proposed solution: > > > > > >What if, instead of bch_cache_set_unregister() that this was called instead: > > > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > > > If cache_mode can be automatically modified, when will it be restored > > to writeback? I think we need to be able to enable or disable this. > > > > > > > >This would bypass the cache for future writes, and allow reads to > > >proceed if possible, and -EIO otherwise to let upper layers handle the > > >failure. > > > > > >What do you think? > > > > If we switch to writearound mode, how to ensure that the IO is read-only, > > because writing IO may require invalidating dirty data. If the backing > > write is successful but invalid fails, how should we handle it? > > > > Maybe "panic" could be the default option. What do you think? > > > > > > > >> But because bcache explicitly does not honor write-barriers from > > >> upstream writes for its own writeback (which is okay because it > > >> guarantees to write back all data anyways and give a consistent view to > > >> upstream FS - well, unless it has to handle write errors), the backed > > >> filesystem is guaranteed to be effed up in that case, and allowing it to > > >> mount and write because bcache silently has fallen back to "none" will > > >> only make the matter worse. > > >> > > >> (HINT: I never used brbd personally, most of the following is > > >> theoretical thinking without real-world experience) > > >> > > >> I see that you're using drbd? Did it fail due to networking issues? > > >> I'm pretty sure it should be robust in that case but maybe bcache > > >> cannot handle the situation? Does brbd have a write log to replay > > >> writes after network connection loss? It looks like it doesn't and > > >> thus bcache exploded. > > > > > >DRBD is _above_ bcache, not below it. In this case, DRBD hung because > > >bcache hung, not the other way around, so DRBD is not the issue here. > > >Here is our stack: > > > > > >bcache: > > > bdev: /dev/sda hardware RAID5 > > > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > > > > >And then bcache is stacked like so: > > > > > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > > > | > > > v > > > [remote host] > > > > > >> Anyways, since your backing device seems to be on drbd, using metadata > > >> allocation hinting is probably no option. You could of course still use > > >> drbd with bcache for metadata hinted partitions, and then use > > >> writearound caching only for that. At least, in the fail-case, your > > >> btrfs won't be destroyed. But your data chunks may have unreadable files > > >> then. But it should be easy to select them and restore from backup > > >> individually. Btrfs is very robust for that fail case: if metadata is > > >> okay, data errors are properly detected and handled. If you're not using > > >> btrfs, all of this doesn't apply ofc. > > >> > > >> I'm not sure if write-back caching for drbd backing is a wise decision > > >> anyways. drbd is slow for writes, that's part of the design (and no > > >> writeback caching could fix that). > > > > > >Bcache-backed DRBD provides a noticable difference, especially with a > > >10GbE link (or faster) and the same disk stack on both sides. > > > > > >> I would not rely on bcache-writeback to fix that for you because it is > > >> not prepared for storage that may be temporarily not available > > > > > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > > >DRBD's existence. > > > > > >> iow, it would freeze and continue when drbd is available again. I think > > >> you should really use writearound/writethrough so your FS can be sure > > >> data has been written, replicated and persisted. In case of btrfs, you > > >> could still split data and metadata as written above, and use writeback > > >> for data, but reliable writes for metadata. > > >> > > >> So concluding: > > >> > > >> 1. I'm now persisting metadata directly to disk with no intermediate > > >> layers (no bcache, no md) > > >> > > >> 2. I'm using allocation-hinted data-only partitions with bcache > > >> write-back, with bcache on mdraid1. If anything goes wrong, I have > > >> file crc errors in btrfs files only, but the filesystem itself is > > >> valid because no metadata is broken or lost. I have snapshots of > > >> recently modified files. I have daily backups. > > >> > > >> 3. Your problem is that bcache can - by design - detect write errors > > >> only when it's too late with no chance telling the filesystem. In that > > >> case, writethrough/writearound is the correct choice. > > >> > > >> 4. Maybe bcache should know if backing is on storage that may be > > >> temporarily unavailable and then freeze until the backing storage is > > >> back online, similar to how iSCSI handles that. > > > > > >I don't think "temporarily unavailable" should be bcache's burden, as > > >bcache is a local-only solution. If someone is using iSCSI under bcache, > > >then good luck ;) > > > > > >> But otoh, maybe drbd should freeze until the replicated storage is > > >> available again while writing (from what I've read, it's designed to not > > >> do that but let local storage get ahead of the replica, which is btw > > >> incompatible with bcache-writeback assumptions). > > > > > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > > >and has no local copy for some reason. If local storage is available, it > > >will use that and resync when its peer comes up. > > > > > >> Or maybe using async mirroring can fix this for you but then, the mirror > > >> will be compromised if a hardware failure immediately follows a previous > > >> drbd network connection loss. But, it may still be an issue with the > > >> local hardware (bit-flips etc) because maybe just bcache internals broke > > >> - Coly may have a better idea of that. > > > > > >This isn't DRBDs fault since it is above bcache. I wish only address the > > >the bcache cache=none issue. > > > > > >-Eric > > > > > >> > > >> I think your main issue here is that bcache decouples writebarriers > > >> from the underlying backing storage - and you should just not use > > >> writeback, it is incompatible by design with how drbd works: your > > >> replica will be broken when you need it. > > > > > > > > >> > > >> > > >> > Here is our trace: > > >> > > > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > >> > 0:1163806048 gen 3: bad, length too big, disabling caching > > >> > > >> > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > >> > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > >> > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > >> > [ +0.000826] Call Trace: > > >> > [ +0.000797] <TASK> > > >> > [ +0.000006] dump_stack_lvl+0x57/0x7e > > >> > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > >> > [ +0.000759] btree_mergesort+0x27e/0x36e > > >> > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > >> > [ +0.000009] __btree_sort+0xa4/0x1e9 > > >> > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > >> > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > >> > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > >> > [ +0.000863] btree_insert_fn+0x20/0x48 > > >> > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > >> > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > >> > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000848] bch_btree_insert+0x102/0x188 > > >> > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > >> > [ +0.000857] bch_data_insert_keys+0x39/0xde > > >> > [ +0.000845] process_one_work+0x280/0x5cf > > >> > [ +0.000858] worker_thread+0x52/0x3bd > > >> > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > >> > [ +0.000877] kthread+0x13e/0x15b > > >> > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > >> > [ +0.000855] ret_from_fork+0x22/0x2d > > >> > [ +0.000854] </TASK> > > >> > > >> > > >> Regards, > > >> Kai > > >> > > > > > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: Dirty data loss after cache disk error recovery 2023-10-11 16:19 ` Kai Krakow @ 2023-10-16 23:39 ` Eric Wheeler 2023-10-17 0:33 ` Kai Krakow 0 siblings, 1 reply; 17+ messages in thread From: Eric Wheeler @ 2023-10-16 23:39 UTC (permalink / raw) To: Kai Krakow Cc: 邹明哲, Coly Li, linux-bcache, 吴本卿(云桌面 福州) [-- Attachment #1: Type: text/plain, Size: 20608 bytes --] On Wed, 11 Oct 2023, Kai Krakow wrote: > I've now faced a similar issue where one of my HDDs spontaneously > decided to have a series of bad blocks. It looks like it has 26145 > failed writes due to how bcache handles writeback. It had 5275 failed > reads with btrfs loudly complaining about it. The system also became > really slow to respond until it eventually froze. > > After a reboot it worked again but of course there were still bad > blocks because bcache did writeback, so no blocks have been replaced > with btrfs auto-repair on read feature. This time, the system handled > the situation a bit better but files became inaccessible in the middle > of writing them which destroyed my Plasma desktop configuration and > Chrome profile (I restored them from the last snapper snapshot > successfully). Essentially, the file system was in a readonly-like > state: most requests failed with IO errors despite the btrfs didn't > switch to read-only. Something messed up in the error path of > userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the Do you mean userspace -> btrfs -> bcache -> device > device somewhere in the limbo of not existing and not working - it > still tried to access it while bcache claimed the backend device would > be missing. To me this looks like bcache error handling may need some > fine tuning - it should not fail in that way, especially not with > btrfs-raid, but still the system was seeing IO errors and broken files > in the middle of writes. > > "bcache show" showed the backend device missing while "btrfs dev show" > was still seeing the attached bcache device, and the system threw IO > errors to user-space despite btrfs still having a valid copy of the > blocks. > > I've rebooted and now switched the bad device from bcache writeback to > bcache none - and guess what: The system runs stable now, btrfs > auto-repair does its thing. The above mentioned behavior does not > occur (IO errors in user-space). A final scrub across the bad devices > repaired the bad blocks, I currently do not see any more problems. > > It's probably better to replace that device but this also shows that > switching bcache to "none" (if the backing device fails) or "write > through" at least may be a better choice than doing some other error > handling. Or bcache should have been able to make btrfs see the device > as missing (which obviously did not happen). Noted. Did bcache actually detach its cache in the failure scenario you describe? > Of course, if the cache device fails we have a completely different > situation. I'm not sure which situation Eric was seeing (I think the > caching device failed) but for me, the backing device failed - and > with bcache involved, the result was very unexpected. Ahh, so you are saying the cache continued to service requests even though the bdev was offline? Was the bdev completely "unplugged" or was it just having IO errors? > So we probably need at least two error handlers: Handling caching > device errors, and handling backing device errors (for which bcache > doesn't currently seem to have a setting). I think it tries to write to the cache if the bdev dies. Dirty or cached blocks are read from cache and other IOs are passed to bdev which may return end up returning an EIO. Coly, is this correct? -Eric > Except for the strange IO errors and resulting incomplete writes (and > I really don't know why that happened), btrfs survived this perfectly > well - and somehow bcache did a good enough job. This has been > different in the past. So this is already a great achievement. Thank > you. > > BTW: This probably only worked for me because I split btrfs metadata > and data to different devices > (https://github.com/kakra/linux/pull/26), and metadata does not pass > through bcache at all but natively to SSD. Otherwise I fear btrfs may > have seen partial metadata writes on different RAID members. > > Regards, > Kai > > > Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler > <lists@bcache.ewheeler.net>: > > > > On Tue, 12 Sep 2023, 邹明哲 wrote: > > > From: Eric Wheeler <lists@bcache.ewheeler.net> > > > Date: 2023-09-07 08:42:41 > > > To: Coly Li <colyli@suse.de> > > > Cc: Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn> > > > Subject: Re: Dirty data loss after cache disk error recovery > > > >+Mingzhe, Coly: please comment on the proposed fix below when you have a > > > >moment: > > > > > > Hi, Eric > > > > > > This is an old issue, and it took me a long time to understand what > > > happened. > > > > > > > > > > >On Thu, 7 Sep 2023, Kai Krakow wrote: > > > >> Wow! > > > >> > > > >> I call that a necro-bump... ;-) > > > >> > > > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler > > > >> <lists@bcache.ewheeler.net>: > > > >> > > > > >> > On Fri, 7 May 2021, Kai Krakow wrote: > > > >> > > > > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When > > > >> > > > the cache device is disconnected, it is always risky that some caching > > > >> > > > data or meta data is not updated onto cache device. Permit the cache > > > >> > > > device to be re-attached to the backing device may introduce "silent > > > >> > > > data loss" which might be worse.... It was the reason why I didn't add > > > >> > > > new error action for the device failure handling patch set. > > > >> > > > > > >> > > But we are actually now seeing silent data loss: The system f'ed up > > > >> > > somehow, needed a hard reset, and after reboot the bcache device was > > > >> > > accessible in cache mode "none" (because they have been unregistered > > > >> > > before, and because udev just detected it and you can use bcache > > > >> > > without an attached cache in "none" mode), completely hiding the fact > > > >> > > that we lost dirty write-back data, it's even not quite obvious that > > > >> > > /dev/bcache0 now is detached, cache mode none, but accessible > > > >> > > nevertheless. To me, this is quite clearly "silent data loss", > > > >> > > especially since the unregister action threw the dirty data away. > > > >> > > > > > >> > > So this: > > > >> > > > > > >> > > > Permit the cache > > > >> > > > device to be re-attached to the backing device may introduce "silent > > > >> > > > data loss" which might be worse.... > > > >> > > > > > >> > > is actually the situation we are facing currently: Device has been > > > >> > > unregistered, after reboot, udev detects it has clean backing device > > > >> > > without cache association, using cache mode none, and it is readable > > > >> > > and writable just fine: It essentially permitted access to the stale > > > >> > > backing device (tho, it didn't re-attach as you outlined, but that's > > > >> > > more or less the same situation). > > > >> > > > > > >> > > Maybe devices that become disassociated from a cache due to IO errors > > > >> > > but have dirty data should go to a caching mode "stale", and bcache > > > >> > > should refuse to access such devices or throw away their dirty data > > > >> > > until I decide to force them back online into the cache set or force > > > >> > > discard the dirty data. Then at least I would discover that something > > > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't > > > >> > > written. In the best case, that makes my FS unmountable, in the worst > > > >> > > case, some file data is simply lost (aka silent data loss), besides > > > >> > > both situations are the worst-case scenario anyways. > > > >> > > > > > >> > > The whole situation probably comes from udev auto-registering bcache > > > >> > > backing devices again, and bcache has no record of why the device was > > > >> > > unregistered - it looks clean after such a situation. > > > >> > > > >> [...] > > > >> > > > >> > I think we hit this same issue from 2021. Here is that original thread from 2021: > > > >> > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de > > > >> > > > > >> > Kai, did you end up with a good patch for this? We are running a 5.15 > > > >> > kernel with the many backported bcache commits that Coly suggested here: > > > >> > https://www.spinics.net/lists/linux-bcache/msg12084.html > > > >> > > > >> I'm currently running 6.1 with bcache on mdraid1 and device-level > > > >> write caching disabled. I didn't see this ever occur again. > > > > > > > >Awesome, good to know. > > > > > > > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel > > > >> 6.1, and had no issues since with bcache even on power loss. > > > >> > > > >> > Coly, is there already a patch to prevent complete dirty cache loss? > > > >> > > > >> This is probably still an issue. The cache attachment MUST NEVER EVER > > > >> automatically degrade to "none" which it did for my fail-cases I had > > > >> back then. I don't know if this has changed meanwhile. > > > > > > > >I would rather that bcache went to a read-only mode in failure > > > >conditions like this. Maybe write-around would be acceptable since > > > >bcache returns -EIO for any failed dirty cache reads. But if the cache > > > >is dirty, and it gets an error, it _must_never_ read from the bdev, which > > > >is what appears to happens now. > > > > > > > >Coly, Mingzhe, would this be an easy change? > > > > > > First of all, we have never had this problem. We have had an nvme > > > controller failure, but at this time the cache cannot be read or > > > written, so even unregister will not succeed. > > > > > > Coly once replied like this: > > > > > > """ > > > There is an option to panic the system when cache device failed. It > > > is in errors file with available options as "unregister" and "panic". > > > This option is default set to "unregister", if you set it to "panic" > > > then panic() will be called. > > > """ > > > > > > I think "panic" is a better way to handle this situation. If cache > > > returns an error, there may be more unknown errors if the operation > > > continues. > > > > Depending on how the block devices are stacked, the OS can continue if > > bcache fails (eg, bcache under raid1, drbd, etc). Returning IO requests > > with -EIO or setting bcache read-only would be better, because a panic > > would crash services that could otherwise proceed without noticing the > > bcache outage. > > > > If bcache has a critical failure, I would rather that it fail the IOs so > > upper-layers in the block stack can compensate. > > > > What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly" > > option, and make that the default setting? The gendisk(s) for related > > /dev/bcacheX devices can be flagged BLKROSET in the error handler: > > https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/ > > > > This would protect the data and keep the host online. > > > > -- > > Eric Wheeler > > > > > > > > > > > > > > > > >Here are the relevant bits: > > > > > > > >The allocator called btree_mergesort which called bch_extent_invalid: > > > > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > > > > > > >Which called the `cache_bug` macro, which triggered bch_cache_set_error: > > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > > > > > > >It then calls `bch_cache_set_unregister` which shuts down the cache: > > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > > > > > > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > > > > { > > > > ... > > > > bch_cache_set_unregister(c); > > > > return true; > > > > } > > > > > > > >Proposed solution: > > > > > > > >What if, instead of bch_cache_set_unregister() that this was called instead: > > > > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > > > > > If cache_mode can be automatically modified, when will it be restored > > > to writeback? I think we need to be able to enable or disable this. > > > > > > > > > > >This would bypass the cache for future writes, and allow reads to > > > >proceed if possible, and -EIO otherwise to let upper layers handle the > > > >failure. > > > > > > > >What do you think? > > > > > > If we switch to writearound mode, how to ensure that the IO is read-only, > > > because writing IO may require invalidating dirty data. If the backing > > > write is successful but invalid fails, how should we handle it? > > > > > > Maybe "panic" could be the default option. What do you think? > > > > > > > > > > >> But because bcache explicitly does not honor write-barriers from > > > >> upstream writes for its own writeback (which is okay because it > > > >> guarantees to write back all data anyways and give a consistent view to > > > >> upstream FS - well, unless it has to handle write errors), the backed > > > >> filesystem is guaranteed to be effed up in that case, and allowing it to > > > >> mount and write because bcache silently has fallen back to "none" will > > > >> only make the matter worse. > > > >> > > > >> (HINT: I never used brbd personally, most of the following is > > > >> theoretical thinking without real-world experience) > > > >> > > > >> I see that you're using drbd? Did it fail due to networking issues? > > > >> I'm pretty sure it should be robust in that case but maybe bcache > > > >> cannot handle the situation? Does brbd have a write log to replay > > > >> writes after network connection loss? It looks like it doesn't and > > > >> thus bcache exploded. > > > > > > > >DRBD is _above_ bcache, not below it. In this case, DRBD hung because > > > >bcache hung, not the other way around, so DRBD is not the issue here. > > > >Here is our stack: > > > > > > > >bcache: > > > > bdev: /dev/sda hardware RAID5 > > > > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > > > > > > >And then bcache is stacked like so: > > > > > > > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > > > > | > > > > v > > > > [remote host] > > > > > > > >> Anyways, since your backing device seems to be on drbd, using metadata > > > >> allocation hinting is probably no option. You could of course still use > > > >> drbd with bcache for metadata hinted partitions, and then use > > > >> writearound caching only for that. At least, in the fail-case, your > > > >> btrfs won't be destroyed. But your data chunks may have unreadable files > > > >> then. But it should be easy to select them and restore from backup > > > >> individually. Btrfs is very robust for that fail case: if metadata is > > > >> okay, data errors are properly detected and handled. If you're not using > > > >> btrfs, all of this doesn't apply ofc. > > > >> > > > >> I'm not sure if write-back caching for drbd backing is a wise decision > > > >> anyways. drbd is slow for writes, that's part of the design (and no > > > >> writeback caching could fix that). > > > > > > > >Bcache-backed DRBD provides a noticable difference, especially with a > > > >10GbE link (or faster) and the same disk stack on both sides. > > > > > > > >> I would not rely on bcache-writeback to fix that for you because it is > > > >> not prepared for storage that may be temporarily not available > > > > > > > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > > > >DRBD's existence. > > > > > > > >> iow, it would freeze and continue when drbd is available again. I think > > > >> you should really use writearound/writethrough so your FS can be sure > > > >> data has been written, replicated and persisted. In case of btrfs, you > > > >> could still split data and metadata as written above, and use writeback > > > >> for data, but reliable writes for metadata. > > > >> > > > >> So concluding: > > > >> > > > >> 1. I'm now persisting metadata directly to disk with no intermediate > > > >> layers (no bcache, no md) > > > >> > > > >> 2. I'm using allocation-hinted data-only partitions with bcache > > > >> write-back, with bcache on mdraid1. If anything goes wrong, I have > > > >> file crc errors in btrfs files only, but the filesystem itself is > > > >> valid because no metadata is broken or lost. I have snapshots of > > > >> recently modified files. I have daily backups. > > > >> > > > >> 3. Your problem is that bcache can - by design - detect write errors > > > >> only when it's too late with no chance telling the filesystem. In that > > > >> case, writethrough/writearound is the correct choice. > > > >> > > > >> 4. Maybe bcache should know if backing is on storage that may be > > > >> temporarily unavailable and then freeze until the backing storage is > > > >> back online, similar to how iSCSI handles that. > > > > > > > >I don't think "temporarily unavailable" should be bcache's burden, as > > > >bcache is a local-only solution. If someone is using iSCSI under bcache, > > > >then good luck ;) > > > > > > > >> But otoh, maybe drbd should freeze until the replicated storage is > > > >> available again while writing (from what I've read, it's designed to not > > > >> do that but let local storage get ahead of the replica, which is btw > > > >> incompatible with bcache-writeback assumptions). > > > > > > > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > > > >and has no local copy for some reason. If local storage is available, it > > > >will use that and resync when its peer comes up. > > > > > > > >> Or maybe using async mirroring can fix this for you but then, the mirror > > > >> will be compromised if a hardware failure immediately follows a previous > > > >> drbd network connection loss. But, it may still be an issue with the > > > >> local hardware (bit-flips etc) because maybe just bcache internals broke > > > >> - Coly may have a better idea of that. > > > > > > > >This isn't DRBDs fault since it is above bcache. I wish only address the > > > >the bcache cache=none issue. > > > > > > > >-Eric > > > > > > > >> > > > >> I think your main issue here is that bcache decouples writebarriers > > > >> from the underlying backing storage - and you should just not use > > > >> writeback, it is incompatible by design with how drbd works: your > > > >> replica will be broken when you need it. > > > > > > > > > > > >> > > > >> > > > >> > Here is our trace: > > > >> > > > > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > > >> > 0:1163806048 gen 3: bad, length too big, disabling caching > > > >> > > > >> > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > > >> > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > > >> > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > > >> > [ +0.000826] Call Trace: > > > >> > [ +0.000797] <TASK> > > > >> > [ +0.000006] dump_stack_lvl+0x57/0x7e > > > >> > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > > >> > [ +0.000759] btree_mergesort+0x27e/0x36e > > > >> > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > > >> > [ +0.000009] __btree_sort+0xa4/0x1e9 > > > >> > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > > >> > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > > >> > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > > >> > [ +0.000863] btree_insert_fn+0x20/0x48 > > > >> > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > > >> > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > >> > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > > >> > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > > >> > [ +0.000848] bch_btree_insert+0x102/0x188 > > > >> > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > > >> > [ +0.000857] bch_data_insert_keys+0x39/0xde > > > >> > [ +0.000845] process_one_work+0x280/0x5cf > > > >> > [ +0.000858] worker_thread+0x52/0x3bd > > > >> > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > > >> > [ +0.000877] kthread+0x13e/0x15b > > > >> > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > > >> > [ +0.000855] ret_from_fork+0x22/0x2d > > > >> > [ +0.000854] </TASK> > > > >> > > > >> > > > >> Regards, > > > >> Kai > > > >> > > > > > > > > > > > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: Dirty data loss after cache disk error recovery 2023-10-16 23:39 ` Eric Wheeler @ 2023-10-17 0:33 ` Kai Krakow 2023-10-17 0:39 ` Kai Krakow 0 siblings, 1 reply; 17+ messages in thread From: Kai Krakow @ 2023-10-17 0:33 UTC (permalink / raw) To: Eric Wheeler Cc: 邹明哲, Coly Li, linux-bcache, 吴本卿(云桌面 福州) Am Di., 17. Okt. 2023 um 01:39 Uhr schrieb Eric Wheeler <bcache@lists.ewheeler.net>: > > On Wed, 11 Oct 2023, Kai Krakow wrote: > > After a reboot it worked again but of course there were still bad > > blocks because bcache did writeback, so no blocks have been replaced > > with btrfs auto-repair on read feature. This time, the system handled > > the situation a bit better but files became inaccessible in the middle > > of writing them which destroyed my Plasma desktop configuration and > > Chrome profile (I restored them from the last snapper snapshot > > successfully). Essentially, the file system was in a readonly-like > > state: most requests failed with IO errors despite the btrfs didn't > > switch to read-only. Something messed up in the error path of > > userspace -> bcache -> btrfs -> device. Also, btrfs was seeing the > > Do you mean userspace -> btrfs -> bcache -> device Ehm.. Yes... > > device somewhere in the limbo of not existing and not working - it > > still tried to access it while bcache claimed the backend device would > > be missing. To me this looks like bcache error handling may need some > > fine tuning - it should not fail in that way, especially not with > > btrfs-raid, but still the system was seeing IO errors and broken files > > in the middle of writes. > > > > "bcache show" showed the backend device missing while "btrfs dev show" > > was still seeing the attached bcache device, and the system threw IO > > errors to user-space despite btrfs still having a valid copy of the > > blocks. > > > > I've rebooted and now switched the bad device from bcache writeback to > > bcache none - and guess what: The system runs stable now, btrfs > > auto-repair does its thing. The above mentioned behavior does not > > occur (IO errors in user-space). A final scrub across the bad devices > > repaired the bad blocks, I currently do not see any more problems. > > > > It's probably better to replace that device but this also shows that > > switching bcache to "none" (if the backing device fails) or "write > > through" at least may be a better choice than doing some other error > > handling. Or bcache should have been able to make btrfs see the device > > as missing (which obviously did not happen). > > Noted. Did bcache actually detach its cache in the failure scenario > you describe? It seemed still attached but was marked as "missing" the the bcache cli tool. > > Of course, if the cache device fails we have a completely different > > situation. I'm not sure which situation Eric was seeing (I think the > > caching device failed) but for me, the backing device failed - and > > with bcache involved, the result was very unexpected. > > Ahh, so you are saying the cache continued to service requests even though > the bdev was offline? Was the bdev completely "unplugged" or was it just > having IO errors? smartctl was still seeing the device, so I think it "just" had IO errors. > > So we probably need at least two error handlers: Handling caching > > device errors, and handling backing device errors (for which bcache > > doesn't currently seem to have a setting). > > I think it tries to write to the cache if the bdev dies. Dirty or cached > blocks are read from cache and other IOs are passed to bdev which may > return end up returning an EIO. Hmm, yes that makes sense... But it seems to confuse user-space a lot. Except that in writeback mode, it won't (and cannot) return errors to user-space although writes eventually fail later and data does not persist. So it may be better to turn writeback off as soon as bdev IO errors are found, or trigger an immediate writeback by temporarily setting writeback_percent to 0. Usually, HDDs support self-healing - which didn't work in this case because of delayed writeback. After I switched to "none", it worked. After some more experimenting, it looks like even "writethrough" may lack behind and not bubble bdev IO errors back up to user-space (or it was due to writeback_percent=0, errors are gone so I can no longer reproduce). I would expect it to do exactly that, tho. I didn't test "writearound". Also, it looks like a failed delay write from writeback dirty data may not be retried by bcache. Or at least, I needed to run "btrfs scrub" with bcache mode "none" to make it work properly and let the HDD heal itself. OTOH, the HDD probably didn't fail writes but reads (except when the situation got completely messed up and even writes returned IO errors but maybe btrfs was involved here). BTW: The failed HDDs ran fine for a few days now, even switched writeback on again. It properly healed itself. But still, time to swap it sooner than later. > Coly, is this correct? > > -Eric Regards, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: Dirty data loss after cache disk error recovery 2023-10-17 0:33 ` Kai Krakow @ 2023-10-17 0:39 ` Kai Krakow 0 siblings, 0 replies; 17+ messages in thread From: Kai Krakow @ 2023-10-17 0:39 UTC (permalink / raw) To: Eric Wheeler Cc: 邹明哲, Coly Li, linux-bcache, 吴本卿(云桌面 福州) Just another thought... Am Di., 17. Okt. 2023 um 02:33 Uhr schrieb Kai Krakow <kai@kaishome.de>: > Except that in writeback mode, it won't (and cannot) return errors to > user-space although writes eventually fail later and data does not > persist. So it may be better to turn writeback off as soon as bdev IO > errors are found, or trigger an immediate writeback by temporarily > setting writeback_percent to 0. Usually, HDDs support self-healing - > which didn't work in this case because of delayed writeback. After I > switched to "none", it worked. In that light, it might be worth thinking about how bcache could be used to encourage self-healing of HDDs: 1. If a read IO error occurs, it should start flushing dirty data, maybe switch to "none" or "writethrough/writearound". 2. Cached bcache contents could be used to rewrite data - in case a sector has become bad. But I think this needs the firmware to detect a read error on that sector first - which doesn't help us because then the data would not be in bcache in the first place. 3. How does bcache handle bdev write errors in common, and in case of delayed writeback in special? Regards, Kai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: Dirty data loss after cache disk error recovery [not found] ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net> 2023-10-11 16:19 ` Kai Krakow @ 2023-10-11 16:29 ` Kai Krakow 1 sibling, 0 replies; 17+ messages in thread From: Kai Krakow @ 2023-10-11 16:29 UTC (permalink / raw) To: Eric Wheeler Cc: 邹明哲, Coly Li, linux-bcache, 吴本卿(云桌面 福州) Eric, your "from" mail (lists@bcache.ewheeler.net) does not exist: > DNS Error: DNS type 'mx' lookup of bcache.ewheeler.net responded with code NXDOMAIN Domain name not found: bcache.ewheeler.net Or is something messed up on my side? All others, please ignore. Doesn't add to the conversation. Thanks. :-) Am Di., 12. Sept. 2023 um 22:02 Uhr schrieb Eric Wheeler <lists@bcache.ewheeler.net>: > > On Tue, 12 Sep 2023, 邹明哲 wrote: > > From: Eric Wheeler <lists@bcache.ewheeler.net> > > Date: 2023-09-07 08:42:41 > > To: Coly Li <colyli@suse.de> > > Cc: Kai Krakow <kai@kaishome.de>,"linux-bcache@vger.kernel.org" <linux-bcache@vger.kernel.org>,"吴本卿(云桌面 福州)" <wubenqing@ruijie.com.cn>,Mingzhe Zou <mingzhe.zou@easystack.cn> > > Subject: Re: Dirty data loss after cache disk error recovery > > >+Mingzhe, Coly: please comment on the proposed fix below when you have a > > >moment: > > > > Hi, Eric > > > > This is an old issue, and it took me a long time to understand what > > happened. > > > > > > > >On Thu, 7 Sep 2023, Kai Krakow wrote: > > >> Wow! > > >> > > >> I call that a necro-bump... ;-) > > >> > > >> Am Mi., 6. Sept. 2023 um 22:33 Uhr schrieb Eric Wheeler > > >> <lists@bcache.ewheeler.net>: > > >> > > > >> > On Fri, 7 May 2021, Kai Krakow wrote: > > >> > > > >> > > > Adding a new "stop" error action IMHO doesn't make things better. When > > >> > > > the cache device is disconnected, it is always risky that some caching > > >> > > > data or meta data is not updated onto cache device. Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... It was the reason why I didn't add > > >> > > > new error action for the device failure handling patch set. > > >> > > > > >> > > But we are actually now seeing silent data loss: The system f'ed up > > >> > > somehow, needed a hard reset, and after reboot the bcache device was > > >> > > accessible in cache mode "none" (because they have been unregistered > > >> > > before, and because udev just detected it and you can use bcache > > >> > > without an attached cache in "none" mode), completely hiding the fact > > >> > > that we lost dirty write-back data, it's even not quite obvious that > > >> > > /dev/bcache0 now is detached, cache mode none, but accessible > > >> > > nevertheless. To me, this is quite clearly "silent data loss", > > >> > > especially since the unregister action threw the dirty data away. > > >> > > > > >> > > So this: > > >> > > > > >> > > > Permit the cache > > >> > > > device to be re-attached to the backing device may introduce "silent > > >> > > > data loss" which might be worse.... > > >> > > > > >> > > is actually the situation we are facing currently: Device has been > > >> > > unregistered, after reboot, udev detects it has clean backing device > > >> > > without cache association, using cache mode none, and it is readable > > >> > > and writable just fine: It essentially permitted access to the stale > > >> > > backing device (tho, it didn't re-attach as you outlined, but that's > > >> > > more or less the same situation). > > >> > > > > >> > > Maybe devices that become disassociated from a cache due to IO errors > > >> > > but have dirty data should go to a caching mode "stale", and bcache > > >> > > should refuse to access such devices or throw away their dirty data > > >> > > until I decide to force them back online into the cache set or force > > >> > > discard the dirty data. Then at least I would discover that something > > >> > > went badly wrong. Otherwise, I may not detect that dirty data wasn't > > >> > > written. In the best case, that makes my FS unmountable, in the worst > > >> > > case, some file data is simply lost (aka silent data loss), besides > > >> > > both situations are the worst-case scenario anyways. > > >> > > > > >> > > The whole situation probably comes from udev auto-registering bcache > > >> > > backing devices again, and bcache has no record of why the device was > > >> > > unregistered - it looks clean after such a situation. > > >> > > >> [...] > > >> > > >> > I think we hit this same issue from 2021. Here is that original thread from 2021: > > >> > https://lore.kernel.org/all/2662a21d-8f12-186a-e632-964ac7bae72d@suse.de/T/#m5a6cc34a043ecedaeb9469ec9d218e084ffec0de > > >> > > > >> > Kai, did you end up with a good patch for this? We are running a 5.15 > > >> > kernel with the many backported bcache commits that Coly suggested here: > > >> > https://www.spinics.net/lists/linux-bcache/msg12084.html > > >> > > >> I'm currently running 6.1 with bcache on mdraid1 and device-level > > >> write caching disabled. I didn't see this ever occur again. > > > > > >Awesome, good to know. > > > > > >> But as written above, I had bad RAM, and meanwhile upgraded to kernel > > >> 6.1, and had no issues since with bcache even on power loss. > > >> > > >> > Coly, is there already a patch to prevent complete dirty cache loss? > > >> > > >> This is probably still an issue. The cache attachment MUST NEVER EVER > > >> automatically degrade to "none" which it did for my fail-cases I had > > >> back then. I don't know if this has changed meanwhile. > > > > > >I would rather that bcache went to a read-only mode in failure > > >conditions like this. Maybe write-around would be acceptable since > > >bcache returns -EIO for any failed dirty cache reads. But if the cache > > >is dirty, and it gets an error, it _must_never_ read from the bdev, which > > >is what appears to happens now. > > > > > >Coly, Mingzhe, would this be an easy change? > > > > First of all, we have never had this problem. We have had an nvme > > controller failure, but at this time the cache cannot be read or > > written, so even unregister will not succeed. > > > > Coly once replied like this: > > > > """ > > There is an option to panic the system when cache device failed. It > > is in errors file with available options as "unregister" and "panic". > > This option is default set to "unregister", if you set it to "panic" > > then panic() will be called. > > """ > > > > I think "panic" is a better way to handle this situation. If cache > > returns an error, there may be more unknown errors if the operation > > continues. > > Depending on how the block devices are stacked, the OS can continue if > bcache fails (eg, bcache under raid1, drbd, etc). Returning IO requests > with -EIO or setting bcache read-only would be better, because a panic > would crash services that could otherwise proceed without noticing the > bcache outage. > > If bcache has a critical failure, I would rather that it fail the IOs so > upper-layers in the block stack can compensate. > > What if we extend /sys/fs/bcache/<uuid>/errors to include a "readonly" > option, and make that the default setting? The gendisk(s) for related > /dev/bcacheX devices can be flagged BLKROSET in the error handler: > https://patchwork.kernel.org/project/dm-devel/patch/20201129181926.897775-2-hch@lst.de/ > > This would protect the data and keep the host online. > > -- > Eric Wheeler > > > > > > > > > > >Here are the relevant bits: > > > > > >The allocator called btree_mergesort which called bch_extent_invalid: > > > https://elixir.bootlin.com/linux/latest/source/drivers/md/bcache/extents.c#L480 > > > > > >Which called the `cache_bug` macro, which triggered bch_cache_set_error: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1626 > > > > > >It then calls `bch_cache_set_unregister` which shuts down the cache: > > > https://elixir.bootlin.com/linux/v6.5/source/drivers/md/bcache/super.c#L1845 > > > > > > bool bch_cache_set_error(struct cache_set *c, const char *fmt, ...) > > > { > > > ... > > > bch_cache_set_unregister(c); > > > return true; > > > } > > > > > >Proposed solution: > > > > > >What if, instead of bch_cache_set_unregister() that this was called instead: > > > SET_BDEV_CACHE_MODE(&c->cache->sb, CACHE_MODE_WRITEAROUND) > > > > If cache_mode can be automatically modified, when will it be restored > > to writeback? I think we need to be able to enable or disable this. > > > > > > > >This would bypass the cache for future writes, and allow reads to > > >proceed if possible, and -EIO otherwise to let upper layers handle the > > >failure. > > > > > >What do you think? > > > > If we switch to writearound mode, how to ensure that the IO is read-only, > > because writing IO may require invalidating dirty data. If the backing > > write is successful but invalid fails, how should we handle it? > > > > Maybe "panic" could be the default option. What do you think? > > > > > > > >> But because bcache explicitly does not honor write-barriers from > > >> upstream writes for its own writeback (which is okay because it > > >> guarantees to write back all data anyways and give a consistent view to > > >> upstream FS - well, unless it has to handle write errors), the backed > > >> filesystem is guaranteed to be effed up in that case, and allowing it to > > >> mount and write because bcache silently has fallen back to "none" will > > >> only make the matter worse. > > >> > > >> (HINT: I never used brbd personally, most of the following is > > >> theoretical thinking without real-world experience) > > >> > > >> I see that you're using drbd? Did it fail due to networking issues? > > >> I'm pretty sure it should be robust in that case but maybe bcache > > >> cannot handle the situation? Does brbd have a write log to replay > > >> writes after network connection loss? It looks like it doesn't and > > >> thus bcache exploded. > > > > > >DRBD is _above_ bcache, not below it. In this case, DRBD hung because > > >bcache hung, not the other way around, so DRBD is not the issue here. > > >Here is our stack: > > > > > >bcache: > > > bdev: /dev/sda hardware RAID5 > > > cachedev: LVM volume from /dev/md0, which is /dev/nvme{0,1} RAID1 > > > > > >And then bcache is stacked like so: > > > > > > bcache <- dm-thin <- DRBD <- dm-crypt <- KVM > > > | > > > v > > > [remote host] > > > > > >> Anyways, since your backing device seems to be on drbd, using metadata > > >> allocation hinting is probably no option. You could of course still use > > >> drbd with bcache for metadata hinted partitions, and then use > > >> writearound caching only for that. At least, in the fail-case, your > > >> btrfs won't be destroyed. But your data chunks may have unreadable files > > >> then. But it should be easy to select them and restore from backup > > >> individually. Btrfs is very robust for that fail case: if metadata is > > >> okay, data errors are properly detected and handled. If you're not using > > >> btrfs, all of this doesn't apply ofc. > > >> > > >> I'm not sure if write-back caching for drbd backing is a wise decision > > >> anyways. drbd is slow for writes, that's part of the design (and no > > >> writeback caching could fix that). > > > > > >Bcache-backed DRBD provides a noticable difference, especially with a > > >10GbE link (or faster) and the same disk stack on both sides. > > > > > >> I would not rely on bcache-writeback to fix that for you because it is > > >> not prepared for storage that may be temporarily not available > > > > > >True, which is why we put drbd /on top/ of bcache, so bcache is unaware of > > >DRBD's existence. > > > > > >> iow, it would freeze and continue when drbd is available again. I think > > >> you should really use writearound/writethrough so your FS can be sure > > >> data has been written, replicated and persisted. In case of btrfs, you > > >> could still split data and metadata as written above, and use writeback > > >> for data, but reliable writes for metadata. > > >> > > >> So concluding: > > >> > > >> 1. I'm now persisting metadata directly to disk with no intermediate > > >> layers (no bcache, no md) > > >> > > >> 2. I'm using allocation-hinted data-only partitions with bcache > > >> write-back, with bcache on mdraid1. If anything goes wrong, I have > > >> file crc errors in btrfs files only, but the filesystem itself is > > >> valid because no metadata is broken or lost. I have snapshots of > > >> recently modified files. I have daily backups. > > >> > > >> 3. Your problem is that bcache can - by design - detect write errors > > >> only when it's too late with no chance telling the filesystem. In that > > >> case, writethrough/writearound is the correct choice. > > >> > > >> 4. Maybe bcache should know if backing is on storage that may be > > >> temporarily unavailable and then freeze until the backing storage is > > >> back online, similar to how iSCSI handles that. > > > > > >I don't think "temporarily unavailable" should be bcache's burden, as > > >bcache is a local-only solution. If someone is using iSCSI under bcache, > > >then good luck ;) > > > > > >> But otoh, maybe drbd should freeze until the replicated storage is > > >> available again while writing (from what I've read, it's designed to not > > >> do that but let local storage get ahead of the replica, which is btw > > >> incompatible with bcache-writeback assumptions). > > > > > >N/A for this thread, but FYI: DRBD will wait (hang) if it is disconnected > > >and has no local copy for some reason. If local storage is available, it > > >will use that and resync when its peer comes up. > > > > > >> Or maybe using async mirroring can fix this for you but then, the mirror > > >> will be compromised if a hardware failure immediately follows a previous > > >> drbd network connection loss. But, it may still be an issue with the > > >> local hardware (bit-flips etc) because maybe just bcache internals broke > > >> - Coly may have a better idea of that. > > > > > >This isn't DRBDs fault since it is above bcache. I wish only address the > > >the bcache cache=none issue. > > > > > >-Eric > > > > > >> > > >> I think your main issue here is that bcache decouples writebarriers > > >> from the underlying backing storage - and you should just not use > > >> writeback, it is incompatible by design with how drbd works: your > > >> replica will be broken when you need it. > > > > > > > > >> > > >> > > >> > Here is our trace: > > >> > > > >> > [Sep 6 13:01] bcache: bch_cache_set_error() error on > > >> > a3292185-39ff-4f67-bec7-0f738d3cc28a: spotted extent > > >> > 829560:7447835265109722923 len 26330 -> [0:112365451 gen 48, > > >> > 0:1163806048 gen 3: bad, length too big, disabling caching > > >> > > >> > [ +0.001940] CPU: 12 PID: 2435752 Comm: kworker/12:0 Kdump: loaded Not tainted 5.15.0-7.86.6.1.el9uek.x86_64-TEST+ #7 > > >> > [ +0.000301] Hardware name: Supermicro Super Server/H11SSL-i, BIOS 2.4 12/27/2021 > > >> > [ +0.000809] Workqueue: bcache bch_data_insert_keys > > >> > [ +0.000826] Call Trace: > > >> > [ +0.000797] <TASK> > > >> > [ +0.000006] dump_stack_lvl+0x57/0x7e > > >> > [ +0.000755] bch_extent_invalid.cold+0x9/0x10 > > >> > [ +0.000759] btree_mergesort+0x27e/0x36e > > >> > [ +0.000005] ? bch_cache_allocator_start+0x50/0x50 > > >> > [ +0.000009] __btree_sort+0xa4/0x1e9 > > >> > [ +0.000109] bch_btree_sort_partial+0xbc/0x14d > > >> > [ +0.000836] bch_btree_init_next+0x39/0xb6 > > >> > [ +0.000004] bch_btree_insert_node+0x26e/0x2d3 > > >> > [ +0.000863] btree_insert_fn+0x20/0x48 > > >> > [ +0.000864] bch_btree_map_nodes_recurse+0x111/0x1a7 > > >> > [ +0.004270] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000850] __bch_btree_map_nodes+0x1e0/0x1fb > > >> > [ +0.000858] ? bch_btree_insert_check_key+0x1f0/0x1e1 > > >> > [ +0.000848] bch_btree_insert+0x102/0x188 > > >> > [ +0.000844] ? do_wait_intr_irq+0xb0/0xaf > > >> > [ +0.000857] bch_data_insert_keys+0x39/0xde > > >> > [ +0.000845] process_one_work+0x280/0x5cf > > >> > [ +0.000858] worker_thread+0x52/0x3bd > > >> > [ +0.000851] ? process_one_work.cold+0x52/0x51 > > >> > [ +0.000877] kthread+0x13e/0x15b > > >> > [ +0.000858] ? set_kthread_struct+0x60/0x52 > > >> > [ +0.000855] ret_from_fork+0x22/0x2d > > >> > [ +0.000854] </TASK> > > >> > > >> > > >> Regards, > > >> Kai > > >> > > > > > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-28 18:39 ` Kai Krakow 2021-04-28 18:51 ` Kai Krakow @ 2021-05-07 12:13 ` Coly Li 1 sibling, 0 replies; 17+ messages in thread From: Coly Li @ 2021-05-07 12:13 UTC (permalink / raw) To: Kai Krakow, 吴本卿(云桌面 福州) Cc: linux-bcache On 4/29/21 2:39 AM, Kai Krakow wrote: > Hi Coly! > > Am Mi., 28. Apr. 2021 um 20:30 Uhr schrieb Kai Krakow <kai@kaishome.de>: >> >> Hello! >> >> Am Di., 20. Apr. 2021 um 05:24 Uhr schrieb 吴本卿(云桌面 福州) >> <wubenqing@ruijie.com.cn>: >>> >>> Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss. >>> >>> I checked the log and found that logs: >>> [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. >> >> "stop it to avoid potential data corruption" is not what it actually >> does: neither it stops it, nor it prevents corruption because dirty >> data becomes thrown away. >> >>> [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb >>> [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered >>> >>> I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id. >>> >>> Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach. >>> I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set. >>> echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit >>> >>> It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal: >>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: >>> Apr 19 05:50:18 localhost.localdomain kernel: journal io error >>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching >>> Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. >>> >>> When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data. >>> Is it possible to provide users with a choice to stop the cache_set instead of unregistering it. >> >> I think the same problem hit me, too, last night. >> >> My kernel choked because of a GPU error, and that somehow disconnected >> the cache. I can only guess that there was some sort of timeout due to >> blocked queues, and that introduced an IO error which detached the >> caches. >> >> Sadly, I only realized this after I already reformatted and started >> restore from backup: During the restore I watched the bcache status >> and found that the devices are not attached. >> >> I don't know if I could have re-attached the devices instead of >> formatting. But I think the dirty data would have been discarded >> anyways due to incrementing bcache_device->id. >> >> This really needs a better solution, detaching is one of the worst, >> especially on btrfs this has catastrophic consequences because data is >> not updated inline but via copy on write. This requires updating a lot >> of pointers. Usually, cow filesystem would be robust to this kind of >> data-loss but the vast amount of dirty data that is lost puts the tree >> generations too far behind of what btrfs is expecting, making it >> essentially broken beyond repair. If some trees in the FS are just a >> few generations behind, btrfs can repair itself by using a backup tree >> root, but when the bcache is lost, generation numbers usually lag >> behind several hundred generations. Detaching would be fine if there'd >> be no dirty data - otherwise the device should probably stop and >> refuse any more IO. >> >> @Coly If I patched the source to stop instead of detach, would it have >> made anything better? Would there be any side-effects? Is it possible >> to atomically check for dirty data in that case and take either the >> one or the other action? > > I think this behavior was introduced by https://lwn.net/Articles/748226/ > > So above is my late review. ;-) > > (around commit 7e027ca4b534b6b99a7c0471e13ba075ffa3f482 if you cannot > access LWN for reasons[tm]) > Hi Kai, Sorry I just find this thread from my INBOX. Hope it is not too late. I replied in your latest reply in this thread. Thanks. Coly Li ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Dirty data loss after cache disk error recovery 2021-04-20 3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州) 2021-04-28 18:30 ` Kai Krakow @ 2023-10-17 1:57 ` Coly Li 1 sibling, 0 replies; 17+ messages in thread From: Coly Li @ 2023-10-17 1:57 UTC (permalink / raw) To: "吴本卿(云桌面 福州)" Cc: linux-bcache > 2021年4月20日 11:17,吴本卿(云桌面 福州) <wubenqing@ruijie.com.cn> 写道: > > Hi, Recently I found a problem in the process of using bcache. My cache disk was offline for some reasons. When the cache disk was back online, I found that the backend in the detached state. I tried to attach the backend to the bcache again, and found that the dirty data was lost. The md5 value of the same file on backend's filesystem is different because dirty data loss. > > I checked the log and found that logs: > [12228.642630] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. > [12228.644072] bcache: cached_dev_detach_finish() Caching disabled for sdb > [12228.644352] bcache: cache_set_free() Cache set 55b9112d-d52b-4e15-aa93-e7d5ccfcac37 unregistered When you mention the bcache related issue, it would be better if the kernel version and distribution information are provided too. Some distributions don’t support bcache, it is possible that some necessary fixes are missed from backport for previous kernel version. Thanks. Coly Li > > I checked the code of bcache and found that a cache disk IO error will trigger __cache_set_unregister, which will cause the backend to be datach, which also causes the loss of dirty data. Because after the backend is reattached, the allocated bcache_device->id is incremented, and the bkey that points to the dirty data stores the old id. > > Is there a way to avoid this problem, such as providing users with options, if a cache disk error occurs, execute the stop process instead of detach. > I tried to increase cache_set->io_error_limit, in order to win the time to execute stop cache_set. > echo 4294967295 > /sys/fs/bcache/55b9112d-d52b-4e15-aa93-e7d5ccfcac37/io_error_limit > > It did not work at that time, because in addition to bch_count_io_errors, which calls bch_cache_set_error, there are other code paths that also call bch_cache_set_error. For example, an io error occurs in the journal: > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() bcache: error on 55b9112d-d52b-4e15-aa93-e7d5ccfcac37: > Apr 19 05:50:18 localhost.localdomain kernel: journal io error > Apr 19 05:50:18 localhost.localdomain kernel: bcache: bch_cache_set_error() , disabling caching > Apr 19 05:50:18 localhost.localdomain kernel: bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. > > When an error occurs in the cache device, why is it designed to unregister the cache_set? What is the original intention? The unregister operation means that all backend relationships are deleted, which will result in the loss of dirty data. > Is it possible to provide users with a choice to stop the cache_set instead of unregistering it. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2023-10-17 1:57 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-20 3:17 Dirty data loss after cache disk error recovery 吴本卿(云桌面 福州) 2021-04-28 18:30 ` Kai Krakow 2021-04-28 18:39 ` Kai Krakow 2021-04-28 18:51 ` Kai Krakow 2021-05-07 12:11 ` Coly Li 2021-05-07 14:56 ` Kai Krakow [not found] ` <6ab4d6a-de99-6464-cb2-ad66d0918446@ewheeler.net> 2023-09-06 22:56 ` Kai Krakow [not found] ` <7cadf9ff-b496-5567-9d60-f0af48122595@ewheeler.net> 2023-09-07 12:00 ` Kai Krakow 2023-09-07 19:10 ` Eric Wheeler 2023-09-12 6:54 ` 邹明哲 [not found] ` <f2fcf354-29ec-e2f7-b251-fb9b7d36f4@ewheeler.net> 2023-10-11 16:19 ` Kai Krakow 2023-10-16 23:39 ` Eric Wheeler 2023-10-17 0:33 ` Kai Krakow 2023-10-17 0:39 ` Kai Krakow 2023-10-11 16:29 ` Kai Krakow 2021-05-07 12:13 ` Coly Li 2023-10-17 1:57 ` Coly Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).