linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* I/O error on cache device can cause user observable errors
@ 2024-02-01 22:25 Arnaldo Montagner
  2024-02-02  7:00 ` Coly Li
  0 siblings, 1 reply; 4+ messages in thread
From: Arnaldo Montagner @ 2024-02-01 22:25 UTC (permalink / raw)
  To: linux-bcache

The bcache documentation says that errors on the cache device are
handled transparently.

I'm seeing a case where the cache device is unregistered in response
to repeated write errors (expected), but that results in a read error
on the bcache device (unexpected).

Here's how I'm reproducing the problem:
1. Create a device with dm-error to simulate I/O errors. The device is
1G in size and it will fail I/Os in a 4M extent starting at offset
128M:
    $ dmsetup create cache_disk << EOF
      0      262144    linear /dev/sdb 0
      262144 8192      error
      270336 1826816   linear /dev/sdb 270336
    EOF

2. Set up bcache in writethrough mode. The backing device is 1000G in length:
    $ make-bcache --cache /dev/mapper/cache_disk --bdev /dev/sdc
--wipe-bcache --bucket 256k
    $ echo writethrough > /sys/block/bcache0/bcache/cache_mode
    $ echo 0 > /sys/block/bcache0/bcache/cache/synchronous

    $ lsblk
    NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
    ...
    sdb            8:16   0    10G  0 disk
    └─cache_disk 253:0    0     1G  0 dm
      └─bcache0  252:0    0  1000G  0 disk
    sdc            8:32   0  1000G  0 disk
    └─bcache0    252:0    0  1000G  0 disk

3. Start a random read workload on the bcache device (using fio):
    $ fio --name=basic --filename=/dev/bcache0 --size=1000G
--rw=randread  --blocksize=256k --blockalign=256k

4. After a while I see that the cache device gets unregistered.
However, the application output indicates it saw an I/O error on a
read request:
     fio: io_u error on file /dev/bcache0: Input/output error: read
offset=592264298496, buflen=262144

I can see in the syslogs that bcache unregistered the cache. The logs
also show that there was an I/O error on the bcache device:
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.176867] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.186494] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.195743] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.204869] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.234722] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.246102] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.274013] bcache:
bch_count_io_errors() dm-0: IO error on writing data to cache.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.289128] bcache:
bch_cache_set_error() error on 427201f5-5c86-4890-9866-f9860e518041:
dm-0: too many IO errors writing data to cache
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.289128] ,
disabling caching
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.306212] bcache:
conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0
is "auto" and cache is clean, keep it alive.
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.306543] Buffer
I/O error on dev bcache0, logical block 144595776, async page read
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.316119] bcache:
cached_dev_detach_finish() Caching disabled for sdc
    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.316398] bcache:
cache_set_free() Cache set 427201f5-5c86-4890-9866-f9860e518041
unregistered

The steps above reproduce the problem most of the time, but not
always. In a few of the attempts, the cache was unregistered without
resulting in observable I/O errors.

Is this expected?

I'm running the Linux kernel version 6.5.0.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O error on cache device can cause user observable errors
  2024-02-01 22:25 I/O error on cache device can cause user observable errors Arnaldo Montagner
@ 2024-02-02  7:00 ` Coly Li
  2024-02-02 17:48   ` Arnaldo Montagner
  0 siblings, 1 reply; 4+ messages in thread
From: Coly Li @ 2024-02-02  7:00 UTC (permalink / raw)
  To: Arnaldo Montagner; +Cc: Bcache Linux



> 2024年2月2日 06:25,Arnaldo Montagner <armont@google.com> 写道:
> 
> The bcache documentation says that errors on the cache device are
> handled transparently.
> 
> I'm seeing a case where the cache device is unregistered in response
> to repeated write errors (expected), but that results in a read error
> on the bcache device (unexpected).
> 
> Here's how I'm reproducing the problem:
> 1. Create a device with dm-error to simulate I/O errors. The device is
> 1G in size and it will fail I/Os in a 4M extent starting at offset
> 128M:
>    $ dmsetup create cache_disk << EOF
>      0      262144    linear /dev/sdb 0
>      262144 8192      error
>      270336 1826816   linear /dev/sdb 270336
>    EOF
> 
> 2. Set up bcache in writethrough mode. The backing device is 1000G in length:
>    $ make-bcache --cache /dev/mapper/cache_disk --bdev /dev/sdc
> --wipe-bcache --bucket 256k
>    $ echo writethrough > /sys/block/bcache0/bcache/cache_mode
>    $ echo 0 > /sys/block/bcache0/bcache/cache/synchronous
> 
>    $ lsblk
>    NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>    ...
>    sdb            8:16   0    10G  0 disk
>    └─cache_disk 253:0    0     1G  0 dm
>      └─bcache0  252:0    0  1000G  0 disk
>    sdc            8:32   0  1000G  0 disk
>    └─bcache0    252:0    0  1000G  0 disk
> 
> 3. Start a random read workload on the bcache device (using fio):
>    $ fio --name=basic --filename=/dev/bcache0 --size=1000G
> --rw=randread  --blocksize=256k --blockalign=256k
> 
> 4. After a while I see that the cache device gets unregistered.
> However, the application output indicates it saw an I/O error on a
> read request:
>     fio: io_u error on file /dev/bcache0: Input/output error: read
> offset=592264298496, buflen=262144
> 
> I can see in the syslogs that bcache unregistered the cache. The logs
> also show that there was an I/O error on the bcache device:
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.176867] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.186494] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.195743] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.204869] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.234722] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.246102] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.274013] bcache:
> bch_count_io_errors() dm-0: IO error on writing data to cache.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.289128] bcache:
> bch_cache_set_error() error on 427201f5-5c86-4890-9866-f9860e518041:
> dm-0: too many IO errors writing data to cache
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.289128] ,
> disabling caching
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.306212] bcache:
> conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0
> is "auto" and cache is clean, keep it alive.
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.306543] Buffer
> I/O error on dev bcache0, logical block 144595776, async page read
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.316119] bcache:
> cached_dev_detach_finish() Caching disabled for sdc
>    Feb  1 19:47:23 armont-bcache-test kernel: [ 3327.316398] bcache:
> cache_set_free() Cache set 427201f5-5c86-4890-9866-f9860e518041
> unregistered
> 
> The steps above reproduce the problem most of the time, but not
> always. In a few of the attempts, the cache was unregistered without
> resulting in observable I/O errors.
> 
> Is this expected?

Yes, this is expected as device failure or hot-plug handling.

BTW, which part of document do you read that “that errors on the cache device are
handled transparently.”, let me see whether it should be updated.

Thanks.

Coly Li



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O error on cache device can cause user observable errors
  2024-02-02  7:00 ` Coly Li
@ 2024-02-02 17:48   ` Arnaldo Montagner
  2024-02-03  3:43     ` Coly Li
  0 siblings, 1 reply; 4+ messages in thread
From: Arnaldo Montagner @ 2024-02-02 17:48 UTC (permalink / raw)
  To: Coly Li; +Cc: Bcache Linux

Thanks for clarifying.

Regarding the documentation, the first sentence in
https://docs.kernel.org/admin-guide/bcache.html#error-handling says:
"Bcache tries to transparently handle IO errors to/from the cache
device without affecting normal operation"

I guess I interpreted it in absolute terms, as some kind of guarantee
that normal operation would not be affected.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: I/O error on cache device can cause user observable errors
  2024-02-02 17:48   ` Arnaldo Montagner
@ 2024-02-03  3:43     ` Coly Li
  0 siblings, 0 replies; 4+ messages in thread
From: Coly Li @ 2024-02-03  3:43 UTC (permalink / raw)
  To: Arnaldo Montagner; +Cc: Bcache Linux



> 2024年2月3日 01:48,Arnaldo Montagner <armont@google.com> 写道:
> 
> Thanks for clarifying.
> 
> Regarding the documentation, the first sentence in
> https://docs.kernel.org/admin-guide/bcache.html#error-handling says:
> "Bcache tries to transparently handle IO errors to/from the cache
> device without affecting normal operation"
> 

Oh I see. This is for temporary I/O failure or error. If cache is clear and an I/O error encountered, bcache may fetch data from backing device.
Such operations are transparent to upper layer code.

> I guess I interpreted it in absolute terms, as some kind of guarantee
> that normal operation would not be affected.
> 

For devices failure or absence, bcache need to make upper layer code to be aware and hand the aftermath when necessary. Your situation is in this category.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-02-03  3:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-01 22:25 I/O error on cache device can cause user observable errors Arnaldo Montagner
2024-02-02  7:00 ` Coly Li
2024-02-02 17:48   ` Arnaldo Montagner
2024-02-03  3:43     ` Coly Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).