All of lore.kernel.org
 help / color / mirror / Atom feed
* all cache blocks marked as dirty in writethrough mode, "Data loss may occur", constant write activity
@ 2014-06-28 22:54 Marc Lehmann
  2014-06-30  9:14 ` Joe Thornber
  0 siblings, 1 reply; 3+ messages in thread
From: Marc Lehmann @ 2014-06-28 22:54 UTC (permalink / raw)
  To: dm-devel

[I am not subscribed to the list, CC: appreciated]

Hi!

I tried to look for contact info for dm-cache bug reports, and decided to
write to this list. If this is the wrong way to report errors, pointers on
the correct way are appreciated. Also, if this report is bogus, I apologise
in advance :)

Anyway, I tried out dm-cache on debian kernel 3.14-0.bpo.1-amd64
(3.14.7-1~bpo70+1) and 3.14-1-amd64 (3.14.7-1), both had essentially the
same behaviour. I previously tried on a 3.12 kernel, which showed none of
these issues.

Namely, after creating a writethrough dm-cache mapping, removing it,
and setting it up again, the whole cache is marked as dirty and written
back to the origin device, which obviously shouldn't happen when using
writethrough. I noticed this because my box was very sluggish for a while
after each reboot (due to the hige write load).

On further inspection, I get kernel error messages on "dmsetup remove" of the
dm-cache device:

   [ 3137.734148] device-mapper: space map metadata: unable to allocate new metadata block
   [ 3137.734152] device-mapper: cache: could not resize on-disk discard bitset
   [ 3137.734153] device-mapper: cache: could not write discard bitset
   [ 3137.734155] device-mapper: space map metadata: unable to allocate new metadata block
   [ 3137.734155] device-mapper: cache metadata: begin_hints failed
   [ 3137.734156] device-mapper: cache: could not write hints
   [ 3137.734159] device-mapper: space map metadata: unable to allocate new metadata block
   [ 3137.734160] device-mapper: cache: could not write cache metadata.  Data loss may occur.

I used the formula "4MB + 16 * nr_blocks" to create the metadata device,
so it shouldn't be too small (the cache device is 10G, blocksize is 64kb,
and the calculated metadata partition has about 6MB).

I still get the above messages after increasing the metadata partition to
40MB. Only after increasing it to 70MB did the error go away, which also
stopped all cache blocks to be marked as dirty.

Even with the 70MB metadata partition, behaviour is strange: dmsetup
remove takes 18 seconds, with one cpu having 100% sys time with no I/O,
and while the partitions are mounted, there is a constant 4kb write
activity to each cache partition, with no activity on the origin partition
(which causes ~1GB/day unnecessary wear).

Obviously dm-cache should not ever mark blocks as dirty in writethrough
mode, and obviously, the metadata requirements are much higher than
documented. Also, I think dm-cache should not constantly write to the
cache partition when the system is idle.

Details:

All devices are lvm volumes.

I tried with both a 9TB and 19TB volume, both showed the same behaviour:

   RO    RA   SSZ   BSZ   StartSec            Size   Device
   rw   256   512  4096          0   9499955953664   /dev/dm-7
   rw   256   512  4096          0  20450918793216   /dev/dm-5

The cache devices are both 10G:

   RO    RA   SSZ   BSZ   StartSec            Size   Device
   rw   256   512  4096          0     10737418240   /dev/dm-11
   rw   256   512  4096          0     10737418240   /dev/dm-12

I use a script which divides the cache device into a 128kb header
"partition", a metadata partition and a cache block partition. The working
configuration is (the first line of each block is the cache partition
mapping by lvm, followed by header/metadata/block mappings, followed by
the cache mapping):

   vg_cerebro-cache_bp: 0 20971520 linear 8:17 209715584
   cache-bp-header: 0 256 linear 253:12 0
   cache-bp-meta: 0 144384 linear 253:12 256
   cache-bp-cache: 0 20826880 linear 253:12 144640
   cache-bp: 0 18554601472 cache 253:22 253:23 253:7 128 1 writethrough mq 2 sequential_threshold 32

   vg_cerebro-cache_wd: 0 20971520 linear 8:17 188744064
   cache-wd-header: 0 256 linear 253:11 0
   cache-wd-meta: 0 144384 linear 253:11 256
   cache-wd-cache: 0 20826880 linear 253:11 144640
   cache-wd: 0 39943200768 cache 253:16 253:17 253:5 128 1 writethrough mq 2 sequential_threshold 32

The configuration where the kernel complains about a too small metadata
partition is:

   vg_cerebro-cache_bp: 0 20971520 linear 8:17 209715584
   cache-bp-header: 0 256 linear 253:12 0
   cache-bp-meta: 0 78848 linear 253:12 256
   cache-bp-cache: 0 20892416 linear 253:12 79104
   cache-bp: 0 18554601472 cache 253:22 253:23 253:7 128 1 writethrough mq 2 sequential_threshold 32

   vg_cerebro-cache_wd: 0 20971520 linear 8:17 188744064
   cache-wd-header: 0 256 linear 253:11 0
   cache-wd-meta: 0 78848 linear 253:11 256
   cache-wd-cache: 0 20892416 linear 253:11 79104
   cache-wd: 0 39943200768 cache 253:16 253:17 253:5 128 1 writethrough mq 2 sequential_threshold 32

If more details are needed, drop me a note.

Greetings,
Marc Lehmann

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: all cache blocks marked as dirty in writethrough mode, "Data loss may occur", constant write activity
  2014-06-28 22:54 all cache blocks marked as dirty in writethrough mode, "Data loss may occur", constant write activity Marc Lehmann
@ 2014-06-30  9:14 ` Joe Thornber
  2014-06-30 11:25   ` Joe Thornber
  0 siblings, 1 reply; 3+ messages in thread
From: Joe Thornber @ 2014-06-30  9:14 UTC (permalink / raw)
  To: device-mapper development

On Sun, Jun 29, 2014 at 12:54:53AM +0200, Marc Lehmann wrote:
> [I am not subscribed to the list, CC: appreciated]
> 
> Hi!
> 
> I tried to look for contact info for dm-cache bug reports, and decided to
> write to this list. If this is the wrong way to report errors, pointers on
> the correct way are appreciated. Also, if this report is bogus, I apologise
> in advance :)

This is the right place, don't worry.

> Anyway, I tried out dm-cache on debian kernel 3.14-0.bpo.1-amd64
> (3.14.7-1~bpo70+1) and 3.14-1-amd64 (3.14.7-1), both had essentially the
> same behaviour. I previously tried on a 3.12 kernel, which showed none of
> these issues.

I think your symptoms are due to us temporarily making the granularity
of the discard bitset match that of the cache block size.  This was a
quick fix for a race condition.  There are some big changes coming in
the handling of discard bios and we'll change it back then.

Let's break down your problems:

- It uses more metadata than expected.

  Yes, the discard bitset size is proportional to the size of origin,
  unlike the dirty bitset which is proportional to the cache size.
  Reducing the granularity has increased the space it takes up
  considerably.

- It assumes the cache is dirty if something went wrong during the
  last activation.

  Yes, constantly updating the dirty bitset on disk would have a large
  impact on IO latency.  So we skip it until you do a clean shutdown
  (ie. deactivate the device).  If a clean shutdown didn't occur then
  we assume all blocks are dirty and resync.

- There is an 18 second pause when removing the cache dev.

  IO will occur to the metadata device when you teardown the cache.
  This is when the dirty bitset and discard bitset get written.
  Increasing the discard bitset size has made this worse.  Though I
  think 18 seconds is way too long and will look into it.

- There's a constant 4k background load to the metadata device.

  I'll look into this.  Sounds like the periodic commit is re writing
  the superblock even if there's no change to the mappings.

- Joe


> 
> Namely, after creating a writethrough dm-cache mapping, removing it,
> and setting it up again, the whole cache is marked as dirty and written
> back to the origin device, which obviously shouldn't happen when using
> writethrough. I noticed this because my box was very sluggish for a while
> after each reboot (due to the hige write load).
> 
> On further inspection, I get kernel error messages on "dmsetup remove" of the
> dm-cache device:
> 
>    [ 3137.734148] device-mapper: space map metadata: unable to allocate new metadata block
>    [ 3137.734152] device-mapper: cache: could not resize on-disk discard bitset
>    [ 3137.734153] device-mapper: cache: could not write discard bitset
>    [ 3137.734155] device-mapper: space map metadata: unable to allocate new metadata block
>    [ 3137.734155] device-mapper: cache metadata: begin_hints failed
>    [ 3137.734156] device-mapper: cache: could not write hints
>    [ 3137.734159] device-mapper: space map metadata: unable to allocate new metadata block
>    [ 3137.734160] device-mapper: cache: could not write cache metadata.  Data loss may occur.
> 
> I used the formula "4MB + 16 * nr_blocks" to create the metadata device,
> so it shouldn't be too small (the cache device is 10G, blocksize is 64kb,
> and the calculated metadata partition has about 6MB).
> 
> I still get the above messages after increasing the metadata partition to
> 40MB. Only after increasing it to 70MB did the error go away, which also
> stopped all cache blocks to be marked as dirty.
> 
> Even with the 70MB metadata partition, behaviour is strange: dmsetup
> remove takes 18 seconds, with one cpu having 100% sys time with no I/O,
> and while the partitions are mounted, there is a constant 4kb write
> activity to each cache partition, with no activity on the origin partition
> (which causes ~1GB/day unnecessary wear).
> 
> Obviously dm-cache should not ever mark blocks as dirty in writethrough
> mode, and obviously, the metadata requirements are much higher than
> documented. Also, I think dm-cache should not constantly write to the
> cache partition when the system is idle.
> 
> Details:
> 
> All devices are lvm volumes.
> 
> I tried with both a 9TB and 19TB volume, both showed the same behaviour:
> 
>    RO    RA   SSZ   BSZ   StartSec            Size   Device
>    rw   256   512  4096          0   9499955953664   /dev/dm-7
>    rw   256   512  4096          0  20450918793216   /dev/dm-5
> 
> The cache devices are both 10G:
> 
>    RO    RA   SSZ   BSZ   StartSec            Size   Device
>    rw   256   512  4096          0     10737418240   /dev/dm-11
>    rw   256   512  4096          0     10737418240   /dev/dm-12
> 
> I use a script which divides the cache device into a 128kb header
> "partition", a metadata partition and a cache block partition. The working
> configuration is (the first line of each block is the cache partition
> mapping by lvm, followed by header/metadata/block mappings, followed by
> the cache mapping):
> 
>    vg_cerebro-cache_bp: 0 20971520 linear 8:17 209715584
>    cache-bp-header: 0 256 linear 253:12 0
>    cache-bp-meta: 0 144384 linear 253:12 256
>    cache-bp-cache: 0 20826880 linear 253:12 144640
>    cache-bp: 0 18554601472 cache 253:22 253:23 253:7 128 1 writethrough mq 2 sequential_threshold 32
> 
>    vg_cerebro-cache_wd: 0 20971520 linear 8:17 188744064
>    cache-wd-header: 0 256 linear 253:11 0
>    cache-wd-meta: 0 144384 linear 253:11 256
>    cache-wd-cache: 0 20826880 linear 253:11 144640
>    cache-wd: 0 39943200768 cache 253:16 253:17 253:5 128 1 writethrough mq 2 sequential_threshold 32
> 
> The configuration where the kernel complains about a too small metadata
> partition is:
> 
>    vg_cerebro-cache_bp: 0 20971520 linear 8:17 209715584
>    cache-bp-header: 0 256 linear 253:12 0
>    cache-bp-meta: 0 78848 linear 253:12 256
>    cache-bp-cache: 0 20892416 linear 253:12 79104
>    cache-bp: 0 18554601472 cache 253:22 253:23 253:7 128 1 writethrough mq 2 sequential_threshold 32
> 
>    vg_cerebro-cache_wd: 0 20971520 linear 8:17 188744064
>    cache-wd-header: 0 256 linear 253:11 0
>    cache-wd-meta: 0 78848 linear 253:11 256
>    cache-wd-cache: 0 20892416 linear 253:11 79104
>    cache-wd: 0 39943200768 cache 253:16 253:17 253:5 128 1 writethrough mq 2 sequential_threshold 32
> 
> If more details are needed, drop me a note.
> 
> Greetings,
> Marc Lehmann
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: all cache blocks marked as dirty in writethrough mode, "Data loss may occur", constant write activity
  2014-06-30  9:14 ` Joe Thornber
@ 2014-06-30 11:25   ` Joe Thornber
  0 siblings, 0 replies; 3+ messages in thread
From: Joe Thornber @ 2014-06-30 11:25 UTC (permalink / raw)
  To: device-mapper development

On Mon, Jun 30, 2014 at 10:14:03AM +0100, Joe Thornber wrote:
> - There's a constant 4k background load to the metadata device.
> 
>   I'll look into this.  Sounds like the periodic commit is re writing
>   the superblock even if there's no change to the mappings.

I've added this test to the dm test suite:

https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/tests/cache/io_use_tests.rb

Using my latest development tree I can't see any spurious IO.

I also instrumented to see exactly when the superblock is written,
again nothing happens while the cache is idle.  So I'm puzzled about
what you are seeing.

- Joe

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-06-30 11:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-28 22:54 all cache blocks marked as dirty in writethrough mode, "Data loss may occur", constant write activity Marc Lehmann
2014-06-30  9:14 ` Joe Thornber
2014-06-30 11:25   ` Joe Thornber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.