A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk

All of lore.kernel.org
 help / color / mirror / Atom feed

* A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk
@ 2013-01-18 10:19 Daniel Browning
  2013-01-21 18:49 ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Daniel Browning @ 2013-01-18 10:19 UTC (permalink / raw)
  To: dm-devel

Why do I get the following error, and what should I do about it? When I 
create a raid0 md with a non-power-of-two chunk size (e.g. 1152K instead of 
512K), then create a thinly-provisioned volume that is over 256 GiB, I get 
the following dmesg error when I try to create a file system on it:

    "make_request bug: can't convert block across chunks or bigger than 1152k 4384 127"

This bubbles up to mkfs.xfs as

    "libxfs_device_zero write failed: Input/output error"

What I find interesting is that it seems to require all three conditions 
(chunk size, thin-p, and >256 GiB) in order to fail. Without those, it seems 
to work fine:

    * Power-of-two chunk (e.g. 512K), thin-p vol, >256 GiB? Works.
    * Non-power-of-two chunk (e.g. 1152K), thin-p vol, <256 GiB? Works.
    * Non-power-of-two chunk (e.g. 1152K), regular vol, >256 GiB? Works.
    * Non-power-of-two chunk (e.g. 1152K), thin-p vol, >256 GiB? FAIL.

Attached is a self-contained test case to reproduce the error, version 
numbers, and an strace. Thank you in advance,
--
Daniel Browning
Kavod Technologies

Appendix A. Self-contained reproduce script
===========================================================
dd if=/dev/zero of=loop0.img bs=1G count=150; losetup /dev/loop0 loop0.img
dd if=/dev/zero of=loop1.img bs=1G count=150; losetup /dev/loop1 loop1.img
mdadm --create /dev/md99 --verbose --level=0 --raid-devices=2 \
      --chunk=1152K /dev/loop0 /dev/loop1
pvcreate /dev/md99
vgcreate test_vg /dev/md99
lvcreate --size 257G --type thin-pool --thinpool test_thin_pool test_vg
lvcreate --virtualsize 257G --thin test_vg/test_thin_pool --name test_lv
mkfs.xfs /dev/test_vg/test_lv

# That is where the error occurs. Next is cleanup.
lvremove -f /dev/test_vg/test_lv
lvremove -f /dev/mapper/test_vg-test_thin_pool
vgremove -f test_vg
pvremove /dev/md99
mdadm --stop /dev/md99
mdadm --zero-superblock /dev/loop0 /dev/loop1
losetup -d /dev/loop0 /dev/loop1
rm loop*.img

Appendix B. Versions
===========================================================
Distro:          CentOS 6.3
Kernel:          3.7.2-1.el6xen.x86_64 from dev.crc.id.au
LVM version:     2.02.99(2)-git (2012-10-22)
Library version: 1.02.78-git (2012-10-22)
Driver version:  4.23.0
XFS userspace:   xfsprogs-3.1.1-7.el6.x86_64

Appendix C. strace of mkfs.xfs
===========================================================
See http://pastebin.com/raw.php?i=hLLm0jVC for the full strace. An excerpt:

lseek(4, 137975840768, SEEK_SET)        = 137975840768
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144
write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = -1 EIO (Input/output error)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk
  2013-01-18 10:19 A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk Daniel Browning
@ 2013-01-21 18:49 ` Mike Snitzer
  2013-01-22 11:10   ` Zdenek Kabelac
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2013-01-21 18:49 UTC (permalink / raw)
  To: Daniel Browning; +Cc: sandeen, dm-devel

On Fri, Jan 18 2013 at  5:19am -0500,
Daniel Browning <db@kavod.com> wrote:

> Why do I get the following error, and what should I do about it? When I 
> create a raid0 md with a non-power-of-two chunk size (e.g. 1152K instead of 
> 512K), then create a thinly-provisioned volume that is over 256 GiB, I get 
> the following dmesg error when I try to create a file system on it:
> 
>     "make_request bug: can't convert block across chunks or bigger than 1152k 4384 127"
> 
> This bubbles up to mkfs.xfs as
> 
>     "libxfs_device_zero write failed: Input/output error"
> 
> What I find interesting is that it seems to require all three conditions 
> (chunk size, thin-p, and >256 GiB) in order to fail. Without those, it seems 
> to work fine:
> 
>     * Power-of-two chunk (e.g. 512K), thin-p vol, >256 GiB? Works.
>     * Non-power-of-two chunk (e.g. 1152K), thin-p vol, <256 GiB? Works.
>     * Non-power-of-two chunk (e.g. 1152K), regular vol, >256 GiB? Works.
>     * Non-power-of-two chunk (e.g. 1152K), thin-p vol, >256 GiB? FAIL.
> 
> Attached is a self-contained test case to reproduce the error, version 
> numbers, and an strace. Thank you in advance,
> --
> Daniel Browning
> Kavod Technologies
> 
> Appendix A. Self-contained reproduce script
> ===========================================================
> dd if=/dev/zero of=loop0.img bs=1G count=150; losetup /dev/loop0 loop0.img
> dd if=/dev/zero of=loop1.img bs=1G count=150; losetup /dev/loop1 loop1.img
> mdadm --create /dev/md99 --verbose --level=0 --raid-devices=2 \
>       --chunk=1152K /dev/loop0 /dev/loop1
> pvcreate /dev/md99
> vgcreate test_vg /dev/md99
> lvcreate --size 257G --type thin-pool --thinpool test_thin_pool test_vg
> lvcreate --virtualsize 257G --thin test_vg/test_thin_pool --name test_lv
> mkfs.xfs /dev/test_vg/test_lv
> 
> # That is where the error occurs. Next is cleanup.
> lvremove -f /dev/test_vg/test_lv
> lvremove -f /dev/mapper/test_vg-test_thin_pool
> vgremove -f test_vg
> pvremove /dev/md99
> mdadm --stop /dev/md99
> mdadm --zero-superblock /dev/loop0 /dev/loop1
> losetup -d /dev/loop0 /dev/loop1
> rm loop*.img

Limits of the raid0 device (/dev/md99):
cat /sys/block/md99/queue/minimum_io_size
1179648
cat /sys/block/md99/queue/optimal_io_size
2359296

Limits of the thin-pool device (/dev/test_vg/test_thin_pool):
cat /sys/block/dm-9/queue/minimum_io_size
512
cat /sys/block/dm-9/queue/optimal_io_size
262144

Limits of the thin-device device (/dev/test_vg/test_lv):
cat /sys/block/dm-10/queue/minimum_io_size
512
cat /sys/block/dm-10/queue/optimal_io_size
262144

I notice that lvcreate is not using a thin-pool chunksize that matches
the raid0's chunksize (just uses the lvm2 default of 256K).

Switching the thin-pool lvcreate to use --chunksize 1152K at least
enables me to format the filesystem.

And both the thin-pool and thin device have an optimal_io_size that
matches the chunk_size of the underlying raid volume:

cat /sys/block/dm-9/queue/optimal_io_size
1179648
cat /sys/block/dm-10/queue/optimal_io_size
1179648

I'm still investigating the limits issue when --chunksize 1152K isn't
used for the thin-pool lvcreate.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk
  2013-01-21 18:49 ` Mike Snitzer
@ 2013-01-22 11:10   ` Zdenek Kabelac
  2013-01-22 13:51     ` Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Zdenek Kabelac @ 2013-01-22 11:10 UTC (permalink / raw)
  To: device-mapper development; +Cc: sandeen, Daniel Browning, Mike Snitzer

Dne 21.1.2013 19:49, Mike Snitzer napsal(a):
> On Fri, Jan 18 2013 at  5:19am -0500,
> Daniel Browning <db@kavod.com> wrote:
>
>> Why do I get the following error, and what should I do about it? When I
>> create a raid0 md with a non-power-of-two chunk size (e.g. 1152K instead of
>> 512K), then create a thinly-provisioned volume that is over 256 GiB, I get
>> the following dmesg error when I try to create a file system on it:
>>
>>      "make_request bug: can't convert block across chunks or bigger than 1152k 4384 127"
>>
>> This bubbles up to mkfs.xfs as
>>
>>      "libxfs_device_zero write failed: Input/output error"
>>
>> What I find interesting is that it seems to require all three conditions
>> (chunk size, thin-p, and >256 GiB) in order to fail. Without those, it seems
>> to work fine:
>>
>>      * Power-of-two chunk (e.g. 512K), thin-p vol, >256 GiB? Works.
>>      * Non-power-of-two chunk (e.g. 1152K), thin-p vol, <256 GiB? Works.
>>      * Non-power-of-two chunk (e.g. 1152K), regular vol, >256 GiB? Works.
>>      * Non-power-of-two chunk (e.g. 1152K), thin-p vol, >256 GiB? FAIL.
>>
>> Attached is a self-contained test case to reproduce the error, version
>> numbers, and an strace. Thank you in advance,
>> --
>> Daniel Browning
>> Kavod Technologies
>>
>> Appendix A. Self-contained reproduce script
>> ===========================================================
>> dd if=/dev/zero of=loop0.img bs=1G count=150; losetup /dev/loop0 loop0.img
>> dd if=/dev/zero of=loop1.img bs=1G count=150; losetup /dev/loop1 loop1.img
>> mdadm --create /dev/md99 --verbose --level=0 --raid-devices=2 \
>>        --chunk=1152K /dev/loop0 /dev/loop1
>> pvcreate /dev/md99
>> vgcreate test_vg /dev/md99
>> lvcreate --size 257G --type thin-pool --thinpool test_thin_pool test_vg
>> lvcreate --virtualsize 257G --thin test_vg/test_thin_pool --name test_lv
>> mkfs.xfs /dev/test_vg/test_lv
>>
>> # That is where the error occurs. Next is cleanup.
>> lvremove -f /dev/test_vg/test_lv
>> lvremove -f /dev/mapper/test_vg-test_thin_pool
>> vgremove -f test_vg
>> pvremove /dev/md99
>> mdadm --stop /dev/md99
>> mdadm --zero-superblock /dev/loop0 /dev/loop1
>> losetup -d /dev/loop0 /dev/loop1
>> rm loop*.img
>
> Limits of the raid0 device (/dev/md99):
> cat /sys/block/md99/queue/minimum_io_size
> 1179648
> cat /sys/block/md99/queue/optimal_io_size
> 2359296
>
> Limits of the thin-pool device (/dev/test_vg/test_thin_pool):
> cat /sys/block/dm-9/queue/minimum_io_size
> 512
> cat /sys/block/dm-9/queue/optimal_io_size
> 262144
>
> Limits of the thin-device device (/dev/test_vg/test_lv):
> cat /sys/block/dm-10/queue/minimum_io_size
> 512
> cat /sys/block/dm-10/queue/optimal_io_size
> 262144
>
> I notice that lvcreate is not using a thin-pool chunksize that matches
> the raid0's chunksize (just uses the lvm2 default of 256K).
>
> Switching the thin-pool lvcreate to use --chunksize 1152K at least
> enables me to format the filesystem.
>
> And both the thin-pool and thin device have an optimal_io_size that
> matches the chunk_size of the underlying raid volume:
>
> cat /sys/block/dm-9/queue/optimal_io_size
> 1179648
> cat /sys/block/dm-10/queue/optimal_io_size
> 1179648
>
> I'm still investigating the limits issue when --chunksize 1152K isn't
> used for the thin-pool lvcreate.

Just a comment for the selection of thin chunksize here -

I think it has couple aspects here - by default (unless changed via
lvm.conf {allocation/thin_pool_chunk_size}) it is targeting for 64K
and scales chunksize up to fit thin metadata within 128MB.
(compiled in as DEFAULT_THIN_POOL_OPTIMAL_SIZE)
So lvm2 here scaled from 64k to 256k in multiTB case.

lvcreate currently doesn't look out for geometry of underlying PV(s) during 
its allocation (somewhat chicken-egg problem) - yet there are possible ways to 
try to put this into equation - thought it might not be actually wanted by the 
user - since for snapshots the smaller chunksize is more usable
(>1MB is quite a lot here IMHO) - but it probably worth some thinking.

Zdenek

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk
  2013-01-22 11:10   ` Zdenek Kabelac
@ 2013-01-22 13:51     ` Mike Snitzer
  2013-01-23 22:16       ` [PATCH] dm thin: fix queue limits stacking when data device has compulsory merge_bvec_fn Mike Snitzer
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Snitzer @ 2013-01-22 13:51 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: sandeen, device-mapper development, Daniel Browning

On Tue, Jan 22 2013 at  6:10am -0500,
Zdenek Kabelac <zkabelac@redhat.com> wrote:

> Dne 21.1.2013 19:49, Mike Snitzer napsal(a):
> >
> >Switching the thin-pool lvcreate to use --chunksize 1152K at least
> >enables me to format the filesystem.
> >
> >And both the thin-pool and thin device have an optimal_io_size that
> >matches the chunk_size of the underlying raid volume:
> >
> >cat /sys/block/dm-9/queue/optimal_io_size
> >1179648
> >cat /sys/block/dm-10/queue/optimal_io_size
> >1179648
> >
> >I'm still investigating the limits issue when --chunksize 1152K isn't
> >used for the thin-pool lvcreate.
> 
> Just a comment for the selection of thin chunksize here -
> 
> I think it has couple aspects here - by default (unless changed via
> lvm.conf {allocation/thin_pool_chunk_size}) it is targeting for 64K
> and scales chunksize up to fit thin metadata within 128MB.
> (compiled in as DEFAULT_THIN_POOL_OPTIMAL_SIZE)
> So lvm2 here scaled from 64k to 256k in multiTB case.

Not quite sure what you mean by "to fit thin metadata within 128MB".
Why is fitting within 128MB the goal?  I recall Joe helping to establish
the rule of thumb for lvm2 but I don't recall specifics at this point.

> lvcreate currently doesn't look out for geometry of underlying PV(s)
> during its allocation (somewhat chicken-egg problem) - yet there are
> possible ways to try to put this into equation - thought it might
> not be actually wanted by the user - since for snapshots the smaller
> chunksize is more usable
> (>1MB is quite a lot here IMHO) - but it probably worth some thinking.

I've found that the mkfs.xfs (which uses direct IO) will work if the
thinp chunksize is a factor of the raid0 chunksize.  So all of the
following thinp chunksizes "work" given that the raid0 chunksize is
1152K: 64K, 128K, 384K, 576K, 1152K

I haven't done extensive IO testing on the result XFS filesystem though.
So I don't want to get too far into shaping lvm2's chunksize selection
algorithm until I can dive into the kernel's limits stacking further
(which I'm doing now).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH] dm thin: fix queue limits stacking when data device has compulsory merge_bvec_fn
  2013-01-22 13:51     ` Mike Snitzer
@ 2013-01-23 22:16       ` Mike Snitzer
  0 siblings, 0 replies; 5+ messages in thread
From: Mike Snitzer @ 2013-01-23 22:16 UTC (permalink / raw)
  To: device-mapper development
  Cc: sandeen, Daniel Browning, ejt, Jeff Moyer, Zdenek Kabelac,
	Alasdair G. Kergon

When a thin-pool uses an MD device for the data device a thin device
from the thin-pool must respect MD's constraints about disallowing a bio
to span multiple chunks.  Otherwise we can see problems.  If the raid0
chunksize is 1152K and thin-pool chunksize is 256K I see the following
md/radi0 error (with extra debug tracing added to thin_endio) when
mkfs.xfs is executed against the thin device:

md/raid0:md99: make_request bug: can't convert block across chunks or bigger than 1152k 6688 127
device-mapper: thin: bio sector=2080 err=-5 bi_size=130560 bi_rw=17 bi_vcnt=32 bi_idx=0

This extra DM debugging shows that the failing bio is spanning across
the first and second logical 1152K chunk (sector 2080 + 255 takes the
bio beyond the first chunk's boundary of sector 2304).  So the bio
splitting that DM is doing clearly isn't respecting the MD limits.

max_hw_sectors_kb is 127 for both the thin-pool and thin device
(queue_max_hw_sectors returns 255 so we'll excuse sysfs's lack of
precision).  So this explains why bi_size is 130560.

But the thin device's max_hw_sectors_kb should be 4 (PAGE_SIZE) given
that it doesn't have a .merge function (for bio_add_page to consult
indirectly via dm_merge_bvec) yet the thin-pool does sit above an MD
device that has a compulsory merge_bvec_fn.  This scenario is exactly
why DM must resort to sending single PAGE_SIZE bios to the underlying
layer -- some additional context for this is available in the header for
commit 8cbeb67a.

Look story short, the reason a thin device doesn't properly get
configured to have a max_hw_sectors_kb of 4 (PAGE_SIZE) is that
thin_io_hints() is blindly copying the queue limits from the thin-pool
device directly to the thin device's queue limits.

Fix this by eliminating thin_io_hints.  Doing so is safe because the
block layer's queue limits stacking already enables the upper level thin
device to inherit the thin-pool device's discard and minimum_io_size and
optimal_io_size limits that get set in pool_io_hints.  But avoiding the
queue limits copy allows the thin and thin-pool limits to be different
where it is important, namely max_hw_sectors_kb.

Reported-by: Daniel Browning <db@kavod.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: stable@vger.kernel.org
---
 drivers/md/dm-thin.c |   11 -----------
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 074e570..9bd59ae 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -2819,16 +2819,6 @@ static int thin_iterate_devices(struct dm_target *ti,
 	return 0;
 }

-/*
- * A thin device always inherits its queue limits from its pool.
- */
-static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
-{
-	struct thin_c *tc = ti->private;
-
-	*limits = bdev_get_queue(tc->pool_dev->bdev)->limits;
-}
-
 static struct target_type thin_target = {
 	.name = "thin",
 	.version = {1, 6, 0},
@@ -2840,7 +2830,6 @@ static struct target_type thin_target = {
 	.postsuspend = thin_postsuspend,
 	.status = thin_status,
 	.iterate_devices = thin_iterate_devices,
-	.io_hints = thin_io_hints,
 };

 /*----------------------------------------------------------------*/

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-01-23 22:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-18 10:19 A thin-p over 256 GiB fails with I/O errors with non-power-of-two chunk Daniel Browning
2013-01-21 18:49 ` Mike Snitzer
2013-01-22 11:10   ` Zdenek Kabelac
2013-01-22 13:51     ` Mike Snitzer
2013-01-23 22:16       ` [PATCH] dm thin: fix queue limits stacking when data device has compulsory merge_bvec_fn Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.