linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [dm-devel] [v2] block: Fix a WRITE SAME BUG_ON
       [not found] <20190125021107.4595-1-zhangxiaoxu5@huawei.com>
@ 2019-01-26 11:17 ` John Dorminy
  2019-01-28  5:48   ` zhangxiaoxu (A)
  2019-01-28 22:14   ` Mike Snitzer
  0 siblings, 2 replies; 20+ messages in thread
From: John Dorminy @ 2019-01-26 11:17 UTC (permalink / raw)
  To: Zhang Xiaoxu
  Cc: axboe, linux-block, Alasdair G Kergon, dm-devel, Mike Snitzer

Hi. I have read a bit of DM code and spent an hour reviewing this... I
didn't get to the point of knowing what the right fix for the problem
is, and I may have a wrong understanding, but I have two thoughts
about the patch:

I don't think this is the right solution for two reasons:

In the first place, if it's an LVM-only issue, we should fix it only
for device-mapper devices. If this is the right way to fix it,
possibly the way to do that would be to change DM calls to
blk_queue_max_write_same_sectors() to only set the max sectors to more
than 0 if and only if the logical block sizes match.

In the second place, I don't think this is the problem. The line of
code that's calling BUG_ON is, I think,
BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);

This is because write_same bios are supposed to have a single sector
of data at the beginning of a page in their bio_iovec(bio) [I think,
based on a commit message I've read] -- in other words, bio_iovec(bio)
is supposed to say something like { .page = something, .offset = 0,
.len = 1 native sector }. But clearly, since the BUG_ON is being
called, one of these is not true. Have you added a log statement right
before the BUG_ON() to print out bio_offset(bio) and
bio_iovec(bio).bv_len to see which value defies expectations?

I would be happy to help you trace through this or attempt to
reproduce it myself -- what stack of devices can you reproduce this
on? Is this a dm-linear device on top of a disk? Does it have a
filesystem on top, and if so, what filesystem?

Thank you!

John Dorminy


On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote:
>
> If the lvm is stacked by different logical_block_size disks,
> when WRITE SAME on it, will bug_on:
>
> kernel BUG at drivers/scsi/sd.c:968!
> invalid opcode: 0000 [#1] SMP PTI
> CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G           O      5.0.0-rc3+ #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
> Workqueue: kblockd blk_mq_run_work_fn
> RIP: 0010:sd_init_command+0x7aa/0xdb0
> Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48
>       89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09
> RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206
> RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000
> RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000
> RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8
> R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800
> R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0
> FS:  0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0
> Call Trace:
>  ? vp_notify+0x12/0x20
>  scsi_queue_rq+0x525/0xa30
>  blk_mq_dispatch_rq_list+0x8d/0x580
>  ? syscall_return_via_sysret+0x10/0x7f
>  ? elv_rb_del+0x1f/0x30
>  ? deadline_remove_request+0x55/0xc0
>  blk_mq_do_dispatch_sched+0x76/0x110
>  blk_mq_sched_dispatch_requests+0xf9/0x170
>  __blk_mq_run_hw_queue+0x51/0xd0
>  process_one_work+0x195/0x380
>  worker_thread+0x30/0x390
>  ? process_one_work+0x380/0x380
>  kthread+0x113/0x130
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x35/0x40
> Modules linked in: alloc(O+)
> ---[ end trace dc92ddeb2e6d1fe5 ]---
>
> The logical_block_size of the LVM is the max value of the sub disks,
> it maybe different with one of the sub disk. when WRITE SAME on the
> disk, it will BUG_ON when setup WRITE SAME cmd.
>
> Close WRITE_SAME feature on the LVM if it was stacked by different
> logical_block_size disk.
>
> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
> ---
>  block/blk-settings.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 3e7038e475ee..e4664280edb4 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>         t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>         t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>         t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
> -       t->max_write_same_sectors = min(t->max_write_same_sectors,
> -                                       b->max_write_same_sectors);
>         t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
>                                         b->max_write_zeroes_sectors);
>         t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
> @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>                 }
>         }
>
> +       /* If the logical block size is different, forbid write same */
> +       if (t->logical_block_size != b->logical_block_size &&
> +           t->max_write_same_sectors != UINT_MAX)
> +               t->max_write_same_sectors = 0;
> +       else
> +               t->max_write_same_sectors = min(t->max_write_same_sectors,
> +                                               b->max_write_same_sectors);
> +
>         t->logical_block_size = max(t->logical_block_size,
>                                     b->logical_block_size);
>
> --
> 2.14.4
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] [v2] block: Fix a WRITE SAME BUG_ON
  2019-01-26 11:17 ` [dm-devel] [v2] block: Fix a WRITE SAME BUG_ON John Dorminy
@ 2019-01-28  5:48   ` zhangxiaoxu (A)
  2019-01-28 22:14   ` Mike Snitzer
  1 sibling, 0 replies; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-01-28  5:48 UTC (permalink / raw)
  To: John Dorminy
  Cc: axboe, linux-block, Alasdair G Kergon, dm-devel, Mike Snitzer

Hi, Thanks for your reply.

BUG_ON is because the 'bio_iovec(bio).bv_len' not same with 'sdp->sector_size'.

Just as below reproducer, If the 'golden' is stacked by disks 'sda'(logical_block_size=512)
and 'sdb'(logical_block_size=4096), call 'blkdev_issue_write_same' on it will BUG_ON because
of the
	bio->bi_io_vec->bv_len = bdev_logical_block_size('golden'), and the
	bdev_logical_block_size('golden') = max_logical_block_size(sda, sdb).

There are 2 solutions about the problem:
1. Disable the WRITE SAME for that situation: Just like this patch.
2. Improve the WRITE SAME to adapt that situation:
	Reorganization the 'WRITE SAME' bio for the disks which logical_block_size is smaller than the logical volume.


For now, I think we should close the 'WRITE SAME' for this situation.


Reproducer:
1. Start qemu with parameter:
     -drive file=/home/qemu/disk/512k.img,if=none,format=raw,cache=none,id=data.1,discard=on \
     -device scsi-hd,drive=data.1,id=vdata.1 \
     -drive file=/home/qemu/disk/4096k.img,if=none,format=raw,cache=none,id=data.2,discard=on \
     -device scsi-hd,drive=data.2,id=vdata.2,logical_block_size=4096,physical_block_size=4096 \

2. Create LV:
     vgremove golden -y
     vgcreate golden /dev/sda /dev/sdb
     lvcreate -L1G -n testvol -i 2 golden -y

3. Call 'blkdev_issue_write_same' on the device:
	#include <linux/module.h>
	#include <linux/kernel.h>
	#include <linux/init.h>
	#include <linux/fs.h>
	#include <linux/slab.h>
	#include <linux/blkdev.h>

	static int __init hello_module_init(void)
	{
		struct block_device *dev;
		int err;

		printk(KERN_ERR "/dev/golden/testvol\n");

		dev = lookup_bdev("/dev/golden/testvol");
		if (IS_ERR(dev)) {
			pr_warn("lookup blk error!\n");
			return 0;
		}

		err = blkdev_get(dev, FMODE_WRITE | FMODE_READ | FMODE_EXCL, (void *)&err);

		if (err) {
			pr_warn("get blk error!\n");
			return 0;
		}

		err = blkdev_issue_write_same(dev, 8, 0xf8, GFP_NOIO, ZERO_PAGE(0));

		printk(KERN_ERR "blkdev_issue_write_same return %d\n", err);

		return 0;
	}

	static void __exit hello_module_exit(void)
	{
		pr_warn("\n");
	}

	module_init(hello_module_init);
	module_exit(hello_module_exit);

	MODULE_DESCRIPTION("hello module");
	MODULE_LICENSE("GPL");

On 1/26/2019 7:17 PM, John Dorminy wrote:
> Hi. I have read a bit of DM code and spent an hour reviewing this... I
> didn't get to the point of knowing what the right fix for the problem
> is, and I may have a wrong understanding, but I have two thoughts
> about the patch:
> 
> I don't think this is the right solution for two reasons:
> 
> In the first place, if it's an LVM-only issue, we should fix it only
> for device-mapper devices. If this is the right way to fix it,
> possibly the way to do that would be to change DM calls to
> blk_queue_max_write_same_sectors() to only set the max sectors to more
> than 0 if and only if the logical block sizes match.
> 
> In the second place, I don't think this is the problem. The line of
> code that's calling BUG_ON is, I think,
> BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);
> 
> This is because write_same bios are supposed to have a single sector
> of data at the beginning of a page in their bio_iovec(bio) [I think,
> based on a commit message I've read] -- in other words, bio_iovec(bio)
> is supposed to say something like { .page = something, .offset = 0,
> .len = 1 native sector }. But clearly, since the BUG_ON is being
> called, one of these is not true. Have you added a log statement right
> before the BUG_ON() to print out bio_offset(bio) and
> bio_iovec(bio).bv_len to see which value defies expectations?
> 
> I would be happy to help you trace through this or attempt to
> reproduce it myself -- what stack of devices can you reproduce this
> on? Is this a dm-linear device on top of a disk? Does it have a
> filesystem on top, and if so, what filesystem?
> 
> Thank you!
> 
> John Dorminy
> 
> 
> On Fri, Jan 25, 2019 at 9:53 AM Zhang Xiaoxu <zhangxiaoxu5@huawei.com> wrote:
>>
>> If the lvm is stacked by different logical_block_size disks,
>> when WRITE SAME on it, will bug_on:
>>
>> kernel BUG at drivers/scsi/sd.c:968!
>> invalid opcode: 0000 [#1] SMP PTI
>> CPU: 11 PID: 525 Comm: kworker/11:1H Tainted: G           O      5.0.0-rc3+ #2
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
>> Workqueue: kblockd blk_mq_run_work_fn
>> RIP: 0010:sd_init_command+0x7aa/0xdb0
>> Code: 30 75 00 00 c7 85 44 01 00 00 18 00 00 00 0f 85 fa 04 00 00 48 83 c4 40 48
>>        89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 b6 ca fe ff <0f> 0b 41 bc 09
>> RSP: 0018:ffffb55f80ddbca0 EFLAGS: 00010206
>> RAX: 0000000000001000 RBX: ffff9ed23fb927a0 RCX: 000000000000f000
>> RDX: ffff9ed23f0a8400 RSI: ffff9ed27bc79800 RDI: 0000000000000000
>> RBP: ffff9ed23fb92680 R08: ffff9ed27c8c0000 R09: ffff9ed23fb927d8
>> R10: 0000000000000000 R11: fefefefefefefeff R12: ffff9ed27bc79800
>> R13: ffff9ed2787a0000 R14: ffff9ed27bdf3400 R15: ffff9ed23fb927a0
>> FS:  0000000000000000(0000) GS:ffff9ed27c8c0000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007f6b14cf9341 CR3: 0000000069058000 CR4: 00000000000006e0
>> Call Trace:
>>   ? vp_notify+0x12/0x20
>>   scsi_queue_rq+0x525/0xa30
>>   blk_mq_dispatch_rq_list+0x8d/0x580
>>   ? syscall_return_via_sysret+0x10/0x7f
>>   ? elv_rb_del+0x1f/0x30
>>   ? deadline_remove_request+0x55/0xc0
>>   blk_mq_do_dispatch_sched+0x76/0x110
>>   blk_mq_sched_dispatch_requests+0xf9/0x170
>>   __blk_mq_run_hw_queue+0x51/0xd0
>>   process_one_work+0x195/0x380
>>   worker_thread+0x30/0x390
>>   ? process_one_work+0x380/0x380
>>   kthread+0x113/0x130
>>   ? kthread_park+0x90/0x90
>>   ret_from_fork+0x35/0x40
>> Modules linked in: alloc(O+)
>> ---[ end trace dc92ddeb2e6d1fe5 ]---
>>
>> The logical_block_size of the LVM is the max value of the sub disks,
>> it maybe different with one of the sub disk. when WRITE SAME on the
>> disk, it will BUG_ON when setup WRITE SAME cmd.
>>
>> Close WRITE_SAME feature on the LVM if it was stacked by different
>> logical_block_size disk.
>>
>> Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
>> ---
>>   block/blk-settings.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 3e7038e475ee..e4664280edb4 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -497,8 +497,6 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>          t->max_sectors = min_not_zero(t->max_sectors, b->max_sectors);
>>          t->max_hw_sectors = min_not_zero(t->max_hw_sectors, b->max_hw_sectors);
>>          t->max_dev_sectors = min_not_zero(t->max_dev_sectors, b->max_dev_sectors);
>> -       t->max_write_same_sectors = min(t->max_write_same_sectors,
>> -                                       b->max_write_same_sectors);
>>          t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
>>                                          b->max_write_zeroes_sectors);
>>          t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
>> @@ -537,6 +535,14 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
>>                  }
>>          }
>>
>> +       /* If the logical block size is different, forbid write same */
>> +       if (t->logical_block_size != b->logical_block_size &&
>> +           t->max_write_same_sectors != UINT_MAX)
>> +               t->max_write_same_sectors = 0;
>> +       else
>> +               t->max_write_same_sectors = min(t->max_write_same_sectors,
>> +                                               b->max_write_same_sectors);
>> +
>>          t->logical_block_size = max(t->logical_block_size,
>>                                      b->logical_block_size);
>>
>> --
>> 2.14.4
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-26 11:17 ` [dm-devel] [v2] block: Fix a WRITE SAME BUG_ON John Dorminy
  2019-01-28  5:48   ` zhangxiaoxu (A)
@ 2019-01-28 22:14   ` Mike Snitzer
  2019-01-29  4:54     ` Martin K. Petersen
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2019-01-28 22:14 UTC (permalink / raw)
  To: John Dorminy
  Cc: Zhang Xiaoxu, axboe, linux-block, dm-devel, Alasdair G Kergon

On Sat, Jan 26 2019 at  6:17am -0500,
John Dorminy <jdorminy@redhat.com> wrote:

> Hi. I have read a bit of DM code and spent an hour reviewing this... I
> didn't get to the point of knowing what the right fix for the problem
> is, and I may have a wrong understanding, but I have two thoughts
> about the patch:
> 
> I don't think this is the right solution for two reasons:
> 
> In the first place, if it's an LVM-only issue, we should fix it only
> for device-mapper devices. If this is the right way to fix it,
> possibly the way to do that would be to change DM calls to
> blk_queue_max_write_same_sectors() to only set the max sectors to more
> than 0 if and only if the logical block sizes match.

There is no way this is specific to lvm (or DM).  It may _seem_ that way
because lvm/dm are in the business of creating stacked devices --
whereby exposing users to blk_stack_limits().

I'll have a closer look at this issue, hopefully tomorrow, but Zhang
Xiaoxu's proposed fix looks bogus to me.  Not disputing there is an
issue, just feels like a different fix is needed.

Mike

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-28 22:14   ` Mike Snitzer
@ 2019-01-29  4:54     ` Martin K. Petersen
  2019-01-29  8:52       ` Christoph Hellwig
  2019-01-30 14:08       ` John Dorminy
  0 siblings, 2 replies; 20+ messages in thread
From: Martin K. Petersen @ 2019-01-29  4:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: John Dorminy, Zhang Xiaoxu, axboe, linux-block, dm-devel,
	Alasdair G Kergon


Mike,

>> In the first place, if it's an LVM-only issue, we should fix it only
>> for device-mapper devices. If this is the right way to fix it,
>> possibly the way to do that would be to change DM calls to
>> blk_queue_max_write_same_sectors() to only set the max sectors to
>> more than 0 if and only if the logical block sizes match.
>
> There is no way this is specific to lvm (or DM).  It may _seem_ that way
> because lvm/dm are in the business of creating stacked devices --
> whereby exposing users to blk_stack_limits().
>
> I'll have a closer look at this issue, hopefully tomorrow, but Zhang
> Xiaoxu's proposed fix looks bogus to me.  Not disputing there is an
> issue, just feels like a different fix is needed.

It's caused by a remnant of the old bio payload hack in sd.c:

	BUG_ON(bio_offset(bio) || bio_iovec(bio).bv_len != sdp->sector_size);

We rounded up LBS when we created the DM device. And therefore the
bv_len coming down is 4K. But one of the component devices has a LBS of
512 and fails this check.

At first glance one could argue we should just nuke the BUG_ON since the
sd code no longer relies on bv_len. However, the semantics for WRITE
SAME are particularly challenging in this scenario. Say the filesystem
wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes,
followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a
component device only has a 512-byte LBS, we would end up writing zeroes
to the entire 4K block on that component device instead of the correct
pattern. Not good.

So disallowing WRITE SAME unless all component devices have the same LBS
is the correct fix.

That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
non-zero payload. The target code ends up manually iterating, if I
remember correctly...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-29  4:54     ` Martin K. Petersen
@ 2019-01-29  8:52       ` Christoph Hellwig
  2019-01-30  6:50         ` zhangxiaoxu (A)
  2019-01-30 14:08       ` John Dorminy
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-01-29  8:52 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, John Dorminy, Zhang Xiaoxu, axboe, linux-block,
	dm-devel, Alasdair G Kergon

On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote:
> That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
> irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
> to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
> non-zero payload. The target code ends up manually iterating, if I
> remember correctly...

I had a series to remove it a while ago, but you wanted to keep it at
that point.  I can resurrect it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-29  8:52       ` Christoph Hellwig
@ 2019-01-30  6:50         ` zhangxiaoxu (A)
  0 siblings, 0 replies; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-01-30  6:50 UTC (permalink / raw)
  To: Christoph Hellwig, Martin K. Petersen
  Cc: Mike Snitzer, John Dorminy, axboe, linux-block, dm-devel,
	Alasdair G Kergon



On 1/29/2019 4:52 PM, Christoph Hellwig wrote:
> On Mon, Jan 28, 2019 at 11:54:08PM -0500, Martin K. Petersen wrote:
>> That said, now that we have REQ_OP_WRITE_ZEROES (where the LBS is
>> irrelevant due to the payload being the ZERO_PAGE), it may be worthwhile
>> to remove REQ_OP_WRITE_SAME. I think drbd is the only user relying on a
>> non-zero payload. The target code ends up manually iterating, if I
>> remember correctly...
> 
> I had a series to remove it a while ago, but you wanted to keep it at
> that point.  I can resurrect it.
> 
Do you mean to remove the `blkdev_issue_write_same` interface?
I think we should fix the problem first because other users may also encounter
the same problem.

> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-29  4:54     ` Martin K. Petersen
  2019-01-29  8:52       ` Christoph Hellwig
@ 2019-01-30 14:08       ` John Dorminy
  2019-01-31  0:58         ` zhangxiaoxu (A)
                           ` (2 more replies)
  1 sibling, 3 replies; 20+ messages in thread
From: John Dorminy @ 2019-01-30 14:08 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Mike Snitzer, Zhang Xiaoxu, axboe, linux-block, dm-devel,
	Alasdair G Kergon

On Mon, Jan 28, 2019 at 11:54 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> We rounded up LBS when we created the DM device. And therefore the
> bv_len coming down is 4K. But one of the component devices has a LBS of
> 512 and fails this check.
>
> At first glance one could argue we should just nuke the BUG_ON since the
> sd code no longer relies on bv_len. However, the semantics for WRITE
> SAME are particularly challenging in this scenario. Say the filesystem
> wants to WRITE SAME a 4K PAGE consisting of 512 bytes of zeroes,
> followed by 512 bytes of ones, followed by 512 bytes of twos, etc. If a
> component device only has a 512-byte LBS, we would end up writing zeroes
> to the entire 4K block on that component device instead of the correct
> pattern. Not good.
>
> So disallowing WRITE SAME unless all component devices have the same LBS
> is the correct fix.

Alternately, could possibly WRITE_SAME bios be accepted with the
minimum sector size of the stack rather than the max, e.g. 512 in this
example rather than 4k? They'd need to have a granularity of the
larger sector size, though, presumabily necessitating new queue limits
write_same_{granularity,block_size}, which might be too much work. For
devices with bigger sectors, the block layer or DM would need to
expand the small-sector payload to an appropriate larger-sector
payload, but it would preserve the ability to use WRITE_SAME with
non-zero payloads.

(I use WRITE_SAME to fill devices with a particular pattern in order
to catch failures to initialize disk structures appropriately,
personally, but it's just for convenience/speed.)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-30 14:08       ` John Dorminy
@ 2019-01-31  0:58         ` zhangxiaoxu (A)
  2019-01-31  2:23         ` Martin K. Petersen
  2019-01-31 10:39         ` Christoph Hellwig
  2 siblings, 0 replies; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-01-31  0:58 UTC (permalink / raw)
  To: John Dorminy, Martin K. Petersen
  Cc: Mike Snitzer, axboe, linux-block, dm-devel, Alasdair G Kergon



On 1/30/2019 10:08 PM, John Dorminy wrote:
> Alternately, could possibly WRITE_SAME bios be accepted with the
> minimum sector size of the stack rather than the max, e.g. 512 in this
> example rather than 4k? They'd need to have a granularity of the
> larger sector size, though, presumabily necessitating new queue limits
> write_same_{granularity,block_size}, which might be too much work. For
> devices with bigger sectors, the block layer or DM would need to
> expand the small-sector payload to an appropriate larger-sector
> payload, but it would preserve the ability to use WRITE_SAME with
> non-zero payloads.
> 
> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)
I think two LBSs will produce ambiguity.
Reference spec
	Information technology -
	SCSI Block Commands – 4 (SBC-4)
	ISO/IEC 14776-324:201x
	BSR INCITS 506:201x
5.50 WRITE SAME (10) command
The WRITE SAME (10) command (see table 145) requests that the device server
**transfer a single logical block** from the Data-Out Buffer and for each LBA in
the specified range of LBAs:





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-30 14:08       ` John Dorminy
  2019-01-31  0:58         ` zhangxiaoxu (A)
@ 2019-01-31  2:23         ` Martin K. Petersen
  2019-01-31 10:39         ` Christoph Hellwig
  2 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2019-01-31  2:23 UTC (permalink / raw)
  To: John Dorminy
  Cc: Martin K. Petersen, Mike Snitzer, Zhang Xiaoxu, axboe,
	linux-block, dm-devel, Alasdair G Kergon


John,

>> So disallowing WRITE SAME unless all component devices have the same LBS
>> is the correct fix.
>
> Alternately, could possibly WRITE_SAME bios be accepted with the
> minimum sector size of the stack rather than the max, e.g. 512 in this
> example rather than 4k? They'd need to have a granularity of the
> larger sector size, though, presumabily necessitating new queue limits
> write_same_{granularity,block_size}, which might be too much work.

I don't have a problem restricting the buffer contents to be consistent
within a page. Or even change the upper layer semantics to specify the
buffer contents using a single byte (0x00..0xff).

But the issue of head and tail remains if there is a block size mismatch
so it's important that we keep scaling the logical block size up when
stacking and reject any bio that can't be honored on a 4Kn device.

> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)

The intent was for stuff like MD to use it to initialize parity disks,
etc. But adoption has been pretty slow.

I don't have any problems keeping WRITE_SAME around if people are
actually using it. It just seemed like most active users only cared
about writing zeroes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-30 14:08       ` John Dorminy
  2019-01-31  0:58         ` zhangxiaoxu (A)
  2019-01-31  2:23         ` Martin K. Petersen
@ 2019-01-31 10:39         ` Christoph Hellwig
  2019-01-31 19:41           ` John Dorminy
  2 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-01-31 10:39 UTC (permalink / raw)
  To: John Dorminy
  Cc: Martin K. Petersen, Mike Snitzer, Zhang Xiaoxu, axboe,
	linux-block, dm-devel, Alasdair G Kergon

On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> (I use WRITE_SAME to fill devices with a particular pattern in order
> to catch failures to initialize disk structures appropriately,
> personally, but it's just for convenience/speed.)

How do you use it?  We don't have a user interface to generate
REQ_OP_WRITE_SAME requests.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-31 10:39         ` Christoph Hellwig
@ 2019-01-31 19:41           ` John Dorminy
  2019-02-01  7:35             ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: John Dorminy @ 2019-01-31 19:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Mike Snitzer, Zhang Xiaoxu, axboe,
	linux-block, dm-devel, Alasdair G Kergon

On Thu, Jan 31, 2019 at 5:39 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > (I use WRITE_SAME to fill devices with a particular pattern in order
> > to catch failures to initialize disk structures appropriately,
> > personally, but it's just for convenience/speed.)
>
> How do you use it?  We don't have a user interface to generate
> REQ_OP_WRITE_SAME requests.

A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
exposing a sysfs node to trigger filling a block device with a test
pattern.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-01-31 19:41           ` John Dorminy
@ 2019-02-01  7:35             ` Christoph Hellwig
  2019-02-01 14:09               ` John Dorminy
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-02-01  7:35 UTC (permalink / raw)
  To: John Dorminy
  Cc: Christoph Hellwig, Martin K. Petersen, Mike Snitzer,
	Zhang Xiaoxu, axboe, linux-block, dm-devel, Alasdair G Kergon

On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
> > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > > (I use WRITE_SAME to fill devices with a particular pattern in order
> > > to catch failures to initialize disk structures appropriately,
> > > personally, but it's just for convenience/speed.)
> >
> > How do you use it?  We don't have a user interface to generate
> > REQ_OP_WRITE_SAME requests.
> 
> A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
> exposing a sysfs node to trigger filling a block device with a test
> pattern.

Any reason you don't just use SCSI/NVMe passthrough directly from
userspace for that?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: block: Fix a WRITE SAME BUG_ON
  2019-02-01  7:35             ` Christoph Hellwig
@ 2019-02-01 14:09               ` John Dorminy
  2019-02-01 16:03                 ` [dm-devel] " Heinz Mauelshagen
  0 siblings, 1 reply; 20+ messages in thread
From: John Dorminy @ 2019-02-01 14:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, Mike Snitzer, Zhang Xiaoxu, axboe,
	linux-block, dm-devel, Alasdair G Kergon

I didn't know such a thing existed... does it work on any block
device? Where do I read more about this?

On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
> > > On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
> > > > (I use WRITE_SAME to fill devices with a particular pattern in order
> > > > to catch failures to initialize disk structures appropriately,
> > > > personally, but it's just for convenience/speed.)
> > >
> > > How do you use it?  We don't have a user interface to generate
> > > REQ_OP_WRITE_SAME requests.
> >
> > A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
> > exposing a sysfs node to trigger filling a block device with a test
> > pattern.
>
> Any reason you don't just use SCSI/NVMe passthrough directly from
> userspace for that?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-01 14:09               ` John Dorminy
@ 2019-02-01 16:03                 ` Heinz Mauelshagen
  2019-02-01 16:18                   ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Heinz Mauelshagen @ 2019-02-01 16:03 UTC (permalink / raw)
  To: John Dorminy, Christoph Hellwig
  Cc: axboe, Martin K. Petersen, Mike Snitzer, linux-block, dm-devel,
	Zhang Xiaoxu, Alasdair G Kergon

On 2/1/19 3:09 PM, John Dorminy wrote:
> I didn't know such a thing existed... does it work on any block
> device? Where do I read more about this?


Use sg_write_same(8) from package sg3_utils.

For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000 
--xferlen=512 /dev/sdwhatever'

will read the pattern to write same from file 'foobarfile' with length 
explicitely set to 512 bytes
(rather than derived from foobarfile's size) and write it 20000 times 
starting at LBA 0 to /dev/sdwhatever.

> --lba=
>
> On Fri, Feb 1, 2019 at 2:35 AM Christoph Hellwig <hch@infradead.org> wrote:
>> On Thu, Jan 31, 2019 at 02:41:52PM -0500, John Dorminy wrote:
>>>> On Wed, Jan 30, 2019 at 09:08:50AM -0500, John Dorminy wrote:
>>>>> (I use WRITE_SAME to fill devices with a particular pattern in order
>>>>> to catch failures to initialize disk structures appropriately,
>>>>> personally, but it's just for convenience/speed.)
>>>> How do you use it?  We don't have a user interface to generate
>>>> REQ_OP_WRITE_SAME requests.
>>> A not-checked-in test module, similar to Zhang Xiaoxu's reproducer,
>>> exposing a sysfs node to trigger filling a block device with a test
>>> pattern.
>> Any reason you don't just use SCSI/NVMe passthrough directly from
>> userspace for that?
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-01 16:03                 ` [dm-devel] " Heinz Mauelshagen
@ 2019-02-01 16:18                   ` Christoph Hellwig
  2019-02-12  3:11                     ` zhangxiaoxu (A)
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2019-02-01 16:18 UTC (permalink / raw)
  To: Heinz Mauelshagen
  Cc: John Dorminy, Christoph Hellwig, axboe, Martin K. Petersen,
	Mike Snitzer, linux-block, dm-devel, Zhang Xiaoxu,
	Alasdair G Kergon

On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote:
> On 2/1/19 3:09 PM, John Dorminy wrote:
> > I didn't know such a thing existed... does it work on any block
> > device? Where do I read more about this?
> 
> 
> Use sg_write_same(8) from package sg3_utils.
> 
> For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000
> --xferlen=512 /dev/sdwhatever'
> 
> will read the pattern to write same from file 'foobarfile' with length
> explicitely set to 512 bytes
> (rather than derived from foobarfile's size) and write it 20000 times
> starting at LBA 0 to /dev/sdwhatever.

Yeah.  Note that this will only work on SCSI disks (and maybe ATA
for a very specific corner case).  But for actual devices and not
remappers the same is true of REQ_OP_WRITE_SAME.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-01 16:18                   ` Christoph Hellwig
@ 2019-02-12  3:11                     ` zhangxiaoxu (A)
  2019-02-14  2:31                       ` Martin K. Petersen
  0 siblings, 1 reply; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-02-12  3:11 UTC (permalink / raw)
  To: Christoph Hellwig, Heinz Mauelshagen, John Dorminy
  Cc: axboe, Martin K. Petersen, Mike Snitzer, linux-block, dm-devel,
	Alasdair G Kergon

Any progress about the problme?
Should we disable the write same when stack the different LBA disks?

On 2/2/2019 12:18 AM, Christoph Hellwig wrote:
> On Fri, Feb 01, 2019 at 05:03:40PM +0100, Heinz Mauelshagen wrote:
>> On 2/1/19 3:09 PM, John Dorminy wrote:
>>> I didn't know such a thing existed... does it work on any block
>>> device? Where do I read more about this?
>>
>>
>> Use sg_write_same(8) from package sg3_utils.
>>
>> For instance 'sg_write_same --in=foobarfile --lba=0 --num=20000
>> --xferlen=512 /dev/sdwhatever'
>>
>> will read the pattern to write same from file 'foobarfile' with length
>> explicitely set to 512 bytes
>> (rather than derived from foobarfile's size) and write it 20000 times
>> starting at LBA 0 to /dev/sdwhatever.
> 
> Yeah.  Note that this will only work on SCSI disks (and maybe ATA
> for a very specific corner case).  But for actual devices and not
> remappers the same is true of REQ_OP_WRITE_SAME.
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-12  3:11                     ` zhangxiaoxu (A)
@ 2019-02-14  2:31                       ` Martin K. Petersen
  2019-02-14  9:36                         ` zhangxiaoxu (A)
  0 siblings, 1 reply; 20+ messages in thread
From: Martin K. Petersen @ 2019-02-14  2:31 UTC (permalink / raw)
  To: zhangxiaoxu (A)
  Cc: Christoph Hellwig, Heinz Mauelshagen, John Dorminy, axboe,
	Martin K. Petersen, Mike Snitzer, linux-block, dm-devel,
	Alasdair G Kergon


zhangxiaoxu,

> Any progress about the problme?
> Should we disable the write same when stack the different LBA disks?

Yes, please.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-14  2:31                       ` Martin K. Petersen
@ 2019-02-14  9:36                         ` zhangxiaoxu (A)
  2019-02-18 14:10                           ` zhangxiaoxu (A)
  0 siblings, 1 reply; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-02-14  9:36 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Heinz Mauelshagen, John Dorminy, axboe,
	Mike Snitzer, linux-block, dm-devel, Alasdair G Kergon

Any comments about the patch?

On 2/14/2019 10:31 AM, Martin K. Petersen wrote:
> 
> zhangxiaoxu,
> 
>> Any progress about the problme?
>> Should we disable the write same when stack the different LBA disks?
> 
> Yes, please.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-14  9:36                         ` zhangxiaoxu (A)
@ 2019-02-18 14:10                           ` zhangxiaoxu (A)
  2019-02-19 23:10                             ` Martin K. Petersen
  0 siblings, 1 reply; 20+ messages in thread
From: zhangxiaoxu (A) @ 2019-02-18 14:10 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, Heinz Mauelshagen, John Dorminy, axboe,
	Mike Snitzer, linux-block, dm-devel, Alasdair G Kergon

ping.

Anyone can help merge this patch? or any other solutions?

On 2/14/2019 5:36 PM, zhangxiaoxu (A) wrote:
> Any comments about the patch?
> 
> On 2/14/2019 10:31 AM, Martin K. Petersen wrote:
>>
>> zhangxiaoxu,
>>
>>> Any progress about the problme?
>>> Should we disable the write same when stack the different LBA disks?
>>
>> Yes, please.
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [dm-devel] block: Fix a WRITE SAME BUG_ON
  2019-02-18 14:10                           ` zhangxiaoxu (A)
@ 2019-02-19 23:10                             ` Martin K. Petersen
  0 siblings, 0 replies; 20+ messages in thread
From: Martin K. Petersen @ 2019-02-19 23:10 UTC (permalink / raw)
  To: zhangxiaoxu (A)
  Cc: Martin K. Petersen, Christoph Hellwig, Heinz Mauelshagen,
	John Dorminy, axboe, Mike Snitzer, linux-block, dm-devel,
	Alasdair G Kergon


Hi Zhang,

> ping.
>
> Anyone can help merge this patch? or any other solutions?

+	/* If the logical block size is different, forbid write same */
+	if (t->logical_block_size != b->logical_block_size &&
+	    t->max_write_same_sectors != UINT_MAX)
+		t->max_write_same_sectors = 0;
+	else
+		t->max_write_same_sectors = min(t->max_write_same_sectors,
+						b->max_write_same_sectors);
+

I am not particularly keen on this UINT_MAX magic. I would prefer to
have the stacking driver default for lbs be set to 0 so the stacking
function could avoid special-casing the first iteration. But I am not
sure whether that would break any assumptions in DM/MD wrt. the logical
block size being non-zero prior to calling the stacking function.

Mike? Any comments?

If we stick with the UINT_MAX check, the comment should at least point
out why it's there.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-02-19 23:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190125021107.4595-1-zhangxiaoxu5@huawei.com>
2019-01-26 11:17 ` [dm-devel] [v2] block: Fix a WRITE SAME BUG_ON John Dorminy
2019-01-28  5:48   ` zhangxiaoxu (A)
2019-01-28 22:14   ` Mike Snitzer
2019-01-29  4:54     ` Martin K. Petersen
2019-01-29  8:52       ` Christoph Hellwig
2019-01-30  6:50         ` zhangxiaoxu (A)
2019-01-30 14:08       ` John Dorminy
2019-01-31  0:58         ` zhangxiaoxu (A)
2019-01-31  2:23         ` Martin K. Petersen
2019-01-31 10:39         ` Christoph Hellwig
2019-01-31 19:41           ` John Dorminy
2019-02-01  7:35             ` Christoph Hellwig
2019-02-01 14:09               ` John Dorminy
2019-02-01 16:03                 ` [dm-devel] " Heinz Mauelshagen
2019-02-01 16:18                   ` Christoph Hellwig
2019-02-12  3:11                     ` zhangxiaoxu (A)
2019-02-14  2:31                       ` Martin K. Petersen
2019-02-14  9:36                         ` zhangxiaoxu (A)
2019-02-18 14:10                           ` zhangxiaoxu (A)
2019-02-19 23:10                             ` Martin K. Petersen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).