linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug Report] Discard bios cannot be correctly merged in blk-mq
@ 2021-06-05 20:54 Wang Shanker
  2021-06-05 22:38 ` antlists
  2021-06-07 13:07 ` Ming Lei
  0 siblings, 2 replies; 13+ messages in thread
From: Wang Shanker @ 2021-06-05 20:54 UTC (permalink / raw)
  To: linux-block; +Cc: linux-raid

Hi, all

I'm writing to report my recent findings about the handling of discard 
operations. As indicated by a few tests, discard operation cannot be 
correctly merged, which leads to poor performance of RAID456 on discard
requests. I'm not quite familiar with block subsystem, so please correct
me if there are any mistakes in the following analysis.

In blk_discard_mergable(), we can see the handling of merging discard 
operations goes through different processes, decided by whether we have
more than one queue_max_discard_segments. If the device requires the 
sectors should be contiguous in one discard operation, the merging process
will be the same as that for normal read/write operations. Otherwise, 
bio_attempt_discard_merge will try to merge as many bios as the device
allows, ignoring the contiguity. Sadly, for both cases, there are problems.

For devices requiring contiguous sector ranges(such as scsi disks), 
bio_attempt_front_merge() or bio_attempt_back_merge() will be handling 
the merging process, and both control flows will arrive at 
ll_new_hw_segment(), where the following statement:

    req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)

can never be true, since blk_rq_get_max_segments(req) will always be 1.
As a result, no discard operations shall be merged.

For devices supporting multiple segments of sector ranges, 
bio_attempt_discard_merge() will take over the process. Indeed it will merge
some bios. But how many bios can be merged into one request? In the 
implementation, the maximum number of bios is limited mainly by 
queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
the problem is). However, it is not the case, since bio_attempt_discard_merge
is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
They should be considered as only one segment instead 20 of them.

You may wonder the importance of merging discard operations. In the 
implementation of RAID456, bios are committed in 4k trunks (they call
them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
The proper merging of the bios is of vital importance for a reasonable 
operating performance of RAID456 devices. In fact, I met this problem
when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
support. Someone also reported similar issues in the linux-raid list
(https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
reported that ``lots of small 4k discard requests that get merged into larger
512k chunks submitted to devices". This can be explained by my above discovery
because nvme allows 128 segments at the maximum in a dsm instruction.

The above two scenarios can be reproduced utilizing latest QEMU, with emulated
scsi drives (for the first scenario) or nvme drives (for the second scenario)
and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
The detailed process reproducing is as follows:

1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
3. create 3 empty sparse disk images:
  # truncate -s 1T disk1 disk2 disk3
4. using the following qemu command to start the guest vm (here 9p is used 
 as the rootfs because we don't want the io operations on the rootfs influence
 the debugging of the block layer of the guest vm)
  # qemu-system-x86_64 \
        -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G  \
        -chardev stdio,mux=on,id=char0,signal=off  \
        -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
        -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
        -monitor chardev:char0 \
        -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0  \
        -nographic \
        -kernel vmlinuz -initrd initrd.img  \
        -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
        -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
        -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
        -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
        -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
        -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
        -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
        -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
5. enable trim support of the raid456 module:
  # modprobe raid456
  # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
6. using mdaam to create a raid5 device in the guest vm:
  # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
7. and issue a discard request on the dm device: (limit the size of 
 discard request because discarding all the 2T data is too slow)
  # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k 
 length (nlb 1).

Similarly, the problem with scsi devices can be emulated using the following 
options for qemu:

        -device virtio-scsi,id=scsi \
        -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
        -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
        -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
        -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log


Despite the discovery, I cannot come up with a proper fix of this issue due
to my lack of familiarity of the block subsystem. I expect your kind feedback
on this. Thanks in advance.

Cheers,

Miao Wang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-05 20:54 [Bug Report] Discard bios cannot be correctly merged in blk-mq Wang Shanker
@ 2021-06-05 22:38 ` antlists
  2021-06-06  3:44   ` Wang Shanker
  2021-06-07 13:07 ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: antlists @ 2021-06-05 22:38 UTC (permalink / raw)
  To: Wang Shanker, linux-block; +Cc: linux-raid

On 05/06/2021 21:54, Wang Shanker wrote:
> You may wonder the importance of merging discard operations. In the
> implementation of RAID456, bios are committed in 4k trunks (they call
> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
> The proper merging of the bios is of vital importance for a reasonable
> operating performance of RAID456 devices.

Note that I have seen reports (I'm not sure where or how true they are), 
that even when requests are sent as 512k or whatever, certain upper 
layers break them into 4k's, presumably expecting lower layers to merge 
them again.

You might have better luck looking for and suppressing the breaking up 
of large chunks.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-05 22:38 ` antlists
@ 2021-06-06  3:44   ` Wang Shanker
  0 siblings, 0 replies; 13+ messages in thread
From: Wang Shanker @ 2021-06-06  3:44 UTC (permalink / raw)
  To: antlists; +Cc: linux-block, linux-raid


> 2021年06月06日 06:38,antlists <antlists@youngman.org.uk> 写道:
> 
> On 05/06/2021 21:54, Wang Shanker wrote:
>> You may wonder the importance of merging discard operations. In the
>> implementation of RAID456, bios are committed in 4k trunks (they call
>> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
>> The proper merging of the bios is of vital importance for a reasonable
>> operating performance of RAID456 devices.
> 
> Note that I have seen reports (I'm not sure where or how true they are), that even when requests are sent as 512k or whatever, certain upper layers break them into 4k's, presumably expecting lower layers to merge them again.

Yes, that's true for RAID456. RAID456 is issuing 4k size requests to lower backing
device, no matter how large the request received is. That's won't be a problem
because for normal read and write operations, those 4k size requests can be
nicely merged into larger ones. Otherwise, we would be flooded with reports 
complaining about unacceptable performance of raid456.

> You might have better luck looking for and suppressing the breaking up of large chunks.

I'm quite aware that it is raid456 that is breaking the requests. It might be
better if we can avoid this in raid456. I believe it can be very difficult due to the
design of raid456 code. However, the question now is the merging is successful for
normal read/write operations, but failed for discard operations. Where lies the 
difference? I did some testing in a qemu vm and added some debug printing in the
control flow. By what I have discovered, I'm quite confident that the discard
requests are not processed the right way when merging them.

Cheers,

Miao Wang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-05 20:54 [Bug Report] Discard bios cannot be correctly merged in blk-mq Wang Shanker
  2021-06-05 22:38 ` antlists
@ 2021-06-07 13:07 ` Ming Lei
  2021-06-08 15:49   ` Wang Shanker
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-06-07 13:07 UTC (permalink / raw)
  To: Wang Shanker; +Cc: linux-block, linux-raid

On Sun, Jun 06, 2021 at 04:54:09AM +0800, Wang Shanker wrote:
> Hi, all
> 
> I'm writing to report my recent findings about the handling of discard 
> operations. As indicated by a few tests, discard operation cannot be 
> correctly merged, which leads to poor performance of RAID456 on discard
> requests. I'm not quite familiar with block subsystem, so please correct
> me if there are any mistakes in the following analysis.
> 
> In blk_discard_mergable(), we can see the handling of merging discard 
> operations goes through different processes, decided by whether we have
> more than one queue_max_discard_segments. If the device requires the 
> sectors should be contiguous in one discard operation, the merging process
> will be the same as that for normal read/write operations. Otherwise, 
> bio_attempt_discard_merge will try to merge as many bios as the device
> allows, ignoring the contiguity. Sadly, for both cases, there are problems.
> 
> For devices requiring contiguous sector ranges(such as scsi disks), 
> bio_attempt_front_merge() or bio_attempt_back_merge() will be handling 
> the merging process, and both control flows will arrive at 
> ll_new_hw_segment(), where the following statement:
> 
>     req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)
> 
> can never be true, since blk_rq_get_max_segments(req) will always be 1.
> As a result, no discard operations shall be merged.

OK, that looks a bug, and the following change may fix the issue:

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4d97fb6dd226..65210e9a8efa 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -559,10 +559,14 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
 static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
 		unsigned int nr_phys_segs)
 {
-	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
+	if (blk_integrity_merge_bio(req->q, req, bio) == false)
 		goto no_merge;
 
-	if (blk_integrity_merge_bio(req->q, req, bio) == false)
+	/* discard request merge won't add new segment */
+	if (req_op(req) == REQ_OP_DISCARD)
+		return 1;
+
+	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
 		goto no_merge;
 
 	/*

> 
> For devices supporting multiple segments of sector ranges, 
> bio_attempt_discard_merge() will take over the process. Indeed it will merge
> some bios. But how many bios can be merged into one request? In the 
> implementation, the maximum number of bios is limited mainly by 
> queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
> the problem is). However, it is not the case, since bio_attempt_discard_merge
> is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
> They should be considered as only one segment instead 20 of them.

Right, so far ELEVATOR_DISCARD_MERGE doesn't merge bios actually, but it
can be supported without much difficulty.

> 
> You may wonder the importance of merging discard operations. In the 
> implementation of RAID456, bios are committed in 4k trunks (they call
> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
> The proper merging of the bios is of vital importance for a reasonable 
> operating performance of RAID456 devices. In fact, I met this problem
> when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
> support. Someone also reported similar issues in the linux-raid list
> (https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
> reported that ``lots of small 4k discard requests that get merged into larger
> 512k chunks submitted to devices". This can be explained by my above discovery
> because nvme allows 128 segments at the maximum in a dsm instruction.
> 
> The above two scenarios can be reproduced utilizing latest QEMU, with emulated
> scsi drives (for the first scenario) or nvme drives (for the second scenario)
> and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
> The detailed process reproducing is as follows:
> 
> 1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
> 2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
> 3. create 3 empty sparse disk images:
>   # truncate -s 1T disk1 disk2 disk3
> 4. using the following qemu command to start the guest vm (here 9p is used 
>  as the rootfs because we don't want the io operations on the rootfs influence
>  the debugging of the block layer of the guest vm)
>   # qemu-system-x86_64 \
>         -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G  \
>         -chardev stdio,mux=on,id=char0,signal=off  \
>         -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
>         -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
>         -monitor chardev:char0 \
>         -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0  \
>         -nographic \
>         -kernel vmlinuz -initrd initrd.img  \
>         -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
>         -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
>         -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
>         -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
>         -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
>         -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>         -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>         -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
> 5. enable trim support of the raid456 module:
>   # modprobe raid456
>   # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
> 6. using mdaam to create a raid5 device in the guest vm:
>   # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
> 7. and issue a discard request on the dm device: (limit the size of 
>  discard request because discarding all the 2T data is too slow)
>   # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
> 8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k 
>  length (nlb 1).

4kb should be the discard segment length, instead of discard request
length, which should be 512k in the above test.

> 
> Similarly, the problem with scsi devices can be emulated using the following 
> options for qemu:
> 
>         -device virtio-scsi,id=scsi \
>         -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>         -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>         -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
>         -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
> 
> 
> Despite the discovery, I cannot come up with a proper fix of this issue due
> to my lack of familiarity of the block subsystem. I expect your kind feedback
> on this. Thanks in advance.

In the above setting and raid456 test, I observe that rq->nr_phys_segments can
reach 128, but queue_max_discard_segments() reports 256. So discard
request size can be 512KB, which is the max size when you run 1MB discard on
raid456. However, if the discard length on raid456 is increased, the
current way will become inefficient.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-07 13:07 ` Ming Lei
@ 2021-06-08 15:49   ` Wang Shanker
  2021-06-09  0:41     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Shanker @ 2021-06-08 15:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-raid

> 2021年06月07日 21:07,Ming Lei <ming.lei@redhat.com> 写道:
> 
> On Sun, Jun 06, 2021 at 04:54:09AM +0800, Wang Shanker wrote:
>> Hi, all
>> 
>> I'm writing to report my recent findings about the handling of discard 
>> operations. As indicated by a few tests, discard operation cannot be 
>> correctly merged, which leads to poor performance of RAID456 on discard
>> requests. I'm not quite familiar with block subsystem, so please correct
>> me if there are any mistakes in the following analysis.
>> 
>> In blk_discard_mergable(), we can see the handling of merging discard 
>> operations goes through different processes, decided by whether we have
>> more than one queue_max_discard_segments. If the device requires the 
>> sectors should be contiguous in one discard operation, the merging process
>> will be the same as that for normal read/write operations. Otherwise, 
>> bio_attempt_discard_merge will try to merge as many bios as the device
>> allows, ignoring the contiguity. Sadly, for both cases, there are problems.
>> 
>> For devices requiring contiguous sector ranges(such as scsi disks), 
>> bio_attempt_front_merge() or bio_attempt_back_merge() will be handling 
>> the merging process, and both control flows will arrive at 
>> ll_new_hw_segment(), where the following statement:
>> 
>>    req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)
>> 
>> can never be true, since blk_rq_get_max_segments(req) will always be 1.
>> As a result, no discard operations shall be merged.
> 
> OK, that looks a bug, and the following change may fix the issue:
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 4d97fb6dd226..65210e9a8efa 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -559,10 +559,14 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
> 		unsigned int nr_phys_segs)
> {
> -	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> +	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> 		goto no_merge;
> 
> -	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> +	/* discard request merge won't add new segment */
> +	if (req_op(req) == REQ_OP_DISCARD)
> +		return 1;
> +
> +	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> 		goto no_merge;
> 
> 	/*
> 

Many thanks for this patch. I wonder whether it will be applied soon or it will go 
through a formal [PATCH] process.

>> 
>> For devices supporting multiple segments of sector ranges, 
>> bio_attempt_discard_merge() will take over the process. Indeed it will merge
>> some bios. But how many bios can be merged into one request? In the 
>> implementation, the maximum number of bios is limited mainly by 
>> queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
>> the problem is). However, it is not the case, since bio_attempt_discard_merge
>> is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
>> They should be considered as only one segment instead 20 of them.
> 
> Right, so far ELEVATOR_DISCARD_MERGE doesn't merge bios actually, but it
> can be supported without much difficulty.
> 
>> 
>> You may wonder the importance of merging discard operations. In the 
>> implementation of RAID456, bios are committed in 4k trunks (they call
>> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
>> The proper merging of the bios is of vital importance for a reasonable 
>> operating performance of RAID456 devices. In fact, I met this problem
>> when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
>> support. Someone also reported similar issues in the linux-raid list
>> (https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
>> reported that ``lots of small 4k discard requests that get merged into larger
>> 512k chunks submitted to devices". This can be explained by my above discovery
>> because nvme allows 128 segments at the maximum in a dsm instruction.
>> 
>> The above two scenarios can be reproduced utilizing latest QEMU, with emulated
>> scsi drives (for the first scenario) or nvme drives (for the second scenario)
>> and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
>> The detailed process reproducing is as follows:
>> 
>> 1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
>> 2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
>> 3. create 3 empty sparse disk images:
>>  # truncate -s 1T disk1 disk2 disk3
>> 4. using the following qemu command to start the guest vm (here 9p is used 
>> as the rootfs because we don't want the io operations on the rootfs influence
>> the debugging of the block layer of the guest vm)
>>  # qemu-system-x86_64 \
>>        -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G  \
>>        -chardev stdio,mux=on,id=char0,signal=off  \
>>        -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
>>        -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
>>        -monitor chardev:char0 \
>>        -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0  \
>>        -nographic \
>>        -kernel vmlinuz -initrd initrd.img  \
>>        -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
>>        -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
>>        -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
>>        -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
>>        -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
>>        -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>>        -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>>        -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
>> 5. enable trim support of the raid456 module:
>>  # modprobe raid456
>>  # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
>> 6. using mdaam to create a raid5 device in the guest vm:
>>  # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
>> 7. and issue a discard request on the dm device: (limit the size of 
>> discard request because discarding all the 2T data is too slow)
>>  # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
>> 8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k 
>> length (nlb 1).
> 
> 4kb should be the discard segment length, instead of discard request
> length, which should be 512k in the above test.

Actually, what are received by the nvme controller are discard requests
with 128 segments of 4k, instead of one segment of 512k.

> 
>> 
>> Similarly, the problem with scsi devices can be emulated using the following 
>> options for qemu:
>> 
>>        -device virtio-scsi,id=scsi \
>>        -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>>        -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>>        -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
>>        -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
>> 
>> 
>> Despite the discovery, I cannot come up with a proper fix of this issue due
>> to my lack of familiarity of the block subsystem. I expect your kind feedback
>> on this. Thanks in advance.
> 
> In the above setting and raid456 test, I observe that rq->nr_phys_segments can
> reach 128, but queue_max_discard_segments() reports 256. So discard
> request size can be 512KB, which is the max size when you run 1MB discard on
> raid456. However, if the discard length on raid456 is increased, the
> current way will become inefficient.

Exactly. 

I suggest that bio's can be merged and be calculated as one segment if they are
contiguous and contain no data.

And I also discovered later that, even normal long write requests, e.g.
a 10m write, will be split into 4k bio's. The maximum number of bio's which can 
be merged into one request is limited by queue_max_segments, regardless
of whether those bio's are contiguous. In my test environment, for scsi devices,
queue_max_segments can be 254, which means about 1m size of requests. For nvme
devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
which results in only 132k of requests. 

So, I would also suggest that raid456 should be improved to issue bigger bio's to
underlying drives.

Cheers,

Miao Wang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-08 15:49   ` Wang Shanker
@ 2021-06-09  0:41     ` Ming Lei
  2021-06-09  2:40       ` Wang Shanker
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-06-09  0:41 UTC (permalink / raw)
  To: Wang Shanker; +Cc: linux-block, linux-raid, Xiao Ni

On Tue, Jun 08, 2021 at 11:49:04PM +0800, Wang Shanker wrote:
> > 2021年06月07日 21:07,Ming Lei <ming.lei@redhat.com> 写道:
> > 
> > On Sun, Jun 06, 2021 at 04:54:09AM +0800, Wang Shanker wrote:
> >> Hi, all
> >> 
> >> I'm writing to report my recent findings about the handling of discard 
> >> operations. As indicated by a few tests, discard operation cannot be 
> >> correctly merged, which leads to poor performance of RAID456 on discard
> >> requests. I'm not quite familiar with block subsystem, so please correct
> >> me if there are any mistakes in the following analysis.
> >> 
> >> In blk_discard_mergable(), we can see the handling of merging discard 
> >> operations goes through different processes, decided by whether we have
> >> more than one queue_max_discard_segments. If the device requires the 
> >> sectors should be contiguous in one discard operation, the merging process
> >> will be the same as that for normal read/write operations. Otherwise, 
> >> bio_attempt_discard_merge will try to merge as many bios as the device
> >> allows, ignoring the contiguity. Sadly, for both cases, there are problems.
> >> 
> >> For devices requiring contiguous sector ranges(such as scsi disks), 
> >> bio_attempt_front_merge() or bio_attempt_back_merge() will be handling 
> >> the merging process, and both control flows will arrive at 
> >> ll_new_hw_segment(), where the following statement:
> >> 
> >>    req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)
> >> 
> >> can never be true, since blk_rq_get_max_segments(req) will always be 1.
> >> As a result, no discard operations shall be merged.
> > 
> > OK, that looks a bug, and the following change may fix the issue:
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 4d97fb6dd226..65210e9a8efa 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -559,10 +559,14 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> > static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
> > 		unsigned int nr_phys_segs)
> > {
> > -	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> > +	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> > 		goto no_merge;
> > 
> > -	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> > +	/* discard request merge won't add new segment */
> > +	if (req_op(req) == REQ_OP_DISCARD)
> > +		return 1;
> > +
> > +	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> > 		goto no_merge;
> > 
> > 	/*
> > 
> 
> Many thanks for this patch. I wonder whether it will be applied soon or it will go 
> through a formal [PATCH] process.
> 
> >> 
> >> For devices supporting multiple segments of sector ranges, 
> >> bio_attempt_discard_merge() will take over the process. Indeed it will merge
> >> some bios. But how many bios can be merged into one request? In the 
> >> implementation, the maximum number of bios is limited mainly by 
> >> queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
> >> the problem is). However, it is not the case, since bio_attempt_discard_merge
> >> is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
> >> They should be considered as only one segment instead 20 of them.
> > 
> > Right, so far ELEVATOR_DISCARD_MERGE doesn't merge bios actually, but it
> > can be supported without much difficulty.
> > 
> >> 
> >> You may wonder the importance of merging discard operations. In the 
> >> implementation of RAID456, bios are committed in 4k trunks (they call
> >> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
> >> The proper merging of the bios is of vital importance for a reasonable 
> >> operating performance of RAID456 devices. In fact, I met this problem
> >> when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
> >> support. Someone also reported similar issues in the linux-raid list
> >> (https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
> >> reported that ``lots of small 4k discard requests that get merged into larger
> >> 512k chunks submitted to devices". This can be explained by my above discovery
> >> because nvme allows 128 segments at the maximum in a dsm instruction.
> >> 
> >> The above two scenarios can be reproduced utilizing latest QEMU, with emulated
> >> scsi drives (for the first scenario) or nvme drives (for the second scenario)
> >> and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
> >> The detailed process reproducing is as follows:
> >> 
> >> 1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
> >> 2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
> >> 3. create 3 empty sparse disk images:
> >>  # truncate -s 1T disk1 disk2 disk3
> >> 4. using the following qemu command to start the guest vm (here 9p is used 
> >> as the rootfs because we don't want the io operations on the rootfs influence
> >> the debugging of the block layer of the guest vm)
> >>  # qemu-system-x86_64 \
> >>        -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G  \
> >>        -chardev stdio,mux=on,id=char0,signal=off  \
> >>        -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
> >>        -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
> >>        -monitor chardev:char0 \
> >>        -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0  \
> >>        -nographic \
> >>        -kernel vmlinuz -initrd initrd.img  \
> >>        -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
> >>        -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
> >>        -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
> >>        -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
> >>        -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
> >>        -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
> >>        -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
> >>        -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
> >> 5. enable trim support of the raid456 module:
> >>  # modprobe raid456
> >>  # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
> >> 6. using mdaam to create a raid5 device in the guest vm:
> >>  # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
> >> 7. and issue a discard request on the dm device: (limit the size of 
> >> discard request because discarding all the 2T data is too slow)
> >>  # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
> >> 8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k 
> >> length (nlb 1).
> > 
> > 4kb should be the discard segment length, instead of discard request
> > length, which should be 512k in the above test.
> 
> Actually, what are received by the nvme controller are discard requests
> with 128 segments of 4k, instead of one segment of 512k.

Right, I am just wondering if this way makes a difference wrt. single
range/segment discard request from device viewpoint, but anyway it is
better to send less segment.

> 
> > 
> >> 
> >> Similarly, the problem with scsi devices can be emulated using the following 
> >> options for qemu:
> >> 
> >>        -device virtio-scsi,id=scsi \
> >>        -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
> >>        -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
> >>        -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
> >>        -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
> >> 
> >> 
> >> Despite the discovery, I cannot come up with a proper fix of this issue due
> >> to my lack of familiarity of the block subsystem. I expect your kind feedback
> >> on this. Thanks in advance.
> > 
> > In the above setting and raid456 test, I observe that rq->nr_phys_segments can
> > reach 128, but queue_max_discard_segments() reports 256. So discard
> > request size can be 512KB, which is the max size when you run 1MB discard on
> > raid456. However, if the discard length on raid456 is increased, the
> > current way will become inefficient.
> 
> Exactly. 
> 
> I suggest that bio's can be merged and be calculated as one segment if they are
> contiguous and contain no data.

Fine.

> 
> And I also discovered later that, even normal long write requests, e.g.
> a 10m write, will be split into 4k bio's. The maximum number of bio's which can 
> be merged into one request is limited by queue_max_segments, regardless
> of whether those bio's are contiguous. In my test environment, for scsi devices,
> queue_max_segments can be 254, which means about 1m size of requests. For nvme
> devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
> which results in only 132k of requests. 

Here what matters is queue_max_discard_segments().

> 
> So, I would also suggest that raid456 should be improved to issue bigger bio's to
> underlying drives.

Right, that should be root solution.

Cc Xiao, I remembered that he worked on this area.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-09  0:41     ` Ming Lei
@ 2021-06-09  2:40       ` Wang Shanker
  2021-06-09  8:44         ` Xiao Ni
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Shanker @ 2021-06-09  2:40 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, linux-raid, Xiao Ni


> 2021年06月09日 08:41,Ming Lei <ming.lei@redhat.com> 写道:
> 
> On Tue, Jun 08, 2021 at 11:49:04PM +0800, Wang Shanker wrote:
>> 
>> 
>> Actually, what are received by the nvme controller are discard requests
>> with 128 segments of 4k, instead of one segment of 512k.
> 
> Right, I am just wondering if this way makes a difference wrt. single
> range/segment discard request from device viewpoint, but anyway it is
> better to send less segment.
It would be meaningful if more than queue_max_discard_segments() bio's
are sent and merged into big segments. 
> 
>> 
>>> 
>>>> 
>>>> Similarly, the problem with scsi devices can be emulated using the following 
>>>> options for qemu:
>>>> 
>>>>       -device virtio-scsi,id=scsi \
>>>>       -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>>>>       -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>>>>       -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
>>>>       -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
>>>> 
>>>> 
>>>> Despite the discovery, I cannot come up with a proper fix of this issue due
>>>> to my lack of familiarity of the block subsystem. I expect your kind feedback
>>>> on this. Thanks in advance.
>>> 
>>> In the above setting and raid456 test, I observe that rq->nr_phys_segments can
>>> reach 128, but queue_max_discard_segments() reports 256. So discard
>>> request size can be 512KB, which is the max size when you run 1MB discard on
>>> raid456. However, if the discard length on raid456 is increased, the
>>> current way will become inefficient.
>> 
>> Exactly. 
>> 
>> I suggest that bio's can be merged and be calculated as one segment if they are
>> contiguous and contain no data.
> 
> Fine.
> 
>> 
>> And I also discovered later that, even normal long write requests, e.g.
>> a 10m write, will be split into 4k bio's. The maximum number of bio's which can 
>> be merged into one request is limited by queue_max_segments, regardless
>> of whether those bio's are contiguous. In my test environment, for scsi devices,
>> queue_max_segments can be 254, which means about 1m size of requests. For nvme
>> devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
>> which results in only 132k of requests. 
> 
> Here what matters is queue_max_discard_segments().
Here I was considering normal write/read bio's, since I first took it for granted
that normal write/read IOs would be optimal in raid456, and finally discovered
that those 4k IOs can only be merged into not-so-big requests.
> 
>> 
>> So, I would also suggest that raid456 should be improved to issue bigger bio's to
>> underlying drives.
> 
> Right, that should be root solution.
> 
> Cc Xiao, I remembered that he worked on this area.

Many thanks for looking into this issue.

Cheers,

Miao Wang

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-09  2:40       ` Wang Shanker
@ 2021-06-09  8:44         ` Xiao Ni
  2021-06-09  9:03           ` Wang Shanker
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Ni @ 2021-06-09  8:44 UTC (permalink / raw)
  To: Wang Shanker; +Cc: Ming Lei, linux-block, linux-raid

Hi all

Thanks for reporting about this. I did a test in my environment.
time blkdiscard /dev/nvme5n1  (477GB)
real    0m0.398s
time blkdiscard /dev/md0
real    9m16.569s

I'm not familiar with the block layer codes. I'll try to understand
the codes related with discard request and
try to fix this problem.

I have a question for raid5 discard, it needs to consider more than
raid0 and raid10. For example, there is a raid5 with 3 disks.
D11 D21 P1 (stripe size is 4KB)
D12 D22 P2
D13 D23 P3
D14 D24 P4
...  (chunk size is 512KB)
If there is a discard request on D13 and D14, and there is no discard
request on D23 D24. It can't send
discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
D13 and disk2 is broken, it can't
get the right data from D13 and P3. The discard request on D13 can
write 0 to the discard region, right?

If so, it can handle a discard bio at a time that is big enough at
least to contain the data. (data disks * chunk size). In this case the
size is 1024KB (512KB*2).

Regards
Xiao


On Wed, Jun 9, 2021 at 10:40 AM Wang Shanker <shankerwangmiao@gmail.com> wrote:
>
>
> > 2021年06月09日 08:41,Ming Lei <ming.lei@redhat.com> 写道:
> >
> > On Tue, Jun 08, 2021 at 11:49:04PM +0800, Wang Shanker wrote:
> >>
> >>
> >> Actually, what are received by the nvme controller are discard requests
> >> with 128 segments of 4k, instead of one segment of 512k.
> >
> > Right, I am just wondering if this way makes a difference wrt. single
> > range/segment discard request from device viewpoint, but anyway it is
> > better to send less segment.
> It would be meaningful if more than queue_max_discard_segments() bio's
> are sent and merged into big segments.
> >
> >>
> >>>
> >>>>
> >>>> Similarly, the problem with scsi devices can be emulated using the following
> >>>> options for qemu:
> >>>>
> >>>>       -device virtio-scsi,id=scsi \
> >>>>       -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
> >>>>       -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
> >>>>       -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
> >>>>       -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
> >>>>
> >>>>
> >>>> Despite the discovery, I cannot come up with a proper fix of this issue due
> >>>> to my lack of familiarity of the block subsystem. I expect your kind feedback
> >>>> on this. Thanks in advance.
> >>>
> >>> In the above setting and raid456 test, I observe that rq->nr_phys_segments can
> >>> reach 128, but queue_max_discard_segments() reports 256. So discard
> >>> request size can be 512KB, which is the max size when you run 1MB discard on
> >>> raid456. However, if the discard length on raid456 is increased, the
> >>> current way will become inefficient.
> >>
> >> Exactly.
> >>
> >> I suggest that bio's can be merged and be calculated as one segment if they are
> >> contiguous and contain no data.
> >
> > Fine.
> >
> >>
> >> And I also discovered later that, even normal long write requests, e.g.
> >> a 10m write, will be split into 4k bio's. The maximum number of bio's which can
> >> be merged into one request is limited by queue_max_segments, regardless
> >> of whether those bio's are contiguous. In my test environment, for scsi devices,
> >> queue_max_segments can be 254, which means about 1m size of requests. For nvme
> >> devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
> >> which results in only 132k of requests.
> >
> > Here what matters is queue_max_discard_segments().
> Here I was considering normal write/read bio's, since I first took it for granted
> that normal write/read IOs would be optimal in raid456, and finally discovered
> that those 4k IOs can only be merged into not-so-big requests.
> >
> >>
> >> So, I would also suggest that raid456 should be improved to issue bigger bio's to
> >> underlying drives.
> >
> > Right, that should be root solution.
> >
> > Cc Xiao, I remembered that he worked on this area.
>
> Many thanks for looking into this issue.
>
> Cheers,
>
> Miao Wang
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-09  8:44         ` Xiao Ni
@ 2021-06-09  9:03           ` Wang Shanker
  2021-06-18  6:28             ` Wang Shanker
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Shanker @ 2021-06-09  9:03 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Ming Lei, linux-block, linux-raid


> 2021年06月09日 16:44,Xiao Ni <xni@redhat.com> 写道:
> 
> Hi all
> 
> Thanks for reporting about this. I did a test in my environment.
> time blkdiscard /dev/nvme5n1  (477GB)
> real    0m0.398s
> time blkdiscard /dev/md0
> real    9m16.569s
> 
> I'm not familiar with the block layer codes. I'll try to understand
> the codes related with discard request and
> try to fix this problem.
> 
> I have a question for raid5 discard, it needs to consider more than
> raid0 and raid10. For example, there is a raid5 with 3 disks.
> D11 D21 P1 (stripe size is 4KB)
> D12 D22 P2
> D13 D23 P3
> D14 D24 P4
> ...  (chunk size is 512KB)
> If there is a discard request on D13 and D14, and there is no discard
> request on D23 D24. It can't send
> discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
> D13 and disk2 is broken, it can't
> get the right data from D13 and P3. The discard request on D13 can
> write 0 to the discard region, right?

Yes. It can be seen at the beginning of make_discard_request(), where
the requested range being discarded is aligned to ``stripe_sectors", 
which should be chunk_sectors * nr_data_disks.

Cheers,

Miao Wang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-09  9:03           ` Wang Shanker
@ 2021-06-18  6:28             ` Wang Shanker
  2021-06-18 12:49               ` Xiao Ni
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Shanker @ 2021-06-18  6:28 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Ming Lei, linux-block, linux-raid

Hi, Xiao

Any ideas on this issue?

Cheers,

Miao Wang

> 2021年06月09日 17:03,Wang Shanker <shankerwangmiao@gmail.com> 写道:
> 
>> 
>> 2021年06月09日 16:44,Xiao Ni <xni@redhat.com> 写道:
>> 
>> Hi all
>> 
>> Thanks for reporting about this. I did a test in my environment.
>> time blkdiscard /dev/nvme5n1  (477GB)
>> real    0m0.398s
>> time blkdiscard /dev/md0
>> real    9m16.569s
>> 
>> I'm not familiar with the block layer codes. I'll try to understand
>> the codes related with discard request and
>> try to fix this problem.
>> 
>> I have a question for raid5 discard, it needs to consider more than
>> raid0 and raid10. For example, there is a raid5 with 3 disks.
>> D11 D21 P1 (stripe size is 4KB)
>> D12 D22 P2
>> D13 D23 P3
>> D14 D24 P4
>> ...  (chunk size is 512KB)
>> If there is a discard request on D13 and D14, and there is no discard
>> request on D23 D24. It can't send
>> discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
>> D13 and disk2 is broken, it can't
>> get the right data from D13 and P3. The discard request on D13 can
>> write 0 to the discard region, right?
> 
> Yes. It can be seen at the beginning of make_discard_request(), where
> the requested range being discarded is aligned to ``stripe_sectors", 
> which should be chunk_sectors * nr_data_disks.
> 
> Cheers,
> 
> Miao Wang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-18  6:28             ` Wang Shanker
@ 2021-06-18 12:49               ` Xiao Ni
  2021-06-21  7:49                 ` Wang Shanker
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Ni @ 2021-06-18 12:49 UTC (permalink / raw)
  To: Wang Shanker; +Cc: Ming Lei, linux-raid, linux-block

Hi Miao

So you plan to fix this problem now? The plan is to submit the discard
bio directly to disk similar with raid0/raid10.
As we talked, it needs to consider the discard region. It should be
larger than chunk_sectors * nr_data_disks. It needs
to split the bio when its size not aligned with chunk_sectors *
nr_data_disks. And it needs to consider the start address
of the bio too. If it's not aligned with a start address of
chunk_sectors, it's better to split this part too.

I'm working on another job. So I don't have time to do this now. If
you submit the patches, I can help to review :)

Regards
Xiao

On Fri, Jun 18, 2021 at 2:28 PM Wang Shanker <shankerwangmiao@gmail.com> wrote:
>
> Hi, Xiao
>
> Any ideas on this issue?
>
> Cheers,
>
> Miao Wang
>
> > 2021年06月09日 17:03,Wang Shanker <shankerwangmiao@gmail.com> 写道:
> >
> >>
> >> 2021年06月09日 16:44,Xiao Ni <xni@redhat.com> 写道:
> >>
> >> Hi all
> >>
> >> Thanks for reporting about this. I did a test in my environment.
> >> time blkdiscard /dev/nvme5n1  (477GB)
> >> real    0m0.398s
> >> time blkdiscard /dev/md0
> >> real    9m16.569s
> >>
> >> I'm not familiar with the block layer codes. I'll try to understand
> >> the codes related with discard request and
> >> try to fix this problem.
> >>
> >> I have a question for raid5 discard, it needs to consider more than
> >> raid0 and raid10. For example, there is a raid5 with 3 disks.
> >> D11 D21 P1 (stripe size is 4KB)
> >> D12 D22 P2
> >> D13 D23 P3
> >> D14 D24 P4
> >> ...  (chunk size is 512KB)
> >> If there is a discard request on D13 and D14, and there is no discard
> >> request on D23 D24. It can't send
> >> discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
> >> D13 and disk2 is broken, it can't
> >> get the right data from D13 and P3. The discard request on D13 can
> >> write 0 to the discard region, right?
> >
> > Yes. It can be seen at the beginning of make_discard_request(), where
> > the requested range being discarded is aligned to ``stripe_sectors",
> > which should be chunk_sectors * nr_data_disks.
> >
> > Cheers,
> >
> > Miao Wang
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-18 12:49               ` Xiao Ni
@ 2021-06-21  7:49                 ` Wang Shanker
  2021-06-22  1:48                   ` Xiao Ni
  0 siblings, 1 reply; 13+ messages in thread
From: Wang Shanker @ 2021-06-21  7:49 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Ming Lei, linux-raid, linux-block

Hi, Xiao

Many thanks for your reply. I realized that this problem is not limited 
to discard requests. For normal read/write requests, they are also first
get split into 4k-sized ones and then merged into larger ones. The merging
of bio's is limited by queue_max_segments, which leads to small trunks of
io operations issued to physical devices. It seems that such behavior is
not optimal and should be improved.

I'm not so familiar with raid456. Could you have a look of its code when you
are free? It seems that improving this may result in big changes.

Cheers,

Miao Wang

> 2021年06月18日 20:49,Xiao Ni <xni@redhat.com> 写道:
> 
> Hi Miao
> 
> So you plan to fix this problem now? The plan is to submit the discard
> bio directly to disk similar with raid0/raid10.
> As we talked, it needs to consider the discard region. It should be
> larger than chunk_sectors * nr_data_disks. It needs
> to split the bio when its size not aligned with chunk_sectors *
> nr_data_disks. And it needs to consider the start address
> of the bio too. If it's not aligned with a start address of
> chunk_sectors, it's better to split this part too.
> 
> I'm working on another job. So I don't have time to do this now. If
> you submit the patches, I can help to review :)
> 
> Regards
> Xiao
> 
> On Fri, Jun 18, 2021 at 2:28 PM Wang Shanker <shankerwangmiao@gmail.com> wrote:
>> 
>> Hi, Xiao
>> 
>> Any ideas on this issue?
>> 
>> Cheers,
>> 
>> Miao Wang
>> 
>>> 2021年06月09日 17:03,Wang Shanker <shankerwangmiao@gmail.com> 写道:
>>> 
>>>> 
>>>> 2021年06月09日 16:44,Xiao Ni <xni@redhat.com> 写道:
>>>> 
>>>> Hi all
>>>> 
>>>> Thanks for reporting about this. I did a test in my environment.
>>>> time blkdiscard /dev/nvme5n1  (477GB)
>>>> real    0m0.398s
>>>> time blkdiscard /dev/md0
>>>> real    9m16.569s
>>>> 
>>>> I'm not familiar with the block layer codes. I'll try to understand
>>>> the codes related with discard request and
>>>> try to fix this problem.
>>>> 
>>>> I have a question for raid5 discard, it needs to consider more than
>>>> raid0 and raid10. For example, there is a raid5 with 3 disks.
>>>> D11 D21 P1 (stripe size is 4KB)
>>>> D12 D22 P2
>>>> D13 D23 P3
>>>> D14 D24 P4
>>>> ...  (chunk size is 512KB)
>>>> If there is a discard request on D13 and D14, and there is no discard
>>>> request on D23 D24. It can't send
>>>> discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
>>>> D13 and disk2 is broken, it can't
>>>> get the right data from D13 and P3. The discard request on D13 can
>>>> write 0 to the discard region, right?
>>> 
>>> Yes. It can be seen at the beginning of make_discard_request(), where
>>> the requested range being discarded is aligned to ``stripe_sectors",
>>> which should be chunk_sectors * nr_data_disks.
>>> 
>>> Cheers,
>>> 
>>> Miao Wang
>> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
  2021-06-21  7:49                 ` Wang Shanker
@ 2021-06-22  1:48                   ` Xiao Ni
  0 siblings, 0 replies; 13+ messages in thread
From: Xiao Ni @ 2021-06-22  1:48 UTC (permalink / raw)
  To: Wang Shanker; +Cc: Ming Lei, linux-raid, linux-block

Hi

For normal read/write requests, it needs to consider parity
calculation. Now it uses page size as the unit. The whole design
is based on this. So it's very hard to change this. But there are many
efforts to improve the performance. The batch requests
can improve the performance.
https://www.spinics.net/lists/raid/msg47207.html. It can help to avoid
to send many small
bios to disks.

And for the discard part, I'll try to do this job.

Regards
Xiao

On Mon, Jun 21, 2021 at 3:49 PM Wang Shanker <shankerwangmiao@gmail.com> wrote:
>
> Hi, Xiao
>
> Many thanks for your reply. I realized that this problem is not limited
> to discard requests. For normal read/write requests, they are also first
> get split into 4k-sized ones and then merged into larger ones. The merging
> of bio's is limited by queue_max_segments, which leads to small trunks of
> io operations issued to physical devices. It seems that such behavior is
> not optimal and should be improved.
>
> I'm not so familiar with raid456. Could you have a look of its code when you
> are free? It seems that improving this may result in big changes.
>
> Cheers,
>
> Miao Wang
>
> > 2021年06月18日 20:49,Xiao Ni <xni@redhat.com> 写道:
> >
> > Hi Miao
> >
> > So you plan to fix this problem now? The plan is to submit the discard
> > bio directly to disk similar with raid0/raid10.
> > As we talked, it needs to consider the discard region. It should be
> > larger than chunk_sectors * nr_data_disks. It needs
> > to split the bio when its size not aligned with chunk_sectors *
> > nr_data_disks. And it needs to consider the start address
> > of the bio too. If it's not aligned with a start address of
> > chunk_sectors, it's better to split this part too.
> >
> > I'm working on another job. So I don't have time to do this now. If
> > you submit the patches, I can help to review :)
> >
> > Regards
> > Xiao
> >
> > On Fri, Jun 18, 2021 at 2:28 PM Wang Shanker <shankerwangmiao@gmail.com> wrote:
> >>
> >> Hi, Xiao
> >>
> >> Any ideas on this issue?
> >>
> >> Cheers,
> >>
> >> Miao Wang
> >>
> >>> 2021年06月09日 17:03,Wang Shanker <shankerwangmiao@gmail.com> 写道:
> >>>
> >>>>
> >>>> 2021年06月09日 16:44,Xiao Ni <xni@redhat.com> 写道:
> >>>>
> >>>> Hi all
> >>>>
> >>>> Thanks for reporting about this. I did a test in my environment.
> >>>> time blkdiscard /dev/nvme5n1  (477GB)
> >>>> real    0m0.398s
> >>>> time blkdiscard /dev/md0
> >>>> real    9m16.569s
> >>>>
> >>>> I'm not familiar with the block layer codes. I'll try to understand
> >>>> the codes related with discard request and
> >>>> try to fix this problem.
> >>>>
> >>>> I have a question for raid5 discard, it needs to consider more than
> >>>> raid0 and raid10. For example, there is a raid5 with 3 disks.
> >>>> D11 D21 P1 (stripe size is 4KB)
> >>>> D12 D22 P2
> >>>> D13 D23 P3
> >>>> D14 D24 P4
> >>>> ...  (chunk size is 512KB)
> >>>> If there is a discard request on D13 and D14, and there is no discard
> >>>> request on D23 D24. It can't send
> >>>> discard request to D13 and D14, right? P3 = D23 xor D13. If we discard
> >>>> D13 and disk2 is broken, it can't
> >>>> get the right data from D13 and P3. The discard request on D13 can
> >>>> write 0 to the discard region, right?
> >>>
> >>> Yes. It can be seen at the beginning of make_discard_request(), where
> >>> the requested range being discarded is aligned to ``stripe_sectors",
> >>> which should be chunk_sectors * nr_data_disks.
> >>>
> >>> Cheers,
> >>>
> >>> Miao Wang
> >>
> >
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-22  1:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-05 20:54 [Bug Report] Discard bios cannot be correctly merged in blk-mq Wang Shanker
2021-06-05 22:38 ` antlists
2021-06-06  3:44   ` Wang Shanker
2021-06-07 13:07 ` Ming Lei
2021-06-08 15:49   ` Wang Shanker
2021-06-09  0:41     ` Ming Lei
2021-06-09  2:40       ` Wang Shanker
2021-06-09  8:44         ` Xiao Ni
2021-06-09  9:03           ` Wang Shanker
2021-06-18  6:28             ` Wang Shanker
2021-06-18 12:49               ` Xiao Ni
2021-06-21  7:49                 ` Wang Shanker
2021-06-22  1:48                   ` Xiao Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).