From: Wang Shanker <shankerwangmiao@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
Date: Tue, 8 Jun 2021 23:49:04 +0800 [thread overview]
Message-ID: <C866C380-7A71-4722-957F-2CE65BDACF26@gmail.com> (raw)
In-Reply-To: <YL4Z/QJCKc0NCV5L@T590>
> 2021年06月07日 21:07,Ming Lei <ming.lei@redhat.com> 写道:
>
> On Sun, Jun 06, 2021 at 04:54:09AM +0800, Wang Shanker wrote:
>> Hi, all
>>
>> I'm writing to report my recent findings about the handling of discard
>> operations. As indicated by a few tests, discard operation cannot be
>> correctly merged, which leads to poor performance of RAID456 on discard
>> requests. I'm not quite familiar with block subsystem, so please correct
>> me if there are any mistakes in the following analysis.
>>
>> In blk_discard_mergable(), we can see the handling of merging discard
>> operations goes through different processes, decided by whether we have
>> more than one queue_max_discard_segments. If the device requires the
>> sectors should be contiguous in one discard operation, the merging process
>> will be the same as that for normal read/write operations. Otherwise,
>> bio_attempt_discard_merge will try to merge as many bios as the device
>> allows, ignoring the contiguity. Sadly, for both cases, there are problems.
>>
>> For devices requiring contiguous sector ranges(such as scsi disks),
>> bio_attempt_front_merge() or bio_attempt_back_merge() will be handling
>> the merging process, and both control flows will arrive at
>> ll_new_hw_segment(), where the following statement:
>>
>> req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)
>>
>> can never be true, since blk_rq_get_max_segments(req) will always be 1.
>> As a result, no discard operations shall be merged.
>
> OK, that looks a bug, and the following change may fix the issue:
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 4d97fb6dd226..65210e9a8efa 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -559,10 +559,14 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
> unsigned int nr_phys_segs)
> {
> - if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> + if (blk_integrity_merge_bio(req->q, req, bio) == false)
> goto no_merge;
>
> - if (blk_integrity_merge_bio(req->q, req, bio) == false)
> + /* discard request merge won't add new segment */
> + if (req_op(req) == REQ_OP_DISCARD)
> + return 1;
> +
> + if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> goto no_merge;
>
> /*
>
Many thanks for this patch. I wonder whether it will be applied soon or it will go
through a formal [PATCH] process.
>>
>> For devices supporting multiple segments of sector ranges,
>> bio_attempt_discard_merge() will take over the process. Indeed it will merge
>> some bios. But how many bios can be merged into one request? In the
>> implementation, the maximum number of bios is limited mainly by
>> queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
>> the problem is). However, it is not the case, since bio_attempt_discard_merge
>> is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
>> They should be considered as only one segment instead 20 of them.
>
> Right, so far ELEVATOR_DISCARD_MERGE doesn't merge bios actually, but it
> can be supported without much difficulty.
>
>>
>> You may wonder the importance of merging discard operations. In the
>> implementation of RAID456, bios are committed in 4k trunks (they call
>> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
>> The proper merging of the bios is of vital importance for a reasonable
>> operating performance of RAID456 devices. In fact, I met this problem
>> when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
>> support. Someone also reported similar issues in the linux-raid list
>> (https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
>> reported that ``lots of small 4k discard requests that get merged into larger
>> 512k chunks submitted to devices". This can be explained by my above discovery
>> because nvme allows 128 segments at the maximum in a dsm instruction.
>>
>> The above two scenarios can be reproduced utilizing latest QEMU, with emulated
>> scsi drives (for the first scenario) or nvme drives (for the second scenario)
>> and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
>> The detailed process reproducing is as follows:
>>
>> 1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
>> 2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
>> 3. create 3 empty sparse disk images:
>> # truncate -s 1T disk1 disk2 disk3
>> 4. using the following qemu command to start the guest vm (here 9p is used
>> as the rootfs because we don't want the io operations on the rootfs influence
>> the debugging of the block layer of the guest vm)
>> # qemu-system-x86_64 \
>> -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G \
>> -chardev stdio,mux=on,id=char0,signal=off \
>> -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
>> -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
>> -monitor chardev:char0 \
>> -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0 \
>> -nographic \
>> -kernel vmlinuz -initrd initrd.img \
>> -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
>> -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
>> -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
>> -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
>> -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
>> -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>> -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>> -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
>> 5. enable trim support of the raid456 module:
>> # modprobe raid456
>> # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
>> 6. using mdaam to create a raid5 device in the guest vm:
>> # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
>> 7. and issue a discard request on the dm device: (limit the size of
>> discard request because discarding all the 2T data is too slow)
>> # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
>> 8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k
>> length (nlb 1).
>
> 4kb should be the discard segment length, instead of discard request
> length, which should be 512k in the above test.
Actually, what are received by the nvme controller are discard requests
with 128 segments of 4k, instead of one segment of 512k.
>
>>
>> Similarly, the problem with scsi devices can be emulated using the following
>> options for qemu:
>>
>> -device virtio-scsi,id=scsi \
>> -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>> -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>> -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
>> -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
>>
>>
>> Despite the discovery, I cannot come up with a proper fix of this issue due
>> to my lack of familiarity of the block subsystem. I expect your kind feedback
>> on this. Thanks in advance.
>
> In the above setting and raid456 test, I observe that rq->nr_phys_segments can
> reach 128, but queue_max_discard_segments() reports 256. So discard
> request size can be 512KB, which is the max size when you run 1MB discard on
> raid456. However, if the discard length on raid456 is increased, the
> current way will become inefficient.
Exactly.
I suggest that bio's can be merged and be calculated as one segment if they are
contiguous and contain no data.
And I also discovered later that, even normal long write requests, e.g.
a 10m write, will be split into 4k bio's. The maximum number of bio's which can
be merged into one request is limited by queue_max_segments, regardless
of whether those bio's are contiguous. In my test environment, for scsi devices,
queue_max_segments can be 254, which means about 1m size of requests. For nvme
devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
which results in only 132k of requests.
So, I would also suggest that raid456 should be improved to issue bigger bio's to
underlying drives.
Cheers,
Miao Wang
next prev parent reply other threads:[~2021-06-08 15:50 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-05 20:54 [Bug Report] Discard bios cannot be correctly merged in blk-mq Wang Shanker
2021-06-05 22:38 ` antlists
2021-06-06 3:44 ` Wang Shanker
2021-06-07 13:07 ` Ming Lei
2021-06-08 15:49 ` Wang Shanker [this message]
2021-06-09 0:41 ` Ming Lei
2021-06-09 2:40 ` Wang Shanker
2021-06-09 8:44 ` Xiao Ni
2021-06-09 9:03 ` Wang Shanker
2021-06-18 6:28 ` Wang Shanker
2021-06-18 12:49 ` Xiao Ni
2021-06-21 7:49 ` Wang Shanker
2021-06-22 1:48 ` Xiao Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C866C380-7A71-4722-957F-2CE65BDACF26@gmail.com \
--to=shankerwangmiao@gmail.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).