Re: Reordering of ublk IO requests

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
To: Ming Lei <ming.lei@redhat.com>,
	Andreas Hindborg <andreas.hindborg@wdc.com>
Cc: Jens Axboe <axboe@kernel.dk>, linux-block@vger.kernel.org
Subject: Re: Reordering of ublk IO requests
Date: Sat, 19 Nov 2022 09:24:01 +0900	[thread overview]
Message-ID: <be6940cf-7b23-4b11-1f6f-f3d4853d9a34@opensource.wdc.com> (raw)
In-Reply-To: <Y3d+78qMOusdYUAG@T590>

On 11/18/22 21:47, Ming Lei wrote:
> On Fri, Nov 18, 2022 at 12:49:15PM +0100, Andreas Hindborg wrote:
>>
>> Ming Lei <ming.lei@redhat.com> writes:
>>
>>> CAUTION: This email originated from outside of Western Digital. Do not click on
>>> links or open attachments unless you recognize the sender and know that the
>>> content is safe.
>>>
>>>
>>> On Fri, Nov 18, 2022 at 10:41:31AM +0100, Andreas Hindborg wrote:
>>>>
>>>> Ming Lei <ming.lei@redhat.com> writes:
>>>>
>>>>> CAUTION: This email originated from outside of Western Digital. Do not click on
>>>>> links or open attachments unless you recognize the sender and know that the
>>>>> content is safe.
>>>>>
>>>>>
>>>>> On Fri, Nov 18, 2022 at 01:35:29PM +0900, Damien Le Moal wrote:
>>>>>> On 11/18/22 13:12, Ming Lei wrote:
>>>>>> [...]
>>>>>>>>> You can only assign it to zoned write request, but you still have to check
>>>>>>>>> the sequence inside each zone, right? Then why not just check LBAs in
>>>>>>>>> each zone simply?
>>>>>>>>
>>>>>>>> We would need to know the zone map, which is not otherwise required.
>>>>>>>> Then we would need to track the write pointer for each open zone for
>>>>>>>> each queue, so that we can stall writes that are not issued at the write
>>>>>>>> pointer. This is in effect all zones, because we cannot track when zones
>>>>>>>> are implicitly closed. Then, if different queues are issuing writes to
>>>>>>>
>>>>>>> Can you explain "implicitly closed" state a bit?
>>>>>>>
>>>>>>> From https://zonedstorage.io/docs/introduction/zoned-storage, only the
>>>>>>> following words are mentioned about closed state:
>>>>>>>
>>>>>>>     ```Conversely, implicitly or explicitly opened zoned can be transitioned to the
>>>>>>>     closed state using the CLOSE ZONE command.```
>>>>>>
>>>>>> When a write is issued to an empty or closed zone, the drive will
>>>>>> automatically transition the zone into the implicit open state. This is
>>>>>> called implicit open because the host did not (explicitly) issue an open
>>>>>> zone command.
>>>>>>
>>>>>> When there are too many implicitly open zones, the drive may choose to
>>>>>> close one of the implicitly opened zone to implicitly open the zone that
>>>>>> is a target for a write command.
>>>>>>
>>>>>> Simple in a nutshell. This is done so that the drive can work with a
>>>>>> limited set of resources needed to handle open zones, that is, zones that
>>>>>> are being written. There are some more nasty details to all this with
>>>>>> limits on the number of open zones and active zones that a zoned drive may
>>>>>> have.
>>>>>
>>>>> OK, thanks for the clarification about implicitly closed, but I
>>>>> understand this close can't change the zone's write pointer.
>>>>
>>>> You are right, it does not matter if the zone is implicitly closed, I
>>>> was mistaken. But we still have to track the write pointer of every zone
>>>> in open or active state, otherwise we cannot know if a write that arrive
>>>> to a zone with no outstanding IO is actually at the write pointer, or
>>>> whether we need to hold it.
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> zone info can be cached in the mapping(hash table)(zone sector is the key, and zone
>>>>>>> info is the value), which can be implemented as one LRU style. If any zone
>>>>>>> info isn't hit in the mapping table, ioctl(BLKREPORTZONE) can be called for
>>>>>>> obtaining the zone info.
>>>>>>>
>>>>>>>> the same zone, we need to sync across queues. Userspace may have
>>>>>>>> synchronization in place to issue writes with multiple threads while
>>>>>>>> still hitting the write pointer.
>>>>>>>
>>>>>>> You can trust mq-dealine, which guaranteed that write IO is sent to ->queue_rq()
>>>>>>> in order, no matter MQ or SQ.
>>>>>>>
>>>>>>> Yes, it could be issue from multiple queues for ublksrv, which doesn't sync
>>>>>>> among multiple queues.
>>>>>>>
>>>>>>> But per-zone re-order still can solve the issue, just need one lock
>>>>>>> for each zone to cover the MQ re-order.
>>>>>>
>>>>>> That lock is already there and using it, mq-deadline will never dispatch
>>>>>> more than one write per zone at any time. This is to avoid write
>>>>>> reordering. So multi queue or not, for any zone, there is no possibility
>>>>>> of having writes reordered.
>>>>>
>>>>> oops, I miss the single queue depth point per zone, so ublk won't break
>>>>> zoned write at all, and I agree order of batch IOs is one problem, but
>>>>> not hard to solve.
>>>>
>>>> The current implementation _does_ break zoned write because it reverses
>>>> batched writes. But if it is an easy fix, that is cool :)
>>>
>>> Please look at Damien's comment:
>>>
>>>>> That lock is already there and using it, mq-deadline will never dispatch
>>>>> more than one write per zone at any time. This is to avoid write
>>>>> reordering. So multi queue or not, for any zone, there is no possibility
>>>>> of having writes reordered.
>>>
>>> For zoned write, mq-deadline is used to limit at most one inflight write
>>> for each zone.
>>>
>>> So can you explain a bit how the current implementation breaks zoned
>>> write?
>>
>> Like Damien wrote in another email, mq-deadline will only impose
>> ordering for requests submitted in batch. The flow we have is the
>> following:
>>
>>  - Userspace sends requests to ublk gendisk
>>  - Requests go through block layer and is _not_ reordered when using
>>    mq-deadline. They may be split.
>>  - Requests hit ublk_drv and ublk_drv will reverse order of _all_
>>    batched up requests (including split requests).
> 
> For ublk-zone, ublk driver needs to be exposed as zoned device by
> calling disk_set_zoned() finally, which definitely isn't supported now,
> so mq-deadline at most sends one write IO for each zone after ublk-zone
> is supported, see blk_req_can_dispatch_to_zone().
> 
>>  - ublk_drv sends request to ublksrv in _reverse_ order.
>>  - ublksrv sends requests _not_ batched up to target device.
>>  - Requests that enter mq-deadline at the same time are reordered in LBA
>>    order, that is all good.
>>  - Requests that enter the kernel in different batches are not reordered
>>    in LBA order and end up missing the write pointer. This is bad.
> 
> Again, please read Damien's comment:
> 
>>> That lock is already there and using it, mq-deadline will never dispatch
>>> more than one write per zone at any time.
> 
> Anytime, there is at most one write IO for each zone, how can the single
> write IO be re-order?

If the user issues writes one at a time out of order (not aligned to the
write pointer), mq-deadline will not help at all. The zone write locking
will still limit write dispatching to one per zone, but the writes will fail.

mq-deadline will reorder write commands in the correct lba order only if:
- the commands are inserted as a batch (more than on request passed to
->insert_requests)
- commands are inserted individually when the target zone is locked (a
write is already being executed)

This has been the semantic from the start: the block layer has no
guarantees about the correct ordering of writes to zoned drive. What is
guaranteed is that (1) if the user issues writes in order AND (2)
mq-deadline is used, then writes will be dispatched in the same order to
the device.

I have not looked at the details of ublk, but from the thread, I think (1)
is not done and (2) is missing-ish as the ublk device is not marked as zoned.

> 
> 
> Thanks,
> Ming
> 

-- 
Damien Le Moal
Western Digital Research