* [Qemu-devel] BDRV request fragmentation and virtio-blk write submission guarantees (2nd try)
@ 2019-07-18 13:52 Евгений Яковлев
2019-07-18 14:59 ` Kevin Wolf
0 siblings, 1 reply; 3+ messages in thread
From: Евгений Яковлев @ 2019-07-18 13:52 UTC (permalink / raw)
To: qemu-devel, stefanha, kwolf, mreitz; +Cc: qemu-block, yc-core
Hi everyone,
My previous message was misformatted, so here's another one. Sorry about
that.
We're currently working on implementing a qemu BDRV format driver which
we are using with virtio-blk devices.
I have a question concerning BDRV request fragmentation and virtio-blk
write request submission which is not entirely clear to me by only
reading virtio spec. Could you please consider the following case and
give some additional guidance?
1. Our BDRV format driver has a notion of max supported transfer size.
So we implement BlockDriver::bdrv_refresh_limits where we fill out
BlockLimits::max_transfer and opt_transfer fields.
2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size
field, which (according to spec 1.1) is a **suggested** maximum. We read
"suggested" as "guest driver may still send requests that don't fit into
opt_io_size and we should handle those"...
3. ... and judging by code in block/io.c qemu block layer handles such
requests by fragmenting them into several BDRV requests if request size
is > max_transfer
4. Guest will see request completion only after all fragments are
handled. However each fragment submission path can call
qemu_coroutine_yield and move on to submitting next request available in
virtq before completely submitting the rest of the fragments. Which
means the following situation is possible where BDRV sees 2 write
requests in virtq, both of which are larger than max_transfer:
Blocks: -----------------------------
Write1: ------xxxxxxxx
Write2: ------yyyyyyyy
Write1Chunk1: xxxx
Write2Chunk1: yyyy
Write2Chunk2: ----yyyy
Write1Chunk1: ----xxxx
Blocks: ------yyyyxxxx-----------------
In above scenario guest virtio-blk driver decided to submit 2
intersecting write requests, both of which are larger than
||max_transfer, and then call hypervisor.
I understand that virtio-blk may handle requests out of order, so guest
must not make any assumptions on relative order in which those requests
will be handled.
However, can guest driver expect that whatever the submission order will
be, the actual intersecting writes will be atomic?
In other words, will it be correct for conforming virtio-blk driver to
expect only "xxxxxxxx" or "yyyyyyyy" but not anything else in between,
after both requests are reported as completed?
Because i think that is something that may happen in qemu right now, if
i understood correctly.
Thanks!
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] BDRV request fragmentation and virtio-blk write submission guarantees (2nd try)
2019-07-18 13:52 [Qemu-devel] BDRV request fragmentation and virtio-blk write submission guarantees (2nd try) Евгений Яковлев
@ 2019-07-18 14:59 ` Kevin Wolf
2019-07-18 15:27 ` Евгений Яковлев
0 siblings, 1 reply; 3+ messages in thread
From: Kevin Wolf @ 2019-07-18 14:59 UTC (permalink / raw)
To: Евгений
Яковлев
Cc: qemu-block, yc-core, qemu-devel, stefanha, mreitz
Am 18.07.2019 um 15:52 hat Евгений Яковлев geschrieben:
> Hi everyone,
>
> My previous message was misformatted, so here's another one. Sorry about
> that.
>
> We're currently working on implementing a qemu BDRV format driver which we
> are using with virtio-blk devices.
>
> I have a question concerning BDRV request fragmentation and virtio-blk write
> request submission which is not entirely clear to me by only reading virtio
> spec. Could you please consider the following case and give some additional
> guidance?
>
> 1. Our BDRV format driver has a notion of max supported transfer size. So we
> implement BlockDriver::bdrv_refresh_limits where we fill out
> BlockLimits::max_transfer and opt_transfer fields.
>
> 2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size
> field, which (according to spec 1.1) is a **suggested** maximum. We read
> "suggested" as "guest driver may still send requests that don't fit into
> opt_io_size and we should handle those"...
>
> 3. ... and judging by code in block/io.c qemu block layer handles such
> requests by fragmenting them into several BDRV requests if request size is >
> max_transfer
>
> 4. Guest will see request completion only after all fragments are handled.
> However each fragment submission path can call qemu_coroutine_yield and move
> on to submitting next request available in virtq before completely
> submitting the rest of the fragments. Which means the following situation is
> possible where BDRV sees 2 write requests in virtq, both of which are larger
> than max_transfer:
>
> Blocks: -----------------------------
>
> Write1: ------xxxxxxxx
>
> Write2: ------yyyyyyyy
>
> Write1Chunk1: xxxx
>
> Write2Chunk1: yyyy
>
> Write2Chunk2: ----yyyy
>
> Write1Chunk1: ----xxxx
>
> Blocks: ------yyyyxxxx-----------------
>
>
> In above scenario guest virtio-blk driver decided to submit 2 intersecting
> write requests, both of which are larger than ||max_transfer, and then call
> hypervisor.
>
> I understand that virtio-blk may handle requests out of order, so guest must
> not make any assumptions on relative order in which those requests will be
> handled.
>
> However, can guest driver expect that whatever the submission order will be,
> the actual intersecting writes will be atomic?
>
> In other words, will it be correct for conforming virtio-blk driver to
> expect only "xxxxxxxx" or "yyyyyyyy" but not anything else in between, after
> both requests are reported as completed?
>
> Because i think that is something that may happen in qemu right now, if i
> understood correctly.
I don't think atomicity is promised anywhere in the virtio
specification, and I agree with you that this case can happen (it
probably happens much more frequently when you use image formats instead
of raw files).
On the other hand, there is no good reason for a guest OS to submit two
write request to the same blocks in parallel. Even if it could expect
that one of the requests wins, the end result would still be undefined,
so I don't think this could ever be a useful thing to do. (Well, I guess
it could replace flipping a coin...)
Kevin
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] BDRV request fragmentation and virtio-blk write submission guarantees (2nd try)
2019-07-18 14:59 ` Kevin Wolf
@ 2019-07-18 15:27 ` Евгений Яковлев
0 siblings, 0 replies; 3+ messages in thread
From: Евгений Яковлев @ 2019-07-18 15:27 UTC (permalink / raw)
To: Kevin Wolf; +Cc: qemu-block, yc-core, qemu-devel, stefanha, mreitz
Evgeny Yakovlev
Lead Software Engineer, Yandex.Cloud Hypervisor Team
On 18.07.2019 17:59, Kevin Wolf wrote:
> Am 18.07.2019 um 15:52 hat Евгений Яковлев geschrieben:
>> Hi everyone,
>>
>> My previous message was misformatted, so here's another one. Sorry about
>> that.
>>
>> We're currently working on implementing a qemu BDRV format driver which we
>> are using with virtio-blk devices.
>>
>> I have a question concerning BDRV request fragmentation and virtio-blk write
>> request submission which is not entirely clear to me by only reading virtio
>> spec. Could you please consider the following case and give some additional
>> guidance?
>>
>> 1. Our BDRV format driver has a notion of max supported transfer size. So we
>> implement BlockDriver::bdrv_refresh_limits where we fill out
>> BlockLimits::max_transfer and opt_transfer fields.
>>
>> 2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size
>> field, which (according to spec 1.1) is a **suggested** maximum. We read
>> "suggested" as "guest driver may still send requests that don't fit into
>> opt_io_size and we should handle those"...
>>
>> 3. ... and judging by code in block/io.c qemu block layer handles such
>> requests by fragmenting them into several BDRV requests if request size is >
>> max_transfer
>>
>> 4. Guest will see request completion only after all fragments are handled.
>> However each fragment submission path can call qemu_coroutine_yield and move
>> on to submitting next request available in virtq before completely
>> submitting the rest of the fragments. Which means the following situation is
>> possible where BDRV sees 2 write requests in virtq, both of which are larger
>> than max_transfer:
>>
>> Blocks: -----------------------------
>>
>> Write1: ------xxxxxxxx
>>
>> Write2: ------yyyyyyyy
>>
>> Write1Chunk1: xxxx
>>
>> Write2Chunk1: yyyy
>>
>> Write2Chunk2: ----yyyy
>>
>> Write1Chunk1: ----xxxx
>>
>> Blocks: ------yyyyxxxx-----------------
>>
>>
>> In above scenario guest virtio-blk driver decided to submit 2 intersecting
>> write requests, both of which are larger than ||max_transfer, and then call
>> hypervisor.
>>
>> I understand that virtio-blk may handle requests out of order, so guest must
>> not make any assumptions on relative order in which those requests will be
>> handled.
>>
>> However, can guest driver expect that whatever the submission order will be,
>> the actual intersecting writes will be atomic?
>>
>> In other words, will it be correct for conforming virtio-blk driver to
>> expect only "xxxxxxxx" or "yyyyyyyy" but not anything else in between, after
>> both requests are reported as completed?
>>
>> Because i think that is something that may happen in qemu right now, if i
>> understood correctly.
> I don't think atomicity is promised anywhere in the virtio
> specification, and I agree with you that this case can happen (it
> probably happens much more frequently when you use image formats instead
> of raw files).
>
> On the other hand, there is no good reason for a guest OS to submit two
> write request to the same blocks in parallel. Even if it could expect
> that one of the requests wins, the end result would still be undefined,
> so I don't think this could ever be a useful thing to do. (Well, I guess
> it could replace flipping a coin...)
> Kevin
Thanks Kevin. I agree that described guest behavior does not a have a
sensible reason behind it. However, just based on purely theoretical
basis, according to virtio-blk contract, is it valid for guest to even
_assume_ that above situation with 2 requests _must_ be resolved in one
of two specific cases i described and not anything in between? In other
words that writes will be atomic even if their relative order is
undefined. We could not get a clear answer from virtio spec ourselves.
For instance, IIRC, nvme spec declares atomicity guarantees as well as
ordering for specific commands ("6.4 Atomic Operations").
Evgeny
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2019-07-18 15:28 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18 13:52 [Qemu-devel] BDRV request fragmentation and virtio-blk write submission guarantees (2nd try) Евгений Яковлев
2019-07-18 14:59 ` Kevin Wolf
2019-07-18 15:27 ` Евгений Яковлев
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).