qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] BDRV request fragmentation and vitio-blk write submission guarantees
@ 2019-07-18 13:44 Евгений Яковлев
  2019-07-19 10:17 ` Stefan Hajnoczi
  0 siblings, 1 reply; 3+ messages in thread
From: Евгений Яковлев @ 2019-07-18 13:44 UTC (permalink / raw)
  To: qemu-devel, stefanha, kwolf, mreitz; +Cc: qemu-block, yc-core

Hi everyone,

We're currently working on implementing a qemu BDRV format driver which 
we are using with virtio-blk devices.

I have a question concerning BDRV request fragmentation and virtio-blk 
write request submission which is not entirely clear to me by only 
reading virtio spec. Could you please consider the following case and 
give some additional guidance?

1. Our BDRV format driver has a notion of max supported transfer size. 
So we implement BlockDriver::bdrv_refresh_limits where we fill out 
BlockLimits::max_transfer and opt_transfer fields.

2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size 
field, which (according to spec 1.1) is a **suggested** maximum. We read 
"suggested" as "guest driver may still send requests that don't fit into 
opt_io_size and we should handle those"...

3. ... and judging by code in block/io.c qemu block layer handles such 
requests by fragmenting them into several BDRV requests if request size 
is > max_transfer

4. Guest will see request completion only after all fragments are 
handled. However each fragment submission path can call 
qemu_coroutine_yield and move on to submitting next request available in 
virtq before completely submitting the rest of the fragments. Which 
means the following situation is possible where BDRV sees 2 write 
requests in virtq, both of which are larger than max_transfer:

||

|Blocks: |-------------------------------------> Write1: xxxxxxxx 
Write2: yyyyyyyy Write1Chunk1: xxxx Write2Chunk1: yyyy Write2Chunk2: 
yyyy Write1Chunk1: xxxx Blocks: |------------yyyyxxxx----------------->|

||

|In above scenario guest virtio-blk driver decided to submit 2 
intersecting write requests, both of which are larger than 
||max_transfer, and then call hypervisor.|

|I understand that virtio-blk may handle requests out of order, so guest 
must not make any assumptions on relative order in which those requests 
will be handled.|

|However, can guest driver expect that whatever the submission order 
will be, the actual intersecting writes will be atomic?|

|In other words, will it be correct for conforming virtio-blk driver to 
expect only "|||xxxxxxxx" or "||||yyyyyyyy" but not anything else in 
between, after both requests are reported as completed?||

||Because i think that is something that may happen in qemu right now, 
if i understood correctly. ||

||||

||Thanks!| |



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] BDRV request fragmentation and vitio-blk write submission guarantees
  2019-07-18 13:44 [Qemu-devel] BDRV request fragmentation and vitio-blk write submission guarantees Евгений Яковлев
@ 2019-07-19 10:17 ` Stefan Hajnoczi
  2019-07-19 10:48   ` Evgeny Yakovlev
  0 siblings, 1 reply; 3+ messages in thread
From: Stefan Hajnoczi @ 2019-07-19 10:17 UTC (permalink / raw)
  To: Евгений
	Яковлев
  Cc: kwolf, stefanha, qemu-block, qemu-devel, mreitz, yc-core

[-- Attachment #1: Type: text/plain, Size: 3155 bytes --]

On Thu, Jul 18, 2019 at 04:44:17PM +0300, Евгений Яковлев wrote:
> Hi everyone,
> 
> We're currently working on implementing a qemu BDRV format driver which we
> are using with virtio-blk devices.
> 
> I have a question concerning BDRV request fragmentation and virtio-blk write
> request submission which is not entirely clear to me by only reading virtio
> spec. Could you please consider the following case and give some additional
> guidance?
> 
> 1. Our BDRV format driver has a notion of max supported transfer size. So we
> implement BlockDriver::bdrv_refresh_limits where we fill out
> BlockLimits::max_transfer and opt_transfer fields.
> 
> 2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size
> field, which (according to spec 1.1) is a **suggested** maximum. We read
> "suggested" as "guest driver may still send requests that don't fit into
> opt_io_size and we should handle those"...
> 
> 3. ... and judging by code in block/io.c qemu block layer handles such
> requests by fragmenting them into several BDRV requests if request size is >
> max_transfer
> 
> 4. Guest will see request completion only after all fragments are handled.
> However each fragment submission path can call qemu_coroutine_yield and move
> on to submitting next request available in virtq before completely
> submitting the rest of the fragments. Which means the following situation is
> possible where BDRV sees 2 write requests in virtq, both of which are larger
> than max_transfer:
> 
> ||
> 
> |Blocks: |-------------------------------------> Write1: xxxxxxxx Write2:
> yyyyyyyy Write1Chunk1: xxxx Write2Chunk1: yyyy Write2Chunk2: yyyy
> Write1Chunk1: xxxx Blocks: |------------yyyyxxxx----------------->|
> 
> ||
> 
> |In above scenario guest virtio-blk driver decided to submit 2 intersecting
> write requests, both of which are larger than ||max_transfer, and then call
> hypervisor.|
> 
> |I understand that virtio-blk may handle requests out of order, so guest
> must not make any assumptions on relative order in which those requests will
> be handled.|
> 
> |However, can guest driver expect that whatever the submission order will
> be, the actual intersecting writes will be atomic?|
> 
> |In other words, will it be correct for conforming virtio-blk driver to
> expect only "|||xxxxxxxx" or "||||yyyyyyyy" but not anything else in
> between, after both requests are reported as completed?||
> 
> ||Because i think that is something that may happen in qemu right now, if i
> understood correctly. ||

Write requests are not atomic in general.  Specific storage technologies
support atomic writes via special commands with certain restrictions but
applications using this feature aren't portable.

Portable applications either don't submit intersecting write requests or
they do not depend on atomicity.

Out of curiousity I took a quick look at Linux device-mapper.  The same
issue applies in device-mapper when intersecting write requests cross
device-mapper targets.  I think Linux submits split bios in parallel and
without serialization.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] BDRV request fragmentation and vitio-blk write submission guarantees
  2019-07-19 10:17 ` Stefan Hajnoczi
@ 2019-07-19 10:48   ` Evgeny Yakovlev
  0 siblings, 0 replies; 3+ messages in thread
From: Evgeny Yakovlev @ 2019-07-19 10:48 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, stefanha, qemu-block, qemu-devel, mreitz, yc-core

On 19.07.2019 13:17, Stefan Hajnoczi wrote:
> On Thu, Jul 18, 2019 at 04:44:17PM +0300, Евгений Яковлев wrote:
>> Hi everyone,
>>
>> We're currently working on implementing a qemu BDRV format driver which we
>> are using with virtio-blk devices.
>>
>> I have a question concerning BDRV request fragmentation and virtio-blk write
>> request submission which is not entirely clear to me by only reading virtio
>> spec. Could you please consider the following case and give some additional
>> guidance?
>>
>> 1. Our BDRV format driver has a notion of max supported transfer size. So we
>> implement BlockDriver::bdrv_refresh_limits where we fill out
>> BlockLimits::max_transfer and opt_transfer fields.
>>
>> 2. virtio-blk exposes max_transfer as a virtio_blk_config::opt_io_size
>> field, which (according to spec 1.1) is a **suggested** maximum. We read
>> "suggested" as "guest driver may still send requests that don't fit into
>> opt_io_size and we should handle those"...
>>
>> 3. ... and judging by code in block/io.c qemu block layer handles such
>> requests by fragmenting them into several BDRV requests if request size is >
>> max_transfer
>>
>> 4. Guest will see request completion only after all fragments are handled.
>> However each fragment submission path can call qemu_coroutine_yield and move
>> on to submitting next request available in virtq before completely
>> submitting the rest of the fragments. Which means the following situation is
>> possible where BDRV sees 2 write requests in virtq, both of which are larger
>> than max_transfer:
>>
>> ||
>>
>> |Blocks: |-------------------------------------> Write1: xxxxxxxx Write2:
>> yyyyyyyy Write1Chunk1: xxxx Write2Chunk1: yyyy Write2Chunk2: yyyy
>> Write1Chunk1: xxxx Blocks: |------------yyyyxxxx----------------->|
>>
>> ||
>>
>> |In above scenario guest virtio-blk driver decided to submit 2 intersecting
>> write requests, both of which are larger than ||max_transfer, and then call
>> hypervisor.|
>>
>> |I understand that virtio-blk may handle requests out of order, so guest
>> must not make any assumptions on relative order in which those requests will
>> be handled.|
>>
>> |However, can guest driver expect that whatever the submission order will
>> be, the actual intersecting writes will be atomic?|
>>
>> |In other words, will it be correct for conforming virtio-blk driver to
>> expect only "|||xxxxxxxx" or "||||yyyyyyyy" but not anything else in
>> between, after both requests are reported as completed?||
>>
>> ||Because i think that is something that may happen in qemu right now, if i
>> understood correctly. ||
> Write requests are not atomic in general.  Specific storage technologies
> support atomic writes via special commands with certain restrictions but
> applications using this feature aren't portable.
>
> Portable applications either don't submit intersecting write requests or
> they do not depend on atomicity.
>
> Out of curiousity I took a quick look at Linux device-mapper.  The same
> issue applies in device-mapper when intersecting write requests cross
> device-mapper targets.  I think Linux submits split bios in parallel and
> without serialization.
>
> Stefan


Thanks, Stefan!

(By the way, there is a v2 of this message without all the formatting bugs)


Evgeny




^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-07-19 10:49 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18 13:44 [Qemu-devel] BDRV request fragmentation and vitio-blk write submission guarantees Евгений Яковлев
2019-07-19 10:17 ` Stefan Hajnoczi
2019-07-19 10:48   ` Evgeny Yakovlev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).