[Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers
@ 2017-02-08  2:38 Adrian Suarez
  2017-02-08 13:59 ` Max Reitz
  0 siblings, 1 reply; 5+ messages in thread
From: Adrian Suarez @ 2017-02-08  2:38 UTC (permalink / raw)
  To: qemu-devel

We’ve implemented a block driver that exposes storage to QEMU VMs. Our
block driver (O) is interposing on writes to some other type of storage
(B). O performs low latency replication and then asynchronously issues the
write to the backing block driver, B, using bdrv_aio_writev(). Our problem
is that the write latencies seen by the workload in the guest should be
those imposed by O plus the guest I/O and QEMU stack (around 25us total
based on our measurements), but we’re actually seeing much higher latencies
(around 120us). We suspect that this is due to the backing block driver B’s
coroutines blocking our coroutines. The sequence of events is as follows
(see diagram:
https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW6AVTwDR6VA8hkw/pub?w=576&h=565
):

1. Write is issued to our block driver O using the asynchronous interface
for QEMU block driver.
2. Write is replicated to a fast device asynchronously.
2.a. In a different thread, the fast device invokes a callback on
completion that causes a coroutine to be scheduled to run in the QEMU
iothread that acknowledges completion of the write to the guest OS.
2.b. The coroutine scheduled in (2.a) is executed.
3. Write is issued asynchronously to the backing block driver, B.
3.a. The backing block driver, B, invokes the completion function supplied
by us, which frees any memory associated with the write (e.g. copies of IO
vectors).

Steps (1), (2), and (3) are performed in the same coroutine (our driver's
bdrv_aio_writev() implementation). (2.a) is executed in a thread that is
part of our transport library linked by O, and (2.b) and (3.a) are executed
as coroutines in the QEMU iothread.

We've tried improving the performance by using separate iothreads for the
two devices, but this only shaved about lowered the latency to around 100us
and caused stability issues. What's the best way to create a separate
iothread for the backing driver to do all of its work in?

-Adrian

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers
  2017-02-08  2:38 [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers Adrian Suarez
@ 2017-02-08 13:59 ` Max Reitz
  2017-02-08 14:30   ` Fam Zheng
  0 siblings, 1 reply; 5+ messages in thread
From: Max Reitz @ 2017-02-08 13:59 UTC (permalink / raw)
  To: Adrian Suarez, qemu-devel, Qemu-block, Stefan Hajnoczi, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

CC-ing qemu-block, Stefan, Fam.


On 08.02.2017 03:38, Adrian Suarez wrote:
> We’ve implemented a block driver that exposes storage to QEMU VMs. Our
> block driver (O) is interposing on writes to some other type of storage
> (B). O performs low latency replication and then asynchronously issues the
> write to the backing block driver, B, using bdrv_aio_writev(). Our problem
> is that the write latencies seen by the workload in the guest should be
> those imposed by O plus the guest I/O and QEMU stack (around 25us total
> based on our measurements), but we’re actually seeing much higher latencies
> (around 120us). We suspect that this is due to the backing block driver B’s
> coroutines blocking our coroutines. The sequence of events is as follows
> (see diagram:
> https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW6AVTwDR6VA8hkw/pub?w=576&h=565
> ):
> 
> 1. Write is issued to our block driver O using the asynchronous interface
> for QEMU block driver.
> 2. Write is replicated to a fast device asynchronously.
> 2.a. In a different thread, the fast device invokes a callback on
> completion that causes a coroutine to be scheduled to run in the QEMU
> iothread that acknowledges completion of the write to the guest OS.
> 2.b. The coroutine scheduled in (2.a) is executed.
> 3. Write is issued asynchronously to the backing block driver, B.
> 3.a. The backing block driver, B, invokes the completion function supplied
> by us, which frees any memory associated with the write (e.g. copies of IO
> vectors).
> 
> Steps (1), (2), and (3) are performed in the same coroutine (our driver's
> bdrv_aio_writev() implementation). (2.a) is executed in a thread that is
> part of our transport library linked by O, and (2.b) and (3.a) are executed
> as coroutines in the QEMU iothread.
> 
> We've tried improving the performance by using separate iothreads for the
> two devices, but this only shaved about lowered the latency to around 100us
> and caused stability issues. What's the best way to create a separate
> iothread for the backing driver to do all of its work in?

I don't think it's possible to use different AioContexts for
BlockDriverStates in the same BDS chain, at least not currently. But
others may know more about this.

Max

> 
> -Adrian
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers
  2017-02-08 13:59 ` Max Reitz
@ 2017-02-08 14:30   ` Fam Zheng
  2017-02-08 19:00     ` Adrian Suarez
  0 siblings, 1 reply; 5+ messages in thread
From: Fam Zheng @ 2017-02-08 14:30 UTC (permalink / raw)
  To: Adrian Suarez; +Cc: qemu-devel, Max Reitz, Qemu-block, Stefan Hajnoczi

On Wed, 02/08 14:59, Max Reitz wrote:
> CC-ing qemu-block, Stefan, Fam.
> 
> 
> On 08.02.2017 03:38, Adrian Suarez wrote:
> > We’ve implemented a block driver that exposes storage to QEMU VMs. Our
> > block driver (O) is interposing on writes to some other type of storage
> > (B). O performs low latency replication and then asynchronously issues the
> > write to the backing block driver, B, using bdrv_aio_writev(). Our problem
> > is that the write latencies seen by the workload in the guest should be
> > those imposed by O plus the guest I/O and QEMU stack (around 25us total
> > based on our measurements), but we’re actually seeing much higher latencies
> > (around 120us). We suspect that this is due to the backing block driver B’s
> > coroutines blocking our coroutines. The sequence of events is as follows
> > (see diagram:
> > https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW6AVTwDR6VA8hkw/pub?w=576&h=565

I cannot open this, so just trying to understand from steps below..

> > ):
> > 
> > 1. Write is issued to our block driver O using the asynchronous interface
> > for QEMU block driver.
> > 2. Write is replicated to a fast device asynchronously.
> > 2.a. In a different thread, the fast device invokes a callback on
> > completion that causes a coroutine to be scheduled to run in the QEMU
> > iothread that acknowledges completion of the write to the guest OS.
> > 2.b. The coroutine scheduled in (2.a) is executed.
> > 3. Write is issued asynchronously to the backing block driver, B.
> > 3.a. The backing block driver, B, invokes the completion function supplied
> > by us, which frees any memory associated with the write (e.g. copies of IO
> > vectors).

Do you only start submitting request to B (step 3) after the fast device I/O
completes (step 2.a)? The fact that they are serialized incurs extra latency.
Have you tried to do 2 and 3 in parallel with AIO?

> > 
> > Steps (1), (2), and (3) are performed in the same coroutine (our driver's
> > bdrv_aio_writev() implementation). (2.a) is executed in a thread that is
> > part of our transport library linked by O, and (2.b) and (3.a) are executed
> > as coroutines in the QEMU iothread.
> > 
> > We've tried improving the performance by using separate iothreads for the
> > two devices, but this only shaved about lowered the latency to around 100us
> > and caused stability issues. What's the best way to create a separate
> > iothread for the backing driver to do all of its work in?
> 
> I don't think it's possible to use different AioContexts for
> BlockDriverStates in the same BDS chain, at least not currently. But
> others may know more about this.

This may change in the future but currently all the BDSes in a chain need to
stay on the same AioContext.

Fam

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers
  2017-02-08 14:30   ` Fam Zheng
@ 2017-02-08 19:00     ` Adrian Suarez
  2017-02-09  1:27       ` Fam Zheng
  0 siblings, 1 reply; 5+ messages in thread
From: Adrian Suarez @ 2017-02-08 19:00 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Max Reitz, Qemu-block, Stefan Hajnoczi

>
> Do you only start submitting request to B (step 3) after the fast device
> I/O
> completes (step 2.a)? The fact that they are serialized incurs extra
> latency.
> Have you tried to do 2 and 3 in parallel with AIO?


In step 2, we perform an asynchronous call to the fast device, supplying a
callback that calls aio_bh_schedule_oneshot() to schedule the completion in
the AioContext of the block driver. Step 3 uses bdrv_aio_writev(), but I'm
not sure if this is actually causing the write to be performed
synchronously to the backing device. What I'm expecting is that
bdrv_aio_writev() issues the write and then yields so that we don't
serialize all writes to the backing device.

Thanks,
Adrian

On Wed, Feb 8, 2017 at 6:30 AM, Fam Zheng <famz@redhat.com> wrote:

> On Wed, 02/08 14:59, Max Reitz wrote:
> > CC-ing qemu-block, Stefan, Fam.
> >
> >
> > On 08.02.2017 03:38, Adrian Suarez wrote:
> > > We’ve implemented a block driver that exposes storage to QEMU VMs. Our
> > > block driver (O) is interposing on writes to some other type of storage
> > > (B). O performs low latency replication and then asynchronously issues
> the
> > > write to the backing block driver, B, using bdrv_aio_writev(). Our
> problem
> > > is that the write latencies seen by the workload in the guest should be
> > > those imposed by O plus the guest I/O and QEMU stack (around 25us total
> > > based on our measurements), but we’re actually seeing much higher
> latencies
> > > (around 120us). We suspect that this is due to the backing block
> driver B’s
> > > coroutines blocking our coroutines. The sequence of events is as
> follows
> > > (see diagram:
> > > https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW
> 6AVTwDR6VA8hkw/pub?w=576&h=565
>
> I cannot open this, so just trying to understand from steps below..
>
> > > ):
> > >
> > > 1. Write is issued to our block driver O using the asynchronous
> interface
> > > for QEMU block driver.
> > > 2. Write is replicated to a fast device asynchronously.
> > > 2.a. In a different thread, the fast device invokes a callback on
> > > completion that causes a coroutine to be scheduled to run in the QEMU
> > > iothread that acknowledges completion of the write to the guest OS.
> > > 2.b. The coroutine scheduled in (2.a) is executed.
> > > 3. Write is issued asynchronously to the backing block driver, B.
> > > 3.a. The backing block driver, B, invokes the completion function
> supplied
> > > by us, which frees any memory associated with the write (e.g. copies
> of IO
> > > vectors).
>
> Do you only start submitting request to B (step 3) after the fast device
> I/O
> completes (step 2.a)? The fact that they are serialized incurs extra
> latency.
> Have you tried to do 2 and 3 in parallel with AIO?
>
> > >
> > > Steps (1), (2), and (3) are performed in the same coroutine (our
> driver's
> > > bdrv_aio_writev() implementation). (2.a) is executed in a thread that
> is
> > > part of our transport library linked by O, and (2.b) and (3.a) are
> executed
> > > as coroutines in the QEMU iothread.
> > >
> > > We've tried improving the performance by using separate iothreads for
> the
> > > two devices, but this only shaved about lowered the latency to around
> 100us
> > > and caused stability issues. What's the best way to create a separate
> > > iothread for the backing driver to do all of its work in?
> >
> > I don't think it's possible to use different AioContexts for
> > BlockDriverStates in the same BDS chain, at least not currently. But
> > others may know more about this.
>
> This may change in the future but currently all the BDSes in a chain need
> to
> stay on the same AioContext.
>
> Fam
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers
  2017-02-08 19:00     ` Adrian Suarez
@ 2017-02-09  1:27       ` Fam Zheng
  0 siblings, 0 replies; 5+ messages in thread
From: Fam Zheng @ 2017-02-09  1:27 UTC (permalink / raw)
  To: Adrian Suarez; +Cc: qemu-devel, Max Reitz, Qemu-block, Stefan Hajnoczi

On Wed, 02/08 11:00, Adrian Suarez wrote:
> >
> > Do you only start submitting request to B (step 3) after the fast device
> > I/O
> > completes (step 2.a)? The fact that they are serialized incurs extra
> > latency.
> > Have you tried to do 2 and 3 in parallel with AIO?
> 
> 
> In step 2, we perform an asynchronous call to the fast device, supplying a
> callback that calls aio_bh_schedule_oneshot() to schedule the completion in
> the AioContext of the block driver. Step 3 uses bdrv_aio_writev(), but I'm
> not sure if this is actually causing the write to be performed
> synchronously to the backing device. What I'm expecting is that
> bdrv_aio_writev() issues the write and then yields so that we don't
> serialize all writes to the backing device.

OK, what I'm wondering is why call bdrv_aio_writev() in a BH instead of right
away. IOW, have you traced how much time is spent before even calling
bdrv_aio_writev()?

> 
> Thanks,
> Adrian
> 
> On Wed, Feb 8, 2017 at 6:30 AM, Fam Zheng <famz@redhat.com> wrote:
> 
> > On Wed, 02/08 14:59, Max Reitz wrote:
> > > CC-ing qemu-block, Stefan, Fam.
> > >
> > >
> > > On 08.02.2017 03:38, Adrian Suarez wrote:
> > > > We’ve implemented a block driver that exposes storage to QEMU VMs. Our
> > > > block driver (O) is interposing on writes to some other type of storage
> > > > (B). O performs low latency replication and then asynchronously issues
> > the
> > > > write to the backing block driver, B, using bdrv_aio_writev(). Our
> > problem
> > > > is that the write latencies seen by the workload in the guest should be
> > > > those imposed by O plus the guest I/O and QEMU stack (around 25us total
> > > > based on our measurements), but we’re actually seeing much higher
> > latencies
> > > > (around 120us). We suspect that this is due to the backing block
> > driver B’s
> > > > coroutines blocking our coroutines. The sequence of events is as
> > follows
> > > > (see diagram:
> > > > https://docs.google.com/drawings/d/12h1QbecvxzlKxSFvGKYAzvAJ18kTW
> > 6AVTwDR6VA8hkw/pub?w=576&h=565
> >
> > I cannot open this, so just trying to understand from steps below..
> >
> > > > ):
> > > >
> > > > 1. Write is issued to our block driver O using the asynchronous
> > interface
> > > > for QEMU block driver.
> > > > 2. Write is replicated to a fast device asynchronously.
> > > > 2.a. In a different thread, the fast device invokes a callback on
> > > > completion that causes a coroutine to be scheduled to run in the QEMU
> > > > iothread that acknowledges completion of the write to the guest OS.
> > > > 2.b. The coroutine scheduled in (2.a) is executed.
> > > > 3. Write is issued asynchronously to the backing block driver, B.
> > > > 3.a. The backing block driver, B, invokes the completion function
> > supplied
> > > > by us, which frees any memory associated with the write (e.g. copies
> > of IO
> > > > vectors).
> >
> > Do you only start submitting request to B (step 3) after the fast device
> > I/O
> > completes (step 2.a)? The fact that they are serialized incurs extra
> > latency.
> > Have you tried to do 2 and 3 in parallel with AIO?
> >
> > > >
> > > > Steps (1), (2), and (3) are performed in the same coroutine (our
> > driver's
> > > > bdrv_aio_writev() implementation). (2.a) is executed in a thread that
> > is
> > > > part of our transport library linked by O, and (2.b) and (3.a) are
> > executed
> > > > as coroutines in the QEMU iothread.
> > > >
> > > > We've tried improving the performance by using separate iothreads for
> > the
> > > > two devices, but this only shaved about lowered the latency to around
> > 100us
> > > > and caused stability issues. What's the best way to create a separate
> > > > iothread for the backing driver to do all of its work in?
> > >
> > > I don't think it's possible to use different AioContexts for
> > > BlockDriverStates in the same BDS chain, at least not currently. But
> > > others may know more about this.
> >
> > This may change in the future but currently all the BDSes in a chain need
> > to
> > stay on the same AioContext.
> >
> > Fam
> >

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-02-09  1:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-08  2:38 [Qemu-devel] Question about QEMU's threading model and stacking multiple block drivers Adrian Suarez
2017-02-08 13:59 ` Max Reitz
2017-02-08 14:30   ` Fam Zheng
2017-02-08 19:00     ` Adrian Suarez
2017-02-09  1:27       ` Fam Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.