[Qemu-devel] Block layer complexity: what to do to keep it under control?

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] Block layer complexity: what to do to keep it under control?
@ 2017-11-29  3:55 Fam Zheng
  2017-11-29  6:30 ` Jeff Cody
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Fam Zheng @ 2017-11-29  3:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, pbonzini, jcody, kwolf, stefanha, mreitz, eblake, famz

Hi all,

As we move forwards with new features in the block layer, the chances of tricky
bugs happening have been increasing alongside - block jobs, coroutines,
throttling, AioContext, op blockers and image locking combined together make a
large and complex picture that is hard to fully understand and work with. Some
bugs we've encountered are quite challenging already.  Examples are:

- segfault in parallel blockjobs (iotest 30)
  https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html

- Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage
  migration)
  https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01626.html

- Drainage in bdrv_replace_child_noperm()
  https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg00868.html

- Regression from 2.8: stuck in bdrv_drain()
  https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02193.html

So in principle, what should we do to make the block layer easy to understand,
develop with and debug? I think we have opportunities in these aspects:

- Documentation

  There is no central developer doc about block layer, especially how all pieces
  fit together. Having one will make it a lot easier for new contributors to
  understand better. Of course, we're facing the old problem: the code is
  moving, maintaining an updated document needs effort.

  Idea: add ./doc/deve/block.txt?

- Tests

  Writing tests is a great way not only to exercise code, verify new features
  work as expected and catch regression bugs, but also a way to show how the
  feature can be used. There is this trend that the QEMU user interface
  gradually moves from high level commands and args to small and flexible
  building blocks, therefore demostrating the usage in iotests is meaningful.

  Idea: Add tests to simulate how libvirt uses block layer, or how we expect it
  to. This would be a long term investment. We could reuse iotests, or create a
  new test framework specifically, if it's easier (for example, use docker/vm
  tests that just uses libvirt).

  Idea: Patchew already tests the quick group of iotests for a few
  formats/protocols, but we should really add it to "make check".

- Simplified code, or more orthogonal/modularized architecture.

  Each aspect of block layer is complex enough so isolating them as much as
  possible is a reasonable approach to control the complexity. Block jobs and
  throttling becoming block filters is a good example, we should identify more.

  Idea: rethink event loops. Create coroutines ubiquitously (for example for
  each fd handler, BH and timer), so that many nested aio_poll() can be removed.

  Crazy idea: move the whole block layer to a vhost process, and implement
  existing features differently, especially in terms of multi-threading (hint:
  rust?).

- Debuggability.

  Working with backtraces when coroutine is used is pretty hard, it would be
  nice if ./scripts/qemugdb/coroutine.py could work with core files (i.e.
  without a process to debug), and trace back to co->caller automatically.

  It's always useful to dump block graph. Maybe we should add a helper function
  in block layer that dumps all node graphs in graphviz DOT format, and even
  make it available in QMP as x-dump-block-graph?

  Of course gdb scripts to dump various lists are also nice little things to
  have.

  Idea: write more ./scripts/qemugdb/<scriptlet>.py.

More thoughts?

Fam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29  3:55 [Qemu-devel] Block layer complexity: what to do to keep it under control? Fam Zheng
@ 2017-11-29  6:30 ` Jeff Cody
  2017-11-29 12:16   ` Stefan Hajnoczi
  2017-11-29 12:00 ` Stefan Hajnoczi
  2017-11-29 12:32 ` Daniel P. Berrange
  2 siblings, 1 reply; 21+ messages in thread
From: Jeff Cody @ 2017-11-29  6:30 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, pbonzini, kwolf, stefanha, mreitz, eblake

On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> Hi all,
> 
> As we move forwards with new features in the block layer, the chances of tricky
> bugs happening have been increasing alongside - block jobs, coroutines,
> throttling, AioContext, op blockers and image locking combined together make a
> large and complex picture that is hard to fully understand and work with. Some
> bugs we've encountered are quite challenging already.  Examples are:
> 
> - segfault in parallel blockjobs (iotest 30)
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html
> 
> - Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage
>   migration)
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01626.html
> 
> - Drainage in bdrv_replace_child_noperm()
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg00868.html
> 
> - Regression from 2.8: stuck in bdrv_drain()
>   https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02193.html
> 

I agree, it seems the complexity is growing by quite a bit.

> So in principle, what should we do to make the block layer easy to understand,
> develop with and debug? I think we have opportunities in these aspects:
> 
> - Documentation
> 
>   There is no central developer doc about block layer, especially how all pieces
>   fit together. Having one will make it a lot easier for new contributors to
>   understand better. Of course, we're facing the old problem: the code is
>   moving, maintaining an updated document needs effort.
> 
>   Idea: add ./doc/deve/block.txt?
> 

There are some bits of brilliance in what is already there; for instance,
devel/atomics.txt is very thorough.  But I agree that a major piece missing
is an overall design document, that provides the "why" to the "what".

Even given the cost of maintaining a higher level design document, I
think your suggestion here is probably the one that can help mitigate the
complexity the most; the more we (developers) can keep a coherent design
model in mind, the better we are able to do your _other_ suggestions: create
effective tests, simplify code, and enhance debuggability.

> - Tests
> 
>   Writing tests is a great way not only to exercise code, verify new features
>   work as expected and catch regression bugs, but also a way to show how the
>   feature can be used. There is this trend that the QEMU user interface
>   gradually moves from high level commands and args to small and flexible
>   building blocks, therefore demostrating the usage in iotests is meaningful.
> 
>   Idea: Add tests to simulate how libvirt uses block layer, or how we expect it
>   to. This would be a long term investment. We could reuse iotests, or create a
>   new test framework specifically, if it's easier (for example, use docker/vm
>   tests that just uses libvirt).
> 
>   Idea: Patchew already tests the quick group of iotests for a few
>   formats/protocols, but we should really add it to "make check".
> 

Perhaps higher level testing (like your example of how libvirt uses the
block layer) is a good candidate for avocado?


> - Simplified code, or more orthogonal/modularized architecture.
> 
>   Each aspect of block layer is complex enough so isolating them as much as
>   possible is a reasonable approach to control the complexity. Block jobs and
>   throttling becoming block filters is a good example, we should identify more.
> 
>   Idea: rethink event loops. Create coroutines ubiquitously (for example for
>   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> 
>   Crazy idea: move the whole block layer to a vhost process, and implement
>   existing features differently, especially in terms of multi-threading (hint:
>   rust?).
> 
> - Debuggability.
> 
>   Working with backtraces when coroutine is used is pretty hard, it would be
>   nice if ./scripts/qemugdb/coroutine.py could work with core files (i.e.
>   without a process to debug), and trace back to co->caller automatically.
> 

IIRC, this used to work, right?

>   It's always useful to dump block graph. Maybe we should add a helper function
>   in block layer that dumps all node graphs in graphviz DOT format, and even
>   make it available in QMP as x-dump-block-graph?
> 
>   Of course gdb scripts to dump various lists are also nice little things to
>   have.
> 
>   Idea: write more ./scripts/qemugdb/<scriptlet>.py.

More qemugdb macros would be great, especially for dumping the block chain
and making coroutines less opaque.

-Jeff

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29  3:55 [Qemu-devel] Block layer complexity: what to do to keep it under control? Fam Zheng
  2017-11-29  6:30 ` Jeff Cody
@ 2017-11-29 12:00 ` Stefan Hajnoczi
  2017-11-29 12:24   ` Paolo Bonzini
  2017-11-30  9:47   ` Fam Zheng
  2017-11-29 12:32 ` Daniel P. Berrange
  2 siblings, 2 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-11-29 12:00 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, qemu-block, pbonzini, jcody, kwolf, mreitz, eblake

[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]

On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> As we move forwards with new features in the block layer, the chances of tricky
> bugs happening have been increasing alongside - block jobs, coroutines,
> throttling, AioContext, op blockers and image locking combined together make a
> large and complex picture that is hard to fully understand and work with. Some
> bugs we've encountered are quite challenging already.  Examples are:
> 
> - segfault in parallel blockjobs (iotest 30)
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html
> 
> - Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage
>   migration)
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01626.html
> 
> - Drainage in bdrv_replace_child_noperm()
>   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg00868.html
> 
> - Regression from 2.8: stuck in bdrv_drain()
>   https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02193.html
> 
> So in principle, what should we do to make the block layer easy to understand,
> develop with and debug?

The assumptions that the code relies on are unclear so it's easy to
introduce new bugs.

We are at a point where code review isn't finding certain bugs because
no single person knows all the assumptions.  Previously the problem was
contained because maintainers spotted problems before patches were
merged.

This is not primarily a documentation problem though.  We cannot
document our way out of this because no single person (patch author or
code reviewer) can know or check everything anymore due to the scale.

I think it's a (lack of) design problem because we have many incomplete
abstractions like block jobs, IOThreads, block graph, image locking,
etc.  They do not cover all possibly states and interactions today.
Extending them leads to complex bugs.

A little progress has been made with defining higher-level APIs for
block drivers and block jobs.  This way they either don't deal with
low-level details of the concurrency and event loop models (e.g.
bdrv_coroutine_enter()) or there is an interface that prompts them to
integrate properly like bdrv_attach/detach_aio_context().

Event loops and coroutines are good but they should not be used directly
by block drivers and block jobs.  We need safe, high-level APIs that
implement commonly-used operations.

> - Documentation
> 
>   There is no central developer doc about block layer, especially how all pieces
>   fit together. Having one will make it a lot easier for new contributors to
>   understand better. Of course, we're facing the old problem: the code is
>   moving, maintaining an updated document needs effort.
> 
>   Idea: add ./doc/deve/block.txt?

IOThreads and AioContexts are addressed here:
docs/devel/multiple-iothreads.txt

The game has become significantly more complex than what the document
describes.  It's lacking aio_co_wake() and aio_co_schedule() for
example.

> - Simplified code, or more orthogonal/modularized architecture.
> 
>   Each aspect of block layer is complex enough so isolating them as much as
>   possible is a reasonable approach to control the complexity. Block jobs and
>   throttling becoming block filters is a good example, we should identify more.
> 
>   Idea: rethink event loops. Create coroutines ubiquitously (for example for
>   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> 
>   Crazy idea: move the whole block layer to a vhost process, and implement
>   existing features differently, especially in terms of multi-threading (hint:
>   rust?).

A reimplementation will not solve the problem because:

1. If it still has the same feature set and requirements then the level
   of complexity will be comparable.

2. We can reduce accidental (inessential) complexity by continuing the
   various efforts around the block graph, block jobs, multi-queue block
   layer with an eye towards higher level APIs.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29  6:30 ` Jeff Cody
@ 2017-11-29 12:16   ` Stefan Hajnoczi
  2017-11-29 12:22     ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-11-29 12:16 UTC (permalink / raw)
  To: Jeff Cody
  Cc: Fam Zheng, qemu-devel, qemu-block, pbonzini, kwolf, mreitz, eblake

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

On Wed, Nov 29, 2017 at 01:30:06AM -0500, Jeff Cody wrote:
> On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> > Hi all,
> > 
> > As we move forwards with new features in the block layer, the chances of tricky
> > bugs happening have been increasing alongside - block jobs, coroutines,
> > throttling, AioContext, op blockers and image locking combined together make a
> > large and complex picture that is hard to fully understand and work with. Some
> > bugs we've encountered are quite challenging already.  Examples are:
> > 
> > - segfault in parallel blockjobs (iotest 30)
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html
> > 
> > - Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage
> >   migration)
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01626.html
> > 
> > - Drainage in bdrv_replace_child_noperm()
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg00868.html
> > 
> > - Regression from 2.8: stuck in bdrv_drain()
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02193.html
> > 
> 
> I agree, it seems the complexity is growing by quite a bit.
> 
> > So in principle, what should we do to make the block layer easy to understand,
> > develop with and debug? I think we have opportunities in these aspects:
> > 
> > - Documentation
> > 
> >   There is no central developer doc about block layer, especially how all pieces
> >   fit together. Having one will make it a lot easier for new contributors to
> >   understand better. Of course, we're facing the old problem: the code is
> >   moving, maintaining an updated document needs effort.
> > 
> >   Idea: add ./doc/deve/block.txt?
> > 
> 
> There are some bits of brilliance in what is already there; for instance,
> devel/atomics.txt is very thorough.  But I agree that a major piece missing
> is an overall design document, that provides the "why" to the "what".

While the atomics documentation is good, atomics themselves have been a
source of difficult bugs.

They should be used as little as possible and only where they can be
encapsulated in a composable abstraction (i.e. don't expect users of
your abstraction to understand atomics).

Why?  They are damn hard to use.  None of us is capable of using them
without introducing difficult bugs.

There is also a temptation to rely on implicit effects of other code
(e.g. when you know there is a barrier in another function) for best
performance.  That's a bad property for code to have because it becomes
hard to change safely.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:16   ` Stefan Hajnoczi
@ 2017-11-29 12:22     ` Paolo Bonzini
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2017-11-29 12:22 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jeff Cody
  Cc: Fam Zheng, qemu-devel, qemu-block, kwolf, mreitz, eblake

[-- Attachment #1: Type: text/plain, Size: 1308 bytes --]

On 29/11/2017 13:16, Stefan Hajnoczi wrote:
> While the atomics documentation is good, atomics themselves have been a
> source of difficult bugs.
> 
> They should be used as little as possible and only where they can be
> encapsulated in a composable abstraction (i.e. don't expect users of
> your abstraction to understand atomics).
> 
> Why?  They are damn hard to use.  None of us is capable of using them
> without introducing difficult bugs.
> 
> There is also a temptation to rely on implicit effects of other code
> (e.g. when you know there is a barrier in another function) for best
> performance.  That's a bad property for code to have because it becomes
> hard to change safely.

I agree.  atomics.txt should be augmented with examples of well-known
idioms/patterns and a big red sign "if this is not what you're doing,
don't do it".

However, I'll note that in the end atomics are not very different from
very small critical sections.  If anything, atomics have the advantage
that you have to think more about acquire/release semantics.  So it's in
general a matter of "being as clever as you need, but not more than
that" (and IMO util/async.c and util/qemu-coroutine-lock.c do need all
the cleverness they can muster, but it should stop there).

Thanks,

Paolo

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:00 ` Stefan Hajnoczi
@ 2017-11-29 12:24   ` Paolo Bonzini
  2017-11-29 13:24     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
                       ` (2 more replies)
  2017-11-30  9:47   ` Fam Zheng
  1 sibling, 3 replies; 21+ messages in thread
From: Paolo Bonzini @ 2017-11-29 12:24 UTC (permalink / raw)
  To: Stefan Hajnoczi, Fam Zheng
  Cc: qemu-devel, qemu-block, jcody, kwolf, mreitz, eblake

[-- Attachment #1: Type: text/plain, Size: 2616 bytes --]

On 29/11/2017 13:00, Stefan Hajnoczi wrote:
> We are at a point where code review isn't finding certain bugs because
> no single person knows all the assumptions.  Previously the problem was
> contained because maintainers spotted problems before patches were
> merged.
> 
> This is not primarily a documentation problem though.  We cannot
> document our way out of this because no single person (patch author or
> code reviewer) can know or check everything anymore due to the scale.
> 
> I think it's a (lack of) design problem because we have many incomplete
> abstractions like block jobs, IOThreads, block graph, image locking,
> etc.  They do not cover all possibly states and interactions today.
> Extending them leads to complex bugs.

I think the main interactions are:

1) block graph modifications and drain.  This has always been a carnage.
 Implementing BlockBackend isolation instead of drain would probably be
a starting point to fix it, because IIRC there are extremely few cases
where we really need "drain" semantics.

2) block jobs and coroutines.  Block jobs were too clever about
coroutines.  Using a simplified API is going to fix this problem.
Ideally, if you're not in a coroutine "co", the only coroutine APIs you
should use on "co" are:

- aio_co_enter/qemu_coroutine_enter (start a coroutine, respectively on
another AioContext or this context);

- aio_co_schedule/aio_co_wake (restart a coroutine that has yielded,
respectively on a given AioContext or its own original.

3) block jobs and drain.  This is related to (1) because drain can
terminate jobs and in turn that can cause block graph modifications.
I'm not even sure it's a separate issue.

Regarding documentation, the include file documentation is good for
coroutines and block jobs.  But it's bad for block graph modification
APIs, and even for coroutines + block jobs the docs/devel documentation
could be improved *and* it's ugly that we're not generating anything
readable from include file documentation, to go with docs/devel.

Paolo

> A little progress has been made with defining higher-level APIs for
> block drivers and block jobs.  This way they either don't deal with
> low-level details of the concurrency and event loop models (e.g.
> bdrv_coroutine_enter()) or there is an interface that prompts them to
> integrate properly like bdrv_attach/detach_aio_context().
> 
> Event loops and coroutines are good but they should not be used directly
> by block drivers and block jobs.  We need safe, high-level APIs that
> implement commonly-used operations.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29  3:55 [Qemu-devel] Block layer complexity: what to do to keep it under control? Fam Zheng
  2017-11-29  6:30 ` Jeff Cody
  2017-11-29 12:00 ` Stefan Hajnoczi
@ 2017-11-29 12:32 ` Daniel P. Berrange
  2 siblings, 0 replies; 21+ messages in thread
From: Daniel P. Berrange @ 2017-11-29 12:32 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, kwolf, qemu-block, jcody, mreitz, stefanha, pbonzini

On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> So in principle, what should we do to make the block layer easy to understand,
> develop with and debug? I think we have opportunities in these aspects:
> 
> - Documentation
> 
>   There is no central developer doc about block layer, especially how all pieces
>   fit together. Having one will make it a lot easier for new contributors to
>   understand better. Of course, we're facing the old problem: the code is
>   moving, maintaining an updated document needs effort.
> 
>   Idea: add ./doc/deve/block.txt?

When I was writing the LUKS block driver one of the bug problems I had was
understanding the internal BlockDriver APIs, as there's almost a complete
lack of any API documentation for them. So I would strongly like to see
inline API documentation in the header files to describe the APIs. eg more
of this:

  commit b461151ff31c7925f271c297e8abed20231ac7d3
  Author: Daniel P. Berrange <berrange@redhat.com>
  Date:   Thu Aug 31 11:54:56 2017 +0100

    block: document semantics of bdrv_co_preadv|pwritev

> More thoughts?

Currently the block driver layer has three different ways you can
implement the I/O path. bdrv_aio_{readv,writev},  bdrv_co_{readv,writev}
and bdrv_co_{preadv,pwritev}. This is pretty confusing to people who are
not core block maintainers, and doubtless increases the code complexity
to deal with three different I/O paths.

This is the usual curse of QEMU refactoring - a new framework is introduced
but only a few existing users are updated to use it.  I'd suggest an
aggressive push to convert all block drivers to use the same internal APIs.
Set a deadline for maintainers to convert them, and if missed, drop the
drivers in question.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:24   ` Paolo Bonzini
@ 2017-11-29 13:24     ` Stefan Hajnoczi
  2017-11-29 13:41     ` [Qemu-devel] " Kevin Wolf
  2017-11-29 19:58     ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-11-29 13:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Fam Zheng, kwolf, qemu-block, qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 2379 bytes --]

On Wed, Nov 29, 2017 at 01:24:46PM +0100, Paolo Bonzini wrote:
> On 29/11/2017 13:00, Stefan Hajnoczi wrote:
> > We are at a point where code review isn't finding certain bugs because
> > no single person knows all the assumptions.  Previously the problem was
> > contained because maintainers spotted problems before patches were
> > merged.
> > 
> > This is not primarily a documentation problem though.  We cannot
> > document our way out of this because no single person (patch author or
> > code reviewer) can know or check everything anymore due to the scale.
> > 
> > I think it's a (lack of) design problem because we have many incomplete
> > abstractions like block jobs, IOThreads, block graph, image locking,
> > etc.  They do not cover all possibly states and interactions today.
> > Extending them leads to complex bugs.
> 
> I think the main interactions are:
> 
> 1) block graph modifications and drain.  This has always been a carnage.
>  Implementing BlockBackend isolation instead of drain would probably be
> a starting point to fix it, because IIRC there are extremely few cases
> where we really need "drain" semantics.

Yes, this is a big one.

> 2) block jobs and coroutines.  Block jobs were too clever about
> coroutines.  Using a simplified API is going to fix this problem.
> Ideally, if you're not in a coroutine "co", the only coroutine APIs you
> should use on "co" are:
> 
> - aio_co_enter/qemu_coroutine_enter (start a coroutine, respectively on
> another AioContext or this context);
> 
> - aio_co_schedule/aio_co_wake (restart a coroutine that has yielded,
> respectively on a given AioContext or its own original.
> 
> 3) block jobs and drain.  This is related to (1) because drain can
> terminate jobs and in turn that can cause block graph modifications.
> I'm not even sure it's a separate issue.
> 
> 
> Regarding documentation, the include file documentation is good for
> coroutines and block jobs.  But it's bad for block graph modification
> APIs, and even for coroutines + block jobs the docs/devel documentation
> could be improved *and* it's ugly that we're not generating anything
> readable from include file documentation, to go with docs/devel.

We still need to be careful about block job completion code that runs in
the main loop, too.  That has been a source of several bugs.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:24   ` Paolo Bonzini
  2017-11-29 13:24     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2017-11-29 13:41     ` Kevin Wolf
  2017-11-29 19:58     ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 21+ messages in thread
From: Kevin Wolf @ 2017-11-29 13:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Fam Zheng, qemu-devel, qemu-block, jcody,
	mreitz, eblake

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

Am 29.11.2017 um 13:24 hat Paolo Bonzini geschrieben:
> On 29/11/2017 13:00, Stefan Hajnoczi wrote:
> > We are at a point where code review isn't finding certain bugs because
> > no single person knows all the assumptions.  Previously the problem was
> > contained because maintainers spotted problems before patches were
> > merged.
> > 
> > This is not primarily a documentation problem though.  We cannot
> > document our way out of this because no single person (patch author or
> > code reviewer) can know or check everything anymore due to the scale.
> > 
> > I think it's a (lack of) design problem because we have many incomplete
> > abstractions like block jobs, IOThreads, block graph, image locking,
> > etc.  They do not cover all possibly states and interactions today.
> > Extending them leads to complex bugs.
> 
> I think the main interactions are:
> 
> 1) block graph modifications and drain.  This has always been a carnage.
>  Implementing BlockBackend isolation instead of drain would probably be
> a starting point to fix it, because IIRC there are extremely few cases
> where we really need "drain" semantics.

I think it's not just specifically drain, but nested event loops in
general. Drain is just more prominent because it recursively affects the
whole tree and actively waits for callbacks, so if anything can go
wrong, it will certainly affect drain, too.

The big problem I see here is that we have never defined in which places
or under which conditions it's allowed to make changes to the graph.
This means that callers never know when to use an extra bdrv_ref/unref
pair, when to expect that child references change in the middle of the
operation etc.

Maybe what we need there is some coroutine locks that make sure that
e.g. a block job completion simply has to wait until a drain has
completed before the graph change is actually executed. We need to make
sure that these locks don't deadlock the drain operation, but as long as
these things run in a separate coroutine (like the block job coroutine),
it should be okay.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:24   ` Paolo Bonzini
  2017-11-29 13:24     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2017-11-29 13:41     ` [Qemu-devel] " Kevin Wolf
@ 2017-11-29 19:58     ` Dr. David Alan Gilbert
  2 siblings, 0 replies; 21+ messages in thread
From: Dr. David Alan Gilbert @ 2017-11-29 19:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Fam Zheng, kwolf, qemu-block, jcody, qemu-devel, mreitz

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 29/11/2017 13:00, Stefan Hajnoczi wrote:
> > We are at a point where code review isn't finding certain bugs because
> > no single person knows all the assumptions.  Previously the problem was
> > contained because maintainers spotted problems before patches were
> > merged.
> > 
> > This is not primarily a documentation problem though.  We cannot
> > document our way out of this because no single person (patch author or
> > code reviewer) can know or check everything anymore due to the scale.
> > 
> > I think it's a (lack of) design problem because we have many incomplete
> > abstractions like block jobs, IOThreads, block graph, image locking,
> > etc.  They do not cover all possibly states and interactions today.
> > Extending them leads to complex bugs.
> 
> I think the main interactions are:
> 
> 1) block graph modifications and drain.  This has always been a carnage.
>  Implementing BlockBackend isolation instead of drain would probably be
> a starting point to fix it, because IIRC there are extremely few cases
> where we really need "drain" semantics.
> 
> 2) block jobs and coroutines.  Block jobs were too clever about
> coroutines.  Using a simplified API is going to fix this problem.
> Ideally, if you're not in a coroutine "co", the only coroutine APIs you
> should use on "co" are:
> 
> - aio_co_enter/qemu_coroutine_enter (start a coroutine, respectively on
> another AioContext or this context);
> 
> - aio_co_schedule/aio_co_wake (restart a coroutine that has yielded,
> respectively on a given AioContext or its own original.
> 
> 3) block jobs and drain.  This is related to (1) because drain can
> terminate jobs and in turn that can cause block graph modifications.
> I'm not even sure it's a separate issue.

Block and migration has been having a rough time for a while, generally
around whether block devices should be inactivated at particular points.
While we've got some changes recently, we've still got at least one
known failure.

Dave

> Regarding documentation, the include file documentation is good for
> coroutines and block jobs.  But it's bad for block graph modification
> APIs, and even for coroutines + block jobs the docs/devel documentation
> could be improved *and* it's ugly that we're not generating anything
> readable from include file documentation, to go with docs/devel.
> 
> Paolo
> 
> > A little progress has been made with defining higher-level APIs for
> > block drivers and block jobs.  This way they either don't deal with
> > low-level details of the concurrency and event loop models (e.g.
> > bdrv_coroutine_enter()) or there is an interface that prompts them to
> > integrate properly like bdrv_attach/detach_aio_context().
> > 
> > Event loops and coroutines are good but they should not be used directly
> > by block drivers and block jobs.  We need safe, high-level APIs that
> > implement commonly-used operations.
> 
> 


--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-29 12:00 ` Stefan Hajnoczi
  2017-11-29 12:24   ` Paolo Bonzini
@ 2017-11-30  9:47   ` Fam Zheng
  2017-11-30 14:19     ` Stefan Hajnoczi
  1 sibling, 1 reply; 21+ messages in thread
From: Fam Zheng @ 2017-11-30  9:47 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz, pbonzini

On Wed, 11/29 12:00, Stefan Hajnoczi wrote:
> On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> > As we move forwards with new features in the block layer, the chances of tricky
> > bugs happening have been increasing alongside - block jobs, coroutines,
> > throttling, AioContext, op blockers and image locking combined together make a
> > large and complex picture that is hard to fully understand and work with. Some
> > bugs we've encountered are quite challenging already.  Examples are:
> > 
> > - segfault in parallel blockjobs (iotest 30)
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01144.html
> > 
> > - Intermittent hang of iotest 194 (bdrv_drain_all after non-shared storage
> >   migration)
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg01626.html
> > 
> > - Drainage in bdrv_replace_child_noperm()
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-11/msg00868.html
> > 
> > - Regression from 2.8: stuck in bdrv_drain()
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02193.html
> > 
> > So in principle, what should we do to make the block layer easy to understand,
> > develop with and debug?
> 
> The assumptions that the code relies on are unclear so it's easy to
> introduce new bugs.

Is that one thing we could do better in documenting?

> 
> We are at a point where code review isn't finding certain bugs because
> no single person knows all the assumptions.  Previously the problem was
> contained because maintainers spotted problems before patches were
> merged.
> 
> This is not primarily a documentation problem though.  We cannot
> document our way out of this because no single person (patch author or
> code reviewer) can know or check everything anymore due to the scale.
> 
> I think it's a (lack of) design problem because we have many incomplete
> abstractions like block jobs, IOThreads, block graph, image locking,
> etc.  They do not cover all possibly states and interactions today.
> Extending them leads to complex bugs.
> 
> A little progress has been made with defining higher-level APIs for
> block drivers and block jobs.  This way they either don't deal with
> low-level details of the concurrency and event loop models (e.g.
> bdrv_coroutine_enter()) or there is an interface that prompts them to
> integrate properly like bdrv_attach/detach_aio_context().

Sounds good.

> 
> Event loops and coroutines are good but they should not be used directly
> by block drivers and block jobs.  We need safe, high-level APIs that
> implement commonly-used operations.
> 
> > - Documentation
> > 
> >   There is no central developer doc about block layer, especially how all pieces
> >   fit together. Having one will make it a lot easier for new contributors to
> >   understand better. Of course, we're facing the old problem: the code is
> >   moving, maintaining an updated document needs effort.
> > 
> >   Idea: add ./doc/deve/block.txt?
> 
> IOThreads and AioContexts are addressed here:
> docs/devel/multiple-iothreads.txt
> 
> The game has become significantly more complex than what the document
> describes.  It's lacking aio_co_wake() and aio_co_schedule() for
> example.
> 
> > - Simplified code, or more orthogonal/modularized architecture.
> > 
> >   Each aspect of block layer is complex enough so isolating them as much as
> >   possible is a reasonable approach to control the complexity. Block jobs and
> >   throttling becoming block filters is a good example, we should identify more.
> > 
> >   Idea: rethink event loops. Create coroutines ubiquitously (for example for
> >   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> > 
> >   Crazy idea: move the whole block layer to a vhost process, and implement
> >   existing features differently, especially in terms of multi-threading (hint:
> >   rust?).
> 
> A reimplementation will not solve the problem because:
> 
> 1. If it still has the same feature set and requirements then the level
>    of complexity will be comparable.
> 
> 2. We can reduce accidental (inessential) complexity by continuing the
>    various efforts around the block graph, block jobs, multi-queue block
>    layer with an eye towards higher level APIs.

Starting over is certainly not the motivation to do qemu-vhost, but it would be
an opportunity to use different async/concurrency paradigms if that is going to
happen. I think in current block layer, event loop + coroutine is a good
combination, but having nested aio_poll()'s made it worse, then mixing IOThreads
in makes it a lot more complicated.

Fam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-30  9:47   ` Fam Zheng
@ 2017-11-30 14:19     ` Stefan Hajnoczi
  2017-12-01 10:16       ` Fam Zheng
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-11-30 14:19 UTC (permalink / raw)
  To: Fam Zheng; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz, pbonzini

[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]

On Thu, Nov 30, 2017 at 05:47:09PM +0800, Fam Zheng wrote:
> On Wed, 11/29 12:00, Stefan Hajnoczi wrote:
> > On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> > 
> > Event loops and coroutines are good but they should not be used directly
> > by block drivers and block jobs.  We need safe, high-level APIs that
> > implement commonly-used operations.
> > 
> > > - Documentation
> > > 
> > >   There is no central developer doc about block layer, especially how all pieces
> > >   fit together. Having one will make it a lot easier for new contributors to
> > >   understand better. Of course, we're facing the old problem: the code is
> > >   moving, maintaining an updated document needs effort.
> > > 
> > >   Idea: add ./doc/deve/block.txt?
> > 
> > IOThreads and AioContexts are addressed here:
> > docs/devel/multiple-iothreads.txt
> > 
> > The game has become significantly more complex than what the document
> > describes.  It's lacking aio_co_wake() and aio_co_schedule() for
> > example.
> > 
> > > - Simplified code, or more orthogonal/modularized architecture.
> > > 
> > >   Each aspect of block layer is complex enough so isolating them as much as
> > >   possible is a reasonable approach to control the complexity. Block jobs and
> > >   throttling becoming block filters is a good example, we should identify more.
> > > 
> > >   Idea: rethink event loops. Create coroutines ubiquitously (for example for
> > >   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> > > 
> > >   Crazy idea: move the whole block layer to a vhost process, and implement
> > >   existing features differently, especially in terms of multi-threading (hint:
> > >   rust?).
> > 
> > A reimplementation will not solve the problem because:
> > 
> > 1. If it still has the same feature set and requirements then the level
> >    of complexity will be comparable.
> > 
> > 2. We can reduce accidental (inessential) complexity by continuing the
> >    various efforts around the block graph, block jobs, multi-queue block
> >    layer with an eye towards higher level APIs.
> 
> Starting over is certainly not the motivation to do qemu-vhost, but it would be
> an opportunity to use different async/concurrency paradigms if that is going to
> happen. I think in current block layer, event loop + coroutine is a good
> combination, but having nested aio_poll()'s made it worse, then mixing IOThreads
> in makes it a lot more complicated.

Why alternative model are you thinking of?

One slight change is to make everything run in a coroutine so that there
are no while (aio_poll()) loops.  Instead the caller would yield.

But the fundamental problem remains that drain is necessary and the
storage stack does not support request cancellation.  For example, when
the virtio-blk-pci device is reset QEMU has no way of (safely)
cancelling requests so it has to wait for all requests to complete.
This means we're stuck with synchronization points.

I agree that adding threading makes thing more complicated but
ultimately there is a real requirement to do that.  It's like the
difference between a simple block driver that is designed only for
qemu-img convert versus a fully-featured block driver that supports
parallel I/O.  The complexity is much higher but you can't wish it away
if you want parallel I/O.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-11-30 14:19     ` Stefan Hajnoczi
@ 2017-12-01 10:16       ` Fam Zheng
  2017-12-01 14:08         ` Stefan Hajnoczi
  0 siblings, 1 reply; 21+ messages in thread
From: Fam Zheng @ 2017-12-01 10:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz, pbonzini

On Thu, 11/30 14:19, Stefan Hajnoczi wrote:
> On Thu, Nov 30, 2017 at 05:47:09PM +0800, Fam Zheng wrote:
> > On Wed, 11/29 12:00, Stefan Hajnoczi wrote:
> > > On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> > > 
> > > Event loops and coroutines are good but they should not be used directly
> > > by block drivers and block jobs.  We need safe, high-level APIs that
> > > implement commonly-used operations.
> > > 
> > > > - Documentation
> > > > 
> > > >   There is no central developer doc about block layer, especially how all pieces
> > > >   fit together. Having one will make it a lot easier for new contributors to
> > > >   understand better. Of course, we're facing the old problem: the code is
> > > >   moving, maintaining an updated document needs effort.
> > > > 
> > > >   Idea: add ./doc/deve/block.txt?
> > > 
> > > IOThreads and AioContexts are addressed here:
> > > docs/devel/multiple-iothreads.txt
> > > 
> > > The game has become significantly more complex than what the document
> > > describes.  It's lacking aio_co_wake() and aio_co_schedule() for
> > > example.
> > > 
> > > > - Simplified code, or more orthogonal/modularized architecture.
> > > > 
> > > >   Each aspect of block layer is complex enough so isolating them as much as
> > > >   possible is a reasonable approach to control the complexity. Block jobs and
> > > >   throttling becoming block filters is a good example, we should identify more.
> > > > 
> > > >   Idea: rethink event loops. Create coroutines ubiquitously (for example for
> > > >   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> > > > 
> > > >   Crazy idea: move the whole block layer to a vhost process, and implement
> > > >   existing features differently, especially in terms of multi-threading (hint:
> > > >   rust?).
> > > 
> > > A reimplementation will not solve the problem because:
> > > 
> > > 1. If it still has the same feature set and requirements then the level
> > >    of complexity will be comparable.
> > > 
> > > 2. We can reduce accidental (inessential) complexity by continuing the
> > >    various efforts around the block graph, block jobs, multi-queue block
> > >    layer with an eye towards higher level APIs.
> > 
> > Starting over is certainly not the motivation to do qemu-vhost, but it would be
> > an opportunity to use different async/concurrency paradigms if that is going to
> > happen. I think in current block layer, event loop + coroutine is a good
> > combination, but having nested aio_poll()'s made it worse, then mixing IOThreads
> > in makes it a lot more complicated.
> 
> Why alternative model are you thinking of?

To utilize whatever is offered in the different language. In particular I've
heard good things about rust (without programming it myself) that doing
concurrency correctly is easier with it. We'll probably lose all the good bits
about coroutine (unlike what is special in Go), but I expects using simpler
concurrency models (IOW threads only) can lead to simpler code. (I have no
problem with coroutine excpet the debuggability problem I pointed out, which
hopefully can be solved by writing more gdb extensions.)

Another thing about rust is it can call into C code so maybe the change can be
done incrementally like suggested by Dan in his libvirt discussion about using
Go:

https://www.redhat.com/archives/libvir-list/2017-November/msg00528.html

> 
> One slight change is to make everything run in a coroutine so that there
> are no while (aio_poll()) loops.  Instead the caller would yield.

Yes I fully agree this is a good idea that we should try.

Fam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 10:16       ` Fam Zheng
@ 2017-12-01 14:08         ` Stefan Hajnoczi
  2017-12-01 15:00           ` Fam Zheng
  2017-12-01 17:03           ` Paolo Bonzini
  0 siblings, 2 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-12-01 14:08 UTC (permalink / raw)
  To: Fam Zheng; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz, pbonzini

[-- Attachment #1: Type: text/plain, Size: 6456 bytes --]

On Fri, Dec 01, 2017 at 06:16:44PM +0800, Fam Zheng wrote:
> On Thu, 11/30 14:19, Stefan Hajnoczi wrote:
> > On Thu, Nov 30, 2017 at 05:47:09PM +0800, Fam Zheng wrote:
> > > On Wed, 11/29 12:00, Stefan Hajnoczi wrote:
> > > > On Wed, Nov 29, 2017 at 11:55:02AM +0800, Fam Zheng wrote:
> > > > 
> > > > Event loops and coroutines are good but they should not be used directly
> > > > by block drivers and block jobs.  We need safe, high-level APIs that
> > > > implement commonly-used operations.
> > > > 
> > > > > - Documentation
> > > > > 
> > > > >   There is no central developer doc about block layer, especially how all pieces
> > > > >   fit together. Having one will make it a lot easier for new contributors to
> > > > >   understand better. Of course, we're facing the old problem: the code is
> > > > >   moving, maintaining an updated document needs effort.
> > > > > 
> > > > >   Idea: add ./doc/deve/block.txt?
> > > > 
> > > > IOThreads and AioContexts are addressed here:
> > > > docs/devel/multiple-iothreads.txt
> > > > 
> > > > The game has become significantly more complex than what the document
> > > > describes.  It's lacking aio_co_wake() and aio_co_schedule() for
> > > > example.
> > > > 
> > > > > - Simplified code, or more orthogonal/modularized architecture.
> > > > > 
> > > > >   Each aspect of block layer is complex enough so isolating them as much as
> > > > >   possible is a reasonable approach to control the complexity. Block jobs and
> > > > >   throttling becoming block filters is a good example, we should identify more.
> > > > > 
> > > > >   Idea: rethink event loops. Create coroutines ubiquitously (for example for
> > > > >   each fd handler, BH and timer), so that many nested aio_poll() can be removed.
> > > > > 
> > > > >   Crazy idea: move the whole block layer to a vhost process, and implement
> > > > >   existing features differently, especially in terms of multi-threading (hint:
> > > > >   rust?).
> > > > 
> > > > A reimplementation will not solve the problem because:
> > > > 
> > > > 1. If it still has the same feature set and requirements then the level
> > > >    of complexity will be comparable.
> > > > 
> > > > 2. We can reduce accidental (inessential) complexity by continuing the
> > > >    various efforts around the block graph, block jobs, multi-queue block
> > > >    layer with an eye towards higher level APIs.
> > > 
> > > Starting over is certainly not the motivation to do qemu-vhost, but it would be
> > > an opportunity to use different async/concurrency paradigms if that is going to
> > > happen. I think in current block layer, event loop + coroutine is a good
> > > combination, but having nested aio_poll()'s made it worse, then mixing IOThreads
> > > in makes it a lot more complicated.
> > 
> > Why alternative model are you thinking of?
> 
> To utilize whatever is offered in the different language. In particular I've
> heard good things about rust (without programming it myself) that doing
> concurrency correctly is easier with it. We'll probably lose all the good bits
> about coroutine (unlike what is special in Go), but I expects using simpler
> concurrency models (IOW threads only) can lead to simpler code. (I have no
> problem with coroutine excpet the debuggability problem I pointed out, which
> hopefully can be solved by writing more gdb extensions.)

[A long rant here but I hope it contains useful points.]

Rust's threading model is 1:1.  Besides mutexes it also has channels
(looks similar to Go and communicating sequential processes-style
channels).

It is probably not feasible to make each I/O request a thread (i.e. 1M
IOPS means creating/destroying 1M threads/sec).  There would have to be
some machinery like a request queue and a thread pool to process
requests.  That way requests can be held before those that are ready to
be executed can run as threads.  Hmm...this sounds similar to what we
have.

The important thing is this:

Coroutines with a single event loop (current model in QEMU) are simpler
than threads.  Why?  Because coroutine code is atomic with respect to
other coroutines in the same event loop.  Only yield points or nested
event loops allow other coroutines to execute.  That means less explicit
synchronization is necessary.

When the block layer goes multiqueue this advantage will be lost and
coroutine code will have to synchronize explicitly just like threaded
code.  Coroutines will remain lighter weight than threads and will allow
M:N threading to be configured via IOThreads.

I'm not hopeful that dropping coroutines helps and I don't see that Rust
brings anything new to the table here.  I do like other aspects of Rust
and am open to using it for new code.

What I'm getting at is that the essential complexity of a parallel I/O
engine with a runtime reconfigurable graph, background operations, and
non-trivial disk image file formats sets a certain floor (minimum)
complexity.  The consequence is that our code needs to handle
concurrency and this is where we fall down today.

Switching programming language will not reduce complexity below this
floor.  We'd have to abandon features in order to reduce complexity.
"Let's start from scratch" attempts often hope to do this but ultimately
users don't want to lose features.

We need to think through the corner cases we've been hitting and in the
process of fixing them we should consider if simplified interfaces/APIs
with less concurrency would allow us to write better code at the expense
of less control and performance in cases where we don't need it.

That means insulate block jobs and drivers as much as possible.  Don't
make them use low-level primitives.  Make them use high-level APIs where
the concurrency is baked in and as safe as we can make it without giving
up performance.

> Another thing about rust is it can call into C code so maybe the change can be
> done incrementally like suggested by Dan in his libvirt discussion about using
> Go:
> 
> https://www.redhat.com/archives/libvir-list/2017-November/msg00528.html

Max's qcow2 in Rust driver didn't make me very optimistic about mixed C
and Rust.  Putting a Rust layer on top of C looks sane.  I don't think
you can mix C and Rust side-by-side without a lot of duplication and
boilerplate, but my knowledge is limited to looking at Max's qcow2
driver.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 14:08         ` Stefan Hajnoczi
@ 2017-12-01 15:00           ` Fam Zheng
  2017-12-01 17:03           ` Paolo Bonzini
  1 sibling, 0 replies; 21+ messages in thread
From: Fam Zheng @ 2017-12-01 15:00 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz, pbonzini

On Fri, 12/01 14:08, Stefan Hajnoczi wrote:
> What I'm getting at is that the essential complexity of a parallel I/O
> engine with a runtime reconfigurable graph, background operations, and
> non-trivial disk image file formats sets a certain floor (minimum)
> complexity.  The consequence is that our code needs to handle
> concurrency and this is where we fall down today.

I think complexity comes from too many (incomplete) abstractions going together,
and there is no simple core mechanism to coordinate them. Specifically, there is
no elegant synchronization interface between graph reconfiguration (control path
operation) and I/O (data path operation), and background operations have no
clear modeling (block jobs are not first class citizen in our event loop, they
are just specially purposed coroutines). If we redesign from the ground up, we
can hopefully model better and draw more "rigorous" lines between different
parts of the picture, or we can use a simpler design, such as "everything is a
block node in the graph".

> 
> Switching programming language will not reduce complexity below this
> floor.  We'd have to abandon features in order to reduce complexity.
> "Let's start from scratch" attempts often hope to do this but ultimately
> users don't want to lose features.

The complexity can be hidden with a higher level abstraction in exchange of full
control and extreme performance, that is probably the real difference to expect
by changing the programming language. I don't believe "start from scratch" is a
way to fix the old system, there will always be bugs, it's just different, and
it's extremely hard to do it fully compatible with the old implementation.

> 
> We need to think through the corner cases we've been hitting and in the
> process of fixing them we should consider if simplified interfaces/APIs
> with less concurrency would allow us to write better code at the expense
> of less control and performance in cases where we don't need it.
> 
> That means insulate block jobs and drivers as much as possible.  Don't
> make them use low-level primitives.  Make them use high-level APIs where
> the concurrency is baked in and as safe as we can make it without giving
> up performance.
> 

Yes, we should take this step sooner rather than later.

Fam

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 14:08         ` Stefan Hajnoczi
  2017-12-01 15:00           ` Fam Zheng
@ 2017-12-01 17:03           ` Paolo Bonzini
  2017-12-01 19:03             ` Peter Maydell
  2017-12-01 19:27             ` Eric Blake
  1 sibling, 2 replies; 21+ messages in thread
From: Paolo Bonzini @ 2017-12-01 17:03 UTC (permalink / raw)
  To: Stefan Hajnoczi, Fam Zheng; +Cc: kwolf, qemu-block, jcody, qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 3516 bytes --]

Just my 2 cents on the language topic, as in general I agree completely
with Stefan.

On 01/12/2017 15:08, Stefan Hajnoczi wrote:
> Rust's threading model is 1:1.  Besides mutexes it also has channels
> (looks similar to Go and communicating sequential processes-style
> channels).
> 
> It is probably not feasible to make each I/O request a thread (i.e. 1M
> IOPS means creating/destroying 1M threads/sec).  There would have to be
> some machinery like a request queue and a thread pool to process
> requests.  That way requests can be held before those that are ready to
> be executed can run as threads.  Hmm...this sounds similar to what we
> have.

Yes, we would probably use coroutines anyway even if using Rust
(wrapping them in a Rust library).

There could still be important gains in readability and type-safety, for
example if the type checker could guarantee that internal qcow2
functions are called with a lock taken.  I'm afraid however that this
would require much stronger Rust knowledge than we can acquire quickly,
and that we would end up with just as much technical debt (without even
realizing it).

Luckily, several benefits don't require a full rewrite or language switch:

- readability from RAII-style code.  If this is important enough, we
could actually use GCC __attribute__((cleanup)) or, heaven forbid,
slowly introduce C++ in QEMU's code base.

- compile-time checks.  For this we can also use clang's
-Wthread-safety.  (GCC doesn't have it yet).

- lack of templates/generics; for example commit 2bb5c936c5 ("curl: do
not do aio_poll when waiting for a free CURLState", 2017-05-16)
open-codes a "CoQueue that's protected by a QemuMutex rather than a
CoMutex"; templates would provide an elegant solution without code
duplication.  Generics could also be used to decouple abstract data
types from the block layer and to unit-test them easily (e.g.
Graph<BlockDriverState, BdrvChildRole>), but: 1) it may be just my very
limited knowledge of the BDS graph code; 2) we couldn't do this anyway
without first understanding the deficiencies of our C implementation
(for example with respect to missing bdrv_ref/unref).

In all cases, the lessons that we learn fixing the C code or slowly
including a little bit of C++ _could_ help with a later larger-scale switch.

> The important thing is this:
> 
> Coroutines with a single event loop (current model in QEMU) are simpler
> than threads.  Why?  Because coroutine code is atomic with respect to
> other coroutines in the same event loop.  Only yield points or nested
> event loops allow other coroutines to execute.  That means less explicit
> synchronization is necessary.
> 
> When the block layer goes multiqueue this advantage will be lost and
> coroutine code will have to synchronize explicitly just like threaded
> code.  Coroutines will remain lighter weight than threads and will allow
> M:N threading to be configured via IOThreads.
> 
> I'm not hopeful that dropping coroutines helps and I don't see that Rust
> brings anything new to the table here.  I do like other aspects of Rust
> and am open to using it for new code.

Same here.  Just like fixing the C code provides a good foundation for a
language switch, some more battle-tested code could be converted from
QEMU to Rust, in order to get familiar with it and probe whether the
benefits are real.  Maybe the memory API could be a good candidate; it
certainly would benefit from generics.

Paolo


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 17:03           ` Paolo Bonzini
@ 2017-12-01 19:03             ` Peter Maydell
  2017-12-04 10:41               ` Stefan Hajnoczi
  2017-12-01 19:27             ` Eric Blake
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2017-12-01 19:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Stefan Hajnoczi, Fam Zheng, Kevin Wolf, Jeff Cody,
	QEMU Developers, Qemu-block, Max Reitz

On 1 December 2017 at 17:03, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Same here.  Just like fixing the C code provides a good foundation for a
> language switch, some more battle-tested code could be converted from
> QEMU to Rust, in order to get familiar with it and probe whether the
> benefits are real.  Maybe the memory API could be a good candidate; it
> certainly would benefit from generics.

At the moment I feel like betting the future of the project on
Rust would be quite a courageous decision. On the other hand
it might be interesting to look at prototyping to see what
benefits it might bring. (One candidate I had in mind was the
device API -- given that we have had quite a few buffer overruns
in device code, converting some of the device models to Rust
might cut off that particular security issue.)

I agree with you that trying to tackle conversion without at
least one developer with reasonably-expert Rust knowledge is
likely to be fraught, though -- what you really need is to be
able to design APIs which are idiomatic for the language, rather
than the kind of thing you'd write in C but transliterated across...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 17:03           ` Paolo Bonzini
  2017-12-01 19:03             ` Peter Maydell
@ 2017-12-01 19:27             ` Eric Blake
  2017-12-04 10:16               ` Stefan Hajnoczi
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Blake @ 2017-12-01 19:27 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi, Fam Zheng
  Cc: kwolf, jcody, qemu-devel, qemu-block, mreitz

On 12/01/2017 11:03 AM, Paolo Bonzini wrote:
> Just my 2 cents on the language topic, as in general I agree completely
> with Stefan.
> 

> Luckily, several benefits don't require a full rewrite or language switch:
> 
> - readability from RAII-style code.  If this is important enough, we
> could actually use GCC __attribute__((cleanup)) or, heaven forbid,
> slowly introduce C++ in QEMU's code base.

This one is cool, and supported by both gcc and clang (and we really 
don't have any example of anyone clamoring for a different compiler). 
For example, here's what I recently added to nbdkit to make it MUCH 
easier to properly remember to unlock a mutex on all exit paths from a 
scope:

https://github.com/libguestfs/nbdkit/commit/0f24d66648
https://github.com/libguestfs/nbdkit/commit/b5ab4835e1

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 19:27             ` Eric Blake
@ 2017-12-04 10:16               ` Stefan Hajnoczi
  2017-12-04 10:32                 ` Peter Maydell
  0 siblings, 1 reply; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-12-04 10:16 UTC (permalink / raw)
  To: Eric Blake
  Cc: Paolo Bonzini, Fam Zheng, kwolf, jcody, qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

On Fri, Dec 01, 2017 at 01:27:37PM -0600, Eric Blake wrote:
> On 12/01/2017 11:03 AM, Paolo Bonzini wrote:
> > Just my 2 cents on the language topic, as in general I agree completely
> > with Stefan.
> > 
> 
> > Luckily, several benefits don't require a full rewrite or language switch:
> > 
> > - readability from RAII-style code.  If this is important enough, we
> > could actually use GCC __attribute__((cleanup)) or, heaven forbid,
> > slowly introduce C++ in QEMU's code base.
> 
> This one is cool, and supported by both gcc and clang (and we really don't
> have any example of anyone clamoring for a different compiler). For example,
> here's what I recently added to nbdkit to make it MUCH easier to properly
> remember to unlock a mutex on all exit paths from a scope:
> 
> https://github.com/libguestfs/nbdkit/commit/0f24d66648
> https://github.com/libguestfs/nbdkit/commit/b5ab4835e1

Yes, please!  __attribute__((cleanup)) is great and we should use it in
QEMU.

I haven't looked at compiler support though... :)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-04 10:16               ` Stefan Hajnoczi
@ 2017-12-04 10:32                 ` Peter Maydell
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Maydell @ 2017-12-04 10:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Eric Blake, Kevin Wolf, Fam Zheng, Qemu-block, Jeff Cody,
	QEMU Developers, Max Reitz, Paolo Bonzini

On 4 December 2017 at 10:16, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> Yes, please!  __attribute__((cleanup)) is great and we should use it in
> QEMU.
>
> I haven't looked at compiler support though... :)

Looks like it came in in gcc 3.6, which is earlier than our earliest
required version. For clang, it's harder to say because the clang
documentation doesn't mention it at all, even though they actually
support it AFAICT from googling.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] Block layer complexity: what to do to keep it under control?
  2017-12-01 19:03             ` Peter Maydell
@ 2017-12-04 10:41               ` Stefan Hajnoczi
  0 siblings, 0 replies; 21+ messages in thread
From: Stefan Hajnoczi @ 2017-12-04 10:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Paolo Bonzini, Fam Zheng, Kevin Wolf, Jeff Cody, QEMU Developers,
	Qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2150 bytes --]

On Fri, Dec 01, 2017 at 07:03:15PM +0000, Peter Maydell wrote:
> On 1 December 2017 at 17:03, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > Same here.  Just like fixing the C code provides a good foundation for a
> > language switch, some more battle-tested code could be converted from
> > QEMU to Rust, in order to get familiar with it and probe whether the
> > benefits are real.  Maybe the memory API could be a good candidate; it
> > certainly would benefit from generics.
> 
> At the moment I feel like betting the future of the project on
> Rust would be quite a courageous decision. On the other hand
> it might be interesting to look at prototyping to see what
> benefits it might bring. (One candidate I had in mind was the
> device API -- given that we have had quite a few buffer overruns
> in device code, converting some of the device models to Rust
> might cut off that particular security issue.)
> 
> I agree with you that trying to tackle conversion without at
> least one developer with reasonably-expert Rust knowledge is
> likely to be fraught, though -- what you really need is to be
> able to design APIs which are idiomatic for the language, rather
> than the kind of thing you'd write in C but transliterated across...

My main concern beyond lack of Rust experience is that quite a bit of
glue code is necessary to mix Rust and C:

  https://doc.rust-lang.org/book/first-edition/ffi.html

You need to redefine structs and functions in Rust.  To make the C
functions easily callable from Rust without unsafe code, you then need
to write Rust wrapper functions.

This means a lot of boilerplate and duplication.  It makes documenting
code harder.

IMO it will be very hard to replace a component in the QEMU codebase
with Rust code unless we're willing to put up with a lot of boilerplate
and duplication.

Where Rust makes a lot of sense to me is for new programs like
qemu-pr-helper, vhost-user programs, etc.  They have a clearly-defined
boundary (command-line, socket protocol, etc) where switching to Rust
doesn't require redeclaring a bunch of existing C interfaces.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-12-04 10:42 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-29  3:55 [Qemu-devel] Block layer complexity: what to do to keep it under control? Fam Zheng
2017-11-29  6:30 ` Jeff Cody
2017-11-29 12:16   ` Stefan Hajnoczi
2017-11-29 12:22     ` Paolo Bonzini
2017-11-29 12:00 ` Stefan Hajnoczi
2017-11-29 12:24   ` Paolo Bonzini
2017-11-29 13:24     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2017-11-29 13:41     ` [Qemu-devel] " Kevin Wolf
2017-11-29 19:58     ` Dr. David Alan Gilbert
2017-11-30  9:47   ` Fam Zheng
2017-11-30 14:19     ` Stefan Hajnoczi
2017-12-01 10:16       ` Fam Zheng
2017-12-01 14:08         ` Stefan Hajnoczi
2017-12-01 15:00           ` Fam Zheng
2017-12-01 17:03           ` Paolo Bonzini
2017-12-01 19:03             ` Peter Maydell
2017-12-04 10:41               ` Stefan Hajnoczi
2017-12-01 19:27             ` Eric Blake
2017-12-04 10:16               ` Stefan Hajnoczi
2017-12-04 10:32                 ` Peter Maydell
2017-11-29 12:32 ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.