AioContext lock removal: help needed

* AioContext lock removal: help needed
@ 2022-07-08  8:42 Emanuele Giuseppe Esposito
  2022-07-08  9:42 ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 2+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-07-08  8:42 UTC (permalink / raw)
  To: open list:Block layer core, Vladimir Sementsov-Ogievskiy
  Cc: Kevin Wolf, Hanna Reitz, Paolo Bonzini, Stefan Hajnoczi,
	Alberto Faria, qemu-devel

Hello everyone,

As you all know, I am trying to find a way to replace the well known
AioContext lock with something else that makes sense and provides the
same (or even better) guarantees than using this lock.

The reason for this change have been explained over and over and I don't
really want to repeat them. Please read the various series I posted in
the past [1] for more information.

The end goal is to get rid of the AioContext, and have fine-granularity
locks in the various components, to make the whole block layer more
multi-thread friendly and eventually be able to assign multiple virtual
queues to a single iothread.

AioContext lock is used everywhere, to protect a huge variety of data.
This limits a lot the level of multithreading that iothreads can achieve.

Before digging into the problem itself and possible solutions, I would
like to also add that we are having a weekly (or bi-weekly, we'll see)
public meeting where we plan to discuss about this project. Anyone
interested is very welcome to join. Event invitation is here:

https://calendar.google.com/event?action=TEMPLATE&tmeid=NTdja2VwMDFyYm9nNjNyc25pdXU5bm8wb3FfMjAyMjA3MTRUMDgwMDAwWiBlZXNwb3NpdEByZWRoYXQuY29t&tmsrc=eesposit%40redhat.com&scp=ALL

One huge blocker we are having is removing the AioContext from the block
API (bdrv_* and friends).
I identified two initial and main candidates that need to lose the
aiocontext protection:
- bdrv_replace_child_noperm
- bdtv_try_set_aio_context

When these two functions can safely run without AioContext lock, then we
are getting rid of the majority of its usage.
The main issue is: what can we use as replacement?

Let's analyze bdrv_replace_child_noperm (probably the toughest of the
two): this function performs a graph modification, removing a child from
a bs and putting it under another. It modifies the bs' ->parents and
->children nodes list, and it definitely needs protection because these
lists are also read from iothreads in parallel.

Possible candidates to use as replacement:

- rwlock. With the help of Paolo, I implemented a rwlock optimized for
many and fast readers, and few writers. Ideal for
bdrv_replace_child_noperm. However, the problem here is that when a
writer has to wait other readers to finish (since it has exclusive
access), it should call a nested event loop to allow others (reader
included) to progress.
And this brings us into serious complications, because polling with a
wlock taken is prone to a lot of deadlocks, including the fact that the
AioContext lock is still needed in AIO_WAIT_WHILE. The solution would be
to run everything, readers included, in coroutines. However, this is not
easy either: long story short, switching BlockDriverState callbacks to
coroutines is a big problem, as the AioContext lock is still being taken
in many of the callbacks caller and therefore switching from a coroutine
creates a mixture of locks taken that simply results in deadlocks.
Ideally we want to first get rid of the AioContext lock and then switch
to coroutines, but that's the whole point of the rwlock.
More on this here:
https://patchew.org/QEMU/20220426085114.199647-1-eesposit@redhat.com/#cc5e12d1-d25f-d338-bff2-0d3f5cc0def7@redhat.com

But I would say this is not an ideal candidate to replace the AioContext
lock. At least not in the immediate future.

- drains. This was the initial and still main lead. Using
bdrv_drained_begin/end we are sure that a node and all its parents will
be paused (included jobs), no io will further come since it will be
temporarily disabled and all processing requests are ensured to be
finished by the end bdrv_drained_begin returns.
Even better than bdrv_drained, I proposed using
bdrv_subtree_drained_begin, which also stops and protects the child of a
node.
I think the major drawback of this is that we need to be sure that there
are no cases where drains is not enough. Together with Kevin and Stefan
we identified that we need to prevent drain to be called in coroutines,
regardless on which AioContext they are run. That's because they could
allow parallel drain/graph reading to happen, for example (thinking
about the general case) a coroutine yielding after calling drain_begin
and in the middle of a graph modification could allow another coroutine
to drain/read the graph.
Note that draining itself also involves reading the graph too.

We thought the only usage of coroutines draining was in mirror run()
callback. However, that is just the tip of the iceberg.
Other functions like .bdrv_open callbacks (like qcow2_open) take care of
creating coroutines to execute part of the logic, with valid performance
reasons (we don't want to wait when we could simply yield and allow
something else to run).

So another question is: what could we do to solve this coroutine issue?
Ideas?

Main drain series:
https://patchew.org/QEMU/20220118162738.1366281-1-eesposit@redhat.com/
[1]

[1] = https://patchew.org/QEMU/20220301142113.163174-1-eesposit@redhat.com/

Thank you,
Emanuele

^ permalink raw reply	[flat|nested] 2+ messages in thread