Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept

From: Hanna Reitz <hreitz@redhat.com>
To: Emanuele Giuseppe Esposito <eesposit@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>, Kevin Wolf <kwolf@redhat.com>,
	Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Paolo Bonzini <pbonzini@redhat.com>, John Snow <jsnow@redhat.com>
Subject: Re: [RFC PATCH 0/5] Removal of AioContext lock, bs->parents and ->children: proof of concept
Date: Wed, 30 Mar 2022 16:12:30 +0200	[thread overview]
Message-ID: <c7f6fe7e-c309-010a-eaba-549fbfcb45ce@redhat.com> (raw)
In-Reply-To: <a4d3fc47-0769-7d11-47aa-a1c4ac503406@redhat.com>

On 30.03.22 13:55, Emanuele Giuseppe Esposito wrote:
>
> Am 30/03/2022 um 12:53 schrieb Hanna Reitz:
>> On 17.03.22 17:23, Emanuele Giuseppe Esposito wrote:
>>> Am 09/03/2022 um 14:26 schrieb Emanuele Giuseppe Esposito:
>>>>>> * Drains allow the caller (either main loop or iothread running
>>>>>> the context) to wait all in_flights requests and operations
>>>>>> of a BDS: normal drains target a given node and is parents, while
>>>>>> subtree ones also include the subgraph of the node. Siblings are
>>>>>> not affected by any of these two kind of drains.
>>>>> Siblings are drained to the extent required for their parent node to
>>>>> reach in_flight == 0.
>>>>>
>>>>> I haven't checked the code but I guess the case you're alluding to is
>>>>> that siblings with multiple parents could have other I/O in flight that
>>>>> will not be drained and further I/O can be submitted after the parent
>>>>> has drained?
>>>> Yes, this in theory can happen. I don't really know if this happens
>>>> practically, and how likely is to happen.
>>>>
>>>> The alternative would be to make a drain that blocks the whole graph,
>>>> siblings included, but that would probably be an overkill.
>>>>
>>> So I have thought about this, and I think maybe this is not a concrete
>>> problem.
>>> Suppose we have a graph where "parent" has 2 children: "child" and
>>> "sibling". "sibling" also has a blockjob.
>>>
>>> Now, main loop wants to modify parent-child relation and maybe detach
>>> child from parent.
>>>
>>> 1st wrong assumption: the sibling is not drained. Actually my strategy
>>> takes into account draining both nodes, also because parent could be in
>>> another graph. Therefore sibling is drained.
>>>
>>> But let's assume "sibling" is the sibling of the parent.
>>>
>>> Therefore we have
>>> "child" -> "parent" -> "grandparent"
>>> and
>>> "blockjob" -> "sibling" -> "grandparent"
>>>
>>> The issue is the following: main loop can't drain "sibling", because
>>> subtree_drained does not reach it. Therefore blockjob can still run
>>> while main loop modifies "child" -> "parent". Blockjob can either:
>>> 1) drain, but this won't affect "child" -> "parent"
>>> 2) read the graph in other ways different from drain, for example
>>> .set_aio_context recursively touches the whole graph.
>>> 3) write the graph.
>> I don’t really understand the problem here.  If the block job only
>> operates on the sibling subgraph, why would it care what’s going on in
>> the other subgraph?
> We are talking about something that probably does not happen, but what
> if it calls a callback similar to .set_aio_context that goes through the
> whole graph?

Hm.  Quite unfortunate if such a callback can operate on drained nodes, 
I’d say.  Ideally callbacks wouldn’t do that, but probably they will. :/

> Even though the first question is: is there such callback?

I mean, you could say any callback qualifies.  Draining a node will only 
drain its recursive parents, so siblings are not affected.  If the 
sibling issues the callback on its parent...  (E.g. changes in the 
backing chain requiring a qcow2 parent node to change the backing file 
string in its image file)

> Second even more irrealistic case is when a job randomly looks for a bs
> in another connectivity component and for example drains it.
> Again probably impossible.

I hope so, but the block layer sure likes to surprise me.

>> Block jobs should own all nodes that are associated with them (e.g.
>> because they intend to drop or replace them when the job is done), so
>> when part of the graph is drained, all jobs that could modify that part
>> should be drained, too.
> What do you mean with "own"?

They’re added with block_job_add_bdrv(), and then are children of the 
BlockJob object.

>>> 3) can be only performed in the main loop, because it's a graph
>>> operation. It means that the blockjob runs when the graph modifying
>>> coroutine/bh is not running. They never run together.
>>> The safety of this operation relies on where the drains are and will be
>>> inserted. If you do like in my patch "block.c:
>>> bdrv_replace_child_noperm: first call ->attach(), and then add child\x0f",
>>> then we would have problem, because we drain between two writes, and the
>>> blockjob will find an inconsistent graph. If we do it as we seem to do
>>> it so far, then we won't really have any problem.
>>>
>>> 2) is a read, and can theoretically be performed by another thread. But
>>> is there a function that does that? .set_aio_context for example is a GS
>>> function, so we will fall back to case 3) and nothing bad would happen.
>>>
>>> Is there a counter example for this?
>>>
>>> -----------
>>>
>>> Talking about something else, I discussed with Kevin what *seems* to be
>>> an alternative way to do this, instead of adding drains everywhere.
>>> His idea is to replicate what blk_wait_while_drained() currently does
>>> but on a larger scale. It is something in between this subtree_drains
>>> logic and a rwlock.
>>>
>>> Basically if I understood correctly, we could implement
>>> bdrv_wait_while_drained(), and put in all places where we would put a
>>> read lock: all the reads to ->parents and ->children.
>>> This function detects if the bdrv is under drain, and if so it will stop
>>> and wait that the drain finishes (ie the graph modification).
>>> On the other side, each write would just need to drain probably both
>>> nodes (simple drain), to signal that we are modifying the graph. Once
>>> bdrv_drained_begin() finishes, we are sure all coroutines are stopped.
>>> Once bdrv_drained_end() finishes, we automatically let all coroutine
>>> restart, and continue where they left off.
>>>
>>> Seems a good compromise between drains and rwlock. What do you think?
>> Well, sounds complicated.  So I’m asking myself whether this would be
>> noticeably better than just an RwLock for graph modifications, like the
>> global lock Vladimir has proposed.
> But the point is then: aren't we re-inventing an AioContext lock?

I don’t know how AioContext locks would even help with graph changes.  
If I want to change a block subgraph that’s in a different I/O thread, 
locking that thread isn’t enough (I would’ve thought); because I have no 
idea what the thread is doing when I’m locking it.  Perhaps it’s 
iterating through ->children right now (with some yields in between), 
and by pausing it, changing the graph, and then resuming it, it’ll still 
cause problems.

> the lock will protect not only ->parents and ->child, but also other
> bdrv fields that are concurrently read/written.

I would’ve thought this lock should only protect ->parents and ->children.

> I don't know, it seems to me that there is a lot of uncertainty on which
> way to take...

Definitely. :)

I wouldn’t call that a bad thing, necessarily.  Let’s look at the 
positive side: There are many ideas!

Hanna