Re: [PATCH v2] block/stream: Drain subtree around graph change

From: Hanna Reitz <hreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>,
	qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
	Emanuele Giuseppe Esposito <eesposit@redhat.com>,
	John Snow <jsnow@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [PATCH v2] block/stream: Drain subtree around graph change
Date: Mon, 28 Mar 2022 10:09:48 +0200	[thread overview]
Message-ID: <8e173cec-d3a1-f8a0-bc07-12b2cfd39d89@redhat.com> (raw)
In-Reply-To: <f4d5e653-9dea-539d-136d-1f36fec597f2@redhat.com>

On 28.03.22 09:44, Hanna Reitz wrote:
> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>> 24.03.2022 17:09, Hanna Reitz wrote:
>>> When the stream block job cuts out the nodes between top and base in
>>> stream_prepare(), it does not drain the subtree manually; it fetches 
>>> the
>>> base node, and tries to insert it as the top node's backing node with
>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, 
>>> and so
>>> the actual base node might change (because the base node is actually 
>>> not
>>> part of the stream job) before the old base node passed to
>>> bdrv_set_backing_hd() is installed.
>>>
>>> This has two implications:
>>>
>>> First, the stream job does not keep a strong reference to the base 
>>> node.
>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>> because some other block job is drained to finish), we will get a
>>> use-after-free.  We should keep a strong reference to that node.
>>>
>>> Second, even with such a strong reference, the problem remains that the
>>> base node might change before bdrv_set_backing_hd() actually runs 
>>> and as
>>> a result the wrong base node is installed.
>>
>> Hmm.
>>
>> So, we don't really need a strong reference, as if it helps to avoid 
>> some use-after-free, it means that we'll finish up with wrong block 
>> graph..
>
> Sure.  But I found it better style to strongly reference a node while 
> it’s used.  I’d rather have an outdated block graph (as in: A node 
> that was supposed to disappear would still be in use) than a 
> use-after-free.
>
>> Graph modifying operations must be somehow isolated from each other.
>>
>>>
>>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>>> case, which has five nodes, and simultaneously streams from the middle
>>> node to the top node, and commits the middle node down to the base 
>>> node.
>>> As it is, this will sometimes crash, namely when we encounter the
>>> above-described use-after-free.
>>>
>>> Taking a strong reference to the base node, we no longer get a crash,
>>> but the resuling block graph is less than ideal: The expected result is
>>> obviously that all middle nodes are cut out and the base node is the
>>> immediate backing child of the top node.  However, if stream_prepare()
>>> takes a strong reference to its base node (the middle node), and then
>>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>>> that middle node, the stream job will just reinstall it again.
>>>
>>> Therefore, we need to keep the whole subtree drained in
>>> stream_prepare(), so that the graph modification it performs is
>>> effectively atomic, i.e. that the base node it fetches is still the 
>>> base
>>> node when bdrv_set_backing_hd() sets it as the top node's backing node.
>>
>> Emanuele has similar idea of isolating graph changes from each other 
>> by subtree-drain.
>>
>> If I understand correctly the idea is that we'll drain all other 
>> block jobs, so the wouldn't do their block-graph modification during 
>> drained section. So, we can safely modify the graph.
>>
>> I don't like this idea:
>>
>> 1. drained section = stop IO. But we don't need to stop IO in the 
>> whole subtree to do a needed block-graph modification.
>
> If you mean to say that draining just the single node should be 
> sufficient, I’ll be happy to change it.
>
> Not sure which node, though, because I’d think it would be `base`, but 
> to safely fetch it I’d need to drain it, which seems to bite itself in 
> the tail.  That’s why I went for a subtree drain from `above_base`.
>
>> 2. Drained section is not a lock, several clients may drain same set 
>> of nodes.. So we exploit the fact that concurrent clients will be 
>> paused by drained section and don't proceed to graph-modification 
>> code.. But are we sure that block-jobs are (and will be?) the only 
>> concurrent block-graph modifying clients? Can qmp commands interleave 
>> somehow?
>
> They can under very specific circumstances and that’s a bug.  See 
> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html .
>
>> Can some jobs from other subtree start a block-graph modification 
>> that touches our subtree?
>
> That would be wrong.  A block job shouldn’t change nodes it doesn’t 
> own; stream doesn’t own the base, but it also doesn’t change it, it 
> only needs to have the top node point to it.
>
>> If go this way, that would be more safe to drain the whole 
>> block-graph on any block-graph modification..
>>
>> I think we'd better have a separate global mechanism for isolating 
>> graph modifications. Something like a global co-mutex or queue, where 
>> clients waits for their turn in block graph modifications.
>>
>> Here is my old proposal on that topic: 
>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/
>
> That would only solve the very specific issue in 030, right?  The 
> stream job isn’t protected from any graph modifications but those 
> coming from mirror.  Might be a solution going forward (I didn’t look 
> closer at it at the time, given I saw you had a discussion with 
> Kevin), if we lock every graph change operation (though a global lock 
> honestly doesn’t sound strictly better than draining subsections of 
> the graph, both have their drawbacks), but that doesn’t look like it’d 
> be something for 7.1.

I wonder whether we could have a short-term version of 
`BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, you 
just can’t change the graph, and you also can’t wait, so that’s just an 
error.  But if `.frozen_lock` is set, you can wait on it. Here, we’d 
keep `.frozen` set for all links between top and above_base, and then in 
prepare() take `.frozen_lock` on the link between above_base and base.