Re: [PATCH v2] block/stream: Drain subtree around graph change

From: Hanna Reitz <hreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>,
	qemu-block@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
	Emanuele Giuseppe Esposito <eesposit@redhat.com>,
	John Snow <jsnow@redhat.com>,
	qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [PATCH v2] block/stream: Drain subtree around graph change
Date: Tue, 29 Mar 2022 14:15:56 +0200	[thread overview]
Message-ID: <7c4aa4e1-4809-628b-1a4c-ccd5a8aa89ce@redhat.com> (raw)
In-Reply-To: <cba4cda5-c70a-e2e3-9a10-a60b418e153a@mail.ru>

On 29.03.22 11:55, Vladimir Sementsov-Ogievskiy wrote:
> 29.03.2022 11:54, Hanna Reitz wrote:
>> On 28.03.22 12:24, Vladimir Sementsov-Ogievskiy wrote:
>>> 28.03.2022 11:09, Hanna Reitz wrote:
>>>> On 28.03.22 09:44, Hanna Reitz wrote:
>>>>> On 25.03.22 17:37, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> 24.03.2022 17:09, Hanna Reitz wrote:
>>>>>>> When the stream block job cuts out the nodes between top and 
>>>>>>> base in
>>>>>>> stream_prepare(), it does not drain the subtree manually; it 
>>>>>>> fetches the
>>>>>>> base node, and tries to insert it as the top node's backing node 
>>>>>>> with
>>>>>>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will 
>>>>>>> drain, and so
>>>>>>> the actual base node might change (because the base node is 
>>>>>>> actually not
>>>>>>> part of the stream job) before the old base node passed to
>>>>>>> bdrv_set_backing_hd() is installed.
>>>>>>>
>>>>>>> This has two implications:
>>>>>>>
>>>>>>> First, the stream job does not keep a strong reference to the 
>>>>>>> base node.
>>>>>>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>>>>>>> because some other block job is drained to finish), we will get a
>>>>>>> use-after-free.  We should keep a strong reference to that node.
>>>>>>>
>>>>>>> Second, even with such a strong reference, the problem remains 
>>>>>>> that the
>>>>>>> base node might change before bdrv_set_backing_hd() actually 
>>>>>>> runs and as
>>>>>>> a result the wrong base node is installed.
>>>>>>
>>>>>> Hmm.
>>>>>>
>>>>>> So, we don't really need a strong reference, as if it helps to 
>>>>>> avoid some use-after-free, it means that we'll finish up with 
>>>>>> wrong block graph..
>>>>>
>>>>> Sure.  But I found it better style to strongly reference a node 
>>>>> while it’s used.  I’d rather have an outdated block graph (as in: 
>>>>> A node that was supposed to disappear would still be in use) than 
>>>>> a use-after-free.
>>>>>
>>>>>> Graph modifying operations must be somehow isolated from each other.
>>>>>>
>>>>>>>
>>>>>>> Both effects can be seen in 030's 
>>>>>>> TestParallelOps.test_overlapping_5()
>>>>>>> case, which has five nodes, and simultaneously streams from the 
>>>>>>> middle
>>>>>>> node to the top node, and commits the middle node down to the 
>>>>>>> base node.
>>>>>>> As it is, this will sometimes crash, namely when we encounter the
>>>>>>> above-described use-after-free.
>>>>>>>
>>>>>>> Taking a strong reference to the base node, we no longer get a 
>>>>>>> crash,
>>>>>>> but the resuling block graph is less than ideal: The expected 
>>>>>>> result is
>>>>>>> obviously that all middle nodes are cut out and the base node is 
>>>>>>> the
>>>>>>> immediate backing child of the top node.  However, if 
>>>>>>> stream_prepare()
>>>>>>> takes a strong reference to its base node (the middle node), and 
>>>>>>> then
>>>>>>> the commit job finishes in bdrv_set_backing_hd(), supposedly 
>>>>>>> dropping
>>>>>>> that middle node, the stream job will just reinstall it again.
>>>>>>>
>>>>>>> Therefore, we need to keep the whole subtree drained in
>>>>>>> stream_prepare(), so that the graph modification it performs is
>>>>>>> effectively atomic, i.e. that the base node it fetches is still 
>>>>>>> the base
>>>>>>> node when bdrv_set_backing_hd() sets it as the top node's 
>>>>>>> backing node.
>>>>>>
>>>>>> Emanuele has similar idea of isolating graph changes from each 
>>>>>> other by subtree-drain.
>>>>>>
>>>>>> If I understand correctly the idea is that we'll drain all other 
>>>>>> block jobs, so the wouldn't do their block-graph modification 
>>>>>> during drained section. So, we can safely modify the graph.
>>>>>>
>>>>>> I don't like this idea:
>>>>>>
>>>>>> 1. drained section = stop IO. But we don't need to stop IO in the 
>>>>>> whole subtree to do a needed block-graph modification.
>>>>>
>>>>> If you mean to say that draining just the single node should be 
>>>>> sufficient, I’ll be happy to change it.
>>>>>
>>>>> Not sure which node, though, because I’d think it would be `base`, 
>>>>> but to safely fetch it I’d need to drain it, which seems to bite 
>>>>> itself in the tail.  That’s why I went for a subtree drain from 
>>>>> `above_base`.
>>>>>
>>>>>> 2. Drained section is not a lock, several clients may drain same 
>>>>>> set of nodes.. So we exploit the fact that concurrent clients 
>>>>>> will be paused by drained section and don't proceed to 
>>>>>> graph-modification code.. But are we sure that block-jobs are 
>>>>>> (and will be?) the only concurrent block-graph modifying clients? 
>>>>>> Can qmp commands interleave somehow?
>>>>>
>>>>> They can under very specific circumstances and that’s a bug. See 
>>>>> https://lists.nongnu.org/archive/html/qemu-block/2022-03/msg00582.html 
>>>>> .
>>>>>
>>>>>> Can some jobs from other subtree start a block-graph modification 
>>>>>> that touches our subtree?
>>>>>
>>>>> That would be wrong.  A block job shouldn’t change nodes it 
>>>>> doesn’t own; stream doesn’t own the base, but it also doesn’t 
>>>>> change it, it only needs to have the top node point to it.
>>>>>
>>>>>> If go this way, that would be more safe to drain the whole 
>>>>>> block-graph on any block-graph modification..
>>>>>>
>>>>>> I think we'd better have a separate global mechanism for 
>>>>>> isolating graph modifications. Something like a global co-mutex 
>>>>>> or queue, where clients waits for their turn in block graph 
>>>>>> modifications.
>>>>>>
>>>>>> Here is my old proposal on that topic: 
>>>>>> https://patchew.org/QEMU/20201120161622.1537-1-vsementsov@virtuozzo.com/ 
>>>>>>
>>>>>
>>>>> That would only solve the very specific issue in 030, right?
>>>
>>> It should solve the same issue as your patch. You don't add 
>>> subtree_drain around every graph modification.. Or we already have it?
>>
>> Well, I’m not saying it will solve every single bug, but draining in 
>> stream_prepare() will at least mean that that is safe from basically 
>> anything else, because it will prevent concurrent automatic graph 
>> changes (e.g. because of jobs finishing), and QMP commands shouldn’t 
>> be executed in drained sections either (when they do, it’s wrong, but 
>> that seems to occur only extremely rarely).  Draining alone should 
>> make a place safe, it isn’t a lock that all sides need to take.
>>
>>>>>   The stream job isn’t protected from any graph modifications but 
>>>>> those coming from mirror.  Might be a solution going forward (I 
>>>>> didn’t look closer at it at the time, given I saw you had a 
>>>>> discussion with Kevin), if we lock every graph change operation 
>>>>> (though a global lock honestly doesn’t sound strictly better than 
>>>>> draining subsections of the graph, both have their drawbacks), but 
>>>>> that doesn’t look like it’d be something for 7.1.
>>>
>>> Same way, with draining solution you should make a subtree-drain for 
>>> every graph change operation.
>>
>> Since we don’t have any lock yet, draining is the de-facto way we use 
>> to forbid concurrent graph modifications.  I’m not saying we use it 
>> correctly and thoroughly, but it is what we do right now.
>>
>>>
>>>>
>>>> I wonder whether we could have a short-term version of 
>>>> `BdrvChild.frozen` that’s a coroutine mutex.  If `.frozen` is set, 
>>>> you just can’t change the graph, and you also can’t wait, so that’s 
>>>> just an error.  But if `.frozen_lock` is set, you can wait on it. 
>>>> Here, we’d keep `.frozen` set for all links between top and 
>>>> above_base, and then in prepare() take `.frozen_lock` on the link 
>>>> between above_base and base.
>>>>
>>>
>>> Yes that's seems an alternative to global lock, that doesn't block 
>>> the whole graph. Still, I don't think that is bad to lock the whole 
>>> graph for graph modificaiton, as modification should be rare and fast.
>>
>> Fair enough.
>>
>>> Another thought: does subtree-drain really drain the whole 
>>> connectivity component of the graph?
>>>
>>> imagine something like this:
>>>
>>> [A]  [   C  ]
>>>  |    |    |
>>>  v    v    v
>>> [ B    ]  [ D ]
>>>
>>>
>>> If we do subtree drain at node A, this will drain B and C, but not D..
>>>
>>> Imagine, some another job is attached to node D, and it will start 
>>> drained section too. So, for example both jobs will share drained 
>>> section on node C. That doesn't seem save, and draining is not a lock.
>>>
>>> So, if we are going to rely on drained section as on lock, that 
>>> isolates graph modifications from each other, we should drain the 
>>> whole connectivity component of the graph.
>>
>> The drained section is not a lock, but if the other job is only 
>> attached to node D, it won’t change node C.  It might change the link 
>> from C to D, but that doesn’t concern the job that is concerned about 
>> A and B. Overlapping drains are fine.
>
> OK. Maybe it works. It's just not obvious to me that subtree_drain 
> works good in all cases. And global graph-modification-lock should 
> obviously work.
>
>>
>>> Next, I'm not relly sure that two jobs can simultaneusly enter 
>>> drained section and do graph modifications. What prevents this? 
>>> Assume two block-stream jobs reaches their finish simultaneously and 
>>> go to subtree-drain. That just means that job_pause will be called 
>>> for both jobs.. But what that means for the block-stream jobs that 
>>> is in bdrv_subtree_drained_beeing() call in stream_prepare()? Seems 
>>> nothing? Then both jobs will start graph modification process 
>>> simultaneously and can interleave on any yield point (for exmaple 
>>> rewriting backing_file in qcow2 metadata).
>>
>> So I don’t think that scenario can really happen, because the stream 
>> job freezes the chain between above_base and top, so you can’t really 
>> have two simultaneous stream jobs that cause problems between each 
>> other.
>
> They cause problem on the boundary, as base of one stream job may be 
> top of another, and we have also a filter, that should be 
> inserted/removed at some moment. As I remember, that's the problematic 
> case in 030..
>
>>
>> Furthermore, the prepare() functions are run in the main thread, so 
>> the only real danger is actually that draining around the actual 
>> graph modification (bdrv_set_backing_hd()) causes another block job 
>> to finish, modifying the block graph.  But then that job will also 
>> actually finish, because it’s all in the main thread.
>>
>> It is true that child_job_drained_poll() says that job that are about 
>> to prepare() are quiesced, but I don’t think that’s a problem, given 
>> that all jobs in that state run in the main thread.
>>
>>>
>>> Another reason, why I think that subtree drain is a wrong tool, as I 
>>> said, is extra IO-stop.
>>
>> I know and agree, but that’s an optimization question.
>>
>>> Imaging the following graph:
>>>
>>> [A]
>>>  |
>>>  v
>>> [B] [C]
>>>  |   |
>>>  v   v
>>> [base]
>>>
>>> If we want to rebase A onto base, we actually need only stop IO 
>>> requests in A and B. Why C should suffer from this graph 
>>> modification? IO requests produced by C, and that are living in C 
>>> and in base don't intersect with rebasing A on base process in any way.
>>>
>>> ====
>>>
>>> Actually, I'm not strictly against your patch, and believe that it 
>>> fixes the problem in most cases. And it's probably OK in short term. 
>>> The only real doubt on including it now is that drained sections 
>>> sometimes lead to dead locks, and is it possible that we now fix the 
>>> bug that happens only in iotest 30 (or is it reported somewhere?) 
>>> and risking to introduce some dead-lock?
>>
>> Saying that the example in 030 is contrived would mean we 
>> could/should re-include the base into the list of nodes that belong 
>> to the stream job, which would simply disallow the case in 030 that’s 
>> being tested and fails.
>>
>> Then we don’t need a subtree drain, and the plain drain in 
>> bdrv_set_backing_hd() would be fine.
>>
>>> Seems that if in some code it's safe to call drained_begin(), it 
>>> should be safe to call subtree_drained_begin(). And if it trigger 
>>> some deadlock, it just shows some another bug.. Is it worth fixing 
>>> now, near to 7.0 release? We live with this bug for years.. Or 
>>> something changed that I missed?
>>
>> I mean...  I can understand your concern that adding a subtree drain 
>> has performance implications (when a stream job ends, which shouldn’t 
>> be often).  But I’m not sure whether I should share the deadlock 
>> concern. Sounds like a sad state of affairs if I can’t just drain 
>> something when I need it to be drained.
>>
>> I wasn’t aware of this bug, actually.  Now I am, and I feel rather 
>> uncomfortable living with a use-after-free bug, because that’s on the 
>> worse end of the bug spectrum.
>>
>
> OK, I don't know:) And others are silent. Agree that global-lock 
> solution is not for 7.0 anyway. And this one is simple enough. So, 
> take my
>
> Acked-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>

Thanks!

A global lock solution sounds good to me for 7.1+, and it even provides 
a solution to solving the problem with QMP commands being executed while 
other main-thread code is running.  (I mean, the QMP command would still 
be executed, but at least concurrent graph changes would be impossible.)