All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] block/stream: Drain subtree around graph change
@ 2022-03-24 12:57 Hanna Reitz
  2022-04-05 10:14 ` Kevin Wolf
  0 siblings, 1 reply; 12+ messages in thread
From: Hanna Reitz @ 2022-03-24 12:57 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Hanna Reitz, Vladimir Sementsov-Ogievskiy, John Snow,
	qemu-devel

When the stream block job cuts out the nodes between top and base in
stream_prepare(), it does not drain the subtree manually; it fetches the
base node, and tries to insert it as the top node's backing node with
bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
the actual base node might change (because the base node is actually not
part of the stream job) before the old base node passed to
bdrv_set_backing_hd() is installed.

This has two implications:

First, the stream job does not keep a strong reference to the base node.
Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
because some other block job is drained to finish), we will get a
use-after-free.  We should keep a strong reference to that node.

Second, even with such a strong reference, the problem remains that the
base node might change before bdrv_set_backing_hd() actually runs and as
a result the wrong base node is installed.

Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
case, which has five nodes, and simultaneously streams from the middle
node to the top node, and commits the middle node down to the base node.
As it is, this will sometimes crash, namely when we encounter the
above-described use-after-free.

Taking a strong reference to the base node, we no longer get a crash,
but the resuling block graph is less than ideal: The expected result is
obviously that all middle nodes are cut out and the base node is the
immediate backing child of the top node.  However, if stream_prepare()
takes a strong reference to its base node (the middle node), and then
the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
that middle node, the stream job will just reinstall it again.

Therefore, we need to keep the whole subtree drained in
stream_prepare(), so that the graph modification it performs is
effectively atomic, i.e. that the base node it fetches is still the base
node when bdrv_set_backing_hd() sets it as the top node's backing node.

Verify this by asserting in said 030's test case that the base node is
always the top node's immediate backing child when both jobs are done.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
---
 block/stream.c         | 11 ++++++++++-
 tests/qemu-iotests/030 |  5 +++++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/stream.c b/block/stream.c
index 3acb59fe6a..5c08208db9 100644
--- a/block/stream.c
+++ b/block/stream.c
@@ -64,7 +64,11 @@ static int stream_prepare(Job *job)
     bdrv_cor_filter_drop(s->cor_filter_bs);
     s->cor_filter_bs = NULL;
 
+    bdrv_subtree_drained_begin(s->above_base);
+
     base = bdrv_filter_or_cow_bs(s->above_base);
+    bdrv_ref(base);
+
     unfiltered_base = bdrv_skip_filters(base);
 
     if (bdrv_cow_child(unfiltered_bs)) {
@@ -75,14 +79,19 @@ static int stream_prepare(Job *job)
                 base_fmt = unfiltered_base->drv->format_name;
             }
         }
+
         bdrv_set_backing_hd(unfiltered_bs, base, &local_err);
         ret = bdrv_change_backing_file(unfiltered_bs, base_id, base_fmt, false);
         if (local_err) {
             error_report_err(local_err);
-            return -EPERM;
+            ret = -EPERM;
+            goto out;
         }
     }
 
+out:
+    bdrv_unref(base);
+    bdrv_subtree_drained_end(s->above_base);
     return ret;
 }
 
diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 567bf1da67..14112835ed 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -436,6 +436,11 @@ class TestParallelOps(iotests.QMPTestCase):
         self.vm.run_job(job='node4', auto_dismiss=True)
         self.assert_no_active_block_jobs()
 
+        # Assert that node0 is now the backing node of node4
+        result = self.vm.qmp('query-named-block-nodes')
+        node4 = next(node for node in result['return'] if node['node-name'] == 'node4')
+        self.assertEqual(node4['image']['backing-image']['filename'], self.imgs[0])
+
     # Test a block-stream and a block-commit job in parallel
     # Here the stream job is supposed to finish quickly in order to reproduce
     # the scenario that triggers the bug fixed in 3d5d319e1221 and 1a63a907507
-- 
2.35.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-03-24 12:57 [PATCH] block/stream: Drain subtree around graph change Hanna Reitz
@ 2022-04-05 10:14 ` Kevin Wolf
  2022-04-05 11:47   ` Hanna Reitz
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Kevin Wolf @ 2022-04-05 10:14 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: eesposit, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block

Am 24.03.2022 um 13:57 hat Hanna Reitz geschrieben:
> When the stream block job cuts out the nodes between top and base in
> stream_prepare(), it does not drain the subtree manually; it fetches the
> base node, and tries to insert it as the top node's backing node with
> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
> the actual base node might change (because the base node is actually not
> part of the stream job) before the old base node passed to
> bdrv_set_backing_hd() is installed.
> 
> This has two implications:
> 
> First, the stream job does not keep a strong reference to the base node.
> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
> because some other block job is drained to finish), we will get a
> use-after-free.  We should keep a strong reference to that node.
> 
> Second, even with such a strong reference, the problem remains that the
> base node might change before bdrv_set_backing_hd() actually runs and as
> a result the wrong base node is installed.
> 
> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
> case, which has five nodes, and simultaneously streams from the middle
> node to the top node, and commits the middle node down to the base node.
> As it is, this will sometimes crash, namely when we encounter the
> above-described use-after-free.
> 
> Taking a strong reference to the base node, we no longer get a crash,
> but the resuling block graph is less than ideal: The expected result is
> obviously that all middle nodes are cut out and the base node is the
> immediate backing child of the top node.  However, if stream_prepare()
> takes a strong reference to its base node (the middle node), and then
> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
> that middle node, the stream job will just reinstall it again.
> 
> Therefore, we need to keep the whole subtree drained in
> stream_prepare()

That doesn't sound right. I think in reality it's "if we take the really
big hammer and drain the whole subtree, then the bit that we really need
usually happens to be covered, too".

When you have a long backing chain and merge the two topmost overlays
with streaming, then it's none of the stream job's business whether
there is I/O going on for the base image way down the chain. Subtree
drains do much more than they should in this case.

At the same time they probably do too little, because what you're
describing you're protecting against is not I/O, but graph modifications
done by callbacks invoked in the AIO_WAIT_WHILE() when replacing the
backing file. The callback could be invoked by I/O on an entirely
different subgraph (maybe if the other thing is a mirror job) or it
could be a BH or anything else really. bdrv_drain_all() would increase
your chances, but I'm not sure if even that would be guaranteed to be
enough - because it's really another instance of abusing drain for
locking, we're not really interested in the _I/O_ of the node.

> so that the graph modification it performs is effectively atomic,
> i.e. that the base node it fetches is still the base node when
> bdrv_set_backing_hd() sets it as the top node's backing node.

I think the way to keep graph modifications atomic is avoid polling in
the middle. Not even running any callbacks is a lot safer than trying to
make sure there can't be undesired callbacks that want to run.

So probably adding drain (or anything else that polls) in
bdrv_set_backing_hd() was a bad idea. It could assert that the parent
node is drained, but it should be the caller's responsibility to do so.

What streaming completion should look like is probably something like
this:

    1. Drain above_base, this also drains all parents up to the top node
       (needed because in-flight I/O using an edge that is removed isn't
       going to end well)

    2. Without any polling involved:
        a. Find base (it can't change without polling)
        b. Update top->backing to point to base

    3. End of drain.

You don't have to keep extra references or deal with surprise removals
of nodes because the whole thing is atomic when you don't poll. Other
threads can't interfere either because graph modification requires the
BQL.

There is no reason to keep base drained because its I/O doesn't
interfere with the incoming edge that we're changing.

I think all of this is really relevant for Emanuele's work, which
involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
fully expect that we would see very similar problems, and just stacking
drain sections over drain sections that might happen to usually fix
things, but aren't guaranteed to, doesn't look like a good solution.

Kevin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 10:14 ` Kevin Wolf
@ 2022-04-05 11:47   ` Hanna Reitz
  2022-04-05 12:05     ` Hanna Reitz
  2022-04-05 14:12     ` Kevin Wolf
  2022-04-05 12:12   ` Vladimir Sementsov-Ogievskiy
  2022-04-05 13:09   ` Emanuele Giuseppe Esposito
  2 siblings, 2 replies; 12+ messages in thread
From: Hanna Reitz @ 2022-04-05 11:47 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: eesposit, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block

On 05.04.22 12:14, Kevin Wolf wrote:
> Am 24.03.2022 um 13:57 hat Hanna Reitz geschrieben:
>> When the stream block job cuts out the nodes between top and base in
>> stream_prepare(), it does not drain the subtree manually; it fetches the
>> base node, and tries to insert it as the top node's backing node with
>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>> the actual base node might change (because the base node is actually not
>> part of the stream job) before the old base node passed to
>> bdrv_set_backing_hd() is installed.
>>
>> This has two implications:
>>
>> First, the stream job does not keep a strong reference to the base node.
>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>> because some other block job is drained to finish), we will get a
>> use-after-free.  We should keep a strong reference to that node.
>>
>> Second, even with such a strong reference, the problem remains that the
>> base node might change before bdrv_set_backing_hd() actually runs and as
>> a result the wrong base node is installed.
>>
>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>> case, which has five nodes, and simultaneously streams from the middle
>> node to the top node, and commits the middle node down to the base node.
>> As it is, this will sometimes crash, namely when we encounter the
>> above-described use-after-free.
>>
>> Taking a strong reference to the base node, we no longer get a crash,
>> but the resuling block graph is less than ideal: The expected result is
>> obviously that all middle nodes are cut out and the base node is the
>> immediate backing child of the top node.  However, if stream_prepare()
>> takes a strong reference to its base node (the middle node), and then
>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>> that middle node, the stream job will just reinstall it again.
>>
>> Therefore, we need to keep the whole subtree drained in
>> stream_prepare()
> That doesn't sound right. I think in reality it's "if we take the really
> big hammer and drain the whole subtree, then the bit that we really need
> usually happens to be covered, too".
>
> When you have a long backing chain and merge the two topmost overlays
> with streaming, then it's none of the stream job's business whether
> there is I/O going on for the base image way down the chain. Subtree
> drains do much more than they should in this case.

Yes, see the discussion I had with Vladimir.  He convinced me that this 
can’t be an indefinite solution, but that we need locking for graph 
changes that’s separate from draining, because (1) those are different 
things, and (2) changing the graph should influence I/O as little as 
possible.

I found this the best solution to fix a known case of a use-after-free 
for 7.1, though.

> At the same time they probably do too little, because what you're
> describing you're protecting against is not I/O, but graph modifications
> done by callbacks invoked in the AIO_WAIT_WHILE() when replacing the
> backing file. The callback could be invoked by I/O on an entirely
> different subgraph (maybe if the other thing is a mirror job)or it
> could be a BH or anything else really. bdrv_drain_all() would increase
> your chances, but I'm not sure if even that would be guaranteed to be
> enough - because it's really another instance of abusing drain for
> locking, we're not really interested in the _I/O_ of the node.

The most common instances of graph modification I see are QMP and block 
jobs finishing.  The former will not be deterred by draining, and we do 
know of one instance where that is a problem (see the bdrv_next() 
discussion).  Generally, it isn’t though.  (If it is, this case here 
won’t be the only thing that breaks.)

As for the latter, most block jobs are parents of the nodes they touch 
(stream is one notable exception with how it handles its base, and 
changing that did indeed cause us headache before), and so will at least 
be paused when a drain occurs on a node they touch.  Since pausing 
doesn’t affect jobs that have exited their main loop, there might be 
some problem with concurrent jobs that are also finished but yielding, 
but I couldn’t find such a case.

I’m not sure what you’re arguing for, so I can only assume.  Perhaps 
you’re arguing for reverting this patch, which I wouldn’t want to do, 
because at least it fixes the one known use-after-free case. Perhaps 
you’re arguing that we need something better, and then I completely agree.

>> so that the graph modification it performs is effectively atomic,
>> i.e. that the base node it fetches is still the base node when
>> bdrv_set_backing_hd() sets it as the top node's backing node.
> I think the way to keep graph modifications atomic is avoid polling in
> the middle. Not even running any callbacks is a lot safer than trying to
> make sure there can't be undesired callbacks that want to run.
>
> So probably adding drain (or anything else that polls) in
> bdrv_set_backing_hd() was a bad idea. It could assert that the parent
> node is drained, but it should be the caller's responsibility to do so.
>
> What streaming completion should look like is probably something like
> this:
>
>      1. Drain above_base, this also drains all parents up to the top node
>         (needed because in-flight I/O using an edge that is removed isn't
>         going to end well)
>
>      2. Without any polling involved:
>          a. Find base (it can't change without polling)
>          b. Update top->backing to point to base
>
>      3. End of drain.
>
> You don't have to keep extra references or deal with surprise removals
> of nodes because the whole thing is atomic when you don't poll. Other
> threads can't interfere either because graph modification requires the
> BQL.
>
> There is no reason to keep base drained because its I/O doesn't
> interfere with the incoming edge that we're changing.
>
> I think all of this is really relevant for Emanuele's work, which
> involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
> fully expect that we would see very similar problems, and just stacking
> drain sections over drain sections that might happen to usually fix
> things, but aren't guaranteed to, doesn't look like a good solution.

I don’t disagree.  Well, I agree, actually.  But I don’t know what 
you’re proposing to actually do.  There is active discussion on how 
block graph changes should be handled on Emanuele’s series.

Hanna



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 11:47   ` Hanna Reitz
@ 2022-04-05 12:05     ` Hanna Reitz
  2022-04-05 14:12     ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Hanna Reitz @ 2022-04-05 12:05 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: eesposit, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block

On 05.04.22 13:47, Hanna Reitz wrote:
> On 05.04.22 12:14, Kevin Wolf wrote:

[...]

>> At the same time they probably do too little, because what you're
>> describing you're protecting against is not I/O, but graph modifications
>> done by callbacks invoked in the AIO_WAIT_WHILE() when replacing the
>> backing file. The callback could be invoked by I/O on an entirely
>> different subgraph (maybe if the other thing is a mirror job)or it
>> could be a BH or anything else really. bdrv_drain_all() would increase
>> your chances, but I'm not sure if even that would be guaranteed to be
>> enough - because it's really another instance of abusing drain for
>> locking, we're not really interested in the _I/O_ of the node.

[...]

> I’m not sure what you’re arguing for, so I can only assume. Perhaps 
> you’re arguing for reverting this patch, which I wouldn’t want to do, 
> because at least it fixes the one known use-after-free case. Perhaps 
> you’re arguing that we need something better, and then I completely agree.

Perhaps I should also note that what actually fixes the use-after-free 
is the bdrv_ref()/unref() pair.  The drained section is just there to 
ensure that the graph is actually correct (i.e. if a concurrently 
finishing job removes @base before the stream job’s 
bdrv_set_backing_hd() can set it as the top node’s backing node, that we 
won’t reinstate this @base that the other job just removed).  So even if 
this does too little, at least there won’t be a use-after-free.

OTOH, if it does much too much, we can drop the drain and keep the 
ref/unref.  I don’t want to have a release with a use-after-free that I 
know of, but I’d be fine if the block graph is “just” outdated.

Hanna



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 10:14 ` Kevin Wolf
  2022-04-05 11:47   ` Hanna Reitz
@ 2022-04-05 12:12   ` Vladimir Sementsov-Ogievskiy
  2022-04-05 14:41     ` Kevin Wolf
  2022-04-05 13:09   ` Emanuele Giuseppe Esposito
  2 siblings, 1 reply; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-04-05 12:12 UTC (permalink / raw)
  To: Kevin Wolf, Hanna Reitz; +Cc: qemu-block, qemu-devel, John Snow, eesposit

05.04.2022 13:14, Kevin Wolf wrote:
> Am 24.03.2022 um 13:57 hat Hanna Reitz geschrieben:
>> When the stream block job cuts out the nodes between top and base in
>> stream_prepare(), it does not drain the subtree manually; it fetches the
>> base node, and tries to insert it as the top node's backing node with
>> bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
>> the actual base node might change (because the base node is actually not
>> part of the stream job) before the old base node passed to
>> bdrv_set_backing_hd() is installed.
>>
>> This has two implications:
>>
>> First, the stream job does not keep a strong reference to the base node.
>> Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
>> because some other block job is drained to finish), we will get a
>> use-after-free.  We should keep a strong reference to that node.
>>
>> Second, even with such a strong reference, the problem remains that the
>> base node might change before bdrv_set_backing_hd() actually runs and as
>> a result the wrong base node is installed.
>>
>> Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
>> case, which has five nodes, and simultaneously streams from the middle
>> node to the top node, and commits the middle node down to the base node.
>> As it is, this will sometimes crash, namely when we encounter the
>> above-described use-after-free.
>>
>> Taking a strong reference to the base node, we no longer get a crash,
>> but the resuling block graph is less than ideal: The expected result is
>> obviously that all middle nodes are cut out and the base node is the
>> immediate backing child of the top node.  However, if stream_prepare()
>> takes a strong reference to its base node (the middle node), and then
>> the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
>> that middle node, the stream job will just reinstall it again.
>>
>> Therefore, we need to keep the whole subtree drained in
>> stream_prepare()
> 
> That doesn't sound right. I think in reality it's "if we take the really
> big hammer and drain the whole subtree, then the bit that we really need
> usually happens to be covered, too".
> 
> When you have a long backing chain and merge the two topmost overlays
> with streaming, then it's none of the stream job's business whether
> there is I/O going on for the base image way down the chain. Subtree
> drains do much more than they should in this case.
> 
> At the same time they probably do too little, because what you're
> describing you're protecting against is not I/O, but graph modifications
> done by callbacks invoked in the AIO_WAIT_WHILE() when replacing the
> backing file. The callback could be invoked by I/O on an entirely
> different subgraph (maybe if the other thing is a mirror job) or it
> could be a BH or anything else really. bdrv_drain_all() would increase
> your chances, but I'm not sure if even that would be guaranteed to be
> enough - because it's really another instance of abusing drain for
> locking, we're not really interested in the _I/O_ of the node.
> 
>> so that the graph modification it performs is effectively atomic,
>> i.e. that the base node it fetches is still the base node when
>> bdrv_set_backing_hd() sets it as the top node's backing node.
> 
> I think the way to keep graph modifications atomic is avoid polling in
> the middle. Not even running any callbacks is a lot safer than trying to
> make sure there can't be undesired callbacks that want to run.
> 
> So probably adding drain (or anything else that polls) in
> bdrv_set_backing_hd() was a bad idea. It could assert that the parent
> node is drained, but it should be the caller's responsibility to do so.
> 
> What streaming completion should look like is probably something like
> this:
> 
>      1. Drain above_base, this also drains all parents up to the top node
>         (needed because in-flight I/O using an edge that is removed isn't
>         going to end well)
> 
>      2. Without any polling involved:
>          a. Find base (it can't change without polling)
>          b. Update top->backing to point to base
> 
>      3. End of drain.
> 
> You don't have to keep extra references or deal with surprise removals
> of nodes because the whole thing is atomic when you don't poll. Other
> threads can't interfere either because graph modification requires the
> BQL.
> 
> There is no reason to keep base drained because its I/O doesn't
> interfere with the incoming edge that we're changing.
> 
> I think all of this is really relevant for Emanuele's work, which
> involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
> fully expect that we would see very similar problems, and just stacking
> drain sections over drain sections that might happen to usually fix
> things, but aren't guaranteed to, doesn't look like a good solution.
> 

Thanks Kevin! I have already run out of arguments in the battle against using subtree-drains to isolate graph modification operations from each other in different threads in the mailing list)

(Note also, that the top-most version of this patch is "[PATCH v2] block/stream: Drain subtree around graph change")


About avoiding polling during graph-modifying operations, there is a problem: some IO operations are involved into block-graph modifying operations. At least it's rewriting "backing_file_offset" and "backing_file_size" fields in qcow2 header.

We can't just separate rewriting metadata from graph modifying operation: this way another graph-modifying operation may interleave and we'll write outdated metadata.


So I still think, we need a kind of global lock for graph modifying operations. Or a kind per-BDS locks as you propose. But in this case we need to be sure that taking all needed per-BDS locks we'll avoid deadlocking.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 10:14 ` Kevin Wolf
  2022-04-05 11:47   ` Hanna Reitz
  2022-04-05 12:12   ` Vladimir Sementsov-Ogievskiy
@ 2022-04-05 13:09   ` Emanuele Giuseppe Esposito
  2022-04-05 15:04     ` Kevin Wolf
  2 siblings, 1 reply; 12+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-04-05 13:09 UTC (permalink / raw)
  To: Kevin Wolf, Hanna Reitz
  Cc: Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel, qemu-block



Am 05/04/2022 um 12:14 schrieb Kevin Wolf:
> I think all of this is really relevant for Emanuele's work, which
> involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
> fully expect that we would see very similar problems, and just stacking
> drain sections over drain sections that might happen to usually fix
> things, but aren't guaranteed to, doesn't look like a good solution.

Yes, I think at this point we all agreed to drop subtree_drain as
replacement for AioContext.

The alternative is what Paolo proposed in the other thread " Removal of
AioContext lock, bs->parents and ->children: proof of concept"
I am not sure which thread you replied first :)

I think that proposal is not far from your idea, and it avoids to
introduce or even use drains at all.
Not sure why you called it a "step backwards even from AioContext locks".

Emanuele



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 11:47   ` Hanna Reitz
  2022-04-05 12:05     ` Hanna Reitz
@ 2022-04-05 14:12     ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Kevin Wolf @ 2022-04-05 14:12 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: eesposit, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block

Am 05.04.2022 um 13:47 hat Hanna Reitz geschrieben:
> On 05.04.22 12:14, Kevin Wolf wrote:
> > Am 24.03.2022 um 13:57 hat Hanna Reitz geschrieben:
> > > When the stream block job cuts out the nodes between top and base in
> > > stream_prepare(), it does not drain the subtree manually; it fetches the
> > > base node, and tries to insert it as the top node's backing node with
> > > bdrv_set_backing_hd().  bdrv_set_backing_hd() however will drain, and so
> > > the actual base node might change (because the base node is actually not
> > > part of the stream job) before the old base node passed to
> > > bdrv_set_backing_hd() is installed.
> > > 
> > > This has two implications:
> > > 
> > > First, the stream job does not keep a strong reference to the base node.
> > > Therefore, if it is deleted in bdrv_set_backing_hd()'s drain (e.g.
> > > because some other block job is drained to finish), we will get a
> > > use-after-free.  We should keep a strong reference to that node.
> > > 
> > > Second, even with such a strong reference, the problem remains that the
> > > base node might change before bdrv_set_backing_hd() actually runs and as
> > > a result the wrong base node is installed.
> > > 
> > > Both effects can be seen in 030's TestParallelOps.test_overlapping_5()
> > > case, which has five nodes, and simultaneously streams from the middle
> > > node to the top node, and commits the middle node down to the base node.
> > > As it is, this will sometimes crash, namely when we encounter the
> > > above-described use-after-free.
> > > 
> > > Taking a strong reference to the base node, we no longer get a crash,
> > > but the resuling block graph is less than ideal: The expected result is
> > > obviously that all middle nodes are cut out and the base node is the
> > > immediate backing child of the top node.  However, if stream_prepare()
> > > takes a strong reference to its base node (the middle node), and then
> > > the commit job finishes in bdrv_set_backing_hd(), supposedly dropping
> > > that middle node, the stream job will just reinstall it again.
> > > 
> > > Therefore, we need to keep the whole subtree drained in
> > > stream_prepare()
> > That doesn't sound right. I think in reality it's "if we take the really
> > big hammer and drain the whole subtree, then the bit that we really need
> > usually happens to be covered, too".
> > 
> > When you have a long backing chain and merge the two topmost overlays
> > with streaming, then it's none of the stream job's business whether
> > there is I/O going on for the base image way down the chain. Subtree
> > drains do much more than they should in this case.
> 
> Yes, see the discussion I had with Vladimir.  He convinced me that this
> can’t be an indefinite solution, but that we need locking for graph changes
> that’s separate from draining, because (1) those are different things, and
> (2) changing the graph should influence I/O as little as possible.
> 
> I found this the best solution to fix a known case of a use-after-free for
> 7.1, though.

I'm not arguing against a short-term band-aid solution (I assume you
mean for 7.0?) as long as we agree that this is what it is. The commit
message just sounded as if this were the right solution rather than a
hack, so I wanted to make the point.

> > At the same time they probably do too little, because what you're
> > describing you're protecting against is not I/O, but graph modifications
> > done by callbacks invoked in the AIO_WAIT_WHILE() when replacing the
> > backing file. The callback could be invoked by I/O on an entirely
> > different subgraph (maybe if the other thing is a mirror job)or it
> > could be a BH or anything else really. bdrv_drain_all() would increase
> > your chances, but I'm not sure if even that would be guaranteed to be
> > enough - because it's really another instance of abusing drain for
> > locking, we're not really interested in the _I/O_ of the node.
> 
> The most common instances of graph modification I see are QMP and block jobs
> finishing.  The former will not be deterred by draining, and we do know of
> one instance where that is a problem (see the bdrv_next() discussion). 
> Generally, it isn’t though.  (If it is, this case here won’t be the only
> thing that breaks.)

To be honest, I would be surprised if other things weren't broken if QMP
commands come in with unfortunate timing.

> As for the latter, most block jobs are parents of the nodes they touch
> (stream is one notable exception with how it handles its base, and changing
> that did indeed cause us headache before), and so will at least be paused
> when a drain occurs on a node they touch.  Since pausing doesn’t affect jobs
> that have exited their main loop, there might be some problem with
> concurrent jobs that are also finished but yielding, but I couldn’t find
> such a case.

True, the way that we implement drain in the block job actually means
that they fully pause and therefore can't complete even if they wouldn't
actually access the drained node. The interface isn't really defined to
guarantee this, but the implementation happens to avoid such problems.

> I’m not sure what you’re arguing for, so I can only assume.  Perhaps you’re
> arguing for reverting this patch, which I wouldn’t want to do, because at
> least it fixes the one known use-after-free case. Perhaps you’re arguing
> that we need something better, and then I completely agree.

Definitely more the latter.

Maybe I really just want to hammer home that if you think that subtree
drains are the solution, you should think again, because the cases where
they are the right solution (as opposed to a hack that happens to do the
intended job as a side effect in 99% of cases) are the big exception.

I actually kind of regret introducing them. They were only used in
reopen, and they are kind of correct there because we know that we
_will_ touch every node and can't have concurrent I/O there. They are
still somewhat lazy and individual drains for each node might have been
the better solution.

> > > so that the graph modification it performs is effectively atomic,
> > > i.e. that the base node it fetches is still the base node when
> > > bdrv_set_backing_hd() sets it as the top node's backing node.
> > I think the way to keep graph modifications atomic is avoid polling in
> > the middle. Not even running any callbacks is a lot safer than trying to
> > make sure there can't be undesired callbacks that want to run.
> > 
> > So probably adding drain (or anything else that polls) in
> > bdrv_set_backing_hd() was a bad idea. It could assert that the parent
> > node is drained, but it should be the caller's responsibility to do so.
> > 
> > What streaming completion should look like is probably something like
> > this:
> > 
> >      1. Drain above_base, this also drains all parents up to the top node
> >         (needed because in-flight I/O using an edge that is removed isn't
> >         going to end well)
> > 
> >      2. Without any polling involved:
> >          a. Find base (it can't change without polling)
> >          b. Update top->backing to point to base
> > 
> >      3. End of drain.
> > 
> > You don't have to keep extra references or deal with surprise removals
> > of nodes because the whole thing is atomic when you don't poll. Other
> > threads can't interfere either because graph modification requires the
> > BQL.
> > 
> > There is no reason to keep base drained because its I/O doesn't
> > interfere with the incoming edge that we're changing.
> > 
> > I think all of this is really relevant for Emanuele's work, which
> > involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
> > fully expect that we would see very similar problems, and just stacking
> > drain sections over drain sections that might happen to usually fix
> > things, but aren't guaranteed to, doesn't look like a good solution.
> 
> I don’t disagree.  Well, I agree, actually.  But I don’t know what you’re
> proposing to actually do.  There is active discussion on how block graph
> changes should be handled on Emanuele’s series.

If you think we should keep the discussion there, that's fine with me.
But this seems like a good and relevant example for the problems we need
to keep in mind there.

Kevin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 12:12   ` Vladimir Sementsov-Ogievskiy
@ 2022-04-05 14:41     ` Kevin Wolf
  2022-04-05 21:48       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2022-04-05 14:41 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: eesposit, Hanna Reitz, John Snow, qemu-devel, qemu-block

Am 05.04.2022 um 14:12 hat Vladimir Sementsov-Ogievskiy geschrieben:
> Thanks Kevin! I have already run out of arguments in the battle
> against using subtree-drains to isolate graph modification operations
> from each other in different threads in the mailing list)
> 
> (Note also, that the top-most version of this patch is "[PATCH v2]
> block/stream: Drain subtree around graph change")

Oops, I completely missed the v2. Thanks!

> About avoiding polling during graph-modifying operations, there is a
> problem: some IO operations are involved into block-graph modifying
> operations. At least it's rewriting "backing_file_offset" and
> "backing_file_size" fields in qcow2 header.
> 
> We can't just separate rewriting metadata from graph modifying
> operation: this way another graph-modifying operation may interleave
> and we'll write outdated metadata.

Hm, generally we don't update image metadata when we reconfigure the
graph. Most changes are temporary (like insertion of filter nodes) and
the image header only contains a "default configuration" to be used on
the next start.

There are only a few places that update the image header; I think it's
generally block job completions. They obviously update the in-memory
graph, too, but they don't write to the image file (and therefore
potentially poll) in the middle of updating the in-memory graph, but
they do both in separate steps.

I think this is okay. We must just avoid polling in the middle of graph
updates because if something else changes the graph there, it's not
clear any more that we're really doing what the caller had in mind.

> So I still think, we need a kind of global lock for graph modifying
> operations. Or a kind per-BDS locks as you propose. But in this case
> we need to be sure that taking all needed per-BDS locks we'll avoid
> deadlocking.

I guess this depends on the exact granularity of the locks we're using.
If you take the lock only while updating a single edge, I don't think
you could easily deadlock. If you hold it for more complex operations,
it becomes harder to tell without checking the code.

Kevin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 13:09   ` Emanuele Giuseppe Esposito
@ 2022-04-05 15:04     ` Kevin Wolf
  2022-04-05 17:53       ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2022-04-05 15:04 UTC (permalink / raw)
  To: Emanuele Giuseppe Esposito
  Cc: Hanna Reitz, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block

Am 05.04.2022 um 15:09 hat Emanuele Giuseppe Esposito geschrieben:
> Am 05/04/2022 um 12:14 schrieb Kevin Wolf:
> > I think all of this is really relevant for Emanuele's work, which
> > involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
> > fully expect that we would see very similar problems, and just stacking
> > drain sections over drain sections that might happen to usually fix
> > things, but aren't guaranteed to, doesn't look like a good solution.
> 
> Yes, I think at this point we all agreed to drop subtree_drain as
> replacement for AioContext.
> 
> The alternative is what Paolo proposed in the other thread " Removal of
> AioContext lock, bs->parents and ->children: proof of concept"
> I am not sure which thread you replied first :)

This one, I think. :-)

> I think that proposal is not far from your idea, and it avoids to
> introduce or even use drains at all.
> Not sure why you called it a "step backwards even from AioContext locks".

I was only referring to the lock locality there. AioContext locks are
really coarse, but still a finer granularity than a single global lock.

In the big picture, it's still be better than the AioContext lock, but
that's because it's a different type of lock, not because it has better
locality.

So I was just wondering if we can't have the different type of lock and
make it local to the BDS, too.

Kevin



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 15:04     ` Kevin Wolf
@ 2022-04-05 17:53       ` Emanuele Giuseppe Esposito
  2022-04-05 18:24         ` Emanuele Giuseppe Esposito
  0 siblings, 1 reply; 12+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-04-05 17:53 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Hanna Reitz, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block



Am 05/04/2022 um 17:04 schrieb Kevin Wolf:
> Am 05.04.2022 um 15:09 hat Emanuele Giuseppe Esposito geschrieben:
>> Am 05/04/2022 um 12:14 schrieb Kevin Wolf:
>>> I think all of this is really relevant for Emanuele's work, which
>>> involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
>>> fully expect that we would see very similar problems, and just stacking
>>> drain sections over drain sections that might happen to usually fix
>>> things, but aren't guaranteed to, doesn't look like a good solution.
>>
>> Yes, I think at this point we all agreed to drop subtree_drain as
>> replacement for AioContext.
>>
>> The alternative is what Paolo proposed in the other thread " Removal of
>> AioContext lock, bs->parents and ->children: proof of concept"
>> I am not sure which thread you replied first :)
> 
> This one, I think. :-)
> 
>> I think that proposal is not far from your idea, and it avoids to
>> introduce or even use drains at all.
>> Not sure why you called it a "step backwards even from AioContext locks".
> 
> I was only referring to the lock locality there. AioContext locks are
> really coarse, but still a finer granularity than a single global lock.
> 
> In the big picture, it's still be better than the AioContext lock, but
> that's because it's a different type of lock, not because it has better
> locality.
> 
> So I was just wondering if we can't have the different type of lock and
> make it local to the BDS, too.

I guess this is the right time to discuss this.

I think that a global lock will be easier to handle, and we already have
a concrete implementation (cpus-common).

I think that the reads in some sense are already BDS-specific, because
each BDS that is reading has an internal a flag.
Writes, on the other hand, are global. If a write is happening, no other
read at all can run, even if it has nothing to do with it.

The question then is: how difficult would be to implement a BDS-specific
write?
From the API prospective, change
bdrv_graph_wrlock(void);
into
bdrv_graph_wrlock(BlockDriverState *parent, BlockDriverState *child);
I am not sure if/how complicated it will be. For sure all the global
variables would end up in the BDS struct.

On the other side, also making instead read generic could be interesting.
Think about drain: it is a recursive function, and it doesn't really
make sense to take the rdlock for each node it traverses.
Even though I don't know an easy way to replace ->has_waiter and
->reading_graph flags...

Emanuele



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 17:53       ` Emanuele Giuseppe Esposito
@ 2022-04-05 18:24         ` Emanuele Giuseppe Esposito
  0 siblings, 0 replies; 12+ messages in thread
From: Emanuele Giuseppe Esposito @ 2022-04-05 18:24 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Hanna Reitz, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	qemu-block



Am 05/04/2022 um 19:53 schrieb Emanuele Giuseppe Esposito:
> 
> 
> Am 05/04/2022 um 17:04 schrieb Kevin Wolf:
>> Am 05.04.2022 um 15:09 hat Emanuele Giuseppe Esposito geschrieben:
>>> Am 05/04/2022 um 12:14 schrieb Kevin Wolf:
>>>> I think all of this is really relevant for Emanuele's work, which
>>>> involves adding AIO_WAIT_WHILE() deep inside graph update functions. I
>>>> fully expect that we would see very similar problems, and just stacking
>>>> drain sections over drain sections that might happen to usually fix
>>>> things, but aren't guaranteed to, doesn't look like a good solution.
>>>
>>> Yes, I think at this point we all agreed to drop subtree_drain as
>>> replacement for AioContext.
>>>
>>> The alternative is what Paolo proposed in the other thread " Removal of
>>> AioContext lock, bs->parents and ->children: proof of concept"
>>> I am not sure which thread you replied first :)
>>
>> This one, I think. :-)
>>
>>> I think that proposal is not far from your idea, and it avoids to
>>> introduce or even use drains at all.
>>> Not sure why you called it a "step backwards even from AioContext locks".
>>
>> I was only referring to the lock locality there. AioContext locks are
>> really coarse, but still a finer granularity than a single global lock.
>>
>> In the big picture, it's still be better than the AioContext lock, but
>> that's because it's a different type of lock, not because it has better
>> locality.
>>
>> So I was just wondering if we can't have the different type of lock and
>> make it local to the BDS, too.
> 
> I guess this is the right time to discuss this.
> 
> I think that a global lock will be easier to handle, and we already have
> a concrete implementation (cpus-common).
> 
> I think that the reads in some sense are already BDS-specific, because
> each BDS that is reading has an internal a flag.
> Writes, on the other hand, are global. If a write is happening, no other
> read at all can run, even if it has nothing to do with it.
> 
> The question then is: how difficult would be to implement a BDS-specific
> write?
> From the API prospective, change
> bdrv_graph_wrlock(void);
> into
> bdrv_graph_wrlock(BlockDriverState *parent, BlockDriverState *child);
> I am not sure if/how complicated it will be. For sure all the global
> variables would end up in the BDS struct.
> 
> On the other side, also making instead read generic could be interesting.
> Think about drain: it is a recursive function, and it doesn't really
> make sense to take the rdlock for each node it traverses.

Otherwise a simple solution for drains that require no change at allis
to just take the rdlock on the bs calling drain, and since each write
waits for all reads to complete, it will work anyways.

The only detail is that assert_bdrv_graph_readable() will then need to
iterate through all nodes to be sure that at leas one of them is
actually reading.

So yeah I know this might be hard to realize without an implementation,
but my conclusion is to leave the lock as it is for now.

> Even though I don't know an easy way to replace ->has_waiter and
> ->reading_graph flags...
> 
> Emanuele
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] block/stream: Drain subtree around graph change
  2022-04-05 14:41     ` Kevin Wolf
@ 2022-04-05 21:48       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2022-04-05 21:48 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Hanna Reitz, qemu-block, qemu-devel, John Snow, eesposit

05.04.2022 17:41, Kevin Wolf wrote:
> Am 05.04.2022 um 14:12 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> Thanks Kevin! I have already run out of arguments in the battle
>> against using subtree-drains to isolate graph modification operations
>> from each other in different threads in the mailing list)
>>
>> (Note also, that the top-most version of this patch is "[PATCH v2]
>> block/stream: Drain subtree around graph change")
> 
> Oops, I completely missed the v2. Thanks!
> 
>> About avoiding polling during graph-modifying operations, there is a
>> problem: some IO operations are involved into block-graph modifying
>> operations. At least it's rewriting "backing_file_offset" and
>> "backing_file_size" fields in qcow2 header.
>>
>> We can't just separate rewriting metadata from graph modifying
>> operation: this way another graph-modifying operation may interleave
>> and we'll write outdated metadata.
> 
> Hm, generally we don't update image metadata when we reconfigure the
> graph. Most changes are temporary (like insertion of filter nodes) and
> the image header only contains a "default configuration" to be used on
> the next start.
> 
> There are only a few places that update the image header; I think it's
> generally block job completions. They obviously update the in-memory
> graph, too, but they don't write to the image file (and therefore
> potentially poll) in the middle of updating the in-memory graph, but
> they do both in separate steps.
> 
> I think this is okay. We must just avoid polling in the middle of graph
> updates because if something else changes the graph there, it's not
> clear any more that we're really doing what the caller had in mind.

Hmm, interesting where is polling in described case?

First possible place I can find is bdrv_parent_drained_begin_single() in bdrv_replace_child_noperm().

Another is bdrv_apply_subtree_drain() in bdrv_child_cb_attach().

No idea how to get rid of them. Hmm.

I think, the core problem here is that when we wait in drained_begin(), nobody protects us from attaching one more node to the drained subgraph. And we should handle this, that's the complexity.

> 
>> So I still think, we need a kind of global lock for graph modifying
>> operations. Or a kind per-BDS locks as you propose. But in this case
>> we need to be sure that taking all needed per-BDS locks we'll avoid
>> deadlocking.
> 
> I guess this depends on the exact granularity of the locks we're using.
> If you take the lock only while updating a single edge, I don't think
> you could easily deadlock. If you hold it for more complex operations,
> it becomes harder to tell without checking the code.
> 

I think, keeping the whole operation, like reopen_multiple, or some job's .prepare(), etc., under one critical section is simplest to analyze.

Could this be something like this?

   uint8_t graph_locked;

   void graph_lock(AioContext *ctx) {
     AIO_POLL_WHILE(ctx, qatomic_cmpxchg(&graph_locked, 0, 1) == 1);
   }

   void graph_unlock() {
     qatomic_set(&graph_locked, 0);
     aio_wait_kick();
   }

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-05 21:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24 12:57 [PATCH] block/stream: Drain subtree around graph change Hanna Reitz
2022-04-05 10:14 ` Kevin Wolf
2022-04-05 11:47   ` Hanna Reitz
2022-04-05 12:05     ` Hanna Reitz
2022-04-05 14:12     ` Kevin Wolf
2022-04-05 12:12   ` Vladimir Sementsov-Ogievskiy
2022-04-05 14:41     ` Kevin Wolf
2022-04-05 21:48       ` Vladimir Sementsov-Ogievskiy
2022-04-05 13:09   ` Emanuele Giuseppe Esposito
2022-04-05 15:04     ` Kevin Wolf
2022-04-05 17:53       ` Emanuele Giuseppe Esposito
2022-04-05 18:24         ` Emanuele Giuseppe Esposito

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.