All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Drainage in bdrv_replace_child_noperm()
@ 2017-11-06 18:49 Max Reitz
  2017-11-07  5:21 ` Fam Zheng
  2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  0 siblings, 2 replies; 6+ messages in thread
From: Max Reitz @ 2017-11-06 18:49 UTC (permalink / raw)
  To: Qemu-block, Qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4291 bytes --]

Hi everyone,

On my quest to fix some flaky iotests, I came to a bit of a halt on 129.
 (Details: Its issue is that block jobs now generally ignore throttling
in a BB (because they use their own), so we have to add a throttle node
instead.  However, when I added it, I got an abort.)

My issue can be reproduced as follows:

$ x86_64-softmmu/qemu-system-x86_64 \
    -qmp stdio \
    -object throttle-group,id=tg0 \
    -blockdev "{'driver':'throttle','node-name':'drive0',
                'throttle-group':'tg0','file':{'driver':'null-co'}}" \
    -blockdev node-name=target,driver=null-co
{"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2},
"package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}}
{'execute':'qmp_capabilities'}
{"return": {}}
{'execute':'blockdev-mirror','arguments':{
    'device':'drive0','job-id':'job0','target':'target','sync':'full',
    'filter-node-name':'mirror-node' }}
qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end:
Assertion `tgm->io_limits_disabled' failed.
[1]    3524 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
stdio -object throttle-group,id=tg0

Here's what happens:

(1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0.

(2) bdrv_append(...) puts mirror-node above drive0.  Through
bdrv_replace_child_noperm(), this will invoke
bdrv_child_cb_drained_begin() on mirror-node.  This is necessary because
drive0 is drained, so the new parent needs to be drained as well.
However, note that drive0 is not yet attached to mirror-node.
Therefore, mirror-node cannot drain drive0 recursively.

This is seemingly fine because drive0 is drained anyway.  However, this
is different from what would happen if we would have drained drive0 with
mirror-node already attached to it as its parent: Then, we would have
drained drive0 twice; once by itself, and another time recursively
through mirror-node.

This will be important in a second...

(3) ...and this second is now: We invoke bdrv_drained_end() on drive0.
Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end()
that goes up to mirror-node which recursively un-drains drive0.  Fine so
far.  But once that parent un-drain is done, we un-drain drive0 by
itself: And this fails the assertion in the throttle driver because we
attempt to un-drain it twice, although we've drained it only once.


So the issue has two parts:

(A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain
that child, too.  This seems a bit superfluous to me and I would guess
that it results in worst-case O(n^2) function calls to drain a block
graph consisting of n nodes.

(B) In bdrv_replace_child_noperm() we try to drain the parent if the new
child is drained; specifically, we want it to be in a state as if it had
been a parent when the child was originally drained.  However, we fail
at this because we drain the parent without the child attached, so we
don't drain the child twice.  This bites us when we undrain everything.

(Most importantly, ideally we'd want to attach the new child to the
parent and then drain the parent: This would give us exactly the state
we want.  However, attaching the child first and then draining the
parent is unsafe, so we cannot do it...)

[1] Whether the parent (un-)drains the child depends on the
BdrvChildRole.drained_{begin,end}() implementation, strictly speaking.
We cannot say it generally.

OK, so how to fix it?  I don't know, so I'm asking you. :-)

I have two ideas:

One is to assume that (un-)draining a parent will always (un-)drain all
children, including the one the (un-)drain comes from.  This assumption
seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
explicitly drain the new child in bdrv_replace_child_noperm() after
having drained the parent and thus get a consistent state again.

The other is to declare (A) wrong.  Maybe when
BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
child because we can declare it the caller's responsibility to make sure
it's drained.  This seems logical to me because usually those methods
are invoked when the child is drained anyway.  But maybe I'm wrong. :-)


So, any ideas?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Drainage in bdrv_replace_child_noperm()
  2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz
@ 2017-11-07  5:21 ` Fam Zheng
  2017-11-08 19:39   ` Max Reitz
  2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
  1 sibling, 1 reply; 6+ messages in thread
From: Fam Zheng @ 2017-11-07  5:21 UTC (permalink / raw)
  To: Max Reitz; +Cc: Qemu-block, Qemu-devel

On Mon, 11/06 19:49, Max Reitz wrote:
> Hi everyone,
> 
> On my quest to fix some flaky iotests, I came to a bit of a halt on 129.
>  (Details: Its issue is that block jobs now generally ignore throttling
> in a BB (because they use their own), so we have to add a throttle node
> instead.  However, when I added it, I got an abort.)
> 
> My issue can be reproduced as follows:
> 
> $ x86_64-softmmu/qemu-system-x86_64 \
>     -qmp stdio \
>     -object throttle-group,id=tg0 \
>     -blockdev "{'driver':'throttle','node-name':'drive0',
>                 'throttle-group':'tg0','file':{'driver':'null-co'}}" \
>     -blockdev node-name=target,driver=null-co
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2},
> "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}}
> {'execute':'qmp_capabilities'}
> {"return": {}}
> {'execute':'blockdev-mirror','arguments':{
>     'device':'drive0','job-id':'job0','target':'target','sync':'full',
>     'filter-node-name':'mirror-node' }}
> qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end:
> Assertion `tgm->io_limits_disabled' failed.
> [1]    3524 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
> stdio -object throttle-group,id=tg0
> 
> Here's what happens:
> 
> (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0.
> 
> (2) bdrv_append(...) puts mirror-node above drive0.  Through
> bdrv_replace_child_noperm(), this will invoke
> bdrv_child_cb_drained_begin() on mirror-node.  This is necessary because
> drive0 is drained, so the new parent needs to be drained as well.
> However, note that drive0 is not yet attached to mirror-node.
> Therefore, mirror-node cannot drain drive0 recursively.
> 
> This is seemingly fine because drive0 is drained anyway.  However, this
> is different from what would happen if we would have drained drive0 with
> mirror-node already attached to it as its parent: Then, we would have
> drained drive0 twice; once by itself, and another time recursively
> through mirror-node.
> 
> This will be important in a second...
> 
> (3) ...and this second is now: We invoke bdrv_drained_end() on drive0.
> Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end()
> that goes up to mirror-node which recursively un-drains drive0.  Fine so
> far.  But once that parent un-drain is done, we un-drain drive0 by
> itself: And this fails the assertion in the throttle driver because we
> attempt to un-drain it twice, although we've drained it only once.

So it is not a problem specific to throttle, but it's a problem that
drain/undrain pairs in bdrv_drained_begin and bdrv_drained_end are uneven.
Throttle filter just happens to assert it's even, so we get an abort.

> 
> 
> So the issue has two parts:
> 
> (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain
> that child, too.  This seems a bit superfluous to me and I would guess
> that it results in worst-case O(n^2) function calls to drain a block
> graph consisting of n nodes.
> 
> (B) In bdrv_replace_child_noperm() we try to drain the parent if the new
> child is drained; specifically, we want it to be in a state as if it had
> been a parent when the child was originally drained.  However, we fail
> at this because we drain the parent without the child attached, so we
> don't drain the child twice.  This bites us when we undrain everything.
> 
> (Most importantly, ideally we'd want to attach the new child to the
> parent and then drain the parent: This would give us exactly the state
> we want.  However, attaching the child first and then draining the
> parent is unsafe, so we cannot do it...)
> 
> [1] Whether the parent (un-)drains the child depends on the
> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking.
> We cannot say it generally.
> 
> OK, so how to fix it?  I don't know, so I'm asking you. :-)
> 
> I have two ideas:
> 
> One is to assume that (un-)draining a parent will always (un-)drain all
> children, including the one the (un-)drain comes from.  This assumption
> seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
> explicitly drain the new child in bdrv_replace_child_noperm() after
> having drained the parent and thus get a consistent state again.
> 
> The other is to declare (A) wrong.  Maybe when
> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
> child because we can declare it the caller's responsibility to make sure
> it's drained.  This seems logical to me because usually those methods
> are invoked when the child is drained anyway.  But maybe I'm wrong. :-)

I'm in favor of asking the caller to make sure all nodes involved in the graph
manupulation are drained, it feels comparably easier to do, than fixing the
problem in bdrv_append().

Fam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm()
  2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz
  2017-11-07  5:21 ` Fam Zheng
@ 2017-11-07 14:22 ` Kevin Wolf
  2017-11-08 20:16   ` Max Reitz
  1 sibling, 1 reply; 6+ messages in thread
From: Kevin Wolf @ 2017-11-07 14:22 UTC (permalink / raw)
  To: Max Reitz; +Cc: Qemu-block, Qemu-devel

[-- Attachment #1: Type: text/plain, Size: 7854 bytes --]

Am 06.11.2017 um 19:49 hat Max Reitz geschrieben:
> Hi everyone,
> 
> On my quest to fix some flaky iotests, I came to a bit of a halt on 129.
>  (Details: Its issue is that block jobs now generally ignore throttling
> in a BB (because they use their own), so we have to add a throttle node
> instead.  However, when I added it, I got an abort.)
> 
> My issue can be reproduced as follows:
> 
> $ x86_64-softmmu/qemu-system-x86_64 \
>     -qmp stdio \
>     -object throttle-group,id=tg0 \
>     -blockdev "{'driver':'throttle','node-name':'drive0',
>                 'throttle-group':'tg0','file':{'driver':'null-co'}}" \
>     -blockdev node-name=target,driver=null-co
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2},
> "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}}
> {'execute':'qmp_capabilities'}
> {"return": {}}
> {'execute':'blockdev-mirror','arguments':{
>     'device':'drive0','job-id':'job0','target':'target','sync':'full',
>     'filter-node-name':'mirror-node' }}
> qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end:
> Assertion `tgm->io_limits_disabled' failed.
> [1]    3524 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
> stdio -object throttle-group,id=tg0
> 
> Here's what happens:
> 
> (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0.
> 
> (2) bdrv_append(...) puts mirror-node above drive0.  Through
> bdrv_replace_child_noperm(), this will invoke
> bdrv_child_cb_drained_begin() on mirror-node.  This is necessary because
> drive0 is drained, so the new parent needs to be drained as well.
> However, note that drive0 is not yet attached to mirror-node.
> Therefore, mirror-node cannot drain drive0 recursively.

Important context: We're talking about bdrv_set_backing_hd() here.

It's also not quite correct to say that drive0 is not yet attached, but
we're in a weird half-attached state. The BdrvChild is already
initialised and in the parent list of drive0, but it's not yet assigned
to mirror_node->backing nor in mirror_node's child list.

For this specific case it looks like this is indeed the same as not
being attached, but I wouldn't be surprised if we saw stranger effects
at some point.

> This is seemingly fine because drive0 is drained anyway.  However, this
> is different from what would happen if we would have drained drive0 with
> mirror-node already attached to it as its parent: Then, we would have
> drained drive0 twice; once by itself, and another time recursively
> through mirror-node.
> 
> This will be important in a second...
> 
> (3) ...and this second is now: We invoke bdrv_drained_end() on drive0.
> Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end()
> that goes up to mirror-node which recursively un-drains drive0.  Fine so
> far.  But once that parent un-drain is done, we un-drain drive0 by
> itself: And this fails the assertion in the throttle driver because we
> attempt to un-drain it twice, although we've drained it only once.
> 
> 
> So the issue has two parts:
> 
> (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain
> that child, too.  This seems a bit superfluous to me and I would guess
> that it results in worst-case O(n^2) function calls to drain a block
> graph consisting of n nodes.
> 
> (B) In bdrv_replace_child_noperm() we try to drain the parent if the new
> child is drained; specifically, we want it to be in a state as if it had
> been a parent when the child was originally drained.  However, we fail
> at this because we drain the parent without the child attached, so we
> don't drain the child twice.  This bites us when we undrain everything.

I think the issue is much simpler, even though it still has two parts.
It's the old story of bdrv_drain mixing two separate concepts:

1. Wait synchronously for the completion of all my requests to this
   node. This needs to be propagated down the graph to the children.

2. Make sure that nobody else sends new requests to this node. This
   needs to be propagated up the graph to the parents.

Some callers want only 1. (usually callers of bdrv_drain_all() without a
begin/end pair), some callers want both 1. and 2. (that's the begin/end
construction for getting exclusive access). Not sure if there are
callers that want only 2., but possibly.

If we actually take this separation serious, the first step to a fix
could be that BdrvChildRole.drained_begin shouldn't be propagated to the
children. We may still need to drain the requests at the node itself:

Imagine a drained backing file of qcow2 node. Making sure that the qcow2
node doesn't get new requests isn't enough, we also must make sure that
in-flight requests don't access the backing file any more. This means
draining the qcow2 node, though we don't care whether its child nodes
still have requests in flight.

The big question is whether bdrv_drain() would still work for a single
node without recursing to the children, but as it uses bs->in_flight, I
think that should be okay these days.

> (Most importantly, ideally we'd want to attach the new child to the
> parent and then drain the parent: This would give us exactly the state
> we want.  However, attaching the child first and then draining the
> parent is unsafe, so we cannot do it...)
> 
> [1] Whether the parent (un-)drains the child depends on the
> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking.
> We cannot say it generally.
> 
> OK, so how to fix it?  I don't know, so I'm asking you. :-)

The conclusion from what I wrote above would be to add a non-recursive
drain function (probably a version of bdrv_drained_begin/end with a bool
parameter) and call that from bdrv_child_cb_drained_begin/end.

This would still only be a partial solution because we still maintain
the single interface for two different purposes, but it should be a step
in the right direction and fix the problem at hand.

> I have two ideas:
> 
> One is to assume that (un-)draining a parent will always (un-)drain all
> children, including the one the (un-)drain comes from.  This assumption
> seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
> explicitly drain the new child in bdrv_replace_child_noperm() after
> having drained the parent and thus get a consistent state again.

I agree that this is wrong.

> The other is to declare (A) wrong.  Maybe when
> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
> child because we can declare it the caller's responsibility to make sure
> it's drained.  This seems logical to me because usually those methods
> are invoked when the child is drained anyway.  But maybe I'm wrong. :-)

Looks like a similar resolution as I suggest, though I like my reasoning
for it better. ;-)

> So, any ideas?

Just an additional thought (aka "alles kaputt"): The throttle driver
will respond to BlockDriver.bdrv_drained_begin() by completing all of
the queued requests (ignoring the I/O limits) before it returns. This is
great when the drain request comes from a parent because it just want to
get everything completed. It's kind of a problem when the drain request
comes from a child which is already drained...

To be more specific, you pointed at bdrv_replace_child_noperm(). This
replaces child->bs first and only then calls the .drained_begin callback
of the parent. So if the parent wants to implement draining by just
submiting its queue, we're submitting requests to a child where this
isn't allowed.

If we had separated the two operations, we could have two BlockDriver
callbacks, one triggering the queue flush, and the other one requiring
that no new requests from the queue be submitted.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Drainage in bdrv_replace_child_noperm()
  2017-11-07  5:21 ` Fam Zheng
@ 2017-11-08 19:39   ` Max Reitz
  0 siblings, 0 replies; 6+ messages in thread
From: Max Reitz @ 2017-11-08 19:39 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Qemu-block, Qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

On 2017-11-07 06:21, Fam Zheng wrote:
> On Mon, 11/06 19:49, Max Reitz wrote:

[...]

>> I have two ideas:
>>
>> One is to assume that (un-)draining a parent will always (un-)drain all
>> children, including the one the (un-)drain comes from.  This assumption
>> seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
>> explicitly drain the new child in bdrv_replace_child_noperm() after
>> having drained the parent and thus get a consistent state again.
>>
>> The other is to declare (A) wrong.  Maybe when
>> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
>> child because we can declare it the caller's responsibility to make sure
>> it's drained.  This seems logical to me because usually those methods
>> are invoked when the child is drained anyway.  But maybe I'm wrong. :-)
> 
> I'm in favor of asking the caller to make sure all nodes involved in the graph
> manupulation are drained, it feels comparably easier to do, than fixing the
> problem in bdrv_append().

I guess my main question was whether any of the two approaches is
evidently wrong in some way, but if you don't say that (and Kevin says
the first is wrong), I'm going to assume doing the second is OK. :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm()
  2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
@ 2017-11-08 20:16   ` Max Reitz
  2017-11-09 16:25     ` Kevin Wolf
  0 siblings, 1 reply; 6+ messages in thread
From: Max Reitz @ 2017-11-08 20:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Qemu-block, Qemu-devel

[-- Attachment #1: Type: text/plain, Size: 9894 bytes --]

On 2017-11-07 15:22, Kevin Wolf wrote:
> Am 06.11.2017 um 19:49 hat Max Reitz geschrieben:
>> Hi everyone,
>>
>> On my quest to fix some flaky iotests, I came to a bit of a halt on 129.
>>  (Details: Its issue is that block jobs now generally ignore throttling
>> in a BB (because they use their own), so we have to add a throttle node
>> instead.  However, when I added it, I got an abort.)
>>
>> My issue can be reproduced as follows:
>>
>> $ x86_64-softmmu/qemu-system-x86_64 \
>>     -qmp stdio \
>>     -object throttle-group,id=tg0 \
>>     -blockdev "{'driver':'throttle','node-name':'drive0',
>>                 'throttle-group':'tg0','file':{'driver':'null-co'}}" \
>>     -blockdev node-name=target,driver=null-co
>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2},
>> "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}}
>> {'execute':'qmp_capabilities'}
>> {"return": {}}
>> {'execute':'blockdev-mirror','arguments':{
>>     'device':'drive0','job-id':'job0','target':'target','sync':'full',
>>     'filter-node-name':'mirror-node' }}
>> qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end:
>> Assertion `tgm->io_limits_disabled' failed.
>> [1]    3524 abort (core dumped)  x86_64-softmmu/qemu-system-x86_64 -qmp
>> stdio -object throttle-group,id=tg0
>>
>> Here's what happens:
>>
>> (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0.
>>
>> (2) bdrv_append(...) puts mirror-node above drive0.  Through
>> bdrv_replace_child_noperm(), this will invoke
>> bdrv_child_cb_drained_begin() on mirror-node.  This is necessary because
>> drive0 is drained, so the new parent needs to be drained as well.
>> However, note that drive0 is not yet attached to mirror-node.
>> Therefore, mirror-node cannot drain drive0 recursively.
> 
> Important context: We're talking about bdrv_set_backing_hd() here.
> 
> It's also not quite correct to say that drive0 is not yet attached, but
> we're in a weird half-attached state. The BdrvChild is already
> initialised and in the parent list of drive0, but it's not yet assigned
> to mirror_node->backing nor in mirror_node's child list.
> 
> For this specific case it looks like this is indeed the same as not
> being attached, but I wouldn't be surprised if we saw stranger effects
> at some point.
> 
>> This is seemingly fine because drive0 is drained anyway.  However, this
>> is different from what would happen if we would have drained drive0 with
>> mirror-node already attached to it as its parent: Then, we would have
>> drained drive0 twice; once by itself, and another time recursively
>> through mirror-node.
>>
>> This will be important in a second...
>>
>> (3) ...and this second is now: We invoke bdrv_drained_end() on drive0.
>> Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end()
>> that goes up to mirror-node which recursively un-drains drive0.  Fine so
>> far.  But once that parent un-drain is done, we un-drain drive0 by
>> itself: And this fails the assertion in the throttle driver because we
>> attempt to un-drain it twice, although we've drained it only once.
>>
>>
>> So the issue has two parts:
>>
>> (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain
>> that child, too.  This seems a bit superfluous to me and I would guess
>> that it results in worst-case O(n^2) function calls to drain a block
>> graph consisting of n nodes.
>>
>> (B) In bdrv_replace_child_noperm() we try to drain the parent if the new
>> child is drained; specifically, we want it to be in a state as if it had
>> been a parent when the child was originally drained.  However, we fail
>> at this because we drain the parent without the child attached, so we
>> don't drain the child twice.  This bites us when we undrain everything.
> 
> I think the issue is much simpler, even though it still has two parts.
> It's the old story of bdrv_drain mixing two separate concepts:
> 
> 1. Wait synchronously for the completion of all my requests to this
>    node. This needs to be propagated down the graph to the children.

So, flush without flushing protocol drivers? :-)

> 2. Make sure that nobody else sends new requests to this node. This
>    needs to be propagated up the graph to the parents.
> 
> Some callers want only 1. (usually callers of bdrv_drain_all() without a
> begin/end pair), some callers want both 1. and 2. (that's the begin/end
> construction for getting exclusive access). Not sure if there are
> callers that want only 2., but possibly.
> 
> If we actually take this separation serious, the first step to a fix
> could be that BdrvChildRole.drained_begin shouldn't be propagated to the
> children.

That seems good to me, but considering that the default was to just
propagate it to every child, I thought that I was missing something.

>           We may still need to drain the requests at the node itself:
> > Imagine a drained backing file of qcow2 node. Making sure that the qcow2
> node doesn't get new requests isn't enough, we also must make sure that
> in-flight requests don't access the backing file any more. This means
> draining the qcow2 node, though we don't care whether its child nodes
> still have requests in flight.

If you want to stop the qcow2 node from generating further requests, you
can either flush them (as you said) or pause them (whatever that means...).

However, if you flush them, you also need to flush the children you have
flushed them to...  So what I wrote was technically wrong ("don't pass
parent drainage back to the child because the child is already
drained"), instead it should be "don't pass parent drainage back to the
child because the child is going to be drained (flushed) anyway".

So, pause requests first (either by parking them or flushing them,
driver's choice), then flush the tree.

Of course that's easier said than done...  First, we would need a
non-recursive flush.  Then, every node that is visited by the drain
would (*after* recursing to its parents) need to flush itself.

(Note that I'm completely disregarding nodes which are below the
original node that is supposed to be drained, and which therefore do not
drain their parents (or do they?) because I'd like to retain at least
some of my sanity for now.)

Secondly we have exactly the issue you describe below...

> The big question is whether bdrv_drain() would still work for a single
> node without recursing to the children, but as it uses bs->in_flight, I
> think that should be okay these days.
> 
>> (Most importantly, ideally we'd want to attach the new child to the
>> parent and then drain the parent: This would give us exactly the state
>> we want.  However, attaching the child first and then draining the
>> parent is unsafe, so we cannot do it...)
>>
>> [1] Whether the parent (un-)drains the child depends on the
>> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking.
>> We cannot say it generally.
>>
>> OK, so how to fix it?  I don't know, so I'm asking you. :-)
> 
> The conclusion from what I wrote above would be to add a non-recursive
> drain function (probably a version of bdrv_drained_begin/end with a bool
> parameter) and call that from bdrv_child_cb_drained_begin/end.
> 
> This would still only be a partial solution because we still maintain
> the single interface for two different purposes, but it should be a step
> in the right direction and fix the problem at hand.

Well, except...

>> I have two ideas:
>>
>> One is to assume that (un-)draining a parent will always (un-)drain all
>> children, including the one the (un-)drain comes from.  This assumption
>> seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
>> explicitly drain the new child in bdrv_replace_child_noperm() after
>> having drained the parent and thus get a consistent state again.
> 
> I agree that this is wrong.
> 
>> The other is to declare (A) wrong.  Maybe when
>> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
>> child because we can declare it the caller's responsibility to make sure
>> it's drained.  This seems logical to me because usually those methods
>> are invoked when the child is drained anyway.  But maybe I'm wrong. :-)
> 
> Looks like a similar resolution as I suggest, though I like my reasoning
> for it better. ;-)
> 
>> So, any ideas?
> 
> Just an additional thought (aka "alles kaputt"): The throttle driver
> will respond to BlockDriver.bdrv_drained_begin() by completing all of
> the queued requests (ignoring the I/O limits) before it returns. This is
> great when the drain request comes from a parent because it just want to
> get everything completed. It's kind of a problem when the drain request
> comes from a child which is already drained...
> 
> To be more specific, you pointed at bdrv_replace_child_noperm(). This
> replaces child->bs first and only then calls the .drained_begin callback
> of the parent. So if the parent wants to implement draining by just
> submiting its queue, we're submitting requests to a child where this
> isn't allowed.

...exactly this. :-/

So a non-recursive drain would just fix the assert, but not this issue.
And I'm not sure whether this is reasonable.  Fixing the assert but
allowing the driver to submit requests to a drained node is just wrong.

> If we had separated the two operations, we could have two BlockDriver
> callbacks, one triggering the queue flush, and the other one requiring
> that no new requests from the queue be submitted.

Yes.  Then we could require that every driver implements a way to park
requests instead of only pausing request submissions through flushing.

...I guess I'll leave 129 just broken for now. :-/

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm()
  2017-11-08 20:16   ` Max Reitz
@ 2017-11-09 16:25     ` Kevin Wolf
  0 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2017-11-09 16:25 UTC (permalink / raw)
  To: Max Reitz; +Cc: Qemu-block, Qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8689 bytes --]

Am 08.11.2017 um 21:16 hat Max Reitz geschrieben:
> On 2017-11-07 15:22, Kevin Wolf wrote:
> > I think the issue is much simpler, even though it still has two parts.
> > It's the old story of bdrv_drain mixing two separate concepts:
> > 
> > 1. Wait synchronously for the completion of all my requests to this
> >    node. This needs to be propagated down the graph to the children.
> 
> So, flush without flushing protocol drivers? :-)

No, flush is a completely different thing. Drain is about completing all
in-flight requests, whereas flush is about writing out all caches. You
can do either one without the other.

In particular, a flush doesn't involve a drain, you can still requests
while a flush request is in flight. The semantics is that a flush makes
sure that all requests which had already completed when the flush
request was started are stable on disk. Later requests may or may not be
stable on disk yet.

> > 2. Make sure that nobody else sends new requests to this node. This
> >    needs to be propagated up the graph to the parents.
> > 
> > Some callers want only 1. (usually callers of bdrv_drain_all() without a
> > begin/end pair), some callers want both 1. and 2. (that's the begin/end
> > construction for getting exclusive access). Not sure if there are
> > callers that want only 2., but possibly.
> > 
> > If we actually take this separation serious, the first step to a fix
> > could be that BdrvChildRole.drained_begin shouldn't be propagated to the
> > children.
> 
> That seems good to me, but considering that the default was to just
> propagate it to every child, I thought that I was missing something.

bdrv_child_cb_drained_begin() calls bdrv_drained_begin() to wait for the
completion of all running requests on its current node. The propagation
to every child is an unwanted side effect, but so far it didn't seem to
hurt anyone, so we didn't care about preventing it.

> >           We may still need to drain the requests at the node itself:
> > > Imagine a drained backing file of qcow2 node. Making sure that the qcow2
> > node doesn't get new requests isn't enough, we also must make sure that
> > in-flight requests don't access the backing file any more. This means
> > draining the qcow2 node, though we don't care whether its child nodes
> > still have requests in flight.
> 
> If you want to stop the qcow2 node from generating further requests, you
> can either flush them (as you said) or pause them (whatever that means...).

Pausing is probably way to complicated. qcow2 never issues requests by
itself. It only has requests running if someone else has a request
running on the qcow2 node. So it's enough to just drain the request
queue of the qcow2 node to get to the state we want.

The only thing we need to make sure is that we drain it _before_ its
child node is replaced with a drained one, so that its requests can
complete.

In fact, I think graph modifications should only be done with drained
nodes anyway. Otherwise you could switch over in the middle of a single
high-level operation and possibly get callbacks from a node that isn't a
child any more. Maybe we should add some assertions to that effect and
clean up whatever code breaks after it.

> However, if you flush them, you also need to flush the children you have
> flushed them to...  So what I wrote was technically wrong ("don't pass
> parent drainage back to the child because the child is already
> drained"), instead it should be "don't pass parent drainage back to the
> child because the child is going to be drained (flushed) anyway".
> 
> So, pause requests first (either by parking them or flushing them,
> driver's choice), then flush the tree.
> 
> Of course that's easier said than done...  First, we would need a
> non-recursive flush.  Then, every node that is visited by the drain
> would (*after* recursing to its parents) need to flush itself.
> 
> (Note that I'm completely disregarding nodes which are below the
> original node that is supposed to be drained, and which therefore do not
> drain their parents (or do they?) because I'd like to retain at least
> some of my sanity for now.)

I don't follow, but I suppose this is related to the flush/drain
confusion.

> Secondly we have exactly the issue you describe below...
> 
> > The big question is whether bdrv_drain() would still work for a single
> > node without recursing to the children, but as it uses bs->in_flight, I
> > think that should be okay these days.
> > 
> >> (Most importantly, ideally we'd want to attach the new child to the
> >> parent and then drain the parent: This would give us exactly the state
> >> we want.  However, attaching the child first and then draining the
> >> parent is unsafe, so we cannot do it...)
> >>
> >> [1] Whether the parent (un-)drains the child depends on the
> >> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking.
> >> We cannot say it generally.
> >>
> >> OK, so how to fix it?  I don't know, so I'm asking you. :-)
> > 
> > The conclusion from what I wrote above would be to add a non-recursive
> > drain function (probably a version of bdrv_drained_begin/end with a bool
> > parameter) and call that from bdrv_child_cb_drained_begin/end.
> > 
> > This would still only be a partial solution because we still maintain
> > the single interface for two different purposes, but it should be a step
> > in the right direction and fix the problem at hand.
> 
> Well, except...

Yes, bug 1 would be fixed, but not yet bug 2.

> >> I have two ideas:
> >>
> >> One is to assume that (un-)draining a parent will always (un-)drain all
> >> children, including the one the (un-)drain comes from.  This assumption
> >> seems wrong, see [1], but maybe it isn't.  Anyway, if so, we could just
> >> explicitly drain the new child in bdrv_replace_child_noperm() after
> >> having drained the parent and thus get a consistent state again.
> > 
> > I agree that this is wrong.
> > 
> >> The other is to declare (A) wrong.  Maybe when
> >> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that
> >> child because we can declare it the caller's responsibility to make sure
> >> it's drained.  This seems logical to me because usually those methods
> >> are invoked when the child is drained anyway.  But maybe I'm wrong. :-)
> > 
> > Looks like a similar resolution as I suggest, though I like my reasoning
> > for it better. ;-)
> > 
> >> So, any ideas?
> > 
> > Just an additional thought (aka "alles kaputt"): The throttle driver
> > will respond to BlockDriver.bdrv_drained_begin() by completing all of
> > the queued requests (ignoring the I/O limits) before it returns. This is
> > great when the drain request comes from a parent because it just want to
> > get everything completed. It's kind of a problem when the drain request
> > comes from a child which is already drained...
> > 
> > To be more specific, you pointed at bdrv_replace_child_noperm(). This
> > replaces child->bs first and only then calls the .drained_begin callback
> > of the parent. So if the parent wants to implement draining by just
> > submiting its queue, we're submitting requests to a child where this
> > isn't allowed.
> 
> ...exactly this. :-/
> 
> So a non-recursive drain would just fix the assert, but not this issue.
> And I'm not sure whether this is reasonable.  Fixing the assert but
> allowing the driver to submit requests to a drained node is just wrong.

Well, apart from what I described below, I see two more solutions:

1. Just drain before replacing the child node. Then the requests still
   go to the old child, which wasn't drained.

2. Require that the parent node be drained before it can be involved in
   a bdrv_replace_child_noperm() operation. This is what I suggested
   above. Right now it seems like a good idea. Not sure if it still does
   after actually trying it out.

> > If we had separated the two operations, we could have two BlockDriver
> > callbacks, one triggering the queue flush, and the other one requiring
> > that no new requests from the queue be submitted.
> 
> Yes.  Then we could require that every driver implements a way to park
> requests instead of only pausing request submissions through flushing.

I think in the long run we'll need this, but not in the strict form that
this sounds like. We don't need a function to "park" already running
requests, just one to avoid issuing new requests. So not every driver
must implement this, just every driver that can issue requests by
itself.

Kevin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-11-09 16:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz
2017-11-07  5:21 ` Fam Zheng
2017-11-08 19:39   ` Max Reitz
2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf
2017-11-08 20:16   ` Max Reitz
2017-11-09 16:25     ` Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.