* [Qemu-devel] Drainage in bdrv_replace_child_noperm() @ 2017-11-06 18:49 Max Reitz 2017-11-07 5:21 ` Fam Zheng 2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf 0 siblings, 2 replies; 6+ messages in thread From: Max Reitz @ 2017-11-06 18:49 UTC (permalink / raw) To: Qemu-block, Qemu-devel [-- Attachment #1: Type: text/plain, Size: 4291 bytes --] Hi everyone, On my quest to fix some flaky iotests, I came to a bit of a halt on 129. (Details: Its issue is that block jobs now generally ignore throttling in a BB (because they use their own), so we have to add a throttle node instead. However, when I added it, I got an abort.) My issue can be reproduced as follows: $ x86_64-softmmu/qemu-system-x86_64 \ -qmp stdio \ -object throttle-group,id=tg0 \ -blockdev "{'driver':'throttle','node-name':'drive0', 'throttle-group':'tg0','file':{'driver':'null-co'}}" \ -blockdev node-name=target,driver=null-co {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}} {'execute':'qmp_capabilities'} {"return": {}} {'execute':'blockdev-mirror','arguments':{ 'device':'drive0','job-id':'job0','target':'target','sync':'full', 'filter-node-name':'mirror-node' }} qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end: Assertion `tgm->io_limits_disabled' failed. [1] 3524 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp stdio -object throttle-group,id=tg0 Here's what happens: (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0. (2) bdrv_append(...) puts mirror-node above drive0. Through bdrv_replace_child_noperm(), this will invoke bdrv_child_cb_drained_begin() on mirror-node. This is necessary because drive0 is drained, so the new parent needs to be drained as well. However, note that drive0 is not yet attached to mirror-node. Therefore, mirror-node cannot drain drive0 recursively. This is seemingly fine because drive0 is drained anyway. However, this is different from what would happen if we would have drained drive0 with mirror-node already attached to it as its parent: Then, we would have drained drive0 twice; once by itself, and another time recursively through mirror-node. This will be important in a second... (3) ...and this second is now: We invoke bdrv_drained_end() on drive0. Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end() that goes up to mirror-node which recursively un-drains drive0. Fine so far. But once that parent un-drain is done, we un-drain drive0 by itself: And this fails the assertion in the throttle driver because we attempt to un-drain it twice, although we've drained it only once. So the issue has two parts: (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain that child, too. This seems a bit superfluous to me and I would guess that it results in worst-case O(n^2) function calls to drain a block graph consisting of n nodes. (B) In bdrv_replace_child_noperm() we try to drain the parent if the new child is drained; specifically, we want it to be in a state as if it had been a parent when the child was originally drained. However, we fail at this because we drain the parent without the child attached, so we don't drain the child twice. This bites us when we undrain everything. (Most importantly, ideally we'd want to attach the new child to the parent and then drain the parent: This would give us exactly the state we want. However, attaching the child first and then draining the parent is unsafe, so we cannot do it...) [1] Whether the parent (un-)drains the child depends on the BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. We cannot say it generally. OK, so how to fix it? I don't know, so I'm asking you. :-) I have two ideas: One is to assume that (un-)draining a parent will always (un-)drain all children, including the one the (un-)drain comes from. This assumption seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just explicitly drain the new child in bdrv_replace_child_noperm() after having drained the parent and thus get a consistent state again. The other is to declare (A) wrong. Maybe when BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that child because we can declare it the caller's responsibility to make sure it's drained. This seems logical to me because usually those methods are invoked when the child is drained anyway. But maybe I'm wrong. :-) So, any ideas? Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Drainage in bdrv_replace_child_noperm() 2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz @ 2017-11-07 5:21 ` Fam Zheng 2017-11-08 19:39 ` Max Reitz 2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf 1 sibling, 1 reply; 6+ messages in thread From: Fam Zheng @ 2017-11-07 5:21 UTC (permalink / raw) To: Max Reitz; +Cc: Qemu-block, Qemu-devel On Mon, 11/06 19:49, Max Reitz wrote: > Hi everyone, > > On my quest to fix some flaky iotests, I came to a bit of a halt on 129. > (Details: Its issue is that block jobs now generally ignore throttling > in a BB (because they use their own), so we have to add a throttle node > instead. However, when I added it, I got an abort.) > > My issue can be reproduced as follows: > > $ x86_64-softmmu/qemu-system-x86_64 \ > -qmp stdio \ > -object throttle-group,id=tg0 \ > -blockdev "{'driver':'throttle','node-name':'drive0', > 'throttle-group':'tg0','file':{'driver':'null-co'}}" \ > -blockdev node-name=target,driver=null-co > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, > "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}} > {'execute':'qmp_capabilities'} > {"return": {}} > {'execute':'blockdev-mirror','arguments':{ > 'device':'drive0','job-id':'job0','target':'target','sync':'full', > 'filter-node-name':'mirror-node' }} > qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end: > Assertion `tgm->io_limits_disabled' failed. > [1] 3524 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp > stdio -object throttle-group,id=tg0 > > Here's what happens: > > (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0. > > (2) bdrv_append(...) puts mirror-node above drive0. Through > bdrv_replace_child_noperm(), this will invoke > bdrv_child_cb_drained_begin() on mirror-node. This is necessary because > drive0 is drained, so the new parent needs to be drained as well. > However, note that drive0 is not yet attached to mirror-node. > Therefore, mirror-node cannot drain drive0 recursively. > > This is seemingly fine because drive0 is drained anyway. However, this > is different from what would happen if we would have drained drive0 with > mirror-node already attached to it as its parent: Then, we would have > drained drive0 twice; once by itself, and another time recursively > through mirror-node. > > This will be important in a second... > > (3) ...and this second is now: We invoke bdrv_drained_end() on drive0. > Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end() > that goes up to mirror-node which recursively un-drains drive0. Fine so > far. But once that parent un-drain is done, we un-drain drive0 by > itself: And this fails the assertion in the throttle driver because we > attempt to un-drain it twice, although we've drained it only once. So it is not a problem specific to throttle, but it's a problem that drain/undrain pairs in bdrv_drained_begin and bdrv_drained_end are uneven. Throttle filter just happens to assert it's even, so we get an abort. > > > So the issue has two parts: > > (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain > that child, too. This seems a bit superfluous to me and I would guess > that it results in worst-case O(n^2) function calls to drain a block > graph consisting of n nodes. > > (B) In bdrv_replace_child_noperm() we try to drain the parent if the new > child is drained; specifically, we want it to be in a state as if it had > been a parent when the child was originally drained. However, we fail > at this because we drain the parent without the child attached, so we > don't drain the child twice. This bites us when we undrain everything. > > (Most importantly, ideally we'd want to attach the new child to the > parent and then drain the parent: This would give us exactly the state > we want. However, attaching the child first and then draining the > parent is unsafe, so we cannot do it...) > > [1] Whether the parent (un-)drains the child depends on the > BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. > We cannot say it generally. > > OK, so how to fix it? I don't know, so I'm asking you. :-) > > I have two ideas: > > One is to assume that (un-)draining a parent will always (un-)drain all > children, including the one the (un-)drain comes from. This assumption > seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just > explicitly drain the new child in bdrv_replace_child_noperm() after > having drained the parent and thus get a consistent state again. > > The other is to declare (A) wrong. Maybe when > BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that > child because we can declare it the caller's responsibility to make sure > it's drained. This seems logical to me because usually those methods > are invoked when the child is drained anyway. But maybe I'm wrong. :-) I'm in favor of asking the caller to make sure all nodes involved in the graph manupulation are drained, it feels comparably easier to do, than fixing the problem in bdrv_append(). Fam ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Drainage in bdrv_replace_child_noperm() 2017-11-07 5:21 ` Fam Zheng @ 2017-11-08 19:39 ` Max Reitz 0 siblings, 0 replies; 6+ messages in thread From: Max Reitz @ 2017-11-08 19:39 UTC (permalink / raw) To: Fam Zheng; +Cc: Qemu-block, Qemu-devel [-- Attachment #1: Type: text/plain, Size: 1259 bytes --] On 2017-11-07 06:21, Fam Zheng wrote: > On Mon, 11/06 19:49, Max Reitz wrote: [...] >> I have two ideas: >> >> One is to assume that (un-)draining a parent will always (un-)drain all >> children, including the one the (un-)drain comes from. This assumption >> seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just >> explicitly drain the new child in bdrv_replace_child_noperm() after >> having drained the parent and thus get a consistent state again. >> >> The other is to declare (A) wrong. Maybe when >> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that >> child because we can declare it the caller's responsibility to make sure >> it's drained. This seems logical to me because usually those methods >> are invoked when the child is drained anyway. But maybe I'm wrong. :-) > > I'm in favor of asking the caller to make sure all nodes involved in the graph > manupulation are drained, it feels comparably easier to do, than fixing the > problem in bdrv_append(). I guess my main question was whether any of the two approaches is evidently wrong in some way, but if you don't say that (and Kevin says the first is wrong), I'm going to assume doing the second is OK. :-) Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm() 2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz 2017-11-07 5:21 ` Fam Zheng @ 2017-11-07 14:22 ` Kevin Wolf 2017-11-08 20:16 ` Max Reitz 1 sibling, 1 reply; 6+ messages in thread From: Kevin Wolf @ 2017-11-07 14:22 UTC (permalink / raw) To: Max Reitz; +Cc: Qemu-block, Qemu-devel [-- Attachment #1: Type: text/plain, Size: 7854 bytes --] Am 06.11.2017 um 19:49 hat Max Reitz geschrieben: > Hi everyone, > > On my quest to fix some flaky iotests, I came to a bit of a halt on 129. > (Details: Its issue is that block jobs now generally ignore throttling > in a BB (because they use their own), so we have to add a throttle node > instead. However, when I added it, I got an abort.) > > My issue can be reproduced as follows: > > $ x86_64-softmmu/qemu-system-x86_64 \ > -qmp stdio \ > -object throttle-group,id=tg0 \ > -blockdev "{'driver':'throttle','node-name':'drive0', > 'throttle-group':'tg0','file':{'driver':'null-co'}}" \ > -blockdev node-name=target,driver=null-co > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, > "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}} > {'execute':'qmp_capabilities'} > {"return": {}} > {'execute':'blockdev-mirror','arguments':{ > 'device':'drive0','job-id':'job0','target':'target','sync':'full', > 'filter-node-name':'mirror-node' }} > qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end: > Assertion `tgm->io_limits_disabled' failed. > [1] 3524 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp > stdio -object throttle-group,id=tg0 > > Here's what happens: > > (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0. > > (2) bdrv_append(...) puts mirror-node above drive0. Through > bdrv_replace_child_noperm(), this will invoke > bdrv_child_cb_drained_begin() on mirror-node. This is necessary because > drive0 is drained, so the new parent needs to be drained as well. > However, note that drive0 is not yet attached to mirror-node. > Therefore, mirror-node cannot drain drive0 recursively. Important context: We're talking about bdrv_set_backing_hd() here. It's also not quite correct to say that drive0 is not yet attached, but we're in a weird half-attached state. The BdrvChild is already initialised and in the parent list of drive0, but it's not yet assigned to mirror_node->backing nor in mirror_node's child list. For this specific case it looks like this is indeed the same as not being attached, but I wouldn't be surprised if we saw stranger effects at some point. > This is seemingly fine because drive0 is drained anyway. However, this > is different from what would happen if we would have drained drive0 with > mirror-node already attached to it as its parent: Then, we would have > drained drive0 twice; once by itself, and another time recursively > through mirror-node. > > This will be important in a second... > > (3) ...and this second is now: We invoke bdrv_drained_end() on drive0. > Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end() > that goes up to mirror-node which recursively un-drains drive0. Fine so > far. But once that parent un-drain is done, we un-drain drive0 by > itself: And this fails the assertion in the throttle driver because we > attempt to un-drain it twice, although we've drained it only once. > > > So the issue has two parts: > > (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain > that child, too. This seems a bit superfluous to me and I would guess > that it results in worst-case O(n^2) function calls to drain a block > graph consisting of n nodes. > > (B) In bdrv_replace_child_noperm() we try to drain the parent if the new > child is drained; specifically, we want it to be in a state as if it had > been a parent when the child was originally drained. However, we fail > at this because we drain the parent without the child attached, so we > don't drain the child twice. This bites us when we undrain everything. I think the issue is much simpler, even though it still has two parts. It's the old story of bdrv_drain mixing two separate concepts: 1. Wait synchronously for the completion of all my requests to this node. This needs to be propagated down the graph to the children. 2. Make sure that nobody else sends new requests to this node. This needs to be propagated up the graph to the parents. Some callers want only 1. (usually callers of bdrv_drain_all() without a begin/end pair), some callers want both 1. and 2. (that's the begin/end construction for getting exclusive access). Not sure if there are callers that want only 2., but possibly. If we actually take this separation serious, the first step to a fix could be that BdrvChildRole.drained_begin shouldn't be propagated to the children. We may still need to drain the requests at the node itself: Imagine a drained backing file of qcow2 node. Making sure that the qcow2 node doesn't get new requests isn't enough, we also must make sure that in-flight requests don't access the backing file any more. This means draining the qcow2 node, though we don't care whether its child nodes still have requests in flight. The big question is whether bdrv_drain() would still work for a single node without recursing to the children, but as it uses bs->in_flight, I think that should be okay these days. > (Most importantly, ideally we'd want to attach the new child to the > parent and then drain the parent: This would give us exactly the state > we want. However, attaching the child first and then draining the > parent is unsafe, so we cannot do it...) > > [1] Whether the parent (un-)drains the child depends on the > BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. > We cannot say it generally. > > OK, so how to fix it? I don't know, so I'm asking you. :-) The conclusion from what I wrote above would be to add a non-recursive drain function (probably a version of bdrv_drained_begin/end with a bool parameter) and call that from bdrv_child_cb_drained_begin/end. This would still only be a partial solution because we still maintain the single interface for two different purposes, but it should be a step in the right direction and fix the problem at hand. > I have two ideas: > > One is to assume that (un-)draining a parent will always (un-)drain all > children, including the one the (un-)drain comes from. This assumption > seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just > explicitly drain the new child in bdrv_replace_child_noperm() after > having drained the parent and thus get a consistent state again. I agree that this is wrong. > The other is to declare (A) wrong. Maybe when > BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that > child because we can declare it the caller's responsibility to make sure > it's drained. This seems logical to me because usually those methods > are invoked when the child is drained anyway. But maybe I'm wrong. :-) Looks like a similar resolution as I suggest, though I like my reasoning for it better. ;-) > So, any ideas? Just an additional thought (aka "alles kaputt"): The throttle driver will respond to BlockDriver.bdrv_drained_begin() by completing all of the queued requests (ignoring the I/O limits) before it returns. This is great when the drain request comes from a parent because it just want to get everything completed. It's kind of a problem when the drain request comes from a child which is already drained... To be more specific, you pointed at bdrv_replace_child_noperm(). This replaces child->bs first and only then calls the .drained_begin callback of the parent. So if the parent wants to implement draining by just submiting its queue, we're submitting requests to a child where this isn't allowed. If we had separated the two operations, we could have two BlockDriver callbacks, one triggering the queue flush, and the other one requiring that no new requests from the queue be submitted. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm() 2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf @ 2017-11-08 20:16 ` Max Reitz 2017-11-09 16:25 ` Kevin Wolf 0 siblings, 1 reply; 6+ messages in thread From: Max Reitz @ 2017-11-08 20:16 UTC (permalink / raw) To: Kevin Wolf; +Cc: Qemu-block, Qemu-devel [-- Attachment #1: Type: text/plain, Size: 9894 bytes --] On 2017-11-07 15:22, Kevin Wolf wrote: > Am 06.11.2017 um 19:49 hat Max Reitz geschrieben: >> Hi everyone, >> >> On my quest to fix some flaky iotests, I came to a bit of a halt on 129. >> (Details: Its issue is that block jobs now generally ignore throttling >> in a BB (because they use their own), so we have to add a throttle node >> instead. However, when I added it, I got an abort.) >> >> My issue can be reproduced as follows: >> >> $ x86_64-softmmu/qemu-system-x86_64 \ >> -qmp stdio \ >> -object throttle-group,id=tg0 \ >> -blockdev "{'driver':'throttle','node-name':'drive0', >> 'throttle-group':'tg0','file':{'driver':'null-co'}}" \ >> -blockdev node-name=target,driver=null-co >> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, >> "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}} >> {'execute':'qmp_capabilities'} >> {"return": {}} >> {'execute':'blockdev-mirror','arguments':{ >> 'device':'drive0','job-id':'job0','target':'target','sync':'full', >> 'filter-node-name':'mirror-node' }} >> qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end: >> Assertion `tgm->io_limits_disabled' failed. >> [1] 3524 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp >> stdio -object throttle-group,id=tg0 >> >> Here's what happens: >> >> (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0. >> >> (2) bdrv_append(...) puts mirror-node above drive0. Through >> bdrv_replace_child_noperm(), this will invoke >> bdrv_child_cb_drained_begin() on mirror-node. This is necessary because >> drive0 is drained, so the new parent needs to be drained as well. >> However, note that drive0 is not yet attached to mirror-node. >> Therefore, mirror-node cannot drain drive0 recursively. > > Important context: We're talking about bdrv_set_backing_hd() here. > > It's also not quite correct to say that drive0 is not yet attached, but > we're in a weird half-attached state. The BdrvChild is already > initialised and in the parent list of drive0, but it's not yet assigned > to mirror_node->backing nor in mirror_node's child list. > > For this specific case it looks like this is indeed the same as not > being attached, but I wouldn't be surprised if we saw stranger effects > at some point. > >> This is seemingly fine because drive0 is drained anyway. However, this >> is different from what would happen if we would have drained drive0 with >> mirror-node already attached to it as its parent: Then, we would have >> drained drive0 twice; once by itself, and another time recursively >> through mirror-node. >> >> This will be important in a second... >> >> (3) ...and this second is now: We invoke bdrv_drained_end() on drive0. >> Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end() >> that goes up to mirror-node which recursively un-drains drive0. Fine so >> far. But once that parent un-drain is done, we un-drain drive0 by >> itself: And this fails the assertion in the throttle driver because we >> attempt to un-drain it twice, although we've drained it only once. >> >> >> So the issue has two parts: >> >> (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain >> that child, too. This seems a bit superfluous to me and I would guess >> that it results in worst-case O(n^2) function calls to drain a block >> graph consisting of n nodes. >> >> (B) In bdrv_replace_child_noperm() we try to drain the parent if the new >> child is drained; specifically, we want it to be in a state as if it had >> been a parent when the child was originally drained. However, we fail >> at this because we drain the parent without the child attached, so we >> don't drain the child twice. This bites us when we undrain everything. > > I think the issue is much simpler, even though it still has two parts. > It's the old story of bdrv_drain mixing two separate concepts: > > 1. Wait synchronously for the completion of all my requests to this > node. This needs to be propagated down the graph to the children. So, flush without flushing protocol drivers? :-) > 2. Make sure that nobody else sends new requests to this node. This > needs to be propagated up the graph to the parents. > > Some callers want only 1. (usually callers of bdrv_drain_all() without a > begin/end pair), some callers want both 1. and 2. (that's the begin/end > construction for getting exclusive access). Not sure if there are > callers that want only 2., but possibly. > > If we actually take this separation serious, the first step to a fix > could be that BdrvChildRole.drained_begin shouldn't be propagated to the > children. That seems good to me, but considering that the default was to just propagate it to every child, I thought that I was missing something. > We may still need to drain the requests at the node itself: > > Imagine a drained backing file of qcow2 node. Making sure that the qcow2 > node doesn't get new requests isn't enough, we also must make sure that > in-flight requests don't access the backing file any more. This means > draining the qcow2 node, though we don't care whether its child nodes > still have requests in flight. If you want to stop the qcow2 node from generating further requests, you can either flush them (as you said) or pause them (whatever that means...). However, if you flush them, you also need to flush the children you have flushed them to... So what I wrote was technically wrong ("don't pass parent drainage back to the child because the child is already drained"), instead it should be "don't pass parent drainage back to the child because the child is going to be drained (flushed) anyway". So, pause requests first (either by parking them or flushing them, driver's choice), then flush the tree. Of course that's easier said than done... First, we would need a non-recursive flush. Then, every node that is visited by the drain would (*after* recursing to its parents) need to flush itself. (Note that I'm completely disregarding nodes which are below the original node that is supposed to be drained, and which therefore do not drain their parents (or do they?) because I'd like to retain at least some of my sanity for now.) Secondly we have exactly the issue you describe below... > The big question is whether bdrv_drain() would still work for a single > node without recursing to the children, but as it uses bs->in_flight, I > think that should be okay these days. > >> (Most importantly, ideally we'd want to attach the new child to the >> parent and then drain the parent: This would give us exactly the state >> we want. However, attaching the child first and then draining the >> parent is unsafe, so we cannot do it...) >> >> [1] Whether the parent (un-)drains the child depends on the >> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. >> We cannot say it generally. >> >> OK, so how to fix it? I don't know, so I'm asking you. :-) > > The conclusion from what I wrote above would be to add a non-recursive > drain function (probably a version of bdrv_drained_begin/end with a bool > parameter) and call that from bdrv_child_cb_drained_begin/end. > > This would still only be a partial solution because we still maintain > the single interface for two different purposes, but it should be a step > in the right direction and fix the problem at hand. Well, except... >> I have two ideas: >> >> One is to assume that (un-)draining a parent will always (un-)drain all >> children, including the one the (un-)drain comes from. This assumption >> seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just >> explicitly drain the new child in bdrv_replace_child_noperm() after >> having drained the parent and thus get a consistent state again. > > I agree that this is wrong. > >> The other is to declare (A) wrong. Maybe when >> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that >> child because we can declare it the caller's responsibility to make sure >> it's drained. This seems logical to me because usually those methods >> are invoked when the child is drained anyway. But maybe I'm wrong. :-) > > Looks like a similar resolution as I suggest, though I like my reasoning > for it better. ;-) > >> So, any ideas? > > Just an additional thought (aka "alles kaputt"): The throttle driver > will respond to BlockDriver.bdrv_drained_begin() by completing all of > the queued requests (ignoring the I/O limits) before it returns. This is > great when the drain request comes from a parent because it just want to > get everything completed. It's kind of a problem when the drain request > comes from a child which is already drained... > > To be more specific, you pointed at bdrv_replace_child_noperm(). This > replaces child->bs first and only then calls the .drained_begin callback > of the parent. So if the parent wants to implement draining by just > submiting its queue, we're submitting requests to a child where this > isn't allowed. ...exactly this. :-/ So a non-recursive drain would just fix the assert, but not this issue. And I'm not sure whether this is reasonable. Fixing the assert but allowing the driver to submit requests to a drained node is just wrong. > If we had separated the two operations, we could have two BlockDriver > callbacks, one triggering the queue flush, and the other one requiring > that no new requests from the queue be submitted. Yes. Then we could require that every driver implements a way to park requests instead of only pausing request submissions through flushing. ...I guess I'll leave 129 just broken for now. :-/ Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 512 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm() 2017-11-08 20:16 ` Max Reitz @ 2017-11-09 16:25 ` Kevin Wolf 0 siblings, 0 replies; 6+ messages in thread From: Kevin Wolf @ 2017-11-09 16:25 UTC (permalink / raw) To: Max Reitz; +Cc: Qemu-block, Qemu-devel [-- Attachment #1: Type: text/plain, Size: 8689 bytes --] Am 08.11.2017 um 21:16 hat Max Reitz geschrieben: > On 2017-11-07 15:22, Kevin Wolf wrote: > > I think the issue is much simpler, even though it still has two parts. > > It's the old story of bdrv_drain mixing two separate concepts: > > > > 1. Wait synchronously for the completion of all my requests to this > > node. This needs to be propagated down the graph to the children. > > So, flush without flushing protocol drivers? :-) No, flush is a completely different thing. Drain is about completing all in-flight requests, whereas flush is about writing out all caches. You can do either one without the other. In particular, a flush doesn't involve a drain, you can still requests while a flush request is in flight. The semantics is that a flush makes sure that all requests which had already completed when the flush request was started are stable on disk. Later requests may or may not be stable on disk yet. > > 2. Make sure that nobody else sends new requests to this node. This > > needs to be propagated up the graph to the parents. > > > > Some callers want only 1. (usually callers of bdrv_drain_all() without a > > begin/end pair), some callers want both 1. and 2. (that's the begin/end > > construction for getting exclusive access). Not sure if there are > > callers that want only 2., but possibly. > > > > If we actually take this separation serious, the first step to a fix > > could be that BdrvChildRole.drained_begin shouldn't be propagated to the > > children. > > That seems good to me, but considering that the default was to just > propagate it to every child, I thought that I was missing something. bdrv_child_cb_drained_begin() calls bdrv_drained_begin() to wait for the completion of all running requests on its current node. The propagation to every child is an unwanted side effect, but so far it didn't seem to hurt anyone, so we didn't care about preventing it. > > We may still need to drain the requests at the node itself: > > > Imagine a drained backing file of qcow2 node. Making sure that the qcow2 > > node doesn't get new requests isn't enough, we also must make sure that > > in-flight requests don't access the backing file any more. This means > > draining the qcow2 node, though we don't care whether its child nodes > > still have requests in flight. > > If you want to stop the qcow2 node from generating further requests, you > can either flush them (as you said) or pause them (whatever that means...). Pausing is probably way to complicated. qcow2 never issues requests by itself. It only has requests running if someone else has a request running on the qcow2 node. So it's enough to just drain the request queue of the qcow2 node to get to the state we want. The only thing we need to make sure is that we drain it _before_ its child node is replaced with a drained one, so that its requests can complete. In fact, I think graph modifications should only be done with drained nodes anyway. Otherwise you could switch over in the middle of a single high-level operation and possibly get callbacks from a node that isn't a child any more. Maybe we should add some assertions to that effect and clean up whatever code breaks after it. > However, if you flush them, you also need to flush the children you have > flushed them to... So what I wrote was technically wrong ("don't pass > parent drainage back to the child because the child is already > drained"), instead it should be "don't pass parent drainage back to the > child because the child is going to be drained (flushed) anyway". > > So, pause requests first (either by parking them or flushing them, > driver's choice), then flush the tree. > > Of course that's easier said than done... First, we would need a > non-recursive flush. Then, every node that is visited by the drain > would (*after* recursing to its parents) need to flush itself. > > (Note that I'm completely disregarding nodes which are below the > original node that is supposed to be drained, and which therefore do not > drain their parents (or do they?) because I'd like to retain at least > some of my sanity for now.) I don't follow, but I suppose this is related to the flush/drain confusion. > Secondly we have exactly the issue you describe below... > > > The big question is whether bdrv_drain() would still work for a single > > node without recursing to the children, but as it uses bs->in_flight, I > > think that should be okay these days. > > > >> (Most importantly, ideally we'd want to attach the new child to the > >> parent and then drain the parent: This would give us exactly the state > >> we want. However, attaching the child first and then draining the > >> parent is unsafe, so we cannot do it...) > >> > >> [1] Whether the parent (un-)drains the child depends on the > >> BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. > >> We cannot say it generally. > >> > >> OK, so how to fix it? I don't know, so I'm asking you. :-) > > > > The conclusion from what I wrote above would be to add a non-recursive > > drain function (probably a version of bdrv_drained_begin/end with a bool > > parameter) and call that from bdrv_child_cb_drained_begin/end. > > > > This would still only be a partial solution because we still maintain > > the single interface for two different purposes, but it should be a step > > in the right direction and fix the problem at hand. > > Well, except... Yes, bug 1 would be fixed, but not yet bug 2. > >> I have two ideas: > >> > >> One is to assume that (un-)draining a parent will always (un-)drain all > >> children, including the one the (un-)drain comes from. This assumption > >> seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just > >> explicitly drain the new child in bdrv_replace_child_noperm() after > >> having drained the parent and thus get a consistent state again. > > > > I agree that this is wrong. > > > >> The other is to declare (A) wrong. Maybe when > >> BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that > >> child because we can declare it the caller's responsibility to make sure > >> it's drained. This seems logical to me because usually those methods > >> are invoked when the child is drained anyway. But maybe I'm wrong. :-) > > > > Looks like a similar resolution as I suggest, though I like my reasoning > > for it better. ;-) > > > >> So, any ideas? > > > > Just an additional thought (aka "alles kaputt"): The throttle driver > > will respond to BlockDriver.bdrv_drained_begin() by completing all of > > the queued requests (ignoring the I/O limits) before it returns. This is > > great when the drain request comes from a parent because it just want to > > get everything completed. It's kind of a problem when the drain request > > comes from a child which is already drained... > > > > To be more specific, you pointed at bdrv_replace_child_noperm(). This > > replaces child->bs first and only then calls the .drained_begin callback > > of the parent. So if the parent wants to implement draining by just > > submiting its queue, we're submitting requests to a child where this > > isn't allowed. > > ...exactly this. :-/ > > So a non-recursive drain would just fix the assert, but not this issue. > And I'm not sure whether this is reasonable. Fixing the assert but > allowing the driver to submit requests to a drained node is just wrong. Well, apart from what I described below, I see two more solutions: 1. Just drain before replacing the child node. Then the requests still go to the old child, which wasn't drained. 2. Require that the parent node be drained before it can be involved in a bdrv_replace_child_noperm() operation. This is what I suggested above. Right now it seems like a good idea. Not sure if it still does after actually trying it out. > > If we had separated the two operations, we could have two BlockDriver > > callbacks, one triggering the queue flush, and the other one requiring > > that no new requests from the queue be submitted. > > Yes. Then we could require that every driver implements a way to park > requests instead of only pausing request submissions through flushing. I think in the long run we'll need this, but not in the strict form that this sounds like. We don't need a function to "park" already running requests, just one to avoid issuing new requests. So not every driver must implement this, just every driver that can issue requests by itself. Kevin [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 801 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-11-09 16:25 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-11-06 18:49 [Qemu-devel] Drainage in bdrv_replace_child_noperm() Max Reitz 2017-11-07 5:21 ` Fam Zheng 2017-11-08 19:39 ` Max Reitz 2017-11-07 14:22 ` [Qemu-devel] [Qemu-block] " Kevin Wolf 2017-11-08 20:16 ` Max Reitz 2017-11-09 16:25 ` Kevin Wolf
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.