From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50461) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eC4m2-0006nD-CO for qemu-devel@nongnu.org; Tue, 07 Nov 2017 09:22:44 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eC4lu-0004vH-5P for qemu-devel@nongnu.org; Tue, 07 Nov 2017 09:22:42 -0500 Date: Tue, 7 Nov 2017 15:22:18 +0100 From: Kevin Wolf Message-ID: <20171107142218.GC4706@localhost.localdomain> References: <8a184b91-49ef-bb52-d190-053c4c0861a1@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="ZGiS0Q5IWpPtfppv" Content-Disposition: inline In-Reply-To: <8a184b91-49ef-bb52-d190-053c4c0861a1@redhat.com> Subject: Re: [Qemu-devel] [Qemu-block] Drainage in bdrv_replace_child_noperm() List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz Cc: Qemu-block , Qemu-devel --ZGiS0Q5IWpPtfppv Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Am 06.11.2017 um 19:49 hat Max Reitz geschrieben: > Hi everyone, >=20 > On my quest to fix some flaky iotests, I came to a bit of a halt on 129. > (Details: Its issue is that block jobs now generally ignore throttling > in a BB (because they use their own), so we have to add a throttle node > instead. However, when I added it, I got an abort.) >=20 > My issue can be reproduced as follows: >=20 > $ x86_64-softmmu/qemu-system-x86_64 \ > -qmp stdio \ > -object throttle-group,id=3Dtg0 \ > -blockdev "{'driver':'throttle','node-name':'drive0', > 'throttle-group':'tg0','file':{'driver':'null-co'}}" \ > -blockdev node-name=3Dtarget,driver=3Dnull-co > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, > "package": " (v2.9.0-632-g4a52d43-dirty)"}, "capabilities": []}} > {'execute':'qmp_capabilities'} > {"return": {}} > {'execute':'blockdev-mirror','arguments':{ > 'device':'drive0','job-id':'job0','target':'target','sync':'full', > 'filter-node-name':'mirror-node' }} > qemu-system-x86_64: block/throttle.c:213: throttle_co_drain_end: > Assertion `tgm->io_limits_disabled' failed. > [1] 3524 abort (core dumped) x86_64-softmmu/qemu-system-x86_64 -qmp > stdio -object throttle-group,id=3Dtg0 >=20 > Here's what happens: >=20 > (1) bdrv_drained_begin(bs) in mirror_start_job() starts draining drive0. >=20 > (2) bdrv_append(...) puts mirror-node above drive0. Through > bdrv_replace_child_noperm(), this will invoke > bdrv_child_cb_drained_begin() on mirror-node. This is necessary because > drive0 is drained, so the new parent needs to be drained as well. > However, note that drive0 is not yet attached to mirror-node. > Therefore, mirror-node cannot drain drive0 recursively. Important context: We're talking about bdrv_set_backing_hd() here. It's also not quite correct to say that drive0 is not yet attached, but we're in a weird half-attached state. The BdrvChild is already initialised and in the parent list of drive0, but it's not yet assigned to mirror_node->backing nor in mirror_node's child list. For this specific case it looks like this is indeed the same as not being attached, but I wouldn't be surprised if we saw stranger effects at some point. > This is seemingly fine because drive0 is drained anyway. However, this > is different from what would happen if we would have drained drive0 with > mirror-node already attached to it as its parent: Then, we would have > drained drive0 twice; once by itself, and another time recursively > through mirror-node. >=20 > This will be important in a second... >=20 > (3) ...and this second is now: We invoke bdrv_drained_end() on drive0. > Now, through bdrv_parent_drained_end() and bdrv_child_cb_drained_end() > that goes up to mirror-node which recursively un-drains drive0. Fine so > far. But once that parent un-drain is done, we un-drain drive0 by > itself: And this fails the assertion in the throttle driver because we > attempt to un-drain it twice, although we've drained it only once. >=20 >=20 > So the issue has two parts: >=20 > (A) (Un-)Draining a parent from a child will always (?[1]) (un-)drain > that child, too. This seems a bit superfluous to me and I would guess > that it results in worst-case O(n^2) function calls to drain a block > graph consisting of n nodes. >=20 > (B) In bdrv_replace_child_noperm() we try to drain the parent if the new > child is drained; specifically, we want it to be in a state as if it had > been a parent when the child was originally drained. However, we fail > at this because we drain the parent without the child attached, so we > don't drain the child twice. This bites us when we undrain everything. I think the issue is much simpler, even though it still has two parts. It's the old story of bdrv_drain mixing two separate concepts: 1. Wait synchronously for the completion of all my requests to this node. This needs to be propagated down the graph to the children. 2. Make sure that nobody else sends new requests to this node. This needs to be propagated up the graph to the parents. Some callers want only 1. (usually callers of bdrv_drain_all() without a begin/end pair), some callers want both 1. and 2. (that's the begin/end construction for getting exclusive access). Not sure if there are callers that want only 2., but possibly. If we actually take this separation serious, the first step to a fix could be that BdrvChildRole.drained_begin shouldn't be propagated to the children. We may still need to drain the requests at the node itself: Imagine a drained backing file of qcow2 node. Making sure that the qcow2 node doesn't get new requests isn't enough, we also must make sure that in-flight requests don't access the backing file any more. This means draining the qcow2 node, though we don't care whether its child nodes still have requests in flight. The big question is whether bdrv_drain() would still work for a single node without recursing to the children, but as it uses bs->in_flight, I think that should be okay these days. > (Most importantly, ideally we'd want to attach the new child to the > parent and then drain the parent: This would give us exactly the state > we want. However, attaching the child first and then draining the > parent is unsafe, so we cannot do it...) >=20 > [1] Whether the parent (un-)drains the child depends on the > BdrvChildRole.drained_{begin,end}() implementation, strictly speaking. > We cannot say it generally. >=20 > OK, so how to fix it? I don't know, so I'm asking you. :-) The conclusion from what I wrote above would be to add a non-recursive drain function (probably a version of bdrv_drained_begin/end with a bool parameter) and call that from bdrv_child_cb_drained_begin/end. This would still only be a partial solution because we still maintain the single interface for two different purposes, but it should be a step in the right direction and fix the problem at hand. > I have two ideas: >=20 > One is to assume that (un-)draining a parent will always (un-)drain all > children, including the one the (un-)drain comes from. This assumption > seems wrong, see [1], but maybe it isn't. Anyway, if so, we could just > explicitly drain the new child in bdrv_replace_child_noperm() after > having drained the parent and thus get a consistent state again. I agree that this is wrong. > The other is to declare (A) wrong. Maybe when > BdrvChildRole.drained_{begin,end}() is invoked, we should not drain that > child because we can declare it the caller's responsibility to make sure > it's drained. This seems logical to me because usually those methods > are invoked when the child is drained anyway. But maybe I'm wrong. :-) Looks like a similar resolution as I suggest, though I like my reasoning for it better. ;-) > So, any ideas? Just an additional thought (aka "alles kaputt"): The throttle driver will respond to BlockDriver.bdrv_drained_begin() by completing all of the queued requests (ignoring the I/O limits) before it returns. This is great when the drain request comes from a parent because it just want to get everything completed. It's kind of a problem when the drain request comes from a child which is already drained... To be more specific, you pointed at bdrv_replace_child_noperm(). This replaces child->bs first and only then calls the .drained_begin callback of the parent. So if the parent wants to implement draining by just submiting its queue, we're submitting requests to a child where this isn't allowed. If we had separated the two operations, we could have two BlockDriver callbacks, one triggering the queue flush, and the other one requiring that no new requests from the queue be submitted. Kevin --ZGiS0Q5IWpPtfppv Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIcBAEBAgAGBQJaAcGZAAoJEH8JsnLIjy/WzgQP/3qeB6HZDCnzjBv1T64EX9n9 5I/ryHR06ZuFfgRULHjXJzr75SV5H5lGvMBhJQjFg8PtyFvjwXD+WwwDj5Y2mUBt aiWg6Xdd8IL2s2CS+5ApybbMi8cIOeIi0Q5A8LphhAb6zQEIRlQt0Q6neNMlKyi6 lj+jlsNt47ASyUHM40HdrHXZEIB/areJwoAHr9TWWdIE3t4uC98cLr3l6HyhE4VU EUN2os1tNGHY8za34TvhOdd//Uyc7yZWdgZEhcBzpw9KWRJKaVRpDCGIYi2+ZIER zCUcnqXScskveKGuVZovB5TeDPVwiwoPdye8M0I5JrihA1Y/lTsoyqOJ9WS0Y7/c K+euIAwUmis7XDc6VG8up4VmUF82agW3UDit247jT7lesEzntkWxbyXcg+PQrLH1 0P/YI6JbN+Lk0oDxnvwKmM1BBK0IwVGwP+FXOsFNDodUo47R5SG4ZWMHC6wtO/Ga umxuuhYfefdCbsGSRsFiiw1pmp/1Rh7fxAAMNie7IXccZNduBXvckbRqQWpDlf0P dXFzXgpj5NjBlt63lWViVOAJOVQHE6Tx0nxZszEFiDpYaGtnfy5HLOPr7py6jM3R dSAsImrsQH4UnsKX4KqLmYDFCattKG9Npn5DNuk+1lKVmzbuhLXN/HjcgW8zFEge TeZU6f/eu3Qpm9Qk1FHC =Bf0t -----END PGP SIGNATURE----- --ZGiS0Q5IWpPtfppv--