From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CB69BC433F5 for ; Mon, 23 May 2022 16:46:52 +0000 (UTC) Received: from localhost ([::1]:42196 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ntBCl-0001XJ-Px for qemu-devel@archiver.kernel.org; Mon, 23 May 2022 12:46:51 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:50402) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ntBBK-0008SD-EO for qemu-devel@nongnu.org; Mon, 23 May 2022 12:45:22 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:24740) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ntBBF-0008A9-1o for qemu-devel@nongnu.org; Mon, 23 May 2022 12:45:19 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1653324316; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NwwmKnG1Bfi/vSH1ICA8r7FXR8UZv+2DgOUKXhEQrUA=; b=DbNUbxpkZqQDqZYg2naDdRPfW66Az4d8QZGaiTPgMnep086qry7cDrxllY5TRTaQa6VOtn BB9WikcUNQaLKpwWR/nWK4K4OtHjz3xMfn9YLkvw0Vq5m07GoEUDPwPi4tkskNRrpardXW oZWLbiscxzZbtVYxitWELACitWvGwbI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-86-a9MM45XTOQWX_bUYTRZJyQ-1; Mon, 23 May 2022 12:45:12 -0400 X-MC-Unique: a9MM45XTOQWX_bUYTRZJyQ-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BEEBC8039D7; Mon, 23 May 2022 16:45:11 +0000 (UTC) Received: from localhost (unknown [10.39.195.54]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7224D2026D6A; Mon, 23 May 2022 16:45:10 +0000 (UTC) Date: Mon, 23 May 2022 17:45:09 +0100 From: Stefan Hajnoczi To: Kevin Wolf Cc: Paolo Bonzini , Emanuele Giuseppe Esposito , qemu-block@nongnu.org, Hanna Reitz , John Snow , Vladimir Sementsov-Ogievskiy , Fam Zheng , qemu-devel@nongnu.org Subject: Re: [RFC PATCH v2 0/8] Removal of AioContext lock, bs->parents and ->children: new rwlock Message-ID: References: <20220426085114.199647-1-eesposit@redhat.com> <6fc3e40e-7682-b9dc-f789-3ca95e0430db@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="FUa+Hh+FHcSL3+Kp" Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 Received-SPF: pass client-ip=170.10.133.124; envelope-from=stefanha@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -28 X-Spam_score: -2.9 X-Spam_bar: -- X-Spam_report: (-2.9 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --FUa+Hh+FHcSL3+Kp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, May 23, 2022 at 06:04:55PM +0200, Kevin Wolf wrote: > Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben: > > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote: > > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > > > > Hi, > > > > Sorry for the long email. I've included three topics that may help = us discuss > > > > draining and AioContext removal further. > > > >=20 > > > > The shortcomings of drain > > > > ------------------------- > > > > I wanted to explore the logical argument that making graph modifica= tions within > > > > a drained section is correct: > > > > - Graph modifications and BB/BDS lookup are Global State (GS). > > > > - Graph traversal from a BB/BDS pointer is IO. > > > > - Only one thread executes GS code at a time. > > > > - IO is quiesced within a drained section. > > >=20 > > > I think you're mixing two things here and calling both of them IO. > > >=20 > > > If a function is marked as IO, that means that it is safe to call from > > > I/O requests, which may be running in any iothread (they currently on= ly > > > run in the home thread of the node's AioContext, but the function > > > annotations already promise that any thread is safe). > > >=20 > > > However, if a function is marked as IO, this doesn't necessarily mean > > > that it is always only called in the context of an I/O request. It can > > > be called by any other code, too. > > >=20 > > > So while drain does quiesce all I/O requests, this doesn't mean that > > > functions marked as IO won't run any more. > >=20 > > My model is that between bdrv_drained_begin() and bdrv_drained_end() > > only the caller is allowed to invoke BB/BDS functions (including > > functions marked IO). >=20 > Okay, this sounds better. It's not actually related to IO/GS, but to > BB/BDS functions, including both IO and GS functions. >=20 > So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and > BB/BDS functions need to be quiesced within a drained section. >=20 > > The caller isn't strictly one thread and one or no coroutines. The > > caller could use threads and coroutines, but the important thing is > > that everything else that accesses the BB/BDS is suspended due to > > .drained_begin() callbacks (and is_external). > >=20 > > So when you say a function marked IO can be called from outside an I/O > > request, that "outside an I/O request" code must be quiesced too. > > Otherwise drain is definitely not safe for graph modifications. >=20 > If you phrase it as a condition and as a goal to achieve, then I agree > that this is required when you want to use drain for locking. >=20 > My impression was that you were using this to argue that the code is > already doing this and is already safe in this scenario, and this isn't > true. I think I misunderstood you there and you were really describing > the future state that you're envisioning. Sorry, I didn't present it clearly. I'm trying to think through the model needed to make graph modifications under drain safe. > > > > - Therefore a drained section in GS code suspends graph traversal, = other graph > > > > modifications, and BB/BDS lookup. > > > > - Therefore it is safe to modify the graph from a GS drained sectio= n. > > >=20 > > > So unfortunately, I don't think this conclusion is correct. > > >=20 > > > In order to make your assumption true, we would have to require that = all > > > callers of IO functions must have called bdrv_inc_in_flight(). We wou= ld > > > also have to find a way to enforce this preferable at compile time or > > > with static analysis, or at least with runtime assertions, because it > > > would be very easy to get wrong. > >=20 > > Or that they are quiesced by .drained_begin() callbacks or is_external. > >=20 > > Do you have a concrete example? >=20 > Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this. > They just need to make bdrv_drain_poll() return true as long as they are > active so that bdrv_drained_begin() waits for them. .drained_poll() is > another valid way to achieve this. >=20 > However, if we want to rely on static analysis to verify that everything > is covered, always requiring bdrv_inc_in_flight() might make this > easier. So possibly we want to require it even in cases where > .drained_poll() or aio_disable_external() would be enough in theory. >=20 > > >=20 > > > > However, I hit on a problem that I think Emanuele and Paolo have al= ready > > > > pointed out: draining is GS & IO. This might have worked under the = 1 IOThread > > > > model but it does not make sense for multi-queue. It is possible to= submit I/O > > > > requests in drained sections. How can multiple threads be in draine= d sections > > > > simultaneously and possibly submit further I/O requests in their dr= ained > > > > sections? Those sections wouldn't be "drained" in any useful sense = of the word. > > >=20 > > > Drains asks other users not to touch the block node. Currently this o= nly > >=20 > > BTW interesting language choice here: you're using the more general > > definition of "other users" and "touch[ing] the block node" rather than > > the narrow definition of just "I/O requests". That's exactly how I think > > of drain but based on what you wrote above I'm not sure that's how you > > think of it too? >=20 > I hope that my reply above made it clearer: The broader definition is > what is needed if we want to use drain to replace the AioContext lock > for protecting the graph, but the narrow definition is what is > implemented today. Okay. > > > includes, but the desire to use drain for locking means that we need = to > > > extend it to pretty much any operation on the node, which would inclu= de > > > calling drain on that block node. So you should never have two callers > > > in a drain section at the same time, it would be a bug in this model. > >=20 > > Yes! Drain in its current form doesn't make sense for multi-queue since > > multiple threads could be in drained sections at the same time and they > > would all be allowed to submit new I/O requests. >=20 > Nobody would be allowed to submit new requests (because someone else has > drained the node), but they would do so anyway. ;-) >=20 > Actually, only half joking, because this shows how weak the protection > by drain is if we don't have a way to verify that the whole codebase > supports drain correctly. >=20 > > > Of course, we know that currently drain is not respected by all relev= ant > > > callers (most importantly, the monitor). If you want to use drain as a > > > form of locking, you need to solve this first. > > >=20 > > > > It is necessary to adjust how recursive drained sections work: > > > > bdrv_drained_begin() must not return while there are deeper nested = drained > > > > sections. > > > >=20 > > > > This is allowed: > > > >=20 > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > < bdrv_drained_begin() > > > > . > > > > . > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > >=20 > > > Just to make sure I understand the scenario that you have in mind her= e: > > > The monitor command is not what causes the block job to do the draini= ng > > > because this is inside bdrv_drained_begin(), so the block job just > > > randomly got a callback that caused it to do so (maybe completion)? > >=20 > > Yes, exactly that completion scenario. When the mirror block job > > completes exactly when a monitor command calls bdrv_drained_begin(), > > mirror_exit_common() deletes a temporary filter BDS node. It involves > > drain and modifies the graph. > >=20 > > >=20 > > > Before bdrv_drained_begin() returns, anything is still allowed to run, > > > so I agree that this is valid. > >=20 > > Thanks for confirming! > >=20 > > > > This is not allowed: > > > >=20 > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > < bdrv_drained_begin() . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > > >=20 > > > > This restriction creates an ordering between bdrv_drained_begin() c= allers. In > > > > this example the block job must not depend on the monitor command c= ompleting > > > > first. Otherwise there would be a deadlock just like with two threa= ds wait for > > > > each other while holding a mutex. > > >=20 > > > So essentially drain needs to increase bs->in_flight, so that the out= er > > > drain has to wait for the inner drain section to end before its > > > bdrv_drained_begin() can return. > > >=20 > > > > For sanity I think draining should be GS-only. IO code should not i= nvoke > > > > bdrv_drained_begin() to avoid ordering problems when multiple threa= ds drain > > > > simultaneously and have dependencies on each other. > > > >=20 > > > > block/mirror.c needs to be modified because it currently drains fro= m IO when > > > > mirroring is about to end. > > > >=20 > > > > With this change to draining I think the logical argument for corre= ctness with > > > > graph modifications holds. > > >=20 > > > What is your suggestion how to modify mirror? It drains so that no new > > > requests can sneak in and source and target stay in sync while it > > > switches to the completion code running in the main AioContext. You > > > can't just drop this. > >=20 > > I haven't read the code in enough depth to come up with a solution. >=20 > So this sounds like it needs more thought before we can assume that > we'll have a final state where drain is GS-only. I will look into solutions for mirror but it might be a few days before I have time. > > >=20 > > > > Enforcing GS/IO separation at compile time > > > > ------------------------------------------ > > > > Right now GS/IO asserts check assumptions at runtime. The next step= may be > > > > using the type system to separate GS and IO APIs so it's impossible= for IO code > > > > to accidentally call GS code, for example. > > > >=20 > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateIOPtr; > > > >=20 > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateGSPtr; > > > >=20 > > > > Then APIs can be protected as follows: > > > >=20 > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > >=20 > > > > A function that only has a BlockDriverStateIOPtr cannot call > > > > bdrv_set_aio_context_ignore() by mistake since the compiler will co= mplain that > > > > the first argument must be a BlockDriverStateGSPtr. > > >=20 > > > And then you have a set of functions that cast from GS to IO pointers, > > > but not for the other way around? Or maybe getting as IO pointer would > > > even be coupled with automatically increasing bs->in_flight? > > >=20 > > > The other option that we were looking into for this was static analys= is. > > > I had hacked up a small script to check consistency of these annotati= ons > > > (without covering function pointers, though) to help me with the revi= ew > > > of Emanuele's patches that introduced them. If I understand correctly, > > > Paolo has scripts to do the same a little bit better. > > >=20 > > > As long as we can integrate such a script in 'make check', we wouldn't > > > necessarily have to have the churn in the C code to switch everything= to > > > new wrapper types. > >=20 > > Yes, that's the problem with the typedef Ptr approach. It would involve > > a lot of code changes. Maybe static analysis tools are better. > >=20 > > >=20 > > > > Possible steps for AioContext removal > > > > ------------------------------------- > > > > I also wanted to share my assumptions about multi-queue and AioCont= ext removal. > > > > Please let me know if anything seems wrong or questionable: > > > >=20 > > > > - IO code can execute in any thread that has an AioContext. > > > > - Multiple threads may execute a IO code at the same time. > > > > - GS code only execute under the BQL. > > > >=20 > > > > For AioContext removal this means: > > > >=20 > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is = no need for > > > > a special "home" AioContext. > > > > - bdrv_coroutine_enter() becomes mostly meaningless because there i= s no need to > > > > run a coroutine in the BDS's AioContext. > > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works be= cause many > > > > threads/AioContexts may submit new I/O requests. BlockDevOps.drai= ned_begin() > > > > may be used instead (e.g. to temporarily disable ioeventfds on a = multi-queue > > > > virtio-blk device). > > > > - AIO_WAIT_WHILE() simplifies to > > > >=20 > > > > while ((cond)) { > > > > aio_poll(qemu_get_current_aio_context(), true); > > > > ... > > > > } > > >=20 > > > Probably not exactly, because you still need aio_wait_kick() to wake = up > > > the thread. We use AioWait.num_waiters !=3D 0 to decide whether we ne= ed to > > > schedule a BH in the main thread (because doing so unconditionally for > > > every completed request would be very wasteful). > > >=20 > > > If you want to be able to run this loop from any thread instead of ju= st > > > the main thread, you would have to store somewhere which thread to wa= ke. > >=20 > > qemu_cond_broadcast() can be used since the wakeup is a slow path. >=20 > I don't think we have a QemuCond for waking up a blocking aio_poll()? > Doesn't this usually involve writing to the event notifier file > descriptor of the specific AioContext? You're right, I forgot that the waiter is sitting in their own aio_poll() so we cannot use qemu_cond_wait(). Another mechanism would be necessary. > > > > and the distinction between home AioContext and non-home context = is > > > > eliminated. AioContext unlocking is dropped. > > > >=20 > > > > Does this make sense? I haven't seen these things in recent patch s= eries. > > >=20 > > > Intuitively I would agree with most. There may be details that I'm not > > > aware of at the moment. I'm not surprised that we haven't seen any su= ch > > > things in recent patch series because there is an awful lot of > > > preparational work to be done before we can reach this final state. > >=20 > > Yes. I'm mostly hoping to find out whether my view in fundamentally > > flawed or very different from what you and others think. >=20 > No, not fundamentally. >=20 > The big challenge in my mind is how to verify that the conditions are > actually met. I'm fairly sure that using drain this way is by far too > complex to rely on review only. I agree there needs to some kind of automated checker because relying on code review won't work. Stefan --FUa+Hh+FHcSL3+Kp Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAmKLuhUACgkQnKSrs4Gr c8jVhAf/UfWaYY2VWw8J1VDovx4rkzSGOXAcWb/UR3nPd8eAzrjfl4oGBYIFMHXy 89aQ2Y23KOkY6MW6Nxg6RfECmQhlv/lNOGxrYEUvY+bO5cjcLC+/usq0ZsKxAIX7 7fGCiSpQ5dqkriJ5/TlMZJDbppsCPaBrM3txhneeWhyAk4z8sI0k14w3wgTxZ9Vz WHVO9/kypmfIoJvXQw3qNBY3rMK4ML/HZzLZDvia3Y7PDbJ395iMZdVnTL90Q6ke o8LjxiU8FCMxSZONk1eKCZ+RnaGWp5b4MDIyxh5dE8A1Xfnw8dV739AhCs0LYvfh kQW+8P6QDWGogH/nOdm+xFcygBmXBA== =ab4i -----END PGP SIGNATURE----- --FUa+Hh+FHcSL3+Kp--