On Mon, May 23, 2022 at 06:04:55PM +0200, Kevin Wolf wrote: > Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben: > > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote: > > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben: > > > > Hi, > > > > Sorry for the long email. I've included three topics that may help us discuss > > > > draining and AioContext removal further. > > > > > > > > The shortcomings of drain > > > > ------------------------- > > > > I wanted to explore the logical argument that making graph modifications within > > > > a drained section is correct: > > > > - Graph modifications and BB/BDS lookup are Global State (GS). > > > > - Graph traversal from a BB/BDS pointer is IO. > > > > - Only one thread executes GS code at a time. > > > > - IO is quiesced within a drained section. > > > > > > I think you're mixing two things here and calling both of them IO. > > > > > > If a function is marked as IO, that means that it is safe to call from > > > I/O requests, which may be running in any iothread (they currently only > > > run in the home thread of the node's AioContext, but the function > > > annotations already promise that any thread is safe). > > > > > > However, if a function is marked as IO, this doesn't necessarily mean > > > that it is always only called in the context of an I/O request. It can > > > be called by any other code, too. > > > > > > So while drain does quiesce all I/O requests, this doesn't mean that > > > functions marked as IO won't run any more. > > > > My model is that between bdrv_drained_begin() and bdrv_drained_end() > > only the caller is allowed to invoke BB/BDS functions (including > > functions marked IO). > > Okay, this sounds better. It's not actually related to IO/GS, but to > BB/BDS functions, including both IO and GS functions. > > So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and > BB/BDS functions need to be quiesced within a drained section. > > > The caller isn't strictly one thread and one or no coroutines. The > > caller could use threads and coroutines, but the important thing is > > that everything else that accesses the BB/BDS is suspended due to > > .drained_begin() callbacks (and is_external). > > > > So when you say a function marked IO can be called from outside an I/O > > request, that "outside an I/O request" code must be quiesced too. > > Otherwise drain is definitely not safe for graph modifications. > > If you phrase it as a condition and as a goal to achieve, then I agree > that this is required when you want to use drain for locking. > > My impression was that you were using this to argue that the code is > already doing this and is already safe in this scenario, and this isn't > true. I think I misunderstood you there and you were really describing > the future state that you're envisioning. Sorry, I didn't present it clearly. I'm trying to think through the model needed to make graph modifications under drain safe. > > > > - Therefore a drained section in GS code suspends graph traversal, other graph > > > > modifications, and BB/BDS lookup. > > > > - Therefore it is safe to modify the graph from a GS drained section. > > > > > > So unfortunately, I don't think this conclusion is correct. > > > > > > In order to make your assumption true, we would have to require that all > > > callers of IO functions must have called bdrv_inc_in_flight(). We would > > > also have to find a way to enforce this preferable at compile time or > > > with static analysis, or at least with runtime assertions, because it > > > would be very easy to get wrong. > > > > Or that they are quiesced by .drained_begin() callbacks or is_external. > > > > Do you have a concrete example? > > Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this. > They just need to make bdrv_drain_poll() return true as long as they are > active so that bdrv_drained_begin() waits for them. .drained_poll() is > another valid way to achieve this. > > However, if we want to rely on static analysis to verify that everything > is covered, always requiring bdrv_inc_in_flight() might make this > easier. So possibly we want to require it even in cases where > .drained_poll() or aio_disable_external() would be enough in theory. > > > > > > > > However, I hit on a problem that I think Emanuele and Paolo have already > > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread > > > > model but it does not make sense for multi-queue. It is possible to submit I/O > > > > requests in drained sections. How can multiple threads be in drained sections > > > > simultaneously and possibly submit further I/O requests in their drained > > > > sections? Those sections wouldn't be "drained" in any useful sense of the word. > > > > > > Drains asks other users not to touch the block node. Currently this only > > > > BTW interesting language choice here: you're using the more general > > definition of "other users" and "touch[ing] the block node" rather than > > the narrow definition of just "I/O requests". That's exactly how I think > > of drain but based on what you wrote above I'm not sure that's how you > > think of it too? > > I hope that my reply above made it clearer: The broader definition is > what is needed if we want to use drain to replace the AioContext lock > for protecting the graph, but the narrow definition is what is > implemented today. Okay. > > > includes, but the desire to use drain for locking means that we need to > > > extend it to pretty much any operation on the node, which would include > > > calling drain on that block node. So you should never have two callers > > > in a drain section at the same time, it would be a bug in this model. > > > > Yes! Drain in its current form doesn't make sense for multi-queue since > > multiple threads could be in drained sections at the same time and they > > would all be allowed to submit new I/O requests. > > Nobody would be allowed to submit new requests (because someone else has > drained the node), but they would do so anyway. ;-) > > Actually, only half joking, because this shows how weak the protection > by drain is if we don't have a way to verify that the whole codebase > supports drain correctly. > > > > Of course, we know that currently drain is not respected by all relevant > > > callers (most importantly, the monitor). If you want to use drain as a > > > form of locking, you need to solve this first. > > > > > > > It is necessary to adjust how recursive drained sections work: > > > > bdrv_drained_begin() must not return while there are deeper nested drained > > > > sections. > > > > > > > > This is allowed: > > > > > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > < bdrv_drained_begin() > > > > . > > > > . > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > > > > > Just to make sure I understand the scenario that you have in mind here: > > > The monitor command is not what causes the block job to do the draining > > > because this is inside bdrv_drained_begin(), so the block job just > > > randomly got a callback that caused it to do so (maybe completion)? > > > > Yes, exactly that completion scenario. When the mirror block job > > completes exactly when a monitor command calls bdrv_drained_begin(), > > mirror_exit_common() deletes a temporary filter BDS node. It involves > > drain and modifies the graph. > > > > > > > > Before bdrv_drained_begin() returns, anything is still allowed to run, > > > so I agree that this is valid. > > > > Thanks for confirming! > > > > > > This is not allowed: > > > > > > > > Monitor command Block job > > > > --------------- --------- > > > > > bdrv_drained_begin() > > > > . > bdrv_drained_begin() > > > > . < bdrv_drained_begin() > > > > . . > > > > . . > > > > < bdrv_drained_begin() . > > > > . . > > > > . > bdrv_drained_end() > > > > . < bdrv_drained_end() > > > > > bdrv_drained_end() > > > > < bdrv_drained_end() > > > > > > > > This restriction creates an ordering between bdrv_drained_begin() callers. In > > > > this example the block job must not depend on the monitor command completing > > > > first. Otherwise there would be a deadlock just like with two threads wait for > > > > each other while holding a mutex. > > > > > > So essentially drain needs to increase bs->in_flight, so that the outer > > > drain has to wait for the inner drain section to end before its > > > bdrv_drained_begin() can return. > > > > > > > For sanity I think draining should be GS-only. IO code should not invoke > > > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain > > > > simultaneously and have dependencies on each other. > > > > > > > > block/mirror.c needs to be modified because it currently drains from IO when > > > > mirroring is about to end. > > > > > > > > With this change to draining I think the logical argument for correctness with > > > > graph modifications holds. > > > > > > What is your suggestion how to modify mirror? It drains so that no new > > > requests can sneak in and source and target stay in sync while it > > > switches to the completion code running in the main AioContext. You > > > can't just drop this. > > > > I haven't read the code in enough depth to come up with a solution. > > So this sounds like it needs more thought before we can assume that > we'll have a final state where drain is GS-only. I will look into solutions for mirror but it might be a few days before I have time. > > > > > > > Enforcing GS/IO separation at compile time > > > > ------------------------------------------ > > > > Right now GS/IO asserts check assumptions at runtime. The next step may be > > > > using the type system to separate GS and IO APIs so it's impossible for IO code > > > > to accidentally call GS code, for example. > > > > > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateIOPtr; > > > > > > > > typedef struct { > > > > BlockDriverState *bs; > > > > } BlockDriverStateGSPtr; > > > > > > > > Then APIs can be protected as follows: > > > > > > > > void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...); > > > > > > > > A function that only has a BlockDriverStateIOPtr cannot call > > > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that > > > > the first argument must be a BlockDriverStateGSPtr. > > > > > > And then you have a set of functions that cast from GS to IO pointers, > > > but not for the other way around? Or maybe getting as IO pointer would > > > even be coupled with automatically increasing bs->in_flight? > > > > > > The other option that we were looking into for this was static analysis. > > > I had hacked up a small script to check consistency of these annotations > > > (without covering function pointers, though) to help me with the review > > > of Emanuele's patches that introduced them. If I understand correctly, > > > Paolo has scripts to do the same a little bit better. > > > > > > As long as we can integrate such a script in 'make check', we wouldn't > > > necessarily have to have the churn in the C code to switch everything to > > > new wrapper types. > > > > Yes, that's the problem with the typedef Ptr approach. It would involve > > a lot of code changes. Maybe static analysis tools are better. > > > > > > > > > Possible steps for AioContext removal > > > > ------------------------------------- > > > > I also wanted to share my assumptions about multi-queue and AioContext removal. > > > > Please let me know if anything seems wrong or questionable: > > > > > > > > - IO code can execute in any thread that has an AioContext. > > > > - Multiple threads may execute a IO code at the same time. > > > > - GS code only execute under the BQL. > > > > > > > > For AioContext removal this means: > > > > > > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for > > > > a special "home" AioContext. > > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to > > > > run a coroutine in the BDS's AioContext. > > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many > > > > threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin() > > > > may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue > > > > virtio-blk device). > > > > - AIO_WAIT_WHILE() simplifies to > > > > > > > > while ((cond)) { > > > > aio_poll(qemu_get_current_aio_context(), true); > > > > ... > > > > } > > > > > > Probably not exactly, because you still need aio_wait_kick() to wake up > > > the thread. We use AioWait.num_waiters != 0 to decide whether we need to > > > schedule a BH in the main thread (because doing so unconditionally for > > > every completed request would be very wasteful). > > > > > > If you want to be able to run this loop from any thread instead of just > > > the main thread, you would have to store somewhere which thread to wake. > > > > qemu_cond_broadcast() can be used since the wakeup is a slow path. > > I don't think we have a QemuCond for waking up a blocking aio_poll()? > Doesn't this usually involve writing to the event notifier file > descriptor of the specific AioContext? You're right, I forgot that the waiter is sitting in their own aio_poll() so we cannot use qemu_cond_wait(). Another mechanism would be necessary. > > > > and the distinction between home AioContext and non-home context is > > > > eliminated. AioContext unlocking is dropped. > > > > > > > > Does this make sense? I haven't seen these things in recent patch series. > > > > > > Intuitively I would agree with most. There may be details that I'm not > > > aware of at the moment. I'm not surprised that we haven't seen any such > > > things in recent patch series because there is an awful lot of > > > preparational work to be done before we can reach this final state. > > > > Yes. I'm mostly hoping to find out whether my view in fundamentally > > flawed or very different from what you and others think. > > No, not fundamentally. > > The big challenge in my mind is how to verify that the conditions are > actually met. I'm fairly sure that using drain this way is by far too > complex to rely on review only. I agree there needs to some kind of automated checker because relying on code review won't work. Stefan