On Mon, May 23, 2022 at 06:04:55PM +0200, Kevin Wolf wrote:
> Am 23.05.2022 um 17:13 hat Stefan Hajnoczi geschrieben:
> > On Mon, May 23, 2022 at 03:02:05PM +0200, Kevin Wolf wrote:
> > > Am 22.05.2022 um 17:06 hat Stefan Hajnoczi geschrieben:
> > > > Hi,
> > > > Sorry for the long email. I've included three topics that may help us discuss
> > > > draining and AioContext removal further.
> > > > 
> > > > The shortcomings of drain
> > > > -------------------------
> > > > I wanted to explore the logical argument that making graph modifications within
> > > > a drained section is correct:
> > > > - Graph modifications and BB/BDS lookup are Global State (GS).
> > > > - Graph traversal from a BB/BDS pointer is IO.
> > > > - Only one thread executes GS code at a time.
> > > > - IO is quiesced within a drained section.
> > > 
> > > I think you're mixing two things here and calling both of them IO.
> > > 
> > > If a function is marked as IO, that means that it is safe to call from
> > > I/O requests, which may be running in any iothread (they currently only
> > > run in the home thread of the node's AioContext, but the function
> > > annotations already promise that any thread is safe).
> > > 
> > > However, if a function is marked as IO, this doesn't necessarily mean
> > > that it is always only called in the context of an I/O request. It can
> > > be called by any other code, too.
> > > 
> > > So while drain does quiesce all I/O requests, this doesn't mean that
> > > functions marked as IO won't run any more.
> > 
> > My model is that between bdrv_drained_begin() and bdrv_drained_end()
> > only the caller is allowed to invoke BB/BDS functions (including
> > functions marked IO).
> 
> Okay, this sounds better. It's not actually related to IO/GS, but to
> BB/BDS functions, including both IO and GS functions.
> 
> So graph traversal from a BB/BDS pointer makes it a BB/BDS function, and
> BB/BDS functions need to be quiesced within a drained section.
> 
> > The caller isn't strictly one thread and one or no coroutines. The
> > caller could use threads and coroutines, but the important thing is
> > that everything else that accesses the BB/BDS is suspended due to
> > .drained_begin() callbacks (and is_external).
> > 
> > So when you say a function marked IO can be called from outside an I/O
> > request, that "outside an I/O request" code must be quiesced too.
> > Otherwise drain is definitely not safe for graph modifications.
> 
> If you phrase it as a condition and as a goal to achieve, then I agree
> that this is required when you want to use drain for locking.
> 
> My impression was that you were using this to argue that the code is
> already doing this and is already safe in this scenario, and this isn't
> true. I think I misunderstood you there and you were really describing
> the future state that you're envisioning.

Sorry, I didn't present it clearly. I'm trying to think through the
model needed to make graph modifications under drain safe.

> > > > - Therefore a drained section in GS code suspends graph traversal, other graph
> > > >   modifications, and BB/BDS lookup.
> > > > - Therefore it is safe to modify the graph from a GS drained section.
> > > 
> > > So unfortunately, I don't think this conclusion is correct.
> > > 
> > > In order to make your assumption true, we would have to require that all
> > > callers of IO functions must have called bdrv_inc_in_flight(). We would
> > > also have to find a way to enforce this preferable at compile time or
> > > with static analysis, or at least with runtime assertions, because it
> > > would be very easy to get wrong.
> > 
> > Or that they are quiesced by .drained_begin() callbacks or is_external.
> > 
> > Do you have a concrete example?
> 
> Yes, you're right, bdrv_inc_in_flight() is only one way to achieve this.
> They just need to make bdrv_drain_poll() return true as long as they are
> active so that bdrv_drained_begin() waits for them. .drained_poll() is
> another valid way to achieve this.
> 
> However, if we want to rely on static analysis to verify that everything
> is covered, always requiring bdrv_inc_in_flight() might make this
> easier. So possibly we want to require it even in cases where
> .drained_poll() or aio_disable_external() would be enough in theory.
> 
> > > 
> > > > However, I hit on a problem that I think Emanuele and Paolo have already
> > > > pointed out: draining is GS & IO. This might have worked under the 1 IOThread
> > > > model but it does not make sense for multi-queue. It is possible to submit I/O
> > > > requests in drained sections. How can multiple threads be in drained sections
> > > > simultaneously and possibly submit further I/O requests in their drained
> > > > sections? Those sections wouldn't be "drained" in any useful sense of the word.
> > > 
> > > Drains asks other users not to touch the block node. Currently this only
> > 
> > BTW interesting language choice here: you're using the more general
> > definition of "other users" and "touch[ing] the block node" rather than
> > the narrow definition of just "I/O requests". That's exactly how I think
> > of drain but based on what you wrote above I'm not sure that's how you
> > think of it too?
> 
> I hope that my reply above made it clearer: The broader definition is
> what is needed if we want to use drain to replace the AioContext lock
> for protecting the graph, but the narrow definition is what is
> implemented today.

Okay.

> > > includes, but the desire to use drain for locking means that we need to
> > > extend it to pretty much any operation on the node, which would include
> > > calling drain on that block node. So you should never have two callers
> > > in a drain section at the same time, it would be a bug in this model.
> > 
> > Yes! Drain in its current form doesn't make sense for multi-queue since
> > multiple threads could be in drained sections at the same time and they
> > would all be allowed to submit new I/O requests.
> 
> Nobody would be allowed to submit new requests (because someone else has
> drained the node), but they would do so anyway. ;-)
> 
> Actually, only half joking, because this shows how weak the protection
> by drain is if we don't have a way to verify that the whole codebase
> supports drain correctly.
> 
> > > Of course, we know that currently drain is not respected by all relevant
> > > callers (most importantly, the monitor). If you want to use drain as a
> > > form of locking, you need to solve this first.
> > > 
> > > > It is necessary to adjust how recursive drained sections work:
> > > > bdrv_drained_begin() must not return while there are deeper nested drained
> > > > sections.
> > > > 
> > > > This is allowed:
> > > > 
> > > >      Monitor command            Block job
> > > >      ---------------            ---------
> > > >   > bdrv_drained_begin()
> > > >            .                 > bdrv_drained_begin()
> > > >            .                 < bdrv_drained_begin()
> > > >            .                          .
> > > >            .                          .
> > > >            .                 > bdrv_drained_end()
> > > >            .                 < bdrv_drained_end()
> > > >   < bdrv_drained_begin()
> > > >            .
> > > >            .
> > > >   > bdrv_drained_end()
> > > >   < bdrv_drained_end()
> > > 
> > > Just to make sure I understand the scenario that you have in mind here:
> > > The monitor command is not what causes the block job to do the draining
> > > because this is inside bdrv_drained_begin(), so the block job just
> > > randomly got a callback that caused it to do so (maybe completion)?
> > 
> > Yes, exactly that completion scenario. When the mirror block job
> > completes exactly when a monitor command calls bdrv_drained_begin(),
> > mirror_exit_common() deletes a temporary filter BDS node. It involves
> > drain and modifies the graph.
> > 
> > > 
> > > Before bdrv_drained_begin() returns, anything is still allowed to run,
> > > so I agree that this is valid.
> > 
> > Thanks for confirming!
> > 
> > > > This is not allowed:
> > > > 
> > > >      Monitor command            Block job
> > > >      ---------------            ---------
> > > >   > bdrv_drained_begin()
> > > >            .                 > bdrv_drained_begin()
> > > >            .                 < bdrv_drained_begin()
> > > >            .                          .
> > > >            .                          .
> > > >   < bdrv_drained_begin()              .
> > > >            .                          .
> > > >            .                 > bdrv_drained_end()
> > > >            .                 < bdrv_drained_end()
> > > >   > bdrv_drained_end()
> > > >   < bdrv_drained_end()
> > > > 
> > > > This restriction creates an ordering between bdrv_drained_begin() callers. In
> > > > this example the block job must not depend on the monitor command completing
> > > > first. Otherwise there would be a deadlock just like with two threads wait for
> > > > each other while holding a mutex.
> > > 
> > > So essentially drain needs to increase bs->in_flight, so that the outer
> > > drain has to wait for the inner drain section to end before its
> > > bdrv_drained_begin() can return.
> > > 
> > > > For sanity I think draining should be GS-only. IO code should not invoke
> > > > bdrv_drained_begin() to avoid ordering problems when multiple threads drain
> > > > simultaneously and have dependencies on each other.
> > > > 
> > > > block/mirror.c needs to be modified because it currently drains from IO when
> > > > mirroring is about to end.
> > > > 
> > > > With this change to draining I think the logical argument for correctness with
> > > > graph modifications holds.
> > > 
> > > What is your suggestion how to modify mirror? It drains so that no new
> > > requests can sneak in and source and target stay in sync while it
> > > switches to the completion code running in the main AioContext. You
> > > can't just drop this.
> > 
> > I haven't read the code in enough depth to come up with a solution.
> 
> So this sounds like it needs more thought before we can assume that
> we'll have a final state where drain is GS-only.

I will look into solutions for mirror but it might be a few days before
I have time.

> > > 
> > > > Enforcing GS/IO separation at compile time
> > > > ------------------------------------------
> > > > Right now GS/IO asserts check assumptions at runtime. The next step may be
> > > > using the type system to separate GS and IO APIs so it's impossible for IO code
> > > > to accidentally call GS code, for example.
> > > > 
> > > >   typedef struct {
> > > >       BlockDriverState *bs;
> > > >   } BlockDriverStateIOPtr;
> > > > 
> > > >   typedef struct {
> > > >       BlockDriverState *bs;
> > > >   } BlockDriverStateGSPtr;
> > > > 
> > > > Then APIs can be protected as follows:
> > > > 
> > > >   void bdrv_set_aio_context_ignore(BlockDriverStateGSPtr bs, ...);
> > > > 
> > > > A function that only has a BlockDriverStateIOPtr cannot call
> > > > bdrv_set_aio_context_ignore() by mistake since the compiler will complain that
> > > > the first argument must be a BlockDriverStateGSPtr.
> > > 
> > > And then you have a set of functions that cast from GS to IO pointers,
> > > but not for the other way around? Or maybe getting as IO pointer would
> > > even be coupled with automatically increasing bs->in_flight?
> > > 
> > > The other option that we were looking into for this was static analysis.
> > > I had hacked up a small script to check consistency of these annotations
> > > (without covering function pointers, though) to help me with the review
> > > of Emanuele's patches that introduced them. If I understand correctly,
> > > Paolo has scripts to do the same a little bit better.
> > > 
> > > As long as we can integrate such a script in 'make check', we wouldn't
> > > necessarily have to have the churn in the C code to switch everything to
> > > new wrapper types.
> > 
> > Yes, that's the problem with the typedef Ptr approach. It would involve
> > a lot of code changes. Maybe static analysis tools are better.
> > 
> > > 
> > > > Possible steps for AioContext removal
> > > > -------------------------------------
> > > > I also wanted to share my assumptions about multi-queue and AioContext removal.
> > > > Please let me know if anything seems wrong or questionable:
> > > > 
> > > > - IO code can execute in any thread that has an AioContext.
> > > > - Multiple threads may execute a IO code at the same time.
> > > > - GS code only execute under the BQL.
> > > > 
> > > > For AioContext removal this means:
> > > > 
> > > > - bdrv_get_aio_context() becomes mostly meaningless since there is no need for
> > > >   a special "home" AioContext.
> > > > - bdrv_coroutine_enter() becomes mostly meaningless because there is no need to
> > > >   run a coroutine in the BDS's AioContext.
> > > > - aio_disable_external(bdrv_get_aio_context(bs)) no longer works because many
> > > >   threads/AioContexts may submit new I/O requests. BlockDevOps.drained_begin()
> > > >   may be used instead (e.g. to temporarily disable ioeventfds on a multi-queue
> > > >   virtio-blk device).
> > > > - AIO_WAIT_WHILE() simplifies to
> > > > 
> > > >     while ((cond)) {
> > > >         aio_poll(qemu_get_current_aio_context(), true);
> > > >         ...
> > > >     }
> > > 
> > > Probably not exactly, because you still need aio_wait_kick() to wake up
> > > the thread. We use AioWait.num_waiters != 0 to decide whether we need to
> > > schedule a BH in the main thread (because doing so unconditionally for
> > > every completed request would be very wasteful).
> > > 
> > > If you want to be able to run this loop from any thread instead of just
> > > the main thread, you would have to store somewhere which thread to wake.
> > 
> > qemu_cond_broadcast() can be used since the wakeup is a slow path.
> 
> I don't think we have a QemuCond for waking up a blocking aio_poll()?
> Doesn't this usually involve writing to the event notifier file
> descriptor of the specific AioContext?

You're right, I forgot that the waiter is sitting in their own
aio_poll() so we cannot use qemu_cond_wait(). Another mechanism would be
necessary.

> > > >   and the distinction between home AioContext and non-home context is
> > > >   eliminated. AioContext unlocking is dropped.
> > > > 
> > > > Does this make sense? I haven't seen these things in recent patch series.
> > > 
> > > Intuitively I would agree with most. There may be details that I'm not
> > > aware of at the moment. I'm not surprised that we haven't seen any such
> > > things in recent patch series because there is an awful lot of
> > > preparational work to be done before we can reach this final state.
> > 
> > Yes. I'm mostly hoping to find out whether my view in fundamentally
> > flawed or very different from what you and others think.
> 
> No, not fundamentally.
> 
> The big challenge in my mind is how to verify that the conditions are
> actually met. I'm fairly sure that using drain this way is by far too
> complex to rely on review only.

I agree there needs to some kind of automated checker because relying on
code review won't work.

Stefan