On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote:
> >> On Sep 9, 2021, at 10:25 PM, John Johnson <john.g.johnson@oracle.com> wrote:
> >>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote:
> >>>> 	I did look at coroutines, but they seemed to work when the sender
> >>>> is triggering the coroutine on send, not when request packets are arriving
> >>>> asynchronously to the sends.
> >>> 
> >>> This can be done with a receiver coroutine. Its job is to be the only
> >>> thing that reads vfio-user messages from the socket. A receiver
> >>> coroutine reads messages from the socket and wakes up the waiting
> >>> coroutine that yielded from vfio_user_send_recv() or
> >>> vfio_user_pci_process_req().
> >>> 
> >>> (Although vfio_user_pci_process_req() could be called directly from the
> >>> receiver coroutine, it seems safer to have a separate coroutine that
> >>> processes requests so that the receiver isn't blocked in case
> >>> vfio_user_pci_process_req() yields while processing a request.)
> >>> 
> >>> Going back to what you mentioned above, the receiver coroutine does
> >>> something like this:
> >>> 
> >>> if it's a reply
> >>>     reply = find_reply(...)
> >>>     qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
> >>> else
> >>>     QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
> >>>     if (pending_reqs_was_empty) {
> >>>         qemu_coroutine_enter(process_request_co);
> >>>     }
> >>> 
> >>> The pending_reqs queue holds incoming requests that the
> >>> process_request_co coroutine processes.
> >>> 
> >> 
> >> 
> >> 	How do coroutines work across threads?  There can be multiple vCPU
> >> threads waiting for replies, and I think the receiver coroutine will be
> >> running in the main loop thread.  Where would a vCPU block waiting for
> >> a reply?  I think coroutine_yield() returns to its coroutine_enter() caller
> > 
> > 
> > 
> > A vCPU thread holding the BQL can iterate the event loop if it has
> > reached a synchronous point that needs to wait for a reply before
> > returning. I think we have this situation when a MemoryRegion is
> > accessed on the proxy device.
> > 
> > For example, block/block-backend.c:blk_prw() kicks off a coroutine and
> > then runs the event loop until the coroutine finishes:
> > 
> >   Coroutine *co = qemu_coroutine_create(co_entry, &rwco);
> >   bdrv_coroutine_enter(blk_bs(blk), co);
> >   BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE);
> > 
> > BDRV_POLL_WHILE() boils down to a loop like this:
> > 
> >   while ((cond)) {
> >     aio_poll(ctx, true);
> >   }
> > 
> 
> 	I think that would make vCPUs sending requests and the
> receiver coroutine all poll on the same socket.  If the “wrong”
> routine reads the message, I’d need a second level of synchronization
> to pass the message to the “right” one.  e.g., if the vCPU coroutine
> reads a request, it needs to pass it to the receiver; if the receiver
> coroutine reads a reply, it needs to pass it to a vCPU.
> 
> 	Avoiding this complexity is one of the reasons I went with
> a separate thread that only reads the socket over the mp-qemu model,
> which does have the sender poll, but doesn’t need to handle incoming
> requests.

Only one coroutine reads from the socket, the "receiver" coroutine. In a
previous reply I sketched what the receiver does:

  if it's a reply
      reply = find_reply(...)
      qemu_coroutine_enter(reply->co) // instead of signalling reply->cv
  else
      QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next);
      if (pending_reqs_was_empty) {
          qemu_coroutine_enter(process_request_co);
      }

The qemu_coroutine_enter(reply->co) call re-enters the coroutine that
was created by the vCPU thread. Is this the "second level of
synchronization" that you described? It's very similar to signalling
reply->cv in the existing patch.

Now I'm actually thinking about whether this can be improved by keeping
the condvar so that the vCPU thread doesn't need to call aio_poll()
(which is awkward because it doesn't drop the BQL and therefore blocks
other vCPUs from making progress). That approach wouldn't require a
dedicated thread for vfio-user.

> > I also want to check that I understand the scenarios in which the
> > vfio-user communication code is used:
> > 
> > 1. vhost-user-server
> > 
> > The vfio-user communication code should run in a given AioContext (it
> > will be the main loop by default but maybe the user will be able to
> > configure a specific IOThread in the future).
> > 
> 
> 	Jag would know more, but I believe it runs off the main loop.
> Running it in an iothread doesn’t gain much, since it needs BQL to
> run the device emulation code.
> 
> 
> > 2. vCPU thread vfio-user clients
> > 
> > The vfio-user communication code is called from the vCPU thread where
> > the proxy device executes. The MemoryRegion->read()/write() callbacks
> > are synchronous, so the thread needs to wait for a vfio-user reply
> > before it can return.
> > 
> > Is this what you had in mind?
> 
> 	The client is also called from the main thread - the GET_*
> messages from vfio_user_pci_realize() as well as MAP/DEMAP messages
> from guest address space change transactions.  It is also called by
> the migration thread, which is a separate thread that does not run
> holding BQL.

Thanks for mentioning those additional cases.

Stefan