On Mon, Sep 13, 2021 at 05:23:33PM +0000, John Johnson wrote: > >> On Sep 9, 2021, at 10:25 PM, John Johnson wrote: > >>> On Sep 8, 2021, at 11:29 PM, Stefan Hajnoczi wrote: > >>> On Thu, Sep 09, 2021 at 05:11:49AM +0000, John Johnson wrote: > >>>> I did look at coroutines, but they seemed to work when the sender > >>>> is triggering the coroutine on send, not when request packets are arriving > >>>> asynchronously to the sends. > >>> > >>> This can be done with a receiver coroutine. Its job is to be the only > >>> thing that reads vfio-user messages from the socket. A receiver > >>> coroutine reads messages from the socket and wakes up the waiting > >>> coroutine that yielded from vfio_user_send_recv() or > >>> vfio_user_pci_process_req(). > >>> > >>> (Although vfio_user_pci_process_req() could be called directly from the > >>> receiver coroutine, it seems safer to have a separate coroutine that > >>> processes requests so that the receiver isn't blocked in case > >>> vfio_user_pci_process_req() yields while processing a request.) > >>> > >>> Going back to what you mentioned above, the receiver coroutine does > >>> something like this: > >>> > >>> if it's a reply > >>> reply = find_reply(...) > >>> qemu_coroutine_enter(reply->co) // instead of signalling reply->cv > >>> else > >>> QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next); > >>> if (pending_reqs_was_empty) { > >>> qemu_coroutine_enter(process_request_co); > >>> } > >>> > >>> The pending_reqs queue holds incoming requests that the > >>> process_request_co coroutine processes. > >>> > >> > >> > >> How do coroutines work across threads? There can be multiple vCPU > >> threads waiting for replies, and I think the receiver coroutine will be > >> running in the main loop thread. Where would a vCPU block waiting for > >> a reply? I think coroutine_yield() returns to its coroutine_enter() caller > > > > > > > > A vCPU thread holding the BQL can iterate the event loop if it has > > reached a synchronous point that needs to wait for a reply before > > returning. I think we have this situation when a MemoryRegion is > > accessed on the proxy device. > > > > For example, block/block-backend.c:blk_prw() kicks off a coroutine and > > then runs the event loop until the coroutine finishes: > > > > Coroutine *co = qemu_coroutine_create(co_entry, &rwco); > > bdrv_coroutine_enter(blk_bs(blk), co); > > BDRV_POLL_WHILE(blk_bs(blk), rwco.ret == NOT_DONE); > > > > BDRV_POLL_WHILE() boils down to a loop like this: > > > > while ((cond)) { > > aio_poll(ctx, true); > > } > > > > I think that would make vCPUs sending requests and the > receiver coroutine all poll on the same socket. If the “wrong” > routine reads the message, I’d need a second level of synchronization > to pass the message to the “right” one. e.g., if the vCPU coroutine > reads a request, it needs to pass it to the receiver; if the receiver > coroutine reads a reply, it needs to pass it to a vCPU. > > Avoiding this complexity is one of the reasons I went with > a separate thread that only reads the socket over the mp-qemu model, > which does have the sender poll, but doesn’t need to handle incoming > requests. Only one coroutine reads from the socket, the "receiver" coroutine. In a previous reply I sketched what the receiver does: if it's a reply reply = find_reply(...) qemu_coroutine_enter(reply->co) // instead of signalling reply->cv else QSIMPLEQ_INSERT_TAIL(&pending_reqs, request, next); if (pending_reqs_was_empty) { qemu_coroutine_enter(process_request_co); } The qemu_coroutine_enter(reply->co) call re-enters the coroutine that was created by the vCPU thread. Is this the "second level of synchronization" that you described? It's very similar to signalling reply->cv in the existing patch. Now I'm actually thinking about whether this can be improved by keeping the condvar so that the vCPU thread doesn't need to call aio_poll() (which is awkward because it doesn't drop the BQL and therefore blocks other vCPUs from making progress). That approach wouldn't require a dedicated thread for vfio-user. > > I also want to check that I understand the scenarios in which the > > vfio-user communication code is used: > > > > 1. vhost-user-server > > > > The vfio-user communication code should run in a given AioContext (it > > will be the main loop by default but maybe the user will be able to > > configure a specific IOThread in the future). > > > > Jag would know more, but I believe it runs off the main loop. > Running it in an iothread doesn’t gain much, since it needs BQL to > run the device emulation code. > > > > 2. vCPU thread vfio-user clients > > > > The vfio-user communication code is called from the vCPU thread where > > the proxy device executes. The MemoryRegion->read()/write() callbacks > > are synchronous, so the thread needs to wait for a vfio-user reply > > before it can return. > > > > Is this what you had in mind? > > The client is also called from the main thread - the GET_* > messages from vfio_user_pci_realize() as well as MAP/DEMAP messages > from guest address space change transactions. It is also called by > the migration thread, which is a separate thread that does not run > holding BQL. Thanks for mentioning those additional cases. Stefan