qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* 9p: requests efficiency
@ 2019-11-15  1:10 Christian Schoenebeck
  2019-11-15 13:26 ` Greg Kurz
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Schoenebeck @ 2019-11-15  1:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz

I'm currently reading up on how client requests (T messages) are currently 
dispatched in general by 9pfs, to understand where potential inefficiencies 
are that I am encountering.

I mean 9pfs is pretty fast on raw I/O (read/write requests), provided that the 
message payload on guest side was chosen large enough (e.g. 
trans=virtio,version=9p2000.L,msize=4194304,...), where I already come close 
to my test disk's therotical maximum performance on read/write tests. But 
obviously these are huge 9p requests.

However when there are a large number of (i.e. small) 9p requests, no matter 
what the actual request type is, then I am encountering severe performance 
issues with 9pfs and I try to understand whether this could be improved with 
reasonable effort.

If I understand it correctly, each incoming request (T message) is dispatched 
to its own qemu coroutine queue. So individual requests should already be 
processed in parallel, right?

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 9p: requests efficiency
  2019-11-15  1:10 9p: requests efficiency Christian Schoenebeck
@ 2019-11-15 13:26 ` Greg Kurz
  2019-11-20 23:37   ` Christian Schoenebeck
  0 siblings, 1 reply; 4+ messages in thread
From: Greg Kurz @ 2019-11-15 13:26 UTC (permalink / raw)
  To: Christian Schoenebeck; +Cc: qemu-devel

On Fri, 15 Nov 2019 02:10:50 +0100
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> I'm currently reading up on how client requests (T messages) are currently 
> dispatched in general by 9pfs, to understand where potential inefficiencies 
> are that I am encountering.
> 
> I mean 9pfs is pretty fast on raw I/O (read/write requests), provided that the 
> message payload on guest side was chosen large enough (e.g. 
> trans=virtio,version=9p2000.L,msize=4194304,...), where I already come close 
> to my test disk's therotical maximum performance on read/write tests. But 
> obviously these are huge 9p requests.
> 
> However when there are a large number of (i.e. small) 9p requests, no matter 
> what the actual request type is, then I am encountering severe performance 
> issues with 9pfs and I try to understand whether this could be improved with 
> reasonable effort.
> 

Thanks for doing that. This is typically the kind of effort I never
dared starting on my own.

> If I understand it correctly, each incoming request (T message) is dispatched 
> to its own qemu coroutine queue. So individual requests should already be 
> processed in parallel, right?
> 

Sort of but not exactly. The real parallelization, ie. doing parallel
processing with concurrent threads, doesn't take place on a per-request
basis. A typical request is broken down into several calls to the backend
which may block because the backend itself calls a syscall that may block
in the kernel. Each backend call is thus handled by its own thread from the
mainloop thread pool (see hw/9pfs/coth.[ch] for details). The rest of the
9p code, basically everything in 9p.c, is serialized in the mainloop thread.

Cheers,

--
Greg

> Best regards,
> Christian Schoenebeck
> 
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 9p: requests efficiency
  2019-11-15 13:26 ` Greg Kurz
@ 2019-11-20 23:37   ` Christian Schoenebeck
  2019-11-23 19:26     ` Christian Schoenebeck
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Schoenebeck @ 2019-11-20 23:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Greg Kurz, Christian Schoenebeck

On Freitag, 15. November 2019 14:26:56 CET Greg Kurz wrote:
> > However when there are a large number of (i.e. small) 9p requests, no
> > matter what the actual request type is, then I am encountering severe
> > performance issues with 9pfs and I try to understand whether this could
> > be improved with reasonable effort.
> 
> Thanks for doing that. This is typically the kind of effort I never
> dared starting on my own.

If you don't mind I still ask some more questions though, just in case you can 
gather them from the back of your head.

> > If I understand it correctly, each incoming request (T message) is
> > dispatched to its own qemu coroutine queue. So individual requests should
> > already be processed in parallel, right?
> 
> Sort of but not exactly. The real parallelization, ie. doing parallel
> processing with concurrent threads, doesn't take place on a per-request
> basis. 

Ok I see, I was just reading that each request causes this call sequence:

	handle_9p_output() -> pdu_submit() -> qemu_co_queue_init(&pdu->complete)

and I was misinterpreting specifically that latter call to be an implied 
thread creation. Because that's what happens with other somewhat similar 
collaborative thread synchronization frameworks like "Grand Central Dispatch" 
or std::async.

But now I realize the entire QEMU coroutine framework is really just managing 
memory stacks, not actually anything about threads per se. The QEMU docs often 
use the term "threads" which is IMO misleading for what it really does.

> A typical request is broken down into several calls to the backend
> which may block because the backend itself calls a syscall that may block
> in the kernel. Each backend call is thus handled by its own thread from the
> mainloop thread pool (see hw/9pfs/coth.[ch] for details). The rest of the
> 9p code, basically everything in 9p.c, is serialized in the mainloop thread.

So the precise parallelism fork points in 9pfs (where tasks are dispatched to 
other threads) are the *_co_*() functions, and there precisely at where they 
are using v9fs_co_run_in_worker( X ) respectively, correct? Or are there more 
fork points than those?

If so, I haven't understood how precisely v9fs_co_run_in_worker() works. I 
mean I understand now how QEMU coroutines are working, and the idea of 
v9fs_co_run_in_worker() is dispatching the passed code block to the worker 
thread, but immediately returning back to main thread and continueing there on 
main thread with other coroutines while the worker thread's dispatched 
coroutine finished. But how that happens there precisely in 
v9fs_co_run_in_worker() is not yet clear to me.

Also where are the worker threads spawned actually?

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 9p: requests efficiency
  2019-11-20 23:37   ` Christian Schoenebeck
@ 2019-11-23 19:26     ` Christian Schoenebeck
  0 siblings, 0 replies; 4+ messages in thread
From: Christian Schoenebeck @ 2019-11-23 19:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Christian Schoenebeck, Greg Kurz

On Donnerstag, 21. November 2019 00:37:36 CET Christian Schoenebeck wrote:
> If so, I haven't understood how precisely v9fs_co_run_in_worker() works. I
> mean I understand now how QEMU coroutines are working, and the idea of
> v9fs_co_run_in_worker() is dispatching the passed code block to the worker
> thread, but immediately returning back to main thread and continueing there
> on main thread with other coroutines while the worker thread's dispatched
> coroutine finished. But how that happens there precisely in
> v9fs_co_run_in_worker() is not yet clear to me.
> 
> Also where are the worker threads spawned actually?

Never mind about these questions; I figured them out by myself in the 
meantime.

Another question though: I see that you introduced those readdir mutex locks 
with commit 7cde47d4a89. Do you remember the scenario where this concurrency 
issue happened? 

V9fsDir exists per file ID. So I think concurrency should only happen there if 
the same file ID (of a directory) was shared and used concurrently on guest 
side. Because otherwise if the file ID was used by only one thread on guest 
side, then all 9p requests on that fid should end up being completely 
serialized end to end.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-11-23 19:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15  1:10 9p: requests efficiency Christian Schoenebeck
2019-11-15 13:26 ` Greg Kurz
2019-11-20 23:37   ` Christian Schoenebeck
2019-11-23 19:26     ` Christian Schoenebeck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).