Re: 9pfs: scope of rename_lock?

From: Christian Schoenebeck <qemu_oss@crudebyte.com>
To: qemu-devel@nongnu.org
Cc: Greg Kurz <groug@kaod.org>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Subject: Re: 9pfs: scope of rename_lock?
Date: Wed, 26 May 2021 15:41:48 +0200	[thread overview]
Message-ID: <2286290.ABruit1yvu@silver> (raw)
In-Reply-To: <3482348.c50Dd4UUo8@silver>

On Dienstag, 25. Mai 2021 13:41:22 CEST Christian Schoenebeck wrote:
> I started to work on a patch set on this.
> 
> I try to get rid of that rename_lock entirely by letting the worker threads
> only access temporary copies e.g. of the fid path instead of allowing the
> worker threads to access main thread owned structures directly like it is
> the case ATM.
> 
> I also noticed that the rename_lock scheme is quite inconsistent right now
> anyway. E.g. some of the fs v9fs_co_*() functions accessing main thread
> owned structures don't take the lock at all. Some for good reasons, some
> not.
> 
> So this week I will give the described approach above a test spin and then
> we will see how this impacts performance in practice before actually
> posting any patch set here.

Ok, I made some test runs. Actually quite disappointing. It made literally no 
difference in performance whether or not that global lock is there, which 
surprises me, because I was expecting this to be a bottleneck for parallel 
background worker tasks and that removing that global lock would at least 
provide a few percent in performance gain.

I performed two tests:

1. Measuring the boot time of a guest OS ontop of 9pfs of an OS that uses a 
parallel init system. Zero difference here.

2. A simple shell script using 'find' to scan the entire filesystem on a one 
guest process and a shell script loop in another guest process doing 
constantly rename of an arbitrary directory in parallel for trying to hit that 
global 9p lock. In this test the parallel renaming slowed down the find 
process by about 20%, which is definitely too much, but removing the lock did 
not improve this either.

I had some doubts on my results and wondered if the actual amount of qemu 
worker threads used by 9p where simply too small for some reason, so I added a 
little debug hack to verify how many workers I was actually running for 9p:

diff --git a/hw/9pfs/coth.c b/hw/9pfs/coth.c
index 2802d41cce..a5340db28b 100644
--- a/hw/9pfs/coth.c
+++ b/hw/9pfs/coth.c
@@ -22,6 +22,28 @@
 #include "qemu/coroutine.h"
 #include "qemu/main-loop.h"
 #include "coth.h"
+#include <stdatomic.h>
+
+atomic_int p9_workers;
+atomic_int p9_workers_max;
+
+void p9_worker_starts(void) {
+    int now = atomic_fetch_add_explicit(&p9_workers, 1, memory_order_seq_cst) 
+ 1;
+    int prev = p9_workers_max;
+    bool isNewMax = now > prev;
+    while (now > prev &&
+        !atomic_compare_exchange_weak(&p9_workers_max, &prev, now))
+    {
+    }
+    if (isNewMax) {
+        printf("9P: NEW MAX WORKER THREADS : %d <-------- !!!\n", now);
+        fflush(stdout);
+    }
+}
+
+void p9_worker_ends(void) {
+    atomic_fetch_add_explicit(&p9_workers, -1, memory_order_seq_cst);
+}
 
 /* Called from QEMU I/O thread.  */
 static void coroutine_enter_cb(void *opaque, int ret)
diff --git a/hw/9pfs/coth.h b/hw/9pfs/coth.h
index c51289903d..e82f92af5e 100644
--- a/hw/9pfs/coth.h
+++ b/hw/9pfs/coth.h
@@ -51,12 +51,16 @@
          */                                                             \
         qemu_coroutine_yield();                                         \
         qemu_bh_delete(co_bh);                                          \
+        p9_worker_starts();                                             \
         code_block;                                                     \
+        p9_worker_ends();                                               \
         /* re-enter back to qemu thread */                              \
         qemu_coroutine_yield();                                         \
     } while (0)
 
 void co_run_in_worker_bh(void *);
+void p9_worker_starts(void);
+void p9_worker_ends(void);
 int coroutine_fn v9fs_co_readlink(V9fsPDU *, V9fsPath *, V9fsString *);
 int coroutine_fn v9fs_co_readdir(V9fsPDU *, V9fsFidState *, struct dirent 
**);
 int coroutine_fn v9fs_co_readdir_many(V9fsPDU *, V9fsFidState *,


But I got about 8 - 12 maximum worker threads doing their job for 9pfs 
simultaniously. I mean if you look at this debug code I am actually counting 
the amount of worker *coroutines*, not really the amount of worker *threads*, 
but as all v9fs_co_*() functions are only dispatching exactly once back to 
main thread (i.e. once worker task completed or errored out), that should in 
fact always equal the number of parallel 9p worker threads.

I actually wonder where that's determined how many worker threads are 
allocated for the qemu thread pool, whether that's a hard coded amount 
somewhere or probably configurable from the CLI.

So for now I guess I postpone this global lock issue, unless somebody has some 
idea of what I still might be missing here. I still think the global lock 
should be removed on the long term, but considering that I had zero 
performance gain and my current patch set delta stats aren't necessarily 
small:

 fsdev/9p-marshal.c |  7 +++++++
 fsdev/9p-marshal.h |  1 +
 hw/9pfs/9p.c       | 32 +++++++-------------------------
 hw/9pfs/9p.h       | 27 +--------------------------
 hw/9pfs/codir.c    | 26 ++++++++++++++++----------
 hw/9pfs/cofile.c   | 53 +++++++++++++++++++++++++++++------------------------
 hw/9pfs/cofs.c     | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++---------------------------------------
 hw/9pfs/coth.c     | 22 ++++++++++++++++++++++
 hw/9pfs/coth.h     |  4 ++++
 hw/9pfs/coxattr.c  | 28 ++++++++++++++++------------
 10 files changed, 162 insertions(+), 136 deletions(-)

Hence I probably concentrate on other things for now. If somebody wants to 
have a glimpse on my patch set, I can post it at any time of course.

Best regards,
Christian Schoenebeck