RFC fuse waitq latency

* RFC fuse waitq latency
@ 2022-03-28 13:21 Bernd Schubert
  2022-04-22 12:25 ` Miklos Szeredi
  0 siblings, 1 reply; 5+ messages in thread
From: Bernd Schubert @ 2022-03-28 13:21 UTC (permalink / raw)
  To: Linux-FSDevel, Linux Kernel
  Cc: Dharmendra Singh, Miklos Szeredi, Boaz Harrosh, Sagi Manole

I would like to discuss the user thread wake up latency in
fuse_dev_do_read(). Profiling fuse shows there is room for improvement
regarding memory copies and splice. The basic profiling with flame graphs
didn't reveal, though, why fuse is so much
slower (with an overlay file system) than just accessing the underlying
file system directly and also didn't reveal why a single threaded fuse
uses less than 100% cpu, with the application on top of use also using
less than 100% cpu (simple bonnie++ runs with 1B files).
So I started to suspect the wait queues and indeed, keeping the thread
that reads the fuse device for work running for some time gives quite
some improvements.

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 592730fd6e42..20b7cf296fb0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1034,7 +1034,7 @@ static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
  
  static int forget_pending(struct fuse_iqueue *fiq)
  {
-       return fiq->forget_list_head.next != NULL;
+       return READ_ONCE(fiq->forget_list_head.next) != NULL;
  }
  
  static int request_pending(struct fuse_iqueue *fiq)
@@ -1237,18 +1237,25 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
                 return -EINVAL;
  
   restart:
+       expires = jiffies + 10;
         for (;;) {
-               spin_lock(&fiq->lock);
-               if (!fiq->connected || request_pending(fiq))
-                       break;
-               spin_unlock(&fiq->lock);
  
+               if (!READ_ONCE(fiq->connected) || request_pending(fiq)) {
+                       spin_lock(&fiq->lock);
+                       if (!fiq->connected || request_pending(fiq))
+                               break;
+                       spin_unlock(&fiq->lock);
+               }
                 if (file->f_flags & O_NONBLOCK)
                         return -EAGAIN;
-               err = wait_event_interruptible_exclusive(fiq->waitq,
-                               !fiq->connected || request_pending(fiq));
+
+               if (time_after_eq(jiffies, expires))
+                       err = wait_event_interruptible_exclusive(fiq->waitq,
+                                       !fiq->connected || request_pending(fiq));
                 if (err)
                         return err;
+
+               cond_resched();
         }
  
         if (!fiq->connected) {



Without patch above

>                     ------Sequential Create------ --------Random Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
>       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> imesrv1   30:1:1:10  5568  28  7784  40  9737  23  5756  29  5709  39  7573  25
> Latency             26813us     654us     965us     261us     550us     336ms



Patch above applied

                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
       files:max:min  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
imesrv1   30:1:1:10  8043  30 12100  44 14024  22  7791  28  9238  43  9871  25
Latency               235us    3982us    3201us     240us     277us     355ms


So there is quite some improvement by 'just' preventing the thread going to
sleep, with the disadvantage that the thread now spins. This also does not
work that well when libfuse creates multiple threads (still with a single
threaded bonnie++), as wakeup then wakes up different threads, multiple
of them start to spin and without having profiled it, I guess
fiq->lock then might become a problem.

I had also tried to use swaitq instead of waitq, there is a little improvement
with it, but it does not solve the major issue.


Now if the wakeup is an issue, how did zufs avoid it? Looking at its code,
zufs also has a thread wakeup. Same for Miklos' fuse2. On our side
Dharmendra was just read to start to port Miklos' fuse2 to more recent
kernels and to add support into libfuse for it. Now with the waitq
latency we are not sure if this is actually the right approach.



Results above are done with bonnie++
bonnie++ -x 4 -q -s0  -d /scratch/dest/ -n 30:1:1:10 -r 0

using passthrough_hp. This is with additional patches (kernel side
has Dharmendras atomic-open optimizations, libfuse has additional
patches for atomic-open, a libfuse thread creation fix and more
fixes and options for passthrough_hp.cc).

passthrough_hp --foreground --nosplice --nocache --num_threads=1\
     /scratch/source /scratch/dest


Thanks,
Bernd


^ permalink raw reply related	[flat|nested] 5+ messages in thread