[Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3
@ 2019-08-01 16:54 Stefan Hajnoczi
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool Stefan Hajnoczi
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-01 16:54 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert, Stefan Hajnoczi

This patch series introduces the virtiofsd --thread-pool-size=NUM and sets the
default value to 64.  Each virtqueue has its own thread pool for processing
requests.  Blocking requests no longer pause virtqueue processing and I/O
performance should be greatly improved when the queue depth is greater than 1.

Linux boot and pjdfstest have been tested with these patches and the default
thread pool size of 64.

I have now concluded the thread-safety code audit.  Please let me know if you
have concerns about things I missed.

Performance
-----------
Please try these patches out and share your results.

Scalability
-----------
There are several synchronization primitives that are taken by the virtqueue
processing thread or the thread pool worker.  Unfortunately this is necessary
to protect shared state.  It means that thread pool workers contend on or at
least access thread synchronization primitives.  If anyone has suggestions for
improving this situation, please discuss.

1. vu_dispatch_rwlock - libvhost-user from races between the vhost-user
   protocol thread and the virtqueue processing and thread pool worker threads.

2. vq_lock - protects the virtqueue from races between the virtqueue processing
   thread and thread pool workers.

3. init_rwlock - protects FUSE_INIT/FUSE_DESTROY from races with other
   requests.

4. se->lock - protects se->list and the FUSE_INTERRUPT shared state.

Ideally we could avoid hitting all of these locks on each request.  That would
make the code scale better.

Future work
-----------
This series does not complete the multithreading effort yet.  Two items are
still missing:
1. Multiqueue support
2. CPU affinity for virtqueue processing threads and thread pools

Stefan Hajnoczi (4):
  virtiofsd: process requests in a thread pool
  virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
  virtiofsd: fix lo_destroy() resource leaks
  virtiofsd: add --thread-pool-size=NUM option

 contrib/virtiofsd/fuse_i.h         |   2 +
 contrib/virtiofsd/fuse_lowlevel.c  |  25 +-
 contrib/virtiofsd/fuse_virtio.c    | 491 ++++++++++++++++-------------
 contrib/virtiofsd/passthrough_ll.c |  43 ++-
 contrib/virtiofsd/seccomp.c        |   1 +
 5 files changed, 318 insertions(+), 244 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
@ 2019-08-01 16:54 ` Stefan Hajnoczi
  2019-08-05 12:02   ` Dr. David Alan Gilbert
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races Stefan Hajnoczi
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-01 16:54 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert, Stefan Hajnoczi

Introduce a thread pool so that fv_queue_thread() just pops
VuVirtqElements and hands them to the thread pool.  For the time being
only one worker thread is allowed since passthrough_ll.c is not
thread-safe yet.  Future patches will lift this restriction so that
multiple FUSE requests can be processed in parallel.

The main new concept is struct FVRequest, which contains both
VuVirtqElement and struct fuse_chan.  We now have fv_VuDev for a device,
fv_QueueInfo for a virtqueue, and FVRequest for a request.  Some of
fv_QueueInfo's fields are moved into FVRequest because they are
per-request.  The name FVRequest conforms to QEMU coding style and I
expect the struct fv_* types will be renamed in a future refactoring.

This patch series is not optimal.  fbuf reuse is dropped so each request
does malloc(se->bufsize), but there is no clean and cheap way to keep
this with a thread pool.  The vq_lock mutex is held for longer than
necessary, especially during the eventfd_write() syscall.  Performance
can be improved in the future.

prctl(2) had to be added to the seccomp whitelist because glib invokes
it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/fuse_virtio.c | 491 ++++++++++++++++++--------------
 contrib/virtiofsd/seccomp.c     |   1 +
 2 files changed, 273 insertions(+), 219 deletions(-)

diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index d543c6d30f..0c52911144 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -29,26 +29,40 @@
 #include <sys/types.h>
 #include <sys/socket.h>
 #include <sys/un.h>
+#include <glib.h>
 
 #include "contrib/libvhost-user/libvhost-user.h"
 
 struct fv_VuDev;
 struct fv_QueueInfo {
         pthread_t thread;
+
+        /* This lock protects the VuVirtq preventing races between
+         * fv_queue_thread() and fv_queue_worker().
+         */
+        pthread_mutex_t vq_lock;
+
         struct fv_VuDev *virtio_dev;
 
         /* Our queue index, corresponds to array position */
         int       qidx;
         int       kick_fd;
         int       kill_fd; /* For killing the thread */
-
-        /* The element for the command currently being processed */
-        VuVirtqElement *qe;
-        /* If any of the qe vec elements (towards vmm) are unmappable */
-        unsigned int elem_bad_in;
-        bool      reply_sent;
 };
 
+/* A FUSE request */
+typedef struct {
+        VuVirtqElement elem;
+        struct fuse_chan ch;
+
+        /* Number of unmappable iovecs */
+        unsigned int bad_in_num;
+        unsigned int bad_out_num;
+
+        /* Used to complete requests that involve no reply */
+        bool reply_sent;
+} FVRequest;
+
 /* We pass the dev element into libvhost-user
  * and then use it to get back to the outer
  * container for other data.
@@ -186,8 +200,11 @@ static void copy_iov(struct iovec *src_iov, int src_count,
 int virtio_send_msg(struct fuse_session *se, struct fuse_chan *ch,
                     struct iovec *iov, int count)
 {
-        VuVirtqElement *elem;
-        VuVirtq *q;
+        FVRequest *req = container_of(ch, FVRequest, ch);
+        struct fv_QueueInfo *qi = ch->qi;
+        VuDev *dev = &se->virtio_dev->dev;
+        VuVirtq *q = vu_get_queue(dev, qi->qidx);
+        VuVirtqElement *elem = &req->elem;
         int ret = 0;
 
         assert(count >= 1);
@@ -200,11 +217,7 @@ int virtio_send_msg(struct fuse_session *se, struct fuse_chan *ch,
 
         /* unique == 0 is notification, which we don't support */
         assert (out->unique);
-        /* For virtio we always have ch */
-        assert(ch);
-        assert(!ch->qi->reply_sent);
-        elem = ch->qi->qe;
-        q = &ch->qi->virtio_dev->dev.vq[ch->qi->qidx];
+        assert(!req->reply_sent);
 
         /* The 'in' part of the elem is to qemu */
         unsigned int in_num = elem->in_num;
@@ -231,9 +244,15 @@ int virtio_send_msg(struct fuse_session *se, struct fuse_chan *ch,
         }
 
         copy_iov(iov, count, in_sg, in_num, tosend_len);
-        vu_queue_push(&se->virtio_dev->dev, q, elem, tosend_len);
-        vu_queue_notify(&se->virtio_dev->dev, q);
-        ch->qi->reply_sent = true;
+
+        pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
+        pthread_mutex_lock(&qi->vq_lock);
+        vu_queue_push(dev, q, elem, tosend_len);
+        vu_queue_notify(dev, q);
+        pthread_mutex_unlock(&qi->vq_lock);
+        pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);
+
+        req->reply_sent = true;
 
 err:
 
@@ -249,9 +268,12 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
                          struct iovec *iov, int count,
                          struct fuse_bufvec *buf, size_t len)
 {
+        FVRequest *req = container_of(ch, FVRequest, ch);
+        struct fv_QueueInfo *qi = ch->qi;
+        VuDev *dev = &se->virtio_dev->dev;
+        VuVirtq *q = vu_get_queue(dev, qi->qidx);
+        VuVirtqElement *elem = &req->elem;
         int ret = 0;
-        VuVirtqElement *elem;
-        VuVirtq *q;
 
         assert(count >= 1);
         assert(iov[0].iov_len >= sizeof(struct fuse_out_header));
@@ -271,15 +293,11 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
         /* unique == 0 is notification which we don't support */
         assert (out->unique);
 
-        /* For virtio we always have ch */
-        assert(ch);
-        assert(!ch->qi->reply_sent);
-        elem = ch->qi->qe;
-        q = &ch->qi->virtio_dev->dev.vq[ch->qi->qidx];
+        assert(!req->reply_sent);
 
         /* The 'in' part of the elem is to qemu */
         unsigned int in_num = elem->in_num;
-        unsigned int bad_in_num = ch->qi->elem_bad_in;
+        unsigned int bad_in_num = req->bad_in_num;
         struct iovec *in_sg = elem->in_sg;
         size_t in_len = iov_length(in_sg, in_num);
         size_t in_len_writeable = iov_length(in_sg, in_num - bad_in_num);
@@ -423,16 +441,219 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
 
         ret = 0;
 
-        vu_queue_push(&se->virtio_dev->dev, q, elem, tosend_len);
-        vu_queue_notify(&se->virtio_dev->dev, q);
+        pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
+        pthread_mutex_lock(&qi->vq_lock);
+        vu_queue_push(dev, q, elem, tosend_len);
+        vu_queue_notify(dev, q);
+        pthread_mutex_unlock(&qi->vq_lock);
+        pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);
 
 err:
         if (ret == 0)
-                ch->qi->reply_sent = true;
+                req->reply_sent = true;
 
         return ret;
 }
 
+/* Process one FVRequest in a thread pool */
+static void fv_queue_worker(gpointer data, gpointer user_data)
+{
+        struct fv_QueueInfo *qi = user_data;
+        struct fuse_session *se = qi->virtio_dev->se;
+        struct VuDev *dev = &qi->virtio_dev->dev;
+        FVRequest *req = data;
+        VuVirtqElement *elem = &req->elem;
+        struct fuse_buf fbuf = {};
+        bool allocated_bufv = false;
+        struct fuse_bufvec bufv;
+        struct fuse_bufvec *pbufv;
+
+        assert(se->bufsize > sizeof(struct fuse_in_header));
+
+        /* An element contains one request and the space to send our response
+         * They're spread over multiple descriptors in a scatter/gather set
+         * and we can't trust the guest to keep them still; so copy in/out.
+         */
+        fbuf.mem = malloc(se->bufsize);
+        assert(fbuf.mem);
+
+        fuse_mutex_init(&req->ch.lock);
+        req->ch.fd = (int)0xdaff0d111;
+        req->ch.ctr = 1;
+        req->ch.qi = qi;
+
+        /* The 'out' part of the elem is from qemu */
+        unsigned int out_num = elem->out_num;
+        unsigned int out_num_readable = out_num - req->bad_out_num;
+        struct iovec *out_sg = elem->out_sg;
+        size_t out_len = iov_length(out_sg, out_num);
+        size_t out_len_readable = iov_length(out_sg, out_num_readable);
+        if (se->debug)
+                fuse_debug("%s: elem %d: with %d out desc of length %zd"
+                           " bad_in_num=%u bad_out_num=%u\n",
+                           __func__, elem->index, out_num,
+                           out_len, req->bad_in_num, req->bad_out_num);
+
+        /* The elem should contain a 'fuse_in_header' (in to fuse)
+         * plus the data based on the len in the header.
+         */
+        if (out_len_readable < sizeof(struct fuse_in_header)) {
+                fuse_err("%s: elem %d too short for in_header\n",
+                                __func__, elem->index);
+                assert(0); // TODO
+        }
+        if (out_len > se->bufsize) {
+                fuse_err("%s: elem %d too large for buffer\n",
+                                __func__, elem->index);
+                assert(0); // TODO
+        }
+        // Copy just the first element and look at it
+        copy_from_iov(&fbuf, 1, out_sg);
+
+        pbufv = NULL; /* Compiler thinks an unitialised path */
+        if (req->bad_in_num || req->bad_out_num) {
+                bool handled_unmappable = false;
+
+                if (out_num > 2 && out_num_readable >= 2 && !req->bad_in_num &&
+                                out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
+                                ((struct fuse_in_header *)fbuf.mem)->opcode ==
+                                FUSE_WRITE &&
+                                out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
+                        handled_unmappable = true;
+
+                        // copy the fuse_write_in header after the fuse_in_header
+                        fbuf.mem += out_sg->iov_len;
+                        copy_from_iov(&fbuf, 1, out_sg + 1);
+                        fbuf.mem -= out_sg->iov_len;
+                        fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
+
+                        // Allocate the bufv, with space for the rest of the iov
+                        allocated_bufv = true;
+                        pbufv = malloc(sizeof(struct fuse_bufvec) +
+                                        sizeof(struct fuse_buf) * (out_num - 2));
+
+                        pbufv->count = 1;
+                        pbufv->buf[0] = fbuf;
+
+                        size_t iovindex, pbufvindex;
+                        iovindex = 2; // 2 headers, separate iovs
+                        pbufvindex = 1; // 2 headers, 1 fusebuf
+
+                        for(; iovindex < out_num; iovindex++, pbufvindex++) {
+                                pbufv->count++;
+                                pbufv->buf[pbufvindex].pos = ~0; // Dummy
+                                pbufv->buf[pbufvindex].flags =
+                                        (iovindex < out_num_readable) ?
+                                        0 : FUSE_BUF_PHYS_ADDR;
+                                pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
+                                pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
+                        }
+                }
+
+                if (out_num == 2 && out_num_readable == 2 && req->bad_in_num &&
+                                out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
+                                ((struct fuse_in_header *)fbuf.mem)->opcode ==
+                                FUSE_READ &&
+                                out_sg[1].iov_len == sizeof(struct fuse_read_in)) {
+                        if (se->debug) {
+                                fuse_debug("Unmappable read case "
+                                           "in_num=%d bad_in_num=%d\n",
+                                           elem->in_num, req->bad_in_num);
+                        }
+                        handled_unmappable = true;
+                }
+
+                if (!handled_unmappable) {
+                        fuse_err("Unhandled unmappable element: out: %d(b:%d) in: %d(b:%d)",
+                                 out_num, req->bad_out_num,
+                                 elem->in_num, req->bad_in_num);
+                        fv_panic(dev, "Unhandled unmappable element");
+                }
+        }
+
+        if (!req->bad_out_num) {
+                if (out_num > 2 &&
+                                out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
+                                ((struct fuse_in_header *)fbuf.mem)->opcode ==
+                                FUSE_WRITE &&
+                                out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
+                        // For a write we don't actually need to copy the
+                        // data, we can just do it straight out of guest memory
+                        // but we must sitll copy the headers in case the guest
+                        // was nasty and changed them while we were using them.
+                        if (se->debug)
+                                fuse_debug("%s: Write special case\n", __func__);
+
+                        // copy the fuse_write_in header afte rthe fuse_in_header
+                        fbuf.mem += out_sg->iov_len;
+                        copy_from_iov(&fbuf, 1, out_sg + 1);
+                        fbuf.mem -= out_sg->iov_len;
+                        fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
+
+                        // Allocate the bufv, with space for the rest of the iov
+                        allocated_bufv = true;
+                        pbufv = malloc(sizeof(struct fuse_bufvec) +
+                                        sizeof(struct fuse_buf) * (out_num - 2));
+
+                        pbufv->count = 1;
+                        pbufv->buf[0] = fbuf;
+
+                        size_t iovindex, pbufvindex;
+                        iovindex = 2; // 2 headers, separate iovs
+                        pbufvindex = 1; // 2 headers, 1 fusebuf
+
+                        for(; iovindex < out_num; iovindex++, pbufvindex++) {
+                                pbufv->count++;
+                                pbufv->buf[pbufvindex].pos = ~0; // Dummy
+                                pbufv->buf[pbufvindex].flags = 0;
+                                pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
+                                pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
+                        }
+                } else {
+                        // Normal (non fast write) path
+
+                        // Copy the rest of the buffer
+                        fbuf.mem += out_sg->iov_len;
+                        copy_from_iov(&fbuf, out_num - 1, out_sg + 1);
+                        fbuf.mem -= out_sg->iov_len;
+                        fbuf.size = out_len;
+
+                        // TODO! Endianness of header
+
+                        // TODO: Add checks for fuse_session_exited
+                        bufv.buf[0] = fbuf;
+                        bufv.count = 1;
+                        pbufv = &bufv;
+                }
+        }
+        pbufv->idx = 0;
+        pbufv->off = 0;
+        fuse_session_process_buf_int(se, pbufv, &req->ch);
+
+        if (allocated_bufv) free(pbufv);
+
+        /* If the request has no reply, still recycle the virtqueue element */
+        if (!req->reply_sent) {
+                struct VuVirtq *q = vu_get_queue(dev, qi->qidx);
+
+                if (se->debug) {
+                        fuse_debug("%s: elem %d no reply sent\n",
+                                   __func__, elem->index);
+                }
+
+                pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
+                pthread_mutex_lock(&qi->vq_lock);
+                vu_queue_push(dev, q, elem, 0);
+                vu_queue_notify(dev, q);
+                pthread_mutex_unlock(&qi->vq_lock);
+                pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);
+        }
+
+        pthread_mutex_destroy(&req->ch.lock);
+        free(fbuf.mem);
+        free(req);
+}
+
 /* Thread function for individual queues, created when a queue is 'started' */
 static void *fv_queue_thread(void *opaque)
 {
@@ -440,16 +661,14 @@ static void *fv_queue_thread(void *opaque)
         struct VuDev        *dev = &qi->virtio_dev->dev;
         struct VuVirtq      *q = vu_get_queue(dev, qi->qidx);
         struct fuse_session *se = qi->virtio_dev->se;
-        struct fuse_chan    ch;
-        struct fuse_buf     fbuf;
+        GThreadPool *pool;
 
-        fbuf.mem = NULL;
-        fbuf.flags = 0;
-
-        fuse_mutex_init(&ch.lock);
-        ch.fd = (int)0xdaff0d111;
-        ch.ctr = 1;
-        ch.qi = qi;
+        pool = g_thread_pool_new(fv_queue_worker, qi, 1 /* TODO max_threads */,
+                        TRUE, NULL);
+        if (!pool) {
+                fuse_err("%s: g_thread_pool_new failed\n", __func__);
+                return NULL;
+        }
 
         fuse_info("%s: Start for queue %d kick_fd %d\n",
                   __func__, qi->qidx, qi->kick_fd);
@@ -507,6 +726,8 @@ static void *fv_queue_thread(void *opaque)
                ret = pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
                assert(ret == 0); /* there is no possible error case */
 
+               pthread_mutex_lock(&qi->vq_lock);
+
                if (se->debug) {
                        /* out is from guest, in is too guest */
                        unsigned int in_bytes, out_bytes;
@@ -518,198 +739,26 @@ static void *fv_queue_thread(void *opaque)
                }
 
                while (1) {
-                       bool allocated_bufv = false;
-                       struct fuse_bufvec bufv;
-                       struct fuse_bufvec *pbufv;
                        unsigned int bad_in_num = 0, bad_out_num = 0;
-
-                       /* An element contains one request and the space to send our response
-                        * They're spread over multiple descriptors in a scatter/gather set
-                        * and we can't trust the guest to keep them still; so copy in/out.
-                        */
-                       VuVirtqElement *elem = vu_queue_pop(dev, q, sizeof(VuVirtqElement),
-                                                           &bad_in_num, &bad_out_num);
-                       if (!elem) {
+                       FVRequest *req = vu_queue_pop(dev, q, sizeof(FVRequest),
+                                                     &bad_in_num,
+                                                     &bad_out_num);
+                       if (!req) {
                                break;
                        }
 
-                       qi->qe = elem;
-                       qi->reply_sent = false;
-                       qi->elem_bad_in = bad_in_num;
+                       req->reply_sent = false;
+                       req->bad_in_num = bad_in_num;
+                       req->bad_out_num = bad_out_num;
 
-                       if (!fbuf.mem) {
-                               fbuf.mem = malloc(se->bufsize);
-                               assert(fbuf.mem);
-                               assert(se->bufsize > sizeof(struct fuse_in_header));
-                       }
-                       /* The 'out' part of the elem is from qemu */
-                       unsigned int out_num = elem->out_num;
-                       unsigned int out_num_readable = out_num - bad_out_num;
-                       struct iovec *out_sg = elem->out_sg;
-                       size_t out_len = iov_length(out_sg, out_num);
-                       size_t out_len_readable = iov_length(out_sg, out_num_readable);
-                       if (se->debug)
-                               fuse_debug("%s: elem %d: with %d out desc of length %zd"
-                                          " bad_in_num=%u bad_out_num=%u\n",
-					  __func__, elem->index, out_num,
-					  out_len, bad_in_num, bad_out_num);
-
-                       /* The elem should contain a 'fuse_in_header' (in to fuse)
-                        * plus the data based on the len in the header.
-                        */
-                       if (out_len_readable < sizeof(struct fuse_in_header)) {
-                               fuse_err("%s: elem %d too short for in_header\n",
-                                        __func__, elem->index);
-                               assert(0); // TODO
-                       }
-                       if (out_len > se->bufsize) {
-                               fuse_err("%s: elem %d too large for buffer\n",
-                                        __func__, elem->index);
-                               assert(0); // TODO
-                       }
-                       // Copy just the first element and look at it
-                       copy_from_iov(&fbuf, 1, out_sg);
-
-                       pbufv = NULL; /* Compiler thinks an unitialised path */
-                       if (bad_in_num || bad_out_num) {
-                           bool handled_unmappable = false;
-
-                           if (out_num > 2 && out_num_readable >= 2 && !bad_in_num &&
-                               out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
-                               ((struct fuse_in_header *)fbuf.mem)->opcode ==
-                                   FUSE_WRITE &&
-                               out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
-                               handled_unmappable = true;
-
-                               // copy the fuse_write_in header after the fuse_in_header
-                               fbuf.mem += out_sg->iov_len;
-                               copy_from_iov(&fbuf, 1, out_sg + 1);
-                               fbuf.mem -= out_sg->iov_len;
-                               fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
-
-                               // Allocate the bufv, with space for the rest of the iov
-                               allocated_bufv = true;
-                               pbufv = malloc(sizeof(struct fuse_bufvec) +
-                                              sizeof(struct fuse_buf) * (out_num - 2));
-
-                               pbufv->count = 1;
-                               pbufv->buf[0] = fbuf;
-
-                               size_t iovindex, pbufvindex;
-                               iovindex = 2; // 2 headers, separate iovs
-                               pbufvindex = 1; // 2 headers, 1 fusebuf
-
-                               for(; iovindex < out_num; iovindex++, pbufvindex++) {
-                                       pbufv->count++;
-                                       pbufv->buf[pbufvindex].pos = ~0; // Dummy
-                                       pbufv->buf[pbufvindex].flags =
-                                               (iovindex < out_num_readable) ?
-                                               0 : FUSE_BUF_PHYS_ADDR;
-                                       pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
-                                       pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
-                               }
-                           }
-
-                           if (out_num == 2 && out_num_readable == 2 && bad_in_num &&
-                               out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
-                               ((struct fuse_in_header *)fbuf.mem)->opcode ==
-                                   FUSE_READ &&
-                               out_sg[1].iov_len == sizeof(struct fuse_read_in)) {
-                               if (se->debug) {
-                                   fuse_debug("Unmappable read case "
-                                              "in_num=%d bad_in_num=%d\n",
-                                              elem->in_num, bad_in_num);
-                               }
-                               handled_unmappable = true;
-                           }
-
-                           if (!handled_unmappable) {
-                               fuse_err("Unhandled unmappable element: out: %d(b:%d) in: %d(b:%d)",
-                                        out_num, bad_out_num,
-                                        elem->in_num, bad_in_num);
-                               fv_panic(dev, "Unhandled unmappable element");
-                           }
-                       }
-
-                       if (!bad_out_num) {
-                           if (out_num > 2 &&
-                               out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
-                               ((struct fuse_in_header *)fbuf.mem)->opcode ==
-                                   FUSE_WRITE &&
-                               out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
-                                   // For a write we don't actually need to copy the
-                                   // data, we can just do it straight out of guest memory
-                                   // but we must sitll copy the headers in case the guest
-                                   // was nasty and changed them while we were using them.
-                                   if (se->debug)
-                                           fuse_debug("%s: Write special case\n", __func__);
-
-                                   // copy the fuse_write_in header afte rthe fuse_in_header
-                                   fbuf.mem += out_sg->iov_len;
-                                   copy_from_iov(&fbuf, 1, out_sg + 1);
-                                   fbuf.mem -= out_sg->iov_len;
-                                   fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
-
-                                   // Allocate the bufv, with space for the rest of the iov
-                                   allocated_bufv = true;
-                                   pbufv = malloc(sizeof(struct fuse_bufvec) +
-                                                  sizeof(struct fuse_buf) * (out_num - 2));
-
-                                   pbufv->count = 1;
-                                   pbufv->buf[0] = fbuf;
-
-                                   size_t iovindex, pbufvindex;
-                                   iovindex = 2; // 2 headers, separate iovs
-                                   pbufvindex = 1; // 2 headers, 1 fusebuf
-
-                                   for(; iovindex < out_num; iovindex++, pbufvindex++) {
-                                           pbufv->count++;
-                                           pbufv->buf[pbufvindex].pos = ~0; // Dummy
-                                           pbufv->buf[pbufvindex].flags = 0;
-                                           pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
-                                           pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
-                                   }
-                           } else {
-                                   // Normal (non fast write) path
-
-                                   // Copy the rest of the buffer
-                                   fbuf.mem += out_sg->iov_len;
-                                   copy_from_iov(&fbuf, out_num - 1, out_sg + 1);
-                                   fbuf.mem -= out_sg->iov_len;
-                                   fbuf.size = out_len;
-
-                                   // TODO! Endianness of header
-
-                                   // TODO: Add checks for fuse_session_exited
-                                   bufv.buf[0] = fbuf;
-                                   bufv.count = 1;
-                                   pbufv = &bufv;
-                           }
-                       }
-                       pbufv->idx = 0;
-                       pbufv->off = 0;
-                       fuse_session_process_buf_int(se, pbufv, &ch);
-
-                       if (allocated_bufv) free(pbufv);
-
-                       if (!qi->reply_sent) {
-			       if (se->debug) {
-				       fuse_debug("%s: elem %d no reply sent\n",
-					          __func__, elem->index);
-			       }
-                               /* I think we've still got to recycle the element */
-                               vu_queue_push(dev, q, elem, 0);
-                               vu_queue_notify(dev, q);
-                       }
-                       qi->qe = NULL;
-                       free(elem);
-                       elem = NULL;
+                       g_thread_pool_push(pool, req, NULL);
                 }
 
+                pthread_mutex_unlock(&qi->vq_lock);
                 pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);
         }
-        pthread_mutex_destroy(&ch.lock);
-        free(fbuf.mem);
+
+        g_thread_pool_free(pool, FALSE, TRUE);
 
         return NULL;
 }
@@ -760,6 +809,9 @@ static void fv_queue_set_started(VuDev *dev, int qidx, bool started)
 
                 ourqi->kill_fd = eventfd(0, EFD_CLOEXEC | EFD_SEMAPHORE);
                 assert(ourqi->kill_fd != -1);
+
+                pthread_mutex_init(&ourqi->vq_lock, NULL);
+
                 if (pthread_create(&ourqi->thread, NULL,  fv_queue_thread,
                                    ourqi)) {
                         fuse_err("%s: Failed to create thread for queue %d\n",
@@ -780,6 +832,7 @@ static void fv_queue_set_started(VuDev *dev, int qidx, bool started)
                        fuse_err("%s: Failed to join thread idx %d err %d\n",
                                 __func__, qidx, ret);
                 }
+                pthread_mutex_destroy(&ourqi->vq_lock);
                 close(ourqi->kill_fd);
                 ourqi->kick_fd = -1;
                 free(vud->qi[qidx]);
diff --git a/contrib/virtiofsd/seccomp.c b/contrib/virtiofsd/seccomp.c
index cea4cc5f60..5f1c873b82 100644
--- a/contrib/virtiofsd/seccomp.c
+++ b/contrib/virtiofsd/seccomp.c
@@ -58,6 +58,7 @@ static const int syscall_whitelist[] = {
 	SCMP_SYS(open),
 	SCMP_SYS(openat),
 	SCMP_SYS(ppoll),
+	SCMP_SYS(prctl), /* TODO restrict to just PR_SET_NAME? */
 	SCMP_SYS(preadv),
 	SCMP_SYS(pwrite64),
 	SCMP_SYS(read),
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool Stefan Hajnoczi
@ 2019-08-01 16:54 ` Stefan Hajnoczi
  2019-08-05 12:26   ` Dr. David Alan Gilbert
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks Stefan Hajnoczi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-01 16:54 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert, Stefan Hajnoczi

When running with multiple threads it can be tricky to handle
FUSE_INIT/FUSE_DESTROY in parallel with other request types or in
parallel with themselves.  Serialize FUSE_INIT and FUSE_DESTROY so that
malicious clients cannot trigger race conditions.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/fuse_i.h        |  1 +
 contrib/virtiofsd/fuse_lowlevel.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/contrib/virtiofsd/fuse_i.h b/contrib/virtiofsd/fuse_i.h
index ff6e1b75ba..dcde9feb97 100644
--- a/contrib/virtiofsd/fuse_i.h
+++ b/contrib/virtiofsd/fuse_i.h
@@ -63,6 +63,7 @@ struct fuse_session {
 	struct fuse_req list;
 	struct fuse_req interrupts;
 	pthread_mutex_t lock;
+	pthread_rwlock_t init_rwlock;
 	int got_destroy;
 	int broken_splice_nonblock;
 	uint64_t notify_ctr;
diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
index 6ebf237baa..483a1bc9be 100644
--- a/contrib/virtiofsd/fuse_lowlevel.c
+++ b/contrib/virtiofsd/fuse_lowlevel.c
@@ -2473,6 +2473,18 @@ void fuse_session_process_buf_int(struct fuse_session *se,
 	req->ctx.pid = in->pid;
 	req->ch = ch ? fuse_chan_get(ch) : NULL;
 
+	/* INIT and DESTROY requests are serialized, all other request types
+	 * run in parallel.  This prevents races between FUSE_INIT and ordinary
+	 * requests, FUSE_INIT and FUSE_INIT, FUSE_INIT and FUSE_DESTROY, and
+	 * FUSE_DESTROY and FUSE_DESTROY.
+	 */
+	if (in->opcode == FUSE_INIT || in->opcode == CUSE_INIT ||
+	    in->opcode == FUSE_DESTROY) {
+		pthread_rwlock_wrlock(&se->init_rwlock);
+	} else {
+		pthread_rwlock_rdlock(&se->init_rwlock);
+	}
+
 	err = EIO;
 	if (!se->got_init) {
 		enum fuse_opcode expected;
@@ -2524,10 +2536,13 @@ void fuse_session_process_buf_int(struct fuse_session *se,
 		do_write_buf(req, in->nodeid, &iter, bufv);
 	else
 		fuse_ll_ops[in->opcode].func(req, in->nodeid, &iter);
+
+	pthread_rwlock_unlock(&se->init_rwlock);
 	return;
 
 reply_err:
 	fuse_reply_err(req, err);
+	pthread_rwlock_unlock(&se->init_rwlock);
 }
 
 #define LL_OPTION(n,o,v) \
@@ -2569,6 +2584,7 @@ void fuse_session_destroy(struct fuse_session *se)
 		if (se->op.destroy)
 			se->op.destroy(se->userdata, se);
 	}
+	pthread_rwlock_destroy(&se->init_rwlock);
 	pthread_mutex_destroy(&se->lock);
 	free(se->cuse_data);
 	if (se->fd != -1)
@@ -2656,6 +2672,7 @@ struct fuse_session *fuse_session_new(struct fuse_args *args,
 	list_init_req(&se->list);
 	list_init_req(&se->interrupts);
 	fuse_mutex_init(&se->lock);
+	pthread_rwlock_init(&se->init_rwlock, NULL);
 
 	memcpy(&se->op, op, op_size);
 	se->owner = getuid();
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool Stefan Hajnoczi
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races Stefan Hajnoczi
@ 2019-08-01 16:54 ` Stefan Hajnoczi
  2019-08-05 15:17   ` Dr. David Alan Gilbert
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 4/4] virtiofsd: add --thread-pool-size=NUM option Stefan Hajnoczi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-01 16:54 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert, Stefan Hajnoczi

Now that lo_destroy() is serialized we can call unref_inode() so that
all inode resources are freed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/passthrough_ll.c | 43 ++++++++++++++----------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
index a81c01d0d1..02a5e97326 100644
--- a/contrib/virtiofsd/passthrough_ll.c
+++ b/contrib/virtiofsd/passthrough_ll.c
@@ -1340,28 +1340,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
 	}
 }
 
-static int unref_all_inodes_cb(gpointer key, gpointer value,
-			       gpointer user_data)
-{
-	struct lo_inode *inode  = value;
-	struct lo_data *lo = user_data;
-
-	inode->nlookup = 0;
-	lo_map_remove(&lo->ino_map, inode->fuse_ino);
-	close(inode->fd);
-	lo_inode_put(lo, &inode); /* Drop our refcount from lo_do_lookup() */
-
-	return TRUE;
-}
-
-static void unref_all_inodes(struct lo_data *lo)
-{
-	pthread_mutex_lock(&lo->mutex);
-	g_hash_table_foreach_remove(lo->inodes, unref_all_inodes_cb, lo);
-	pthread_mutex_unlock(&lo->mutex);
-
-}
-
 static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
 {
 	struct lo_data *lo = lo_data(req);
@@ -2462,6 +2440,18 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
 	fuse_reply_err(req, ret);
 }
 
+static int destroy_inode_cb(gpointer key, gpointer value, gpointer user_data)
+{
+        struct lo_inode *inode = value;
+        struct lo_data *lo = user_data;
+
+        /* inode->nlookup is normally protected by lo->mutex but see the
+         * comment in lo_destroy().
+         */
+        unref_inode(lo, inode, inode->nlookup);
+        return TRUE;
+}
+
 static void lo_destroy(void *userdata, struct fuse_session *se)
 {
 	struct lo_data *lo = (struct lo_data*) userdata;
@@ -2475,7 +2465,14 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
                         fuse_err("%s: unmap during destroy failed\n", __func__);
                 }
         }
-	unref_all_inodes(lo);
+
+        /* Normally lo->mutex must be taken when traversing lo->inodes but
+         * lo_destroy() is a serialized request so no races are possible here.
+         *
+         * In addition, we cannot acquire lo->mutex since destroy_inode_cb() takes it
+         * too and this would result in a recursive lock.
+         */
+        g_hash_table_foreach_remove(lo->inodes, destroy_inode_cb, lo);
 }
 
 static struct fuse_lowlevel_ops lo_oper = {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Qemu-devel] [PATCH 4/4] virtiofsd: add --thread-pool-size=NUM option
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks Stefan Hajnoczi
@ 2019-08-01 16:54 ` Stefan Hajnoczi
  2019-08-05  2:52 ` [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3 piaojun
  2019-08-07 18:03 ` [Qemu-devel] " Stefan Hajnoczi
  5 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-01 16:54 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert, Stefan Hajnoczi

Add an option to control the size of the thread pool.  Requests are now
processed in parallel by default.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 contrib/virtiofsd/fuse_i.h        | 1 +
 contrib/virtiofsd/fuse_lowlevel.c | 8 ++++++--
 contrib/virtiofsd/fuse_virtio.c   | 4 ++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/contrib/virtiofsd/fuse_i.h b/contrib/virtiofsd/fuse_i.h
index dcde9feb97..5c82cf2eac 100644
--- a/contrib/virtiofsd/fuse_i.h
+++ b/contrib/virtiofsd/fuse_i.h
@@ -74,6 +74,7 @@ struct fuse_session {
         int   vu_listen_fd;
         int   vu_socketfd;
         struct fv_VuDev *virtio_dev;
+        int thread_pool_size;
 };
 
 struct fuse_chan {
diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
index 483a1bc9be..b692791fbc 100644
--- a/contrib/virtiofsd/fuse_lowlevel.c
+++ b/contrib/virtiofsd/fuse_lowlevel.c
@@ -26,7 +26,7 @@
 #include <assert.h>
 #include <sys/file.h>
 
-
+#define THREAD_POOL_SIZE 64
 
 #define OFFSET_MAX 0x7fffffffffffffffLL
 
@@ -2556,6 +2556,7 @@ static const struct fuse_opt fuse_ll_opts[] = {
 	LL_OPTION("--socket-path=%s", vu_socket_path, 0),
         LL_OPTION("vhost_user_socket=%s", vu_socket_path, 0),
 	LL_OPTION("--fd=%d", vu_listen_fd, 0),
+	LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0),
 	FUSE_OPT_END
 };
 
@@ -2575,7 +2576,9 @@ void fuse_lowlevel_help(void)
 "    --socket-path=PATH         path for the vhost-user socket\n"
 "    -o vhost_user_socket=PATH  path for the vhost-user socket\n"
 "    --fd=FDNUM                 fd number of vhost-user socket\n"
-"    -o auto_unmount            auto unmount on process termination\n");
+"    -o auto_unmount            auto unmount on process termination\n"
+"    --thread-pool-size=NUM     thread pool size limit (default %d)\n",
+	       THREAD_POOL_SIZE);
 }
 
 void fuse_session_destroy(struct fuse_session *se)
@@ -2629,6 +2632,7 @@ struct fuse_session *fuse_session_new(struct fuse_args *args,
 	}
 	se->fd = -1;
 	se->vu_listen_fd = -1;
+	se->thread_pool_size = THREAD_POOL_SIZE;
 	se->conn.max_write = UINT_MAX;
 	se->conn.max_readahead = UINT_MAX;
 
diff --git a/contrib/virtiofsd/fuse_virtio.c b/contrib/virtiofsd/fuse_virtio.c
index 0c52911144..e13b3cc2e6 100644
--- a/contrib/virtiofsd/fuse_virtio.c
+++ b/contrib/virtiofsd/fuse_virtio.c
@@ -663,8 +663,8 @@ static void *fv_queue_thread(void *opaque)
         struct fuse_session *se = qi->virtio_dev->se;
         GThreadPool *pool;
 
-        pool = g_thread_pool_new(fv_queue_worker, qi, 1 /* TODO max_threads */,
-                        TRUE, NULL);
+        pool = g_thread_pool_new(fv_queue_worker, qi, se->thread_pool_size,
+                                 TRUE, NULL);
         if (!pool) {
                 fuse_err("%s: g_thread_pool_new failed\n", __func__);
                 return NULL;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 4/4] virtiofsd: add --thread-pool-size=NUM option Stefan Hajnoczi
@ 2019-08-05  2:52 ` piaojun
  2019-08-05  8:01   ` Stefan Hajnoczi
  2019-08-07 18:03 ` [Qemu-devel] " Stefan Hajnoczi
  5 siblings, 1 reply; 30+ messages in thread
From: piaojun @ 2019-08-05  2:52 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtio-fs, qemu-devel

Hi Stefan,

From my test, 9p has better bandwidth than virtio as below:

---
9p Test:
# mount -t 9p -o trans=virtio,version=9p2000.L,rw,nodev,msize=1000000000,access=client 9pshare /mnt/9pshare

# fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M -size=1G -numjob=1 -runtime=30 -group_reporting -name=file -filename=/mnt/9pshare/file
file: (g=0): rw=randwrite, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
fio-2.13
Starting 1 process
file: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/1091MB/0KB /s] [0/1091/0 iops] [eta 00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6187: Mon Aug  5 17:55:44 2019
  write: io=35279MB, bw=1175.1MB/s, iops=1175, runt= 30001msec
    slat (usec): min=589, max=4211, avg=844.04, stdev=124.04
    clat (usec): min=1, max=24, avg= 2.53, stdev= 1.16
     lat (usec): min=591, max=4214, avg=846.57, stdev=124.14
    clat percentiles (usec):
     |  1.00th=[    2],  5.00th=[    2], 10.00th=[    2], 20.00th=[    2],
     | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    3],
     | 70.00th=[    3], 80.00th=[    3], 90.00th=[    3], 95.00th=[    3],
     | 99.00th=[    4], 99.50th=[   13], 99.90th=[   18], 99.95th=[   20],
     | 99.99th=[   22]
    lat (usec) : 2=0.04%, 4=98.27%, 10=1.15%, 20=0.48%, 50=0.06%
  cpu          : usr=23.83%, sys=5.24%, ctx=105843, majf=0, minf=9
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35279/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
---

---
virtiofs Test:
# ./virtiofsd -o vhost_user_socket=/tmp/vhostqemu -o source=/mnt/virtiofs/ -o cache=none

# mount -t virtio_fs myfs /mnt/virtiofs -o rootmode=040000,user_id=0,group_id=0

# fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M -size=1G -numjob=1 -runtime=30 -group_reporting -name=file -filename=/mnt/virtiofs/file
file: (g=0): rw=randwrite, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=1
fio-2.13
Starting 1 process
file: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/895.1MB/0KB /s] [0/895/0 iops] [eta 00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6046: Mon Aug  5 17:54:58 2019
  write: io=23491MB, bw=801799KB/s, iops=783, runt= 30001msec
    slat (usec): min=93, max=390, avg=233.40, stdev=64.22
    clat (usec): min=849, max=4083, avg=1039.32, stdev=178.98
     lat (usec): min=971, max=4346, avg=1272.72, stdev=200.34
    clat percentiles (usec):
     |  1.00th=[  972],  5.00th=[  980], 10.00th=[  988], 20.00th=[  988],
     | 30.00th=[  996], 40.00th=[ 1004], 50.00th=[ 1012], 60.00th=[ 1012],
     | 70.00th=[ 1020], 80.00th=[ 1032], 90.00th=[ 1032], 95.00th=[ 1384],
     | 99.00th=[ 1560], 99.50th=[ 1768], 99.90th=[ 3664], 99.95th=[ 4016],
     | 99.99th=[ 4048]
    lat (usec) : 1000=37.21%
    lat (msec) : 2=62.39%, 4=0.34%, 10=0.06%
  cpu          : usr=15.39%, sys=4.03%, ctx=23496, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=23491/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
---

And the backend filesystem is ext4 + ramdisk, and 9p has deeper queue
depth than virtiofs catched by iostat. Then I check the code, and found
9p uses pwritev, but virtiofs uses pwrite. I wonder if virtiofs could
also use iovec to improve its performance.

I'd like to help contributing the patch in the future.

Thanks,
Jun

On 2019/8/2 0:54, Stefan Hajnoczi wrote:
> This patch series introduces the virtiofsd --thread-pool-size=NUM and sets the
> default value to 64.  Each virtqueue has its own thread pool for processing
> requests.  Blocking requests no longer pause virtqueue processing and I/O
> performance should be greatly improved when the queue depth is greater than 1.
> 
> Linux boot and pjdfstest have been tested with these patches and the default
> thread pool size of 64.
> 
> I have now concluded the thread-safety code audit.  Please let me know if you
> have concerns about things I missed.
> 
> Performance
> -----------
> Please try these patches out and share your results.
> 
> Scalability
> -----------
> There are several synchronization primitives that are taken by the virtqueue
> processing thread or the thread pool worker.  Unfortunately this is necessary
> to protect shared state.  It means that thread pool workers contend on or at
> least access thread synchronization primitives.  If anyone has suggestions for
> improving this situation, please discuss.
> 
> 1. vu_dispatch_rwlock - libvhost-user from races between the vhost-user
>    protocol thread and the virtqueue processing and thread pool worker threads.
> 
> 2. vq_lock - protects the virtqueue from races between the virtqueue processing
>    thread and thread pool workers.
> 
> 3. init_rwlock - protects FUSE_INIT/FUSE_DESTROY from races with other
>    requests.
> 
> 4. se->lock - protects se->list and the FUSE_INTERRUPT shared state.
> 
> Ideally we could avoid hitting all of these locks on each request.  That would
> make the code scale better.
> 
> Future work
> -----------
> This series does not complete the multithreading effort yet.  Two items are
> still missing:
> 1. Multiqueue support
> 2. CPU affinity for virtqueue processing threads and thread pools
> 
> Stefan Hajnoczi (4):
>   virtiofsd: process requests in a thread pool
>   virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
>   virtiofsd: fix lo_destroy() resource leaks
>   virtiofsd: add --thread-pool-size=NUM option
> 
>  contrib/virtiofsd/fuse_i.h         |   2 +
>  contrib/virtiofsd/fuse_lowlevel.c  |  25 +-
>  contrib/virtiofsd/fuse_virtio.c    | 491 ++++++++++++++++-------------
>  contrib/virtiofsd/passthrough_ll.c |  43 ++-
>  contrib/virtiofsd/seccomp.c        |   1 +
>  5 files changed, 318 insertions(+), 244 deletions(-)
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-05  2:52 ` [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3 piaojun
@ 2019-08-05  8:01   ` Stefan Hajnoczi
  2019-08-05  9:40     ` piaojun
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-05  8:01 UTC (permalink / raw)
  To: piaojun; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi

On Mon, Aug 05, 2019 at 10:52:21AM +0800, piaojun wrote:
> # fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M -size=1G -numjob=1 -runtime=30 -group_reporting -name=file -filename=/mnt/9pshare/file

This benchmark configuration (--iodepth=1 --numjobs=1) cannot benefit
from virtiofsd multithreading.  Have you measured a regression when this
patch series is applied?

If your benchmark is unrelated to this patch series, let's discuss it in
a separate email thread.  fuse_buf_copy() can be optimized to use
pwritev().

Thanks,
Stefan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-05  8:01   ` Stefan Hajnoczi
@ 2019-08-05  9:40     ` piaojun
  0 siblings, 0 replies; 30+ messages in thread
From: piaojun @ 2019-08-05  9:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi

Hi Stefan,

On 2019/8/5 16:01, Stefan Hajnoczi wrote:
> On Mon, Aug 05, 2019 at 10:52:21AM +0800, piaojun wrote:
>> # fio -direct=1 -time_based -iodepth=1 -rw=randwrite -ioengine=libaio -bs=1M -size=1G -numjob=1 -runtime=30 -group_reporting -name=file -filename=/mnt/9pshare/file
> 
> This benchmark configuration (--iodepth=1 --numjobs=1) cannot benefit
> from virtiofsd multithreading.  Have you measured a regression when this
> patch series is applied?
> 
> If your benchmark is unrelated to this patch series, let's discuss it in
> a separate email thread.  fuse_buf_copy() can be optimized to use
> pwritev().
> 
> Thanks,
> Stefan

Yes, I have not applied your patch series and test. As you said, I will
discuss *pwritev* in another email.

Thanks,
Jun

> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool Stefan Hajnoczi
@ 2019-08-05 12:02   ` Dr. David Alan Gilbert
  2019-08-07  9:35     ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-05 12:02 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Liu Bo, qemu-devel

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> Introduce a thread pool so that fv_queue_thread() just pops
> VuVirtqElements and hands them to the thread pool.  For the time being
> only one worker thread is allowed since passthrough_ll.c is not
> thread-safe yet.  Future patches will lift this restriction so that
> multiple FUSE requests can be processed in parallel.
> 
> The main new concept is struct FVRequest, which contains both
> VuVirtqElement and struct fuse_chan.  We now have fv_VuDev for a device,
> fv_QueueInfo for a virtqueue, and FVRequest for a request.  Some of
> fv_QueueInfo's fields are moved into FVRequest because they are
> per-request.  The name FVRequest conforms to QEMU coding style and I
> expect the struct fv_* types will be renamed in a future refactoring.
> 
> This patch series is not optimal.  fbuf reuse is dropped so each request
> does malloc(se->bufsize), but there is no clean and cheap way to keep
> this with a thread pool.  The vq_lock mutex is held for longer than
> necessary, especially during the eventfd_write() syscall.  Performance
> can be improved in the future.
> 
> prctl(2) had to be added to the seccomp whitelist because glib invokes
> it.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Thanks, applied; some comments below.

> +
> +        pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
> +        pthread_mutex_lock(&qi->vq_lock);
> +        vu_queue_push(dev, q, elem, tosend_len);
> +        vu_queue_notify(dev, q);
> +        pthread_mutex_unlock(&qi->vq_lock);
> +        pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);

It surprises me that these paired locks are so common.
>  
>  err:
>  
> @@ -249,9 +268,12 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
>                           struct iovec *iov, int count,
>                           struct fuse_bufvec *buf, size_t len)
>  {
> +        FVRequest *req = container_of(ch, FVRequest, ch);

I can't think of a better answer than the container_of but it does make
me a bit nervous; I guess we need it because 'ch' comes from the generic
fs code so can't be FVRequest*
> +        struct VuDev *dev = &qi->virtio_dev->dev;
> +        FVRequest *req = data;
> +        VuVirtqElement *elem = &req->elem;
> +        struct fuse_buf fbuf = {};
> +        bool allocated_bufv = false;
> +        struct fuse_bufvec bufv;
> +        struct fuse_bufvec *pbufv;
> +
> +        assert(se->bufsize > sizeof(struct fuse_in_header));
> +
> +        /* An element contains one request and the space to send our response
> +         * They're spread over multiple descriptors in a scatter/gather set
> +         * and we can't trust the guest to keep them still; so copy in/out.
> +         */
> +        fbuf.mem = malloc(se->bufsize);
> +        assert(fbuf.mem);

Now we're using thread pools etc, maybe it's time to switch to glib's
g_new/g_malloc etc that never return NULL?

> +        if (se->debug)
> +                fuse_debug("%s: elem %d: with %d out desc of length %zd"
> +                           " bad_in_num=%u bad_out_num=%u\n",
> +                           __func__, elem->index, out_num,
> +                           out_len, req->bad_in_num, req->bad_out_num);

Are the debug/logging calls thread safe?


> diff --git a/contrib/virtiofsd/seccomp.c b/contrib/virtiofsd/seccomp.c
> index cea4cc5f60..5f1c873b82 100644
> --- a/contrib/virtiofsd/seccomp.c
> +++ b/contrib/virtiofsd/seccomp.c
> @@ -58,6 +58,7 @@ static const int syscall_whitelist[] = {
>  	SCMP_SYS(open),
>  	SCMP_SYS(openat),
>  	SCMP_SYS(ppoll),
> +	SCMP_SYS(prctl), /* TODO restrict to just PR_SET_NAME? */

Would seem a good idea because prctl can do lots of crazy things.

Dave

>  	SCMP_SYS(preadv),
>  	SCMP_SYS(pwrite64),
>  	SCMP_SYS(read),
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races Stefan Hajnoczi
@ 2019-08-05 12:26   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-05 12:26 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Liu Bo, qemu-devel

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> When running with multiple threads it can be tricky to handle
> FUSE_INIT/FUSE_DESTROY in parallel with other request types or in
> parallel with themselves.  Serialize FUSE_INIT and FUSE_DESTROY so that
> malicious clients cannot trigger race conditions.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

(It took me a few minutes getting my head around the different _destroy
functions)

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  contrib/virtiofsd/fuse_i.h        |  1 +
>  contrib/virtiofsd/fuse_lowlevel.c | 17 +++++++++++++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/contrib/virtiofsd/fuse_i.h b/contrib/virtiofsd/fuse_i.h
> index ff6e1b75ba..dcde9feb97 100644
> --- a/contrib/virtiofsd/fuse_i.h
> +++ b/contrib/virtiofsd/fuse_i.h
> @@ -63,6 +63,7 @@ struct fuse_session {
>  	struct fuse_req list;
>  	struct fuse_req interrupts;
>  	pthread_mutex_t lock;
> +	pthread_rwlock_t init_rwlock;
>  	int got_destroy;
>  	int broken_splice_nonblock;
>  	uint64_t notify_ctr;
> diff --git a/contrib/virtiofsd/fuse_lowlevel.c b/contrib/virtiofsd/fuse_lowlevel.c
> index 6ebf237baa..483a1bc9be 100644
> --- a/contrib/virtiofsd/fuse_lowlevel.c
> +++ b/contrib/virtiofsd/fuse_lowlevel.c
> @@ -2473,6 +2473,18 @@ void fuse_session_process_buf_int(struct fuse_session *se,
>  	req->ctx.pid = in->pid;
>  	req->ch = ch ? fuse_chan_get(ch) : NULL;
>  
> +	/* INIT and DESTROY requests are serialized, all other request types
> +	 * run in parallel.  This prevents races between FUSE_INIT and ordinary
> +	 * requests, FUSE_INIT and FUSE_INIT, FUSE_INIT and FUSE_DESTROY, and
> +	 * FUSE_DESTROY and FUSE_DESTROY.
> +	 */
> +	if (in->opcode == FUSE_INIT || in->opcode == CUSE_INIT ||
> +	    in->opcode == FUSE_DESTROY) {
> +		pthread_rwlock_wrlock(&se->init_rwlock);
> +	} else {
> +		pthread_rwlock_rdlock(&se->init_rwlock);
> +	}
> +
>  	err = EIO;
>  	if (!se->got_init) {
>  		enum fuse_opcode expected;
> @@ -2524,10 +2536,13 @@ void fuse_session_process_buf_int(struct fuse_session *se,
>  		do_write_buf(req, in->nodeid, &iter, bufv);
>  	else
>  		fuse_ll_ops[in->opcode].func(req, in->nodeid, &iter);
> +
> +	pthread_rwlock_unlock(&se->init_rwlock);
>  	return;
>  
>  reply_err:
>  	fuse_reply_err(req, err);
> +	pthread_rwlock_unlock(&se->init_rwlock);
>  }
>  
>  #define LL_OPTION(n,o,v) \
> @@ -2569,6 +2584,7 @@ void fuse_session_destroy(struct fuse_session *se)
>  		if (se->op.destroy)
>  			se->op.destroy(se->userdata, se);
>  	}
> +	pthread_rwlock_destroy(&se->init_rwlock);
>  	pthread_mutex_destroy(&se->lock);
>  	free(se->cuse_data);
>  	if (se->fd != -1)
> @@ -2656,6 +2672,7 @@ struct fuse_session *fuse_session_new(struct fuse_args *args,
>  	list_init_req(&se->list);
>  	list_init_req(&se->interrupts);
>  	fuse_mutex_init(&se->lock);
> +	pthread_rwlock_init(&se->init_rwlock, NULL);
>  
>  	memcpy(&se->op, op, op_size);
>  	se->owner = getuid();
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks
  2019-08-01 16:54 ` [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks Stefan Hajnoczi
@ 2019-08-05 15:17   ` Dr. David Alan Gilbert
  2019-08-05 18:57     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-05 15:17 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Liu Bo, qemu-devel

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> Now that lo_destroy() is serialized we can call unref_inode() so that
> all inode resources are freed.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  contrib/virtiofsd/passthrough_ll.c | 43 ++++++++++++++----------------
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> index a81c01d0d1..02a5e97326 100644
> --- a/contrib/virtiofsd/passthrough_ll.c
> +++ b/contrib/virtiofsd/passthrough_ll.c
> @@ -1340,28 +1340,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
>  	}
>  }
>  
> -static int unref_all_inodes_cb(gpointer key, gpointer value,
> -			       gpointer user_data)
> -{
> -	struct lo_inode *inode  = value;
> -	struct lo_data *lo = user_data;
> -
> -	inode->nlookup = 0;
> -	lo_map_remove(&lo->ino_map, inode->fuse_ino);
> -	close(inode->fd);
> -	lo_inode_put(lo, &inode); /* Drop our refcount from lo_do_lookup() */
> -
> -	return TRUE;
> -}
> -
> -static void unref_all_inodes(struct lo_data *lo)
> -{
> -	pthread_mutex_lock(&lo->mutex);
> -	g_hash_table_foreach_remove(lo->inodes, unref_all_inodes_cb, lo);
> -	pthread_mutex_unlock(&lo->mutex);
> -
> -}
> -
>  static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
>  {
>  	struct lo_data *lo = lo_data(req);
> @@ -2462,6 +2440,18 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
>  	fuse_reply_err(req, ret);
>  }
>  
> +static int destroy_inode_cb(gpointer key, gpointer value, gpointer user_data)
> +{
> +        struct lo_inode *inode = value;
> +        struct lo_data *lo = user_data;
> +
> +        /* inode->nlookup is normally protected by lo->mutex but see the
> +         * comment in lo_destroy().
> +         */
> +        unref_inode(lo, inode, inode->nlookup);
> +        return TRUE;
> +}
> +
>  static void lo_destroy(void *userdata, struct fuse_session *se)
>  {
>  	struct lo_data *lo = (struct lo_data*) userdata;
> @@ -2475,7 +2465,14 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
>                          fuse_err("%s: unmap during destroy failed\n", __func__);
>                  }
>          }
> -	unref_all_inodes(lo);
> +
> +        /* Normally lo->mutex must be taken when traversing lo->inodes but
> +         * lo_destroy() is a serialized request so no races are possible here.
> +         *
> +         * In addition, we cannot acquire lo->mutex since destroy_inode_cb() takes it
> +         * too and this would result in a recursive lock.
> +         */
> +        g_hash_table_foreach_remove(lo->inodes, destroy_inode_cb, lo);
>  }
>  
>  static struct fuse_lowlevel_ops lo_oper = {
> -- 
> 2.21.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks
  2019-08-05 15:17   ` Dr. David Alan Gilbert
@ 2019-08-05 18:57     ` Dr. David Alan Gilbert
  2019-08-06 18:58       ` Dr. David Alan Gilbert
  2019-08-07  9:41       ` Stefan Hajnoczi
  0 siblings, 2 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-05 18:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Liu Bo, qemu-devel

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > Now that lo_destroy() is serialized we can call unref_inode() so that
> > all inode resources are freed.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> > ---
> >  contrib/virtiofsd/passthrough_ll.c | 43 ++++++++++++++----------------
> >  1 file changed, 20 insertions(+), 23 deletions(-)
> > 
> > diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> > index a81c01d0d1..02a5e97326 100644
> > --- a/contrib/virtiofsd/passthrough_ll.c
> > +++ b/contrib/virtiofsd/passthrough_ll.c
> > @@ -1340,28 +1340,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
> >  	}
> >  }
> >  
> > -static int unref_all_inodes_cb(gpointer key, gpointer value,
> > -			       gpointer user_data)
> > -{
> > -	struct lo_inode *inode  = value;
> > -	struct lo_data *lo = user_data;
> > -
> > -	inode->nlookup = 0;
> > -	lo_map_remove(&lo->ino_map, inode->fuse_ino);
> > -	close(inode->fd);
> > -	lo_inode_put(lo, &inode); /* Drop our refcount from lo_do_lookup() */
> > -
> > -	return TRUE;
> > -}
> > -
> > -static void unref_all_inodes(struct lo_data *lo)
> > -{
> > -	pthread_mutex_lock(&lo->mutex);
> > -	g_hash_table_foreach_remove(lo->inodes, unref_all_inodes_cb, lo);
> > -	pthread_mutex_unlock(&lo->mutex);
> > -
> > -}
> > -
> >  static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
> >  {
> >  	struct lo_data *lo = lo_data(req);
> > @@ -2462,6 +2440,18 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
> >  	fuse_reply_err(req, ret);
> >  }
> >  
> > +static int destroy_inode_cb(gpointer key, gpointer value, gpointer user_data)
> > +{
> > +        struct lo_inode *inode = value;
> > +        struct lo_data *lo = user_data;
> > +
> > +        /* inode->nlookup is normally protected by lo->mutex but see the
> > +         * comment in lo_destroy().
> > +         */
> > +        unref_inode(lo, inode, inode->nlookup);
> > +        return TRUE;
> > +}
> > +
> >  static void lo_destroy(void *userdata, struct fuse_session *se)
> >  {
> >  	struct lo_data *lo = (struct lo_data*) userdata;
> > @@ -2475,7 +2465,14 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
> >                          fuse_err("%s: unmap during destroy failed\n", __func__);
> >                  }
> >          }
> > -	unref_all_inodes(lo);
> > +
> > +        /* Normally lo->mutex must be taken when traversing lo->inodes but
> > +         * lo_destroy() is a serialized request so no races are possible here.
> > +         *
> > +         * In addition, we cannot acquire lo->mutex since destroy_inode_cb() takes it
> > +         * too and this would result in a recursive lock.
> > +         */
> > +        g_hash_table_foreach_remove(lo->inodes, destroy_inode_cb, lo);

I'm seeing a crash here if I ctrl-c the virtiofsd after it's got an
active mount:

(process:3219): GLib-CRITICAL **: 18:42:08.334: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed

(I only get the debug if I give seccomp both getpeername and ioctl;
I think glib is trying to get to syslog and wants getpeername
and I'm guessing ioctl to do something funky with the terminal).

Dave

> >  }
> >  
> >  static struct fuse_lowlevel_ops lo_oper = {
> > -- 
> > 2.21.0
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks
  2019-08-05 18:57     ` Dr. David Alan Gilbert
@ 2019-08-06 18:58       ` Dr. David Alan Gilbert
  2019-08-07  9:41       ` Stefan Hajnoczi
  1 sibling, 0 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-06 18:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, Liu Bo, qemu-devel

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > Now that lo_destroy() is serialized we can call unref_inode() so that
> > > all inode resources are freed.
> > > 
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > > ---
> > >  contrib/virtiofsd/passthrough_ll.c | 43 ++++++++++++++----------------
> > >  1 file changed, 20 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> > > index a81c01d0d1..02a5e97326 100644
> > > --- a/contrib/virtiofsd/passthrough_ll.c
> > > +++ b/contrib/virtiofsd/passthrough_ll.c
> > > @@ -1340,28 +1340,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
> > >  	}
> > >  }
> > >  
> > > -static int unref_all_inodes_cb(gpointer key, gpointer value,
> > > -			       gpointer user_data)
> > > -{
> > > -	struct lo_inode *inode  = value;
> > > -	struct lo_data *lo = user_data;
> > > -
> > > -	inode->nlookup = 0;
> > > -	lo_map_remove(&lo->ino_map, inode->fuse_ino);
> > > -	close(inode->fd);
> > > -	lo_inode_put(lo, &inode); /* Drop our refcount from lo_do_lookup() */
> > > -
> > > -	return TRUE;
> > > -}
> > > -
> > > -static void unref_all_inodes(struct lo_data *lo)
> > > -{
> > > -	pthread_mutex_lock(&lo->mutex);
> > > -	g_hash_table_foreach_remove(lo->inodes, unref_all_inodes_cb, lo);
> > > -	pthread_mutex_unlock(&lo->mutex);
> > > -
> > > -}
> > > -
> > >  static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
> > >  {
> > >  	struct lo_data *lo = lo_data(req);
> > > @@ -2462,6 +2440,18 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
> > >  	fuse_reply_err(req, ret);
> > >  }
> > >  
> > > +static int destroy_inode_cb(gpointer key, gpointer value, gpointer user_data)
> > > +{
> > > +        struct lo_inode *inode = value;
> > > +        struct lo_data *lo = user_data;
> > > +
> > > +        /* inode->nlookup is normally protected by lo->mutex but see the
> > > +         * comment in lo_destroy().
> > > +         */
> > > +        unref_inode(lo, inode, inode->nlookup);
> > > +        return TRUE;
> > > +}
> > > +
> > >  static void lo_destroy(void *userdata, struct fuse_session *se)
> > >  {
> > >  	struct lo_data *lo = (struct lo_data*) userdata;
> > > @@ -2475,7 +2465,14 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
> > >                          fuse_err("%s: unmap during destroy failed\n", __func__);
> > >                  }
> > >          }
> > > -	unref_all_inodes(lo);
> > > +
> > > +        /* Normally lo->mutex must be taken when traversing lo->inodes but
> > > +         * lo_destroy() is a serialized request so no races are possible here.
> > > +         *
> > > +         * In addition, we cannot acquire lo->mutex since destroy_inode_cb() takes it
> > > +         * too and this would result in a recursive lock.
> > > +         */
> > > +        g_hash_table_foreach_remove(lo->inodes, destroy_inode_cb, lo);
> 
> I'm seeing a crash here if I ctrl-c the virtiofsd after it's got an
> active mount:
> 
> (process:3219): GLib-CRITICAL **: 18:42:08.334: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed
> 
> (I only get the debug if I give seccomp both getpeername and ioctl;
> I think glib is trying to get to syslog and wants getpeername
> and I'm guessing ioctl to do something funky with the terminal).

That's also the culprit for a crash on umount that only happens with
-o cache=auto  -  reverting this makes it go away.

Dave

> Dave
> 
> > >  }
> > >  
> > >  static struct fuse_lowlevel_ops lo_oper = {
> > > -- 
> > > 2.21.0
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool
  2019-08-05 12:02   ` Dr. David Alan Gilbert
@ 2019-08-07  9:35     ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-07  9:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, Liu Bo, qemu-devel

On Mon, Aug 05, 2019 at 01:02:31PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > +        if (se->debug)
> > +                fuse_debug("%s: elem %d: with %d out desc of length %zd"
> > +                           " bad_in_num=%u bad_out_num=%u\n",
> > +                           __func__, elem->index, out_num,
> > +                           out_len, req->bad_in_num, req->bad_out_num);
> 
> Are the debug/logging calls thread safe?

Yes, vsyslog(3) and vfprintf(3) are thread-safe.

Stefan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks
  2019-08-05 18:57     ` Dr. David Alan Gilbert
  2019-08-06 18:58       ` Dr. David Alan Gilbert
@ 2019-08-07  9:41       ` Stefan Hajnoczi
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-07  9:41 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, Liu Bo, qemu-devel

On Mon, Aug 05, 2019 at 07:57:51PM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > Now that lo_destroy() is serialized we can call unref_inode() so that
> > > all inode resources are freed.
> > > 
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > > ---
> > >  contrib/virtiofsd/passthrough_ll.c | 43 ++++++++++++++----------------
> > >  1 file changed, 20 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/contrib/virtiofsd/passthrough_ll.c b/contrib/virtiofsd/passthrough_ll.c
> > > index a81c01d0d1..02a5e97326 100644
> > > --- a/contrib/virtiofsd/passthrough_ll.c
> > > +++ b/contrib/virtiofsd/passthrough_ll.c
> > > @@ -1340,28 +1340,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
> > >  	}
> > >  }
> > >  
> > > -static int unref_all_inodes_cb(gpointer key, gpointer value,
> > > -			       gpointer user_data)
> > > -{
> > > -	struct lo_inode *inode  = value;
> > > -	struct lo_data *lo = user_data;
> > > -
> > > -	inode->nlookup = 0;
> > > -	lo_map_remove(&lo->ino_map, inode->fuse_ino);
> > > -	close(inode->fd);
> > > -	lo_inode_put(lo, &inode); /* Drop our refcount from lo_do_lookup() */
> > > -
> > > -	return TRUE;
> > > -}
> > > -
> > > -static void unref_all_inodes(struct lo_data *lo)
> > > -{
> > > -	pthread_mutex_lock(&lo->mutex);
> > > -	g_hash_table_foreach_remove(lo->inodes, unref_all_inodes_cb, lo);
> > > -	pthread_mutex_unlock(&lo->mutex);
> > > -
> > > -}
> > > -
> > >  static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
> > >  {
> > >  	struct lo_data *lo = lo_data(req);
> > > @@ -2462,6 +2440,18 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
> > >  	fuse_reply_err(req, ret);
> > >  }
> > >  
> > > +static int destroy_inode_cb(gpointer key, gpointer value, gpointer user_data)
> > > +{
> > > +        struct lo_inode *inode = value;
> > > +        struct lo_data *lo = user_data;
> > > +
> > > +        /* inode->nlookup is normally protected by lo->mutex but see the
> > > +         * comment in lo_destroy().
> > > +         */
> > > +        unref_inode(lo, inode, inode->nlookup);
> > > +        return TRUE;
> > > +}
> > > +
> > >  static void lo_destroy(void *userdata, struct fuse_session *se)
> > >  {
> > >  	struct lo_data *lo = (struct lo_data*) userdata;
> > > @@ -2475,7 +2465,14 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
> > >                          fuse_err("%s: unmap during destroy failed\n", __func__);
> > >                  }
> > >          }
> > > -	unref_all_inodes(lo);
> > > +
> > > +        /* Normally lo->mutex must be taken when traversing lo->inodes but
> > > +         * lo_destroy() is a serialized request so no races are possible here.
> > > +         *
> > > +         * In addition, we cannot acquire lo->mutex since destroy_inode_cb() takes it
> > > +         * too and this would result in a recursive lock.
> > > +         */
> > > +        g_hash_table_foreach_remove(lo->inodes, destroy_inode_cb, lo);
> 
> I'm seeing a crash here if I ctrl-c the virtiofsd after it's got an
> active mount:
> 
> (process:3219): GLib-CRITICAL **: 18:42:08.334: g_hash_table_foreach_remove_or_steal: assertion 'version == hash_table->version' failed

The hash table was modified by unref_inode() so
g_hash_table_foreach_remove() panics.

I'll come up with a different way of doing this.

Stefan


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2019-08-05  2:52 ` [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3 piaojun
@ 2019-08-07 18:03 ` Stefan Hajnoczi
  2019-08-07 20:57   ` [Qemu-devel] [Virtio-fs] " Vivek Goyal
  2019-08-08  8:10   ` piaojun
  5 siblings, 2 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-07 18:03 UTC (permalink / raw)
  To: virtio-fs, qemu-devel; +Cc: Liu Bo, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1931 bytes --]

On Thu, Aug 01, 2019 at 05:54:05PM +0100, Stefan Hajnoczi wrote:
> Performance
> -----------
> Please try these patches out and share your results.

Here are the performance numbers:

  Threadpool | iodepth | iodepth
     size    |    1    |   64
  -----------+---------+--------
  None       |   4451  |  4876
  1          |   4360  |  4858
  64         |   4359  | 33,266

A graph is available here:
https://vmsplice.net/~stefan/virtiofsd-threadpool-performance.png

Summary:

 * iodepth=64 performance is increased by 6.8 times.
 * iodepth=1 performance degrades by 2%.
 * DAX is bottlenecked by QEMU's single-threaded
   VHOST_USER_SLAVE_FS_MAP/UNMAP handler.

Threadpool size "none" is virtiofsd commit 813a824b707 ("virtiofsd: use
fuse_lowlevel_is_virtio() in fuse_session_destroy()") without any of the
multithreading preparation patches.  I benchmarked this to check whether
the patches introduce a regression for iodepth=1.  They do, but it's
only around 2%.

I also ran with DAX but found there was not much difference between
iodepth=1 and iodepth=64.  This might be because the host mmap(2)
syscall becomes the bottleneck and a serialization point.  QEMU only
processes one VHOST_USER_SLAVE_FS_MAP/UNMAP at a time.  If we want to
accelerate DAX it may be necessary to parallelize mmap, assuming the
host kernel can do them in parallel on a single file.  This performance
optimization is future work and not directly related to this patch
series.

The following fio job was run with cache=none and no DAX:

  [global]
  runtime=60
  ramp_time=30
  filename=/var/tmp/fio.dat
  direct=1
  rw=randread
  bs=4k
  size=4G
  ioengine=libaio
  iodepth=1

  [read]

Guest configuration:
1 vCPU
4 GB RAM
Linux 5.1 (vivek-aug-06-2019)

Host configuration:
Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz (2 cores x 2 threads)
8 GB RAM
Linux 5.1.20-300.fc30.x86_64
XFS + dm-thin + dm-crypt
Toshiba THNSFJ256GDNU (256 GB SATA SSD)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-07 18:03 ` [Qemu-devel] " Stefan Hajnoczi
@ 2019-08-07 20:57   ` Vivek Goyal
  2019-08-08  9:02     ` Stefan Hajnoczi
  2019-08-08  8:10   ` piaojun
  1 sibling, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2019-08-07 20:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel

On Wed, Aug 07, 2019 at 07:03:55PM +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 01, 2019 at 05:54:05PM +0100, Stefan Hajnoczi wrote:
> > Performance
> > -----------
> > Please try these patches out and share your results.
> 
> Here are the performance numbers:
> 
>   Threadpool | iodepth | iodepth
>      size    |    1    |   64
>   -----------+---------+--------
>   None       |   4451  |  4876
>   1          |   4360  |  4858
>   64         |   4359  | 33,266
> 
> A graph is available here:
> https://vmsplice.net/~stefan/virtiofsd-threadpool-performance.png
> 
> Summary:
> 
>  * iodepth=64 performance is increased by 6.8 times.
>  * iodepth=1 performance degrades by 2%.
>  * DAX is bottlenecked by QEMU's single-threaded
>    VHOST_USER_SLAVE_FS_MAP/UNMAP handler.
> 
> Threadpool size "none" is virtiofsd commit 813a824b707 ("virtiofsd: use
> fuse_lowlevel_is_virtio() in fuse_session_destroy()") without any of the
> multithreading preparation patches.  I benchmarked this to check whether
> the patches introduce a regression for iodepth=1.  They do, but it's
> only around 2%.
> 
> I also ran with DAX but found there was not much difference between
> iodepth=1 and iodepth=64.  This might be because the host mmap(2)
> syscall becomes the bottleneck and a serialization point.  QEMU only
> processes one VHOST_USER_SLAVE_FS_MAP/UNMAP at a time.  If we want to
> accelerate DAX it may be necessary to parallelize mmap, assuming the
> host kernel can do them in parallel on a single file.  This performance
> optimization is future work and not directly related to this patch
> series.

Good to see nice improvement with higher queue depth.

Kernel also serializes MAP/UNMAP on one inode. So you will need to run
multiple jobs operating on different inodes to see parallel MAP/UNMAP
(atleast from kernel's point of view).

Thanks
Vivek


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-07 18:03 ` [Qemu-devel] " Stefan Hajnoczi
  2019-08-07 20:57   ` [Qemu-devel] [Virtio-fs] " Vivek Goyal
@ 2019-08-08  8:10   ` piaojun
  2019-08-08  9:53     ` Stefan Hajnoczi
  1 sibling, 1 reply; 30+ messages in thread
From: piaojun @ 2019-08-08  8:10 UTC (permalink / raw)
  To: Stefan Hajnoczi, virtio-fs, qemu-devel

Hi Stefan,

From my test, your patch set of multithreading improves iops greatly as
below:

Guest configuration:
8 vCPU
8GB RAM
Linux 5.1 (vivek-aug-06-2019)

Host configuration:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores x 4 threads)
32GB RAM
Linux 3.10.0
EXT4 + LVM + local HDD

---
Before:
# fio -direct=1 -time_based -iodepth=64 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjob=1 -runtime=30 -group_reporting -name=file -filename=/mnt/virtiofs/file
Jobs: 1 (f=1): [r(1)] [100.0% done] [1177KB/0KB/0KB /s] [294/0/0 iops] [eta 00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=6037: Thu Aug  8 23:18:59 2019
  read : io=35148KB, bw=1169.9KB/s, iops=292, runt= 30045msec

After:
Jobs: 1 (f=1): [r(1)] [100.0% done] [6246KB/0KB/0KB /s] [1561/0/0 iops] [eta 00m:00s]
file: (groupid=0, jobs=1): err= 0: pid=5850: Thu Aug  8 23:21:22 2019
  read : io=191216KB, bw=6370.7KB/s, iops=1592, runt= 30015msec
---

But there is no iops improvment when I change from HDD to ramdisk. I
guess this is because ramdisk has no iodepth.

Thanks,
Jun

On 2019/8/8 2:03, Stefan Hajnoczi wrote:
> On Thu, Aug 01, 2019 at 05:54:05PM +0100, Stefan Hajnoczi wrote:
>> Performance
>> -----------
>> Please try these patches out and share your results.
> 
> Here are the performance numbers:
> 
>   Threadpool | iodepth | iodepth
>      size    |    1    |   64
>   -----------+---------+--------
>   None       |   4451  |  4876
>   1          |   4360  |  4858
>   64         |   4359  | 33,266
> 
> A graph is available here:
> https://vmsplice.net/~stefan/virtiofsd-threadpool-performance.png
> 
> Summary:
> 
>  * iodepth=64 performance is increased by 6.8 times.
>  * iodepth=1 performance degrades by 2%.
>  * DAX is bottlenecked by QEMU's single-threaded
>    VHOST_USER_SLAVE_FS_MAP/UNMAP handler.
> 
> Threadpool size "none" is virtiofsd commit 813a824b707 ("virtiofsd: use
> fuse_lowlevel_is_virtio() in fuse_session_destroy()") without any of the
> multithreading preparation patches.  I benchmarked this to check whether
> the patches introduce a regression for iodepth=1.  They do, but it's
> only around 2%.
> 
> I also ran with DAX but found there was not much difference between
> iodepth=1 and iodepth=64.  This might be because the host mmap(2)
> syscall becomes the bottleneck and a serialization point.  QEMU only
> processes one VHOST_USER_SLAVE_FS_MAP/UNMAP at a time.  If we want to
> accelerate DAX it may be necessary to parallelize mmap, assuming the
> host kernel can do them in parallel on a single file.  This performance
> optimization is future work and not directly related to this patch
> series.
> 
> The following fio job was run with cache=none and no DAX:
> 
>   [global]
>   runtime=60
>   ramp_time=30
>   filename=/var/tmp/fio.dat
>   direct=1
>   rw=randread
>   bs=4k
>   size=4G
>   ioengine=libaio
>   iodepth=1
> 
>   [read]
> 
> Guest configuration:
> 1 vCPU
> 4 GB RAM
> Linux 5.1 (vivek-aug-06-2019)
> 
> Host configuration:
> Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz (2 cores x 2 threads)
> 8 GB RAM
> Linux 5.1.20-300.fc30.x86_64
> XFS + dm-thin + dm-crypt
> Toshiba THNSFJ256GDNU (256 GB SATA SSD)
> 
> Stefan
> 
> 
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-07 20:57   ` [Qemu-devel] [Virtio-fs] " Vivek Goyal
@ 2019-08-08  9:02     ` Stefan Hajnoczi
  2019-08-08  9:53       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-08  9:02 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> multiple jobs operating on different inodes to see parallel MAP/UNMAP
> (atleast from kernel's point of view).

Okay, there is still room to experiment with how MAP and UNMAP are
handled by virtiofsd and QEMU even if the host kernel ultimately becomes
the bottleneck.

One possible optimization is to eliminate REMOVEMAPPING requests when
the guest driver knows a SETUPMAPPING will follow immediately.  I see
the following request pattern in a fio randread iodepth=64 job:

  unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, moffset=859832320, flags=0)
     unique: 995348, success, outsize: 16
  unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
     unique: 995350, success, outsize: 16
  unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, moffset=861929472, flags=0)
     unique: 995352, success, outsize: 16
  unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
     unique: 995354, success, outsize: 16
  virtio_send_msg: elem 9: with 1 in desc of length 16
  unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
  lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, moffset=864026624, flags=0)
     unique: 995356, success, outsize: 16
  unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12

The REMOVEMAPPING requests are unnecessary since we can map over the top
of the old mapping instead of taking the extra step of removing it
first.

Some more questions to consider for DAX performance optimization:

1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
   to eliminate the bad address problem?
4. Can OPEN+MAP be fused into a single request for small files, avoiding
   the 2nd request?

I'm not going to tackle DAX optimization myself right now but wanted to
share these ideas.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08  9:02     ` Stefan Hajnoczi
@ 2019-08-08  9:53       ` Dr. David Alan Gilbert
  2019-08-08 12:53         ` Vivek Goyal
  2019-08-09  8:21         ` Stefan Hajnoczi
  0 siblings, 2 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-08  9:53 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > (atleast from kernel's point of view).
> 
> Okay, there is still room to experiment with how MAP and UNMAP are
> handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> the bottleneck.
> 
> One possible optimization is to eliminate REMOVEMAPPING requests when
> the guest driver knows a SETUPMAPPING will follow immediately.  I see
> the following request pattern in a fio randread iodepth=64 job:
> 
>   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, moffset=859832320, flags=0)
>      unique: 995348, success, outsize: 16
>   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
>      unique: 995350, success, outsize: 16
>   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, moffset=861929472, flags=0)
>      unique: 995352, success, outsize: 16
>   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
>      unique: 995354, success, outsize: 16
>   virtio_send_msg: elem 9: with 1 in desc of length 16
>   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
>   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, moffset=864026624, flags=0)
>      unique: 995356, success, outsize: 16
>   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> 
> The REMOVEMAPPING requests are unnecessary since we can map over the top
> of the old mapping instead of taking the extra step of removing it
> first.

Yep, those should go - I think Vivek likes to keep them for testing
since they make things fail more completely if there's a screwup.

> Some more questions to consider for DAX performance optimization:
> 
> 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?

Probably for cases where the data is only accessed once, and you can't
preemptively map.
Another variant on (1) is whether we could do read/writes while the mmap
is happening to absorb the latency.

> 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?

I think there's two things to solve here that I don't currently know the
answer to:
  2a) We'd need to get the fd to qemu for the thing to mmap;
      we might be able to cache the fd on the qemu side for existing
      mappings, so when asking for a new mapping for an existing file then
      it would already have the fd.

  2b) Running a device with a mix of queues inside QEMU and on
      vhost-user; I don't think we have anything with that mix
 
> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>    to eliminate the bad address problem?

Are you thinking of doing all read/writes that way, or just the corner
cases? It doesn't seem worth it for the corner cases unless you're
finding them cropping up in real work loads.

> 4. Can OPEN+MAP be fused into a single request for small files, avoiding
>    the 2nd request?

Sounds possible.

> I'm not going to tackle DAX optimization myself right now but wanted to
> share these ideas.

One I was thinking about that feels easier than (2) was to change the
vhost slave protocol to be split transaction; it wouldn't do anything
for the latency but it would be able to do some in parallel if we can
get the kernel to feed it.

Dave

> Stefan



> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08  8:10   ` piaojun
@ 2019-08-08  9:53     ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-08  9:53 UTC (permalink / raw)
  To: piaojun; +Cc: virtio-fs, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 185 bytes --]

On Thu, Aug 08, 2019 at 04:10:00PM +0800, piaojun wrote:
> From my test, your patch set of multithreading improves iops greatly as
> below:

Thank you for sharing your results!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08  9:53       ` Dr. David Alan Gilbert
@ 2019-08-08 12:53         ` Vivek Goyal
  2019-08-09  8:23           ` Stefan Hajnoczi
  2019-08-10 21:35           ` Liu Bo
  2019-08-09  8:21         ` Stefan Hajnoczi
  1 sibling, 2 replies; 30+ messages in thread
From: Vivek Goyal @ 2019-08-08 12:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi

On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > (atleast from kernel's point of view).
> > 
> > Okay, there is still room to experiment with how MAP and UNMAP are
> > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > the bottleneck.
> > 
> > One possible optimization is to eliminate REMOVEMAPPING requests when
> > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > the following request pattern in a fio randread iodepth=64 job:
> > 
> >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, moffset=859832320, flags=0)
> >      unique: 995348, success, outsize: 16
> >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> >      unique: 995350, success, outsize: 16
> >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, moffset=861929472, flags=0)
> >      unique: 995352, success, outsize: 16
> >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> >      unique: 995354, success, outsize: 16
> >   virtio_send_msg: elem 9: with 1 in desc of length 16
> >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, moffset=864026624, flags=0)
> >      unique: 995356, success, outsize: 16
> >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > 
> > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > of the old mapping instead of taking the extra step of removing it
> > first.
> 
> Yep, those should go - I think Vivek likes to keep them for testing
> since they make things fail more completely if there's a screwup.

I like to keep them because otherwise they keep the resources busy
on host. If DAX range is being used immediately, then this optimization
makes more sense. I will keep this in mind.

> 
> > Some more questions to consider for DAX performance optimization:
> > 
> > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> 
> Probably for cases where the data is only accessed once, and you can't
> preemptively map.
> Another variant on (1) is whether we could do read/writes while the mmap
> is happening to absorb the latency.

For small random I/O, dax might not be very effective. Overhead of
setting up mapping and tearing it down is significant.

Vivek


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08  9:53       ` Dr. David Alan Gilbert
  2019-08-08 12:53         ` Vivek Goyal
@ 2019-08-09  8:21         ` Stefan Hajnoczi
  2019-08-10 21:34           ` Liu Bo
  2019-08-11  2:26           ` piaojun
  1 sibling, 2 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-09  8:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Vivek Goyal

[-- Attachment #1: Type: text/plain, Size: 2253 bytes --]

On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
> 
> I think there's two things to solve here that I don't currently know the
> answer to:
>   2a) We'd need to get the fd to qemu for the thing to mmap;
>       we might be able to cache the fd on the qemu side for existing
>       mappings, so when asking for a new mapping for an existing file then
>       it would already have the fd.
> 
>   2b) Running a device with a mix of queues inside QEMU and on
>       vhost-user; I don't think we have anything with that mix

vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
and the rx/tx queues by the vhost device.  This is in fact how vhost was
initially designed: the vhost device is not a full virtio device, only
the dataplane.

> > 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >    to eliminate the bad address problem?
> 
> Are you thinking of doing all read/writes that way, or just the corner
> cases? It doesn't seem worth it for the corner cases unless you're
> finding them cropping up in real work loads.

Send all READ/WRITE requests to QEMU instead of virtiofsd.

Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
MKDIR, etc).

> > I'm not going to tackle DAX optimization myself right now but wanted to
> > share these ideas.
> 
> One I was thinking about that feels easier than (2) was to change the
> vhost slave protocol to be split transaction; it wouldn't do anything
> for the latency but it would be able to do some in parallel if we can
> get the kernel to feed it.

There are two cases:
1. mmapping multiple inode.  This should benefit from parallelism,
   although mmap is still expensive because it involves TLB shootdown
   for all other threads running this process.
2. mmapping the same inode.  Here the host kernel is likely to serialize
   mmaps even more, making it hard to gain performance.

It's probably worth writing a tiny benchmark first to evaluate the
potential gains.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08 12:53         ` Vivek Goyal
@ 2019-08-09  8:23           ` Stefan Hajnoczi
  2019-08-10 21:35           ` Liu Bo
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-09  8:23 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, Dr. David Alan Gilbert, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]

On Thu, Aug 08, 2019 at 08:53:20AM -0400, Vivek Goyal wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > > (atleast from kernel's point of view).
> > > 
> > > Okay, there is still room to experiment with how MAP and UNMAP are
> > > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > > the bottleneck.
> > > 
> > > One possible optimization is to eliminate REMOVEMAPPING requests when
> > > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > > the following request pattern in a fio randread iodepth=64 job:
> > > 
> > >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, moffset=859832320, flags=0)
> > >      unique: 995348, success, outsize: 16
> > >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > >      unique: 995350, success, outsize: 16
> > >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, moffset=861929472, flags=0)
> > >      unique: 995352, success, outsize: 16
> > >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > >      unique: 995354, success, outsize: 16
> > >   virtio_send_msg: elem 9: with 1 in desc of length 16
> > >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, moffset=864026624, flags=0)
> > >      unique: 995356, success, outsize: 16
> > >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > > 
> > > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > > of the old mapping instead of taking the extra step of removing it
> > > first.
> > 
> > Yep, those should go - I think Vivek likes to keep them for testing
> > since they make things fail more completely if there's a screwup.
> 
> I like to keep them because otherwise they keep the resources busy
> on host. If DAX range is being used immediately, then this optimization
> makes more sense. I will keep this in mind.

Skipping all unmaps is has drawbacks, as you've said.  I'm just thinking
about the case where a mapping is replaced with a new one.

> > 
> > > Some more questions to consider for DAX performance optimization:
> > > 
> > > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> > 
> > Probably for cases where the data is only accessed once, and you can't
> > preemptively map.
> > Another variant on (1) is whether we could do read/writes while the mmap
> > is happening to absorb the latency.
> 
> For small random I/O, dax might not be very effective. Overhead of
> setting up mapping and tearing it down is significant.

Plus there is still an EPT violation and the host page cache needs to be
filled if we haven't prefetched it.  So I imagine FUSE_READ/FUSE_WRITE
will be faster than DAX here.  DAX will be better for repeated,
long-lived accesses.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-09  8:21         ` Stefan Hajnoczi
@ 2019-08-10 21:34           ` Liu Bo
  2019-08-11  2:26           ` piaojun
  1 sibling, 0 replies; 30+ messages in thread
From: Liu Bo @ 2019-08-10 21:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, Dr. David Alan Gilbert, Vivek Goyal, qemu-devel

On Fri, Aug 09, 2019 at 09:21:02AM +0100, Stefan Hajnoczi wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
> > 
> > I think there's two things to solve here that I don't currently know the
> > answer to:
> >   2a) We'd need to get the fd to qemu for the thing to mmap;
> >       we might be able to cache the fd on the qemu side for existing
> >       mappings, so when asking for a new mapping for an existing file then
> >       it would already have the fd.
> > 
> >   2b) Running a device with a mix of queues inside QEMU and on
> >       vhost-user; I don't think we have anything with that mix
> 
> vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
> and the rx/tx queues by the vhost device.  This is in fact how vhost was
> initially designed: the vhost device is not a full virtio device, only
> the dataplane.

> > > 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> > >    to eliminate the bad address problem?
> > 
> > Are you thinking of doing all read/writes that way, or just the corner
> > cases? It doesn't seem worth it for the corner cases unless you're
> > finding them cropping up in real work loads.
> 
> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> 
> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> MKDIR, etc).

For now qemu is not aware of virtio-fs's fd info, but I think it's
doable, I like the idea.

thanks,
-liubo
> 
> > > I'm not going to tackle DAX optimization myself right now but wanted to
> > > share these ideas.
> > 
> > One I was thinking about that feels easier than (2) was to change the
> > vhost slave protocol to be split transaction; it wouldn't do anything
> > for the latency but it would be able to do some in parallel if we can
> > get the kernel to feed it.
> 
> There are two cases:
> 1. mmapping multiple inode.  This should benefit from parallelism,
>    although mmap is still expensive because it involves TLB shootdown
>    for all other threads running this process.
> 2. mmapping the same inode.  Here the host kernel is likely to serialize
>    mmaps even more, making it hard to gain performance.
> 
> It's probably worth writing a tiny benchmark first to evaluate the
> potential gains.
> 
> Stefan



> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-08 12:53         ` Vivek Goyal
  2019-08-09  8:23           ` Stefan Hajnoczi
@ 2019-08-10 21:35           ` Liu Bo
  1 sibling, 0 replies; 30+ messages in thread
From: Liu Bo @ 2019-08-10 21:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, Dr. David Alan Gilbert, qemu-devel

On Thu, Aug 08, 2019 at 08:53:20AM -0400, Vivek Goyal wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> > > > Kernel also serializes MAP/UNMAP on one inode. So you will need to run
> > > > multiple jobs operating on different inodes to see parallel MAP/UNMAP
> > > > (atleast from kernel's point of view).
> > > 
> > > Okay, there is still room to experiment with how MAP and UNMAP are
> > > handled by virtiofsd and QEMU even if the host kernel ultimately becomes
> > > the bottleneck.
> > > 
> > > One possible optimization is to eliminate REMOVEMAPPING requests when
> > > the guest driver knows a SETUPMAPPING will follow immediately.  I see
> > > the following request pattern in a fio randread iodepth=64 job:
> > > 
> > >   unique: 995348, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=3860856832, len=2097152, moffset=859832320, flags=0)
> > >      unique: 995348, success, outsize: 16
> > >   unique: 995350, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > >      unique: 995350, success, outsize: 16
> > >   unique: 995352, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=16777216, len=2097152, moffset=861929472, flags=0)
> > >      unique: 995352, success, outsize: 16
> > >   unique: 995354, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > >      unique: 995354, success, outsize: 16
> > >   virtio_send_msg: elem 9: with 1 in desc of length 16
> > >   unique: 995356, opcode: SETUPMAPPING (48), nodeid: 135, insize: 80, pid: 1351
> > >   lo_setupmapping(ino=135, fi=0x(nil), foffset=383778816, len=2097152, moffset=864026624, flags=0)
> > >      unique: 995356, success, outsize: 16
> > >   unique: 995358, opcode: REMOVEMAPPING (49), nodeid: 135, insize: 60, pid: 12
> > > 
> > > The REMOVEMAPPING requests are unnecessary since we can map over the top
> > > of the old mapping instead of taking the extra step of removing it
> > > first.
> > 
> > Yep, those should go - I think Vivek likes to keep them for testing
> > since they make things fail more completely if there's a screwup.
> 
> I like to keep them because otherwise they keep the resources busy
> on host. If DAX range is being used immediately, then this optimization
> makes more sense. I will keep this in mind.
>

Other than the resource not being released, do you think there'll be
any stale data problem if we don't do removemapping at all, neither
background reclaim nor inline reclaim?
(truncate/punch_hole/evict_inode still needs to remove mapping though)

thanks,
-liubo

> > 
> > > Some more questions to consider for DAX performance optimization:
> > > 
> > > 1. Is FUSE_READ/FUSE_WRITE more efficient than DAX for some I/O patterns?
> > 
> > Probably for cases where the data is only accessed once, and you can't
> > preemptively map.
> > Another variant on (1) is whether we could do read/writes while the mmap
> > is happening to absorb the latency.
> 
> For small random I/O, dax might not be very effective. Overhead of
> setting up mapping and tearing it down is significant.
> 
> Vivek
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://www.redhat.com/mailman/listinfo/virtio-fs


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-09  8:21         ` Stefan Hajnoczi
  2019-08-10 21:34           ` Liu Bo
@ 2019-08-11  2:26           ` piaojun
  2019-08-12 10:05             ` Stefan Hajnoczi
  1 sibling, 1 reply; 30+ messages in thread
From: piaojun @ 2019-08-11  2:26 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert
  Cc: virtio-fs, qemu-devel, Vivek Goyal



On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
>> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
>>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
>>> 2. Can MAP/UNMAP be performed directly in QEMU via a separate virtqueue?
>>
>> I think there's two things to solve here that I don't currently know the
>> answer to:
>>   2a) We'd need to get the fd to qemu for the thing to mmap;
>>       we might be able to cache the fd on the qemu side for existing
>>       mappings, so when asking for a new mapping for an existing file then
>>       it would already have the fd.
>>
>>   2b) Running a device with a mix of queues inside QEMU and on
>>       vhost-user; I don't think we have anything with that mix
> 
> vhost-user-net works in the same way.  The ctrl queue is handled by QEMU
> and the rx/tx queues by the vhost device.  This is in fact how vhost was
> initially designed: the vhost device is not a full virtio device, only
> the dataplane.

Agreed.

> 
>>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>>>    to eliminate the bad address problem?
>>
>> Are you thinking of doing all read/writes that way, or just the corner
>> cases? It doesn't seem worth it for the corner cases unless you're
>> finding them cropping up in real work loads.
> 
> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> 
> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> MKDIR, etc).
> 

Sorry for not catching your point, and I like the virtiofsd to do
READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
good at processing dataplane things due to thread-pool and CPU
affinity(maybe in the future). As you said, virtiofsd is just acting as
a vhost-user device which should care less about ctrl request.

If our concern is improving mmap/write/read performance, why not adding
a delay worker for unmmap which could decrease the ummap times. Maybe
virtiofsd could still handle both data and meta requests by this way.

Jun


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-11  2:26           ` piaojun
@ 2019-08-12 10:05             ` Stefan Hajnoczi
  2019-08-12 11:58               ` piaojun
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2019-08-12 10:05 UTC (permalink / raw)
  To: piaojun; +Cc: virtio-fs, Dr. David Alan Gilbert, Vivek Goyal, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2002 bytes --]

On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> > On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> >> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> >>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> >>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >>>    to eliminate the bad address problem?
> >>
> >> Are you thinking of doing all read/writes that way, or just the corner
> >> cases? It doesn't seem worth it for the corner cases unless you're
> >> finding them cropping up in real work loads.
> > 
> > Send all READ/WRITE requests to QEMU instead of virtiofsd.
> > 
> > Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> > MKDIR, etc).
> > 
> 
> Sorry for not catching your point, and I like the virtiofsd to do
> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
> good at processing dataplane things due to thread-pool and CPU
> affinity(maybe in the future). As you said, virtiofsd is just acting as
> a vhost-user device which should care less about ctrl request.
> 
> If our concern is improving mmap/write/read performance, why not adding
> a delay worker for unmmap which could decrease the ummap times. Maybe
> virtiofsd could still handle both data and meta requests by this way.

Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
have access to guest RAM regions.  If a guest transfers other memory,
like an address in the DAX Window, to/from the vhost-user device then
virtqueue buffer address translation fails.

Dave added a code path that bounces such accesses through the QEMU
process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
be simpler, faster, and cleaner to do I/O in QEMU in the first place.

What I don't like about moving READ/WRITE into QEMU is that we need to
use even more virtqueues for multiqueue operation :).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-12 10:05             ` Stefan Hajnoczi
@ 2019-08-12 11:58               ` piaojun
  2019-08-12 12:51                 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 30+ messages in thread
From: piaojun @ 2019-08-12 11:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, Dr. David Alan Gilbert, Vivek Goyal, qemu-devel



On 2019/8/12 18:05, Stefan Hajnoczi wrote:
> On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
>> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
>>> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
>>>> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
>>>>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
>>>>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
>>>>>    to eliminate the bad address problem?
>>>>
>>>> Are you thinking of doing all read/writes that way, or just the corner
>>>> cases? It doesn't seem worth it for the corner cases unless you're
>>>> finding them cropping up in real work loads.
>>>
>>> Send all READ/WRITE requests to QEMU instead of virtiofsd.
>>>
>>> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
>>> MKDIR, etc).
>>>
>>
>> Sorry for not catching your point, and I like the virtiofsd to do
>> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
>> good at processing dataplane things due to thread-pool and CPU
>> affinity(maybe in the future). As you said, virtiofsd is just acting as
>> a vhost-user device which should care less about ctrl request.
>>
>> If our concern is improving mmap/write/read performance, why not adding
>> a delay worker for unmmap which could decrease the ummap times. Maybe
>> virtiofsd could still handle both data and meta requests by this way.
> 
> Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
> have access to guest RAM regions.  If a guest transfers other memory,
> like an address in the DAX Window, to/from the vhost-user device then
> virtqueue buffer address translation fails.
> 
> Dave added a code path that bounces such accesses through the QEMU
> process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
> be simpler, faster, and cleaner to do I/O in QEMU in the first place.
> 
> What I don't like about moving READ/WRITE into QEMU is that we need to
> use even more virtqueues for multiqueue operation :).
> 
> Stefan

Thanks for your detailed explanation. If DAX is not good at small files,
shall we just let the users choose the I/O path according to their user
cases?



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3
  2019-08-12 11:58               ` piaojun
@ 2019-08-12 12:51                 ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2019-08-12 12:51 UTC (permalink / raw)
  To: piaojun; +Cc: virtio-fs, qemu-devel, Stefan Hajnoczi, Vivek Goyal

* piaojun (piaojun@huawei.com) wrote:
> 
> 
> On 2019/8/12 18:05, Stefan Hajnoczi wrote:
> > On Sun, Aug 11, 2019 at 10:26:18AM +0800, piaojun wrote:
> >> On 2019/8/9 16:21, Stefan Hajnoczi wrote:
> >>> On Thu, Aug 08, 2019 at 10:53:16AM +0100, Dr. David Alan Gilbert wrote:
> >>>> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> >>>>> On Wed, Aug 07, 2019 at 04:57:15PM -0400, Vivek Goyal wrote:
> >>>>> 3. Can READ/WRITE be performed directly in QEMU via a separate virtqueue
> >>>>>    to eliminate the bad address problem?
> >>>>
> >>>> Are you thinking of doing all read/writes that way, or just the corner
> >>>> cases? It doesn't seem worth it for the corner cases unless you're
> >>>> finding them cropping up in real work loads.
> >>>
> >>> Send all READ/WRITE requests to QEMU instead of virtiofsd.
> >>>
> >>> Only handle metadata requests in virtiofsd (OPEN, RELEASE, READDIR,
> >>> MKDIR, etc).
> >>>
> >>
> >> Sorry for not catching your point, and I like the virtiofsd to do
> >> READ/WRITE requests and qemu handle metadata requests, as virtiofsd is
> >> good at processing dataplane things due to thread-pool and CPU
> >> affinity(maybe in the future). As you said, virtiofsd is just acting as
> >> a vhost-user device which should care less about ctrl request.
> >>
> >> If our concern is improving mmap/write/read performance, why not adding
> >> a delay worker for unmmap which could decrease the ummap times. Maybe
> >> virtiofsd could still handle both data and meta requests by this way.
> > 
> > Doing READ/WRITE in QEMU solves the problem that vhost-user slaves only
> > have access to guest RAM regions.  If a guest transfers other memory,
> > like an address in the DAX Window, to/from the vhost-user device then
> > virtqueue buffer address translation fails.
> > 
> > Dave added a code path that bounces such accesses through the QEMU
> > process using the VHOST_USER_SLAVE_FS_IO slavefd request, but it would
> > be simpler, faster, and cleaner to do I/O in QEMU in the first place.
> > 
> > What I don't like about moving READ/WRITE into QEMU is that we need to
> > use even more virtqueues for multiqueue operation :).
> > 
> > Stefan
> 
> Thanks for your detailed explanation. If DAX is not good at small files,
> shall we just let the users choose the I/O path according to their user
> cases?

The problem is how/when to decide and where to keep policy like that.
My understanding is it's also tricky to flip in the kernel from DAX to
non-DAX for any one file.

So without knowing access patterns it's tricky.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-08-12 12:51 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-01 16:54 [Qemu-devel] [PATCH 0/4] virtiofsd: multithreading preparation part 3 Stefan Hajnoczi
2019-08-01 16:54 ` [Qemu-devel] [PATCH 1/4] virtiofsd: process requests in a thread pool Stefan Hajnoczi
2019-08-05 12:02   ` Dr. David Alan Gilbert
2019-08-07  9:35     ` Stefan Hajnoczi
2019-08-01 16:54 ` [Qemu-devel] [PATCH 2/4] virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races Stefan Hajnoczi
2019-08-05 12:26   ` Dr. David Alan Gilbert
2019-08-01 16:54 ` [Qemu-devel] [PATCH 3/4] virtiofsd: fix lo_destroy() resource leaks Stefan Hajnoczi
2019-08-05 15:17   ` Dr. David Alan Gilbert
2019-08-05 18:57     ` Dr. David Alan Gilbert
2019-08-06 18:58       ` Dr. David Alan Gilbert
2019-08-07  9:41       ` Stefan Hajnoczi
2019-08-01 16:54 ` [Qemu-devel] [PATCH 4/4] virtiofsd: add --thread-pool-size=NUM option Stefan Hajnoczi
2019-08-05  2:52 ` [Qemu-devel] [Virtio-fs] [PATCH 0/4] virtiofsd: multithreading preparation part 3 piaojun
2019-08-05  8:01   ` Stefan Hajnoczi
2019-08-05  9:40     ` piaojun
2019-08-07 18:03 ` [Qemu-devel] " Stefan Hajnoczi
2019-08-07 20:57   ` [Qemu-devel] [Virtio-fs] " Vivek Goyal
2019-08-08  9:02     ` Stefan Hajnoczi
2019-08-08  9:53       ` Dr. David Alan Gilbert
2019-08-08 12:53         ` Vivek Goyal
2019-08-09  8:23           ` Stefan Hajnoczi
2019-08-10 21:35           ` Liu Bo
2019-08-09  8:21         ` Stefan Hajnoczi
2019-08-10 21:34           ` Liu Bo
2019-08-11  2:26           ` piaojun
2019-08-12 10:05             ` Stefan Hajnoczi
2019-08-12 11:58               ` piaojun
2019-08-12 12:51                 ` Dr. David Alan Gilbert
2019-08-08  8:10   ` piaojun
2019-08-08  9:53     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).