All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/26] virtiofs dax patches
@ 2021-04-28 11:00 Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
                   ` (27 more replies)
  0 siblings, 28 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

  This series adds support for acceleration of virtiofs via DAX
mapping, using features added in the 5.11 Linux kernel.

  DAX originally existed in the kernel for mapping real storage
devices directly into memory, so that reads/writes turn into
reads/writes directly mapped into the storage device.

  virtiofs's DAX support is similar; a PCI BAR is exposed on the
virtiofs device corresponding to a DAX 'cache' of a user defined size.
The guest daemon then requests files to be mapped into that cache;
when that happens the virtiofsd sends filedescriptors and commands back
to the QEMU that mmap's those files directly into the memory slot
exposed to kvm.  The guest can then directly read/write to the files
exposed by virtiofs by reading/writing into the BAR.

  A typical invocation would be:
     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,cache-size=4G

and then the guest must mount with -o dax

  Note that the cache doesn't really take VM up on the host, because
everything placed there is just an mmap of a file, so you can afford
to use quite a large cache size.

  Unlike a real DAX device, the cache is a finite size that's
potentially smaller than the underlying filesystem (especially when
mapping granuality is taken into account).  Mapping, unmapping and
remapping must take place to juggle files into the cache if it's too
small.  Some workloads benefit more than others.

Gotchas:
  a) If something else on the host truncates an mmap'd file,
kvm gets rather upset;  for this reason it's advised that DAX is
currently only suitable for use on non-shared filesystems.

(This series, with a couple of other patches, is at:
https://gitlab.com/virtio-fs/qemu/-/tree/dgilbert-dax-2021-04-28 )

Dave

v3
  Review cleanups
  Fix some printf formating issues

Dr. David Alan Gilbert (21):
  virtiofs: Fixup printf args
  virtiofsd: Don't assume header layout
  DAX: vhost-user: Rework slave return values
  DAX: libvhost-user: Route slave message payload
  DAX: libvhost-user: Allow popping a queue element with bad pointers
  DAX subprojects/libvhost-user: Add virtio-fs slave types
  DAX: virtio: Add shared memory capability
  DAX: virtio-fs: Add cache BAR
  DAX: virtio-fs: Add vhost-user slave commands for mapping
  DAX: virtio-fs: Fill in slave commands for mapping
  DAX: virtiofsd Add cache accessor functions
  DAX: virtiofsd: Add setup/remove mappings fuse commands
  DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll
  DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping
  DAX: virtiofsd: route se down to destroy method
  DAX: virtiofsd: Perform an unmap on destroy
  DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO
  DAX/unmap virtiofsd: Parse unmappable elements
  DAX/unmap virtiofsd: Route unmappable reads
  DAX/unmap virtiofsd: route unmappable write to slave command

Stefan Hajnoczi (1):
  DAX:virtiofsd: implement FUSE_INIT map_alignment field

Vivek Goyal (4):
  DAX: virtiofsd: Make lo_removemapping() work
  vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info
  vhost-user-fs: Implement drop CAP_FSETID functionality
  virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it

 block/export/vhost-user-blk-server.c      |   2 +-
 contrib/vhost-user-blk/vhost-user-blk.c   |   3 +-
 contrib/vhost-user-gpu/vhost-user-gpu.c   |   5 +-
 contrib/vhost-user-input/main.c           |   4 +-
 contrib/vhost-user-scsi/vhost-user-scsi.c |   2 +-
 docs/interop/vhost-user.rst               |  37 ++
 hw/virtio/meson.build                     |   1 +
 hw/virtio/trace-events                    |   6 +
 hw/virtio/vhost-backend.c                 |   6 +-
 hw/virtio/vhost-user-fs-pci.c             |  32 ++
 hw/virtio/vhost-user-fs.c                 | 395 ++++++++++++++++++++++
 hw/virtio/vhost-user.c                    |  62 +++-
 hw/virtio/virtio-pci.c                    |  20 ++
 hw/virtio/virtio-pci.h                    |   4 +
 include/hw/virtio/vhost-backend.h         |   2 +-
 include/hw/virtio/vhost-user-fs.h         |  43 +++
 meson.build                               |   6 +
 subprojects/libvhost-user/libvhost-user.c | 113 ++++++-
 subprojects/libvhost-user/libvhost-user.h |  57 +++-
 tests/vhost-user-bridge.c                 |   4 +-
 tools/virtiofsd/buffer.c                  |  22 +-
 tools/virtiofsd/fuse_common.h             |  17 +-
 tools/virtiofsd/fuse_lowlevel.c           |  92 ++++-
 tools/virtiofsd/fuse_lowlevel.h           |  78 ++++-
 tools/virtiofsd/fuse_virtio.c             | 372 ++++++++++++++++----
 tools/virtiofsd/passthrough_ll.c          | 138 +++++++-
 26 files changed, 1393 insertions(+), 130 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 01/26] virtiofs: Fixup printf args
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-04 14:54   ` Stefan Hajnoczi
  2021-05-06 15:56   ` Dr. David Alan Gilbert
  2021-04-28 11:00 ` [PATCH v3 02/26] virtiofsd: Don't assume header layout Dr. David Alan Gilbert (git)
                   ` (26 subsequent siblings)
  27 siblings, 2 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Fixup some fuse_log printf args for 32bit compatibility.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 1553d2ef45..110f85a701 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2011,10 +2011,10 @@ static void lo_getlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
 
     fuse_log(FUSE_LOG_DEBUG,
              "lo_getlk(ino=%" PRIu64 ", flags=%d)"
-             " owner=0x%lx, l_type=%d l_start=0x%lx"
-             " l_len=0x%lx\n",
-             ino, fi->flags, fi->lock_owner, lock->l_type, lock->l_start,
-             lock->l_len);
+             " owner=0x%" PRIx64 ", l_type=%d l_start=0x%" PRIx64
+             " l_len=0x%" PRIx64 "\n",
+             ino, fi->flags, fi->lock_owner, lock->l_type,
+             (uint64_t)lock->l_start, (uint64_t)lock->l_len);
 
     if (!lo->posix_lock) {
         fuse_reply_err(req, ENOSYS);
@@ -2061,10 +2061,10 @@ static void lo_setlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
 
     fuse_log(FUSE_LOG_DEBUG,
              "lo_setlk(ino=%" PRIu64 ", flags=%d)"
-             " cmd=%d pid=%d owner=0x%lx sleep=%d l_whence=%d"
-             " l_start=0x%lx l_len=0x%lx\n",
+             " cmd=%d pid=%d owner=0x%" PRIx64 " sleep=%d l_whence=%d"
+             " l_start=0x%" PRIx64 " l_len=0x%" PRIx64 "\n",
              ino, fi->flags, lock->l_type, lock->l_pid, fi->lock_owner, sleep,
-             lock->l_whence, lock->l_start, lock->l_len);
+             lock->l_whence, (uint64_t)lock->l_start, (uint64_t)lock->l_len);
 
     if (!lo->posix_lock) {
         fuse_reply_err(req, ENOSYS);
@@ -3097,9 +3097,10 @@ static void lo_copy_file_range(fuse_req_t req, fuse_ino_t ino_in, off_t off_in,
 
     fuse_log(FUSE_LOG_DEBUG,
              "lo_copy_file_range(ino=%" PRIu64 "/fd=%d, "
-             "off=%lu, ino=%" PRIu64 "/fd=%d, "
-             "off=%lu, size=%zd, flags=0x%x)\n",
-             ino_in, in_fd, off_in, ino_out, out_fd, off_out, len, flags);
+             "off=%ju, ino=%" PRIu64 "/fd=%d, "
+             "off=%ju, size=%zd, flags=0x%x)\n",
+             ino_in, in_fd, (intmax_t)off_in,
+             ino_out, out_fd, (intmax_t)off_out, len, flags);
 
     res = copy_file_range(in_fd, &off_in, out_fd, &off_out, len, flags);
     if (res < 0) {
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 02/26] virtiofsd: Don't assume header layout
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-04 15:12   ` Stefan Hajnoczi
  2021-05-06 15:56   ` Dr. David Alan Gilbert
  2021-04-28 11:00 ` [PATCH v3 03/26] DAX: vhost-user: Rework slave return values Dr. David Alan Gilbert (git)
                   ` (25 subsequent siblings)
  27 siblings, 2 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

virtiofsd incorrectly assumed a fixed set of header layout in the virt
queue; assuming that the fuse and write headers were conveniently
separated from the data;  the spec doesn't allow us to take that
convenience, so fix it up to deal with it the hard way.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 94 +++++++++++++++++++++++++++--------
 1 file changed, 73 insertions(+), 21 deletions(-)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 3e13997406..6dd73c9b72 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -129,18 +129,55 @@ static void fv_panic(VuDev *dev, const char *err)
  * Copy from an iovec into a fuse_buf (memory only)
  * Caller must ensure there is space
  */
-static void copy_from_iov(struct fuse_buf *buf, size_t out_num,
-                          const struct iovec *out_sg)
+static size_t copy_from_iov(struct fuse_buf *buf, size_t out_num,
+                            const struct iovec *out_sg,
+                            size_t max)
 {
     void *dest = buf->mem;
+    size_t copied = 0;
 
-    while (out_num) {
+    while (out_num && max) {
         size_t onelen = out_sg->iov_len;
+        onelen = MIN(onelen, max);
         memcpy(dest, out_sg->iov_base, onelen);
         dest += onelen;
+        copied += onelen;
         out_sg++;
         out_num--;
+        max -= onelen;
     }
+
+    return copied;
+}
+
+/*
+ * Skip 'skip' bytes in the iov; 'sg_1stindex' is set as
+ * the index for the 1st iovec to read data from, and
+ * 'sg_1stskip' is the number of bytes to skip in that entry.
+ *
+ * Returns True if there are at least 'skip' bytes in the iovec
+ *
+ */
+static bool skip_iov(const struct iovec *sg, size_t sg_size,
+                     size_t skip,
+                     size_t *sg_1stindex, size_t *sg_1stskip)
+{
+    size_t vec;
+
+    for (vec = 0; vec < sg_size; vec++) {
+        if (sg[vec].iov_len > skip) {
+            *sg_1stskip = skip;
+            *sg_1stindex = vec;
+
+            return true;
+        }
+
+        skip -= sg[vec].iov_len;
+    }
+
+    *sg_1stindex = vec;
+    *sg_1stskip = 0;
+    return skip == 0;
 }
 
 /*
@@ -457,6 +494,7 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
     bool allocated_bufv = false;
     struct fuse_bufvec bufv;
     struct fuse_bufvec *pbufv;
+    struct fuse_in_header inh;
 
     assert(se->bufsize > sizeof(struct fuse_in_header));
 
@@ -505,14 +543,15 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
                  elem->index);
         assert(0); /* TODO */
     }
-    /* Copy just the first element and look at it */
-    copy_from_iov(&fbuf, 1, out_sg);
+    /* Copy just the fuse_in_header and look at it */
+    copy_from_iov(&fbuf, out_num, out_sg,
+                  sizeof(struct fuse_in_header));
+    memcpy(&inh, fbuf.mem, sizeof(struct fuse_in_header));
 
     pbufv = NULL; /* Compiler thinks an unitialised path */
-    if (out_num > 2 &&
-        out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
-        ((struct fuse_in_header *)fbuf.mem)->opcode == FUSE_WRITE &&
-        out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
+    if (inh.opcode == FUSE_WRITE &&
+        out_len >= (sizeof(struct fuse_in_header) +
+                    sizeof(struct fuse_write_in))) {
         /*
          * For a write we don't actually need to copy the
          * data, we can just do it straight out of guest memory
@@ -521,15 +560,15 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
          */
         fuse_log(FUSE_LOG_DEBUG, "%s: Write special case\n", __func__);
 
-        /* copy the fuse_write_in header afte rthe fuse_in_header */
-        fbuf.mem += out_sg->iov_len;
-        copy_from_iov(&fbuf, 1, out_sg + 1);
-        fbuf.mem -= out_sg->iov_len;
-        fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
+        fbuf.size = copy_from_iov(&fbuf, out_num, out_sg,
+                                  sizeof(struct fuse_in_header) +
+                                  sizeof(struct fuse_write_in));
+        /* That copy reread the in_header, make sure we use the original */
+        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
 
         /* Allocate the bufv, with space for the rest of the iov */
         pbufv = malloc(sizeof(struct fuse_bufvec) +
-                       sizeof(struct fuse_buf) * (out_num - 2));
+                       sizeof(struct fuse_buf) * out_num);
         if (!pbufv) {
             fuse_log(FUSE_LOG_ERR, "%s: pbufv malloc failed\n",
                     __func__);
@@ -540,24 +579,37 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
         pbufv->count = 1;
         pbufv->buf[0] = fbuf;
 
-        size_t iovindex, pbufvindex;
-        iovindex = 2; /* 2 headers, separate iovs */
+        size_t iovindex, pbufvindex, iov_bytes_skip;
         pbufvindex = 1; /* 2 headers, 1 fusebuf */
 
+        if (!skip_iov(out_sg, out_num,
+                      sizeof(struct fuse_in_header) +
+                      sizeof(struct fuse_write_in),
+                      &iovindex, &iov_bytes_skip)) {
+            fuse_log(FUSE_LOG_ERR, "%s: skip failed\n",
+                    __func__);
+            goto out;
+        }
+
         for (; iovindex < out_num; iovindex++, pbufvindex++) {
             pbufv->count++;
             pbufv->buf[pbufvindex].pos = ~0; /* Dummy */
             pbufv->buf[pbufvindex].flags = 0;
             pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
             pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
+
+            if (iov_bytes_skip) {
+                pbufv->buf[pbufvindex].mem += iov_bytes_skip;
+                pbufv->buf[pbufvindex].size -= iov_bytes_skip;
+                iov_bytes_skip = 0;
+            }
         }
     } else {
         /* Normal (non fast write) path */
 
-        /* Copy the rest of the buffer */
-        fbuf.mem += out_sg->iov_len;
-        copy_from_iov(&fbuf, out_num - 1, out_sg + 1);
-        fbuf.mem -= out_sg->iov_len;
+        copy_from_iov(&fbuf, out_num, out_sg, se->bufsize);
+        /* That copy reread the in_header, make sure we use the original */
+        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
         fbuf.size = out_len;
 
         /* TODO! Endianness of header */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 03/26] DAX: vhost-user: Rework slave return values
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 02/26] virtiofsd: Don't assume header layout Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-04 15:23   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload Dr. David Alan Gilbert (git)
                   ` (24 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

All the current slave handlers on the qemu side generate an 'int'
return value that's squashed down to a bool (!!ret) and stuffed into
a uint64_t (field of a union) to be returned.

Move the uint64_t type back up through the individual handlers so
that we can make one actually return a full uint64_t.

Note that the definition in the interop spec says most of these
cases are defined as returning 0 on success and non-0 for failure,
so it's OK to change from a bool to another non-0.

Vivek:
This is needed because upcoming patches in series will add new functions
which want to return full error code. Existing functions continue to
return true/false so, it should not lead to change of behavior for
existing users.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 hw/virtio/vhost-backend.c         |  6 +++---
 hw/virtio/vhost-user.c            | 31 ++++++++++++++++---------------
 include/hw/virtio/vhost-backend.h |  2 +-
 3 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
index 31b33bde37..1686c94767 100644
--- a/hw/virtio/vhost-backend.c
+++ b/hw/virtio/vhost-backend.c
@@ -401,8 +401,8 @@ int vhost_backend_invalidate_device_iotlb(struct vhost_dev *dev,
     return -ENODEV;
 }
 
-int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
-                                          struct vhost_iotlb_msg *imsg)
+uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
+                                        struct vhost_iotlb_msg *imsg)
 {
     int ret = 0;
 
@@ -429,5 +429,5 @@ int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
         break;
     }
 
-    return ret;
+    return !!ret;
 }
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index ded0c10453..4a7d2786c6 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -1405,24 +1405,25 @@ static int vhost_user_reset_device(struct vhost_dev *dev)
     return 0;
 }
 
-static int vhost_user_slave_handle_config_change(struct vhost_dev *dev)
+static uint64_t vhost_user_slave_handle_config_change(struct vhost_dev *dev)
 {
     int ret = -1;
 
     if (!dev->config_ops) {
-        return -1;
+        return true;
     }
 
     if (dev->config_ops->vhost_dev_config_notifier) {
         ret = dev->config_ops->vhost_dev_config_notifier(dev);
     }
 
-    return ret;
+    return !!ret;
 }
 
-static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
-                                                       VhostUserVringArea *area,
-                                                       int fd)
+static uint64_t vhost_user_slave_handle_vring_host_notifier(
+                    struct vhost_dev *dev,
+                    VhostUserVringArea *area,
+                    int fd)
 {
     int queue_idx = area->u64 & VHOST_USER_VRING_IDX_MASK;
     size_t page_size = qemu_real_host_page_size;
@@ -1436,7 +1437,7 @@ static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
     if (!virtio_has_feature(dev->protocol_features,
                             VHOST_USER_PROTOCOL_F_HOST_NOTIFIER) ||
         vdev == NULL || queue_idx >= virtio_get_num_queues(vdev)) {
-        return -1;
+        return true;
     }
 
     n = &user->notifier[queue_idx];
@@ -1449,18 +1450,18 @@ static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
     }
 
     if (area->u64 & VHOST_USER_VRING_NOFD_MASK) {
-        return 0;
+        return false;
     }
 
     /* Sanity check. */
     if (area->size != page_size) {
-        return -1;
+        return true;
     }
 
     addr = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
                 fd, area->offset);
     if (addr == MAP_FAILED) {
-        return -1;
+        return true;
     }
 
     name = g_strdup_printf("vhost-user/host-notifier@%p mmaps[%d]",
@@ -1471,13 +1472,13 @@ static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
 
     if (virtio_queue_set_host_notifier_mr(vdev, queue_idx, &n->mr, true)) {
         munmap(addr, page_size);
-        return -1;
+        return true;
     }
 
     n->addr = addr;
     n->set = true;
 
-    return 0;
+    return false;
 }
 
 static void close_slave_channel(struct vhost_user *u)
@@ -1498,7 +1499,7 @@ static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
     VhostUserPayload payload = { 0, };
     Error *local_err = NULL;
     gboolean rc = G_SOURCE_CONTINUE;
-    int ret = 0;
+    uint64_t ret = 0;
     struct iovec iov;
     g_autofree int *fd = NULL;
     size_t fdsize = 0;
@@ -1539,7 +1540,7 @@ static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
         break;
     default:
         error_report("Received unexpected msg type: %d.", hdr.request);
-        ret = -EINVAL;
+        ret = true;
     }
 
     /*
@@ -1553,7 +1554,7 @@ static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
         hdr.flags &= ~VHOST_USER_NEED_REPLY_MASK;
         hdr.flags |= VHOST_USER_REPLY_MASK;
 
-        payload.u64 = !!ret;
+        payload.u64 = ret;
         hdr.size = sizeof(payload.u64);
 
         iovec[0].iov_base = &hdr;
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 8a6f8e2a7a..64ac6b6444 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -186,7 +186,7 @@ int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
 int vhost_backend_invalidate_device_iotlb(struct vhost_dev *dev,
                                                  uint64_t iova, uint64_t len);
 
-int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
+uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
                                           struct vhost_iotlb_msg *imsg);
 
 int vhost_user_gpu_set_socket(struct vhost_dev *dev, int fd);
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (2 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 03/26] DAX: vhost-user: Rework slave return values Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-04 15:26   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 05/26] DAX: libvhost-user: Allow popping a queue element with bad pointers Dr. David Alan Gilbert (git)
                   ` (23 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Route the uint64 payload from message replies on the slave back up
through vu_process_message_reply and to the callers.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 subprojects/libvhost-user/libvhost-user.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c
index fab7ca17ee..937f64480d 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -403,9 +403,11 @@ vu_send_reply(VuDev *dev, int conn_fd, VhostUserMsg *vmsg)
  * Processes a reply on the slave channel.
  * Entered with slave_mutex held and releases it before exit.
  * Returns true on success.
+ * *payload is written on success
  */
 static bool
-vu_process_message_reply(VuDev *dev, const VhostUserMsg *vmsg)
+vu_process_message_reply(VuDev *dev, const VhostUserMsg *vmsg,
+                         uint64_t *payload)
 {
     VhostUserMsg msg_reply;
     bool result = false;
@@ -425,7 +427,8 @@ vu_process_message_reply(VuDev *dev, const VhostUserMsg *vmsg)
         goto out;
     }
 
-    result = msg_reply.payload.u64 == 0;
+    *payload = msg_reply.payload.u64;
+    result = true;
 
 out:
     pthread_mutex_unlock(&dev->slave_mutex);
@@ -1312,6 +1315,8 @@ bool vu_set_queue_host_notifier(VuDev *dev, VuVirtq *vq, int fd,
 {
     int qidx = vq - dev->vq;
     int fd_num = 0;
+    bool res;
+    uint64_t payload = 0;
     VhostUserMsg vmsg = {
         .request = VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG,
         .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
@@ -1342,7 +1347,10 @@ bool vu_set_queue_host_notifier(VuDev *dev, VuVirtq *vq, int fd,
     }
 
     /* Also unlocks the slave_mutex */
-    return vu_process_message_reply(dev, &vmsg);
+    res = vu_process_message_reply(dev, &vmsg, &payload);
+    res = res && (payload == 0);
+
+    return res;
 }
 
 static bool
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 05/26] DAX: libvhost-user: Allow popping a queue element with bad pointers
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (3 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types Dr. David Alan Gilbert (git)
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Allow a daemon implemented with libvhost-user to accept an
element with pointers to memory that aren't in the mapping table.
The daemon might have some special way to deal with some special
cases of this.

The default behaviour doesn't change.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/export/vhost-user-blk-server.c      |  2 +-
 contrib/vhost-user-blk/vhost-user-blk.c   |  3 +-
 contrib/vhost-user-gpu/vhost-user-gpu.c   |  5 ++-
 contrib/vhost-user-input/main.c           |  4 +-
 contrib/vhost-user-scsi/vhost-user-scsi.c |  2 +-
 subprojects/libvhost-user/libvhost-user.c | 51 ++++++++++++++++++-----
 subprojects/libvhost-user/libvhost-user.h |  8 +++-
 tests/vhost-user-bridge.c                 |  4 +-
 tools/virtiofsd/fuse_virtio.c             |  3 +-
 9 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/block/export/vhost-user-blk-server.c b/block/export/vhost-user-blk-server.c
index fa06996d37..84c6432325 100644
--- a/block/export/vhost-user-blk-server.c
+++ b/block/export/vhost-user-blk-server.c
@@ -293,7 +293,7 @@ static void vu_blk_process_vq(VuDev *vu_dev, int idx)
     while (1) {
         VuBlkReq *req;
 
-        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlkReq));
+        req = vu_queue_pop(vu_dev, vq, sizeof(VuBlkReq), NULL, NULL);
         if (!req) {
             break;
         }
diff --git a/contrib/vhost-user-blk/vhost-user-blk.c b/contrib/vhost-user-blk/vhost-user-blk.c
index d14b2896bf..01193552e9 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -235,7 +235,8 @@ static int vub_virtio_process_req(VubDev *vdev_blk,
     unsigned out_num;
     VubReq *req;
 
-    elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) + sizeof(VubReq));
+    elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement) + sizeof(VubReq),
+                        NULL, NULL);
     if (!elem) {
         return -1;
     }
diff --git a/contrib/vhost-user-gpu/vhost-user-gpu.c b/contrib/vhost-user-gpu/vhost-user-gpu.c
index f73f292c9f..827d15af00 100644
--- a/contrib/vhost-user-gpu/vhost-user-gpu.c
+++ b/contrib/vhost-user-gpu/vhost-user-gpu.c
@@ -840,7 +840,8 @@ vg_handle_ctrl(VuDev *dev, int qidx)
             return;
         }
 
-        cmd = vu_queue_pop(dev, vq, sizeof(struct virtio_gpu_ctrl_command));
+        cmd = vu_queue_pop(dev, vq, sizeof(struct virtio_gpu_ctrl_command),
+                           NULL, NULL);
         if (!cmd) {
             break;
         }
@@ -949,7 +950,7 @@ vg_handle_cursor(VuDev *dev, int qidx)
     struct virtio_gpu_update_cursor cursor;
 
     for (;;) {
-        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             break;
         }
diff --git a/contrib/vhost-user-input/main.c b/contrib/vhost-user-input/main.c
index c15d18c33f..d5c435605c 100644
--- a/contrib/vhost-user-input/main.c
+++ b/contrib/vhost-user-input/main.c
@@ -57,7 +57,7 @@ static void vi_input_send(VuInput *vi, struct virtio_input_event *event)
 
     /* ... then check available space ... */
     for (i = 0; i < vi->qindex; i++) {
-        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             while (--i >= 0) {
                 vu_queue_unpop(dev, vq, vi->queue[i].elem, 0);
@@ -141,7 +141,7 @@ static void vi_handle_sts(VuDev *dev, int qidx)
     g_debug("%s", G_STRFUNC);
 
     for (;;) {
-        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             break;
         }
diff --git a/contrib/vhost-user-scsi/vhost-user-scsi.c b/contrib/vhost-user-scsi/vhost-user-scsi.c
index 4f6e3e2a24..7564d6ab2d 100644
--- a/contrib/vhost-user-scsi/vhost-user-scsi.c
+++ b/contrib/vhost-user-scsi/vhost-user-scsi.c
@@ -252,7 +252,7 @@ static void vus_proc_req(VuDev *vu_dev, int idx)
         VirtIOSCSICmdReq *req;
         VirtIOSCSICmdResp *rsp;
 
-        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(vu_dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             g_debug("No more elements pending on vq[%d]@%p", idx, vq);
             break;
diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c
index 937f64480d..68eb165755 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -2469,7 +2469,8 @@ vu_queue_set_notification(VuDev *dev, VuVirtq *vq, int enable)
 
 static bool
 virtqueue_map_desc(VuDev *dev,
-                   unsigned int *p_num_sg, struct iovec *iov,
+                   unsigned int *p_num_sg, unsigned int *p_bad_sg,
+                   struct iovec *iov,
                    unsigned int max_num_sg, bool is_write,
                    uint64_t pa, size_t sz)
 {
@@ -2490,10 +2491,35 @@ virtqueue_map_desc(VuDev *dev,
             return false;
         }
 
-        iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
-        if (iov[num_sg].iov_base == NULL) {
-            vu_panic(dev, "virtio: invalid address for buffers");
-            return false;
+        if (p_bad_sg && *p_bad_sg) {
+            /* A previous mapping was bad, we won't try and map this either */
+            *p_bad_sg = *p_bad_sg + 1;
+        }
+        if (!p_bad_sg || !*p_bad_sg) {
+            /* No bad mappings so far, lets try mapping this one */
+            iov[num_sg].iov_base = vu_gpa_to_va(dev, &len, pa);
+            if (iov[num_sg].iov_base == NULL) {
+                /*
+                 * OK, it won't map, either panic or if the caller can handle
+                 * it, then count it.
+                 */
+                if (!p_bad_sg) {
+                    vu_panic(dev, "virtio: invalid address for buffers");
+                    return false;
+                } else {
+                    *p_bad_sg = *p_bad_sg + 1;
+                }
+            }
+        }
+        if (p_bad_sg && *p_bad_sg) {
+            /*
+             * There was a bad mapping, either now or previously, since
+             * the caller set p_bad_sg it means it's prepared to deal with
+             * it, so give it the pa in the iov
+             * Note: In this case len will be the whole sz, so we won't
+             * go around again for this descriptor
+             */
+            iov[num_sg].iov_base = (void *)(uintptr_t)pa;
         }
         iov[num_sg].iov_len = len;
         num_sg++;
@@ -2524,7 +2550,8 @@ virtqueue_alloc_element(size_t sz,
 }
 
 static void *
-vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
+vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz,
+                  unsigned int *p_bad_in, unsigned int *p_bad_out)
 {
     struct vring_desc *desc = vq->vring.desc;
     uint64_t desc_addr, read_len;
@@ -2568,7 +2595,7 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
     /* Collect all the descriptors */
     do {
         if (le16toh(desc[i].flags) & VRING_DESC_F_WRITE) {
-            if (!virtqueue_map_desc(dev, &in_num, iov + out_num,
+            if (!virtqueue_map_desc(dev, &in_num, p_bad_in, iov + out_num,
                                VIRTQUEUE_MAX_SIZE - out_num, true,
                                le64toh(desc[i].addr),
                                le32toh(desc[i].len))) {
@@ -2579,7 +2606,7 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
                 vu_panic(dev, "Incorrect order for descriptors");
                 return NULL;
             }
-            if (!virtqueue_map_desc(dev, &out_num, iov,
+            if (!virtqueue_map_desc(dev, &out_num, p_bad_out, iov,
                                VIRTQUEUE_MAX_SIZE, false,
                                le64toh(desc[i].addr),
                                le32toh(desc[i].len))) {
@@ -2669,7 +2696,8 @@ vu_queue_inflight_post_put(VuDev *dev, VuVirtq *vq, int desc_idx)
 }
 
 void *
-vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz,
+             unsigned int *p_bad_in, unsigned int *p_bad_out)
 {
     int i;
     unsigned int head;
@@ -2682,7 +2710,8 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 
     if (unlikely(vq->resubmit_list && vq->resubmit_num > 0)) {
         i = (--vq->resubmit_num);
-        elem = vu_queue_map_desc(dev, vq, vq->resubmit_list[i].index, sz);
+        elem = vu_queue_map_desc(dev, vq, vq->resubmit_list[i].index, sz,
+                                 p_bad_in, p_bad_out);
 
         if (!vq->resubmit_num) {
             free(vq->resubmit_list);
@@ -2714,7 +2743,7 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
         vring_set_avail_event(vq, vq->last_avail_idx);
     }
 
-    elem = vu_queue_map_desc(dev, vq, head, sz);
+    elem = vu_queue_map_desc(dev, vq, head, sz, p_bad_in, p_bad_out);
 
     if (!elem) {
         return NULL;
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index 3d13dfadde..330b61c005 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -589,11 +589,17 @@ void vu_queue_notify_sync(VuDev *dev, VuVirtq *vq);
  * @dev: a VuDev context
  * @vq: a VuVirtq queue
  * @sz: the size of struct to return (must be >= VuVirtqElement)
+ * @p_bad_in: If none NULL, a pointer to an integer count of
+ *            unmappable regions in input descriptors
+ * @p_bad_out: If none NULL, a pointer to an integer count of
+ *            unmappable regions in output descriptors
+ *
  *
  * Returns: a VuVirtqElement filled from the queue or NULL. The
  * returned element must be free()-d by the caller.
  */
-void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz);
+void *vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz,
+                   unsigned int *p_bad_in, unsigned int *p_bad_out);
 
 
 /**
diff --git a/tests/vhost-user-bridge.c b/tests/vhost-user-bridge.c
index 24815920b2..4f6829e6c3 100644
--- a/tests/vhost-user-bridge.c
+++ b/tests/vhost-user-bridge.c
@@ -184,7 +184,7 @@ vubr_handle_tx(VuDev *dev, int qidx)
         unsigned int out_num;
         struct iovec sg[VIRTQUEUE_MAX_SIZE], *out_sg;
 
-        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             break;
         }
@@ -299,7 +299,7 @@ vubr_backend_recv_cb(int sock, void *ctx)
         ssize_t ret, total = 0;
         unsigned int num;
 
-        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement));
+        elem = vu_queue_pop(dev, vq, sizeof(VuVirtqElement), NULL, NULL);
         if (!elem) {
             break;
         }
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 6dd73c9b72..2604e7f418 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -732,7 +732,8 @@ static void *fv_queue_thread(void *opaque)
                  __func__, qi->qidx, (size_t)evalue, in_bytes, out_bytes);
 
         while (1) {
-            FVRequest *req = vu_queue_pop(dev, q, sizeof(FVRequest));
+            FVRequest *req = vu_queue_pop(dev, q, sizeof(FVRequest),
+                                          NULL, NULL);
             if (!req) {
                 break;
             }
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (4 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 05/26] DAX: libvhost-user: Allow popping a queue element with bad pointers Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-29 15:48   ` Dr. David Alan Gilbert
  2021-04-28 11:00 ` [PATCH v3 07/26] DAX: virtio: Add shared memory capability Dr. David Alan Gilbert (git)
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add virtio-fs definitions to libvhost-user

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 subprojects/libvhost-user/libvhost-user.c | 48 +++++++++++++++++++++++
 subprojects/libvhost-user/libvhost-user.h | 40 +++++++++++++++++++
 2 files changed, 88 insertions(+)

diff --git a/subprojects/libvhost-user/libvhost-user.c b/subprojects/libvhost-user/libvhost-user.c
index 68eb165755..97c909c6a8 100644
--- a/subprojects/libvhost-user/libvhost-user.c
+++ b/subprojects/libvhost-user/libvhost-user.c
@@ -2918,3 +2918,51 @@ vu_queue_push(VuDev *dev, VuVirtq *vq,
     vu_queue_flush(dev, vq, 1);
     vu_queue_inflight_post_put(dev, vq, elem->index);
 }
+
+int64_t vu_fs_cache_request(VuDev *dev, VhostUserSlaveRequest req, int fd,
+                            VhostUserFSSlaveMsg *fsm)
+{
+    int fd_num = 0;
+    bool res;
+    uint64_t payload = 0;
+    VhostUserMsg vmsg = {
+        .request = req,
+        .flags = VHOST_USER_VERSION | VHOST_USER_NEED_REPLY_MASK,
+        .payload.fs = *fsm,
+    };
+
+    if (fsm->count > VHOST_USER_FS_SLAVE_MAX_ENTRIES) {
+        return -EINVAL;
+    }
+
+    vmsg.size = sizeof(VhostUserFSSlaveMsg) +
+                fsm->count * sizeof(VhostUserFSSlaveMsgEntry);
+    memcpy(&vmsg.payload.fs, fsm, vmsg.size);
+
+    if (fd != -1) {
+        vmsg.fds[fd_num++] = fd;
+    }
+
+    vmsg.fd_num = fd_num;
+
+    if (!vu_has_protocol_feature(dev, VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD)) {
+        return -EINVAL;
+    }
+
+    pthread_mutex_lock(&dev->slave_mutex);
+    if (!vu_message_write(dev, dev->slave_fd, &vmsg)) {
+        pthread_mutex_unlock(&dev->slave_mutex);
+        return -EIO;
+    }
+
+    /* Also unlocks the slave_mutex */
+    res = vu_process_message_reply(dev, &vmsg, &payload);
+    if (!res) {
+        return -EIO;
+    }
+    /*
+     * Payload is delivered as uint64_t but is actually signed for
+     * errors.
+     */
+    return (int64_t)payload;
+}
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index 330b61c005..70fc61171f 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -122,6 +122,33 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
+/* Structures carried over the slave channel back to QEMU */
+#define VHOST_USER_FS_SLAVE_MAX_ENTRIES 32
+
+/* For the flags field of VhostUserFSSlaveMsg */
+#define VHOST_USER_FS_FLAG_MAP_R (1u << 0)
+#define VHOST_USER_FS_FLAG_MAP_W (1u << 1)
+
+typedef struct {
+    /* Offsets within the file being mapped */
+    uint64_t fd_offset;
+    /* Offsets within the cache */
+    uint64_t c_offset;
+    /* Lengths of sections */
+    uint64_t len;
+    /* Flags, from VHOST_USER_FS_FLAG_* */
+    uint64_t flags;
+} VhostUserFSSlaveMsgEntry;
+
+typedef struct {
+    /* Number of entries */
+    uint16_t count;
+    /* Spare */
+    uint16_t align;
+
+    VhostUserFSSlaveMsgEntry entries[];
+} VhostUserFSSlaveMsg;
+
 typedef struct VhostUserMemoryRegion {
     uint64_t guest_phys_addr;
     uint64_t memory_size;
@@ -197,6 +224,7 @@ typedef struct VhostUserMsg {
         VhostUserConfig config;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserFSSlaveMsg fs;
     } payload;
 
     int fds[VHOST_MEMORY_BASELINE_NREGIONS];
@@ -693,4 +721,16 @@ void vu_queue_get_avail_bytes(VuDev *vdev, VuVirtq *vq, unsigned int *in_bytes,
 bool vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 
+/**
+ * vu_fs_cache_request: Send a slave message for an fs client
+ * @dev: a VuDev context
+ * @req: The request type (map, unmap, sync)
+ * @fd: an fd (only required for map, else must be -1)
+ * @fsm: The body of the message
+ *
+ * Returns: 0 or above for success, nevative errno on error
+ */
+int64_t vu_fs_cache_request(VuDev *dev, VhostUserSlaveRequest req, int fd,
+                            VhostUserFSSlaveMsg *fsm);
+
 #endif /* LIBVHOST_USER_H */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 07/26] DAX: virtio: Add shared memory capability
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (5 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR Dr. David Alan Gilbert (git)
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Define a new capability type 'VIRTIO_PCI_CAP_SHARED_MEMORY_CFG'
and the data structure 'virtio_pci_cap64' to go with it.
They allow defining shared memory regions with sizes and offsets
of 2^32 and more.
Multiple instances of the capability are allowed and distinguished
by the 'id' field in the base capability.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/virtio/virtio-pci.c | 20 ++++++++++++++++++++
 hw/virtio/virtio-pci.h |  4 ++++
 2 files changed, 24 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index b321604d9b..493014fdf7 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1138,6 +1138,26 @@ static int virtio_pci_add_mem_cap(VirtIOPCIProxy *proxy,
     return offset;
 }
 
+int virtio_pci_add_shm_cap(VirtIOPCIProxy *proxy,
+                           uint8_t bar, uint64_t offset, uint64_t length,
+                           uint8_t id)
+{
+    struct virtio_pci_cap64 cap = {
+        .cap.cap_len = sizeof cap,
+        .cap.cfg_type = VIRTIO_PCI_CAP_SHARED_MEMORY_CFG,
+    };
+    uint32_t mask32 = ~0;
+
+    cap.cap.bar = bar;
+    cap.cap.id = id;
+    cap.cap.length = cpu_to_le32(length & mask32);
+    cap.length_hi = cpu_to_le32((length >> 32) & mask32);
+    cap.cap.offset = cpu_to_le32(offset & mask32);
+    cap.offset_hi = cpu_to_le32((offset >> 32) & mask32);
+
+    return virtio_pci_add_mem_cap(proxy, &cap.cap);
+}
+
 static uint64_t virtio_pci_common_read(void *opaque, hwaddr addr,
                                        unsigned size)
 {
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index 2446dcd9ae..5e5c4a4c6d 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -252,4 +252,8 @@ void virtio_pci_types_register(const VirtioPCIDeviceTypeInfo *t);
  */
 unsigned virtio_pci_optimal_num_queues(unsigned fixed_queues);
 
+int virtio_pci_add_shm_cap(VirtIOPCIProxy *proxy,
+                           uint8_t bar, uint64_t offset, uint64_t length,
+                           uint8_t id);
+
 #endif
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (6 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 07/26] DAX: virtio: Add shared memory capability Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-05 12:12   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping Dr. David Alan Gilbert (git)
                   ` (19 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a cache BAR into which files will be directly mapped.
The size can be set with the cache-size= property, e.g.
   -device vhost-user-fs-pci,chardev=char0,tag=myfs,cache-size=16G

The default is no cache.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
with PPC fixes by:
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 hw/virtio/vhost-user-fs-pci.c     | 32 +++++++++++++++++++++++++++++++
 hw/virtio/vhost-user-fs.c         | 32 +++++++++++++++++++++++++++++++
 include/hw/virtio/vhost-user-fs.h |  2 ++
 3 files changed, 66 insertions(+)

diff --git a/hw/virtio/vhost-user-fs-pci.c b/hw/virtio/vhost-user-fs-pci.c
index 2ed8492b3f..20e447631f 100644
--- a/hw/virtio/vhost-user-fs-pci.c
+++ b/hw/virtio/vhost-user-fs-pci.c
@@ -12,14 +12,19 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/error.h"
 #include "hw/qdev-properties.h"
 #include "hw/virtio/vhost-user-fs.h"
 #include "virtio-pci.h"
 #include "qom/object.h"
+#include "standard-headers/linux/virtio_fs.h"
+
+#define VIRTIO_FS_PCI_CACHE_BAR 2
 
 struct VHostUserFSPCI {
     VirtIOPCIProxy parent_obj;
     VHostUserFS vdev;
+    MemoryRegion cachebar;
 };
 
 typedef struct VHostUserFSPCI VHostUserFSPCI;
@@ -38,7 +43,9 @@ static Property vhost_user_fs_pci_properties[] = {
 static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
 {
     VHostUserFSPCI *dev = VHOST_USER_FS_PCI(vpci_dev);
+    bool modern_pio = vpci_dev->flags & VIRTIO_PCI_FLAG_MODERN_PIO_NOTIFY;
     DeviceState *vdev = DEVICE(&dev->vdev);
+    uint64_t cachesize;
 
     if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
         /* Also reserve config change and hiprio queue vectors */
@@ -46,6 +53,31 @@ static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
     }
 
     qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
+    cachesize = dev->vdev.conf.cache_size;
+
+    if (cachesize && modern_pio) {
+        error_setg(errp, "DAX Cache can not be used together with modern_pio");
+        return;
+    }
+
+    /*
+     * The bar starts with the data/DAX cache
+     * Others will be added later.
+     */
+    memory_region_init(&dev->cachebar, OBJECT(vpci_dev),
+                       "vhost-user-fs-pci-cachebar", cachesize);
+    if (cachesize) {
+        memory_region_add_subregion(&dev->cachebar, 0, &dev->vdev.cache);
+        virtio_pci_add_shm_cap(vpci_dev, VIRTIO_FS_PCI_CACHE_BAR, 0, cachesize,
+                               VIRTIO_FS_SHMCAP_ID_CACHE);
+
+        /* After 'realized' so the memory region exists */
+        pci_register_bar(&vpci_dev->pci_dev, VIRTIO_FS_PCI_CACHE_BAR,
+                         PCI_BASE_ADDRESS_SPACE_MEMORY |
+                         PCI_BASE_ADDRESS_MEM_PREFETCH |
+                         PCI_BASE_ADDRESS_MEM_TYPE_64,
+                         &dev->cachebar);
+    }
 }
 
 static void vhost_user_fs_pci_class_init(ObjectClass *klass, void *data)
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 6f7f91533d..dd0a02aa99 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -35,6 +35,16 @@ static const int user_feature_bits[] = {
     VHOST_INVALID_FEATURE_BIT
 };
 
+/*
+ * The powerpc kernel code expects the memory to be accessible during
+ * addition/removal.
+ */
+#if defined(TARGET_PPC64) && defined(CONFIG_LINUX)
+#define DAX_WINDOW_PROT PROT_READ
+#else
+#define DAX_WINDOW_PROT PROT_NONE
+#endif
+
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
@@ -175,6 +185,7 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VHostUserFS *fs = VHOST_USER_FS(dev);
+    void *cache_ptr;
     unsigned int i;
     size_t len;
     int ret;
@@ -214,6 +225,26 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
                    VIRTQUEUE_MAX_SIZE);
         return;
     }
+    if (fs->conf.cache_size &&
+        (!is_power_of_2(fs->conf.cache_size) ||
+          fs->conf.cache_size < qemu_real_host_page_size)) {
+        error_setg(errp, "cache-size property must be a power of 2 "
+                         "no smaller than the page size");
+        return;
+    }
+    if (fs->conf.cache_size) {
+        /* Anonymous, private memory is not counted as overcommit */
+        cache_ptr = mmap(NULL, fs->conf.cache_size, DAX_WINDOW_PROT,
+                         MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+        if (cache_ptr == MAP_FAILED) {
+            error_setg(errp, "Unable to mmap blank cache");
+            return;
+        }
+
+        memory_region_init_ram_ptr(&fs->cache, OBJECT(vdev),
+                                   "virtio-fs-cache",
+                                   fs->conf.cache_size, cache_ptr);
+    }
 
     if (!vhost_user_init(&fs->vhost_user, &fs->conf.chardev, errp)) {
         return;
@@ -289,6 +320,7 @@ static Property vuf_properties[] = {
     DEFINE_PROP_UINT16("num-request-queues", VHostUserFS,
                        conf.num_request_queues, 1),
     DEFINE_PROP_UINT16("queue-size", VHostUserFS, conf.queue_size, 128),
+    DEFINE_PROP_SIZE("cache-size", VHostUserFS, conf.cache_size, 0),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 0d62834c25..04596799e3 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -28,6 +28,7 @@ typedef struct {
     char *tag;
     uint16_t num_request_queues;
     uint16_t queue_size;
+    uint64_t cache_size;
 } VHostUserFSConf;
 
 struct VHostUserFS {
@@ -42,6 +43,7 @@ struct VHostUserFS {
     int32_t bootindex;
 
     /*< public >*/
+    MemoryRegion cache;
 };
 
 #endif /* _QEMU_VHOST_USER_FS_H */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (7 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-05 14:15   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 10/26] DAX: virtio-fs: Fill in " Dr. David Alan Gilbert (git)
                   ` (18 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The daemon may request that fd's be mapped into the virtio-fs cache
visible to the guest.
These mappings are triggered by commands sent over the slave fd
from the daemon.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/interop/vhost-user.rst               | 21 ++++++++
 hw/virtio/vhost-user-fs.c                 | 66 +++++++++++++++++++++++
 hw/virtio/vhost-user.c                    | 26 +++++++++
 include/hw/virtio/vhost-user-fs.h         | 33 ++++++++++++
 subprojects/libvhost-user/libvhost-user.h |  2 +
 5 files changed, 148 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index d6085f7045..09aee3565d 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1432,6 +1432,27 @@ Slave message types
 
   The state.num field is currently reserved and must be set to 0.
 
+``VHOST_USER_SLAVE_FS_MAP``
+  :id: 6
+  :equivalent ioctl: N/A
+  :slave payload: ``struct VhostUserFSSlaveMsg``
+  :master payload: N/A
+
+  Requests that an fd, provided in the ancillary data, be mmapped
+  into the virtio-fs cache; multiple chunks can be mapped in one
+  command.
+  A reply is generated indicating whether mapping succeeded.
+
+``VHOST_USER_SLAVE_FS_UNMAP``
+  :id: 7
+  :equivalent ioctl: N/A
+  :slave payload: ``struct VhostUserFSSlaveMsg``
+  :master payload: N/A
+
+  Requests that the range in the virtio-fs cache is unmapped;
+  multiple chunks can be unmapped in one command.
+  A reply is generated indicating whether unmapping succeeded.
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index dd0a02aa99..169a146e72 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -45,6 +45,72 @@ static const int user_feature_bits[] = {
 #define DAX_WINDOW_PROT PROT_NONE
 #endif
 
+/*
+ * The message apparently had 'received_size' bytes, check this
+ * matches the count in the message.
+ *
+ * Returns true if the size matches.
+ */
+static bool check_slave_message_entries(const VhostUserFSSlaveMsg *sm,
+                                        int received_size)
+{
+    int tmp;
+
+    /*
+     * VhostUserFSSlaveMsg consists of a body followed by 'n' entries,
+     * (each VhostUserFSSlaveMsgEntry).  There's a maximum of
+     * VHOST_USER_FS_SLAVE_MAX_ENTRIES of these.
+     */
+    if (received_size <= sizeof(VhostUserFSSlaveMsg)) {
+        error_report("%s: Short VhostUserFSSlaveMsg size, %d", __func__,
+                     received_size);
+        return false;
+    }
+
+    tmp = received_size - sizeof(VhostUserFSSlaveMsg);
+    if (tmp % sizeof(VhostUserFSSlaveMsgEntry)) {
+        error_report("%s: Non-multiple VhostUserFSSlaveMsg size, %d", __func__,
+                     received_size);
+        return false;
+    }
+
+    tmp /= sizeof(VhostUserFSSlaveMsgEntry);
+    if (tmp != sm->count) {
+        error_report("%s: VhostUserFSSlaveMsg count mismatch, %d count: %d",
+                     __func__, tmp, sm->count);
+        return false;
+    }
+
+    if (sm->count > VHOST_USER_FS_SLAVE_MAX_ENTRIES) {
+        error_report("%s: VhostUserFSSlaveMsg too many entries: %d",
+                     __func__, sm->count);
+        return false;
+    }
+    return true;
+}
+
+uint64_t vhost_user_fs_slave_map(struct vhost_dev *dev, int message_size,
+                                 VhostUserFSSlaveMsg *sm, int fd)
+{
+    if (!check_slave_message_entries(sm, message_size)) {
+        return (uint64_t)-1;
+    }
+
+    /* TODO */
+    return (uint64_t)-1;
+}
+
+uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
+                                   VhostUserFSSlaveMsg *sm)
+{
+    if (!check_slave_message_entries(sm, message_size)) {
+        return (uint64_t)-1;
+    }
+
+    /* TODO */
+    return (uint64_t)-1;
+}
+
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 4a7d2786c6..7d9b0ad45d 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -12,6 +12,7 @@
 #include "qapi/error.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-user.h"
+#include "hw/virtio/vhost-user-fs.h"
 #include "hw/virtio/vhost-backend.h"
 #include "hw/virtio/virtio.h"
 #include "hw/virtio/virtio-net.h"
@@ -133,6 +134,10 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_IOTLB_MSG = 1,
     VHOST_USER_SLAVE_CONFIG_CHANGE_MSG = 2,
     VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG = 3,
+    VHOST_USER_SLAVE_VRING_CALL = 4,
+    VHOST_USER_SLAVE_VRING_ERR = 5,
+    VHOST_USER_SLAVE_FS_MAP = 6,
+    VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -205,6 +210,16 @@ typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/*
+ * VhostUserFSSlaveMsg is special since it has a variable entry count,
+ * but it does have a maximum, so make a type for that to fit in our union
+ * for max size.
+ */
+typedef struct {
+    VhostUserFSSlaveMsg fs;
+    VhostUserFSSlaveMsgEntry entries[VHOST_USER_FS_SLAVE_MAX_ENTRIES];
+} QEMU_PACKED VhostUserFSSlaveMsgMax;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1<<8)
@@ -219,6 +234,8 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserFSSlaveMsg fs;
+        VhostUserFSSlaveMsg fs_max; /* Never actually used */
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1538,6 +1555,15 @@ static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
         ret = vhost_user_slave_handle_vring_host_notifier(dev, &payload.area,
                                                           fd ? fd[0] : -1);
         break;
+#ifdef CONFIG_VHOST_USER_FS
+    case VHOST_USER_SLAVE_FS_MAP:
+        ret = vhost_user_fs_slave_map(dev, hdr.size, &payload.fs,
+                                      fd ? fd[0] : -1);
+        break;
+    case VHOST_USER_SLAVE_FS_UNMAP:
+        ret = vhost_user_fs_slave_unmap(dev, hdr.size, &payload.fs);
+        break;
+#endif
     default:
         error_report("Received unexpected msg type: %d.", hdr.request);
         ret = true;
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 04596799e3..0766f17548 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -23,6 +23,33 @@
 #define TYPE_VHOST_USER_FS "vhost-user-fs-device"
 OBJECT_DECLARE_SIMPLE_TYPE(VHostUserFS, VHOST_USER_FS)
 
+/* Structures carried over the slave channel back to QEMU */
+#define VHOST_USER_FS_SLAVE_MAX_ENTRIES 32
+
+/* For the flags field of VhostUserFSSlaveMsg */
+#define VHOST_USER_FS_FLAG_MAP_R (1u << 0)
+#define VHOST_USER_FS_FLAG_MAP_W (1u << 1)
+
+typedef struct {
+    /* Offsets within the file being mapped */
+    uint64_t fd_offset;
+    /* Offsets within the cache */
+    uint64_t c_offset;
+    /* Lengths of sections */
+    uint64_t len;
+    /* Flags, from VHOST_USER_FS_FLAG_* */
+    uint64_t flags;
+} VhostUserFSSlaveMsgEntry;
+
+typedef struct {
+    /* Number of entries */
+    uint16_t count;
+    /* Spare */
+    uint16_t align;
+
+    VhostUserFSSlaveMsgEntry entries[];
+} VhostUserFSSlaveMsg;
+
 typedef struct {
     CharBackend chardev;
     char *tag;
@@ -46,4 +73,10 @@ struct VHostUserFS {
     MemoryRegion cache;
 };
 
+/* Callbacks from the vhost-user code for slave commands */
+uint64_t vhost_user_fs_slave_map(struct vhost_dev *dev, int message_size,
+                                 VhostUserFSSlaveMsg *sm, int fd);
+uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
+                                   VhostUserFSSlaveMsg *sm);
+
 #endif /* _QEMU_VHOST_USER_FS_H */
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index 70fc61171f..a98c5f5c11 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -119,6 +119,8 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG = 3,
     VHOST_USER_SLAVE_VRING_CALL = 4,
     VHOST_USER_SLAVE_VRING_ERR = 5,
+    VHOST_USER_SLAVE_FS_MAP = 6,
+    VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 10/26] DAX: virtio-fs: Fill in slave commands for mapping
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (8 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-05 16:43   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 11/26] DAX: virtiofsd Add cache accessor functions Dr. David Alan Gilbert (git)
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Fill in definitions for map, unmap and sync commands.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
with fix by misono.tomohiro@fujitsu.com
---
 hw/virtio/vhost-user-fs.c | 117 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 113 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 169a146e72..963f694435 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -92,23 +92,132 @@ static bool check_slave_message_entries(const VhostUserFSSlaveMsg *sm,
 uint64_t vhost_user_fs_slave_map(struct vhost_dev *dev, int message_size,
                                  VhostUserFSSlaveMsg *sm, int fd)
 {
+    VHostUserFS *fs = (VHostUserFS *)object_dynamic_cast(OBJECT(dev->vdev),
+                          TYPE_VHOST_USER_FS);
+    if (!fs) {
+        error_report("%s: Bad fs ptr", __func__);
+        return (uint64_t)-1;
+    }
     if (!check_slave_message_entries(sm, message_size)) {
         return (uint64_t)-1;
     }
 
-    /* TODO */
-    return (uint64_t)-1;
+    size_t cache_size = fs->conf.cache_size;
+    if (!cache_size) {
+        error_report("map called when DAX cache not present");
+        return (uint64_t)-1;
+    }
+    void *cache_host = memory_region_get_ram_ptr(&fs->cache);
+
+    unsigned int i;
+    int res = 0;
+
+    if (fd < 0) {
+        error_report("Bad fd for map");
+        return (uint64_t)-1;
+    }
+
+    for (i = 0; i < sm->count; i++) {
+        VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
+        if (e->len == 0) {
+            continue;
+        }
+
+        if ((e->c_offset + e->len) < e->len ||
+            (e->c_offset + e->len) > cache_size) {
+            error_report("Bad offset/len for map [%d] %" PRIx64 "+%" PRIx64,
+                         i, e->c_offset, e->len);
+            res = -1;
+            break;
+        }
+
+        if (mmap(cache_host + e->c_offset, e->len,
+                 ((e->flags & VHOST_USER_FS_FLAG_MAP_R) ? PROT_READ : 0) |
+                 ((e->flags & VHOST_USER_FS_FLAG_MAP_W) ? PROT_WRITE : 0),
+                 MAP_SHARED | MAP_FIXED,
+                 fd, e->fd_offset) != (cache_host + e->c_offset)) {
+            res = -errno;
+            error_report("map failed err %d [%d] %" PRIx64 "+%" PRIx64 " from %"
+                         PRIx64, errno, i, e->c_offset, e->len,
+                         e->fd_offset);
+            break;
+        }
+    }
+
+    if (res) {
+        /* Something went wrong, unmap them all */
+        vhost_user_fs_slave_unmap(dev, message_size, sm);
+    }
+    return (uint64_t)res;
 }
 
 uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
                                    VhostUserFSSlaveMsg *sm)
 {
+    VHostUserFS *fs = (VHostUserFS *)object_dynamic_cast(OBJECT(dev->vdev),
+                          TYPE_VHOST_USER_FS);
+    if (!fs) {
+        error_report("%s: Bad fs ptr", __func__);
+        return (uint64_t)-1;
+    }
     if (!check_slave_message_entries(sm, message_size)) {
         return (uint64_t)-1;
     }
 
-    /* TODO */
-    return (uint64_t)-1;
+    size_t cache_size = fs->conf.cache_size;
+    if (!cache_size) {
+        /*
+         * Since dax cache is disabled, there should be no unmap request.
+         * Howerver we still receives whole range unmap request during umount
+         * for cleanup. Ignore it.
+         */
+        if (sm->entries[0].len == ~(uint64_t)0) {
+            return 0;
+        }
+
+        error_report("unmap called when DAX cache not present");
+        return (uint64_t)-1;
+    }
+    void *cache_host = memory_region_get_ram_ptr(&fs->cache);
+
+    unsigned int i;
+    int res = 0;
+
+    /*
+     * Note even if one unmap fails we try the rest, since the effect
+     * is to clean up as much as possible.
+     */
+    for (i = 0; i < sm->count; i++) {
+        VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
+        void *ptr;
+        if (e->len == 0) {
+            continue;
+        }
+
+        if (e->len == ~(uint64_t)0) {
+            /* Special case meaning the whole arena */
+            e->len = cache_size;
+        }
+
+        if ((e->c_offset + e->len) < e->len ||
+            (e->c_offset + e->len) > cache_size) {
+            error_report("Bad offset/len for unmap [%d] %" PRIx64 "+%" PRIx64,
+                         i, e->c_offset, e->len);
+            res = -1;
+            continue;
+        }
+
+        ptr = mmap(cache_host + e->c_offset, e->len, DAX_WINDOW_PROT,
+                   MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
+        if (ptr != (cache_host + e->c_offset)) {
+            res = -errno;
+            error_report("mmap failed (%s) [%d] %" PRIx64 "+%" PRIx64 " from %"
+                         PRIx64 " res: %p", strerror(errno), i, e->c_offset,
+                         e->len, e->fd_offset, ptr);
+        }
+    }
+
+    return (uint64_t)res;
 }
 
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 11/26] DAX: virtiofsd Add cache accessor functions
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (9 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 10/26] DAX: virtio-fs: Fill in " Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands Dr. David Alan Gilbert (git)
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add low level functions that the clients can use to map/unmap cache
areas.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_lowlevel.h | 21 +++++++++++++++++++++
 tools/virtiofsd/fuse_virtio.c   | 18 ++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 3bf786b034..3383e3a8a0 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -29,6 +29,8 @@
 #include <sys/uio.h>
 #include <utime.h>
 
+#include "subprojects/libvhost-user/libvhost-user.h"
+
 /*
  * Miscellaneous definitions
  */
@@ -1971,4 +1973,23 @@ void fuse_session_process_buf(struct fuse_session *se,
  */
 int fuse_session_receive_buf(struct fuse_session *se, struct fuse_buf *buf);
 
+/**
+ * For use with virtio-fs; request an fd be mapped into the cache
+ *
+ * @param req The request that triggered this action
+ * @param msg A set of mapping requests
+ * @param fd The fd to map
+ * @return Zero on success
+ */
+int64_t fuse_virtio_map(fuse_req_t req, VhostUserFSSlaveMsg *msg, int fd);
+
+/**
+ * For use with virtio-fs; request unmapping of part of the cache
+ *
+ * @param se The session this request is on
+ * @param msg A set of unmapping requests
+ * @return Zero on success
+ */
+int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg);
+
 #endif /* FUSE_LOWLEVEL_H_ */
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 2604e7f418..85d90ca595 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -1123,3 +1123,21 @@ void virtio_session_close(struct fuse_session *se)
     free(se->virtio_dev);
     se->virtio_dev = NULL;
 }
+
+int64_t fuse_virtio_map(fuse_req_t req, VhostUserFSSlaveMsg *msg, int fd)
+{
+    if (!req->se->virtio_dev) {
+        return -ENODEV;
+    }
+    return vu_fs_cache_request(&req->se->virtio_dev->dev,
+                               VHOST_USER_SLAVE_FS_MAP, fd, msg);
+}
+
+int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg)
+{
+    if (!se->virtio_dev) {
+        return -ENODEV;
+    }
+    return vu_fs_cache_request(&se->virtio_dev->dev, VHOST_USER_SLAVE_FS_UNMAP,
+                               -1, msg);
+}
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (10 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 11/26] DAX: virtiofsd Add cache accessor functions Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:02   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 13/26] DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll Dr. David Alan Gilbert (git)
                   ` (15 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add commands so that the guest kernel can ask the daemon to map file
sections into a guest kernel visible cache.

Note: Catherine Ho had sent a patch to fix an issue with multiple
removemapping. It was a merge issue though.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Including-fixes: Catherine Ho <catherine.hecx@gmail.com>
Signed-off-by: Catherine Ho <catherine.hecx@gmail.com>
---
 tools/virtiofsd/fuse_lowlevel.c | 69 +++++++++++++++++++++++++++++++++
 tools/virtiofsd/fuse_lowlevel.h | 23 ++++++++++-
 2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 58e32fc963..a1a8730b73 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1870,6 +1870,73 @@ static void do_lseek(fuse_req_t req, fuse_ino_t nodeid,
     }
 }
 
+static void do_setupmapping(fuse_req_t req, fuse_ino_t nodeid,
+                            struct fuse_mbuf_iter *iter)
+{
+    struct fuse_setupmapping_in *arg;
+    struct fuse_file_info fi;
+
+    arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
+    if (!arg) {
+        fuse_reply_err(req, EINVAL);
+        return;
+    }
+
+    memset(&fi, 0, sizeof(fi));
+    fi.fh = arg->fh;
+
+    /*
+     *  TODO: Need to come up with a better definition of flags here; it can't
+     * be the kernel view of the flags, since that's abstracted from the client
+     * similarly, it's not the vhost-user set
+     * for now just use O_ flags
+     */
+    uint64_t genflags;
+
+    genflags = O_RDONLY;
+    if (arg->flags & FUSE_SETUPMAPPING_FLAG_WRITE) {
+        genflags = O_RDWR;
+    }
+
+    if (req->se->op.setupmapping) {
+        req->se->op.setupmapping(req, nodeid, arg->foffset, arg->len,
+                                 arg->moffset, genflags, &fi);
+    } else {
+        fuse_reply_err(req, ENOSYS);
+    }
+}
+
+static void do_removemapping(fuse_req_t req, fuse_ino_t nodeid,
+                             struct fuse_mbuf_iter *iter)
+{
+    struct fuse_removemapping_in *arg;
+    struct fuse_removemapping_one *one;
+
+    arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
+    if (!arg || !arg->count ||
+        (uint64_t)arg->count * sizeof(*one) >= SIZE_MAX) {
+        fuse_log(FUSE_LOG_ERR, "do_removemapping: invalid arg %p\n", arg);
+        fuse_reply_err(req, EINVAL);
+        return;
+    }
+
+    one = fuse_mbuf_iter_advance(iter, arg->count * sizeof(*one));
+    if (!one) {
+        fuse_log(
+            FUSE_LOG_ERR,
+            "do_removemapping: invalid in, expected %d * %zd, has %zd - %zd\n",
+            arg->count, sizeof(*one), iter->size, iter->pos);
+        fuse_reply_err(req, EINVAL);
+        return;
+    }
+
+    if (req->se->op.removemapping) {
+        req->se->op.removemapping(req, req->se, nodeid, arg->count, one);
+    } else {
+        fuse_reply_err(req, ENOSYS);
+    }
+}
+
 static void do_init(fuse_req_t req, fuse_ino_t nodeid,
                     struct fuse_mbuf_iter *iter)
 {
@@ -2267,6 +2334,8 @@ static struct {
     [FUSE_RENAME2] = { do_rename2, "RENAME2" },
     [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
     [FUSE_LSEEK] = { do_lseek, "LSEEK" },
+    [FUSE_SETUPMAPPING] = { do_setupmapping, "SETUPMAPPING" },
+    [FUSE_REMOVEMAPPING] = { do_removemapping, "REMOVEMAPPING" },
 };
 
 #define FUSE_MAXOP (sizeof(fuse_ll_ops) / sizeof(fuse_ll_ops[0]))
diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 3383e3a8a0..0bf206264d 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -24,6 +24,7 @@
 #endif
 
 #include "fuse_common.h"
+#include "standard-headers/linux/fuse.h"
 
 #include <sys/statvfs.h>
 #include <sys/uio.h>
@@ -1171,7 +1172,6 @@ struct fuse_lowlevel_ops {
      */
     void (*readdirplus)(fuse_req_t req, fuse_ino_t ino, size_t size, off_t off,
                         struct fuse_file_info *fi);
-
     /**
      * Copy a range of data from one file to another
      *
@@ -1227,6 +1227,27 @@ struct fuse_lowlevel_ops {
      */
     void (*lseek)(fuse_req_t req, fuse_ino_t ino, off_t off, int whence,
                   struct fuse_file_info *fi);
+
+    /*
+     * Map file sections into kernel visible cache
+     *
+     * Map a section of the file into address space visible to the kernel
+     * mounting the filesystem.
+     * TODO
+     */
+    void (*setupmapping)(fuse_req_t req, fuse_ino_t ino, uint64_t foffset,
+                         uint64_t len, uint64_t moffset, uint64_t flags,
+                         struct fuse_file_info *fi);
+
+    /*
+     * Unmap file sections in kernel visible cache
+     *
+     * Unmap sections previously mapped by setupmapping
+     * TODO
+     */
+    void (*removemapping)(fuse_req_t req, struct fuse_session *se,
+                          fuse_ino_t ino, unsigned num,
+                          struct fuse_removemapping_one *argp);
 };
 
 /**
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 13/26] DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (11 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 14/26] DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping Dr. David Alan Gilbert (git)
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 110f85a701..a16d425b78 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3145,6 +3145,22 @@ static void lo_destroy(void *userdata)
     pthread_mutex_unlock(&lo->mutex);
 }
 
+static void lo_setupmapping(fuse_req_t req, fuse_ino_t ino, uint64_t foffset,
+                            uint64_t len, uint64_t moffset, uint64_t flags,
+                            struct fuse_file_info *fi)
+{
+    /* TODO */
+    fuse_reply_err(req, ENOSYS);
+}
+
+static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
+                             fuse_ino_t ino, unsigned num,
+                             struct fuse_removemapping_one *argp)
+{
+    /* TODO */
+    fuse_reply_err(req, ENOSYS);
+}
+
 static struct fuse_lowlevel_ops lo_oper = {
     .init = lo_init,
     .lookup = lo_lookup,
@@ -3186,6 +3202,8 @@ static struct fuse_lowlevel_ops lo_oper = {
 #endif
     .lseek = lo_lseek,
     .destroy = lo_destroy,
+    .setupmapping = lo_setupmapping,
+    .removemapping = lo_removemapping,
 };
 
 /* Print vhost-user.json backend program capabilities */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 14/26] DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (12 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 13/26] DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 15/26] DAX: virtiofsd: Make lo_removemapping() work Dr. David Alan Gilbert (git)
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Wire up passthrough_ll's setupmapping to allocate, send to virtio
and then reply OK.

Guest might not pass file pointer. In that case using inode info, open
the file again, mmap() and close fd.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
With fix from:
Signed-off-by: Fotis Xenakis <foxen@windowslive.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_lowlevel.c  | 13 ++++++--
 tools/virtiofsd/passthrough_ll.c | 57 ++++++++++++++++++++++++++++++--
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index a1a8730b73..4921f1bbb7 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1899,8 +1899,17 @@ static void do_setupmapping(fuse_req_t req, fuse_ino_t nodeid,
     }
 
     if (req->se->op.setupmapping) {
-        req->se->op.setupmapping(req, nodeid, arg->foffset, arg->len,
-                                 arg->moffset, genflags, &fi);
+        /*
+         * TODO: Add a flag to request which tells if arg->fh is
+         * valid or not.
+         */
+        if (fi.fh == (uint64_t)-1) {
+            req->se->op.setupmapping(req, nodeid, arg->foffset, arg->len,
+                                     arg->moffset, genflags, NULL);
+        } else {
+            req->se->op.setupmapping(req, nodeid, arg->foffset, arg->len,
+                                     arg->moffset, genflags, &fi);
+        }
     } else {
         fuse_reply_err(req, ENOSYS);
     }
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index a16d425b78..6981737389 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3149,8 +3149,61 @@ static void lo_setupmapping(fuse_req_t req, fuse_ino_t ino, uint64_t foffset,
                             uint64_t len, uint64_t moffset, uint64_t flags,
                             struct fuse_file_info *fi)
 {
-    /* TODO */
-    fuse_reply_err(req, ENOSYS);
+    struct lo_data *lo = lo_data(req);
+    int ret = 0, fd;
+    VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
+                                         sizeof(VhostUserFSSlaveMsgEntry));
+    uint64_t vhu_flags;
+    char *buf;
+    bool writable = flags & O_RDWR;
+
+    fuse_log(FUSE_LOG_DEBUG,
+             "lo_setupmapping(ino=%" PRIu64 ", fi=0x%p,"
+             " foffset=%" PRIu64 ", len=%" PRIu64 ", moffset=%" PRIu64
+             ", flags=%" PRIu64 ")\n",
+             ino, (void *)fi, foffset, len, moffset, flags);
+
+    vhu_flags = VHOST_USER_FS_FLAG_MAP_R;
+    if (writable) {
+        vhu_flags |= VHOST_USER_FS_FLAG_MAP_W;
+    }
+
+    msg->count = 1;
+    msg->entries[0].fd_offset = foffset;
+    msg->entries[0].len = len;
+    msg->entries[0].c_offset = moffset;
+    msg->entries[0].flags = vhu_flags;
+
+    if (fi) {
+        fd = lo_fi_fd(req, fi);
+    } else {
+        ret = asprintf(&buf, "%i", lo_fd(req, ino));
+        if (ret == -1) {
+            g_free(msg);
+            return (void)fuse_reply_err(req, errno);
+        }
+
+        fd = openat(lo->proc_self_fd, buf, flags);
+        free(buf);
+        if (fd == -1) {
+            g_free(msg);
+            return (void)fuse_reply_err(req, errno);
+        }
+    }
+
+    ret = fuse_virtio_map(req, msg, fd);
+    if (ret < 0) {
+        fuse_log(FUSE_LOG_ERR,
+                 "%s: map over virtio failed (ino=%" PRId64
+                 "fd=%d moffset=0x%" PRIx64 "). err = %d\n",
+                 __func__, ino, fd, moffset, ret);
+    }
+
+    if (!fi) {
+        close(fd);
+    }
+    fuse_reply_err(req, -ret);
+    g_free(msg);
 }
 
 static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 15/26] DAX: virtiofsd: Make lo_removemapping() work
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (13 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 14/26] DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 16/26] DAX: virtiofsd: route se down to destroy method Dr. David Alan Gilbert (git)
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: Vivek Goyal <vgoyal@redhat.com>

Let guest pass in the offset in dax window a mapping is currently
mapped at and needs to be removed.

Vivek added the initial support to remove single mapping and later Peng
added patch to support removing multiple mappings in single command.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 6981737389..1a86378172 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3210,8 +3210,36 @@ static void lo_removemapping(fuse_req_t req, struct fuse_session *se,
                              fuse_ino_t ino, unsigned num,
                              struct fuse_removemapping_one *argp)
 {
-    /* TODO */
-    fuse_reply_err(req, ENOSYS);
+    VhostUserFSSlaveMsg *msg;
+    size_t alloc_count = (num > VHOST_USER_FS_SLAVE_MAX_ENTRIES) ?
+                              VHOST_USER_FS_SLAVE_MAX_ENTRIES : num;
+    int ret = 0;
+    msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
+                    alloc_count * sizeof(VhostUserFSSlaveMsgEntry));
+
+    for (int i = 0, o = 0; num > 0; i++, argp++) {
+        VhostUserFSSlaveMsgEntry *e = &msg->entries[o];
+
+        e->len = argp->len;
+        e->c_offset = argp->moffset;
+
+        o++;
+        if (--num == 0 || o == VHOST_USER_FS_SLAVE_MAX_ENTRIES) {
+            msg->count = o;
+            ret = fuse_virtio_unmap(se, msg);
+            if (ret < 0) {
+                fuse_log(FUSE_LOG_ERR,
+                         "%s: unmap over virtio failed "
+                         "(offset=0x%" PRIx64 ", len=0x%" PRIx64 "). err=%d\n",
+                         __func__, argp->moffset, argp->len, ret);
+                break;
+            }
+            o = 0;
+        }
+    }
+
+    fuse_reply_err(req, -ret);
+    g_free(msg);
 }
 
 static struct fuse_lowlevel_ops lo_oper = {
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 16/26] DAX: virtiofsd: route se down to destroy method
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (14 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 15/26] DAX: virtiofsd: Make lo_removemapping() work Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 17/26] DAX: virtiofsd: Perform an unmap on destroy Dr. David Alan Gilbert (git)
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

We're going to need to pass the session down to destroy so that it can
pass it back to do the remove mapping.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_lowlevel.c  | 6 +++---
 tools/virtiofsd/fuse_lowlevel.h  | 2 +-
 tools/virtiofsd/passthrough_ll.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 4921f1bbb7..6930574aaf 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2222,7 +2222,7 @@ static void do_destroy(fuse_req_t req, fuse_ino_t nodeid,
     se->got_destroy = 1;
     se->got_init = 0;
     if (se->op.destroy) {
-        se->op.destroy(se->userdata);
+        se->op.destroy(se->userdata, se);
     }
 
     send_reply_ok(req, NULL, 0);
@@ -2449,7 +2449,7 @@ void fuse_session_process_buf_int(struct fuse_session *se,
             se->got_destroy = 1;
             se->got_init = 0;
             if (se->op.destroy) {
-                se->op.destroy(se->userdata);
+                se->op.destroy(se->userdata, se);
             }
         } else {
             goto reply_err;
@@ -2538,7 +2538,7 @@ void fuse_session_destroy(struct fuse_session *se)
 {
     if (se->got_init && !se->got_destroy) {
         if (se->op.destroy) {
-            se->op.destroy(se->userdata);
+            se->op.destroy(se->userdata, se);
         }
     }
     pthread_rwlock_destroy(&se->init_rwlock);
diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 0bf206264d..27b07bfc22 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -209,7 +209,7 @@ struct fuse_lowlevel_ops {
      *
      * @param userdata the user data passed to fuse_session_new()
      */
-    void (*destroy)(void *userdata);
+    void (*destroy)(void *userdata, struct fuse_session *se);
 
     /**
      * Look up a directory entry by name and get its attributes.
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 1a86378172..ed5b6c9e2d 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3125,7 +3125,7 @@ static void lo_lseek(fuse_req_t req, fuse_ino_t ino, off_t off, int whence,
     }
 }
 
-static void lo_destroy(void *userdata)
+static void lo_destroy(void *userdata, struct fuse_session *se)
 {
     struct lo_data *lo = (struct lo_data *)userdata;
 
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 17/26] DAX: virtiofsd: Perform an unmap on destroy
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (15 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 16/26] DAX: virtiofsd: route se down to destroy method Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Force unmap all remaining dax cache entries on a destroy.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/passthrough_ll.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index ed5b6c9e2d..600e102839 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3129,6 +3129,20 @@ static void lo_destroy(void *userdata, struct fuse_session *se)
 {
     struct lo_data *lo = (struct lo_data *)userdata;
 
+    if (fuse_lowlevel_is_virtio(se)) {
+        VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
+                                             sizeof(VhostUserFSSlaveMsgEntry));
+
+        msg->count = 0;
+        msg->entries[0].len = ~(uint64_t)0; /* Special: means 'all' */
+        msg->entries[0].c_offset = 0;
+        if (fuse_virtio_unmap(se, msg)) {
+            fuse_log(FUSE_LOG_ERR, "%s: unmap during destroy failed\n",
+                     __func__);
+        }
+        g_free(msg);
+    }
+
     pthread_mutex_lock(&lo->mutex);
     while (true) {
         GHashTableIter iter;
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (16 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 17/26] DAX: virtiofsd: Perform an unmap on destroy Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:12   ` Stefan Hajnoczi
  2021-05-06 15:16   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
                   ` (9 subsequent siblings)
  27 siblings, 2 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Define a new slave command 'VHOST_USER_SLAVE_FS_IO' for a
client to ask qemu to perform a read/write from an fd directly
to GPA.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/interop/vhost-user.rst               | 16 ++++
 hw/virtio/trace-events                    |  6 ++
 hw/virtio/vhost-user-fs.c                 | 95 +++++++++++++++++++++++
 hw/virtio/vhost-user.c                    |  5 ++
 include/hw/virtio/vhost-user-fs.h         |  2 +
 subprojects/libvhost-user/libvhost-user.h |  1 +
 6 files changed, 125 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 09aee3565d..2fa62ea451 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1453,6 +1453,22 @@ Slave message types
   multiple chunks can be unmapped in one command.
   A reply is generated indicating whether unmapping succeeded.
 
+``VHOST_USER_SLAVE_FS_IO``
+  :id: 9
+  :equivalent ioctl: N/A
+  :slave payload: ``struct VhostUserFSSlaveMsg``
+  :master payload: N/A
+
+  Requests that IO be performed directly from an fd, passed in ancillary
+  data, to guest memory on behalf of the daemon; this is normally for a
+  case where a memory region isn't visible to the daemon. slave payload
+  has flags which determine the direction of IO operation.
+
+  The ``VHOST_USER_FS_FLAG_MAP_R`` flag must be set in the ``flags`` field to
+  read from the file into RAM.
+  The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
+  write to the file from RAM.
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index c62727f879..20557a078e 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -53,6 +53,12 @@ vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRI
 vhost_vdpa_set_owner(void *dev) "dev: %p"
 vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
 
+# vhost-user-fs.c
+
+vhost_user_fs_slave_io_loop(const char *name, uint64_t owr, int is_ram, int is_romd, size_t size) "region %s with internal offset 0x%"PRIx64 " ram=%d romd=%d mrs.size=%zd"
+vhost_user_fs_slave_io_loop_res(ssize_t transferred) "%zd"
+vhost_user_fs_slave_io_exit(int res, size_t done) "res: %d done: %zd"
+
 # virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
 virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) "vq %p elem %p len %u idx %u"
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 963f694435..ee600ce968 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -23,6 +23,8 @@
 #include "hw/virtio/vhost-user-fs.h"
 #include "monitor/monitor.h"
 #include "sysemu/sysemu.h"
+#include "exec/address-spaces.h"
+#include "trace.h"
 
 static const int user_feature_bits[] = {
     VIRTIO_F_VERSION_1,
@@ -220,6 +222,99 @@ uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
     return (uint64_t)res;
 }
 
+uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
+                                VhostUserFSSlaveMsg *sm, int fd)
+{
+    VHostUserFS *fs = (VHostUserFS *)object_dynamic_cast(OBJECT(dev->vdev),
+                          TYPE_VHOST_USER_FS);
+    if (!fs) {
+        error_report("%s: Bad fs ptr", __func__);
+        return (uint64_t)-1;
+    }
+    if (!check_slave_message_entries(sm, message_size)) {
+        return (uint64_t)-1;
+    }
+
+    unsigned int i;
+    int res = 0;
+    size_t done = 0;
+
+    if (fd < 0) {
+        error_report("Bad fd for io");
+        return (uint64_t)-1;
+    }
+
+    for (i = 0; i < sm->count && !res; i++) {
+        VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
+        if (e->len == 0) {
+            continue;
+        }
+
+        size_t len = e->len;
+        uint64_t fd_offset = e->fd_offset;
+        hwaddr gpa = e->c_offset;
+
+        while (len && !res) {
+            hwaddr xlat, xlat_len;
+            bool is_write = e->flags & VHOST_USER_FS_FLAG_MAP_W;
+            MemoryRegion *mr = address_space_translate(dev->vdev->dma_as, gpa,
+                                                       &xlat, &xlat_len,
+                                                       is_write,
+                                                       MEMTXATTRS_UNSPECIFIED);
+            if (!mr || !xlat_len) {
+                error_report("No guest region found for 0x%" HWADDR_PRIx, gpa);
+                res = -EFAULT;
+                break;
+            }
+
+            trace_vhost_user_fs_slave_io_loop(mr->name,
+                                          (uint64_t)xlat,
+                                          memory_region_is_ram(mr),
+                                          memory_region_is_romd(mr),
+                                          (size_t)xlat_len);
+
+            void *hostptr = qemu_map_ram_ptr(mr->ram_block,
+                                             xlat);
+            ssize_t transferred;
+            if (e->flags & VHOST_USER_FS_FLAG_MAP_R) {
+                /* Read from file into RAM */
+                if (mr->readonly) {
+                    res = -EFAULT;
+                    break;
+                }
+                transferred = pread(fd, hostptr, xlat_len, fd_offset);
+            } else if (e->flags & VHOST_USER_FS_FLAG_MAP_W) {
+                /* Write into file from RAM */
+                transferred = pwrite(fd, hostptr, xlat_len, fd_offset);
+            } else {
+                transferred = EINVAL;
+            }
+
+            trace_vhost_user_fs_slave_io_loop_res(transferred);
+            if (transferred < 0) {
+                res = -errno;
+                break;
+            }
+            if (!transferred) {
+                /* EOF */
+                break;
+            }
+
+            done += transferred;
+            fd_offset += transferred;
+            gpa += transferred;
+            len -= transferred;
+        }
+    }
+    close(fd);
+
+    trace_vhost_user_fs_slave_io_exit(res, done);
+    if (res < 0) {
+        return (uint64_t)res;
+    }
+    return (uint64_t)done;
+}
+
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 7d9b0ad45d..58af28cb79 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -138,6 +138,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_VRING_ERR = 5,
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
+    VHOST_USER_SLAVE_FS_IO = 8,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -1563,6 +1564,10 @@ static gboolean slave_read(QIOChannel *ioc, GIOCondition condition,
     case VHOST_USER_SLAVE_FS_UNMAP:
         ret = vhost_user_fs_slave_unmap(dev, hdr.size, &payload.fs);
         break;
+    case VHOST_USER_SLAVE_FS_IO:
+        ret = vhost_user_fs_slave_io(dev, hdr.size, &payload.fs,
+                                     fd ? fd[0] : -1);
+        break;
 #endif
     default:
         error_report("Received unexpected msg type: %d.", hdr.request);
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 0766f17548..2931164e23 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -78,5 +78,7 @@ uint64_t vhost_user_fs_slave_map(struct vhost_dev *dev, int message_size,
                                  VhostUserFSSlaveMsg *sm, int fd);
 uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
                                    VhostUserFSSlaveMsg *sm);
+uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
+                                VhostUserFSSlaveMsg *sm, int fd);
 
 #endif /* _QEMU_VHOST_USER_FS_H */
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index a98c5f5c11..42b0833c4b 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -121,6 +121,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_VRING_ERR = 5,
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
+    VHOST_USER_SLAVE_FS_IO = 8,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (17 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 12:53   ` Dr. David Alan Gilbert
  2021-04-28 11:00 ` [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements Dr. David Alan Gilbert (git)
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

Add a wrapper to send VHOST_USER_SLAVE_FS_IO commands and a
further wrapper for sending a fuse_buf write using the FS_IO
slave command.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_lowlevel.h | 25 +++++++++++++++++++
 tools/virtiofsd/fuse_virtio.c   | 43 +++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 27b07bfc22..757cdae49b 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -2013,4 +2013,29 @@ int64_t fuse_virtio_map(fuse_req_t req, VhostUserFSSlaveMsg *msg, int fd);
  */
 int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg);
 
+/**
+ * For use with virtio-fs; request IO directly to memory
+ *
+ * @param se The current session
+ * @param msg A set of IO requests
+ * @param fd The fd to map
+ * @return Length on success, negative errno on error
+ */
+int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
+                       int fd);
+
+/**
+ * For use with virtio-fs; wrapper for fuse_virtio_io for writes
+ * from memory to an fd
+ * @param req The request that triggered this action
+ * @param dst The destination (file) memory buffer
+ * @param dst_off Byte offset in the file
+ * @param src The source (memory) buffer
+ * @param src_off The GPA
+ * @param len Length in bytes
+ */
+ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
+                          size_t dst_off, const struct fuse_buf *src,
+                          size_t src_off, size_t len);
+
 #endif /* FUSE_LOWLEVEL_H_ */
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 85d90ca595..91317bade8 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -1141,3 +1141,46 @@ int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg)
     return vu_fs_cache_request(&se->virtio_dev->dev, VHOST_USER_SLAVE_FS_UNMAP,
                                -1, msg);
 }
+
+int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
+                       int fd)
+{
+    if (!se->virtio_dev) {
+        return -ENODEV;
+    }
+    return vu_fs_cache_request(&se->virtio_dev->dev, VHOST_USER_SLAVE_FS_IO,
+                               fd, msg);
+}
+
+/*
+ * Write to a file (dst) from an area of guest GPA (src) that probably
+ * isn't visible to the daemon.
+ */
+ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
+                          size_t dst_off, const struct fuse_buf *src,
+                          size_t src_off, size_t len)
+{
+    VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
+                                         sizeof(VhostUserFSSlaveMsgEntry));
+
+    msg->count = 1;
+
+    if (dst->flags & FUSE_BUF_FD_SEEK) {
+        msg->entries[0].fd_offset = dst->pos + dst_off;
+    } else {
+        off_t cur = lseek(dst->fd, 0, SEEK_CUR);
+        if (cur == (off_t)-1) {
+            g_free(msg);
+            return -errno;
+        }
+        msg->entries[0].fd_offset = cur;
+    }
+    msg->entries[0].c_offset = (uintptr_t)src->mem + src_off;
+    msg->entries[0].len = len;
+    msg->entries[0].flags = VHOST_USER_FS_FLAG_MAP_W;
+
+    int64_t result = fuse_virtio_io(req->se, msg, dst->fd);
+    fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);
+    g_free(msg);
+    return result;
+}
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (18 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:23   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads Dr. David Alan Gilbert (git)
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

For some read/writes the virtio queue elements are unmappable by
the daemon; these are cases where the data is to be read/written
from non-RAM.  In viritofs's case this is typically a direct read/write
into an mmap'd DAX file also on virtiofs (possibly on another instance).

When we receive a virtio queue element, check that we have enough
mappable data to handle the headers.  Make a note of the number of
unmappable 'in' entries (ie. for read data back to the VMM),
and flag the fuse_bufvec for 'out' entries with a new flag
FUSE_BUF_PHYS_ADDR.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
with fix by:
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 tools/virtiofsd/buffer.c      |   4 +-
 tools/virtiofsd/fuse_common.h |   7 ++
 tools/virtiofsd/fuse_virtio.c | 230 ++++++++++++++++++++++++----------
 3 files changed, 173 insertions(+), 68 deletions(-)

diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
index 874f01c488..1a050aa441 100644
--- a/tools/virtiofsd/buffer.c
+++ b/tools/virtiofsd/buffer.c
@@ -77,6 +77,7 @@ static ssize_t fuse_buf_write(const struct fuse_buf *dst, size_t dst_off,
     ssize_t res = 0;
     size_t copied = 0;
 
+    assert(!(src->flags & FUSE_BUF_PHYS_ADDR));
     while (len) {
         if (dst->flags & FUSE_BUF_FD_SEEK) {
             res = pwrite(dst->fd, (char *)src->mem + src_off, len,
@@ -272,7 +273,8 @@ ssize_t fuse_buf_copy(struct fuse_bufvec *dstv, struct fuse_bufvec *srcv)
      * process
      */
     for (i = 0; i < srcv->count; i++) {
-        if (srcv->buf[i].flags & FUSE_BUF_IS_FD) {
+        if ((srcv->buf[i].flags & FUSE_BUF_PHYS_ADDR) ||
+            (srcv->buf[i].flags & FUSE_BUF_IS_FD)) {
             break;
         }
     }
diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index fa9671872e..af43cf19f9 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -626,6 +626,13 @@ enum fuse_buf_flags {
      * detected.
      */
     FUSE_BUF_FD_RETRY = (1 << 3),
+
+    /**
+     * The addresses in the iovec represent guest physical addresses
+     * that can't be mapped by the daemon process.
+     * IO must be bounced back to the VMM to do it.
+     */
+    FUSE_BUF_PHYS_ADDR = (1 << 4),
 };
 
 /**
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 91317bade8..f8fd158bb2 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -49,6 +49,10 @@ typedef struct {
     VuVirtqElement elem;
     struct fuse_chan ch;
 
+    /* Number of unmappable iovecs */
+    unsigned bad_in_num;
+    unsigned bad_out_num;
+
     /* Used to complete requests that involve no reply */
     bool reply_sent;
 } FVRequest;
@@ -353,8 +357,10 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
 
     /* The 'in' part of the elem is to qemu */
     unsigned int in_num = elem->in_num;
+    unsigned int bad_in_num = req->bad_in_num;
     struct iovec *in_sg = elem->in_sg;
     size_t in_len = iov_size(in_sg, in_num);
+    size_t in_len_writeable = iov_size(in_sg, in_num - bad_in_num);
     fuse_log(FUSE_LOG_DEBUG, "%s: elem %d: with %d in desc of length %zd\n",
              __func__, elem->index, in_num, in_len);
 
@@ -362,7 +368,7 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
      * The elem should have room for a 'fuse_out_header' (out from fuse)
      * plus the data based on the len in the header.
      */
-    if (in_len < sizeof(struct fuse_out_header)) {
+    if (in_len_writeable < sizeof(struct fuse_out_header)) {
         fuse_log(FUSE_LOG_ERR, "%s: elem %d too short for out_header\n",
                  __func__, elem->index);
         ret = E2BIG;
@@ -389,7 +395,7 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
     memcpy(in_sg_cpy, in_sg, sizeof(struct iovec) * in_num);
     /* These get updated as we skip */
     struct iovec *in_sg_ptr = in_sg_cpy;
-    int in_sg_cpy_count = in_num;
+    int in_sg_cpy_count = in_num - bad_in_num;
 
     /* skip over parts of in_sg that contained the header iov */
     size_t skip_size = iov_len;
@@ -523,17 +529,21 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
 
     /* The 'out' part of the elem is from qemu */
     unsigned int out_num = elem->out_num;
+    unsigned int out_num_readable = out_num - req->bad_out_num;
     struct iovec *out_sg = elem->out_sg;
     size_t out_len = iov_size(out_sg, out_num);
+    size_t out_len_readable = iov_size(out_sg, out_num_readable);
     fuse_log(FUSE_LOG_DEBUG,
-             "%s: elem %d: with %d out desc of length %zd\n",
-             __func__, elem->index, out_num, out_len);
+             "%s: elem %d: with %d out desc of length %zd"
+             " bad_in_num=%u bad_out_num=%u\n",
+             __func__, elem->index, out_num, out_len, req->bad_in_num,
+             req->bad_out_num);
 
     /*
      * The elem should contain a 'fuse_in_header' (in to fuse)
      * plus the data based on the len in the header.
      */
-    if (out_len < sizeof(struct fuse_in_header)) {
+    if (out_len_readable < sizeof(struct fuse_in_header)) {
         fuse_log(FUSE_LOG_ERR, "%s: elem %d too short for in_header\n",
                  __func__, elem->index);
         assert(0); /* TODO */
@@ -544,80 +554,163 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
         assert(0); /* TODO */
     }
     /* Copy just the fuse_in_header and look at it */
-    copy_from_iov(&fbuf, out_num, out_sg,
+    copy_from_iov(&fbuf, out_num_readable, out_sg,
                   sizeof(struct fuse_in_header));
     memcpy(&inh, fbuf.mem, sizeof(struct fuse_in_header));
 
     pbufv = NULL; /* Compiler thinks an unitialised path */
-    if (inh.opcode == FUSE_WRITE &&
-        out_len >= (sizeof(struct fuse_in_header) +
-                    sizeof(struct fuse_write_in))) {
-        /*
-         * For a write we don't actually need to copy the
-         * data, we can just do it straight out of guest memory
-         * but we must still copy the headers in case the guest
-         * was nasty and changed them while we were using them.
-         */
-        fuse_log(FUSE_LOG_DEBUG, "%s: Write special case\n", __func__);
-
-        fbuf.size = copy_from_iov(&fbuf, out_num, out_sg,
-                                  sizeof(struct fuse_in_header) +
-                                  sizeof(struct fuse_write_in));
-        /* That copy reread the in_header, make sure we use the original */
-        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
-
-        /* Allocate the bufv, with space for the rest of the iov */
-        pbufv = malloc(sizeof(struct fuse_bufvec) +
-                       sizeof(struct fuse_buf) * out_num);
-        if (!pbufv) {
-            fuse_log(FUSE_LOG_ERR, "%s: pbufv malloc failed\n",
-                    __func__);
-            goto out;
-        }
+    if (req->bad_in_num || req->bad_out_num) {
+        bool handled_unmappable = false;
+
+        if (!req->bad_in_num &&
+            inh.opcode == FUSE_WRITE &&
+            out_len_readable >= (sizeof(struct fuse_in_header) +
+                                 sizeof(struct fuse_write_in))) {
+            handled_unmappable = true;
+
+            /* copy the fuse_write_in header after fuse_in_header */
+            fbuf.size = copy_from_iov(&fbuf, out_num_readable, out_sg,
+                                      sizeof(struct fuse_in_header) +
+                                      sizeof(struct fuse_write_in));
+            /* That copy reread the in_header, make sure we use the original */
+            memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
+
+            /* Allocate the bufv, with space for the rest of the iov */
+            pbufv = malloc(sizeof(struct fuse_bufvec) +
+                           sizeof(struct fuse_buf) * out_num);
+            if (!pbufv) {
+                fuse_log(FUSE_LOG_ERR, "%s: pbufv malloc failed\n",
+                        __func__);
+                goto out;
+            }
 
-        allocated_bufv = true;
-        pbufv->count = 1;
-        pbufv->buf[0] = fbuf;
+            allocated_bufv = true;
+            pbufv->count = 1;
+            pbufv->buf[0] = fbuf;
 
-        size_t iovindex, pbufvindex, iov_bytes_skip;
-        pbufvindex = 1; /* 2 headers, 1 fusebuf */
+            size_t iovindex, pbufvindex, iov_bytes_skip;
+            pbufvindex = 1; /* 2 headers, 1 fusebuf */
 
-        if (!skip_iov(out_sg, out_num,
-                      sizeof(struct fuse_in_header) +
-                      sizeof(struct fuse_write_in),
-                      &iovindex, &iov_bytes_skip)) {
-            fuse_log(FUSE_LOG_ERR, "%s: skip failed\n",
-                    __func__);
-            goto out;
-        }
+            if (!skip_iov(out_sg, out_num,
+                          sizeof(struct fuse_in_header) +
+                          sizeof(struct fuse_write_in),
+                          &iovindex, &iov_bytes_skip)) {
+                fuse_log(FUSE_LOG_ERR, "%s: skip failed\n",
+                        __func__);
+                goto out;
+            }
 
-        for (; iovindex < out_num; iovindex++, pbufvindex++) {
-            pbufv->count++;
-            pbufv->buf[pbufvindex].pos = ~0; /* Dummy */
-            pbufv->buf[pbufvindex].flags = 0;
-            pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
-            pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
-
-            if (iov_bytes_skip) {
-                pbufv->buf[pbufvindex].mem += iov_bytes_skip;
-                pbufv->buf[pbufvindex].size -= iov_bytes_skip;
-                iov_bytes_skip = 0;
+            for (; iovindex < out_num; iovindex++, pbufvindex++) {
+                pbufv->count++;
+                pbufv->buf[pbufvindex].pos = ~0; /* Dummy */
+                pbufv->buf[pbufvindex].flags =
+                    (iovindex < out_num_readable) ? 0 :
+                                                    FUSE_BUF_PHYS_ADDR;
+                pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
+                pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
+
+                if (iov_bytes_skip) {
+                    pbufv->buf[pbufvindex].mem += iov_bytes_skip;
+                    pbufv->buf[pbufvindex].size -= iov_bytes_skip;
+                    iov_bytes_skip = 0;
+                }
             }
         }
-    } else {
-        /* Normal (non fast write) path */
 
-        copy_from_iov(&fbuf, out_num, out_sg, se->bufsize);
-        /* That copy reread the in_header, make sure we use the original */
-        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
-        fbuf.size = out_len;
+        if (req->bad_in_num &&
+            inh.opcode == FUSE_READ &&
+            out_len_readable >=
+                (sizeof(struct fuse_in_header) + sizeof(struct fuse_read_in))) {
+            fuse_log(FUSE_LOG_DEBUG,
+                     "Unmappable read case "
+                     "in_num=%d bad_in_num=%d\n",
+                     elem->in_num, req->bad_in_num);
+            handled_unmappable = true;
+        }
+
+        if (!handled_unmappable) {
+            fuse_log(FUSE_LOG_ERR,
+                     "Unhandled unmappable element: out: %d(b:%d) in: "
+                     "%d(b:%d)",
+                     out_num, req->bad_out_num, elem->in_num, req->bad_in_num);
+            fv_panic(dev, "Unhandled unmappable element");
+        }
+    }
+
+    if (!req->bad_out_num) {
+        if (inh.opcode == FUSE_WRITE &&
+            out_len_readable >= (sizeof(struct fuse_in_header) +
+                                 sizeof(struct fuse_write_in))) {
+            /*
+             * For a write we don't actually need to copy the
+             * data, we can just do it straight out of guest memory
+             * but we must still copy the headers in case the guest
+             * was nasty and changed them while we were using them.
+             */
+            fuse_log(FUSE_LOG_DEBUG, "%s: Write special case\n",
+                     __func__);
+
+            fbuf.size = copy_from_iov(&fbuf, out_num, out_sg,
+                                      sizeof(struct fuse_in_header) +
+                                      sizeof(struct fuse_write_in));
+            /* That copy reread the in_header, make sure we use the original */
+            memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
+
+            /* Allocate the bufv, with space for the rest of the iov */
+            pbufv = malloc(sizeof(struct fuse_bufvec) +
+                           sizeof(struct fuse_buf) * out_num);
+            if (!pbufv) {
+                fuse_log(FUSE_LOG_ERR, "%s: pbufv malloc failed\n",
+                        __func__);
+                goto out;
+            }
+
+            allocated_bufv = true;
+            pbufv->count = 1;
+            pbufv->buf[0] = fbuf;
 
-        /* TODO! Endianness of header */
+            size_t iovindex, pbufvindex, iov_bytes_skip;
+            pbufvindex = 1; /* 2 headers, 1 fusebuf */
 
-        /* TODO: Add checks for fuse_session_exited */
-        bufv.buf[0] = fbuf;
-        bufv.count = 1;
-        pbufv = &bufv;
+            if (!skip_iov(out_sg, out_num,
+                          sizeof(struct fuse_in_header) +
+                          sizeof(struct fuse_write_in),
+                          &iovindex, &iov_bytes_skip)) {
+                fuse_log(FUSE_LOG_ERR, "%s: skip failed\n",
+                        __func__);
+                goto out;
+            }
+
+            for (; iovindex < out_num; iovindex++, pbufvindex++) {
+                pbufv->count++;
+                pbufv->buf[pbufvindex].pos = ~0; /* Dummy */
+                pbufv->buf[pbufvindex].flags = 0;
+                pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
+                pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
+
+                if (iov_bytes_skip) {
+                    pbufv->buf[pbufvindex].mem += iov_bytes_skip;
+                    pbufv->buf[pbufvindex].size -= iov_bytes_skip;
+                    iov_bytes_skip = 0;
+                }
+            }
+        } else {
+            /* Normal (non fast write) path */
+
+            /* Copy the rest of the buffer */
+            copy_from_iov(&fbuf, out_num, out_sg, se->bufsize);
+            /* That copy reread the in_header, make sure we use the original */
+            memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
+
+            fbuf.size = out_len;
+
+            /* TODO! Endianness of header */
+
+            /* TODO: Add checks for fuse_session_exited */
+            bufv.buf[0] = fbuf;
+            bufv.count = 1;
+            pbufv = &bufv;
+        }
     }
     pbufv->idx = 0;
     pbufv->off = 0;
@@ -732,13 +825,16 @@ static void *fv_queue_thread(void *opaque)
                  __func__, qi->qidx, (size_t)evalue, in_bytes, out_bytes);
 
         while (1) {
+            unsigned int bad_in_num = 0, bad_out_num = 0;
             FVRequest *req = vu_queue_pop(dev, q, sizeof(FVRequest),
-                                          NULL, NULL);
+                                          &bad_in_num, &bad_out_num);
             if (!req) {
                 break;
             }
 
             req->reply_sent = false;
+            req->bad_in_num = bad_in_num;
+            req->bad_out_num = bad_out_num;
 
             if (!se->thread_pool_size) {
                 req_list = g_list_prepend(req_list, req);
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (19 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:27   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command Dr. David Alan Gilbert (git)
                   ` (6 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When a read with unmappable buffers is found, map it to a slave
read command.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/fuse_virtio.c | 37 +++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index f8fd158bb2..c6ea2bd2a1 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -459,6 +459,43 @@ int virtio_send_data_iov(struct fuse_session *se, struct fuse_chan *ch,
         in_sg_left -= ret;
         len -= ret;
     } while (in_sg_left);
+
+    if (bad_in_num) {
+        /* TODO: Rework to send in fewer messages */
+        VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
+                                             sizeof(VhostUserFSSlaveMsgEntry));
+        while (len && bad_in_num) {
+            msg->count = 1;
+            msg->entries[0].flags = VHOST_USER_FS_FLAG_MAP_R;
+            msg->entries[0].fd_offset = buf->buf[0].pos;
+            msg->entries[0].c_offset =
+                (uint64_t)(uintptr_t)in_sg_ptr[0].iov_base;
+            msg->entries[0].len = in_sg_ptr[0].iov_len;
+            if (len < msg->entries[0].len) {
+                msg->entries[0].len = len;
+             }
+            int64_t req_res = fuse_virtio_io(se, msg, buf->buf[0].fd);
+            fuse_log(FUSE_LOG_DEBUG,
+                     "%s: bad loop; len=%zd bad_in_num=%d fd_offset=%jd "
+                     "c_offset=%p req_res=%" PRId64 "\n",
+                     __func__, len, bad_in_num, (intmax_t)(buf->buf[0].pos),
+                     in_sg_ptr[0].iov_base, req_res);
+            if (req_res > 0) {
+                len -= msg->entries[0].len;
+                buf->buf[0].pos += msg->entries[0].len;
+                in_sg_ptr++;
+                bad_in_num--;
+            } else if (req_res == 0) {
+                break;
+            } else {
+                ret = req_res;
+                free(in_sg_cpy);
+                g_free(msg);
+                goto err;
+            }
+        }
+        g_free(msg);
+    }
     free(in_sg_cpy);
 
     /* Need to fix out->len on EOF */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (20 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:28   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 23/26] DAX:virtiofsd: implement FUSE_INIT map_alignment field Dr. David Alan Gilbert (git)
                   ` (5 subsequent siblings)
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

When a fuse_buf_copy is performed on an element with FUSE_BUF_PHYS_ADDR
route it to a fuse_virtio_write request that does a slave command to
perform the write.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 tools/virtiofsd/buffer.c         | 14 +++++++++++---
 tools/virtiofsd/fuse_common.h    |  6 +++++-
 tools/virtiofsd/fuse_lowlevel.h  |  3 ---
 tools/virtiofsd/passthrough_ll.c |  2 +-
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
index 1a050aa441..8135d52d2a 100644
--- a/tools/virtiofsd/buffer.c
+++ b/tools/virtiofsd/buffer.c
@@ -200,13 +200,20 @@ static ssize_t fuse_buf_fd_to_fd(const struct fuse_buf *dst, size_t dst_off,
     return copied;
 }
 
-static ssize_t fuse_buf_copy_one(const struct fuse_buf *dst, size_t dst_off,
+static ssize_t fuse_buf_copy_one(fuse_req_t req,
+                                 const struct fuse_buf *dst, size_t dst_off,
                                  const struct fuse_buf *src, size_t src_off,
                                  size_t len)
 {
     int src_is_fd = src->flags & FUSE_BUF_IS_FD;
     int dst_is_fd = dst->flags & FUSE_BUF_IS_FD;
+    int src_is_phys = src->flags & FUSE_BUF_PHYS_ADDR;
+    int dst_is_phys = src->flags & FUSE_BUF_PHYS_ADDR;
 
+    if (src_is_phys && !src_is_fd && dst_is_fd) {
+        return fuse_virtio_write(req, dst, dst_off, src, src_off, len);
+    }
+    assert(!src_is_phys && !dst_is_phys);
     if (!src_is_fd && !dst_is_fd) {
         char *dstmem = (char *)dst->mem + dst_off;
         char *srcmem = (char *)src->mem + src_off;
@@ -259,7 +266,8 @@ static int fuse_bufvec_advance(struct fuse_bufvec *bufv, size_t len)
     return 1;
 }
 
-ssize_t fuse_buf_copy(struct fuse_bufvec *dstv, struct fuse_bufvec *srcv)
+ssize_t fuse_buf_copy(fuse_req_t req, struct fuse_bufvec *dstv,
+                      struct fuse_bufvec *srcv)
 {
     size_t copied = 0, i;
 
@@ -301,7 +309,7 @@ ssize_t fuse_buf_copy(struct fuse_bufvec *dstv, struct fuse_bufvec *srcv)
         dst_len = dst->size - dstv->off;
         len = min_size(src_len, dst_len);
 
-        res = fuse_buf_copy_one(dst, dstv->off, src, srcv->off, len);
+        res = fuse_buf_copy_one(req, dst, dstv->off, src, srcv->off, len);
         if (res < 0) {
             if (!copied) {
                 return res;
diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index af43cf19f9..beed03aa93 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -510,6 +510,8 @@ struct fuse_conn_info {
 struct fuse_session;
 struct fuse_pollhandle;
 struct fuse_conn_info_opts;
+struct fuse_req;
+typedef struct fuse_req *fuse_req_t;
 
 /**
  * This function parses several command-line options that can be used
@@ -728,11 +730,13 @@ size_t fuse_buf_size(const struct fuse_bufvec *bufv);
 /**
  * Copy data from one buffer vector to another
  *
+ * @param req The request this copy is part of
  * @param dst destination buffer vector
  * @param src source buffer vector
  * @return actual number of bytes copied or -errno on error
  */
-ssize_t fuse_buf_copy(struct fuse_bufvec *dst, struct fuse_bufvec *src);
+ssize_t fuse_buf_copy(fuse_req_t req,
+                      struct fuse_bufvec *dst, struct fuse_bufvec *src);
 
 /**
  * Memory buffer iterator
diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 757cdae49b..24e580aafe 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -42,9 +42,6 @@
 /** Inode number type */
 typedef uint64_t fuse_ino_t;
 
-/** Request pointer type */
-typedef struct fuse_req *fuse_req_t;
-
 /**
  * Session
  *
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 600e102839..c5b8a1f5b1 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2301,7 +2301,7 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
         }
     }
 
-    res = fuse_buf_copy(&out_buf, in_buf);
+    res = fuse_buf_copy(req, &out_buf, in_buf);
     if (res < 0) {
         fuse_reply_err(req, -res);
     } else {
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 23/26] DAX:virtiofsd: implement FUSE_INIT map_alignment field
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (21 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:00 ` [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info Dr. David Alan Gilbert (git)
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: Stefan Hajnoczi <stefanha@redhat.com>

Communicate the host page size to the FUSE client so that
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING requests are aware of our alignment
constraints.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tools/virtiofsd/fuse_lowlevel.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 6930574aaf..50fc5c8d5a 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -10,6 +10,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/host-utils.h"
 #include "fuse_i.h"
 #include "standard-headers/linux/fuse.h"
 #include "fuse_misc.h"
@@ -2194,6 +2195,12 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
     outarg.max_background = se->conn.max_background;
     outarg.congestion_threshold = se->conn.congestion_threshold;
     outarg.time_gran = se->conn.time_gran;
+    if (arg->flags & FUSE_MAP_ALIGNMENT) {
+        outarg.flags |= FUSE_MAP_ALIGNMENT;
+
+        /* This constraint comes from mmap(2) and munmap(2) */
+        outarg.map_alignment = ctz64(sysconf(_SC_PAGE_SIZE));
+    }
 
     if (se->conn.want & FUSE_CAP_HANDLE_KILLPRIV_V2) {
         outarg.flags |= FUSE_HANDLE_KILLPRIV_V2;
@@ -2207,6 +2214,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
     fuse_log(FUSE_LOG_DEBUG, "   congestion_threshold=%i\n",
              outarg.congestion_threshold);
     fuse_log(FUSE_LOG_DEBUG, "   time_gran=%u\n", outarg.time_gran);
+    fuse_log(FUSE_LOG_DEBUG, "   map_alignment=%u\n", outarg.map_alignment);
 
     send_reply_ok(req, &outarg, outargsize);
 }
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (22 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 23/26] DAX:virtiofsd: implement FUSE_INIT map_alignment field Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:31   ` Stefan Hajnoczi
  2021-05-06 15:32   ` Stefan Hajnoczi
  2021-04-28 11:00 ` [PATCH v3 25/26] vhost-user-fs: Implement drop CAP_FSETID functionality Dr. David Alan Gilbert (git)
                   ` (3 subsequent siblings)
  27 siblings, 2 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: Vivek Goyal <vgoyal@redhat.com>

Extend VhostUserFSSlaveMsg so that slave can ask it to drop CAP_FSETID
before doing I/O on fd.

In some cases, virtiofsd takes the onus of clearing setuid bit on a file
when WRITE happens. Generally virtiofsd does the WRITE to fd (from guest
memory which is mapped in virtiofsd as well), but if this memory is
unmappable in virtiofsd (like cache window), then virtiofsd asks qemu
to do the I/O instead.

To retain the capability to drop suid bit on write, qemu needs to
drop the CAP_FSETID as well before write to fd. Extend VhostUserFSSlaveMsg
so that virtiofsd can specify in message if CAP_FSETID needs to be
dropped.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 hw/virtio/vhost-user-fs.c                 | 5 +++++
 include/hw/virtio/vhost-user-fs.h         | 6 ++++++
 subprojects/libvhost-user/libvhost-user.h | 6 ++++++
 3 files changed, 17 insertions(+)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index ee600ce968..036ca17767 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -244,6 +244,11 @@ uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
         return (uint64_t)-1;
     }
 
+    if (sm->flags & VHOST_USER_FS_GENFLAG_DROP_FSETID) {
+        error_report("Dropping CAP_FSETID is not supported");
+        return (uint64_t)-ENOTSUP;
+    }
+
     for (i = 0; i < sm->count && !res; i++) {
         VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
         if (e->len == 0) {
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 2931164e23..bcd797c0cc 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -30,6 +30,10 @@ OBJECT_DECLARE_SIMPLE_TYPE(VHostUserFS, VHOST_USER_FS)
 #define VHOST_USER_FS_FLAG_MAP_R (1u << 0)
 #define VHOST_USER_FS_FLAG_MAP_W (1u << 1)
 
+/* Generic flags for the overall message and not individual ranges */
+/* Drop capability CAP_FSETID during the operation */
+#define VHOST_USER_FS_GENFLAG_DROP_FSETID (1u << 0)
+
 typedef struct {
     /* Offsets within the file being mapped */
     uint64_t fd_offset;
@@ -42,6 +46,8 @@ typedef struct {
 } VhostUserFSSlaveMsgEntry;
 
 typedef struct {
+    /* Generic flags for the overall message */
+    uint32_t flags;
     /* Number of entries */
     uint16_t count;
     /* Spare */
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index 42b0833c4b..4dba4321f4 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -132,6 +132,10 @@ typedef enum VhostUserSlaveRequest {
 #define VHOST_USER_FS_FLAG_MAP_R (1u << 0)
 #define VHOST_USER_FS_FLAG_MAP_W (1u << 1)
 
+/* Generic flags for the overall message and not individual ranges */
+/* Drop capability CAP_FSETID during the operation */
+#define VHOST_USER_FS_GENFLAG_DROP_FSETID (1u << 0)
+
 typedef struct {
     /* Offsets within the file being mapped */
     uint64_t fd_offset;
@@ -144,6 +148,8 @@ typedef struct {
 } VhostUserFSSlaveMsgEntry;
 
 typedef struct {
+    /* Generic flags for the overall message */
+    uint32_t flags;
     /* Number of entries */
     uint16_t count;
     /* Spare */
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 25/26] vhost-user-fs: Implement drop CAP_FSETID functionality
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (23 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info Dr. David Alan Gilbert (git)
@ 2021-04-28 11:00 ` Dr. David Alan Gilbert (git)
  2021-04-28 11:01 ` [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it Dr. David Alan Gilbert (git)
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:00 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: Vivek Goyal <vgoyal@redhat.com>

As part of slave_io message, slave can ask to do I/O on an fd. Additionally
slave can ask for dropping CAP_FSETID (if master has it) before doing I/O.
Implement functionality to drop CAP_FSETID and gain it back after the
operation.

This also creates a dependency on libcap-ng.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 hw/virtio/meson.build     |  1 +
 hw/virtio/vhost-user-fs.c | 92 ++++++++++++++++++++++++++++++++++++++-
 meson.build               |  6 +++
 3 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..bdcdc82e13 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -18,6 +18,7 @@ virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_CRYPTO', if_true: files('virtio-crypto.c'))
 virtio_ss.add(when: ['CONFIG_VIRTIO_CRYPTO', 'CONFIG_VIRTIO_PCI'], if_true: files('virtio-crypto-pci.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER_FS', if_true: files('vhost-user-fs.c'))
+virtio_ss.add(when: 'CONFIG_VHOST_USER_FS', if_true: libcap_ng)
 virtio_ss.add(when: ['CONFIG_VHOST_USER_FS', 'CONFIG_VIRTIO_PCI'], if_true: files('vhost-user-fs-pci.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_PMEM', if_true: files('virtio-pmem.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VSOCK', if_true: files('vhost-vsock.c', 'vhost-vsock-common.c'))
diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 036ca17767..7afd9495c9 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -13,6 +13,8 @@
 
 #include "qemu/osdep.h"
 #include <sys/ioctl.h>
+#include <cap-ng.h>
+#include <sys/syscall.h>
 #include "standard-headers/linux/virtio_fs.h"
 #include "qapi/error.h"
 #include "hw/qdev-properties.h"
@@ -91,6 +93,84 @@ static bool check_slave_message_entries(const VhostUserFSSlaveMsg *sm,
     return true;
 }
 
+/*
+ * Helpers for dropping and regaining effective capabilities. Returns 0
+ * on success, error otherwise
+ */
+static int drop_effective_cap(const char *cap_name, bool *cap_dropped)
+{
+    int cap, ret;
+
+    cap = capng_name_to_capability(cap_name);
+    if (cap < 0) {
+        ret = -errno;
+        error_report("capng_name_to_capability(%s) failed:%s", cap_name,
+                     strerror(errno));
+        goto out;
+    }
+
+    if (capng_get_caps_process()) {
+        ret = -errno;
+        error_report("capng_get_caps_process() failed:%s", strerror(errno));
+        goto out;
+    }
+
+    /* We dont have this capability in effective set already. */
+    if (!capng_have_capability(CAPNG_EFFECTIVE, cap)) {
+        ret = 0;
+        goto out;
+    }
+
+    if (capng_update(CAPNG_DROP, CAPNG_EFFECTIVE, cap)) {
+        ret = -errno;
+        error_report("capng_update(DROP,) failed");
+        goto out;
+    }
+    if (capng_apply(CAPNG_SELECT_CAPS)) {
+        ret = -errno;
+        error_report("drop:capng_apply() failed");
+        goto out;
+    }
+
+    ret = 0;
+    if (cap_dropped) {
+        *cap_dropped = true;
+    }
+
+out:
+    return ret;
+}
+
+static int gain_effective_cap(const char *cap_name)
+{
+    int cap;
+    int ret = 0;
+
+    cap = capng_name_to_capability(cap_name);
+    if (cap < 0) {
+        ret = -errno;
+        error_report("capng_name_to_capability(%s) failed:%s", cap_name,
+                     strerror(errno));
+        goto out;
+    }
+
+    if (capng_update(CAPNG_ADD, CAPNG_EFFECTIVE, cap)) {
+        ret = -errno;
+        error_report("capng_update(ADD,) failed");
+        goto out;
+    }
+
+    if (capng_apply(CAPNG_SELECT_CAPS)) {
+        ret = -errno;
+        error_report("gain:capng_apply() failed");
+        goto out;
+    }
+    ret = 0;
+
+out:
+    return ret;
+}
+
 uint64_t vhost_user_fs_slave_map(struct vhost_dev *dev, int message_size,
                                  VhostUserFSSlaveMsg *sm, int fd)
 {
@@ -238,6 +318,7 @@ uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
     unsigned int i;
     int res = 0;
     size_t done = 0;
+    bool cap_fsetid_dropped = false;
 
     if (fd < 0) {
         error_report("Bad fd for io");
@@ -245,8 +326,10 @@ uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
     }
 
     if (sm->flags & VHOST_USER_FS_GENFLAG_DROP_FSETID) {
-        error_report("Dropping CAP_FSETID is not supported");
-        return (uint64_t)-ENOTSUP;
+        res = drop_effective_cap("FSETID", &cap_fsetid_dropped);
+        if (res != 0) {
+            return (uint64_t)res;
+        }
     }
 
     for (i = 0; i < sm->count && !res; i++) {
@@ -313,6 +396,11 @@ uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
     }
     close(fd);
 
+    if (cap_fsetid_dropped) {
+        if (gain_effective_cap("FSETID")) {
+            error_report("Failed to gain CAP_FSETID");
+        }
+    }
     trace_vhost_user_fs_slave_io_exit(res, done);
     if (res < 0) {
         return (uint64_t)res;
diff --git a/meson.build b/meson.build
index c6f4b0cf5e..71899d0993 100644
--- a/meson.build
+++ b/meson.build
@@ -1081,6 +1081,12 @@ elif get_option('virtfs').disabled()
   have_virtfs = false
 endif
 
+if config_host.has_key('CONFIG_VHOST_USER_FS')
+  if not libcap_ng.found()
+    error('vhost-user-fs requires libcap-ng-devel')
+  endif
+endif
+
 config_host_data.set_quoted('CONFIG_BINDIR', get_option('prefix') / get_option('bindir'))
 config_host_data.set_quoted('CONFIG_PREFIX', get_option('prefix'))
 config_host_data.set_quoted('CONFIG_QEMU_CONFDIR', get_option('prefix') / qemu_confdir)
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (24 preceding siblings ...)
  2021-04-28 11:00 ` [PATCH v3 25/26] vhost-user-fs: Implement drop CAP_FSETID functionality Dr. David Alan Gilbert (git)
@ 2021-04-28 11:01 ` Dr. David Alan Gilbert (git)
  2021-05-06 15:37   ` Stefan Hajnoczi
  2021-04-28 11:27 ` [PATCH v3 00/26] virtiofs dax patches no-reply
  2021-05-06 15:37 ` Stefan Hajnoczi
  27 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2021-04-28 11:01 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

From: Vivek Goyal <vgoyal@redhat.com>

If qemu guest asked to drop CAP_FSETID upon write, send that info
to qemu in SLAVE_FS_IO message so that qemu can drop capability
before WRITE. This is to make sure that any setuid bit is killed
on fd (if there is one set).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 tools/virtiofsd/buffer.c         | 10 ++++++----
 tools/virtiofsd/fuse_common.h    |  6 +++++-
 tools/virtiofsd/fuse_lowlevel.h  |  6 +++++-
 tools/virtiofsd/fuse_virtio.c    |  5 ++++-
 tools/virtiofsd/passthrough_ll.c |  2 +-
 5 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
index 8135d52d2a..b4cda7db9a 100644
--- a/tools/virtiofsd/buffer.c
+++ b/tools/virtiofsd/buffer.c
@@ -203,7 +203,7 @@ static ssize_t fuse_buf_fd_to_fd(const struct fuse_buf *dst, size_t dst_off,
 static ssize_t fuse_buf_copy_one(fuse_req_t req,
                                  const struct fuse_buf *dst, size_t dst_off,
                                  const struct fuse_buf *src, size_t src_off,
-                                 size_t len)
+                                 size_t len, bool dropped_cap_fsetid)
 {
     int src_is_fd = src->flags & FUSE_BUF_IS_FD;
     int dst_is_fd = dst->flags & FUSE_BUF_IS_FD;
@@ -211,7 +211,8 @@ static ssize_t fuse_buf_copy_one(fuse_req_t req,
     int dst_is_phys = src->flags & FUSE_BUF_PHYS_ADDR;
 
     if (src_is_phys && !src_is_fd && dst_is_fd) {
-        return fuse_virtio_write(req, dst, dst_off, src, src_off, len);
+        return fuse_virtio_write(req, dst, dst_off, src, src_off, len,
+                                 dropped_cap_fsetid);
     }
     assert(!src_is_phys && !dst_is_phys);
     if (!src_is_fd && !dst_is_fd) {
@@ -267,7 +268,7 @@ static int fuse_bufvec_advance(struct fuse_bufvec *bufv, size_t len)
 }
 
 ssize_t fuse_buf_copy(fuse_req_t req, struct fuse_bufvec *dstv,
-                      struct fuse_bufvec *srcv)
+                      struct fuse_bufvec *srcv, bool dropped_cap_fsetid)
 {
     size_t copied = 0, i;
 
@@ -309,7 +310,8 @@ ssize_t fuse_buf_copy(fuse_req_t req, struct fuse_bufvec *dstv,
         dst_len = dst->size - dstv->off;
         len = min_size(src_len, dst_len);
 
-        res = fuse_buf_copy_one(req, dst, dstv->off, src, srcv->off, len);
+        res = fuse_buf_copy_one(req, dst, dstv->off, src, srcv->off, len,
+                                dropped_cap_fsetid);
         if (res < 0) {
             if (!copied) {
                 return res;
diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index beed03aa93..8a75729be9 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -733,10 +733,14 @@ size_t fuse_buf_size(const struct fuse_bufvec *bufv);
  * @param req The request this copy is part of
  * @param dst destination buffer vector
  * @param src source buffer vector
+ * @param dropped_cap_fsetid Caller has dropped CAP_FSETID. If work is handed
+ *        over to a different thread/process, CAP_FSETID needs to be dropped
+ *        there as well.
  * @return actual number of bytes copied or -errno on error
  */
 ssize_t fuse_buf_copy(fuse_req_t req,
-                      struct fuse_bufvec *dst, struct fuse_bufvec *src);
+                      struct fuse_bufvec *dst, struct fuse_bufvec *src,
+                      bool dropped_cap_fsetid);
 
 /**
  * Memory buffer iterator
diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
index 24e580aafe..dfd7e1525c 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -2030,9 +2030,13 @@ int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
  * @param src The source (memory) buffer
  * @param src_off The GPA
  * @param len Length in bytes
+ * @param dropped_cap_fsetid Caller dropped CAP_FSETID. If it is being handed
+ *        over to different thread/process, CAP_FSETID needs to be dropped
+ *        before write.
  */
 ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
                           size_t dst_off, const struct fuse_buf *src,
-                          size_t src_off, size_t len);
+                          size_t src_off, size_t len,
+                          bool dropped_cap_fsetid);
 
 #endif /* FUSE_LOWLEVEL_H_ */
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index c6ea2bd2a1..9f3d38942a 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -1291,7 +1291,7 @@ int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
  */
 ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
                           size_t dst_off, const struct fuse_buf *src,
-                          size_t src_off, size_t len)
+                          size_t src_off, size_t len, bool dropped_cap_fsetid)
 {
     VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
                                          sizeof(VhostUserFSSlaveMsgEntry));
@@ -1311,6 +1311,9 @@ ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
     msg->entries[0].c_offset = (uintptr_t)src->mem + src_off;
     msg->entries[0].len = len;
     msg->entries[0].flags = VHOST_USER_FS_FLAG_MAP_W;
+    if (dropped_cap_fsetid) {
+        msg->flags |= VHOST_USER_FS_GENFLAG_DROP_FSETID;
+    }
 
     int64_t result = fuse_virtio_io(req->se, msg, dst->fd);
     fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index c5b8a1f5b1..b76d878509 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2301,7 +2301,7 @@ static void lo_write_buf(fuse_req_t req, fuse_ino_t ino,
         }
     }
 
-    res = fuse_buf_copy(req, &out_buf, in_buf);
+    res = fuse_buf_copy(req, &out_buf, in_buf, fi->kill_priv);
     if (res < 0) {
         fuse_reply_err(req, -res);
     } else {
-- 
2.31.1



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 00/26] virtiofs dax patches
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (25 preceding siblings ...)
  2021-04-28 11:01 ` [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it Dr. David Alan Gilbert (git)
@ 2021-04-28 11:27 ` no-reply
  2021-05-06 15:37 ` Stefan Hajnoczi
  27 siblings, 0 replies; 66+ messages in thread
From: no-reply @ 2021-04-28 11:27 UTC (permalink / raw)
  To: dgilbert; +Cc: virtio-fs, stefanha, qemu-devel, vgoyal, groug

Patchew URL: https://patchew.org/QEMU/20210428110100.27757-1-dgilbert@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210428110100.27757-1-dgilbert@redhat.com
Subject: [PATCH v3 00/26] virtiofs dax patches

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20210415102321.3987935-1-philmd@redhat.com -> patchew/20210415102321.3987935-1-philmd@redhat.com
 - [tag update]      patchew/20210422145335.65814-1-mreitz@redhat.com -> patchew/20210422145335.65814-1-mreitz@redhat.com
 - [tag update]      patchew/20210427135147.111218-1-lvivier@redhat.com -> patchew/20210427135147.111218-1-lvivier@redhat.com
 - [tag update]      patchew/20210427192658.266933-1-f4bug@amsat.org -> patchew/20210427192658.266933-1-f4bug@amsat.org
 * [new tag]         patchew/20210428110100.27757-1-dgilbert@redhat.com -> patchew/20210428110100.27757-1-dgilbert@redhat.com
Switched to a new branch 'test'
ccf0714 virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
dae0067 vhost-user-fs: Implement drop CAP_FSETID functionality
addc004 vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info
bf4ce2a DAX:virtiofsd: implement FUSE_INIT map_alignment field
188f074 DAX/unmap virtiofsd: route unmappable write to slave command
9bcd730 DAX/unmap virtiofsd: Route unmappable reads
b95bb5c DAX/unmap virtiofsd: Parse unmappable elements
e98bda0 DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO
f8bc115 DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
e563a49 DAX: virtiofsd: Perform an unmap on destroy
4977fed DAX: virtiofsd: route se down to destroy method
e86d999 DAX: virtiofsd: Make lo_removemapping() work
b43ed18 DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping
80b6613 DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll
497a551 DAX: virtiofsd: Add setup/remove mappings fuse commands
a68ac68 DAX: virtiofsd Add cache accessor functions
47fd841 DAX: virtio-fs: Fill in slave commands for mapping
e22b4f2 DAX: virtio-fs: Add vhost-user slave commands for mapping
43cefc1 DAX: virtio-fs: Add cache BAR
ee3d0da DAX: virtio: Add shared memory capability
9532b91 DAX subprojects/libvhost-user: Add virtio-fs slave types
7a510ed DAX: libvhost-user: Allow popping a queue element with bad pointers
79824d5 DAX: libvhost-user: Route slave message payload
4d503d0 DAX: vhost-user: Rework slave return values
d18fd97 virtiofsd: Don't assume header layout
301ba24 virtiofs: Fixup printf args

=== OUTPUT BEGIN ===
1/26 Checking commit 301ba247ecca (virtiofs: Fixup printf args)
2/26 Checking commit d18fd977f0c5 (virtiofsd: Don't assume header layout)
3/26 Checking commit 4d503d03bfb6 (DAX: vhost-user: Rework slave return values)
4/26 Checking commit 79824d57ac9c (DAX: libvhost-user: Route slave message payload)
5/26 Checking commit 7a510edf1169 (DAX: libvhost-user: Allow popping a queue element with bad pointers)
6/26 Checking commit 9532b917fec4 (DAX subprojects/libvhost-user: Add virtio-fs slave types)
7/26 Checking commit ee3d0daebc76 (DAX: virtio: Add shared memory capability)
8/26 Checking commit 43cefc18bd9b (DAX: virtio-fs: Add cache BAR)
9/26 Checking commit e22b4f256a22 (DAX: virtio-fs: Add vhost-user slave commands for mapping)
10/26 Checking commit 47fd84136d6f (DAX: virtio-fs: Fill in slave commands for mapping)
11/26 Checking commit a68ac68dcf41 (DAX: virtiofsd Add cache accessor functions)
12/26 Checking commit 497a5518a931 (DAX: virtiofsd: Add setup/remove mappings fuse commands)
13/26 Checking commit 80b661353cd4 (DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll)
14/26 Checking commit b43ed18535c4 (DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping)
15/26 Checking commit e86d999fc18c (DAX: virtiofsd: Make lo_removemapping() work)
16/26 Checking commit 4977fed192e3 (DAX: virtiofsd: route se down to destroy method)
17/26 Checking commit e563a49d4b2a (DAX: virtiofsd: Perform an unmap on destroy)
18/26 Checking commit f8bc115092d8 (DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO)
19/26 Checking commit e98bda0c581d (DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO)
ERROR: unnecessary whitespace before a quoted newline
#100: FILE: tools/virtiofsd/fuse_virtio.c:1183:
+    fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);

total: 1 errors, 0 warnings, 75 lines checked

Patch 19/26 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

20/26 Checking commit b95bb5c22789 (DAX/unmap virtiofsd: Parse unmappable elements)
21/26 Checking commit 9bcd730dfe66 (DAX/unmap virtiofsd: Route unmappable reads)
22/26 Checking commit 188f074554d3 (DAX/unmap virtiofsd: route unmappable write to slave command)
23/26 Checking commit bf4ce2af375c (DAX:virtiofsd: implement FUSE_INIT map_alignment field)
24/26 Checking commit addc0047af4a (vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info)
25/26 Checking commit dae00676a878 (vhost-user-fs: Implement drop CAP_FSETID functionality)
26/26 Checking commit ccf0714484af (virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it)
ERROR: unnecessary whitespace before a quoted newline
#125: FILE: tools/virtiofsd/fuse_virtio.c:1319:
     fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);

total: 1 errors, 0 warnings, 88 lines checked

Patch 26/26 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210428110100.27757-1-dgilbert@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO
  2021-04-28 11:00 ` [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
@ 2021-04-28 12:53   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-04-28 12:53 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a wrapper to send VHOST_USER_SLAVE_FS_IO commands and a
> further wrapper for sending a fuse_buf write using the FS_IO
> slave command.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tools/virtiofsd/fuse_lowlevel.h | 25 +++++++++++++++++++
>  tools/virtiofsd/fuse_virtio.c   | 43 +++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
> index 27b07bfc22..757cdae49b 100644
> --- a/tools/virtiofsd/fuse_lowlevel.h
> +++ b/tools/virtiofsd/fuse_lowlevel.h
> @@ -2013,4 +2013,29 @@ int64_t fuse_virtio_map(fuse_req_t req, VhostUserFSSlaveMsg *msg, int fd);
>   */
>  int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg);
>  
> +/**
> + * For use with virtio-fs; request IO directly to memory
> + *
> + * @param se The current session
> + * @param msg A set of IO requests
> + * @param fd The fd to map
> + * @return Length on success, negative errno on error
> + */
> +int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
> +                       int fd);
> +
> +/**
> + * For use with virtio-fs; wrapper for fuse_virtio_io for writes
> + * from memory to an fd
> + * @param req The request that triggered this action
> + * @param dst The destination (file) memory buffer
> + * @param dst_off Byte offset in the file
> + * @param src The source (memory) buffer
> + * @param src_off The GPA
> + * @param len Length in bytes
> + */
> +ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
> +                          size_t dst_off, const struct fuse_buf *src,
> +                          size_t src_off, size_t len);
> +
>  #endif /* FUSE_LOWLEVEL_H_ */
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 85d90ca595..91317bade8 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -1141,3 +1141,46 @@ int64_t fuse_virtio_unmap(struct fuse_session *se, VhostUserFSSlaveMsg *msg)
>      return vu_fs_cache_request(&se->virtio_dev->dev, VHOST_USER_SLAVE_FS_UNMAP,
>                                 -1, msg);
>  }
> +
> +int64_t fuse_virtio_io(struct fuse_session *se, VhostUserFSSlaveMsg *msg,
> +                       int fd)
> +{
> +    if (!se->virtio_dev) {
> +        return -ENODEV;
> +    }
> +    return vu_fs_cache_request(&se->virtio_dev->dev, VHOST_USER_SLAVE_FS_IO,
> +                               fd, msg);
> +}
> +
> +/*
> + * Write to a file (dst) from an area of guest GPA (src) that probably
> + * isn't visible to the daemon.
> + */
> +ssize_t fuse_virtio_write(fuse_req_t req, const struct fuse_buf *dst,
> +                          size_t dst_off, const struct fuse_buf *src,
> +                          size_t src_off, size_t len)
> +{
> +    VhostUserFSSlaveMsg *msg = g_malloc0(sizeof(VhostUserFSSlaveMsg) +
> +                                         sizeof(VhostUserFSSlaveMsgEntry));
> +
> +    msg->count = 1;
> +
> +    if (dst->flags & FUSE_BUF_FD_SEEK) {
> +        msg->entries[0].fd_offset = dst->pos + dst_off;
> +    } else {
> +        off_t cur = lseek(dst->fd, 0, SEEK_CUR);
> +        if (cur == (off_t)-1) {
> +            g_free(msg);
> +            return -errno;
> +        }
> +        msg->entries[0].fd_offset = cur;
> +    }
> +    msg->entries[0].c_offset = (uintptr_t)src->mem + src_off;
> +    msg->entries[0].len = len;
> +    msg->entries[0].flags = VHOST_USER_FS_FLAG_MAP_W;
> +
> +    int64_t result = fuse_virtio_io(req->se, msg, dst->fd);
> +    fuse_log(FUSE_LOG_DEBUG, "%s: result=%" PRId64 " \n", __func__, result);

Oops, as the bot spotted, there's an unneeded space before the \n, I'll
sweep that out.

Dave

> +    g_free(msg);
> +    return result;
> +}
> -- 
> 2.31.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types
  2021-04-28 11:00 ` [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types Dr. David Alan Gilbert (git)
@ 2021-04-29 15:48   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-04-29 15:48 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug, chirantan; +Cc: virtio-fs

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add virtio-fs definitions to libvhost-user
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

I'm going to need to rework this

> +/* Structures carried over the slave channel back to QEMU */
> +#define VHOST_USER_FS_SLAVE_MAX_ENTRIES 32
> +
> +/* For the flags field of VhostUserFSSlaveMsg */
> +#define VHOST_USER_FS_FLAG_MAP_R (1u << 0)
> +#define VHOST_USER_FS_FLAG_MAP_W (1u << 1)
> +
> +typedef struct {
> +    /* Offsets within the file being mapped */
> +    uint64_t fd_offset;
> +    /* Offsets within the cache */
> +    uint64_t c_offset;
> +    /* Lengths of sections */
> +    uint64_t len;
> +    /* Flags, from VHOST_USER_FS_FLAG_* */
> +    uint64_t flags;
> +} VhostUserFSSlaveMsgEntry;
> +
> +typedef struct {
> +    /* Number of entries */
> +    uint16_t count;
> +    /* Spare */
> +    uint16_t align;
> +
> +    VhostUserFSSlaveMsgEntry entries[];
> +} VhostUserFSSlaveMsg;
> +
>  typedef struct VhostUserMemoryRegion {
>      uint64_t guest_phys_addr;
>      uint64_t memory_size;
> @@ -197,6 +224,7 @@ typedef struct VhostUserMsg {
>          VhostUserConfig config;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserFSSlaveMsg fs;
>      } payload;
>  
>      int fds[VHOST_MEMORY_BASELINE_NREGIONS];

This fails Clang's build; because 'fs' is part of a union and
given it's entries[] is a variable length type, and is not
at the end of the union.   It's got a good point - I really don't know
how gcc copes here; but also, what are vhost-user's rules
on the length of 'payload' - it looks like me putting
a larger message in there will break other demons.

I'd changed it from a fixed size array to variable size based
on Chirantan's comments on v1; but now I'm not even convinced
the fixed size was right, given that I'm not convinced I'm
allowed to change the length of payload.

Dave

> @@ -693,4 +721,16 @@ void vu_queue_get_avail_bytes(VuDev *vdev, VuVirtq *vq, unsigned int *in_bytes,
>  bool vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes,
>                            unsigned int out_bytes);
>  
> +/**
> + * vu_fs_cache_request: Send a slave message for an fs client
> + * @dev: a VuDev context
> + * @req: The request type (map, unmap, sync)
> + * @fd: an fd (only required for map, else must be -1)
> + * @fsm: The body of the message
> + *
> + * Returns: 0 or above for success, nevative errno on error
> + */
> +int64_t vu_fs_cache_request(VuDev *dev, VhostUserSlaveRequest req, int fd,
> +                            VhostUserFSSlaveMsg *fsm);
> +
>  #endif /* LIBVHOST_USER_H */
> -- 
> 2.31.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/26] virtiofs: Fixup printf args
  2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
@ 2021-05-04 14:54   ` Stefan Hajnoczi
  2021-05-05 11:06     ` Dr. David Alan Gilbert
  2021-05-06 15:56   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-04 14:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 803 bytes --]

On Wed, Apr 28, 2021 at 12:00:35PM +0100, Dr. David Alan Gilbert (git) wrote:
> @@ -3097,9 +3097,10 @@ static void lo_copy_file_range(fuse_req_t req, fuse_ino_t ino_in, off_t off_in,
>  
>      fuse_log(FUSE_LOG_DEBUG,
>               "lo_copy_file_range(ino=%" PRIu64 "/fd=%d, "
> -             "off=%lu, ino=%" PRIu64 "/fd=%d, "
> -             "off=%lu, size=%zd, flags=0x%x)\n",
> -             ino_in, in_fd, off_in, ino_out, out_fd, off_out, len, flags);
> +             "off=%ju, ino=%" PRIu64 "/fd=%d, "
> +             "off=%ju, size=%zd, flags=0x%x)\n",
> +             ino_in, in_fd, (intmax_t)off_in,
> +             ino_out, out_fd, (intmax_t)off_out, len, flags);

The rest of the patch used uint64_t. Why intmax_t here?

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 02/26] virtiofsd: Don't assume header layout
  2021-04-28 11:00 ` [PATCH v3 02/26] virtiofsd: Don't assume header layout Dr. David Alan Gilbert (git)
@ 2021-05-04 15:12   ` Stefan Hajnoczi
  2021-05-06 15:56   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-04 15:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 666 bytes --]

On Wed, Apr 28, 2021 at 12:00:36PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> virtiofsd incorrectly assumed a fixed set of header layout in the virt
> queue; assuming that the fuse and write headers were conveniently
> separated from the data;  the spec doesn't allow us to take that
> convenience, so fix it up to deal with it the hard way.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 94 +++++++++++++++++++++++++++--------
>  1 file changed, 73 insertions(+), 21 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 03/26] DAX: vhost-user: Rework slave return values
  2021-04-28 11:00 ` [PATCH v3 03/26] DAX: vhost-user: Rework slave return values Dr. David Alan Gilbert (git)
@ 2021-05-04 15:23   ` Stefan Hajnoczi
  2021-05-27 15:59     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-04 15:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 2172 bytes --]

On Wed, Apr 28, 2021 at 12:00:37PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> All the current slave handlers on the qemu side generate an 'int'
> return value that's squashed down to a bool (!!ret) and stuffed into
> a uint64_t (field of a union) to be returned.
> 
> Move the uint64_t type back up through the individual handlers so
> that we can make one actually return a full uint64_t.
> 
> Note that the definition in the interop spec says most of these
> cases are defined as returning 0 on success and non-0 for failure,
> so it's OK to change from a bool to another non-0.
> 
> Vivek:
> This is needed because upcoming patches in series will add new functions
> which want to return full error code. Existing functions continue to
> return true/false so, it should not lead to change of behavior for
> existing users.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Greg Kurz <groug@kaod.org>
> ---
>  hw/virtio/vhost-backend.c         |  6 +++---
>  hw/virtio/vhost-user.c            | 31 ++++++++++++++++---------------
>  include/hw/virtio/vhost-backend.h |  2 +-
>  3 files changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> index 31b33bde37..1686c94767 100644
> --- a/hw/virtio/vhost-backend.c
> +++ b/hw/virtio/vhost-backend.c
> @@ -401,8 +401,8 @@ int vhost_backend_invalidate_device_iotlb(struct vhost_dev *dev,
>      return -ENODEV;
>  }
>  
> -int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> -                                          struct vhost_iotlb_msg *imsg)
> +uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> +                                        struct vhost_iotlb_msg *imsg)
>  {
>      int ret = 0;

This patch is messy. We want uint64_t but these functions use int ret
internally. The actual return values are true/false instead of int 0 and
1.

Please use uint64_t everywhere: return 0 for success and 1 for failure
instead of bool literals and change the type of the local ret variables
to uint64_t.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload
  2021-04-28 11:00 ` [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload Dr. David Alan Gilbert (git)
@ 2021-05-04 15:26   ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-04 15:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 566 bytes --]

On Wed, Apr 28, 2021 at 12:00:38PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Route the uint64 payload from message replies on the slave back up
> through vu_process_message_reply and to the callers.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  subprojects/libvhost-user/libvhost-user.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/26] virtiofs: Fixup printf args
  2021-05-04 14:54   ` Stefan Hajnoczi
@ 2021-05-05 11:06     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-05 11:06 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:35PM +0100, Dr. David Alan Gilbert (git) wrote:
> > @@ -3097,9 +3097,10 @@ static void lo_copy_file_range(fuse_req_t req, fuse_ino_t ino_in, off_t off_in,
> >  
> >      fuse_log(FUSE_LOG_DEBUG,
> >               "lo_copy_file_range(ino=%" PRIu64 "/fd=%d, "
> > -             "off=%lu, ino=%" PRIu64 "/fd=%d, "
> > -             "off=%lu, size=%zd, flags=0x%x)\n",
> > -             ino_in, in_fd, off_in, ino_out, out_fd, off_out, len, flags);
> > +             "off=%ju, ino=%" PRIu64 "/fd=%d, "
> > +             "off=%ju, size=%zd, flags=0x%x)\n",
> > +             ino_in, in_fd, (intmax_t)off_in,
> > +             ino_out, out_fd, (intmax_t)off_out, len, flags);
> 
> The rest of the patch used uint64_t. Why intmax_t here?

Because it seems to be the standard way of doing it for things that may
be off_t.

> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Thanks.

Dave


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR
  2021-04-28 11:00 ` [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR Dr. David Alan Gilbert (git)
@ 2021-05-05 12:12   ` Stefan Hajnoczi
  2021-05-05 18:59     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-05 12:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 2635 bytes --]

On Wed, Apr 28, 2021 at 12:00:42PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add a cache BAR into which files will be directly mapped.
> The size can be set with the cache-size= property, e.g.
>    -device vhost-user-fs-pci,chardev=char0,tag=myfs,cache-size=16G
> 
> The default is no cache.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> with PPC fixes by:
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---
>  hw/virtio/vhost-user-fs-pci.c     | 32 +++++++++++++++++++++++++++++++
>  hw/virtio/vhost-user-fs.c         | 32 +++++++++++++++++++++++++++++++
>  include/hw/virtio/vhost-user-fs.h |  2 ++
>  3 files changed, 66 insertions(+)
> 
> diff --git a/hw/virtio/vhost-user-fs-pci.c b/hw/virtio/vhost-user-fs-pci.c
> index 2ed8492b3f..20e447631f 100644
> --- a/hw/virtio/vhost-user-fs-pci.c
> +++ b/hw/virtio/vhost-user-fs-pci.c
> @@ -12,14 +12,19 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qapi/error.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/vhost-user-fs.h"
>  #include "virtio-pci.h"
>  #include "qom/object.h"
> +#include "standard-headers/linux/virtio_fs.h"
> +
> +#define VIRTIO_FS_PCI_CACHE_BAR 2
>  
>  struct VHostUserFSPCI {
>      VirtIOPCIProxy parent_obj;
>      VHostUserFS vdev;
> +    MemoryRegion cachebar;
>  };
>  
>  typedef struct VHostUserFSPCI VHostUserFSPCI;
> @@ -38,7 +43,9 @@ static Property vhost_user_fs_pci_properties[] = {
>  static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
>  {
>      VHostUserFSPCI *dev = VHOST_USER_FS_PCI(vpci_dev);
> +    bool modern_pio = vpci_dev->flags & VIRTIO_PCI_FLAG_MODERN_PIO_NOTIFY;
>      DeviceState *vdev = DEVICE(&dev->vdev);
> +    uint64_t cachesize;
>  
>      if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
>          /* Also reserve config change and hiprio queue vectors */
> @@ -46,6 +53,31 @@ static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
>      }
>  
>      qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
> +    cachesize = dev->vdev.conf.cache_size;
> +
> +    if (cachesize && modern_pio) {
> +        error_setg(errp, "DAX Cache can not be used together with modern_pio");

It's not necessary to respin but it would help to capture the reason for
this limitation either in the error message or at least in a comment.

The problem is that PCI BARs are limited resources and enabling modern
PIO notify conflicts with the DAX Window BAR usage.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping
  2021-04-28 11:00 ` [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping Dr. David Alan Gilbert (git)
@ 2021-05-05 14:15   ` Stefan Hajnoczi
  2021-05-27 16:57     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-05 14:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 1788 bytes --]

On Wed, Apr 28, 2021 at 12:00:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> index dd0a02aa99..169a146e72 100644
> --- a/hw/virtio/vhost-user-fs.c
> +++ b/hw/virtio/vhost-user-fs.c
> @@ -45,6 +45,72 @@ static const int user_feature_bits[] = {
>  #define DAX_WINDOW_PROT PROT_NONE
>  #endif
>  
> +/*
> + * The message apparently had 'received_size' bytes, check this
> + * matches the count in the message.
> + *
> + * Returns true if the size matches.
> + */
> +static bool check_slave_message_entries(const VhostUserFSSlaveMsg *sm,
> +                                        int received_size)
> +{
> +    int tmp;
> +
> +    /*
> +     * VhostUserFSSlaveMsg consists of a body followed by 'n' entries,
> +     * (each VhostUserFSSlaveMsgEntry).  There's a maximum of
> +     * VHOST_USER_FS_SLAVE_MAX_ENTRIES of these.
> +     */
> +    if (received_size <= sizeof(VhostUserFSSlaveMsg)) {

received_size is an int and we risk hitting checking against the coerced
size_t value but then using the signed int received_size later. It's
safer to convert to size_t once and then use that size_t value
everywhere later.

> +typedef struct {
> +    /* Offsets within the file being mapped */
> +    uint64_t fd_offset;
> +    /* Offsets within the cache */
> +    uint64_t c_offset;
> +    /* Lengths of sections */
> +    uint64_t len;
> +    /* Flags, from VHOST_USER_FS_FLAG_* */
> +    uint64_t flags;
> +} VhostUserFSSlaveMsgEntry;
> +
> +typedef struct {
> +    /* Number of entries */
> +    uint16_t count;
> +    /* Spare */
> +    uint16_t align;

VhostUserFSSlaveMsgEntry is aligned to 64 bits, so the 16-bit align
field is not sufficient for full alignment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 10/26] DAX: virtio-fs: Fill in slave commands for mapping
  2021-04-28 11:00 ` [PATCH v3 10/26] DAX: virtio-fs: Fill in " Dr. David Alan Gilbert (git)
@ 2021-05-05 16:43   ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-05 16:43 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]

On Wed, Apr 28, 2021 at 12:00:44PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Fill in definitions for map, unmap and sync commands.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> with fix by misono.tomohiro@fujitsu.com
> ---
>  hw/virtio/vhost-user-fs.c | 117 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 113 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR
  2021-05-05 12:12   ` Stefan Hajnoczi
@ 2021-05-05 18:59     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-05 18:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:42PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Add a cache BAR into which files will be directly mapped.
> > The size can be set with the cache-size= property, e.g.
> >    -device vhost-user-fs-pci,chardev=char0,tag=myfs,cache-size=16G
> > 
> > The default is no cache.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > with PPC fixes by:
> > Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> > ---
> >  hw/virtio/vhost-user-fs-pci.c     | 32 +++++++++++++++++++++++++++++++
> >  hw/virtio/vhost-user-fs.c         | 32 +++++++++++++++++++++++++++++++
> >  include/hw/virtio/vhost-user-fs.h |  2 ++
> >  3 files changed, 66 insertions(+)
> > 
> > diff --git a/hw/virtio/vhost-user-fs-pci.c b/hw/virtio/vhost-user-fs-pci.c
> > index 2ed8492b3f..20e447631f 100644
> > --- a/hw/virtio/vhost-user-fs-pci.c
> > +++ b/hw/virtio/vhost-user-fs-pci.c
> > @@ -12,14 +12,19 @@
> >   */
> >  
> >  #include "qemu/osdep.h"
> > +#include "qapi/error.h"
> >  #include "hw/qdev-properties.h"
> >  #include "hw/virtio/vhost-user-fs.h"
> >  #include "virtio-pci.h"
> >  #include "qom/object.h"
> > +#include "standard-headers/linux/virtio_fs.h"
> > +
> > +#define VIRTIO_FS_PCI_CACHE_BAR 2
> >  
> >  struct VHostUserFSPCI {
> >      VirtIOPCIProxy parent_obj;
> >      VHostUserFS vdev;
> > +    MemoryRegion cachebar;
> >  };
> >  
> >  typedef struct VHostUserFSPCI VHostUserFSPCI;
> > @@ -38,7 +43,9 @@ static Property vhost_user_fs_pci_properties[] = {
> >  static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> >  {
> >      VHostUserFSPCI *dev = VHOST_USER_FS_PCI(vpci_dev);
> > +    bool modern_pio = vpci_dev->flags & VIRTIO_PCI_FLAG_MODERN_PIO_NOTIFY;
> >      DeviceState *vdev = DEVICE(&dev->vdev);
> > +    uint64_t cachesize;
> >  
> >      if (vpci_dev->nvectors == DEV_NVECTORS_UNSPECIFIED) {
> >          /* Also reserve config change and hiprio queue vectors */
> > @@ -46,6 +53,31 @@ static void vhost_user_fs_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
> >      }
> >  
> >      qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
> > +    cachesize = dev->vdev.conf.cache_size;
> > +
> > +    if (cachesize && modern_pio) {
> > +        error_setg(errp, "DAX Cache can not be used together with modern_pio");
> 
> It's not necessary to respin but it would help to capture the reason for
> this limitation either in the error message or at least in a comment.
> 
> The problem is that PCI BARs are limited resources and enabling modern
> PIO notify conflicts with the DAX Window BAR usage.

OK, I've added that as a comment:

    if (cachesize && modern_pio) {
        /*
         * We've not got enough BARs for the one used by the DAX cache
         * and also the one used by modern_pio
         */
        error_setg(errp, "DAX Cache can not be used together with modern_pio");
        return;
    }

> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Thanks.

Dave


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands
  2021-04-28 11:00 ` [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands Dr. David Alan Gilbert (git)
@ 2021-05-06 15:02   ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:02 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

On Wed, Apr 28, 2021 at 12:00:46PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Add commands so that the guest kernel can ask the daemon to map file
> sections into a guest kernel visible cache.
> 
> Note: Catherine Ho had sent a patch to fix an issue with multiple
> removemapping. It was a merge issue though.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
> Including-fixes: Catherine Ho <catherine.hecx@gmail.com>
> Signed-off-by: Catherine Ho <catherine.hecx@gmail.com>
> ---
>  tools/virtiofsd/fuse_lowlevel.c | 69 +++++++++++++++++++++++++++++++++
>  tools/virtiofsd/fuse_lowlevel.h | 23 ++++++++++-
>  2 files changed, 91 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  2021-04-28 11:00 ` [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
@ 2021-05-06 15:12   ` Stefan Hajnoczi
  2021-05-27 17:44     ` Dr. David Alan Gilbert
  2021-05-06 15:16   ` Stefan Hajnoczi
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:12 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 3886 bytes --]

On Wed, Apr 28, 2021 at 12:00:52PM +0100, Dr. David Alan Gilbert (git) wrote:
> @@ -220,6 +222,99 @@ uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
>      return (uint64_t)res;
>  }
>  
> +uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
> +                                VhostUserFSSlaveMsg *sm, int fd)
> +{
> +    VHostUserFS *fs = (VHostUserFS *)object_dynamic_cast(OBJECT(dev->vdev),
> +                          TYPE_VHOST_USER_FS);
> +    if (!fs) {
> +        error_report("%s: Bad fs ptr", __func__);
> +        return (uint64_t)-1;
> +    }
> +    if (!check_slave_message_entries(sm, message_size)) {
> +        return (uint64_t)-1;
> +    }

These early error returns don't close(fd).

> +
> +    unsigned int i;
> +    int res = 0;
> +    size_t done = 0;
> +
> +    if (fd < 0) {
> +        error_report("Bad fd for io");
> +        return (uint64_t)-1;
> +    }
> +
> +    for (i = 0; i < sm->count && !res; i++) {
> +        VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
> +        if (e->len == 0) {
> +            continue;
> +        }
> +
> +        size_t len = e->len;
> +        uint64_t fd_offset = e->fd_offset;
> +        hwaddr gpa = e->c_offset;
> +
> +        while (len && !res) {
> +            hwaddr xlat, xlat_len;
> +            bool is_write = e->flags & VHOST_USER_FS_FLAG_MAP_W;
> +            MemoryRegion *mr = address_space_translate(dev->vdev->dma_as, gpa,
> +                                                       &xlat, &xlat_len,
> +                                                       is_write,
> +                                                       MEMTXATTRS_UNSPECIFIED);
> +            if (!mr || !xlat_len) {
> +                error_report("No guest region found for 0x%" HWADDR_PRIx, gpa);
> +                res = -EFAULT;
> +                break;
> +            }
> +
> +            trace_vhost_user_fs_slave_io_loop(mr->name,
> +                                          (uint64_t)xlat,
> +                                          memory_region_is_ram(mr),
> +                                          memory_region_is_romd(mr),
> +                                          (size_t)xlat_len);
> +
> +            void *hostptr = qemu_map_ram_ptr(mr->ram_block,
> +                                             xlat);
> +            ssize_t transferred;

What happens when the MemoryRegion is not backed by RAM?

> +            if (e->flags & VHOST_USER_FS_FLAG_MAP_R) {
> +                /* Read from file into RAM */
> +                if (mr->readonly) {
> +                    res = -EFAULT;
> +                    break;
> +                }
> +                transferred = pread(fd, hostptr, xlat_len, fd_offset);
> +            } else if (e->flags & VHOST_USER_FS_FLAG_MAP_W) {
> +                /* Write into file from RAM */
> +                transferred = pwrite(fd, hostptr, xlat_len, fd_offset);
> +            } else {
> +                transferred = EINVAL;

I don't see how this is handled below by the error checking code. Should
this be:

  errno = EINVAL;
  transferred = -1;

?

> +            }
> +
> +            trace_vhost_user_fs_slave_io_loop_res(transferred);
> +            if (transferred < 0) {
> +                res = -errno;
> +                break;
> +            }
> +            if (!transferred) {
> +                /* EOF */
> +                break;
> +            }
> +
> +            done += transferred;
> +            fd_offset += transferred;
> +            gpa += transferred;
> +            len -= transferred;
> +        }
> +    }
> +    close(fd);
> +
> +    trace_vhost_user_fs_slave_io_exit(res, done);
> +    if (res < 0) {
> +        return (uint64_t)res;
> +    }
> +    return (uint64_t)done;
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  2021-04-28 11:00 ` [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
  2021-05-06 15:12   ` Stefan Hajnoczi
@ 2021-05-06 15:16   ` Stefan Hajnoczi
  2021-05-27 17:31     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

On Wed, Apr 28, 2021 at 12:00:52PM +0100, Dr. David Alan Gilbert (git) wrote:
> +    close(fd);

I looked back at the hw/virtio/vhost-user.c slave channel code and it
closes fds for us. Looks like this close(2) call should be removed, but
please double-check in case I missed something.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements
  2021-04-28 11:00 ` [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements Dr. David Alan Gilbert (git)
@ 2021-05-06 15:23   ` Stefan Hajnoczi
  2021-05-27 17:56     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 3086 bytes --]

On Wed, Apr 28, 2021 at 12:00:54PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> For some read/writes the virtio queue elements are unmappable by
> the daemon; these are cases where the data is to be read/written
> from non-RAM.  In viritofs's case this is typically a direct read/write

s/viritofs/virtiofs/

> into an mmap'd DAX file also on virtiofs (possibly on another instance).
> 
> When we receive a virtio queue element, check that we have enough
> mappable data to handle the headers.  Make a note of the number of
> unmappable 'in' entries (ie. for read data back to the VMM),
> and flag the fuse_bufvec for 'out' entries with a new flag
> FUSE_BUF_PHYS_ADDR.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> with fix by:
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  tools/virtiofsd/buffer.c      |   4 +-
>  tools/virtiofsd/fuse_common.h |   7 ++
>  tools/virtiofsd/fuse_virtio.c | 230 ++++++++++++++++++++++++----------
>  3 files changed, 173 insertions(+), 68 deletions(-)
> 
> diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
> index 874f01c488..1a050aa441 100644
> --- a/tools/virtiofsd/buffer.c
> +++ b/tools/virtiofsd/buffer.c
> @@ -77,6 +77,7 @@ static ssize_t fuse_buf_write(const struct fuse_buf *dst, size_t dst_off,
>      ssize_t res = 0;
>      size_t copied = 0;
>  
> +    assert(!(src->flags & FUSE_BUF_PHYS_ADDR));
>      while (len) {
>          if (dst->flags & FUSE_BUF_FD_SEEK) {
>              res = pwrite(dst->fd, (char *)src->mem + src_off, len,
> @@ -272,7 +273,8 @@ ssize_t fuse_buf_copy(struct fuse_bufvec *dstv, struct fuse_bufvec *srcv)
>       * process
>       */
>      for (i = 0; i < srcv->count; i++) {
> -        if (srcv->buf[i].flags & FUSE_BUF_IS_FD) {
> +        if ((srcv->buf[i].flags & FUSE_BUF_PHYS_ADDR) ||
> +            (srcv->buf[i].flags & FUSE_BUF_IS_FD)) {
>              break;
>          }
>      }
> diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
> index fa9671872e..af43cf19f9 100644
> --- a/tools/virtiofsd/fuse_common.h
> +++ b/tools/virtiofsd/fuse_common.h
> @@ -626,6 +626,13 @@ enum fuse_buf_flags {
>       * detected.
>       */
>      FUSE_BUF_FD_RETRY = (1 << 3),
> +
> +    /**
> +     * The addresses in the iovec represent guest physical addresses
> +     * that can't be mapped by the daemon process.
> +     * IO must be bounced back to the VMM to do it.
> +     */
> +    FUSE_BUF_PHYS_ADDR = (1 << 4),

Based on the previous patch this is not a gpa, it's an IOVA. Depending
on the virtiofs device's DMA address space in QEMU this might be the
same as guest physical addresses but there could also be vIOMMU
translation (see the address_space_translate() call in the patch that
implemented the IO slave command).

Maybe virtiofs + vIOMMU has never been tested though... I'm not sure it
works today.

If you want to leave it as is, feel free:
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads
  2021-04-28 11:00 ` [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads Dr. David Alan Gilbert (git)
@ 2021-05-06 15:27   ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:27 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 463 bytes --]

On Wed, Apr 28, 2021 at 12:00:55PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When a read with unmappable buffers is found, map it to a slave
> read command.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  tools/virtiofsd/fuse_virtio.c | 37 +++++++++++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command
  2021-04-28 11:00 ` [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command Dr. David Alan Gilbert (git)
@ 2021-05-06 15:28   ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On Wed, Apr 28, 2021 at 12:00:56PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> When a fuse_buf_copy is performed on an element with FUSE_BUF_PHYS_ADDR
> route it to a fuse_virtio_write request that does a slave command to
> perform the write.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  tools/virtiofsd/buffer.c         | 14 +++++++++++---
>  tools/virtiofsd/fuse_common.h    |  6 +++++-
>  tools/virtiofsd/fuse_lowlevel.h  |  3 ---
>  tools/virtiofsd/passthrough_ll.c |  2 +-
>  4 files changed, 17 insertions(+), 8 deletions(-)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info
  2021-04-28 11:00 ` [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info Dr. David Alan Gilbert (git)
@ 2021-05-06 15:31   ` Stefan Hajnoczi
  2021-05-06 15:32   ` Stefan Hajnoczi
  1 sibling, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

On Wed, Apr 28, 2021 at 12:00:58PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: Vivek Goyal <vgoyal@redhat.com>
> 
> Extend VhostUserFSSlaveMsg so that slave can ask it to drop CAP_FSETID
> before doing I/O on fd.
> 
> In some cases, virtiofsd takes the onus of clearing setuid bit on a file
> when WRITE happens. Generally virtiofsd does the WRITE to fd (from guest
> memory which is mapped in virtiofsd as well), but if this memory is
> unmappable in virtiofsd (like cache window), then virtiofsd asks qemu
> to do the I/O instead.
> 
> To retain the capability to drop suid bit on write, qemu needs to
> drop the CAP_FSETID as well before write to fd. Extend VhostUserFSSlaveMsg
> so that virtiofsd can specify in message if CAP_FSETID needs to be
> dropped.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  hw/virtio/vhost-user-fs.c                 | 5 +++++
>  include/hw/virtio/vhost-user-fs.h         | 6 ++++++
>  subprojects/libvhost-user/libvhost-user.h | 6 ++++++
>  3 files changed, 17 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info
  2021-04-28 11:00 ` [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info Dr. David Alan Gilbert (git)
  2021-05-06 15:31   ` Stefan Hajnoczi
@ 2021-05-06 15:32   ` Stefan Hajnoczi
  1 sibling, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 652 bytes --]

On Wed, Apr 28, 2021 at 12:00:58PM +0100, Dr. David Alan Gilbert (git) wrote:
> @@ -144,6 +148,8 @@ typedef struct {
>  } VhostUserFSSlaveMsgEntry;
>  
>  typedef struct {
> +    /* Generic flags for the overall message */
> +    uint32_t flags;
>      /* Number of entries */
>      uint16_t count;
>      /* Spare */

Please introduce the uint32_t field as a reserved field in the earlier
patch that left a hole in the struct. Everything is okay once we get to
this patch but the earlier patches should have avoided the hole too
(they had the confusing uint16_t padding field which was not enough to
avoid the 64-bit alignment hole).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-04-28 11:01 ` [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it Dr. David Alan Gilbert (git)
@ 2021-05-06 15:37   ` Stefan Hajnoczi
  2021-05-06 16:02     ` Vivek Goyal
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: Vivek Goyal <vgoyal@redhat.com>
> 
> If qemu guest asked to drop CAP_FSETID upon write, send that info
> to qemu in SLAVE_FS_IO message so that qemu can drop capability
> before WRITE. This is to make sure that any setuid bit is killed
> on fd (if there is one set).
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
to create setgid files, thereby potentially allowing an attacker to gain
any GID.

I think it's better not to implement QEMU FSETID functionality at all
and to handle it another way. In the worst case I/O requests should just
fail, it seems like a rare case anyway: I/O to a setuid/setgid file with
a memory buffer that is not mapped in virtiofsd.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 00/26] virtiofs dax patches
  2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
                   ` (26 preceding siblings ...)
  2021-04-28 11:27 ` [PATCH v3 00/26] virtiofs dax patches no-reply
@ 2021-05-06 15:37 ` Stefan Hajnoczi
  27 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-06 15:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: virtio-fs, qemu-devel, vgoyal, groug

[-- Attachment #1: Type: text/plain, Size: 2122 bytes --]

On Wed, Apr 28, 2021 at 12:00:34PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
>   This series adds support for acceleration of virtiofs via DAX
> mapping, using features added in the 5.11 Linux kernel.
> 
>   DAX originally existed in the kernel for mapping real storage
> devices directly into memory, so that reads/writes turn into
> reads/writes directly mapped into the storage device.
> 
>   virtiofs's DAX support is similar; a PCI BAR is exposed on the
> virtiofs device corresponding to a DAX 'cache' of a user defined size.
> The guest daemon then requests files to be mapped into that cache;
> when that happens the virtiofsd sends filedescriptors and commands back
> to the QEMU that mmap's those files directly into the memory slot
> exposed to kvm.  The guest can then directly read/write to the files
> exposed by virtiofs by reading/writing into the BAR.
> 
>   A typical invocation would be:
>      -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,cache-size=4G
> 
> and then the guest must mount with -o dax
> 
>   Note that the cache doesn't really take VM up on the host, because
> everything placed there is just an mmap of a file, so you can afford
> to use quite a large cache size.
> 
>   Unlike a real DAX device, the cache is a finite size that's
> potentially smaller than the underlying filesystem (especially when
> mapping granuality is taken into account).  Mapping, unmapping and
> remapping must take place to juggle files into the cache if it's too
> small.  Some workloads benefit more than others.
> 
> Gotchas:
>   a) If something else on the host truncates an mmap'd file,
> kvm gets rather upset;  for this reason it's advised that DAX is
> currently only suitable for use on non-shared filesystems.
> 
> (This series, with a couple of other patches, is at:
> https://gitlab.com/virtio-fs/qemu/-/tree/dgilbert-dax-2021-04-28 )

Overall this looks close but I don't think the FSETID support should be
added to QEMU. Please see my comment on the final patch.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 01/26] virtiofs: Fixup printf args
  2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
  2021-05-04 14:54   ` Stefan Hajnoczi
@ 2021-05-06 15:56   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-06 15:56 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> Fixup some fuse_log printf args for 32bit compatibility.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

queued this 1/26 only

> ---
>  tools/virtiofsd/passthrough_ll.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 1553d2ef45..110f85a701 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2011,10 +2011,10 @@ static void lo_getlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
>  
>      fuse_log(FUSE_LOG_DEBUG,
>               "lo_getlk(ino=%" PRIu64 ", flags=%d)"
> -             " owner=0x%lx, l_type=%d l_start=0x%lx"
> -             " l_len=0x%lx\n",
> -             ino, fi->flags, fi->lock_owner, lock->l_type, lock->l_start,
> -             lock->l_len);
> +             " owner=0x%" PRIx64 ", l_type=%d l_start=0x%" PRIx64
> +             " l_len=0x%" PRIx64 "\n",
> +             ino, fi->flags, fi->lock_owner, lock->l_type,
> +             (uint64_t)lock->l_start, (uint64_t)lock->l_len);
>  
>      if (!lo->posix_lock) {
>          fuse_reply_err(req, ENOSYS);
> @@ -2061,10 +2061,10 @@ static void lo_setlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
>  
>      fuse_log(FUSE_LOG_DEBUG,
>               "lo_setlk(ino=%" PRIu64 ", flags=%d)"
> -             " cmd=%d pid=%d owner=0x%lx sleep=%d l_whence=%d"
> -             " l_start=0x%lx l_len=0x%lx\n",
> +             " cmd=%d pid=%d owner=0x%" PRIx64 " sleep=%d l_whence=%d"
> +             " l_start=0x%" PRIx64 " l_len=0x%" PRIx64 "\n",
>               ino, fi->flags, lock->l_type, lock->l_pid, fi->lock_owner, sleep,
> -             lock->l_whence, lock->l_start, lock->l_len);
> +             lock->l_whence, (uint64_t)lock->l_start, (uint64_t)lock->l_len);
>  
>      if (!lo->posix_lock) {
>          fuse_reply_err(req, ENOSYS);
> @@ -3097,9 +3097,10 @@ static void lo_copy_file_range(fuse_req_t req, fuse_ino_t ino_in, off_t off_in,
>  
>      fuse_log(FUSE_LOG_DEBUG,
>               "lo_copy_file_range(ino=%" PRIu64 "/fd=%d, "
> -             "off=%lu, ino=%" PRIu64 "/fd=%d, "
> -             "off=%lu, size=%zd, flags=0x%x)\n",
> -             ino_in, in_fd, off_in, ino_out, out_fd, off_out, len, flags);
> +             "off=%ju, ino=%" PRIu64 "/fd=%d, "
> +             "off=%ju, size=%zd, flags=0x%x)\n",
> +             ino_in, in_fd, (intmax_t)off_in,
> +             ino_out, out_fd, (intmax_t)off_out, len, flags);
>  
>      res = copy_file_range(in_fd, &off_in, out_fd, &off_out, len, flags);
>      if (res < 0) {
> -- 
> 2.31.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 02/26] virtiofsd: Don't assume header layout
  2021-04-28 11:00 ` [PATCH v3 02/26] virtiofsd: Don't assume header layout Dr. David Alan Gilbert (git)
  2021-05-04 15:12   ` Stefan Hajnoczi
@ 2021-05-06 15:56   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-06 15:56 UTC (permalink / raw)
  To: qemu-devel, vgoyal, stefanha, groug; +Cc: virtio-fs

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> virtiofsd incorrectly assumed a fixed set of header layout in the virt
> queue; assuming that the fuse and write headers were conveniently
> separated from the data;  the spec doesn't allow us to take that
> convenience, so fix it up to deal with it the hard way.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

Queued this 2/26 only

> ---
>  tools/virtiofsd/fuse_virtio.c | 94 +++++++++++++++++++++++++++--------
>  1 file changed, 73 insertions(+), 21 deletions(-)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 3e13997406..6dd73c9b72 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -129,18 +129,55 @@ static void fv_panic(VuDev *dev, const char *err)
>   * Copy from an iovec into a fuse_buf (memory only)
>   * Caller must ensure there is space
>   */
> -static void copy_from_iov(struct fuse_buf *buf, size_t out_num,
> -                          const struct iovec *out_sg)
> +static size_t copy_from_iov(struct fuse_buf *buf, size_t out_num,
> +                            const struct iovec *out_sg,
> +                            size_t max)
>  {
>      void *dest = buf->mem;
> +    size_t copied = 0;
>  
> -    while (out_num) {
> +    while (out_num && max) {
>          size_t onelen = out_sg->iov_len;
> +        onelen = MIN(onelen, max);
>          memcpy(dest, out_sg->iov_base, onelen);
>          dest += onelen;
> +        copied += onelen;
>          out_sg++;
>          out_num--;
> +        max -= onelen;
>      }
> +
> +    return copied;
> +}
> +
> +/*
> + * Skip 'skip' bytes in the iov; 'sg_1stindex' is set as
> + * the index for the 1st iovec to read data from, and
> + * 'sg_1stskip' is the number of bytes to skip in that entry.
> + *
> + * Returns True if there are at least 'skip' bytes in the iovec
> + *
> + */
> +static bool skip_iov(const struct iovec *sg, size_t sg_size,
> +                     size_t skip,
> +                     size_t *sg_1stindex, size_t *sg_1stskip)
> +{
> +    size_t vec;
> +
> +    for (vec = 0; vec < sg_size; vec++) {
> +        if (sg[vec].iov_len > skip) {
> +            *sg_1stskip = skip;
> +            *sg_1stindex = vec;
> +
> +            return true;
> +        }
> +
> +        skip -= sg[vec].iov_len;
> +    }
> +
> +    *sg_1stindex = vec;
> +    *sg_1stskip = 0;
> +    return skip == 0;
>  }
>  
>  /*
> @@ -457,6 +494,7 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
>      bool allocated_bufv = false;
>      struct fuse_bufvec bufv;
>      struct fuse_bufvec *pbufv;
> +    struct fuse_in_header inh;
>  
>      assert(se->bufsize > sizeof(struct fuse_in_header));
>  
> @@ -505,14 +543,15 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
>                   elem->index);
>          assert(0); /* TODO */
>      }
> -    /* Copy just the first element and look at it */
> -    copy_from_iov(&fbuf, 1, out_sg);
> +    /* Copy just the fuse_in_header and look at it */
> +    copy_from_iov(&fbuf, out_num, out_sg,
> +                  sizeof(struct fuse_in_header));
> +    memcpy(&inh, fbuf.mem, sizeof(struct fuse_in_header));
>  
>      pbufv = NULL; /* Compiler thinks an unitialised path */
> -    if (out_num > 2 &&
> -        out_sg[0].iov_len == sizeof(struct fuse_in_header) &&
> -        ((struct fuse_in_header *)fbuf.mem)->opcode == FUSE_WRITE &&
> -        out_sg[1].iov_len == sizeof(struct fuse_write_in)) {
> +    if (inh.opcode == FUSE_WRITE &&
> +        out_len >= (sizeof(struct fuse_in_header) +
> +                    sizeof(struct fuse_write_in))) {
>          /*
>           * For a write we don't actually need to copy the
>           * data, we can just do it straight out of guest memory
> @@ -521,15 +560,15 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
>           */
>          fuse_log(FUSE_LOG_DEBUG, "%s: Write special case\n", __func__);
>  
> -        /* copy the fuse_write_in header afte rthe fuse_in_header */
> -        fbuf.mem += out_sg->iov_len;
> -        copy_from_iov(&fbuf, 1, out_sg + 1);
> -        fbuf.mem -= out_sg->iov_len;
> -        fbuf.size = out_sg[0].iov_len + out_sg[1].iov_len;
> +        fbuf.size = copy_from_iov(&fbuf, out_num, out_sg,
> +                                  sizeof(struct fuse_in_header) +
> +                                  sizeof(struct fuse_write_in));
> +        /* That copy reread the in_header, make sure we use the original */
> +        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
>  
>          /* Allocate the bufv, with space for the rest of the iov */
>          pbufv = malloc(sizeof(struct fuse_bufvec) +
> -                       sizeof(struct fuse_buf) * (out_num - 2));
> +                       sizeof(struct fuse_buf) * out_num);
>          if (!pbufv) {
>              fuse_log(FUSE_LOG_ERR, "%s: pbufv malloc failed\n",
>                      __func__);
> @@ -540,24 +579,37 @@ static void fv_queue_worker(gpointer data, gpointer user_data)
>          pbufv->count = 1;
>          pbufv->buf[0] = fbuf;
>  
> -        size_t iovindex, pbufvindex;
> -        iovindex = 2; /* 2 headers, separate iovs */
> +        size_t iovindex, pbufvindex, iov_bytes_skip;
>          pbufvindex = 1; /* 2 headers, 1 fusebuf */
>  
> +        if (!skip_iov(out_sg, out_num,
> +                      sizeof(struct fuse_in_header) +
> +                      sizeof(struct fuse_write_in),
> +                      &iovindex, &iov_bytes_skip)) {
> +            fuse_log(FUSE_LOG_ERR, "%s: skip failed\n",
> +                    __func__);
> +            goto out;
> +        }
> +
>          for (; iovindex < out_num; iovindex++, pbufvindex++) {
>              pbufv->count++;
>              pbufv->buf[pbufvindex].pos = ~0; /* Dummy */
>              pbufv->buf[pbufvindex].flags = 0;
>              pbufv->buf[pbufvindex].mem = out_sg[iovindex].iov_base;
>              pbufv->buf[pbufvindex].size = out_sg[iovindex].iov_len;
> +
> +            if (iov_bytes_skip) {
> +                pbufv->buf[pbufvindex].mem += iov_bytes_skip;
> +                pbufv->buf[pbufvindex].size -= iov_bytes_skip;
> +                iov_bytes_skip = 0;
> +            }
>          }
>      } else {
>          /* Normal (non fast write) path */
>  
> -        /* Copy the rest of the buffer */
> -        fbuf.mem += out_sg->iov_len;
> -        copy_from_iov(&fbuf, out_num - 1, out_sg + 1);
> -        fbuf.mem -= out_sg->iov_len;
> +        copy_from_iov(&fbuf, out_num, out_sg, se->bufsize);
> +        /* That copy reread the in_header, make sure we use the original */
> +        memcpy(fbuf.mem, &inh, sizeof(struct fuse_in_header));
>          fbuf.size = out_len;
>  
>          /* TODO! Endianness of header */
> -- 
> 2.31.1
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-06 15:37   ` Stefan Hajnoczi
@ 2021-05-06 16:02     ` Vivek Goyal
  2021-05-10  9:05       ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Vivek Goyal @ 2021-05-06 16:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, groug, Dr. David Alan Gilbert (git), qemu-devel

On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: Vivek Goyal <vgoyal@redhat.com>
> > 
> > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > before WRITE. This is to make sure that any setuid bit is killed
> > on fd (if there is one set).
> > 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> 
> I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> to create setgid files, thereby potentially allowing an attacker to gain
> any GID.

Sure, its not recommended to run QEMU as root, but we don't block that
either and I do regularly test with qemu running as root.

> 
> I think it's better not to implement QEMU FSETID functionality at all
> and to handle it another way.

One way could be that virtiofsd tries to clear setuid bit after I/O
has finished. But that will be non-atomic operation and it is filled
with perils as it requires virtiofsd to know what all kernel will
do if this write has been done with CAP_FSETID dropped.

> In the worst case I/O requests should just
> fail, it seems like a rare case anyway:

Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
or not. If there is one, it might be reasonable to error out. If we
don't know, then we can't fail all the operations.

> I/O to a setuid/setgid file with
> a memory buffer that is not mapped in virtiofsd.

With DAX it is easily triggerable. User has to append to a setuid file
in virtiofs and this path will trigger.

I am fine with not supporting this patch but will also need a reaosonable
alternative solution.

Vivek



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-06 16:02     ` Vivek Goyal
@ 2021-05-10  9:05       ` Stefan Hajnoczi
  2021-05-10 15:23         ` Vivek Goyal
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-10  9:05 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, groug, Dr. David Alan Gilbert (git), qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3334 bytes --]

On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > From: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > before WRITE. This is to make sure that any setuid bit is killed
> > > on fd (if there is one set).
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > 
> > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > to create setgid files, thereby potentially allowing an attacker to gain
> > any GID.
> 
> Sure, its not recommended to run QEMU as root, but we don't block that
> either and I do regularly test with qemu running as root.
> 
> > 
> > I think it's better not to implement QEMU FSETID functionality at all
> > and to handle it another way.
> 
> One way could be that virtiofsd tries to clear setuid bit after I/O
> has finished. But that will be non-atomic operation and it is filled
> with perils as it requires virtiofsd to know what all kernel will
> do if this write has been done with CAP_FSETID dropped.
> 
> > In the worst case I/O requests should just
> > fail, it seems like a rare case anyway:
> 
> Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> or not. If there is one, it might be reasonable to error out. If we
> don't know, then we can't fail all the operations.
> 
> > I/O to a setuid/setgid file with
> > a memory buffer that is not mapped in virtiofsd.
> 
> With DAX it is easily triggerable. User has to append to a setuid file
> in virtiofs and this path will trigger.
> 
> I am fine with not supporting this patch but will also need a reaosonable
> alternative solution.

One way to avoid this problem is by introducing DMA read/write functions
into the vhost-user protocol that can be used by all device types, not
just virtio-fs.

Today virtio-fs uses the IO slave request when it cannot access a region
of guest memory. It sends the file descriptor to QEMU and QEMU performs
the pread(2)/pwrite(2) on behalf of virtiofsd.

I mentioned in the past that this solution is over-specialized. It
doesn't solve the larger problem that vhost-user processes do not have
full access to the guest memory space (e.g. DAX window).

Instead of sending file I/O requests over to QEMU, the vhost-user
protocol should offer DMA read/write requests so any vhost-user process
can access the guest memory space where vhost's shared memory mechanism
is insufficient.

Here is how it would work:

1. Drop the IO slave request, replace it with DMA read/write slave
   requests.

   Note that these new requests can also be used in environments where
   maximum vIOMMU isolation is needed for security reasons and sharing
   all of guest RAM with the vhost-user process is considered
   unacceptable.

2. When virtqueue buffer mapping fails, send DMA read/write slave
   requests to transfer the data from/to QEMU. virtiofsd calls
   pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-10  9:05       ` Stefan Hajnoczi
@ 2021-05-10 15:23         ` Vivek Goyal
  2021-05-10 15:32           ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Vivek Goyal @ 2021-05-10 15:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: virtio-fs, groug, Dr. David Alan Gilbert (git), qemu-devel

On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > 
> > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > on fd (if there is one set).
> > > > 
> > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > 
> > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > to create setgid files, thereby potentially allowing an attacker to gain
> > > any GID.
> > 
> > Sure, its not recommended to run QEMU as root, but we don't block that
> > either and I do regularly test with qemu running as root.
> > 
> > > 
> > > I think it's better not to implement QEMU FSETID functionality at all
> > > and to handle it another way.
> > 
> > One way could be that virtiofsd tries to clear setuid bit after I/O
> > has finished. But that will be non-atomic operation and it is filled
> > with perils as it requires virtiofsd to know what all kernel will
> > do if this write has been done with CAP_FSETID dropped.
> > 
> > > In the worst case I/O requests should just
> > > fail, it seems like a rare case anyway:
> > 
> > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > or not. If there is one, it might be reasonable to error out. If we
> > don't know, then we can't fail all the operations.
> > 
> > > I/O to a setuid/setgid file with
> > > a memory buffer that is not mapped in virtiofsd.
> > 
> > With DAX it is easily triggerable. User has to append to a setuid file
> > in virtiofs and this path will trigger.
> > 
> > I am fine with not supporting this patch but will also need a reaosonable
> > alternative solution.
> 
> One way to avoid this problem is by introducing DMA read/write functions
> into the vhost-user protocol that can be used by all device types, not
> just virtio-fs.
> 
> Today virtio-fs uses the IO slave request when it cannot access a region
> of guest memory. It sends the file descriptor to QEMU and QEMU performs
> the pread(2)/pwrite(2) on behalf of virtiofsd.
> 
> I mentioned in the past that this solution is over-specialized. It
> doesn't solve the larger problem that vhost-user processes do not have
> full access to the guest memory space (e.g. DAX window).
> 
> Instead of sending file I/O requests over to QEMU, the vhost-user
> protocol should offer DMA read/write requests so any vhost-user process
> can access the guest memory space where vhost's shared memory mechanism
> is insufficient.
> 
> Here is how it would work:
> 
> 1. Drop the IO slave request, replace it with DMA read/write slave
>    requests.
> 
>    Note that these new requests can also be used in environments where
>    maximum vIOMMU isolation is needed for security reasons and sharing
>    all of guest RAM with the vhost-user process is considered
>    unacceptable.
> 
> 2. When virtqueue buffer mapping fails, send DMA read/write slave
>    requests to transfer the data from/to QEMU. virtiofsd calls
>    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.

Can you elaborate a bit more how will this new DMA read/write vhost-user
commands can be implemented. I am assuming its not a real DMA and just
sort of emulation of DMA. Effectively we have two processes and one
process needs to read/write to/from address space of other process.

We were also wondering if we can make use of process_vm_readv()
and process_vm_write() syscalls to achieve this. But this atleast
requires virtiofsd to be more priviliged than qemu and also virtiofsd
needs to know where DAX mapping window is. We briefly discussed this here.

https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/

Vivek



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-10 15:23         ` Vivek Goyal
@ 2021-05-10 15:32           ` Stefan Hajnoczi
  2021-05-27 19:09             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-05-10 15:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: virtio-fs, groug, Dr. David Alan Gilbert (git), qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5388 bytes --]

On Mon, May 10, 2021 at 11:23:24AM -0400, Vivek Goyal wrote:
> On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> > On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > > 
> > > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > > on fd (if there is one set).
> > > > > 
> > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > > 
> > > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > > to create setgid files, thereby potentially allowing an attacker to gain
> > > > any GID.
> > > 
> > > Sure, its not recommended to run QEMU as root, but we don't block that
> > > either and I do regularly test with qemu running as root.
> > > 
> > > > 
> > > > I think it's better not to implement QEMU FSETID functionality at all
> > > > and to handle it another way.
> > > 
> > > One way could be that virtiofsd tries to clear setuid bit after I/O
> > > has finished. But that will be non-atomic operation and it is filled
> > > with perils as it requires virtiofsd to know what all kernel will
> > > do if this write has been done with CAP_FSETID dropped.
> > > 
> > > > In the worst case I/O requests should just
> > > > fail, it seems like a rare case anyway:
> > > 
> > > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > > or not. If there is one, it might be reasonable to error out. If we
> > > don't know, then we can't fail all the operations.
> > > 
> > > > I/O to a setuid/setgid file with
> > > > a memory buffer that is not mapped in virtiofsd.
> > > 
> > > With DAX it is easily triggerable. User has to append to a setuid file
> > > in virtiofs and this path will trigger.
> > > 
> > > I am fine with not supporting this patch but will also need a reaosonable
> > > alternative solution.
> > 
> > One way to avoid this problem is by introducing DMA read/write functions
> > into the vhost-user protocol that can be used by all device types, not
> > just virtio-fs.
> > 
> > Today virtio-fs uses the IO slave request when it cannot access a region
> > of guest memory. It sends the file descriptor to QEMU and QEMU performs
> > the pread(2)/pwrite(2) on behalf of virtiofsd.
> > 
> > I mentioned in the past that this solution is over-specialized. It
> > doesn't solve the larger problem that vhost-user processes do not have
> > full access to the guest memory space (e.g. DAX window).
> > 
> > Instead of sending file I/O requests over to QEMU, the vhost-user
> > protocol should offer DMA read/write requests so any vhost-user process
> > can access the guest memory space where vhost's shared memory mechanism
> > is insufficient.
> > 
> > Here is how it would work:
> > 
> > 1. Drop the IO slave request, replace it with DMA read/write slave
> >    requests.
> > 
> >    Note that these new requests can also be used in environments where
> >    maximum vIOMMU isolation is needed for security reasons and sharing
> >    all of guest RAM with the vhost-user process is considered
> >    unacceptable.
> > 
> > 2. When virtqueue buffer mapping fails, send DMA read/write slave
> >    requests to transfer the data from/to QEMU. virtiofsd calls
> >    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.
> 
> Can you elaborate a bit more how will this new DMA read/write vhost-user
> commands can be implemented. I am assuming its not a real DMA and just
> sort of emulation of DMA. Effectively we have two processes and one
> process needs to read/write to/from address space of other process.
> 
> We were also wondering if we can make use of process_vm_readv()
> and process_vm_write() syscalls to achieve this. But this atleast
> requires virtiofsd to be more priviliged than qemu and also virtiofsd
> needs to know where DAX mapping window is. We briefly discussed this here.
> 
> https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/

I wasn't thinking of directly allowing QEMU virtual memory access via
process_vm_readv/writev(). That would be more efficient but requires
privileges and also exposes internals of QEMU's virtual memory layout
and vIOMMU translation to the vhost-user process.

Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
containing the address (a device IOVA, it could just be a guest physical
memory address in most cases) and the length. The WRITE message would
also contain the data that the vhost-user device wishes to write. The
READ message reply would contain the data that the device read from
QEMU.

QEMU would implement this using QEMU's address_space_read/write() APIs.

So basically just a new vhost-user protocol message to do a memcpy(),
but with guest addresses and vIOMMU support :).

The vhost-user device will need to do bounce buffering so using these
new messages is slower than zero-copy I/O to shared guest RAM.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 03/26] DAX: vhost-user: Rework slave return values
  2021-05-04 15:23   ` Stefan Hajnoczi
@ 2021-05-27 15:59     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 15:59 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:37PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > All the current slave handlers on the qemu side generate an 'int'
> > return value that's squashed down to a bool (!!ret) and stuffed into
> > a uint64_t (field of a union) to be returned.
> > 
> > Move the uint64_t type back up through the individual handlers so
> > that we can make one actually return a full uint64_t.
> > 
> > Note that the definition in the interop spec says most of these
> > cases are defined as returning 0 on success and non-0 for failure,
> > so it's OK to change from a bool to another non-0.
> > 
> > Vivek:
> > This is needed because upcoming patches in series will add new functions
> > which want to return full error code. Existing functions continue to
> > return true/false so, it should not lead to change of behavior for
> > existing users.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reviewed-by: Greg Kurz <groug@kaod.org>
> > ---
> >  hw/virtio/vhost-backend.c         |  6 +++---
> >  hw/virtio/vhost-user.c            | 31 ++++++++++++++++---------------
> >  include/hw/virtio/vhost-backend.h |  2 +-
> >  3 files changed, 20 insertions(+), 19 deletions(-)
> > 
> > diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
> > index 31b33bde37..1686c94767 100644
> > --- a/hw/virtio/vhost-backend.c
> > +++ b/hw/virtio/vhost-backend.c
> > @@ -401,8 +401,8 @@ int vhost_backend_invalidate_device_iotlb(struct vhost_dev *dev,
> >      return -ENODEV;
> >  }
> >  
> > -int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> > -                                          struct vhost_iotlb_msg *imsg)
> > +uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
> > +                                        struct vhost_iotlb_msg *imsg)
> >  {
> >      int ret = 0;
> 
> This patch is messy. We want uint64_t but these functions use int ret
> internally. The actual return values are true/false instead of int 0 and
> 1.

Yeh.

> Please use uint64_t everywhere: return 0 for success and 1 for failure
> instead of bool literals and change the type of the local ret variables
> to uint64_t.


OK, reworked, this part of it now looks like :

int vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
-                                          struct vhost_iotlb_msg *imsg)
+uint64_t vhost_backend_handle_iotlb_msg(struct vhost_dev *dev,
+                                        struct vhost_iotlb_msg *imsg)
 {
-    int ret = 0;
+    uint64_t ret = 0;
 
     if (unlikely(!dev->vdev)) {
         error_report("Unexpected IOTLB message when virtio device is stopped");
-        return -EINVAL;
+        return EINVAL;
     }
 
     switch (imsg->type) {
     case VHOST_IOTLB_MISS:
-        ret = vhost_device_iotlb_miss(dev, imsg->iova,
-                                      imsg->perm != VHOST_ACCESS_RO);
+        ret = -vhost_device_iotlb_miss(dev, imsg->iova,
+                                       imsg->perm != VHOST_ACCESS_RO);
         break;
     case VHOST_IOTLB_ACCESS_FAIL:
         /* FIXME: report device iotlb error */
         error_report("Access failure IOTLB message type not supported");
-        ret = -ENOTSUP;
+        ret = ENOTSUP;
         break;
     case VHOST_IOTLB_UPDATE:
     case VHOST_IOTLB_INVALIDATE:
     default:
         error_report("Unexpected IOTLB message type");
-        ret = -EINVAL;
+        ret = EINVAL;
         break;
     }


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping
  2021-05-05 14:15   ` Stefan Hajnoczi
@ 2021-05-27 16:57     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 16:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:43PM +0100, Dr. David Alan Gilbert (git) wrote:
> > diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
> > index dd0a02aa99..169a146e72 100644
> > --- a/hw/virtio/vhost-user-fs.c
> > +++ b/hw/virtio/vhost-user-fs.c
> > @@ -45,6 +45,72 @@ static const int user_feature_bits[] = {
> >  #define DAX_WINDOW_PROT PROT_NONE
> >  #endif
> >  
> > +/*
> > + * The message apparently had 'received_size' bytes, check this
> > + * matches the count in the message.
> > + *
> > + * Returns true if the size matches.
> > + */
> > +static bool check_slave_message_entries(const VhostUserFSSlaveMsg *sm,
> > +                                        int received_size)
> > +{
> > +    int tmp;
> > +
> > +    /*
> > +     * VhostUserFSSlaveMsg consists of a body followed by 'n' entries,
> > +     * (each VhostUserFSSlaveMsgEntry).  There's a maximum of
> > +     * VHOST_USER_FS_SLAVE_MAX_ENTRIES of these.
> > +     */
> > +    if (received_size <= sizeof(VhostUserFSSlaveMsg)) {
> 
> received_size is an int and we risk hitting checking against the coerced
> size_t value but then using the signed int received_size later. It's
> safer to convert to size_t once and then use that size_t value
> everywhere later.

Done.

> > +typedef struct {
> > +    /* Offsets within the file being mapped */
> > +    uint64_t fd_offset;
> > +    /* Offsets within the cache */
> > +    uint64_t c_offset;
> > +    /* Lengths of sections */
> > +    uint64_t len;
> > +    /* Flags, from VHOST_USER_FS_FLAG_* */
> > +    uint64_t flags;
> > +} VhostUserFSSlaveMsgEntry;
> > +
> > +typedef struct {
> > +    /* Number of entries */
> > +    uint16_t count;
> > +    /* Spare */
> > +    uint16_t align;
> 
> VhostUserFSSlaveMsgEntry is aligned to 64 bits, so the 16-bit align
> field is not sufficient for full alignment.

Ah, interesting; fixed to:

typedef struct {
    /* Generic flags for the overall message */
    uint32_t flags;
    /* Number of entries */
    uint16_t count;
    /* Spare */
    uint16_t align;
} VhostUserFSSlaveMsgHdr;

or do I actually need to have a uint64_t in there?

Dave


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  2021-05-06 15:16   ` Stefan Hajnoczi
@ 2021-05-27 17:31     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 17:31 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:52PM +0100, Dr. David Alan Gilbert (git) wrote:
> > +    close(fd);
> 
> I looked back at the hw/virtio/vhost-user.c slave channel code and it
> closes fds for us. Looks like this close(2) call should be removed, but
> please double-check in case I missed something.

Nice, I think you're right; I've deleted that close(fd)


> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO
  2021-05-06 15:12   ` Stefan Hajnoczi
@ 2021-05-27 17:44     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 17:44 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:52PM +0100, Dr. David Alan Gilbert (git) wrote:
> > @@ -220,6 +222,99 @@ uint64_t vhost_user_fs_slave_unmap(struct vhost_dev *dev, int message_size,
> >      return (uint64_t)res;
> >  }
> >  
> > +uint64_t vhost_user_fs_slave_io(struct vhost_dev *dev, int message_size,
> > +                                VhostUserFSSlaveMsg *sm, int fd)
> > +{
> > +    VHostUserFS *fs = (VHostUserFS *)object_dynamic_cast(OBJECT(dev->vdev),
> > +                          TYPE_VHOST_USER_FS);
> > +    if (!fs) {
> > +        error_report("%s: Bad fs ptr", __func__);
> > +        return (uint64_t)-1;
> > +    }
> > +    if (!check_slave_message_entries(sm, message_size)) {
> > +        return (uint64_t)-1;
> > +    }
> 
> These early error returns don't close(fd).

(as per followup, we don't need it and it's removed)

> > +
> > +    unsigned int i;
> > +    int res = 0;
> > +    size_t done = 0;
> > +
> > +    if (fd < 0) {
> > +        error_report("Bad fd for io");
> > +        return (uint64_t)-1;
> > +    }
> > +
> > +    for (i = 0; i < sm->count && !res; i++) {
> > +        VhostUserFSSlaveMsgEntry *e = &sm->entries[i];
> > +        if (e->len == 0) {
> > +            continue;
> > +        }
> > +
> > +        size_t len = e->len;
> > +        uint64_t fd_offset = e->fd_offset;
> > +        hwaddr gpa = e->c_offset;
> > +
> > +        while (len && !res) {
> > +            hwaddr xlat, xlat_len;
> > +            bool is_write = e->flags & VHOST_USER_FS_FLAG_MAP_W;
> > +            MemoryRegion *mr = address_space_translate(dev->vdev->dma_as, gpa,
> > +                                                       &xlat, &xlat_len,
> > +                                                       is_write,
> > +                                                       MEMTXATTRS_UNSPECIFIED);
> > +            if (!mr || !xlat_len) {
> > +                error_report("No guest region found for 0x%" HWADDR_PRIx, gpa);
> > +                res = -EFAULT;
> > +                break;
> > +            }
> > +
> > +            trace_vhost_user_fs_slave_io_loop(mr->name,
> > +                                          (uint64_t)xlat,
> > +                                          memory_region_is_ram(mr),
> > +                                          memory_region_is_romd(mr),
> > +                                          (size_t)xlat_len);
> > +
> > +            void *hostptr = qemu_map_ram_ptr(mr->ram_block,
> > +                                             xlat);
> > +            ssize_t transferred;
> 
> What happens when the MemoryRegion is not backed by RAM?

I've added a check for mr->ramblock being non-null that I think covers
it.

> > +            if (e->flags & VHOST_USER_FS_FLAG_MAP_R) {
> > +                /* Read from file into RAM */
> > +                if (mr->readonly) {
> > +                    res = -EFAULT;
> > +                    break;
> > +                }
> > +                transferred = pread(fd, hostptr, xlat_len, fd_offset);
> > +            } else if (e->flags & VHOST_USER_FS_FLAG_MAP_W) {
> > +                /* Write into file from RAM */
> > +                transferred = pwrite(fd, hostptr, xlat_len, fd_offset);
> > +            } else {
> > +                transferred = EINVAL;
> 
> I don't see how this is handled below by the error checking code. Should
> this be:
> 
>   errno = EINVAL;
>   transferred = -1;
> 
> ?

Thanks; I've gone with  res = - EINVAL; break;

Dave

> 
> > +            }
> > +
> > +            trace_vhost_user_fs_slave_io_loop_res(transferred);
> > +            if (transferred < 0) {
> > +                res = -errno;
> > +                break;
> > +            }
> > +            if (!transferred) {
> > +                /* EOF */
> > +                break;
> > +            }
> > +
> > +            done += transferred;
> > +            fd_offset += transferred;
> > +            gpa += transferred;
> > +            len -= transferred;
> > +        }
> > +    }
> > +    close(fd);
> > +
> > +    trace_vhost_user_fs_slave_io_exit(res, done);
> > +    if (res < 0) {
> > +        return (uint64_t)res;
> > +    }
> > +    return (uint64_t)done;
> > +}


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements
  2021-05-06 15:23   ` Stefan Hajnoczi
@ 2021-05-27 17:56     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 17:56 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, vgoyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Apr 28, 2021 at 12:00:54PM +0100, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > For some read/writes the virtio queue elements are unmappable by
> > the daemon; these are cases where the data is to be read/written
> > from non-RAM.  In viritofs's case this is typically a direct read/write
> 
> s/viritofs/virtiofs/

Eventually I'll stop making that typo.

> > into an mmap'd DAX file also on virtiofs (possibly on another instance).
> > 
> > When we receive a virtio queue element, check that we have enough
> > mappable data to handle the headers.  Make a note of the number of
> > unmappable 'in' entries (ie. for read data back to the VMM),
> > and flag the fuse_bufvec for 'out' entries with a new flag
> > FUSE_BUF_PHYS_ADDR.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > with fix by:
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  tools/virtiofsd/buffer.c      |   4 +-
> >  tools/virtiofsd/fuse_common.h |   7 ++
> >  tools/virtiofsd/fuse_virtio.c | 230 ++++++++++++++++++++++++----------
> >  3 files changed, 173 insertions(+), 68 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/buffer.c b/tools/virtiofsd/buffer.c
> > index 874f01c488..1a050aa441 100644
> > --- a/tools/virtiofsd/buffer.c
> > +++ b/tools/virtiofsd/buffer.c
> > @@ -77,6 +77,7 @@ static ssize_t fuse_buf_write(const struct fuse_buf *dst, size_t dst_off,
> >      ssize_t res = 0;
> >      size_t copied = 0;
> >  
> > +    assert(!(src->flags & FUSE_BUF_PHYS_ADDR));
> >      while (len) {
> >          if (dst->flags & FUSE_BUF_FD_SEEK) {
> >              res = pwrite(dst->fd, (char *)src->mem + src_off, len,
> > @@ -272,7 +273,8 @@ ssize_t fuse_buf_copy(struct fuse_bufvec *dstv, struct fuse_bufvec *srcv)
> >       * process
> >       */
> >      for (i = 0; i < srcv->count; i++) {
> > -        if (srcv->buf[i].flags & FUSE_BUF_IS_FD) {
> > +        if ((srcv->buf[i].flags & FUSE_BUF_PHYS_ADDR) ||
> > +            (srcv->buf[i].flags & FUSE_BUF_IS_FD)) {
> >              break;
> >          }
> >      }
> > diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
> > index fa9671872e..af43cf19f9 100644
> > --- a/tools/virtiofsd/fuse_common.h
> > +++ b/tools/virtiofsd/fuse_common.h
> > @@ -626,6 +626,13 @@ enum fuse_buf_flags {
> >       * detected.
> >       */
> >      FUSE_BUF_FD_RETRY = (1 << 3),
> > +
> > +    /**
> > +     * The addresses in the iovec represent guest physical addresses
> > +     * that can't be mapped by the daemon process.
> > +     * IO must be bounced back to the VMM to do it.
> > +     */
> > +    FUSE_BUF_PHYS_ADDR = (1 << 4),
> 
> Based on the previous patch this is not a gpa, it's an IOVA. Depending
> on the virtiofs device's DMA address space in QEMU this might be the
> same as guest physical addresses but there could also be vIOMMU
> translation (see the address_space_translate() call in the patch that
> implemented the IO slave command).

I've changed that comment to:
    /**
     * The addresses in the iovec represent guest physical addresses
     * (or IOVA when used with an IOMMU) * that can't be mapped by the
     * daemon process.
     * IO must be bounced back to the VMM to do it.

> Maybe virtiofs + vIOMMU has never been tested though... I'm not sure it
> works today.

It has and it definitely doesn't work yet.

> If you want to leave it as is, feel free:
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

Thanks.

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-10 15:32           ` Stefan Hajnoczi
@ 2021-05-27 19:09             ` Dr. David Alan Gilbert
  2021-06-10 15:29               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-05-27 19:09 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Mon, May 10, 2021 at 11:23:24AM -0400, Vivek Goyal wrote:
> > On Mon, May 10, 2021 at 10:05:09AM +0100, Stefan Hajnoczi wrote:
> > > On Thu, May 06, 2021 at 12:02:23PM -0400, Vivek Goyal wrote:
> > > > On Thu, May 06, 2021 at 04:37:04PM +0100, Stefan Hajnoczi wrote:
> > > > > On Wed, Apr 28, 2021 at 12:01:00PM +0100, Dr. David Alan Gilbert (git) wrote:
> > > > > > From: Vivek Goyal <vgoyal@redhat.com>
> > > > > > 
> > > > > > If qemu guest asked to drop CAP_FSETID upon write, send that info
> > > > > > to qemu in SLAVE_FS_IO message so that qemu can drop capability
> > > > > > before WRITE. This is to make sure that any setuid bit is killed
> > > > > > on fd (if there is one set).
> > > > > > 
> > > > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > > > 
> > > > > I'm not sure if the QEMU FSETID patches make sense. QEMU shouldn't be
> > > > > running with FSETID because QEMU is untrusted. FSETGID would allow QEMU
> > > > > to create setgid files, thereby potentially allowing an attacker to gain
> > > > > any GID.
> > > > 
> > > > Sure, its not recommended to run QEMU as root, but we don't block that
> > > > either and I do regularly test with qemu running as root.
> > > > 
> > > > > 
> > > > > I think it's better not to implement QEMU FSETID functionality at all
> > > > > and to handle it another way.
> > > > 
> > > > One way could be that virtiofsd tries to clear setuid bit after I/O
> > > > has finished. But that will be non-atomic operation and it is filled
> > > > with perils as it requires virtiofsd to know what all kernel will
> > > > do if this write has been done with CAP_FSETID dropped.
> > > > 
> > > > > In the worst case I/O requests should just
> > > > > fail, it seems like a rare case anyway:
> > > > 
> > > > Is there a way for virtiofsd to know if qemu is running with CAP_FSETID
> > > > or not. If there is one, it might be reasonable to error out. If we
> > > > don't know, then we can't fail all the operations.
> > > > 
> > > > > I/O to a setuid/setgid file with
> > > > > a memory buffer that is not mapped in virtiofsd.
> > > > 
> > > > With DAX it is easily triggerable. User has to append to a setuid file
> > > > in virtiofs and this path will trigger.
> > > > 
> > > > I am fine with not supporting this patch but will also need a reaosonable
> > > > alternative solution.
> > > 
> > > One way to avoid this problem is by introducing DMA read/write functions
> > > into the vhost-user protocol that can be used by all device types, not
> > > just virtio-fs.
> > > 
> > > Today virtio-fs uses the IO slave request when it cannot access a region
> > > of guest memory. It sends the file descriptor to QEMU and QEMU performs
> > > the pread(2)/pwrite(2) on behalf of virtiofsd.
> > > 
> > > I mentioned in the past that this solution is over-specialized. It
> > > doesn't solve the larger problem that vhost-user processes do not have
> > > full access to the guest memory space (e.g. DAX window).
> > > 
> > > Instead of sending file I/O requests over to QEMU, the vhost-user
> > > protocol should offer DMA read/write requests so any vhost-user process
> > > can access the guest memory space where vhost's shared memory mechanism
> > > is insufficient.
> > > 
> > > Here is how it would work:
> > > 
> > > 1. Drop the IO slave request, replace it with DMA read/write slave
> > >    requests.
> > > 
> > >    Note that these new requests can also be used in environments where
> > >    maximum vIOMMU isolation is needed for security reasons and sharing
> > >    all of guest RAM with the vhost-user process is considered
> > >    unacceptable.
> > > 
> > > 2. When virtqueue buffer mapping fails, send DMA read/write slave
> > >    requests to transfer the data from/to QEMU. virtiofsd calls
> > >    pread(2)/pwrite(2) itself with virtiofsd's Linux capabilities.
> > 
> > Can you elaborate a bit more how will this new DMA read/write vhost-user
> > commands can be implemented. I am assuming its not a real DMA and just
> > sort of emulation of DMA. Effectively we have two processes and one
> > process needs to read/write to/from address space of other process.
> > 
> > We were also wondering if we can make use of process_vm_readv()
> > and process_vm_write() syscalls to achieve this. But this atleast
> > requires virtiofsd to be more priviliged than qemu and also virtiofsd
> > needs to know where DAX mapping window is. We briefly discussed this here.
> > 
> > https://lore.kernel.org/qemu-devel/20210421200746.GH1579961@redhat.com/
> 
> I wasn't thinking of directly allowing QEMU virtual memory access via
> process_vm_readv/writev(). That would be more efficient but requires
> privileges and also exposes internals of QEMU's virtual memory layout
> and vIOMMU translation to the vhost-user process.
> 
> Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> containing the address (a device IOVA, it could just be a guest physical
> memory address in most cases) and the length. The WRITE message would
> also contain the data that the vhost-user device wishes to write. The
> READ message reply would contain the data that the device read from
> QEMU.
> 
> QEMU would implement this using QEMU's address_space_read/write() APIs.
> 
> So basically just a new vhost-user protocol message to do a memcpy(),
> but with guest addresses and vIOMMU support :).

This doesn't actually feel that hard - ignoring vIOMMU for a minute
which I know very little about - I'd have to think where the data
actually flows, probably the slave fd.

> The vhost-user device will need to do bounce buffering so using these
> new messages is slower than zero-copy I/O to shared guest RAM.

I guess the theory is it's only in the weird corner cases anyway.

Dave

> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-05-27 19:09             ` Dr. David Alan Gilbert
@ 2021-06-10 15:29               ` Dr. David Alan Gilbert
  2021-06-10 16:23                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-10 15:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:

<snip>

> > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > containing the address (a device IOVA, it could just be a guest physical
> > memory address in most cases) and the length. The WRITE message would
> > also contain the data that the vhost-user device wishes to write. The
> > READ message reply would contain the data that the device read from
> > QEMU.
> > 
> > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > 
> > So basically just a new vhost-user protocol message to do a memcpy(),
> > but with guest addresses and vIOMMU support :).
> 
> This doesn't actually feel that hard - ignoring vIOMMU for a minute
> which I know very little about - I'd have to think where the data
> actually flows, probably the slave fd.
> 
> > The vhost-user device will need to do bounce buffering so using these
> > new messages is slower than zero-copy I/O to shared guest RAM.
> 
> I guess the theory is it's only in the weird corner cases anyway.

The direction I'm going is something like the following;
the idea is that the master will have to handle the requests on a
separate thread, to avoid any problems with side effects from the memory
accesses; the slave will then have to parkt he requests somewhere and
handle them later.


From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Thu, 10 Jun 2021 15:34:04 +0100
Subject: [PATCH] WIP: vhost-user: DMA type interface

A DMA type interface where the slave can ask for a stream of bytes
to be read/written to the guests memory by the master.
The interface is asynchronous, since a request may have side effects
inside the guest.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
 hw/virtio/vhost-user.c                    |  4 +++
 subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 9ebd05e2bf..b9b5322147 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -1347,6 +1347,15 @@ Master message types
   query the backend for its device status as defined in the Virtio
   specification.
 
+``VHOST_USER_MEM_DATA``
+  :id: 41
+  :equivalent ioctl: N/A
+  :slave payload: N/A
+  :master payload: ``struct VhostUserMemReply``
+
+  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
+  message.  Where the request was for the master to read data, this
+  message will be followed by the data that was read.
 
 Slave message types
 -------------------
@@ -1469,6 +1478,30 @@ Slave message types
   The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
   write to the file from RAM.
 
+``VHOST_USER_SLAVE_MEM_ACCESS``
+  :id: 9
+  :equivalent ioctl: N/A
+  :slave payload: ``struct VhostUserMemAccess``
+  :master payload: N/A
+
+  Requests that the master perform a range of memory accesses on behalf
+  of the slave that the slave can't perform itself.
+
+  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
+  field for the slave to write data into the RAM of the master.   In this
+  case the data to write follows the ``VhostUserMemAccess`` on the fd.
+  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
+  field for the slave to read data from the RAM of the master.
+
+  When the master has completed the access it replies on the main fd with
+  a ``VHOST_USER_MEM_DATA`` message.
+
+  The master is allowed to complete part of the request and reply stating
+  the amount completed, leaving it to the slave to resend further components.
+  This may happen to limit memory allocations in the master or to simplify
+  the implementation.
+
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 39a0e55cca..a3fefc4c1d 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_STATUS = 39,
+    VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_MEM_DATA = 41,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_FS_IO = 8,
+    VHOST_USER_SLAVE_MEM_ACCESS = 9,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
index eee611a2f6..b5444f4f6f 100644
--- a/subprojects/libvhost-user/libvhost-user.h
+++ b/subprojects/libvhost-user/libvhost-user.h
@@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_STATUS = 39,
+    VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_MEM_DATA = 41,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_FS_MAP = 6,
     VHOST_USER_SLAVE_FS_UNMAP = 7,
     VHOST_USER_SLAVE_FS_IO = 8,
+    VHOST_USER_SLAVE_MEM_ACCESS = 9,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
     uint16_t queue_size;
 } VhostUserInflight;
 
+/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
+#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
+#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
+typedef struct VhostUserMemAccess {
+    uint32_t id; /* Included in the reply */
+    uint32_t flags;
+    uint64_t addr; /* In the bus address of the device */
+    uint64_t len;  /* In bytes */
+} VhostUserMemAccess;
+
+typedef struct VhostUserMemReply {
+    uint32_t id; /* From the request */
+    uint32_t flags;
+    uint32_t err; /* 0 on success */
+    uint32_t align;
+    uint64_t len;
+} VhostUserMemReply;
+
 #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
 # define VU_PACKED __attribute__((gcc_struct, packed))
 #else
@@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
         VhostUserVringArea area;
         VhostUserInflight inflight;
         VhostUserFSSlaveMsgMax fs_max;
+        VhostUserMemAccess memaccess;
+        VhostUserMemReply  memreply;
     } payload;
 
     int fds[VHOST_MEMORY_BASELINE_NREGIONS];
-- 
2.31.1

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-06-10 15:29               ` Dr. David Alan Gilbert
@ 2021-06-10 16:23                 ` Stefan Hajnoczi
  2021-06-16 12:36                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-06-10 16:23 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

[-- Attachment #1: Type: text/plain, Size: 8365 bytes --]

On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> 
> <snip>
> 
> > > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > > containing the address (a device IOVA, it could just be a guest physical
> > > memory address in most cases) and the length. The WRITE message would
> > > also contain the data that the vhost-user device wishes to write. The
> > > READ message reply would contain the data that the device read from
> > > QEMU.
> > > 
> > > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > > 
> > > So basically just a new vhost-user protocol message to do a memcpy(),
> > > but with guest addresses and vIOMMU support :).
> > 
> > This doesn't actually feel that hard - ignoring vIOMMU for a minute
> > which I know very little about - I'd have to think where the data
> > actually flows, probably the slave fd.
> > 
> > > The vhost-user device will need to do bounce buffering so using these
> > > new messages is slower than zero-copy I/O to shared guest RAM.
> > 
> > I guess the theory is it's only in the weird corner cases anyway.

The feature is also useful if DMA isolation is desirable (i.e.
security/reliability are more important than performance). Once this new
vhost-user protocol feature is available it will be possible to run
vhost-user devices without shared memory or with limited shared memory
(e.g. just the vring).

> The direction I'm going is something like the following;
> the idea is that the master will have to handle the requests on a
> separate thread, to avoid any problems with side effects from the memory
> accesses; the slave will then have to parkt he requests somewhere and
> handle them later.
> 
> 
> From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Date: Thu, 10 Jun 2021 15:34:04 +0100
> Subject: [PATCH] WIP: vhost-user: DMA type interface
> 
> A DMA type interface where the slave can ask for a stream of bytes
> to be read/written to the guests memory by the master.
> The interface is asynchronous, since a request may have side effects
> inside the guest.
> 
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>  docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
>  hw/virtio/vhost-user.c                    |  4 +++
>  subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
>  3 files changed, 61 insertions(+)

Use of the word "RAM" in this patch is a little unclear since we need
these new messages precisely when it's not ordinary guest RAM :-). Maybe
referring to the address space is more general.

> diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> index 9ebd05e2bf..b9b5322147 100644
> --- a/docs/interop/vhost-user.rst
> +++ b/docs/interop/vhost-user.rst
> @@ -1347,6 +1347,15 @@ Master message types
>    query the backend for its device status as defined in the Virtio
>    specification.
>  
> +``VHOST_USER_MEM_DATA``
> +  :id: 41
> +  :equivalent ioctl: N/A
> +  :slave payload: N/A
> +  :master payload: ``struct VhostUserMemReply``
> +
> +  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
> +  message.  Where the request was for the master to read data, this
> +  message will be followed by the data that was read.

Please explain why this message is asynchronous. Implementors will need
to understand the gotchas around deadlocks, etc.

>  
>  Slave message types
>  -------------------
> @@ -1469,6 +1478,30 @@ Slave message types
>    The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
>    write to the file from RAM.
>  
> +``VHOST_USER_SLAVE_MEM_ACCESS``
> +  :id: 9
> +  :equivalent ioctl: N/A
> +  :slave payload: ``struct VhostUserMemAccess``
> +  :master payload: N/A
> +
> +  Requests that the master perform a range of memory accesses on behalf
> +  of the slave that the slave can't perform itself.
> +
> +  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
> +  field for the slave to write data into the RAM of the master.   In this
> +  case the data to write follows the ``VhostUserMemAccess`` on the fd.
> +  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
> +  field for the slave to read data from the RAM of the master.
> +
> +  When the master has completed the access it replies on the main fd with
> +  a ``VHOST_USER_MEM_DATA`` message.
> +
> +  The master is allowed to complete part of the request and reply stating
> +  the amount completed, leaving it to the slave to resend further components.
> +  This may happen to limit memory allocations in the master or to simplify
> +  the implementation.
> +
> +
>  .. _reply_ack:
>  
>  VHOST_USER_PROTOCOL_F_REPLY_ACK
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 39a0e55cca..a3fefc4c1d 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
>      VHOST_USER_ADD_MEM_REG = 37,
>      VHOST_USER_REM_MEM_REG = 38,
> +    VHOST_USER_SET_STATUS = 39,
> +    VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_MEM_DATA = 41,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
>      VHOST_USER_SLAVE_FS_MAP = 6,
>      VHOST_USER_SLAVE_FS_UNMAP = 7,
>      VHOST_USER_SLAVE_FS_IO = 8,
> +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
>      VHOST_USER_SLAVE_MAX
>  }  VhostUserSlaveRequest;
>  
> diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
> index eee611a2f6..b5444f4f6f 100644
> --- a/subprojects/libvhost-user/libvhost-user.h
> +++ b/subprojects/libvhost-user/libvhost-user.h
> @@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
>      VHOST_USER_ADD_MEM_REG = 37,
>      VHOST_USER_REM_MEM_REG = 38,
> +    VHOST_USER_SET_STATUS = 39,
> +    VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_MEM_DATA = 41,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
>      VHOST_USER_SLAVE_FS_MAP = 6,
>      VHOST_USER_SLAVE_FS_UNMAP = 7,
>      VHOST_USER_SLAVE_FS_IO = 8,
> +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
>      VHOST_USER_SLAVE_MAX
>  }  VhostUserSlaveRequest;
>  
> @@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
>      uint16_t queue_size;
>  } VhostUserInflight;
>  
> +/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
> +#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
> +#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
> +typedef struct VhostUserMemAccess {
> +    uint32_t id; /* Included in the reply */
> +    uint32_t flags;

Is VHOST_USER_MEM_FLAG_TO_MASTER | VHOST_USER_MEM_FLAG_FROM_MASTER
valid?

> +    uint64_t addr; /* In the bus address of the device */

Please check the spec for preferred terminology. "bus address" isn't
used in the spec, so there's probably another term for it.

> +    uint64_t len;  /* In bytes */
> +} VhostUserMemAccess;
> +
> +typedef struct VhostUserMemReply {
> +    uint32_t id; /* From the request */
> +    uint32_t flags;

Are any flags defined?

> +    uint32_t err; /* 0 on success */
> +    uint32_t align;

Is this a reserved padding field? "align" is confusing because it could
refer to some kind of memory alignment value. "reserved" or "padding" is
clearer.

> +    uint64_t len;
> +} VhostUserMemReply;
> +
>  #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
>  # define VU_PACKED __attribute__((gcc_struct, packed))
>  #else
> @@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
>          VhostUserFSSlaveMsgMax fs_max;
> +        VhostUserMemAccess memaccess;
> +        VhostUserMemReply  memreply;
>      } payload;
>  
>      int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> -- 
> 2.31.1
> 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-06-10 16:23                 ` Stefan Hajnoczi
@ 2021-06-16 12:36                   ` Dr. David Alan Gilbert
  2021-06-16 15:29                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-16 12:36 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > 
> > <snip>
> > 
> > > > Instead I was thinking about VHOST_USER_DMA_READ/WRITE messages
> > > > containing the address (a device IOVA, it could just be a guest physical
> > > > memory address in most cases) and the length. The WRITE message would
> > > > also contain the data that the vhost-user device wishes to write. The
> > > > READ message reply would contain the data that the device read from
> > > > QEMU.
> > > > 
> > > > QEMU would implement this using QEMU's address_space_read/write() APIs.
> > > > 
> > > > So basically just a new vhost-user protocol message to do a memcpy(),
> > > > but with guest addresses and vIOMMU support :).
> > > 
> > > This doesn't actually feel that hard - ignoring vIOMMU for a minute
> > > which I know very little about - I'd have to think where the data
> > > actually flows, probably the slave fd.
> > > 
> > > > The vhost-user device will need to do bounce buffering so using these
> > > > new messages is slower than zero-copy I/O to shared guest RAM.
> > > 
> > > I guess the theory is it's only in the weird corner cases anyway.
> 
> The feature is also useful if DMA isolation is desirable (i.e.
> security/reliability are more important than performance). Once this new
> vhost-user protocol feature is available it will be possible to run
> vhost-user devices without shared memory or with limited shared memory
> (e.g. just the vring).

I don't see it ever being efficient, so that case is going to be pretty
limited.

> > The direction I'm going is something like the following;
> > the idea is that the master will have to handle the requests on a
> > separate thread, to avoid any problems with side effects from the memory
> > accesses; the slave will then have to parkt he requests somewhere and
> > handle them later.
> > 
> > 
> > From 07aacff77c50c8a2b588b2513f2dfcfb8f5aa9df Mon Sep 17 00:00:00 2001
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > Date: Thu, 10 Jun 2021 15:34:04 +0100
> > Subject: [PATCH] WIP: vhost-user: DMA type interface
> > 
> > A DMA type interface where the slave can ask for a stream of bytes
> > to be read/written to the guests memory by the master.
> > The interface is asynchronous, since a request may have side effects
> > inside the guest.
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > ---
> >  docs/interop/vhost-user.rst               | 33 +++++++++++++++++++++++
> >  hw/virtio/vhost-user.c                    |  4 +++
> >  subprojects/libvhost-user/libvhost-user.h | 24 +++++++++++++++++
> >  3 files changed, 61 insertions(+)
> 
> Use of the word "RAM" in this patch is a little unclear since we need
> these new messages precisely when it's not ordinary guest RAM :-). Maybe
> referring to the address space is more general.

Yeh, I'll try and spot those.

> > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
> > index 9ebd05e2bf..b9b5322147 100644
> > --- a/docs/interop/vhost-user.rst
> > +++ b/docs/interop/vhost-user.rst
> > @@ -1347,6 +1347,15 @@ Master message types
> >    query the backend for its device status as defined in the Virtio
> >    specification.
> >  
> > +``VHOST_USER_MEM_DATA``
> > +  :id: 41
> > +  :equivalent ioctl: N/A
> > +  :slave payload: N/A
> > +  :master payload: ``struct VhostUserMemReply``
> > +
> > +  This message is an asynchronous response to a ``VHOST_USER_SLAVE_MEM_ACCESS``
> > +  message.  Where the request was for the master to read data, this
> > +  message will be followed by the data that was read.
> 
> Please explain why this message is asynchronous. Implementors will need
> to understand the gotchas around deadlocks, etc.

I've added:
  Making this a separate asynchronous response message (rather than just a reply
  to the ``VHOST_USER_SLAVE_MEM_ACCESS``) makes it much easier for the master
  to deal with any side effects the access may have, and in particular avoid
  deadlocks they might cause if an access triggers another vhost_user message.

> >  
> >  Slave message types
> >  -------------------
> > @@ -1469,6 +1478,30 @@ Slave message types
> >    The ``VHOST_USER_FS_FLAG_MAP_W`` flag must be set in the ``flags`` field to
> >    write to the file from RAM.
> >  
> > +``VHOST_USER_SLAVE_MEM_ACCESS``
> > +  :id: 9
> > +  :equivalent ioctl: N/A
> > +  :slave payload: ``struct VhostUserMemAccess``
> > +  :master payload: N/A
> > +
> > +  Requests that the master perform a range of memory accesses on behalf
> > +  of the slave that the slave can't perform itself.
> > +
> > +  The ``VHOST_USER_MEM_FLAG_TO_MASTER`` flag must be set in the ``flags``
> > +  field for the slave to write data into the RAM of the master.   In this
> > +  case the data to write follows the ``VhostUserMemAccess`` on the fd.
> > +  The ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flag must be set in the ``flags``
> > +  field for the slave to read data from the RAM of the master.
> > +
> > +  When the master has completed the access it replies on the main fd with
> > +  a ``VHOST_USER_MEM_DATA`` message.
> > +
> > +  The master is allowed to complete part of the request and reply stating
> > +  the amount completed, leaving it to the slave to resend further components.
> > +  This may happen to limit memory allocations in the master or to simplify
> > +  the implementation.
> > +
> > +
> >  .. _reply_ack:
> >  
> >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index 39a0e55cca..a3fefc4c1d 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -126,6 +126,9 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> >      VHOST_USER_ADD_MEM_REG = 37,
> >      VHOST_USER_REM_MEM_REG = 38,
> > +    VHOST_USER_SET_STATUS = 39,
> > +    VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_MEM_DATA = 41,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > @@ -139,6 +142,7 @@ typedef enum VhostUserSlaveRequest {
> >      VHOST_USER_SLAVE_FS_MAP = 6,
> >      VHOST_USER_SLAVE_FS_UNMAP = 7,
> >      VHOST_USER_SLAVE_FS_IO = 8,
> > +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
> >      VHOST_USER_SLAVE_MAX
> >  }  VhostUserSlaveRequest;
> >  
> > diff --git a/subprojects/libvhost-user/libvhost-user.h b/subprojects/libvhost-user/libvhost-user.h
> > index eee611a2f6..b5444f4f6f 100644
> > --- a/subprojects/libvhost-user/libvhost-user.h
> > +++ b/subprojects/libvhost-user/libvhost-user.h
> > @@ -109,6 +109,9 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> >      VHOST_USER_ADD_MEM_REG = 37,
> >      VHOST_USER_REM_MEM_REG = 38,
> > +    VHOST_USER_SET_STATUS = 39,
> > +    VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_MEM_DATA = 41,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >  
> > @@ -122,6 +125,7 @@ typedef enum VhostUserSlaveRequest {
> >      VHOST_USER_SLAVE_FS_MAP = 6,
> >      VHOST_USER_SLAVE_FS_UNMAP = 7,
> >      VHOST_USER_SLAVE_FS_IO = 8,
> > +    VHOST_USER_SLAVE_MEM_ACCESS = 9,
> >      VHOST_USER_SLAVE_MAX
> >  }  VhostUserSlaveRequest;
> >  
> > @@ -220,6 +224,24 @@ typedef struct VhostUserInflight {
> >      uint16_t queue_size;
> >  } VhostUserInflight;
> >  
> > +/* For the flags field of VhostUserMemAccess and VhostUserMemReply */
> > +#define VHOST_USER_MEM_FLAG_TO_MASTER (1u << 0)
> > +#define VHOST_USER_MEM_FLAG_FROM_MASTER (1u << 1)
> > +typedef struct VhostUserMemAccess {
> > +    uint32_t id; /* Included in the reply */
> > +    uint32_t flags;
> 
> Is VHOST_USER_MEM_FLAG_TO_MASTER | VHOST_USER_MEM_FLAG_FROM_MASTER
> valid?

No; I've changed the docs to state:
  One (and only one) of the ``VHOST_USER_MEM_FLAG_TO_MASTER`` and
  ``VHOST_USER_MEM_FLAG_FROM_MASTER`` flags must be set in the ``flags`` field.

> > +    uint64_t addr; /* In the bus address of the device */
> 
> Please check the spec for preferred terminology. "bus address" isn't
> used in the spec, so there's probably another term for it.

I'm not seeing anything useful in the virtio spec; it mostly talks about
guest physical addresses; it does say 'bus addresses' in the definition
of 'VIRTIO_F_ACCESS_PLATFORM' .

> > +    uint64_t len;  /* In bytes */
> > +} VhostUserMemAccess;
> > +
> > +typedef struct VhostUserMemReply {
> > +    uint32_t id; /* From the request */
> > +    uint32_t flags;
> 
> Are any flags defined?

Currently they're a copy of the TO/FROM _MASTER flags that were in the
request; which are useful to the device to make it easy to know whether
there's data following on the stream.

> > +    uint32_t err; /* 0 on success */
> > +    uint32_t align;
> 
> Is this a reserved padding field? "align" is confusing because it could
> refer to some kind of memory alignment value. "reserved" or "padding" is
> clearer.

Changed to 'padding'

Dave

> > +    uint64_t len;
> > +} VhostUserMemReply;
> > +
> >  #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
> >  # define VU_PACKED __attribute__((gcc_struct, packed))
> >  #else
> > @@ -248,6 +270,8 @@ typedef struct VhostUserMsg {
> >          VhostUserVringArea area;
> >          VhostUserInflight inflight;
> >          VhostUserFSSlaveMsgMax fs_max;
> > +        VhostUserMemAccess memaccess;
> > +        VhostUserMemReply  memreply;
> >      } payload;
> >  
> >      int fds[VHOST_MEMORY_BASELINE_NREGIONS];
> > -- 
> > 2.31.1
> > 
> > -- 
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-06-16 12:36                   ` Dr. David Alan Gilbert
@ 2021-06-16 15:29                     ` Stefan Hajnoczi
  2021-06-16 18:35                       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2021-06-16 15:29 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

[-- Attachment #1: Type: text/plain, Size: 776 bytes --]

On Wed, Jun 16, 2021 at 01:36:10PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > +    uint64_t addr; /* In the bus address of the device */
> > 
> > Please check the spec for preferred terminology. "bus address" isn't
> > used in the spec, so there's probably another term for it.
> 
> I'm not seeing anything useful in the virtio spec; it mostly talks about
> guest physical addresses; it does say 'bus addresses' in the definition
> of 'VIRTIO_F_ACCESS_PLATFORM' .

I meant the docs/interop/vhost-user.rst spec.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it
  2021-06-16 15:29                     ` Stefan Hajnoczi
@ 2021-06-16 18:35                       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 66+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-16 18:35 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtio-fs, qemu-devel, Vivek Goyal, groug

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Wed, Jun 16, 2021 at 01:36:10PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > On Thu, Jun 10, 2021 at 04:29:42PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > > * Stefan Hajnoczi (stefanha@redhat.com) wrote:
> > > > +    uint64_t addr; /* In the bus address of the device */
> > > 
> > > Please check the spec for preferred terminology. "bus address" isn't
> > > used in the spec, so there's probably another term for it.
> > 
> > I'm not seeing anything useful in the virtio spec; it mostly talks about
> > guest physical addresses; it does say 'bus addresses' in the definition
> > of 'VIRTIO_F_ACCESS_PLATFORM' .
> 
> I meant the docs/interop/vhost-user.rst spec.

I think they use the phrase 'guest address' so I've changed that to:

    uint64_t guest_addr; 

   Elsewhere in the vhost-user.rst it says:
   
   When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
    
   * Guest addresses map to the vhost memory region containing that guest
     address.
    
   When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
    
   * Guest addresses are also called I/O virtual addresses (IOVAs).  They are
     translated to user addresses via the IOTLB.
   
> Stefan


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2021-06-16 18:37 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-28 11:00 [PATCH v3 00/26] virtiofs dax patches Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 01/26] virtiofs: Fixup printf args Dr. David Alan Gilbert (git)
2021-05-04 14:54   ` Stefan Hajnoczi
2021-05-05 11:06     ` Dr. David Alan Gilbert
2021-05-06 15:56   ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 02/26] virtiofsd: Don't assume header layout Dr. David Alan Gilbert (git)
2021-05-04 15:12   ` Stefan Hajnoczi
2021-05-06 15:56   ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 03/26] DAX: vhost-user: Rework slave return values Dr. David Alan Gilbert (git)
2021-05-04 15:23   ` Stefan Hajnoczi
2021-05-27 15:59     ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 04/26] DAX: libvhost-user: Route slave message payload Dr. David Alan Gilbert (git)
2021-05-04 15:26   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 05/26] DAX: libvhost-user: Allow popping a queue element with bad pointers Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 06/26] DAX subprojects/libvhost-user: Add virtio-fs slave types Dr. David Alan Gilbert (git)
2021-04-29 15:48   ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 07/26] DAX: virtio: Add shared memory capability Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 08/26] DAX: virtio-fs: Add cache BAR Dr. David Alan Gilbert (git)
2021-05-05 12:12   ` Stefan Hajnoczi
2021-05-05 18:59     ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 09/26] DAX: virtio-fs: Add vhost-user slave commands for mapping Dr. David Alan Gilbert (git)
2021-05-05 14:15   ` Stefan Hajnoczi
2021-05-27 16:57     ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 10/26] DAX: virtio-fs: Fill in " Dr. David Alan Gilbert (git)
2021-05-05 16:43   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 11/26] DAX: virtiofsd Add cache accessor functions Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 12/26] DAX: virtiofsd: Add setup/remove mappings fuse commands Dr. David Alan Gilbert (git)
2021-05-06 15:02   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 13/26] DAX: virtiofsd: Add setup/remove mapping handlers to passthrough_ll Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 14/26] DAX: virtiofsd: Wire up passthrough_ll's lo_setupmapping Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 15/26] DAX: virtiofsd: Make lo_removemapping() work Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 16/26] DAX: virtiofsd: route se down to destroy method Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 17/26] DAX: virtiofsd: Perform an unmap on destroy Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 18/26] DAX/unmap: virtiofsd: Add VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
2021-05-06 15:12   ` Stefan Hajnoczi
2021-05-27 17:44     ` Dr. David Alan Gilbert
2021-05-06 15:16   ` Stefan Hajnoczi
2021-05-27 17:31     ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 19/26] DAX/unmap virtiofsd: Add wrappers for VHOST_USER_SLAVE_FS_IO Dr. David Alan Gilbert (git)
2021-04-28 12:53   ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 20/26] DAX/unmap virtiofsd: Parse unmappable elements Dr. David Alan Gilbert (git)
2021-05-06 15:23   ` Stefan Hajnoczi
2021-05-27 17:56     ` Dr. David Alan Gilbert
2021-04-28 11:00 ` [PATCH v3 21/26] DAX/unmap virtiofsd: Route unmappable reads Dr. David Alan Gilbert (git)
2021-05-06 15:27   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 22/26] DAX/unmap virtiofsd: route unmappable write to slave command Dr. David Alan Gilbert (git)
2021-05-06 15:28   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 23/26] DAX:virtiofsd: implement FUSE_INIT map_alignment field Dr. David Alan Gilbert (git)
2021-04-28 11:00 ` [PATCH v3 24/26] vhost-user-fs: Extend VhostUserFSSlaveMsg to pass additional info Dr. David Alan Gilbert (git)
2021-05-06 15:31   ` Stefan Hajnoczi
2021-05-06 15:32   ` Stefan Hajnoczi
2021-04-28 11:00 ` [PATCH v3 25/26] vhost-user-fs: Implement drop CAP_FSETID functionality Dr. David Alan Gilbert (git)
2021-04-28 11:01 ` [PATCH v3 26/26] virtiofsd: Ask qemu to drop CAP_FSETID if client asked for it Dr. David Alan Gilbert (git)
2021-05-06 15:37   ` Stefan Hajnoczi
2021-05-06 16:02     ` Vivek Goyal
2021-05-10  9:05       ` Stefan Hajnoczi
2021-05-10 15:23         ` Vivek Goyal
2021-05-10 15:32           ` Stefan Hajnoczi
2021-05-27 19:09             ` Dr. David Alan Gilbert
2021-06-10 15:29               ` Dr. David Alan Gilbert
2021-06-10 16:23                 ` Stefan Hajnoczi
2021-06-16 12:36                   ` Dr. David Alan Gilbert
2021-06-16 15:29                     ` Stefan Hajnoczi
2021-06-16 18:35                       ` Dr. David Alan Gilbert
2021-04-28 11:27 ` [PATCH v3 00/26] virtiofs dax patches no-reply
2021-05-06 15:37 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.