qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
@ 2020-12-15 16:21 Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 1/9] vhost-user-fs: Add support for reconnection of vhost-user-fs backend Jiachen Zhang
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

Hi, all

We implement virtio-fs crash reconnection in this patchset. The crash
reconnection of virtiofsd here is completely transparent to guest, no
remount in guest is needed, even the inflight requests can be handled
normally after reconnection. We are looking forward to any comments. 

Thanks,
Jiachen


OVERVIEW:

To support virtio-fs crash reconnection, we need to support the recovery
of 1) inflight FUSE request, and 2) virtiofsd internal status information.

Fortunately, QEMU's vhost-user reconnection framework already supports
inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
As the FUSE requests are transferred by virtqueue I/O requests, by using
the vhost-user inflight I/O tracking, we can recover the inflight FUSE
requests.

To support virtiofsd internal status recovery, we introduce 4 new
vhost-user message types. As shown in the following diagram, two of them
are used to persist shared lo_maps and opened fds to QEMU, the other two
message types are used to restore the status when reconnecting.

                               VHOST_USER_SLAVE_SHM
                               VHOST_USER_SLAVE_FD
    +--------------+       Persist       +--------------------+
    |              <---------------------+                    |
    |     QEMU     |                     |  Virtio-fs Daemon  |
    |              +--------------------->                    |
    +--------------+       Restore       +--------------------+
            VHOST_USER_SET_SHM
            VHOST_USER_SET_FD

Although the 4 newly added message types are to support virtiofsd
reconnection in this patchset, it might be potential in other situation.
So we keep in mind to make them more general when add them to vhost
related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
used for memory sharing between a vhost-user daemon and QEMU,
VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
shared opened fds between QEMU process and vhost-user daemon process.


USAGE and NOTES:

- The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.

- ",reconnect=1" should be added to the "-chardev socket" of vhost-user-fs-pci
in the QEMU command line, for example:

    qemu-system-x86_64 ... \
    -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
    -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
    ...

- We add new options for virtiofsd to enable or disable crash reconnection.
And some options are not supported by crash reconnection. So add following
options to virtiofsd to enable reconnection:

    virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
    -o no_xattr ...

- The reasons why virtiofsd-side locking, extended attributes, and mount
namespace are not supported is explained in the commit message of the 6th
patch (virtiofsd: Add two new options for crash reconnection).

- The 9th patch is a work-around that will not affect the overall correctness.
We remove the qsort related codes because we found that when resubmit_num is
larger than 64, seccomp will kill the virtiofsd process.

- Support for dax version virtiofsd is very possible and requires almost no
additional change to this patchset.


Jiachen Zhang (9):
  vhost-user-fs: Add support for reconnection of vhost-user-fs backend
  vhost: Add vhost-user message types for sending shared memory and file
    fds
  vhost-user-fs: Support virtiofsd crash reconnection
  libvhost-user: Add vhost-user message types for sending shared memory
    and file fds
  virtiofsd: Convert the struct lo_map array to a more flatten layout
  virtiofsd: Add two new options for crash reconnection
  virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
  virtiofsd: Ensure crash consistency after reconnection
  virtiofsd: (work around) Comment qsort in inflight I/O tracking

 contrib/libvhost-user/libvhost-user.c | 106 +++-
 contrib/libvhost-user/libvhost-user.h |  70 +++
 docs/interop/vhost-user.rst           |  41 ++
 hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
 hw/virtio/vhost-user.c                | 123 ++++
 hw/virtio/vhost.c                     |  42 ++
 include/hw/virtio/vhost-backend.h     |   6 +
 include/hw/virtio/vhost-user-fs.h     |  16 +-
 include/hw/virtio/vhost.h             |  42 ++
 tools/virtiofsd/fuse_lowlevel.c       |  24 +-
 tools/virtiofsd/fuse_virtio.c         |  44 ++
 tools/virtiofsd/fuse_virtio.h         |   1 +
 tools/virtiofsd/helper.c              |   9 +
 tools/virtiofsd/passthrough_helpers.h |   2 +-
 tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 16 files changed, 1413 insertions(+), 278 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH 1/9] vhost-user-fs: Add support for reconnection of vhost-user-fs backend
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 2/9] vhost: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

Based on vhost-user's inflight I/O tracking infrastructure, we now add
support for the vhost-user-fs backend (or virtiofs daemon) reconnection.

Note that, till this patch, since the state information of virtiofsd is
not saved, virtiofsd will lose its information after reconnected to
QEMU. Following patches of this patchset will focus on state persistence
and restoring, with the help of some new vhost-user message types.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 hw/virtio/vhost-user-fs.c         | 218 +++++++++++++++++++++++++++---
 include/hw/virtio/vhost-user-fs.h |   2 +
 2 files changed, 200 insertions(+), 20 deletions(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index 1bc5d03a00..ce343101d4 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -21,6 +21,7 @@
 #include "qemu/error-report.h"
 #include "hw/virtio/vhost-user-fs.h"
 #include "monitor/monitor.h"
+#include "sysemu/runstate.h"
 
 static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
 {
@@ -35,7 +36,7 @@ static void vuf_get_config(VirtIODevice *vdev, uint8_t *config)
     memcpy(config, &fscfg, sizeof(fscfg));
 }
 
-static void vuf_start(VirtIODevice *vdev)
+static int vuf_start(VirtIODevice *vdev)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
@@ -45,13 +46,13 @@ static void vuf_start(VirtIODevice *vdev)
 
     if (!k->set_guest_notifiers) {
         error_report("binding does not support guest notifiers");
-        return;
+        return -ENOSYS;
     }
 
     ret = vhost_dev_enable_notifiers(&fs->vhost_dev, vdev);
     if (ret < 0) {
         error_report("Error enabling host notifiers: %d", -ret);
-        return;
+        return ret;
     }
 
     ret = k->set_guest_notifiers(qbus->parent, fs->vhost_dev.nvqs, true);
@@ -61,6 +62,22 @@ static void vuf_start(VirtIODevice *vdev)
     }
 
     fs->vhost_dev.acked_features = vdev->guest_features;
+
+    if (!fs->inflight->addr) {
+        ret = vhost_dev_get_inflight(&fs->vhost_dev, fs->conf.queue_size,
+                                     fs->inflight);
+        if (ret < 0) {
+            error_report("Error get inflight: %d", -ret);
+            goto err_guest_notifiers;
+        }
+    }
+
+    ret = vhost_dev_set_inflight(&fs->vhost_dev, fs->inflight);
+    if (ret < 0) {
+        error_report("Error set inflight: %d", -ret);
+        goto err_guest_notifiers;
+    }
+
     ret = vhost_dev_start(&fs->vhost_dev, vdev);
     if (ret < 0) {
         error_report("Error starting vhost: %d", -ret);
@@ -76,12 +93,14 @@ static void vuf_start(VirtIODevice *vdev)
         vhost_virtqueue_mask(&fs->vhost_dev, vdev, i, false);
     }
 
-    return;
+    return ret;
 
 err_guest_notifiers:
     k->set_guest_notifiers(qbus->parent, fs->vhost_dev.nvqs, false);
 err_host_notifiers:
     vhost_dev_disable_notifiers(&fs->vhost_dev, vdev);
+
+    return ret;
 }
 
 static void vuf_stop(VirtIODevice *vdev)
@@ -110,17 +129,27 @@ static void vuf_set_status(VirtIODevice *vdev, uint8_t status)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
     bool should_start = status & VIRTIO_CONFIG_S_DRIVER_OK;
+    int ret;
 
     if (!vdev->vm_running) {
         should_start = false;
     }
 
+    if (!fs->connected) {
+        return;
+    }
+
     if (fs->vhost_dev.started == should_start) {
         return;
     }
 
     if (should_start) {
-        vuf_start(vdev);
+        ret = vuf_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-fs: vhost start failed: %s",
+                         strerror(-ret));
+            qemu_chr_fe_disconnect(&fs->conf.chardev);
+        }
     } else {
         vuf_stop(vdev);
     }
@@ -140,30 +169,161 @@ static void vuf_handle_output(VirtIODevice *vdev, VirtQueue *vq)
      * Not normally called; it's the daemon that handles the queue;
      * however virtio's cleanup path can call this.
      */
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int i, ret;
+
+    if (!vdev->start_on_kick) {
+        return;
+    }
+
+    if (!fs->connected) {
+        return;
+    }
+
+    if (fs->vhost_dev.started) {
+        return;
+    }
+
+    /*
+     * Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
+     * vhost here instead of waiting for .set_status().
+     */
+    ret = vuf_start(vdev);
+    if (ret < 0) {
+        error_report("vhost-user-fs: vhost start failed: %s",
+                     strerror(-ret));
+        qemu_chr_fe_disconnect(&fs->conf.chardev);
+        return;
+    }
+
+    /* Kick right away to begin processing requests already in vring */
+    for (i = 0; i < fs->vhost_dev.nvqs; i++) {
+        VirtQueue *kick_vq = virtio_get_queue(vdev, i);
+
+        if (!virtio_queue_get_desc_addr(vdev, i)) {
+            continue;
+        }
+        event_notifier_set(virtio_queue_get_host_notifier(kick_vq));
+    }
 }
 
-static void vuf_guest_notifier_mask(VirtIODevice *vdev, int idx,
-                                            bool mask)
+static bool vuf_guest_notifier_pending(VirtIODevice *vdev, int idx)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
 
-    vhost_virtqueue_mask(&fs->vhost_dev, vdev, idx, mask);
+    return vhost_virtqueue_pending(&fs->vhost_dev, idx);
 }
 
-static bool vuf_guest_notifier_pending(VirtIODevice *vdev, int idx)
+static void vuf_reset(VirtIODevice *vdev)
 {
     VHostUserFS *fs = VHOST_USER_FS(vdev);
+    vhost_dev_free_inflight(fs->inflight);
+}
 
-    return vhost_virtqueue_pending(&fs->vhost_dev, idx);
+static int vuf_connect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+    int ret = 0;
+
+    if (fs->connected) {
+        return 0;
+    }
+    fs->connected = true;
+
+    /* 1 high prio queue, plus the number configured */
+    fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
+    fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, fs->vhost_dev.nvqs);
+    ret = vhost_dev_init(&fs->vhost_dev, &fs->vhost_user,
+                         VHOST_BACKEND_TYPE_USER, 0);
+    if (ret < 0) {
+        error_report("vhost-user-fs: vhost initialization failed: %s",
+                     strerror(-ret));
+        return ret;
+    }
+
+    /* restore vhost state */
+    if (vdev->started) {
+        ret = vuf_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-fs: vhost start failed: %s",
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static void vuf_disconnect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+    if (!fs->connected) {
+        return;
+    }
+    fs->connected = false;
+
+    if (fs->vhost_dev.started) {
+        vuf_stop(vdev);
+    }
+
+    vhost_dev_cleanup(&fs->vhost_dev);
+}
+
+static void vuf_event(void *opaque, QEMUChrEvent event);
+
+static void vuf_chr_closed_bh(void *opaque)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+    vuf_disconnect(dev);
+    qemu_chr_fe_set_handlers(&fs->conf.chardev, NULL, NULL, vuf_event,
+            NULL, opaque, NULL, true);
+}
+
+static void vuf_event(void *opaque, QEMUChrEvent event)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserFS *fs = VHOST_USER_FS(vdev);
+
+    switch (event) {
+    case CHR_EVENT_OPENED:
+        if (vuf_connect(dev) < 0) {
+            qemu_chr_fe_disconnect(&fs->conf.chardev);
+            return;
+        }
+        break;
+    case CHR_EVENT_CLOSED:
+        /* delay disconnectting according to commit 4bcad76f4c390f */
+        if (runstate_is_running()) {
+            AioContext *ctx = qemu_get_current_aio_context();
+
+            qemu_chr_fe_set_handlers(&fs->conf.chardev, NULL, NULL, NULL, NULL,
+                    NULL, NULL, false);
+            aio_bh_schedule_oneshot(ctx, vuf_chr_closed_bh, opaque);
+        }
+        break;
+    case CHR_EVENT_BREAK:
+    case CHR_EVENT_MUX_IN:
+    case CHR_EVENT_MUX_OUT:
+         /* Ignore */
+            break;
+    }
 }
 
+
 static void vuf_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VHostUserFS *fs = VHOST_USER_FS(dev);
     unsigned int i;
     size_t len;
-    int ret;
+    Error *err = NULL;
 
     if (!fs->conf.chardev.chr) {
         error_setg(errp, "missing chardev");
@@ -217,16 +377,24 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
         fs->req_vqs[i] = virtio_add_queue(vdev, fs->conf.queue_size, vuf_handle_output);
     }
 
-    /* 1 high prio queue, plus the number configured */
-    fs->vhost_dev.nvqs = 1 + fs->conf.num_request_queues;
-    fs->vhost_dev.vqs = g_new0(struct vhost_virtqueue, fs->vhost_dev.nvqs);
-    ret = vhost_dev_init(&fs->vhost_dev, &fs->vhost_user,
-                         VHOST_BACKEND_TYPE_USER, 0);
-    if (ret < 0) {
-        error_setg_errno(errp, -ret, "vhost_dev_init failed");
+    /* init reconnection related variables */
+    fs->inflight = g_new0(struct vhost_inflight, 1);
+    fs->connected = false;
+
+    qemu_chr_fe_set_handlers(&fs->conf.chardev,  NULL, NULL, vuf_event,
+                                 NULL, (void *)dev, NULL, true);
+
+reconnect:
+    if (qemu_chr_fe_wait_connected(&fs->conf.chardev, &err) < 0) {
+        error_report_err(err);
         goto err_virtio;
     }
 
+    /* check whether vuf_connect() failed or not */
+    if (!fs->connected) {
+        goto reconnect;
+    }
+
     return;
 
 err_virtio:
@@ -236,6 +404,9 @@ err_virtio:
         virtio_delete_queue(fs->req_vqs[i]);
     }
     g_free(fs->req_vqs);
+    fs->req_vqs = NULL;
+    g_free(fs->inflight);
+    fs->inflight = NULL;
     virtio_cleanup(vdev);
     g_free(fs->vhost_dev.vqs);
     return;
@@ -248,7 +419,7 @@ static void vuf_device_unrealize(DeviceState *dev)
     int i;
 
     /* This will stop vhost backend if appropriate. */
-    vuf_set_status(vdev, 0);
+    virtio_set_status(vdev, 0);
 
     vhost_dev_cleanup(&fs->vhost_dev);
 
@@ -259,9 +430,16 @@ static void vuf_device_unrealize(DeviceState *dev)
         virtio_delete_queue(fs->req_vqs[i]);
     }
     g_free(fs->req_vqs);
+    fs->req_vqs = NULL;
+    qemu_chr_fe_set_handlers(&fs->conf.chardev,  NULL, NULL, NULL,
+                             NULL, NULL, NULL, false);
+
     virtio_cleanup(vdev);
+    vhost_dev_free_inflight(fs->inflight);
     g_free(fs->vhost_dev.vqs);
     fs->vhost_dev.vqs = NULL;
+    g_free(fs->inflight);
+    fs->inflight = NULL;
 }
 
 static const VMStateDescription vuf_vmstate = {
@@ -291,8 +469,8 @@ static void vuf_class_init(ObjectClass *klass, void *data)
     vdc->get_features = vuf_get_features;
     vdc->get_config = vuf_get_config;
     vdc->set_status = vuf_set_status;
-    vdc->guest_notifier_mask = vuf_guest_notifier_mask;
     vdc->guest_notifier_pending = vuf_guest_notifier_pending;
+    vdc->reset = vuf_reset;
 }
 
 static const TypeInfo vuf_info = {
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 6985752771..9ef47568e7 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -39,6 +39,8 @@ struct VHostUserFS {
     VhostUserState vhost_user;
     VirtQueue **req_vqs;
     VirtQueue *hiprio_vq;
+    struct vhost_inflight *inflight;
+    bool connected;
 
     /*< public >*/
 };
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 2/9] vhost: Add vhost-user message types for sending shared memory and file fds
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 1/9] vhost-user-fs: Add support for reconnection of vhost-user-fs backend Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 3/9] vhost-user-fs: Support virtiofsd crash reconnection Jiachen Zhang
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

This commit adds 4 new vhost-user message types,
    VHOST_USER_SET_SHM (41),
    VHOST_USER_SET_FD (42),
    VHOST_USER_SLAVE_SHM (slave type, 6),
    VHOST_USER_SLAVE_FD (slave type, 7),
and a vhost-user protocol feature,
    VHOST_USER_PROTOCOL_F_MAP_SHMFD (17).

They can be used by vhost-user devices or backend daemon to persist/restore
shared memory regions and opened file descriptors to/from QEMU.

This commit first implements 2 handlers for the 2 new slave message types
(vhost_user_slave_handle_shm and vhost_user_slave_handle_fd), then
implements some common interfaces for devices to use VHOST_USER_SET_SHM and
VHOST_USER_SET_FD (vhost_dev_set_shm and vhost_dev_set_fd).

This commit also defines some callback interfaces, which can be registered by
arbitrary vhost devices. (VhostDevShmOps is for VHOST_USER_SET_SHM and
VHOST_USER_SLAVE_SHM, and VhostDevFdOps is for VHOST_USER_SET_FD and
VHOST_USER_SLAVE_FD.)

The following commits will use the 4 new message types  to implement virtiofsd
crash reconnection.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 docs/interop/vhost-user.rst       |  41 ++++++++++
 hw/virtio/vhost-user.c            | 123 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  42 ++++++++++
 include/hw/virtio/vhost-backend.h |   6 ++
 include/hw/virtio/vhost.h         |  42 ++++++++++
 5 files changed, 254 insertions(+)

diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index 988f154144..515c879bd3 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -817,6 +817,7 @@ Protocol features
   #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
   #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
   #define VHOST_USER_PROTOCOL_F_STATUS               16
+  #define VHOST_USER_PROTOCOL_F_MAP_SHMFD            17
 
 Master message types
 --------------------
@@ -1330,6 +1331,26 @@ Master message types
   query the backend for its device status as defined in the Virtio
   specification.
 
+``VHOST_USER_SET_SHM``
+  :id: 41
+  :equivalent ioctl: N/A
+  :master payload: shared memory destription
+  :slave payload: N/A
+
+  When slave has ``VHOST_USER_PROTOCOL_F_MAP_SHMFD`` protocol feature, a
+  memfd is provided in the ancillary data of ``VHOST_USER_SET_SHM`` message,
+  the shared memory destription (ID, size and offset) is also provided in
+  the message.
+
+``VHOST_USER_SET_FD``
+  :id: 42
+  :equivalent ioctl: N/A
+  :master payload: fd destription
+  :slave payload: N/A
+
+  A fd is provided in the ancillary data of ``VHOST_USER_SET_FD`` message,
+  the fd destription (a unique key and an operation flag) is also provided
+  in the message.
 
 Slave message types
 -------------------
@@ -1415,6 +1436,26 @@ Slave message types
 
   The state.num field is currently reserved and must be set to 0.
 
+``VHOST_USER_SLAVE_SHM``
+  :id: 6
+  :equivalent ioctl: N/A
+  :master payload: shared memory destription
+  :master payload: N/A
+
+  A memfd is provided in the ancillary data of ``VHOST_USER_SLAVE_SHM``
+  message, the shared memory destription (ID, size and offset) is also
+  provided in the message.
+
+``VHOST_USER_SLAVE_FD``
+  :id: 7
+  :equivalent ioctl: N/A
+  :master payload: fd destription
+  :slave payload: N/A
+
+  A fd is provided in the ancillary data of ``VHOST_USER_SLAVE_FD`` message,
+  the fd destription (a unique key and an operation flag) is also provided
+  in the message.
+
 .. _reply_ack:
 
 VHOST_USER_PROTOCOL_F_REPLY_ACK
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 9c5b4f7fbc..2b5170e921 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -79,6 +79,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_RESET_DEVICE = 13,
     /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
     VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+    VHOST_USER_PROTOCOL_F_MAP_SHMFD = 17,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -124,6 +125,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_SHM = 41,
+    VHOST_USER_SET_FD = 42,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -132,6 +135,10 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_IOTLB_MSG = 1,
     VHOST_USER_SLAVE_CONFIG_CHANGE_MSG = 2,
     VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG = 3,
+    VHOST_USER_SLAVE_VRING_CALL = 4,
+    VHOST_USER_SLAVE_VRING_ERR = 5,
+    VHOST_USER_SLAVE_SHM = 6,
+    VHOST_USER_SLAVE_FD = 7,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -218,6 +225,8 @@ typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserShm shm;
+        VhostUserFd fdinfo;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1393,6 +1402,36 @@ static int vhost_user_slave_handle_vring_host_notifier(struct vhost_dev *dev,
     return 0;
 }
 
+static int vhost_user_slave_handle_shm(struct vhost_dev *dev,
+                                       VhostUserShm *shm, int fd)
+{
+    int ret;
+
+    if (!dev->shm_ops) {
+        return -1;
+    }
+
+    if (dev->shm_ops->vhost_dev_slave_shm) {
+        ret = dev->shm_ops->vhost_dev_slave_shm(dev, shm, fd);
+    }
+    return ret;
+}
+
+static int vhost_user_slave_handle_fd(struct vhost_dev *dev,
+                                      VhostUserFd *fdinfo, int fd)
+{
+    int ret;
+    if (!dev->fd_ops) {
+        return -1;
+    }
+
+    if (dev->fd_ops->vhost_dev_slave_fd) {
+        ret = dev->fd_ops->vhost_dev_slave_fd(dev, fdinfo, fd);
+    }
+
+    return ret;
+}
+
 static void slave_read(void *opaque)
 {
     struct vhost_dev *dev = opaque;
@@ -1471,6 +1510,12 @@ static void slave_read(void *opaque)
         ret = vhost_user_slave_handle_vring_host_notifier(dev, &payload.area,
                                                           fd[0]);
         break;
+    case VHOST_USER_SLAVE_SHM:
+        ret = vhost_user_slave_handle_shm(dev, &payload.shm, fd[0]);
+        break;
+    case VHOST_USER_SLAVE_FD:
+        ret = vhost_user_slave_handle_fd(dev, &payload.fdinfo, fd[0]);
+        break;
     default:
         error_report("Received unexpected msg type: %d.", hdr.request);
         ret = -EINVAL;
@@ -2325,6 +2370,82 @@ static int vhost_user_set_inflight_fd(struct vhost_dev *dev,
     return 0;
 }
 
+
+/* The maximum shm number of a vhost-user deviceis MAX_SHM_NUM */
+#define MAX_SHM_NUM 128
+
+static int vhost_user_set_shm(struct vhost_dev *dev)
+{
+    int i;
+    int ret;
+    int memfd;
+    if (!dev->fd_ops) {
+        return -1;
+    }
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_MAP_SHMFD)) {
+        return 0;
+    }
+
+    if (dev->shm_ops->vhost_dev_shm_info) {
+        for (i = 0; i < MAX_SHM_NUM; i++) {
+            VhostUserMsg msg = {
+                .hdr.request = VHOST_USER_SET_SHM,
+                .hdr.flags = VHOST_USER_VERSION,
+                .hdr.size = sizeof(msg.payload.shm),
+                .payload.shm.id = i,
+            };
+            ret = dev->shm_ops->vhost_dev_shm_info(dev, i,
+                                                   &msg.payload.shm.size,
+                                                   &msg.payload.shm.offset,
+                                                   &memfd);
+            if (ret == -1) {
+                continue;
+            }
+            if (vhost_user_write(dev, &msg, &memfd, 1) < 0) {
+                return -1;
+            }
+        }
+    }
+
+    return 0;
+}
+
+static void send_each_fd(gpointer key, gpointer value, gpointer opaque)
+{
+    int fd_key = GPOINTER_TO_INT(key);
+    int fd = GPOINTER_TO_INT(value);
+    struct vhost_dev *dev = opaque;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_SET_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .hdr.size = sizeof(msg.payload.fdinfo),
+    };
+    msg.payload.fdinfo.key = fd_key;
+    vhost_user_write(dev, &msg, &fd, 1);
+}
+
+static int vhost_user_set_fd(struct vhost_dev *dev)
+{
+    int ret;
+    GHashTable *fd_ht = NULL;
+
+    if (!dev->fd_ops || !dev->fd_ops->vhost_dev_fd_info) {
+        return -1;
+    }
+
+    ret = dev->fd_ops->vhost_dev_fd_info(dev, &fd_ht);
+    if (ret) {
+        return 0;
+    }
+
+    if (fd_ht != NULL) {
+        g_hash_table_foreach(fd_ht, send_each_fd, dev);
+    }
+    return 0;
+}
+
 bool vhost_user_init(VhostUserState *user, CharBackend *chr, Error **errp)
 {
     if (user->chr) {
@@ -2387,4 +2508,6 @@ const VhostOps user_ops = {
         .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
         .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
+        .vhost_set_shm = vhost_user_set_shm,
+        .vhost_set_fd = vhost_user_set_fd,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 614ccc2bcb..9acda4d69f 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1581,6 +1581,18 @@ void vhost_dev_set_config_notifier(struct vhost_dev *hdev,
     hdev->config_ops = ops;
 }
 
+void vhost_dev_set_shm_ops(struct vhost_dev *hdev,
+                           const VhostDevShmOps *ops)
+{
+    hdev->shm_ops = ops;
+}
+
+void vhost_dev_set_fd_ops(struct vhost_dev *hdev,
+                          const VhostDevFdOps *ops)
+{
+    hdev->fd_ops = ops;
+}
+
 void vhost_dev_free_inflight(struct vhost_inflight *inflight)
 {
     if (inflight && inflight->addr) {
@@ -1590,6 +1602,36 @@ void vhost_dev_free_inflight(struct vhost_inflight *inflight)
     }
 }
 
+int vhost_dev_set_shm(struct vhost_dev *dev)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_set_shm) {
+        r = dev->vhost_ops->vhost_set_shm(dev);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_dev_set_shm failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
+int vhost_dev_set_fd(struct vhost_dev *dev)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_set_fd) {
+        r = dev->vhost_ops->vhost_set_fd(dev);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_dev_set_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
 static int vhost_dev_resize_inflight(struct vhost_inflight *inflight,
                                      uint64_t new_size)
 {
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 8a6f8e2a7a..af55b62133 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -125,6 +125,10 @@ typedef int (*vhost_get_device_id_op)(struct vhost_dev *dev, uint32_t *dev_id);
 
 typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
 
+typedef int (*vhost_set_shm_op)(struct vhost_dev *dev);
+
+typedef int (*vhost_set_fd_op)(struct vhost_dev *dev);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -170,6 +174,8 @@ typedef struct VhostOps {
     vhost_vq_get_addr_op  vhost_vq_get_addr;
     vhost_get_device_id_op vhost_get_device_id;
     vhost_force_iommu_op vhost_force_iommu;
+    vhost_set_shm_op vhost_set_shm;
+    vhost_set_fd_op vhost_set_fd;
 } VhostOps;
 
 extern const VhostOps user_ops;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 4a8bc75415..c1e6f32d13 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -60,6 +60,42 @@ typedef struct VhostDevConfigOps {
     int (*vhost_dev_config_notifier)(struct vhost_dev *dev);
 } VhostDevConfigOps;
 
+#ifndef VU_PERSIST_STRUCTS
+#define VU_PERSIST_STRUCTS
+
+typedef struct VhostUserShm {
+    int id;
+    uint64_t size;
+    uint64_t offset;
+} VhostUserShm;
+
+typedef enum VhostUserFdFlag {
+    VU_FD_FLAG_ADD = 0,
+    VU_FD_FLAG_DEL = 1,
+    VU_FD_FLAG_RESTORE = 2,
+    VU_FD_FLAG_MAX
+} VhostUserFdFlag;
+
+typedef struct VhostUserFd {
+    int key;
+    VhostUserFdFlag flag;
+} VhostUserFd;
+
+#endif
+
+typedef struct VhostDevShmOps {
+    int (*vhost_dev_slave_shm)(struct vhost_dev *dev,
+                               struct VhostUserShm *shm, int fd);
+    int (*vhost_dev_shm_info)(struct vhost_dev *dev, int shm_type,
+                              uint64_t *size, uint64_t *offset, int *memfd);
+} VhostDevShmOps;
+
+typedef struct VhostDevFdOps {
+    int (*vhost_dev_slave_fd)(struct vhost_dev *dev,
+                              struct VhostUserFd *fdinfo, int fd);
+    int (*vhost_dev_fd_info)(struct vhost_dev *dev, GHashTable **fd_ht_p);
+} VhostDevFdOps;
+
 struct vhost_memory;
 struct vhost_dev {
     VirtIODevice *vdev;
@@ -91,6 +127,8 @@ struct vhost_dev {
     QLIST_HEAD(, vhost_iommu) iommu_list;
     IOMMUNotifier n;
     const VhostDevConfigOps *config_ops;
+    const VhostDevShmOps *shm_ops;
+    const VhostDevFdOps *fd_ops;
 };
 
 struct vhost_net {
@@ -136,6 +174,8 @@ int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
  */
 void vhost_dev_set_config_notifier(struct vhost_dev *dev,
                                    const VhostDevConfigOps *ops);
+void vhost_dev_set_shm_ops(struct vhost_dev *dev, const VhostDevShmOps *ops);
+void vhost_dev_set_fd_ops(struct vhost_dev *dev, const VhostDevFdOps *ops);
 
 void vhost_dev_reset_inflight(struct vhost_inflight *inflight);
 void vhost_dev_free_inflight(struct vhost_inflight *inflight);
@@ -146,4 +186,6 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
                            struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
                            struct vhost_inflight *inflight);
+int vhost_dev_set_shm(struct vhost_dev *dev);
+int vhost_dev_set_fd(struct vhost_dev *dev);
 #endif
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/9] vhost-user-fs: Support virtiofsd crash reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 1/9] vhost-user-fs: Add support for reconnection of vhost-user-fs backend Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 2/9] vhost: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 4/9] libvhost-user: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

This commit adds vhost-user-fs device-end support for the virtio-fs crash
reconnection, mainly the vhost-user callbacks. The VhostUserFSPersist
structure is also added to save the virtiofsd status at QEMU-side:

    typedef struct {
        bool need_restore;

        /* for persistent of lo_maps */
        VhostUserShm maps[MAP_TYPE_NUM];
        int map_fds[MAP_TYPE_NUM];

        /* for persistent of fds */
        GHashTable *fd_ht;
    } VhostUserFSPersist;

The fd_ht GHashTable is to save the opened file descriptors sent from
virtiofsd, the maps and map_fds are to save the shared memory related
status sent from virtiofsd.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 hw/virtio/vhost-user-fs.c         | 118 +++++++++++++++++++++++++++++-
 include/hw/virtio/vhost-user-fs.h |  14 +++-
 2 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/vhost-user-fs.c b/hw/virtio/vhost-user-fs.c
index ce343101d4..a4f58821b3 100644
--- a/hw/virtio/vhost-user-fs.c
+++ b/hw/virtio/vhost-user-fs.c
@@ -72,6 +72,18 @@ static int vuf_start(VirtIODevice *vdev)
         }
     }
 
+    ret = vhost_dev_set_shm(&fs->vhost_dev);
+    if (ret < 0) {
+        error_report("Error set fs maps: %d", -ret);
+        goto err_guest_notifiers;
+    }
+
+    ret = vhost_dev_set_fd(&fs->vhost_dev);
+    if (ret < 0) {
+        error_report("Error set fs proc fds: %d", -ret);
+        goto err_guest_notifiers;
+    }
+
     ret = vhost_dev_set_inflight(&fs->vhost_dev, fs->inflight);
     if (ret < 0) {
         error_report("Error set inflight: %d", -ret);
@@ -220,6 +232,102 @@ static void vuf_reset(VirtIODevice *vdev)
     vhost_dev_free_inflight(fs->inflight);
 }
 
+static int vhost_user_fs_persist_map(struct vhost_dev *dev,
+                                     struct VhostUserShm *shm, int fd)
+{
+    VHostUserFS *fs = container_of(dev, VHostUserFS, vhost_dev);
+    VhostUserFSPersist *persist = &fs->persist;
+
+    if (persist->map_fds[shm->id] != -1) {
+        close(persist->map_fds[shm->id]);
+    }
+
+    persist->need_restore = true;
+    memcpy(&persist->maps[shm->id], shm, sizeof(VhostUserShm));
+    persist->map_fds[shm->id] = dup(fd);
+
+    return 0;
+}
+
+static int vhost_user_fs_map_info(struct vhost_dev *dev, int id,
+                                  uint64_t *size, uint64_t *offset,
+                                  int *memfd)
+{
+    if (!dev) {
+        return -1;
+    }
+
+    if (id >= MAP_TYPE_NUM) {
+        return -1;
+    }
+
+    VHostUserFS *fs = container_of(dev, VHostUserFS, vhost_dev);
+    VhostUserFSPersist *persist = &fs->persist;
+    if (!persist->need_restore || (persist->map_fds[id] == -1)) {
+        return -1;
+    }
+
+    *size = persist->maps[id].size;
+    *offset = persist->maps[id].offset;
+    *memfd = persist->map_fds[id];
+
+    return 0;
+}
+
+static int vhost_user_fs_persist_fd(struct vhost_dev *dev,
+                                    struct VhostUserFd *fdinfo, int fd)
+{
+    VHostUserFS *fs = container_of(dev, VHostUserFS, vhost_dev);
+    VhostUserFSPersist *persist = &fs->persist;
+
+    persist->need_restore = true;
+
+    if (fdinfo->flag == VU_FD_FLAG_ADD) {
+        assert(persist->fd_ht != NULL);
+        int newfd = dup(fd);
+        g_hash_table_insert(persist->fd_ht, GINT_TO_POINTER(fdinfo->key),
+                                                    GINT_TO_POINTER(newfd));
+    } else if (fdinfo->flag == VU_FD_FLAG_DEL) {
+        gpointer fd_p = g_hash_table_lookup(persist->fd_ht,
+                                            GINT_TO_POINTER(fdinfo->key));
+        if (fd_p != NULL) {
+            int fd = GPOINTER_TO_INT(fd_p);
+            close(fd);
+            g_hash_table_remove(persist->fd_ht,
+                                        GINT_TO_POINTER(fdinfo->key));
+        }
+    }
+
+    return 0;
+}
+
+static int vhost_user_fs_fd_info(struct vhost_dev *dev, GHashTable **fd_ht_p)
+{
+    if (!dev) {
+        return -1;
+    }
+
+    VHostUserFS *fs = container_of(dev, VHostUserFS, vhost_dev);
+    VhostUserFSPersist *persist = &fs->persist;
+    if (!persist->need_restore) {
+        return -1;
+    }
+
+    *fd_ht_p = persist->fd_ht;
+    return 0;
+}
+
+
+const VhostDevShmOps fs_shm_ops = {
+        .vhost_dev_slave_shm = vhost_user_fs_persist_map,
+        .vhost_dev_shm_info = vhost_user_fs_map_info,
+};
+
+const VhostDevFdOps fs_fd_ops = {
+        .vhost_dev_slave_fd = vhost_user_fs_persist_fd,
+        .vhost_dev_fd_info = vhost_user_fs_fd_info,
+};
+
 static int vuf_connect(DeviceState *dev)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -242,6 +350,9 @@ static int vuf_connect(DeviceState *dev)
         return ret;
     }
 
+    vhost_dev_set_shm_ops(&fs->vhost_dev, &fs_shm_ops);
+    vhost_dev_set_fd_ops(&fs->vhost_dev, &fs_fd_ops);
+
     /* restore vhost state */
     if (vdev->started) {
         ret = vuf_start(vdev);
@@ -380,7 +491,11 @@ static void vuf_device_realize(DeviceState *dev, Error **errp)
     /* init reconnection related variables */
     fs->inflight = g_new0(struct vhost_inflight, 1);
     fs->connected = false;
-
+    fs->persist.need_restore = false;
+    for (i = 0; i < MAP_TYPE_NUM; i++) {
+        fs->persist.map_fds[i] = -1;
+    }
+    fs->persist.fd_ht = g_hash_table_new(NULL, NULL);
     qemu_chr_fe_set_handlers(&fs->conf.chardev,  NULL, NULL, vuf_event,
                                  NULL, (void *)dev, NULL, true);
 
@@ -440,6 +555,7 @@ static void vuf_device_unrealize(DeviceState *dev)
     fs->vhost_dev.vqs = NULL;
     g_free(fs->inflight);
     fs->inflight = NULL;
+    g_hash_table_destroy(fs->persist.fd_ht);
 }
 
 static const VMStateDescription vuf_vmstate = {
diff --git a/include/hw/virtio/vhost-user-fs.h b/include/hw/virtio/vhost-user-fs.h
index 9ef47568e7..c2b77ffc53 100644
--- a/include/hw/virtio/vhost-user-fs.h
+++ b/include/hw/virtio/vhost-user-fs.h
@@ -30,6 +30,18 @@ typedef struct {
     uint16_t queue_size;
 } VHostUserFSConf;
 
+#define MAP_TYPE_NUM 3
+typedef struct {
+    bool need_restore;
+
+    /* for persistent of lo_maps */
+    VhostUserShm maps[MAP_TYPE_NUM];
+    int map_fds[MAP_TYPE_NUM];
+
+    /* for persistent of fds */
+    GHashTable *fd_ht;
+} VhostUserFSPersist;
+
 struct VHostUserFS {
     /*< private >*/
     VirtIODevice parent;
@@ -41,7 +53,7 @@ struct VHostUserFS {
     VirtQueue *hiprio_vq;
     struct vhost_inflight *inflight;
     bool connected;
-
+    VhostUserFSPersist persist;
     /*< public >*/
 };
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 4/9] libvhost-user: Add vhost-user message types for sending shared memory and file fds
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (2 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 3/9] vhost-user-fs: Support virtiofsd crash reconnection Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 5/9] virtiofsd: Convert the struct lo_map array to a more flatten layout Jiachen Zhang
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

Add libvhost-user support for the 4 new vhost-user messages types:

    VHOST_USER_SET_SHM
    VHOST_USER_SET_FD
    VHOST_USER_SLAVE_SHM
    VHOST_USER_SLAVE_FD

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 contrib/libvhost-user/libvhost-user.c | 88 +++++++++++++++++++++++++++
 contrib/libvhost-user/libvhost-user.h | 70 +++++++++++++++++++++
 2 files changed, 158 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index bfec8a881a..8c97013e59 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -140,6 +140,8 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_ADD_MEM_REG),
         REQ(VHOST_USER_REM_MEM_REG),
         REQ(VHOST_USER_MAX),
+        REQ(VHOST_USER_SET_SHM),
+        REQ(VHOST_USER_SET_FD),
     };
 #undef REQ
 
@@ -1718,6 +1720,77 @@ vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
     return false;
 }
 
+bool vu_slave_send_shm(VuDev *dev, int memfd, uint64_t size, int map_type)
+{
+    VhostUserMsg vmsg = {
+        .request = VHOST_USER_SLAVE_SHM,
+        .flags = VHOST_USER_VERSION,
+        .size = sizeof(VhostUserShm),
+        .payload.shm = {
+            .id = map_type,
+            .size = size,
+            .offset = 0,
+        },
+    };
+
+    vmsg.fd_num = 1;
+    vmsg.fds[0] = memfd;
+
+    if (!vu_has_protocol_feature(dev, VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD)) {
+        return false;
+    }
+
+    pthread_mutex_lock(&dev->slave_mutex);
+    if (!vu_message_write(dev, dev->slave_fd, &vmsg)) {
+        pthread_mutex_unlock(&dev->slave_mutex);
+        return false;
+    }
+
+    /* Also unlocks the slave_mutex */
+    return vu_process_message_reply(dev, &vmsg);
+}
+
+static bool vu_slave_send_fd(VuDev *dev, int fd, int fd_key, int flag)
+{
+    VhostUserMsg vmsg = {
+        .request = VHOST_USER_SLAVE_FD,
+        .flags = VHOST_USER_VERSION,
+        .size = sizeof(vmsg.payload.fdinfo),
+    };
+
+    vmsg.payload.fdinfo.key = fd_key;
+    vmsg.payload.fdinfo.flag = flag;
+    if (flag == VU_FD_FLAG_ADD) {
+        vmsg.fds[0] = fd;
+    }
+    vmsg.fd_num = 1;
+
+    if (!vu_has_protocol_feature(dev, VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD)) {
+        return false;
+    }
+
+    pthread_mutex_lock(&dev->slave_mutex);
+    if (!vu_message_write(dev, dev->slave_fd, &vmsg)) {
+        pthread_mutex_unlock(&dev->slave_mutex);
+        return false;
+    }
+
+    /* Also unlocks the slave_mutex */
+    bool ret =
+    vu_process_message_reply(dev, &vmsg);
+    return ret;
+}
+
+bool vu_slave_send_fd_add(VuDev *dev, int fd, int fd_key)
+{
+    return vu_slave_send_fd(dev, fd, fd_key, VU_FD_FLAG_ADD);
+}
+
+bool vu_slave_send_fd_del(VuDev *dev, int fd_key)
+{
+    return vu_slave_send_fd(dev, -1, fd_key, VU_FD_FLAG_DEL);
+}
+
 static bool
 vu_handle_vring_kick(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1762,6 +1835,9 @@ static bool vu_handle_get_max_memslots(VuDev *dev, VhostUserMsg *vmsg)
     return false;
 }
 
+bool (*vu_set_shm_cb)(VuDev *dev, VhostUserMsg *vmsg);
+bool (*vu_set_fd_cb)(VuDev *dev, VhostUserMsg *vmsg);
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1852,6 +1928,18 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_add_mem_reg(dev, vmsg);
     case VHOST_USER_REM_MEM_REG:
         return vu_rem_mem_reg(dev, vmsg);
+    case VHOST_USER_SET_SHM:
+        if (vu_set_shm_cb) {
+            return vu_set_shm_cb(dev, vmsg);
+        } else  {
+            return false;
+        }
+    case VHOST_USER_SET_FD:
+        if (vu_set_fd_cb) {
+            return vu_set_fd_cb(dev, vmsg);
+        } else  {
+            return false;
+        }
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index a1539dbb69..5448dc5818 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -64,6 +64,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
     VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS = 14,
     VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
+    VHOST_USER_PROTOCOL_F_MAP_SHMFD = 17,
 
     VHOST_USER_PROTOCOL_F_MAX
 };
@@ -109,6 +110,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_GET_MAX_MEM_SLOTS = 36,
     VHOST_USER_ADD_MEM_REG = 37,
     VHOST_USER_REM_MEM_REG = 38,
+    VHOST_USER_SET_SHM = 41,
+    VHOST_USER_SET_FD = 42,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -119,6 +122,8 @@ typedef enum VhostUserSlaveRequest {
     VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG = 3,
     VHOST_USER_SLAVE_VRING_CALL = 4,
     VHOST_USER_SLAVE_VRING_ERR = 5,
+    VHOST_USER_SLAVE_SHM = 6,
+    VHOST_USER_SLAVE_FD = 7,
     VHOST_USER_SLAVE_MAX
 }  VhostUserSlaveRequest;
 
@@ -170,6 +175,29 @@ typedef struct VhostUserInflight {
     uint16_t queue_size;
 } VhostUserInflight;
 
+#ifndef VU_PERSIST_STRUCTS
+#define VU_PERSIST_STRUCTS
+
+typedef struct VhostUserShm {
+    int id;
+    uint64_t size;
+    uint64_t offset;
+} VhostUserShm;
+
+typedef enum VhostUserFdFlag {
+    VU_FD_FLAG_ADD = 0,
+    VU_FD_FLAG_DEL = 1,
+    VU_FD_FLAG_RESTORE = 2,
+    VU_FD_FLAG_MAX
+} VhostUserFdFlag;
+
+typedef struct VhostUserFd {
+    int key;
+    VhostUserFdFlag flag;
+} VhostUserFd;
+#endif
+
+
 #if defined(_WIN32) && (defined(__x86_64__) || defined(__i386__))
 # define VU_PACKED __attribute__((gcc_struct, packed))
 #else
@@ -197,6 +225,8 @@ typedef struct VhostUserMsg {
         VhostUserConfig config;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserShm shm;
+        VhostUserFd fdinfo;
     } payload;
 
     int fds[VHOST_MEMORY_BASELINE_NREGIONS];
@@ -687,4 +717,44 @@ void vu_queue_get_avail_bytes(VuDev *vdev, VuVirtq *vq, unsigned int *in_bytes,
 bool vu_queue_avail_bytes(VuDev *dev, VuVirtq *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 
+/**
+ * vu_slave_send_shm:
+ * @dev: a VuDev context
+ * @memfd: the shared memory fd to sync with QEMU
+ * @size: shared memory lenth
+ * @map_type: the lo_map type number
+ *
+ * Sync the map_type region that shared with QEMU when memfd or its size
+ * is changed.
+ *
+ * Returns: true on success.
+ */
+bool vu_slave_send_shm(VuDev *dev, int memfd, uint64_t size, int map_type);
+
+/**
+ * vu_slave_send_fd_add:
+ * @dev: a VuDev context
+ * @fd: the fd to send to QEMU
+ * @fd_key: the fingerprint of the fd
+ *
+ * Send a opened file fd to QEMU.
+ *
+ * Returns: true on success.
+ */
+bool vu_slave_send_fd_add(VuDev *dev, int fd, int fd_key);
+
+/**
+ * vu_slave_send_fd_del:
+ * @dev: a VuDev context
+ * @fd_key: the fingerprint of the fd
+ *
+ * Remove a file fd from QEMU.
+ *
+ * Returns: true on success.
+ */
+bool vu_slave_send_fd_del(VuDev *dev, int fd_key);
+
+extern bool (*vu_set_shm_cb)(VuDev *dev, VhostUserMsg *vmsg);
+extern bool (*vu_set_fd_cb)(VuDev *dev, VhostUserMsg *vmsg);
+
 #endif /* LIBVHOST_USER_H */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 5/9] virtiofsd: Convert the struct lo_map array to a more flatten layout
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (3 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 4/9] libvhost-user: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection Jiachen Zhang
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

In the original implementation, lo_inodes and lo_dirps are dynamically
allocated by calloc() and attached to the pointer in its corresponding
lo_map_elem.

However, as we need to share the memory regions of the lo_maps with QEMU,
using dynamic allocation will cause virtiofsd to share a larger number of
memory regions with QEMU. In such a case, when a new memory region is
allocated or deallocated, virtiofsd needs to send a slave message to QEMU.
That might lead a complex code logic and poor performance.

Therefore, in this commit, we integrate the inode and dirp into lo_map_elem
directly. This decision touches many code lines in passthrough_ll.c. However,
the code changes can be divided into several distinct situations:
    1. Use memfd_create and mmap to initialize the lo_maps. Replace
    realloc by ftruncate & mmap for lo_map growing.
    2. Change the root filed of struct lo_data to a pointer pointing to
    the second lo_inode map.
    3. Change the input arguments of lo_inode_put and lo_dirp_put to use
    struct lo_inode* and struct lo_dirp* instead of struct lo_inode** and
    struct lo_dirp**.
    4. Use refcount alone and deprecate the in_use flag in lo_map_elem.
    That is because previously, a lo_map slot can be in a not in_use
    state because its corresponding dirp / inode pointer is pointed to
    NULL, while the already unhooked lo_inode / lo_dirp objects may not
    be freed. But now, the lo_inode / lo_dirp are integrated just inside
    the lo_map_elem, so the slot releasing and lo_inode / lo_dirp
    freeing should be happened at the same time. Therefore we will not
    need two counter to indicate the available status of the lo_map
    slots. The refcount it also took out from struct lo_inode to struct
    lo_map_elem, in case every type of lo_map needs the deprecated
    in_use flag.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 tools/virtiofsd/passthrough_ll.c      | 468 +++++++++++++++-----------
 tools/virtiofsd/passthrough_seccomp.c |   1 +
 2 files changed, 278 insertions(+), 191 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 97485b22b4..815b001076 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -68,6 +68,7 @@
 #include "qemu/cutils.h"
 #include "passthrough_helpers.h"
 #include "passthrough_seccomp.h"
+#include "qemu/memfd.h"
 
 /* Keep track of inode posix locks for each owner. */
 struct lo_inode_plock {
@@ -75,23 +76,6 @@ struct lo_inode_plock {
     int fd; /* fd for OFD locks */
 };
 
-struct lo_map_elem {
-    union {
-        struct lo_inode *inode;
-        struct lo_dirp *dirp;
-        int fd;
-        ssize_t freelist;
-    };
-    bool in_use;
-};
-
-/* Maps FUSE fh or ino values to internal objects */
-struct lo_map {
-    struct lo_map_elem *elems;
-    size_t nelems;
-    ssize_t freelist;
-};
-
 struct lo_key {
     ino_t ino;
     dev_t dev;
@@ -101,12 +85,6 @@ struct lo_key {
 struct lo_inode {
     int fd;
 
-    /*
-     * Atomic reference count for this object.  The nlookup field holds a
-     * reference and release it when nlookup reaches 0.
-     */
-    gint refcount;
-
     struct lo_key key;
 
     /*
@@ -127,6 +105,38 @@ struct lo_inode {
     GHashTable *posix_locks; /* protected by lo_inode->plock_mutex */
 
     mode_t filetype;
+} __attribute__((packed));
+
+struct lo_dirp {
+    int fd;
+    DIR *dp;
+    struct dirent *entry;
+    off_t offset;
+} __attribute__((packed));
+
+struct lo_map_elem {
+    union {
+        struct lo_inode inode;
+        struct lo_dirp dirp;
+        int fd;
+    };
+    ssize_t freelist;
+    gint refcount;
+} __attribute__((packed));
+
+/* Maps FUSE fh or ino values to internal objects */
+struct lo_map {
+    size_t nelems;
+    ssize_t freelist;
+    int map_type;
+    int memfd;
+    struct lo_map_elem elems[];
+} __attribute__((packed));
+
+const char *MAP_NAME[] = {
+    "fd_map",
+    "ino_map",
+    "dirp_map"
 };
 
 struct lo_cred {
@@ -151,6 +161,11 @@ typedef struct xattr_map_entry {
     unsigned int flags;
 } XattrMapEntry;
 
+#define FD_MAP 0
+#define INO_MAP 1
+#define DIRP_MAP 2
+#define MAP_TYPE_NUM 3
+
 struct lo_data {
     pthread_mutex_t mutex;
     int sandbox;
@@ -170,11 +185,9 @@ struct lo_data {
     int allow_direct_io;
     int announce_submounts;
     bool use_statx;
-    struct lo_inode root;
+    struct lo_inode *root;
     GHashTable *inodes; /* protected by lo->mutex */
-    struct lo_map ino_map; /* protected by lo->mutex */
-    struct lo_map dirp_map; /* protected by lo->mutex */
-    struct lo_map fd_map; /* protected by lo->mutex */
+    struct lo_map *maps[MAP_TYPE_NUM]; /* protected by lo->mutex */
     XattrMapEntry *xattr_map_list;
     size_t xattr_map_nentries;
 
@@ -182,6 +195,8 @@ struct lo_data {
     int proc_self_fd;
 };
 
+static struct lo_data lo;
+
 static const struct fuse_opt lo_opts[] = {
     { "sandbox=namespace",
       offsetof(struct lo_data, sandbox),
@@ -245,7 +260,7 @@ static int is_safe_path_component(const char *path)
 
 static struct lo_data *lo_data(fuse_req_t req)
 {
-    return (struct lo_data *)fuse_req_userdata(req);
+    return &lo;
 }
 
 /*
@@ -365,65 +380,104 @@ out:
     return ret;
 }
 
-static void lo_map_init(struct lo_map *map)
+
+static void lo_map_init(int map_type)
 {
-    map->elems = NULL;
-    map->nelems = 0;
-    map->freelist = -1;
+    int memfd;
+    int ret;
+
+    memfd = memfd_create(MAP_NAME[map_type], 0);
+    ret = ftruncate(memfd, sizeof(struct lo_map) + sizeof(struct lo_map_elem));
+    if (-1 == ret) {
+        exit(1);
+    }
+
+    int page_size = getpagesize();
+    void *fake_addr = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
+                                                MAP_SHARED | MAP_ANON, -1, 0);
+    unsigned long tgt_addr = ((unsigned long)fake_addr / page_size + 1) *
+                                                                    page_size;
+    lo.maps[map_type] = mmap((void *)tgt_addr,
+                    sizeof(struct lo_map) + sizeof(struct lo_map_elem),
+                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, memfd, 0);
+
+    if (lo.maps[map_type] == MAP_FAILED) {
+        fuse_log(FUSE_LOG_ERR, "map %d init failed!\n", MAP_NAME[map_type]);
+        exit(1);
+    }
+    lo.maps[map_type]->nelems = 0;
+    lo.maps[map_type]->freelist = -1;
+    lo.maps[map_type]->map_type = map_type;
+    lo.maps[map_type]->memfd = memfd;
+
+    printf("memfd: %d\n", lo.maps[map_type]->memfd);
+    ret = system("ls -lsh /proc/self/fd");
+    fuse_log(FUSE_LOG_DEBUG, "map %s init succeed!\n",
+                             MAP_NAME[lo.maps[map_type]->map_type]);
 }
 
-static void lo_map_destroy(struct lo_map *map)
+static void lo_map_destroy(int map_type)
 {
-    free(map->elems);
+    munmap(lo.maps[map_type], sizeof(struct lo_map) +
+           lo.maps[map_type]->nelems * sizeof(struct lo_map_elem));
+
 }
 
-static int lo_map_grow(struct lo_map *map, size_t new_nelems)
+static int lo_map_grow(int map_type, size_t new_nelems)
 {
-    struct lo_map_elem *new_elems;
     size_t i;
+    size_t new_size;
+    int ret;
 
-    if (new_nelems <= map->nelems) {
+    if (new_nelems <= lo.maps[map_type]->nelems) {
         return 1;
     }
 
-    new_elems = realloc(map->elems, sizeof(map->elems[0]) * new_nelems);
-    if (!new_elems) {
+    new_size = sizeof(struct lo_map) + sizeof(struct lo_map_elem) * new_nelems;
+    ret = ftruncate(lo.maps[map_type]->memfd, new_size);
+    if (-1 == ret) {
         return 0;
     }
 
-    for (i = map->nelems; i < new_nelems; i++) {
-        new_elems[i].freelist = i + 1;
-        new_elems[i].in_use = false;
+    lo.maps[map_type] = mmap(lo.maps[map_type], new_size,
+                             PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED,
+                             lo.maps[map_type]->memfd, 0);
+    if (lo.maps[map_type] == MAP_FAILED) {
+        return 0;
     }
-    new_elems[new_nelems - 1].freelist = -1;
 
-    map->elems = new_elems;
-    map->freelist = map->nelems;
-    map->nelems = new_nelems;
+    for (i = lo.maps[map_type]->nelems; i < new_nelems; i++) {
+        lo.maps[map_type]->elems[i].freelist = i + 1;
+        g_atomic_int_set(&lo.maps[map_type]->elems[i].refcount, -1);
+    }
+    lo.maps[map_type]->elems[new_nelems - 1].freelist = -1;
+
+    lo.maps[map_type]->freelist = lo.maps[map_type]->nelems;
+    lo.maps[map_type]->nelems = new_nelems;
     return 1;
 }
 
-static struct lo_map_elem *lo_map_alloc_elem(struct lo_map *map)
+static struct lo_map_elem *lo_map_alloc_elem(int map_type)
 {
     struct lo_map_elem *elem;
+    struct lo_map *map = lo.maps[map_type];
 
-    if (map->freelist == -1 && !lo_map_grow(map, map->nelems + 256)) {
+    if (map->freelist == -1 && !lo_map_grow(map_type, map->nelems + 256)) {
         return NULL;
     }
 
-    elem = &map->elems[map->freelist];
+    elem = &lo.maps[map_type]->elems[map->freelist];
     map->freelist = elem->freelist;
 
-    elem->in_use = true;
-
     return elem;
 }
 
-static struct lo_map_elem *lo_map_reserve(struct lo_map *map, size_t key)
+static struct lo_map_elem *lo_map_reserve(int map_type, size_t key)
 {
     ssize_t *prev;
+    struct lo_map *map = lo.maps[map_type];
 
-    if (!lo_map_grow(map, key + 1)) {
+    if (!lo_map_grow(map_type, key + 1)) {
         return NULL;
     }
 
@@ -433,7 +487,6 @@ static struct lo_map_elem *lo_map_reserve(struct lo_map *map, size_t key)
             struct lo_map_elem *elem = &map->elems[key];
 
             *prev = elem->freelist;
-            elem->in_use = true;
             return elem;
         }
     }
@@ -445,7 +498,7 @@ static struct lo_map_elem *lo_map_get(struct lo_map *map, size_t key)
     if (key >= map->nelems) {
         return NULL;
     }
-    if (!map->elems[key].in_use) {
+    if (g_atomic_int_get(&map->elems[key].refcount) == -1) {
         return NULL;
     }
     return &map->elems[key];
@@ -454,17 +507,29 @@ static struct lo_map_elem *lo_map_get(struct lo_map *map, size_t key)
 static void lo_map_remove(struct lo_map *map, size_t key)
 {
     struct lo_map_elem *elem;
+    int to_del_fd;
 
     if (key >= map->nelems) {
         return;
     }
 
     elem = &map->elems[key];
-    if (!elem->in_use) {
+    if (map->map_type == FD_MAP) {
+        to_del_fd = elem->fd;
+        close(to_del_fd);
+    } else if (map->map_type == INO_MAP) {
+        to_del_fd = elem->inode.fd;
+        close(to_del_fd);
+    } else if (map->map_type == DIRP_MAP) {
+        to_del_fd = elem->dirp.fd;
+        closedir(elem->dirp.dp);
+    }
+
+    if (g_atomic_int_get(&elem->refcount) == -1) {
         return;
     }
 
-    elem->in_use = false;
+    g_atomic_int_set(&elem->refcount, -1);
 
     elem->freelist = map->freelist;
     map->freelist = key;
@@ -475,59 +540,73 @@ static ssize_t lo_add_fd_mapping(fuse_req_t req, int fd)
 {
     struct lo_map_elem *elem;
 
-    elem = lo_map_alloc_elem(&lo_data(req)->fd_map);
+    elem = lo_map_alloc_elem(FD_MAP);
     if (!elem) {
         return -1;
     }
 
+    g_atomic_int_set(&elem->refcount, 1);
+
     elem->fd = fd;
-    return elem - lo_data(req)->fd_map.elems;
+    return ((unsigned long)elem -
+            (unsigned long)lo_data(req)->maps[FD_MAP]->elems) /
+            sizeof(struct lo_map_elem);
 }
 
 /* Assumes lo->mutex is held */
-static ssize_t lo_add_dirp_mapping(fuse_req_t req, struct lo_dirp *dirp)
+static struct lo_map_elem *lo_add_dirp_mapping(fuse_req_t req,
+                                               struct lo_dirp *dirp)
 {
     struct lo_map_elem *elem;
 
-    elem = lo_map_alloc_elem(&lo_data(req)->dirp_map);
+    elem = lo_map_alloc_elem(DIRP_MAP);
     if (!elem) {
-        return -1;
+        return NULL;
     }
 
-    elem->dirp = dirp;
-    return elem - lo_data(req)->dirp_map.elems;
+    memcpy(&elem->dirp, dirp, sizeof(struct lo_dirp));
+    return elem;
 }
 
+static void posix_locks_value_destroy(gpointer data);
+
 /* Assumes lo->mutex is held */
-static ssize_t lo_add_inode_mapping(fuse_req_t req, struct lo_inode *inode)
+static struct lo_map_elem *lo_add_inode_mapping(fuse_req_t req,
+                                                struct lo_inode *inode)
 {
     struct lo_map_elem *elem;
 
-    elem = lo_map_alloc_elem(&lo_data(req)->ino_map);
+    elem = lo_map_alloc_elem(INO_MAP);
     if (!elem) {
-        return -1;
+        return NULL;
     }
 
-    elem->inode = inode;
-    return elem - lo_data(req)->ino_map.elems;
+    memcpy(&elem->inode, inode, sizeof(struct lo_inode));
+    pthread_mutex_init(&elem->inode.plock_mutex, NULL);
+    elem->inode.posix_locks = g_hash_table_new_full(
+        g_direct_hash, g_direct_equal, NULL, posix_locks_value_destroy);
+    return elem;
 }
 
-static void lo_inode_put(struct lo_data *lo, struct lo_inode **inodep)
+static void lo_inode_put_nolock(struct lo_data *lo, struct lo_inode *inode)
 {
-    struct lo_inode *inode = *inodep;
-
     if (!inode) {
         return;
     }
 
-    *inodep = NULL;
-
-    if (g_atomic_int_dec_and_test(&inode->refcount)) {
-        close(inode->fd);
-        free(inode);
+    struct lo_map_elem *elem = container_of(inode, struct lo_map_elem, inode);
+    if (g_atomic_int_dec_and_test(&elem->refcount)) {
+        lo_map_remove(lo->maps[INO_MAP], inode->fuse_ino);
     }
 }
 
+static void lo_inode_put(struct lo_data *lo, struct lo_inode *inode)
+{
+    pthread_mutex_lock(&lo->mutex);
+    lo_inode_put_nolock(lo, inode);
+    pthread_mutex_unlock(&lo->mutex);
+}
+
 /* Caller must release refcount using lo_inode_put() */
 static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino)
 {
@@ -535,9 +614,9 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino)
     struct lo_map_elem *elem;
 
     pthread_mutex_lock(&lo->mutex);
-    elem = lo_map_get(&lo->ino_map, ino);
+    elem = lo_map_get(lo->maps[INO_MAP], ino);
     if (elem) {
-        g_atomic_int_inc(&elem->inode->refcount);
+        g_atomic_int_inc(&elem->refcount);
     }
     pthread_mutex_unlock(&lo->mutex);
 
@@ -545,7 +624,7 @@ static struct lo_inode *lo_inode(fuse_req_t req, fuse_ino_t ino)
         return NULL;
     }
 
-    return elem->inode;
+    return &elem->inode;
 }
 
 /*
@@ -563,7 +642,7 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino)
     }
 
     fd = inode->fd;
-    lo_inode_put(lo_data(req), &inode);
+    lo_inode_put(lo_data(req), inode);
     return fd;
 }
 
@@ -636,7 +715,7 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info *fi)
     struct lo_map_elem *elem;
 
     pthread_mutex_lock(&lo->mutex);
-    elem = lo_map_get(&lo->fd_map, fi->fh);
+    elem = lo_map_get(lo->maps[FD_MAP], fi->fh);
     pthread_mutex_unlock(&lo->mutex);
 
     if (!elem) {
@@ -743,13 +822,13 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
             goto out_err;
         }
     }
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
 
     return lo_getattr(req, ino, fi);
 
 out_err:
     saverr = errno;
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
     fuse_reply_err(req, saverr);
 }
 
@@ -768,7 +847,8 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
     if (p) {
         assert(p->nlookup > 0);
         p->nlookup++;
-        g_atomic_int_inc(&p->refcount);
+        struct lo_map_elem *elem = container_of(p, struct lo_map_elem, inode);
+        g_atomic_int_inc(&elem->refcount);
     }
     pthread_mutex_unlock(&lo->mutex);
 
@@ -871,7 +951,7 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
     e->entry_timeout = lo->timeout;
 
     /* Do not allow escaping root directory */
-    if (dir == &lo->root && strcmp(name, "..") == 0) {
+    if (dir == lo->root && strcmp(name, "..") == 0) {
         name = ".";
     }
 
@@ -903,12 +983,6 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
         /* cache only filetype */
         inode->filetype = (e->attr.st_mode & S_IFMT);
 
-        /*
-         * One for the caller and one for nlookup (released in
-         * unref_inode_lolocked())
-         */
-        g_atomic_int_set(&inode->refcount, 2);
-
         inode->nlookup = 1;
         inode->fd = newfd;
         inode->key.ino = e->attr.st_ino;
@@ -918,14 +992,28 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
         inode->posix_locks = g_hash_table_new_full(
             g_direct_hash, g_direct_equal, NULL, posix_locks_value_destroy);
 
+        struct lo_map_elem *elem;
         pthread_mutex_lock(&lo->mutex);
-        inode->fuse_ino = lo_add_inode_mapping(req, inode);
+        elem = lo_add_inode_mapping(req, inode);
+
+        /*
+         * One for the caller and one for nlookup (released in
+         * unref_inode_lolocked())
+         */
+        g_atomic_int_set(&elem->refcount, 2);
+
+        free(inode);
+        inode = &elem->inode;
+        inode->fuse_ino = ((unsigned long)elem -
+                        (unsigned long)lo_data(req)->maps[INO_MAP]->elems) /
+                                                sizeof(struct lo_map_elem);
+
         g_hash_table_insert(lo->inodes, &inode->key, inode);
         pthread_mutex_unlock(&lo->mutex);
     }
     e->ino = inode->fuse_ino;
-    lo_inode_put(lo, &inode);
-    lo_inode_put(lo, &dir);
+    lo_inode_put(lo, inode);
+    lo_inode_put(lo, dir);
 
     fuse_log(FUSE_LOG_DEBUG, "  %lli/%s -> %lli\n", (unsigned long long)parent,
              name, (unsigned long long)e->ino);
@@ -937,8 +1025,8 @@ out_err:
     if (newfd != -1) {
         close(newfd);
     }
-    lo_inode_put(lo, &inode);
-    lo_inode_put(lo, &dir);
+    lo_inode_put(lo, inode);
+    lo_inode_put(lo, dir);
     return saverr;
 }
 
@@ -1076,11 +1164,11 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent,
              name, (unsigned long long)e.ino);
 
     fuse_reply_entry(req, &e);
-    lo_inode_put(lo, &dir);
+    lo_inode_put(lo, dir);
     return;
 
 out:
-    lo_inode_put(lo, &dir);
+    lo_inode_put(lo, dir);
     fuse_reply_err(req, saverr);
 }
 
@@ -1132,6 +1220,7 @@ static void lo_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t parent,
     sprintf(procname, "%i", inode->fd);
     res = linkat(lo->proc_self_fd, procname, parent_inode->fd, name,
                  AT_SYMLINK_FOLLOW);
+
     if (res == -1) {
         goto out_err;
     }
@@ -1150,14 +1239,14 @@ static void lo_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t parent,
              name, (unsigned long long)e.ino);
 
     fuse_reply_entry(req, &e);
-    lo_inode_put(lo, &parent_inode);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, parent_inode);
+    lo_inode_put(lo, inode);
     return;
 
 out_err:
     saverr = errno;
-    lo_inode_put(lo, &parent_inode);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, parent_inode);
+    lo_inode_put(lo, inode);
     fuse_reply_err(req, saverr);
 }
 
@@ -1177,7 +1266,7 @@ static struct lo_inode *lookup_name(fuse_req_t req, fuse_ino_t parent,
 
     res = do_statx(lo, dir->fd, name, &attr,
                    AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, &mnt_id);
-    lo_inode_put(lo, &dir);
+    lo_inode_put(lo, dir);
     if (res == -1) {
         return NULL;
     }
@@ -1206,7 +1295,7 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name)
 
     fuse_reply_err(req, res == -1 ? errno : 0);
     unref_inode_lolocked(lo, inode, 1);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
 }
 
 static void lo_rename(fuse_req_t req, fuse_ino_t parent, const char *name,
@@ -1261,10 +1350,10 @@ static void lo_rename(fuse_req_t req, fuse_ino_t parent, const char *name,
 out:
     unref_inode_lolocked(lo, oldinode, 1);
     unref_inode_lolocked(lo, newinode, 1);
-    lo_inode_put(lo, &oldinode);
-    lo_inode_put(lo, &newinode);
-    lo_inode_put(lo, &parent_inode);
-    lo_inode_put(lo, &newparent_inode);
+    lo_inode_put(lo, oldinode);
+    lo_inode_put(lo, newinode);
+    lo_inode_put(lo, parent_inode);
+    lo_inode_put(lo, newparent_inode);
 }
 
 static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
@@ -1288,7 +1377,7 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
 
     fuse_reply_err(req, res == -1 ? errno : 0);
     unref_inode_lolocked(lo, inode, 1);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
 }
 
 /* To be called with lo->mutex held */
@@ -1301,7 +1390,6 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
     assert(inode->nlookup >= n);
     inode->nlookup -= n;
     if (!inode->nlookup) {
-        lo_map_remove(&lo->ino_map, inode->fuse_ino);
         g_hash_table_remove(lo->inodes, &inode->key);
         if (g_hash_table_size(inode->posix_locks)) {
             fuse_log(FUSE_LOG_WARNING, "Hash table is not empty\n");
@@ -1310,7 +1398,7 @@ static void unref_inode(struct lo_data *lo, struct lo_inode *inode, uint64_t n)
         pthread_mutex_destroy(&inode->plock_mutex);
 
         /* Drop our refcount from lo_do_lookup() */
-        lo_inode_put(lo, &inode);
+        lo_inode_put_nolock(lo, inode);
     }
 }
 
@@ -1340,8 +1428,10 @@ static void lo_forget_one(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
              (unsigned long long)ino, (unsigned long long)inode->nlookup,
              (unsigned long long)nlookup);
 
-    unref_inode_lolocked(lo, inode, nlookup);
-    lo_inode_put(lo, &inode);
+    pthread_mutex_lock(&lo->mutex);
+    unref_inode(lo, inode, nlookup);
+    pthread_mutex_unlock(&lo->mutex);
+    lo_inode_put(lo, inode);
 }
 
 static void lo_forget(fuse_req_t req, fuse_ino_t ino, uint64_t nlookup)
@@ -1380,25 +1470,17 @@ static void lo_readlink(fuse_req_t req, fuse_ino_t ino)
     fuse_reply_readlink(req, buf);
 }
 
-struct lo_dirp {
-    gint refcount;
-    DIR *dp;
-    struct dirent *entry;
-    off_t offset;
-};
-
-static void lo_dirp_put(struct lo_dirp **dp)
+static void lo_dirp_put(struct lo_dirp *d, ssize_t fh)
 {
-    struct lo_dirp *d = *dp;
-
     if (!d) {
         return;
     }
-    *dp = NULL;
 
-    if (g_atomic_int_dec_and_test(&d->refcount)) {
-        closedir(d->dp);
-        free(d);
+    struct lo_map_elem *elem = container_of(d, struct lo_map_elem, dirp);
+    if (g_atomic_int_dec_and_test(&elem->refcount)) {
+        pthread_mutex_lock(&lo.mutex);
+        lo_map_remove(lo.maps[DIRP_MAP], fh);
+        pthread_mutex_unlock(&lo.mutex);
     }
 }
 
@@ -1409,16 +1491,16 @@ static struct lo_dirp *lo_dirp(fuse_req_t req, struct fuse_file_info *fi)
     struct lo_map_elem *elem;
 
     pthread_mutex_lock(&lo->mutex);
-    elem = lo_map_get(&lo->dirp_map, fi->fh);
+    elem = lo_map_get(lo->maps[DIRP_MAP], fi->fh);
     if (elem) {
-        g_atomic_int_inc(&elem->dirp->refcount);
+        g_atomic_int_inc(&elem->refcount);
     }
     pthread_mutex_unlock(&lo->mutex);
     if (!elem) {
         return NULL;
     }
 
-    return elem->dirp;
+    return &elem->dirp;
 }
 
 static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
@@ -1435,12 +1517,15 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
         goto out_err;
     }
 
-    fd = openat(lo_fd(req, ino), ".", O_RDONLY);
+    int to_open_fd = lo_fd(req, ino);
+    fd = openat(to_open_fd, ".", O_RDONLY);
+
     if (fd == -1) {
         goto out_errno;
     }
 
     d->dp = fdopendir(fd);
+    d->fd = dirfd(d->dp);
     if (d->dp == NULL) {
         goto out_errno;
     }
@@ -1448,18 +1533,24 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
     d->offset = 0;
     d->entry = NULL;
 
-    g_atomic_int_set(&d->refcount, 1); /* paired with lo_releasedir() */
+    struct lo_map_elem *elem;
     pthread_mutex_lock(&lo->mutex);
-    fh = lo_add_dirp_mapping(req, d);
-    pthread_mutex_unlock(&lo->mutex);
-    if (fh == -1) {
+    elem = lo_add_dirp_mapping(req, d);
+    if (!elem) {
         goto out_err;
     }
 
+    free(d);
+    d = &elem->dirp;
+    fh = ((unsigned long)elem -
+          (unsigned long)&lo_data(req)->maps[DIRP_MAP]->elems[0]) /
+          sizeof(struct lo_map_elem);
     fi->fh = fh;
     if (lo->cache == CACHE_ALWAYS) {
         fi->cache_readdir = 1;
     }
+    g_atomic_int_set(&elem->refcount, 1); /* paired with lo_releasedir() */
+    pthread_mutex_unlock(&lo->mutex);
     fuse_reply_open(req, fi);
     return;
 
@@ -1472,8 +1563,8 @@ out_err:
         } else if (fd != -1) {
             close(fd);
         }
-        free(d);
     }
+    pthread_mutex_unlock(&lo->mutex);
     fuse_reply_err(req, error);
 }
 
@@ -1537,8 +1628,8 @@ static void lo_do_readdir(fuse_req_t req, fuse_ino_t ino, size_t size,
         };
 
         /* Hide root's parent directory */
-        if (dinode == &lo->root && strcmp(name, "..") == 0) {
-            e.attr.st_ino = lo->root.key.ino;
+        if (dinode == lo->root && strcmp(name, "..") == 0) {
+            e.attr.st_ino = lo->root->key.ino;
             e.attr.st_mode = DT_DIR << 12;
         }
 
@@ -1571,8 +1662,8 @@ static void lo_do_readdir(fuse_req_t req, fuse_ino_t ino, size_t size,
 
     err = 0;
 error:
-    lo_dirp_put(&d);
-    lo_inode_put(lo, &dinode);
+    lo_dirp_put(d, fi->fh);
+    lo_inode_put(lo, dinode);
 
     /*
      * If there's an error, we can only signal it if we haven't stored
@@ -1610,18 +1701,17 @@ static void lo_releasedir(fuse_req_t req, fuse_ino_t ino,
     (void)ino;
 
     pthread_mutex_lock(&lo->mutex);
-    elem = lo_map_get(&lo->dirp_map, fi->fh);
+    elem = lo_map_get(lo->maps[DIRP_MAP], fi->fh);
     if (!elem) {
         pthread_mutex_unlock(&lo->mutex);
         fuse_reply_err(req, EBADF);
         return;
     }
 
-    d = elem->dirp;
-    lo_map_remove(&lo->dirp_map, fi->fh);
+    d = &elem->dirp;
     pthread_mutex_unlock(&lo->mutex);
 
-    lo_dirp_put(&d); /* paired with lo_opendir() */
+    lo_dirp_put(d, fi->fh); /* paired with lo_opendir() */
 
     fuse_reply_err(req, 0);
 }
@@ -1719,7 +1809,7 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
     }
 
 out:
-    lo_inode_put(lo, &parent_inode);
+    lo_inode_put(lo, parent_inode);
 
     if (err) {
         fuse_reply_err(req, err);
@@ -1805,7 +1895,7 @@ static void lo_getlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
 
 out:
     pthread_mutex_unlock(&inode->plock_mutex);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
 
     if (saverr) {
         fuse_reply_err(req, saverr);
@@ -1858,7 +1948,7 @@ static void lo_setlk(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi,
 
 out:
     pthread_mutex_unlock(&inode->plock_mutex);
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
 
     fuse_reply_err(req, saverr);
 }
@@ -1885,7 +1975,7 @@ static void lo_fsyncdir(fuse_req_t req, fuse_ino_t ino, int datasync,
         res = fsync(fd);
     }
 
-    lo_dirp_put(&d);
+    lo_dirp_put(d, fi->fh);
 
     fuse_reply_err(req, res == -1 ? errno : 0);
 }
@@ -1931,20 +2021,16 @@ static void lo_release(fuse_req_t req, fuse_ino_t ino,
 {
     struct lo_data *lo = lo_data(req);
     struct lo_map_elem *elem;
-    int fd = -1;
 
     (void)ino;
 
     pthread_mutex_lock(&lo->mutex);
-    elem = lo_map_get(&lo->fd_map, fi->fh);
+    elem = lo_map_get(lo->maps[FD_MAP], fi->fh);
     if (elem) {
-        fd = elem->fd;
-        elem = NULL;
-        lo_map_remove(&lo->fd_map, fi->fh);
+        lo_map_remove(lo->maps[FD_MAP], fi->fh);
     }
     pthread_mutex_unlock(&lo->mutex);
 
-    close(fd);
     fuse_reply_err(req, 0);
 }
 
@@ -1966,7 +2052,7 @@ static void lo_flush(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
     pthread_mutex_unlock(&inode->plock_mutex);
 
     res = close(dup(lo_fi_fd(req, fi)));
-    lo_inode_put(lo_data(req), &inode);
+    lo_inode_put(lo_data(req), inode);
     fuse_reply_err(req, res == -1 ? errno : 0);
 }
 
@@ -2529,7 +2615,7 @@ static void lo_getxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name,
         /* fchdir should not fail here */
         assert(fchdir(lo->proc_self_fd) == 0);
         ret = getxattr(procname, name, value, size);
-        assert(fchdir(lo->root.fd) == 0);
+        assert(fchdir(lo->root->fd) == 0);
     }
 
     if (ret == -1) {
@@ -2551,7 +2637,7 @@ out_free:
         close(fd);
     }
 
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
     return;
 
 out_err:
@@ -2604,7 +2690,7 @@ static void lo_listxattr(fuse_req_t req, fuse_ino_t ino, size_t size)
         /* fchdir should not fail here */
         assert(fchdir(lo->proc_self_fd) == 0);
         ret = listxattr(procname, value, size);
-        assert(fchdir(lo->root.fd) == 0);
+        assert(fchdir(lo->root->fd) == 0);
     }
 
     if (ret == -1) {
@@ -2678,7 +2764,7 @@ out_free:
         close(fd);
     }
 
-    lo_inode_put(lo, &inode);
+    lo_inode_put(lo, inode);
     return;
 
 out_err:
@@ -2740,7 +2826,7 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, const char *in_name,
         /* fchdir should not fail here */
         assert(fchdir(lo->proc_self_fd) == 0);
         ret = setxattr(procname, name, value, size, flags);
-        assert(fchdir(lo->root.fd) == 0);
+        assert(fchdir(lo->root->fd) == 0);
     }
 
     saverr = ret == -1 ? errno : 0;
@@ -2750,8 +2836,8 @@ out:
         close(fd);
     }
 
-    lo_inode_put(lo, &inode);
     g_free(mapped_name);
+    lo_inode_put(lo, inode);
     fuse_reply_err(req, saverr);
 }
 
@@ -2806,7 +2892,7 @@ static void lo_removexattr(fuse_req_t req, fuse_ino_t ino, const char *in_name)
         /* fchdir should not fail here */
         assert(fchdir(lo->proc_self_fd) == 0);
         ret = removexattr(procname, name);
-        assert(fchdir(lo->root.fd) == 0);
+        assert(fchdir(lo->root->fd) == 0);
     }
 
     saverr = ret == -1 ? errno : 0;
@@ -2816,8 +2902,8 @@ out:
         close(fd);
     }
 
-    lo_inode_put(lo, &inode);
     g_free(mapped_name);
+    lo_inode_put(lo, inode);
     fuse_reply_err(req, saverr);
 }
 
@@ -3371,7 +3457,8 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
     root->key.dev = stat.st_dev;
     root->key.mnt_id = mnt_id;
     root->nlookup = 2;
-    g_atomic_int_set(&root->refcount, 2);
+    struct lo_map_elem *elem = container_of(root, struct lo_map_elem, inode);
+    g_atomic_int_set(&elem->refcount, 2);
 }
 
 static guint lo_key_hash(gconstpointer key)
@@ -3394,16 +3481,17 @@ static void fuse_lo_data_cleanup(struct lo_data *lo)
     if (lo->inodes) {
         g_hash_table_destroy(lo->inodes);
     }
-    lo_map_destroy(&lo->fd_map);
-    lo_map_destroy(&lo->dirp_map);
-    lo_map_destroy(&lo->ino_map);
 
-    if (lo->proc_self_fd >= 0) {
-        close(lo->proc_self_fd);
+    if (lo->root->fd >= 0) {
+        close(lo->root->fd);
     }
 
-    if (lo->root.fd >= 0) {
-        close(lo->root.fd);
+    lo_map_destroy(FD_MAP);
+    lo_map_destroy(DIRP_MAP);
+    lo_map_destroy(INO_MAP);
+
+    if (lo->proc_self_fd >= 0) {
+        close(lo->proc_self_fd);
     }
 
     free(lo->xattrmap);
@@ -3416,18 +3504,17 @@ int main(int argc, char *argv[])
     struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
     struct fuse_session *se;
     struct fuse_cmdline_opts opts;
-    struct lo_data lo = {
-        .sandbox = SANDBOX_NAMESPACE,
-        .debug = 0,
-        .writeback = 0,
-        .posix_lock = 0,
-        .allow_direct_io = 0,
-        .proc_self_fd = -1,
-    };
     struct lo_map_elem *root_elem;
     struct lo_map_elem *reserve_elem;
     int ret = -1;
 
+    lo.sandbox = SANDBOX_NAMESPACE,
+    lo.debug = 0;
+    lo.writeback = 0;
+    lo.posix_lock = 1;
+    lo.allow_direct_io = 0,
+    lo.proc_self_fd = -1;
+
     /* Don't mask creation mode, kernel already did that */
     umask(0);
 
@@ -3435,8 +3522,6 @@ int main(int argc, char *argv[])
 
     pthread_mutex_init(&lo.mutex, NULL);
     lo.inodes = g_hash_table_new(lo_key_hash, lo_key_equal);
-    lo.root.fd = -1;
-    lo.root.fuse_ino = FUSE_ROOT_ID;
     lo.cache = CACHE_AUTO;
 
     /*
@@ -3444,22 +3529,23 @@ int main(int argc, char *argv[])
      * [0] Reserved (will not be used)
      * [1] Root inode
      */
-    lo_map_init(&lo.ino_map);
-    reserve_elem = lo_map_reserve(&lo.ino_map, 0);
+    lo_map_init(FD_MAP);
+    lo_map_init(INO_MAP);
+    reserve_elem = lo_map_reserve(INO_MAP, 0);
     if (!reserve_elem) {
         fuse_log(FUSE_LOG_ERR, "failed to alloc reserve_elem.\n");
         goto err_out1;
     }
-    reserve_elem->in_use = false;
-    root_elem = lo_map_reserve(&lo.ino_map, lo.root.fuse_ino);
+    root_elem = lo_map_reserve(INO_MAP, FUSE_ROOT_ID);
     if (!root_elem) {
         fuse_log(FUSE_LOG_ERR, "failed to alloc root_elem.\n");
         goto err_out1;
     }
-    root_elem->inode = &lo.root;
-
-    lo_map_init(&lo.dirp_map);
-    lo_map_init(&lo.fd_map);
+    lo.root = &root_elem->inode;
+    pthread_mutex_init(&lo.root->plock_mutex, NULL);
+    lo.root->fd = -1;
+    lo.root->fuse_ino = FUSE_ROOT_ID;
+    lo_map_init(DIRP_MAP);
 
     if (fuse_parse_cmdline(&args, &opts) != 0) {
         goto err_out1;
@@ -3570,7 +3656,7 @@ int main(int argc, char *argv[])
 
     setup_sandbox(&lo, se, opts.syslog);
 
-    setup_root(&lo, &lo.root);
+    setup_root(&lo, lo.root);
     /* Block until ctrl+c or fusermount -u */
     ret = virtio_loop(se);
 
diff --git a/tools/virtiofsd/passthrough_seccomp.c b/tools/virtiofsd/passthrough_seccomp.c
index 11623f56f2..3a0dc60f73 100644
--- a/tools/virtiofsd/passthrough_seccomp.c
+++ b/tools/virtiofsd/passthrough_seccomp.c
@@ -115,6 +115,7 @@ static const int syscall_whitelist[] = {
     SCMP_SYS(utimensat),
     SCMP_SYS(write),
     SCMP_SYS(writev),
+    SCMP_SYS(memfd_create),
 };
 
 /* Syscalls used when --syslog is enabled */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (4 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 5/9] virtiofsd: Convert the struct lo_map array to a more flatten layout Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2021-02-04 12:08   ` Dr. David Alan Gilbert
  2020-12-15 16:21 ` [RFC PATCH 7/9] virtiofsd: Persist/restore lo_map and opened fds to/from QEMU Jiachen Zhang
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

We add two options for virtiofsd crash reconnection: reconnect |
no_reconnect(default) and

User of virtiofsd should set "-o reconnect" for crash reconnection. Note
that, when "-o reconnect" is set, some options will not be supported and we
need to disable them:

  - mount namespace: When mount namespace is enabled, reconnected
    virtiofsd will failed to link/rename for EXDEV. The reason is, when
    virtiofsd is sandboxed by mount namespace, attempts to link/rename
    the fds opened before crashing (also recovered from QEMU) will be
    considered as across mount operations between mounts, which is not
    allowed by host kernel.

    So we introduce another option "-o mount_ns|no_mount_ns" (mount_ns
    by default, and takes no effect when sandbox=chroot is specified).
    The option "-o no_mount_ns" should be used with "-o reconnect".

  - remote locking: As it is hard to determine wether a file is locked or
    not when handling inflight locking requests, we should specify "-o
    no_posix_lock" (default) and "-o no_flock" (default).

  - extended attributes: When handling inflight removexattr requests after
    reconnecting, it is hard to determine wether a attr is already removed
    or not. Therefore, we should also use "-o noxattr" (default) with "-o
    reconnect".

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 tools/virtiofsd/helper.c         |   9 +++
 tools/virtiofsd/passthrough_ll.c | 112 ++++++++++++++++++++++---------
 2 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index 75ac48dec2..e0d6525b1a 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -174,6 +174,10 @@ void fuse_cmdline_help(void)
            "                               - chroot: chroot(2) into shared\n"
            "                                 directory (use in containers)\n"
            "                               default: namespace\n"
+           "    -o mount_ns|no_mount_ns    enable/disable mount namespace\n"
+           "                               default: mount_ns\n"
+           "                               note: if chroot sandbox mode is used,\n"
+           "                               mount_ns will not take effect.\n"
            "    -o timeout=<number>        I/O timeout (seconds)\n"
            "                               default: depends on cache= option.\n"
            "    -o writeback|no_writeback  enable/disable writeback cache\n"
@@ -191,6 +195,11 @@ void fuse_cmdline_help(void)
            "                               to virtiofsd from guest applications.\n"
            "                               default: no_allow_direct_io\n"
            "    -o announce_submounts      Announce sub-mount points to the guest\n"
+           "    -o reconnect|no_reconnect  enable/disable crash reconnection\n"
+           "                               to enable crash reconnection, the options\n"
+           "                               no_mount_ns, no_flock, no_posix_lock, and\n"
+           "                               no_xattr should also be set.\n"
+           "                               default: no_reconnect\n"
            );
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 815b001076..73a50bd7a3 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -170,6 +170,8 @@ struct lo_data {
     pthread_mutex_t mutex;
     int sandbox;
     int debug;
+    int mount_ns;
+    int reconnect;
     int writeback;
     int flock;
     int posix_lock;
@@ -204,8 +206,12 @@ static const struct fuse_opt lo_opts[] = {
     { "sandbox=chroot",
       offsetof(struct lo_data, sandbox),
       SANDBOX_CHROOT },
+    { "reconnect", offsetof(struct lo_data, reconnect), 1 },
+    { "no_reconnect", offsetof(struct lo_data, reconnect), 0 },
     { "writeback", offsetof(struct lo_data, writeback), 1 },
     { "no_writeback", offsetof(struct lo_data, writeback), 0 },
+    { "mount_ns", offsetof(struct lo_data, mount_ns), 1 },
+    { "no_mount_ns", offsetof(struct lo_data, mount_ns), 0 },
     { "source=%s", offsetof(struct lo_data, source), 0 },
     { "flock", offsetof(struct lo_data, flock), 1 },
     { "no_flock", offsetof(struct lo_data, flock), 0 },
@@ -3047,8 +3053,14 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
      * an empty network namespace to prevent TCP/IP and other network
      * activity in case this process is compromised.
      */
-    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
-        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
+    int unshare_flag;
+    if (lo->mount_ns) {
+        unshare_flag = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
+    } else {
+        unshare_flag = CLONE_NEWPID | CLONE_NEWNET;
+    }
+    if (unshare(unshare_flag) != 0) {
+        fuse_log(FUSE_LOG_ERR, "unshare(): %m\n");
         exit(1);
     }
 
@@ -3083,38 +3095,47 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
     /* Send us SIGTERM when the parent thread terminates, see prctl(2) */
     prctl(PR_SET_PDEATHSIG, SIGTERM);
 
-    /*
-     * If the mounts have shared propagation then we want to opt out so our
-     * mount changes don't affect the parent mount namespace.
-     */
-    if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
-        fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
-        exit(1);
-    }
+    if (lo->mount_ns) {
+        /*
+         * If the mounts have shared propagation then we want to opt out so our
+         * mount changes don't affect the parent mount namespace.
+         */
+        if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
+            fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
+            exit(1);
+        }
 
-    /* The child must remount /proc to use the new pid namespace */
-    if (mount("proc", "/proc", "proc",
-              MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
-        fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
-        exit(1);
-    }
+        /* The child must remount /proc to use the new pid namespace */
+        if (mount("proc", "/proc", "proc",
+                  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
+            fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
+            exit(1);
+        }
 
-    /*
-     * We only need /proc/self/fd. Prevent ".." from accessing parent
-     * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
-     * previously remounted with MS_REC | MS_SLAVE this mount change only
-     * affects our process.
-     */
-    if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
-        fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
-        exit(1);
-    }
+        /*
+         * We only need /proc/self/fd. Prevent ".." from accessing parent
+         * directories of /proc/self/fd by bind-mounting it over /proc. Since
+         * / was previously remounted with MS_REC | MS_SLAVE this mount change
+         * only affects our process.
+         */
+        if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
+            fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
+            exit(1);
+        }
 
-    /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
-    lo->proc_self_fd = open("/proc", O_PATH);
-    if (lo->proc_self_fd == -1) {
-        fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
-        exit(1);
+        /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
+        lo->proc_self_fd = open("/proc", O_PATH);
+        if (lo->proc_self_fd == -1) {
+            fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
+            exit(1);
+        }
+    } else {
+        /* Now we can get our /proc/self/fd directory file descriptor */
+        lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+        if (lo->proc_self_fd == -1) {
+            fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
+            exit(1);
+        }
     }
 }
 
@@ -3347,7 +3368,9 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
 {
     if (lo->sandbox == SANDBOX_NAMESPACE) {
         setup_namespaces(lo, se);
-        setup_mounts(lo->source);
+        if (lo->mount_ns) {
+            setup_mounts(lo->source);
+        }
     } else {
         setup_chroot(lo);
     }
@@ -3438,7 +3461,11 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
     struct stat stat;
     uint64_t mnt_id;
 
-    fd = open("/", O_PATH);
+    if (lo->mount_ns) {
+        fd = open("/", O_PATH);
+    } else {
+        fd = open(lo->source, O_PATH);
+    }
     if (fd == -1) {
         fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", lo->source);
         exit(1);
@@ -3515,6 +3542,9 @@ int main(int argc, char *argv[])
     lo.allow_direct_io = 0,
     lo.proc_self_fd = -1;
 
+    lo.reconnect = 0;
+    lo.mount_ns = 1;
+
     /* Don't mask creation mode, kernel already did that */
     umask(0);
 
@@ -3577,6 +3607,22 @@ int main(int argc, char *argv[])
         goto err_out1;
     }
 
+    /* For sandbox mode "chroot", set the mount_ns option to 0. */
+    if (lo.sandbox == SANDBOX_CHROOT) {
+        lo.mount_ns = 0;
+    }
+
+    if (lo.reconnect) {
+        if ((lo.sandbox == SANDBOX_NAMESPACE && lo.mount_ns) || lo.flock
+                                               || lo.posix_lock || lo.xattr) {
+            printf("Mount namespace, remote lock, and extended attributes"
+                   " are not supported by reconnection (-o reconnect). Please "
+                   "specify  -o no_mount_ns, -o no_flock (default), -o"
+                   "no_posix_lock (default), and -o no_xattr (default).\n");
+            exit(1);
+        }
+    }
+
     if (opts.log_level != 0) {
         current_log_level = opts.log_level;
     } else {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 7/9] virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (5 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 8/9] virtiofsd: Ensure crash consistency after reconnection Jiachen Zhang
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

With the help of the 4 new vhost-user message types and the previous
commits, this commit enables crash reconnection of virtiofsd.

For the three kinds of lo_maps, INO_MAP, FD_MAP, and DIRP_MAP, a
VHOST_USER_SET_SHM message will be sent from virtiofsd to QEMU when they
are growing. QEMU will send back the lo_map memfd back to virtiofsd when
reconnecting.

For the file descriptors, when virtiofsd is opening or closing a file or
a directory, it will also send a VHOST_USER_SLAVE_FD message to QEMU to
persist the opened fd in the QEMU process. When reconnecting, QEMU will
send back all the fds to virtiofsd with VHOST_USER_SET_FD messages.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 tools/virtiofsd/fuse_lowlevel.c       |  21 ++-
 tools/virtiofsd/fuse_virtio.c         |  16 ++
 tools/virtiofsd/fuse_virtio.h         |   1 +
 tools/virtiofsd/passthrough_helpers.h |   2 +-
 tools/virtiofsd/passthrough_ll.c      | 213 +++++++++++++++++++++++---
 5 files changed, 226 insertions(+), 27 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index c70fb16a9a..f3358561e2 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2136,6 +2136,22 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
     send_reply_ok(req, &outarg, outargsize);
 }
 
+static void do_init_reconn(fuse_req_t req, fuse_ino_t nodeid,
+                    struct fuse_mbuf_iter *iter)
+{
+    struct fuse_session *se = req->se;
+
+    (void)nodeid;
+
+    se->got_init = 1;
+    se->got_destroy = 0;
+    if (se->op.init) {
+        se->op.init(se->userdata, &se->conn);
+    }
+
+    fuse_log(FUSE_LOG_DEBUG, "   Virtiofsd reconnected");
+}
+
 static void do_destroy(fuse_req_t req, fuse_ino_t nodeid,
                        struct fuse_mbuf_iter *iter)
 {
@@ -2444,8 +2460,7 @@ void fuse_session_process_buf_int(struct fuse_session *se,
      * requests, FUSE_INIT and FUSE_INIT, FUSE_INIT and FUSE_DESTROY, and
      * FUSE_DESTROY and FUSE_DESTROY.
      */
-    if (in->opcode == FUSE_INIT || in->opcode == CUSE_INIT ||
-        in->opcode == FUSE_DESTROY) {
+    if (!se->got_init || in->opcode == FUSE_DESTROY) {
         pthread_rwlock_wrlock(&se->init_rwlock);
     } else {
         pthread_rwlock_rdlock(&se->init_rwlock);
@@ -2457,7 +2472,7 @@ void fuse_session_process_buf_int(struct fuse_session *se,
 
         expected = se->cuse_data ? CUSE_INIT : FUSE_INIT;
         if (in->opcode != expected) {
-            goto reply_err;
+            do_init_reconn(req, 0, &iter);
         }
     } else if (in->opcode == FUSE_INIT || in->opcode == CUSE_INIT) {
         if (fuse_lowlevel_is_virtio(se)) {
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 83ba07c6cd..03b03ddacf 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -109,6 +109,12 @@ static void fv_set_features(VuDev *dev, uint64_t features)
 {
 }
 
+static uint64_t fv_get_protocol_features(VuDev *dev)
+{
+    return 1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD |
+            1ull << VHOST_USER_PROTOCOL_F_MAP_SHMFD;
+}
+
 /*
  * Callback from libvhost-user if there's a new fd we're supposed to listen
  * to, typically a queue kick?
@@ -773,6 +779,7 @@ static bool fv_queue_order(VuDev *dev, int qidx)
 static const VuDevIface fv_iface = {
     .get_features = fv_get_features,
     .set_features = fv_set_features,
+    .get_protocol_features = fv_get_protocol_features,
 
     /* Don't need process message, we've not got any at vhost-user level */
     .queue_set_started = fv_queue_set_started,
@@ -1035,3 +1042,12 @@ void virtio_session_close(struct fuse_session *se)
     free(se->virtio_dev);
     se->virtio_dev = NULL;
 }
+
+struct VuDev *virtio_get_dev(struct fuse_session *se)
+{
+    if (se == NULL || se->virtio_dev == NULL) {
+        return NULL;
+    }
+
+    return &se->virtio_dev->dev;
+}
diff --git a/tools/virtiofsd/fuse_virtio.h b/tools/virtiofsd/fuse_virtio.h
index 111684032c..3d9884e894 100644
--- a/tools/virtiofsd/fuse_virtio.h
+++ b/tools/virtiofsd/fuse_virtio.h
@@ -21,6 +21,7 @@ struct fuse_session;
 int virtio_session_mount(struct fuse_session *se);
 void virtio_session_close(struct fuse_session *se);
 int virtio_loop(struct fuse_session *se);
+struct VuDev *virtio_get_dev(struct fuse_session *se);
 
 
 int virtio_send_msg(struct fuse_session *se, struct fuse_chan *ch,
diff --git a/tools/virtiofsd/passthrough_helpers.h b/tools/virtiofsd/passthrough_helpers.h
index 0b98275ed5..aa82990a71 100644
--- a/tools/virtiofsd/passthrough_helpers.h
+++ b/tools/virtiofsd/passthrough_helpers.h
@@ -33,7 +33,7 @@ static int mknod_wrapper(int dirfd, const char *path, const char *link,
     int res;
 
     if (S_ISREG(mode)) {
-        res = openat(dirfd, path, O_CREAT | O_EXCL | O_WRONLY, mode);
+        res = openat(dirfd, path, O_CREAT | O_WRONLY, mode);
         if (res >= 0) {
             res = close(res);
         }
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 73a50bd7a3..8fee635396 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -69,6 +69,20 @@
 #include "passthrough_helpers.h"
 #include "passthrough_seccomp.h"
 #include "qemu/memfd.h"
+#include "contrib/libvhost-user/libvhost-user.h"
+
+static struct fuse_session *se;
+
+struct lo_data lo;
+
+typedef enum {
+    FD_MAP = 0,
+    INO_MAP = 1,
+    DIRP_MAP = 2,
+    MAP_TYPE_NUM = 3
+} map_t;
+/* We have 3 type of lo_maps, 2 bits is enough */
+#define MAP_TYPE_SHIFT 2
 
 /* Keep track of inode posix locks for each owner. */
 struct lo_inode_plock {
@@ -128,7 +142,7 @@ struct lo_map_elem {
 struct lo_map {
     size_t nelems;
     ssize_t freelist;
-    int map_type;
+    map_t map_type;
     int memfd;
     struct lo_map_elem elems[];
 } __attribute__((packed));
@@ -161,11 +175,6 @@ typedef struct xattr_map_entry {
     unsigned int flags;
 } XattrMapEntry;
 
-#define FD_MAP 0
-#define INO_MAP 1
-#define DIRP_MAP 2
-#define MAP_TYPE_NUM 3
-
 struct lo_data {
     pthread_mutex_t mutex;
     int sandbox;
@@ -187,6 +196,7 @@ struct lo_data {
     int allow_direct_io;
     int announce_submounts;
     bool use_statx;
+    int connected;
     struct lo_inode *root;
     GHashTable *inodes; /* protected by lo->mutex */
     struct lo_map *maps[MAP_TYPE_NUM]; /* protected by lo->mutex */
@@ -197,8 +207,6 @@ struct lo_data {
     int proc_self_fd;
 };
 
-static struct lo_data lo;
-
 static const struct fuse_opt lo_opts[] = {
     { "sandbox=namespace",
       offsetof(struct lo_data, sandbox),
@@ -237,6 +245,9 @@ static bool use_syslog = false;
 static int current_log_level;
 static void unref_inode_lolocked(struct lo_data *lo, struct lo_inode *inode,
                                  uint64_t n);
+static void try_reclaim_elems(void);
+static void restore_ino_map(void);
+static void restore_dirp_map(void);
 
 static struct {
     pthread_mutex_t mutex;
@@ -266,7 +277,7 @@ static int is_safe_path_component(const char *path)
 
 static struct lo_data *lo_data(fuse_req_t req)
 {
-    return &lo;
+    return (struct lo_data *)fuse_req_userdata(req);
 }
 
 /*
@@ -386,8 +397,7 @@ out:
     return ret;
 }
 
-
-static void lo_map_init(int map_type)
+static void lo_map_init(map_t map_type)
 {
     int memfd;
     int ret;
@@ -415,21 +425,15 @@ static void lo_map_init(int map_type)
     lo.maps[map_type]->freelist = -1;
     lo.maps[map_type]->map_type = map_type;
     lo.maps[map_type]->memfd = memfd;
-
-    printf("memfd: %d\n", lo.maps[map_type]->memfd);
-    ret = system("ls -lsh /proc/self/fd");
-    fuse_log(FUSE_LOG_DEBUG, "map %s init succeed!\n",
-                             MAP_NAME[lo.maps[map_type]->map_type]);
 }
 
-static void lo_map_destroy(int map_type)
+static void lo_map_destroy(map_t map_type)
 {
     munmap(lo.maps[map_type], sizeof(struct lo_map) +
            lo.maps[map_type]->nelems * sizeof(struct lo_map_elem));
-
 }
 
-static int lo_map_grow(int map_type, size_t new_nelems)
+static int lo_map_grow(map_t map_type, size_t new_nelems)
 {
     size_t i;
     size_t new_size;
@@ -460,10 +464,19 @@ static int lo_map_grow(int map_type, size_t new_nelems)
 
     lo.maps[map_type]->freelist = lo.maps[map_type]->nelems;
     lo.maps[map_type]->nelems = new_nelems;
+
+     if (lo.reconnect && lo.connected) {
+        /* sync the lo_map growing with QEMU */
+        uint64_t map_size = sizeof(struct lo_map) +
+                                sizeof(struct lo_map_elem) * new_nelems;
+        vu_slave_send_shm(virtio_get_dev(se), lo.maps[map_type]->memfd,
+                          map_size, map_type);
+    }
+
     return 1;
 }
 
-static struct lo_map_elem *lo_map_alloc_elem(int map_type)
+static struct lo_map_elem *lo_map_alloc_elem(map_t map_type)
 {
     struct lo_map_elem *elem;
     struct lo_map *map = lo.maps[map_type];
@@ -478,7 +491,7 @@ static struct lo_map_elem *lo_map_alloc_elem(int map_type)
     return elem;
 }
 
-static struct lo_map_elem *lo_map_reserve(int map_type, size_t key)
+static struct lo_map_elem *lo_map_reserve(map_t map_type, size_t key)
 {
     ssize_t *prev;
     struct lo_map *map = lo.maps[map_type];
@@ -531,6 +544,12 @@ static void lo_map_remove(struct lo_map *map, size_t key)
         closedir(elem->dirp.dp);
     }
 
+    if (lo.reconnect) {
+        /* remove the opened fd from QEMU */
+        int fd_key = key << MAP_TYPE_SHIFT | map->map_type;
+        vu_slave_send_fd_del(virtio_get_dev(se), fd_key);
+    }
+
     if (g_atomic_int_get(&elem->refcount) == -1) {
         return;
     }
@@ -539,6 +558,7 @@ static void lo_map_remove(struct lo_map *map, size_t key)
 
     elem->freelist = map->freelist;
     map->freelist = key;
+
 }
 
 /* Assumes lo->mutex is held */
@@ -554,6 +574,17 @@ static ssize_t lo_add_fd_mapping(fuse_req_t req, int fd)
     g_atomic_int_set(&elem->refcount, 1);
 
     elem->fd = fd;
+
+    ssize_t fh = ((unsigned long)elem -
+                 (unsigned long)lo_data(req)->maps[FD_MAP]->elems) /
+                 sizeof(struct lo_map_elem);
+
+    if (lo.reconnect) {
+        /* Send the newly opened fd to QEMU */
+        int key = fh << MAP_TYPE_SHIFT | FD_MAP;
+        vu_slave_send_fd_add(virtio_get_dev(se), fd, key);
+    }
+
     return ((unsigned long)elem -
             (unsigned long)lo_data(req)->maps[FD_MAP]->elems) /
             sizeof(struct lo_map_elem);
@@ -695,6 +726,13 @@ static void lo_init(void *userdata, struct fuse_conn_info *conn)
                  "does not support it\n");
         lo->announce_submounts = false;
     }
+    /* reinit the root inode when needed */
+    lo->connected = 1;
+
+    /* try to reclaim elems as many as possible */
+    try_reclaim_elems();
+    restore_ino_map();
+    restore_dirp_map();
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -1016,6 +1054,12 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
 
         g_hash_table_insert(lo->inodes, &inode->key, inode);
         pthread_mutex_unlock(&lo->mutex);
+
+        if (lo->reconnect) {
+            /* Send the newly opened fd to QEMU */
+            int key = inode->fuse_ino << MAP_TYPE_SHIFT | INO_MAP;
+            vu_slave_send_fd_add(virtio_get_dev(se), newfd, key);
+        }
     }
     e->ino = inode->fuse_ino;
     lo_inode_put(lo, inode);
@@ -1557,6 +1601,13 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
     }
     g_atomic_int_set(&elem->refcount, 1); /* paired with lo_releasedir() */
     pthread_mutex_unlock(&lo->mutex);
+
+    if (lo->reconnect) {
+        /* Send the newly opened fd to QEMU */
+        int key = fh << MAP_TYPE_SHIFT | DIRP_MAP;
+        vu_slave_send_fd_add(virtio_get_dev(se), fd, key);
+    }
+
     fuse_reply_open(req, fi);
     return;
 
@@ -2972,6 +3023,8 @@ static void lo_destroy(void *userdata)
         unref_inode(lo, inode, inode->nlookup);
     }
     pthread_mutex_unlock(&lo->mutex);
+
+    lo->connected = 0;
 }
 
 static struct fuse_lowlevel_ops lo_oper = {
@@ -3025,6 +3078,113 @@ static void print_capabilities(void)
     printf("}\n");
 }
 
+static void try_reclaim_elems(void)
+{
+    /*
+     * Crash consistency: Check if virtiofsd is crashed in an inconsistent
+     * status when removing, and try to fix it by removing again.
+     */
+    map_t map_type;
+    int i;
+    for (map_type = 0; map_type < MAP_TYPE_NUM; map_type++) {
+        if (map_type == INO_MAP || map_type == DIRP_MAP) {
+            for (i = 0; i < lo.maps[map_type]->nelems; i++) {
+                if (g_atomic_int_get(&lo.maps[map_type]->elems[i].refcount)
+                                                                     == 0) {
+                    pthread_mutex_lock(&lo.mutex);
+                    lo_map_remove(lo.maps[map_type], i);
+                    pthread_mutex_unlock(&lo.mutex);
+                }
+            }
+        }
+    }
+}
+
+static void restore_ino_map(void)
+{
+    int i ;
+
+    lo.root = &lo_map_get(lo.maps[INO_MAP], FUSE_ROOT_ID)->inode;
+    /* Restore the ino_map and lo.inodes hash table */
+    for (i = 0; i < lo.maps[INO_MAP]->nelems; i++) {
+        struct lo_map_elem *elem = &lo.maps[INO_MAP]->elems[i];
+        if (g_atomic_int_get(&elem->refcount) != -1) {
+            struct lo_inode *inode = &elem->inode;
+            pthread_mutex_init(&inode->plock_mutex, NULL);
+            elem->inode.posix_locks = g_hash_table_new_full(
+                g_direct_hash, g_direct_equal, NULL, posix_locks_value_destroy);
+            if (inode->nlookup > 0) {
+                g_hash_table_insert(lo.inodes, &inode->key, inode);
+            }
+            if (i > FUSE_ROOT_ID) {
+                g_atomic_int_set(&elem->refcount, 1);
+            } else {
+                g_atomic_int_set(&elem->refcount, 2);
+            }
+        }
+    }
+}
+
+static void restore_dirp_map(void)
+{
+    int i;
+
+    /* Restore the fields of lo_dirp. */
+    for (i = 0; i < lo.maps[DIRP_MAP]->nelems; i++) {
+        struct lo_map_elem *elem = &lo.maps[DIRP_MAP]->elems[i];
+        if (g_atomic_int_get(&elem->refcount) != -1) {
+            struct lo_dirp *d = &elem->dirp;
+            d->dp = fdopendir(d->fd);
+            d->fd = dirfd(d->dp);
+            d->entry = NULL;
+            g_atomic_int_set(&elem->refcount, 1);
+        }
+    }
+}
+
+static bool vu_set_fs_map(VuDev *dev, VhostUserMsg *vmsg)
+{
+    map_t map_type = vmsg->payload.shm.id;
+    int cur_map_memfd = lo.maps[map_type]->memfd;
+    int tmp_memfd = vmsg->fds[0];
+
+    lo.maps[map_type]->map_type = map_type;
+
+    uint64_t cur_map_size = sizeof(struct lo_map) +
+                    sizeof(struct lo_map_elem) * lo.maps[map_type]->nelems;
+    munmap(lo.maps[map_type], cur_map_size);
+    close(cur_map_memfd);
+    lo.maps[map_type] = mmap(lo.maps[map_type], vmsg->payload.shm.size,
+                        PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED,
+                                                        tmp_memfd, 0);
+    lo.maps[map_type]->memfd = tmp_memfd;
+
+    return false;
+}
+
+static bool vu_set_fs_proc_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int fd_num = vmsg->fd_num;
+    int fd, fd_key;
+
+    assert(fd_num == 1);
+    fd = vmsg->fds[0];
+    fd_key = vmsg->payload.fdinfo.key;
+
+    map_t map_type = fd_key & ((1 << MAP_TYPE_SHIFT) - 1);
+    int key = fd_key >> MAP_TYPE_SHIFT;
+
+    if (map_type == INO_MAP) {
+        lo.maps[INO_MAP]->elems[key].inode.fd = fd;
+    } else if (map_type == DIRP_MAP) {
+        lo.maps[DIRP_MAP]->elems[key].dirp.fd = fd;
+    } else if (map_type == FD_MAP) {
+        lo.maps[FD_MAP]->elems[key].fd = fd;
+    }
+
+    return false;
+}
+
 /*
  * Drop all Linux capabilities because the wait parent process only needs to
  * sit in waitpid(2) and terminate.
@@ -3529,7 +3689,6 @@ static void fuse_lo_data_cleanup(struct lo_data *lo)
 int main(int argc, char *argv[])
 {
     struct fuse_args args = FUSE_ARGS_INIT(argc, argv);
-    struct fuse_session *se;
     struct fuse_cmdline_opts opts;
     struct lo_map_elem *root_elem;
     struct lo_map_elem *reserve_elem;
@@ -3541,7 +3700,7 @@ int main(int argc, char *argv[])
     lo.posix_lock = 1;
     lo.allow_direct_io = 0,
     lo.proc_self_fd = -1;
-
+    lo.connected = 0;
     lo.reconnect = 0;
     lo.mount_ns = 1;
 
@@ -3621,6 +3780,9 @@ int main(int argc, char *argv[])
                    "no_posix_lock (default), and -o no_xattr (default).\n");
             exit(1);
         }
+
+        vu_set_shm_cb = vu_set_fs_map;
+        vu_set_fd_cb = vu_set_fs_proc_fd;
     }
 
     if (opts.log_level != 0) {
@@ -3706,6 +3868,11 @@ int main(int argc, char *argv[])
     /* Block until ctrl+c or fusermount -u */
     ret = virtio_loop(se);
 
+    /* If reconnection is enabled, we directly return without destroy status */
+    if (lo.reconnect) {
+        return 0;
+    }
+
     fuse_session_unmount(se);
     cleanup_capng();
 err_out3:
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 8/9] virtiofsd: Ensure crash consistency after reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (6 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 7/9] virtiofsd: Persist/restore lo_map and opened fds to/from QEMU Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2020-12-15 16:21 ` [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking Jiachen Zhang
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

Now the lo_maps are linked lists based on continuous arrays which are
shared between QEMU and virtiofsd. Elem adding, removing, or lo_map
growing all involve multi-step updating of the shared memory region.

So we need to maintain the fault-atomicity (or crash-consistency) issue
of these operations.

We could choose to use a redo / undo logging to handle this, but it is
rather complex and might sequentialize the concurrent virtiofsd request
handling. Therefore we employ a more relaxed scheme, specifically:

    - Ensure code execution order of the codelines involving consistency.
      We adjust some execution order to make sure that the lo_maps can be
      normally used after reconnections.

    - Relax some error handling. For example, lo_link, lo_symlink, lo_mkdir
      and lo_mknod should reply success when errno of the underlying FS
      syscalls (such as linkat and mknodat) is EEXIST; lo_rmdir, lo_unlink
      and lo_rename should succeed (rather than EIO) when lookup_name returns
      ENOENT error.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 tools/virtiofsd/fuse_lowlevel.c  |   3 +
 tools/virtiofsd/fuse_virtio.c    |  28 +++++++
 tools/virtiofsd/passthrough_ll.c | 123 +++++++++++++++++++++++--------
 3 files changed, 123 insertions(+), 31 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index f3358561e2..ddd8667e16 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2521,6 +2521,9 @@ void fuse_session_process_buf_int(struct fuse_session *se,
     if (in->opcode == FUSE_WRITE && se->op.write_buf) {
         do_write_buf(req, in->nodeid, &iter, bufv);
     } else {
+        if (in->opcode == FUSE_FORGET || in->opcode == FUSE_BATCH_FORGET) {
+            virtio_send_msg(req->se, req->ch, NULL, 0);
+        }
         fuse_ll_ops[in->opcode].func(req, in->nodeid, &iter);
     }
 
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 03b03ddacf..8e8e9f48f3 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -203,6 +203,30 @@ static void copy_iov(struct iovec *src_iov, int src_count,
     }
 }
 
+static int virtio_no_msg(struct fuse_session *se, struct fuse_chan *ch)
+{
+    FVRequest *req = container_of(ch, FVRequest, ch);
+    struct fv_QueueInfo *qi = ch->qi;
+    VuDev *dev = &se->virtio_dev->dev;
+    VuVirtq *q = vu_get_queue(dev, qi->qidx);
+    VuVirtqElement *elem = &req->elem;
+    int ret = 0;
+
+    fuse_log(FUSE_LOG_DEBUG, "%s: elem %d no reply sent\n", __func__,
+             elem->index);
+
+    pthread_rwlock_rdlock(&qi->virtio_dev->vu_dispatch_rwlock);
+    pthread_mutex_lock(&qi->vq_lock);
+    vu_queue_push(dev, q, elem, 0);
+    vu_queue_notify(dev, q);
+    pthread_mutex_unlock(&qi->vq_lock);
+    pthread_rwlock_unlock(&qi->virtio_dev->vu_dispatch_rwlock);
+
+    req->reply_sent = true;
+
+    return ret;
+}
+
 /*
  * Called back by ll whenever it wants to send a reply/message back
  * The 1st element of the iov starts with the fuse_out_header
@@ -218,6 +242,10 @@ int virtio_send_msg(struct fuse_session *se, struct fuse_chan *ch,
     VuVirtqElement *elem = &req->elem;
     int ret = 0;
 
+    if (iov == NULL && count == 0) {
+        return virtio_no_msg(se, ch);
+    }
+
     assert(count >= 1);
     assert(iov[0].iov_len >= sizeof(struct fuse_out_header));
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 8fee635396..b5750ef179 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -462,17 +462,22 @@ static int lo_map_grow(map_t map_type, size_t new_nelems)
     }
     lo.maps[map_type]->elems[new_nelems - 1].freelist = -1;
 
-    lo.maps[map_type]->freelist = lo.maps[map_type]->nelems;
-    lo.maps[map_type]->nelems = new_nelems;
-
      if (lo.reconnect && lo.connected) {
         /* sync the lo_map growing with QEMU */
         uint64_t map_size = sizeof(struct lo_map) +
                                 sizeof(struct lo_map_elem) * new_nelems;
         vu_slave_send_shm(virtio_get_dev(se), lo.maps[map_type]->memfd,
                           map_size, map_type);
+        /*
+         * A compiler barrier to ensure map_size is sent to QEMU
+         * before we update lo.maps[map_type]->nelems.
+         */
+        barrier();
     }
 
+    lo.maps[map_type]->freelist = lo.maps[map_type]->nelems;
+    lo.maps[map_type]->nelems = new_nelems;
+
     return 1;
 }
 
@@ -554,11 +559,15 @@ static void lo_map_remove(struct lo_map *map, size_t key)
         return;
     }
 
+    /*
+     * Crash consistency: we first set refcount to -1 (not in use),
+     * to invalidate the slot immediately, that may only cause a waste
+     * of space when crashing, instead of inconsistency.
+     */
     g_atomic_int_set(&elem->refcount, -1);
 
     elem->freelist = map->freelist;
     map->freelist = key;
-
 }
 
 /* Assumes lo->mutex is held */
@@ -571,10 +580,6 @@ static ssize_t lo_add_fd_mapping(fuse_req_t req, int fd)
         return -1;
     }
 
-    g_atomic_int_set(&elem->refcount, 1);
-
-    elem->fd = fd;
-
     ssize_t fh = ((unsigned long)elem -
                  (unsigned long)lo_data(req)->maps[FD_MAP]->elems) /
                  sizeof(struct lo_map_elem);
@@ -585,6 +590,9 @@ static ssize_t lo_add_fd_mapping(fuse_req_t req, int fd)
         vu_slave_send_fd_add(virtio_get_dev(se), fd, key);
     }
 
+    g_atomic_int_set(&elem->refcount, 1);
+    elem->fd = fd;
+
     return ((unsigned long)elem -
             (unsigned long)lo_data(req)->maps[FD_MAP]->elems) /
             sizeof(struct lo_map_elem);
@@ -1039,27 +1047,27 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t parent, const char *name,
         struct lo_map_elem *elem;
         pthread_mutex_lock(&lo->mutex);
         elem = lo_add_inode_mapping(req, inode);
-
-        /*
-         * One for the caller and one for nlookup (released in
-         * unref_inode_lolocked())
-         */
-        g_atomic_int_set(&elem->refcount, 2);
-
         free(inode);
         inode = &elem->inode;
         inode->fuse_ino = ((unsigned long)elem -
                         (unsigned long)lo_data(req)->maps[INO_MAP]->elems) /
                                                 sizeof(struct lo_map_elem);
 
-        g_hash_table_insert(lo->inodes, &inode->key, inode);
-        pthread_mutex_unlock(&lo->mutex);
 
         if (lo->reconnect) {
             /* Send the newly opened fd to QEMU */
             int key = inode->fuse_ino << MAP_TYPE_SHIFT | INO_MAP;
             vu_slave_send_fd_add(virtio_get_dev(se), newfd, key);
         }
+
+        /*
+         * One for the caller and one for nlookup (released in
+         * unref_inode_lolocked())
+         */
+        g_atomic_int_set(&elem->refcount, 2);
+
+        g_hash_table_insert(lo->inodes, &inode->key, inode);
+        pthread_mutex_unlock(&lo->mutex);
     }
     e->ino = inode->fuse_ino;
     lo_inode_put(lo, inode);
@@ -1202,7 +1210,9 @@ static void lo_mknod_symlink(fuse_req_t req, fuse_ino_t parent,
     lo_restore_cred(&old);
 
     if (res == -1) {
-        goto out;
+        if (!(errno == EEXIST && lo->reconnect)) {
+            goto out;
+        }
     }
 
     saverr = lo_do_lookup(req, parent, name, &e);
@@ -1272,7 +1282,9 @@ static void lo_link(fuse_req_t req, fuse_ino_t ino, fuse_ino_t parent,
                  AT_SYMLINK_FOLLOW);
 
     if (res == -1) {
-        goto out_err;
+        if (!(errno == EEXIST && lo->reconnect)) {
+            goto out_err;
+        }
     }
 
     res = fstatat(inode->fd, "", &e.attr, AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW);
@@ -1337,13 +1349,22 @@ static void lo_rmdir(fuse_req_t req, fuse_ino_t parent, const char *name)
 
     inode = lookup_name(req, parent, name);
     if (!inode) {
-        fuse_reply_err(req, EIO);
-        return;
+        if (lo->reconnect && errno == ENOENT) {
+            /*
+             * Because we crashed in this function, so do not perform unlink and
+             * reply 0 to complete this request.
+             */
+            fuse_reply_err(req, 0);
+        } else {
+            fuse_reply_err(req, EIO);
+        }
+        goto out;
     }
 
     res = unlinkat(lo_fd(req, parent), name, AT_REMOVEDIR);
 
     fuse_reply_err(req, res == -1 ? errno : 0);
+out:
     unref_inode_lolocked(lo, inode, 1);
     lo_inode_put(lo, inode);
 }
@@ -1371,11 +1392,19 @@ static void lo_rename(fuse_req_t req, fuse_ino_t parent, const char *name,
         goto out;
     }
 
-    oldinode = lookup_name(req, parent, name);
     newinode = lookup_name(req, newparent, newname);
+    oldinode = lookup_name(req, parent, name);
 
     if (!oldinode) {
-        fuse_reply_err(req, EIO);
+        if (lo->reconnect && errno == ENOENT) {
+            /*
+             * Because we crashed in this function, so do not perform unlink and
+             * reply 0 to complete this request.
+             */
+            fuse_reply_err(req, 0);
+        } else {
+            fuse_reply_err(req, EIO);
+        }
         goto out;
     }
 
@@ -1419,13 +1448,22 @@ static void lo_unlink(fuse_req_t req, fuse_ino_t parent, const char *name)
 
     inode = lookup_name(req, parent, name);
     if (!inode) {
-        fuse_reply_err(req, EIO);
-        return;
+        if (lo->reconnect && errno == ENOENT) {
+            /*
+             * Because we crashed in this function, so do not perform unlink and
+             * reply 0 to complete this request.
+             */
+            fuse_reply_err(req, 0);
+        } else {
+            fuse_reply_err(req, EIO);
+        }
+        goto out;
     }
 
     res = unlinkat(lo_fd(req, parent), name, 0);
 
     fuse_reply_err(req, res == -1 ? errno : 0);
+out:
     unref_inode_lolocked(lo, inode, 1);
     lo_inode_put(lo, inode);
 }
@@ -1599,8 +1637,6 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
     if (lo->cache == CACHE_ALWAYS) {
         fi->cache_readdir = 1;
     }
-    g_atomic_int_set(&elem->refcount, 1); /* paired with lo_releasedir() */
-    pthread_mutex_unlock(&lo->mutex);
 
     if (lo->reconnect) {
         /* Send the newly opened fd to QEMU */
@@ -1608,7 +1644,10 @@ static void lo_opendir(fuse_req_t req, fuse_ino_t ino,
         vu_slave_send_fd_add(virtio_get_dev(se), fd, key);
     }
 
+    g_atomic_int_set(&elem->refcount, 1); /* paired with lo_releasedir() */
+    pthread_mutex_unlock(&lo->mutex);
     fuse_reply_open(req, fi);
+
     return;
 
 out_errno:
@@ -1768,9 +1807,15 @@ static void lo_releasedir(fuse_req_t req, fuse_ino_t ino,
     d = &elem->dirp;
     pthread_mutex_unlock(&lo->mutex);
 
-    lo_dirp_put(d, fi->fh); /* paired with lo_opendir() */
-
+    /*
+     * Reply the request before releasing the lo_map slot. Because the
+     * fi->fh slot may be re-allocated to a valid elem right after the slot
+     * is released, reply after releaseing may cause double-removing after
+     * reconnection, and the valid elem may be mistakenly removed.
+     */
     fuse_reply_err(req, 0);
+
+    lo_dirp_put(d, fi->fh); /* paired with lo_opendir() */
 }
 
 static void update_open_flags(int writeback, int allow_direct_io,
@@ -2084,11 +2129,14 @@ static void lo_release(fuse_req_t req, fuse_ino_t ino,
     pthread_mutex_lock(&lo->mutex);
     elem = lo_map_get(lo->maps[FD_MAP], fi->fh);
     if (elem) {
+        /*
+         * Reply the request before releasing the lo_map slot. Because similar
+         * with the lo_releasedir case, we want to avoid false double-removing.
+         */
+        fuse_reply_err(req, 0);
         lo_map_remove(lo->maps[FD_MAP], fi->fh);
     }
     pthread_mutex_unlock(&lo->mutex);
-
-    fuse_reply_err(req, 0);
 }
 
 static void lo_flush(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
@@ -3159,6 +3207,19 @@ static bool vu_set_fs_map(VuDev *dev, VhostUserMsg *vmsg)
                                                         tmp_memfd, 0);
     lo.maps[map_type]->memfd = tmp_memfd;
 
+    /*
+     * Crash consistency: Check if virtiofsd is crashed in an inconsistent
+     * status when growing, and fix it when needed.
+     */
+    cur_map_size = sizeof(struct lo_map) +
+                        sizeof(struct lo_map_elem) * lo.maps[map_type]->nelems;
+    if (cur_map_size != vmsg->payload.shm.size) {
+        size_t new_nelems = (vmsg->payload.shm.size - cur_map_size) /
+                                                    sizeof(struct lo_map_elem);
+        lo_map_grow(map_type, lo.maps[map_type]->nelems + new_nelems);
+
+    }
+
     return false;
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (7 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 8/9] virtiofsd: Ensure crash consistency after reconnection Jiachen Zhang
@ 2020-12-15 16:21 ` Jiachen Zhang
  2021-02-04 12:15   ` Dr. David Alan Gilbert
  2020-12-15 22:51 ` [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection no-reply
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-15 16:21 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs, Jiachen Zhang, qemu-devel

This is a work around. The qsort function will malloc memory instead of use
stack memory when the resubmit_num is larger than 64 (total size larger than
1024 Bytes). This will cause seccomp kill virtiofsd, so we comment qsort.
This work around will not affect the correctness of inflight I/O tracking.

Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
---
 contrib/libvhost-user/libvhost-user.c | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 8c97013e59..c226d5d915 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -1167,20 +1167,6 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
     return true;
 }
 
-static int
-inflight_desc_compare(const void *a, const void *b)
-{
-    VuVirtqInflightDesc *desc0 = (VuVirtqInflightDesc *)a,
-                        *desc1 = (VuVirtqInflightDesc *)b;
-
-    if (desc1->counter > desc0->counter &&
-        (desc1->counter - desc0->counter) < VIRTQUEUE_MAX_SIZE * 2) {
-        return 1;
-    }
-
-    return -1;
-}
-
 static int
 vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
 {
@@ -1236,10 +1222,6 @@ vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
             }
         }
 
-        if (vq->resubmit_num > 1) {
-            qsort(vq->resubmit_list, vq->resubmit_num,
-                  sizeof(VuVirtqInflightDesc), inflight_desc_compare);
-        }
         vq->counter = vq->resubmit_list[0].counter + 1;
     }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (8 preceding siblings ...)
  2020-12-15 16:21 ` [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking Jiachen Zhang
@ 2020-12-15 22:51 ` no-reply
  2020-12-16 15:36 ` Marc-André Lureau
  2021-05-10 14:38 ` Jiachen Zhang
  11 siblings, 0 replies; 30+ messages in thread
From: no-reply @ 2020-12-15 22:51 UTC (permalink / raw)
  To: zhangjiachen.jaycee
  Cc: mst, dgilbert, qemu-devel, virtio-fs, xieyongji,
	zhangjiachen.jaycee, stefanha

Patchew URL: https://patchew.org/QEMU/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com
Subject: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com -> patchew/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com
 * [new tag]         patchew/20201215174824.76017-1-richard.henderson@linaro.org -> patchew/20201215174824.76017-1-richard.henderson@linaro.org
Switched to a new branch 'test'
c6015f6 virtiofsd: (work around) Comment qsort in inflight I/O tracking
a285357 virtiofsd: Ensure crash consistency after reconnection
d5f69c8 virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
4e0a56d virtiofsd: Add two new options for crash reconnection
d4b638e virtiofsd: Convert the struct lo_map array to a more flatten layout
6dadec2 libvhost-user: Add vhost-user message types for sending shared memory and file fds
91f265b vhost-user-fs: Support virtiofsd crash reconnection
bb587ea vhost: Add vhost-user message types for sending shared memory and file fds
ccb8eca vhost-user-fs: Add support for reconnection of vhost-user-fs backend

=== OUTPUT BEGIN ===
1/9 Checking commit ccb8eca38f55 (vhost-user-fs: Add support for reconnection of vhost-user-fs backend)
2/9 Checking commit bb587eaf556e (vhost: Add vhost-user message types for sending shared memory and file fds)
3/9 Checking commit 91f265b55b12 (vhost-user-fs: Support virtiofsd crash reconnection)
4/9 Checking commit 6dadec226d05 (libvhost-user: Add vhost-user message types for sending shared memory and file fds)
5/9 Checking commit d4b638e34242 (virtiofsd: Convert the struct lo_map array to a more flatten layout)
ERROR: use qemu_real_host_page_size instead of getpagesize()
#195: FILE: tools/virtiofsd/passthrough_ll.c:383:
+    int page_size = getpagesize();

total: 1 errors, 0 warnings, 997 lines checked

Patch 5/9 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

6/9 Checking commit 4e0a56d0217d (virtiofsd: Add two new options for crash reconnection)
7/9 Checking commit d5f69c8465ac (virtiofsd: Persist/restore lo_map and opened fds to/from QEMU)
8/9 Checking commit a285357c7bff (virtiofsd: Ensure crash consistency after reconnection)
9/9 Checking commit c6015f602fb0 (virtiofsd: (work around) Comment qsort in inflight I/O tracking)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20201215162119.27360-1-zhangjiachen.jaycee@bytedance.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (9 preceding siblings ...)
  2020-12-15 22:51 ` [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection no-reply
@ 2020-12-16 15:36 ` Marc-André Lureau
  2020-12-18  9:39   ` [External] " Jiachen Zhang
  2021-05-10 14:38 ` Jiachen Zhang
  11 siblings, 1 reply; 30+ messages in thread
From: Marc-André Lureau @ 2020-12-16 15:36 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: Daniel P. Berrange, Michael S . Tsirkin, QEMU,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 6213 bytes --]

Hi

On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang <
zhangjiachen.jaycee@bytedance.com> wrote:

> Hi, all
>
> We implement virtio-fs crash reconnection in this patchset. The crash
> reconnection of virtiofsd here is completely transparent to guest, no
> remount in guest is needed, even the inflight requests can be handled
> normally after reconnection. We are looking forward to any comments.
>
> Thanks,
> Jiachen
>
>
> OVERVIEW:
>
> To support virtio-fs crash reconnection, we need to support the recovery
> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>
> Fortunately, QEMU's vhost-user reconnection framework already supports
> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
> As the FUSE requests are transferred by virtqueue I/O requests, by using
> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
> requests.
>
> To support virtiofsd internal status recovery, we introduce 4 new
> vhost-user message types. As shown in the following diagram, two of them
> are used to persist shared lo_maps and opened fds to QEMU, the other two
> message types are used to restore the status when reconnecting.
>
>                                VHOST_USER_SLAVE_SHM
>                                VHOST_USER_SLAVE_FD
>     +--------------+       Persist       +--------------------+
>     |              <---------------------+                    |
>     |     QEMU     |                     |  Virtio-fs Daemon  |
>     |              +--------------------->                    |
>     +--------------+       Restore       +--------------------+
>             VHOST_USER_SET_SHM
>             VHOST_USER_SET_FD
>
> Although the 4 newly added message types are to support virtiofsd
> reconnection in this patchset, it might be potential in other situation.
> So we keep in mind to make them more general when add them to vhost
> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
> used for memory sharing between a vhost-user daemon and QEMU,
> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
> shared opened fds between QEMU process and vhost-user daemon process.
>

Before adding new messages to the already complex vhost-user protocol, can
we evaluate other options?

First thing that came to my mind is that the memory state could be saved to
disk or with a POSIX shared memory object.

Eventually, the protocol could just pass around the fds, and not make a
special treatment for shared memory.

Then I remember systemd has a pretty good API & protocol for this sort of
thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler)

You can store fds with FDSTORE=1 (with an optional associated FDNAME).
sd_listen_fds() & others to get them back (note: passed by inheritance only
I think). systemd seems to not make shm a special case either, just treat
it like an opened fd to restore.

If we consider backend processes are going to be managed by libvirt or even
a systemd service, is it a better alternative? sd_notify() offers a number
of interesting features as well to monitor services.


>
> USAGE and NOTES:
>
> - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.
>
> - ",reconnect=1" should be added to the "-chardev socket" of
> vhost-user-fs-pci
> in the QEMU command line, for example:
>
>     qemu-system-x86_64 ... \
>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>     ...
>
> - We add new options for virtiofsd to enable or disable crash reconnection.
> And some options are not supported by crash reconnection. So add following
> options to virtiofsd to enable reconnection:
>
>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>     -o no_xattr ...
>
> - The reasons why virtiofsd-side locking, extended attributes, and mount
> namespace are not supported is explained in the commit message of the 6th
> patch (virtiofsd: Add two new options for crash reconnection).
>
> - The 9th patch is a work-around that will not affect the overall
> correctness.
> We remove the qsort related codes because we found that when resubmit_num
> is
> larger than 64, seccomp will kill the virtiofsd process.
>
> - Support for dax version virtiofsd is very possible and requires almost no
> additional change to this patchset.
>
>
> Jiachen Zhang (9):
>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>   vhost: Add vhost-user message types for sending shared memory and file
>     fds
>   vhost-user-fs: Support virtiofsd crash reconnection
>   libvhost-user: Add vhost-user message types for sending shared memory
>     and file fds
>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>   virtiofsd: Add two new options for crash reconnection
>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>   virtiofsd: Ensure crash consistency after reconnection
>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>
>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>  contrib/libvhost-user/libvhost-user.h |  70 +++
>  docs/interop/vhost-user.rst           |  41 ++
>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>  hw/virtio/vhost-user.c                | 123 ++++
>  hw/virtio/vhost.c                     |  42 ++
>  include/hw/virtio/vhost-backend.h     |   6 +
>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>  include/hw/virtio/vhost.h             |  42 ++
>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>  tools/virtiofsd/fuse_virtio.h         |   1 +
>  tools/virtiofsd/helper.c              |   9 +
>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  16 files changed, 1413 insertions(+), 278 deletions(-)
>
> --
> 2.20.1
>
>
>

-- 
Marc-André Lureau

[-- Attachment #2: Type: text/html, Size: 7528 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2020-12-16 15:36 ` Marc-André Lureau
@ 2020-12-18  9:39   ` Jiachen Zhang
  2021-03-17 10:05     ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2020-12-18  9:39 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Daniel P. Berrange, Michael S . Tsirkin, QEMU,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 6943 bytes --]

On Wed, Dec 16, 2020 at 11:36 PM Marc-André Lureau <
marcandre.lureau@gmail.com> wrote:

> Hi
>
> On Tue, Dec 15, 2020 at 8:22 PM Jiachen Zhang <
> zhangjiachen.jaycee@bytedance.com> wrote:
>
>> Hi, all
>>
>> We implement virtio-fs crash reconnection in this patchset. The crash
>> reconnection of virtiofsd here is completely transparent to guest, no
>> remount in guest is needed, even the inflight requests can be handled
>> normally after reconnection. We are looking forward to any comments.
>>
>> Thanks,
>> Jiachen
>>
>>
>> OVERVIEW:
>>
>> To support virtio-fs crash reconnection, we need to support the recovery
>> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>>
>> Fortunately, QEMU's vhost-user reconnection framework already supports
>> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
>> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
>> As the FUSE requests are transferred by virtqueue I/O requests, by using
>> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
>> requests.
>>
>> To support virtiofsd internal status recovery, we introduce 4 new
>> vhost-user message types. As shown in the following diagram, two of them
>> are used to persist shared lo_maps and opened fds to QEMU, the other two
>> message types are used to restore the status when reconnecting.
>>
>>                                VHOST_USER_SLAVE_SHM
>>                                VHOST_USER_SLAVE_FD
>>     +--------------+       Persist       +--------------------+
>>     |              <---------------------+                    |
>>     |     QEMU     |                     |  Virtio-fs Daemon  |
>>     |              +--------------------->                    |
>>     +--------------+       Restore       +--------------------+
>>             VHOST_USER_SET_SHM
>>             VHOST_USER_SET_FD
>>
>> Although the 4 newly added message types are to support virtiofsd
>> reconnection in this patchset, it might be potential in other situation.
>> So we keep in mind to make them more general when add them to vhost
>> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
>> used for memory sharing between a vhost-user daemon and QEMU,
>> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
>> shared opened fds between QEMU process and vhost-user daemon process.
>>
>
> Before adding new messages to the already complex vhost-user protocol, can
> we evaluate other options?
>
> First thing that came to my mind is that the memory state could be saved
> to disk or with a POSIX shared memory object.
>
>
Eventually, the protocol could just pass around the fds, and not make a
> special treatment for shared memory.
>
> Then I remember systemd has a pretty good API & protocol for this sort of
> thing: sd_notify(3) (afaik, it is quite easy to implement a minimal handler)
>
> You can store fds with FDSTORE=1 (with an optional associated FDNAME).
> sd_listen_fds() & others to get them back (note: passed by inheritance only
> I think). systemd seems to not make shm a special case either, just treat
> it like an opened fd to restore.
>
> If we consider backend processes are going to be managed by libvirt or
> even a systemd service, is it a better alternative? sd_notify() offers a
> number of interesting features as well to monitor services.
>
>

Thanks for the suggestions. Actually, we choose to save all state
information to QEMU because a virtiofsd has the same lifecycle as its
QEMU master. However, saving things to a file do avoid communication with
QEMU, and we no longer need to increase the complexity of vhost-user
protocol. The suggestion to save fds to the systemd is also very reasonable
if we don't consider the lifecycle issues, we will try it.

All the best,
Jiachen



>>
>> USAGE and NOTES:
>>
>> - The commits are rebased to a recent QEMU master commit
>> b4d939133dca0fa2b.
>>
>> - ",reconnect=1" should be added to the "-chardev socket" of
>> vhost-user-fs-pci
>> in the QEMU command line, for example:
>>
>>     qemu-system-x86_64 ... \
>>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>>     ...
>>
>> - We add new options for virtiofsd to enable or disable crash
>> reconnection.
>> And some options are not supported by crash reconnection. So add following
>> options to virtiofsd to enable reconnection:
>>
>>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>>     -o no_xattr ...
>>
>> - The reasons why virtiofsd-side locking, extended attributes, and mount
>> namespace are not supported is explained in the commit message of the 6th
>> patch (virtiofsd: Add two new options for crash reconnection).
>>
>> - The 9th patch is a work-around that will not affect the overall
>> correctness.
>> We remove the qsort related codes because we found that when resubmit_num
>> is
>> larger than 64, seccomp will kill the virtiofsd process.
>>
>> - Support for dax version virtiofsd is very possible and requires almost
>> no
>> additional change to this patchset.
>>
>>
>> Jiachen Zhang (9):
>>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>>   vhost: Add vhost-user message types for sending shared memory and file
>>     fds
>>   vhost-user-fs: Support virtiofsd crash reconnection
>>   libvhost-user: Add vhost-user message types for sending shared memory
>>     and file fds
>>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>>   virtiofsd: Add two new options for crash reconnection
>>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>>   virtiofsd: Ensure crash consistency after reconnection
>>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>>
>>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>>  contrib/libvhost-user/libvhost-user.h |  70 +++
>>  docs/interop/vhost-user.rst           |  41 ++
>>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>>  hw/virtio/vhost-user.c                | 123 ++++
>>  hw/virtio/vhost.c                     |  42 ++
>>  include/hw/virtio/vhost-backend.h     |   6 +
>>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>>  include/hw/virtio/vhost.h             |  42 ++
>>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>>  tools/virtiofsd/fuse_virtio.h         |   1 +
>>  tools/virtiofsd/helper.c              |   9 +
>>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>>  16 files changed, 1413 insertions(+), 278 deletions(-)
>>
>> --
>> 2.20.1
>>
>>
>>
>
> --
> Marc-André Lureau
>

[-- Attachment #2: Type: text/html, Size: 8799 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection
  2020-12-15 16:21 ` [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection Jiachen Zhang
@ 2021-02-04 12:08   ` Dr. David Alan Gilbert
  2021-02-04 14:16     ` [External] " Jiachen Zhang
  0 siblings, 1 reply; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2021-02-04 12:08 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: virtio-fs, Xie Yongji, qemu-devel, Stefan Hajnoczi, Michael S . Tsirkin

* Jiachen Zhang (zhangjiachen.jaycee@bytedance.com) wrote:
> We add two options for virtiofsd crash reconnection: reconnect |
> no_reconnect(default) and
> 
> User of virtiofsd should set "-o reconnect" for crash reconnection. Note
> that, when "-o reconnect" is set, some options will not be supported and we
> need to disable them:
> 
>   - mount namespace: When mount namespace is enabled, reconnected
>     virtiofsd will failed to link/rename for EXDEV. The reason is, when
>     virtiofsd is sandboxed by mount namespace, attempts to link/rename
>     the fds opened before crashing (also recovered from QEMU) will be
>     considered as across mount operations between mounts, which is not
>     allowed by host kernel.
> 
>     So we introduce another option "-o mount_ns|no_mount_ns" (mount_ns
>     by default, and takes no effect when sandbox=chroot is specified).
>     The option "-o no_mount_ns" should be used with "-o reconnect".

Why not just use -o sandbox=chroot?

>   - remote locking: As it is hard to determine wether a file is locked or
>     not when handling inflight locking requests, we should specify "-o
>     no_posix_lock" (default) and "-o no_flock" (default).
> 
>   - extended attributes: When handling inflight removexattr requests after
>     reconnecting, it is hard to determine wether a attr is already removed
>     or not. Therefore, we should also use "-o noxattr" (default) with "-o
>     reconnect".

Can you explain a bit more about why removexattr is hard to restart?

Dave
> 
> Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  tools/virtiofsd/helper.c         |   9 +++
>  tools/virtiofsd/passthrough_ll.c | 112 ++++++++++++++++++++++---------
>  2 files changed, 88 insertions(+), 33 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 75ac48dec2..e0d6525b1a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -174,6 +174,10 @@ void fuse_cmdline_help(void)
>             "                               - chroot: chroot(2) into shared\n"
>             "                                 directory (use in containers)\n"
>             "                               default: namespace\n"
> +           "    -o mount_ns|no_mount_ns    enable/disable mount namespace\n"
> +           "                               default: mount_ns\n"
> +           "                               note: if chroot sandbox mode is used,\n"
> +           "                               mount_ns will not take effect.\n"
>             "    -o timeout=<number>        I/O timeout (seconds)\n"
>             "                               default: depends on cache= option.\n"
>             "    -o writeback|no_writeback  enable/disable writeback cache\n"
> @@ -191,6 +195,11 @@ void fuse_cmdline_help(void)
>             "                               to virtiofsd from guest applications.\n"
>             "                               default: no_allow_direct_io\n"
>             "    -o announce_submounts      Announce sub-mount points to the guest\n"
> +           "    -o reconnect|no_reconnect  enable/disable crash reconnection\n"
> +           "                               to enable crash reconnection, the options\n"
> +           "                               no_mount_ns, no_flock, no_posix_lock, and\n"
> +           "                               no_xattr should also be set.\n"
> +           "                               default: no_reconnect\n"
>             );
>  }
>  
> diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
> index 815b001076..73a50bd7a3 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -170,6 +170,8 @@ struct lo_data {
>      pthread_mutex_t mutex;
>      int sandbox;
>      int debug;
> +    int mount_ns;
> +    int reconnect;
>      int writeback;
>      int flock;
>      int posix_lock;
> @@ -204,8 +206,12 @@ static const struct fuse_opt lo_opts[] = {
>      { "sandbox=chroot",
>        offsetof(struct lo_data, sandbox),
>        SANDBOX_CHROOT },
> +    { "reconnect", offsetof(struct lo_data, reconnect), 1 },
> +    { "no_reconnect", offsetof(struct lo_data, reconnect), 0 },
>      { "writeback", offsetof(struct lo_data, writeback), 1 },
>      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
> +    { "mount_ns", offsetof(struct lo_data, mount_ns), 1 },
> +    { "no_mount_ns", offsetof(struct lo_data, mount_ns), 0 },
>      { "source=%s", offsetof(struct lo_data, source), 0 },
>      { "flock", offsetof(struct lo_data, flock), 1 },
>      { "no_flock", offsetof(struct lo_data, flock), 0 },
> @@ -3047,8 +3053,14 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>       * an empty network namespace to prevent TCP/IP and other network
>       * activity in case this process is compromised.
>       */
> -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
> +    int unshare_flag;
> +    if (lo->mount_ns) {
> +        unshare_flag = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> +    } else {
> +        unshare_flag = CLONE_NEWPID | CLONE_NEWNET;
> +    }
> +    if (unshare(unshare_flag) != 0) {
> +        fuse_log(FUSE_LOG_ERR, "unshare(): %m\n");
>          exit(1);
>      }
>  
> @@ -3083,38 +3095,47 @@ static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
>      /* Send us SIGTERM when the parent thread terminates, see prctl(2) */
>      prctl(PR_SET_PDEATHSIG, SIGTERM);
>  
> -    /*
> -     * If the mounts have shared propagation then we want to opt out so our
> -     * mount changes don't affect the parent mount namespace.
> -     */
> -    if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
> -        fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> -        exit(1);
> -    }
> +    if (lo->mount_ns) {
> +        /*
> +         * If the mounts have shared propagation then we want to opt out so our
> +         * mount changes don't affect the parent mount namespace.
> +         */
> +        if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
> +            fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> +            exit(1);
> +        }
>  
> -    /* The child must remount /proc to use the new pid namespace */
> -    if (mount("proc", "/proc", "proc",
> -              MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
> -        fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> -        exit(1);
> -    }
> +        /* The child must remount /proc to use the new pid namespace */
> +        if (mount("proc", "/proc", "proc",
> +                  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
> +            fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> +            exit(1);
> +        }
>  
> -    /*
> -     * We only need /proc/self/fd. Prevent ".." from accessing parent
> -     * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
> -     * previously remounted with MS_REC | MS_SLAVE this mount change only
> -     * affects our process.
> -     */
> -    if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
> -        fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
> -        exit(1);
> -    }
> +        /*
> +         * We only need /proc/self/fd. Prevent ".." from accessing parent
> +         * directories of /proc/self/fd by bind-mounting it over /proc. Since
> +         * / was previously remounted with MS_REC | MS_SLAVE this mount change
> +         * only affects our process.
> +         */
> +        if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
> +            fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
> +            exit(1);
> +        }
>  
> -    /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
> -    lo->proc_self_fd = open("/proc", O_PATH);
> -    if (lo->proc_self_fd == -1) {
> -        fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
> -        exit(1);
> +        /* Get the /proc (actually /proc/self/fd, see above) file descriptor */
> +        lo->proc_self_fd = open("/proc", O_PATH);
> +        if (lo->proc_self_fd == -1) {
> +            fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
> +            exit(1);
> +        }
> +    } else {
> +        /* Now we can get our /proc/self/fd directory file descriptor */
> +        lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> +        if (lo->proc_self_fd == -1) {
> +            fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> +            exit(1);
> +        }
>      }
>  }
>  
> @@ -3347,7 +3368,9 @@ static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>  {
>      if (lo->sandbox == SANDBOX_NAMESPACE) {
>          setup_namespaces(lo, se);
> -        setup_mounts(lo->source);
> +        if (lo->mount_ns) {
> +            setup_mounts(lo->source);
> +        }
>      } else {
>          setup_chroot(lo);
>      }
> @@ -3438,7 +3461,11 @@ static void setup_root(struct lo_data *lo, struct lo_inode *root)
>      struct stat stat;
>      uint64_t mnt_id;
>  
> -    fd = open("/", O_PATH);
> +    if (lo->mount_ns) {
> +        fd = open("/", O_PATH);
> +    } else {
> +        fd = open(lo->source, O_PATH);
> +    }
>      if (fd == -1) {
>          fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", lo->source);
>          exit(1);
> @@ -3515,6 +3542,9 @@ int main(int argc, char *argv[])
>      lo.allow_direct_io = 0,
>      lo.proc_self_fd = -1;
>  
> +    lo.reconnect = 0;
> +    lo.mount_ns = 1;
> +
>      /* Don't mask creation mode, kernel already did that */
>      umask(0);
>  
> @@ -3577,6 +3607,22 @@ int main(int argc, char *argv[])
>          goto err_out1;
>      }
>  
> +    /* For sandbox mode "chroot", set the mount_ns option to 0. */
> +    if (lo.sandbox == SANDBOX_CHROOT) {
> +        lo.mount_ns = 0;
> +    }
> +
> +    if (lo.reconnect) {
> +        if ((lo.sandbox == SANDBOX_NAMESPACE && lo.mount_ns) || lo.flock
> +                                               || lo.posix_lock || lo.xattr) {
> +            printf("Mount namespace, remote lock, and extended attributes"
> +                   " are not supported by reconnection (-o reconnect). Please "
> +                   "specify  -o no_mount_ns, -o no_flock (default), -o"
> +                   "no_posix_lock (default), and -o no_xattr (default).\n");
> +            exit(1);
> +        }
> +    }
> +
>      if (opts.log_level != 0) {
>          current_log_level = opts.log_level;
>      } else {
> -- 
> 2.20.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking
  2020-12-15 16:21 ` [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking Jiachen Zhang
@ 2021-02-04 12:15   ` Dr. David Alan Gilbert
  2021-02-04 14:20     ` [External] " Jiachen Zhang
  0 siblings, 1 reply; 30+ messages in thread
From: Dr. David Alan Gilbert @ 2021-02-04 12:15 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: virtio-fs, Xie Yongji, qemu-devel, Stefan Hajnoczi, Michael S . Tsirkin

* Jiachen Zhang (zhangjiachen.jaycee@bytedance.com) wrote:
> This is a work around. The qsort function will malloc memory instead of use
> stack memory when the resubmit_num is larger than 64 (total size larger than
> 1024 Bytes). This will cause seccomp kill virtiofsd, so we comment qsort.
> This work around will not affect the correctness of inflight I/O tracking.
> 
> Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>

That's an odd hack!   Just follow the audit log to see what seccomp was
upset by and add the right syscall.

Dave

> ---
>  contrib/libvhost-user/libvhost-user.c | 18 ------------------
>  1 file changed, 18 deletions(-)
> 
> diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
> index 8c97013e59..c226d5d915 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -1167,20 +1167,6 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
>      return true;
>  }
>  
> -static int
> -inflight_desc_compare(const void *a, const void *b)
> -{
> -    VuVirtqInflightDesc *desc0 = (VuVirtqInflightDesc *)a,
> -                        *desc1 = (VuVirtqInflightDesc *)b;
> -
> -    if (desc1->counter > desc0->counter &&
> -        (desc1->counter - desc0->counter) < VIRTQUEUE_MAX_SIZE * 2) {
> -        return 1;
> -    }
> -
> -    return -1;
> -}
> -
>  static int
>  vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
>  {
> @@ -1236,10 +1222,6 @@ vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
>              }
>          }
>  
> -        if (vq->resubmit_num > 1) {
> -            qsort(vq->resubmit_list, vq->resubmit_num,
> -                  sizeof(VuVirtqInflightDesc), inflight_desc_compare);
> -        }
>          vq->counter = vq->resubmit_list[0].counter + 1;
>      }
>  
> -- 
> 2.20.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection
  2021-02-04 12:08   ` Dr. David Alan Gilbert
@ 2021-02-04 14:16     ` Jiachen Zhang
  0 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2021-02-04 14:16 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, Xie Yongji, QEMU, Stefan Hajnoczi, Michael S . Tsirkin

[-- Attachment #1: Type: text/plain, Size: 12224 bytes --]

On Thu, Feb 4, 2021 at 8:09 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Jiachen Zhang (zhangjiachen.jaycee@bytedance.com) wrote:
> > We add two options for virtiofsd crash reconnection: reconnect |
> > no_reconnect(default) and
> >
> > User of virtiofsd should set "-o reconnect" for crash reconnection. Note
> > that, when "-o reconnect" is set, some options will not be supported and
> we
> > need to disable them:
> >
> >   - mount namespace: When mount namespace is enabled, reconnected
> >     virtiofsd will failed to link/rename for EXDEV. The reason is, when
> >     virtiofsd is sandboxed by mount namespace, attempts to link/rename
> >     the fds opened before crashing (also recovered from QEMU) will be
> >     considered as across mount operations between mounts, which is not
> >     allowed by host kernel.
> >
> >     So we introduce another option "-o mount_ns|no_mount_ns" (mount_ns
> >     by default, and takes no effect when sandbox=chroot is specified).
> >     The option "-o no_mount_ns" should be used with "-o reconnect".
>
> Why not just use -o sandbox=chroot?
>
>
Yes, thanks. I think this is a good idea, and maybe better than add the new
options
(mount_ns | no_mount_ns). Actually, "-o sandbox=" has not been upstream
when we
add the new options.


> >   - remote locking: As it is hard to determine wether a file is locked or
> >     not when handling inflight locking requests, we should specify "-o
> >     no_posix_lock" (default) and "-o no_flock" (default).
> >
> >   - extended attributes: When handling inflight removexattr requests
> after
> >     reconnecting, it is hard to determine wether a attr is already
> removed
> >     or not. Therefore, we should also use "-o noxattr" (default) with "-o
> >     reconnect".
>
> Can you explain a bit more about why removexattr is hard to restart?
>
>
Consider the following removexattr handling procedure:

    lo_removexattr() {
        (1) // ......
        (2). fremovexattr / removexattr
        (3) // ......
        (4). fuse_reply_err(req, saverr);
    }

If virtiofsd successfully executed (2) but crashes at (3). When new
virtiofsd is
reconnected, as (4) is not executed before crashing, this fuse request will
be
re-executed from (1), and (2) will be executed for the second time. As the
xattr
is already removed by the first execution of (2), this time an ENODATA will
be
returned by (2) and mistakenly replied to fuse by (4).

Jiachen


> Dave
> >
> > Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> > ---
> >  tools/virtiofsd/helper.c         |   9 +++
> >  tools/virtiofsd/passthrough_ll.c | 112 ++++++++++++++++++++++---------
> >  2 files changed, 88 insertions(+), 33 deletions(-)
> >
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index 75ac48dec2..e0d6525b1a 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -174,6 +174,10 @@ void fuse_cmdline_help(void)
> >             "                               - chroot: chroot(2) into
> shared\n"
> >             "                                 directory (use in
> containers)\n"
> >             "                               default: namespace\n"
> > +           "    -o mount_ns|no_mount_ns    enable/disable mount
> namespace\n"
> > +           "                               default: mount_ns\n"
> > +           "                               note: if chroot sandbox mode
> is used,\n"
> > +           "                               mount_ns will not take
> effect.\n"
> >             "    -o timeout=<number>        I/O timeout (seconds)\n"
> >             "                               default: depends on cache=
> option.\n"
> >             "    -o writeback|no_writeback  enable/disable writeback
> cache\n"
> > @@ -191,6 +195,11 @@ void fuse_cmdline_help(void)
> >             "                               to virtiofsd from guest
> applications.\n"
> >             "                               default:
> no_allow_direct_io\n"
> >             "    -o announce_submounts      Announce sub-mount points to
> the guest\n"
> > +           "    -o reconnect|no_reconnect  enable/disable crash
> reconnection\n"
> > +           "                               to enable crash
> reconnection, the options\n"
> > +           "                               no_mount_ns, no_flock,
> no_posix_lock, and\n"
> > +           "                               no_xattr should also be
> set.\n"
> > +           "                               default: no_reconnect\n"
> >             );
> >  }
> >
> > diff --git a/tools/virtiofsd/passthrough_ll.c
> b/tools/virtiofsd/passthrough_ll.c
> > index 815b001076..73a50bd7a3 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -170,6 +170,8 @@ struct lo_data {
> >      pthread_mutex_t mutex;
> >      int sandbox;
> >      int debug;
> > +    int mount_ns;
> > +    int reconnect;
> >      int writeback;
> >      int flock;
> >      int posix_lock;
> > @@ -204,8 +206,12 @@ static const struct fuse_opt lo_opts[] = {
> >      { "sandbox=chroot",
> >        offsetof(struct lo_data, sandbox),
> >        SANDBOX_CHROOT },
> > +    { "reconnect", offsetof(struct lo_data, reconnect), 1 },
> > +    { "no_reconnect", offsetof(struct lo_data, reconnect), 0 },
> >      { "writeback", offsetof(struct lo_data, writeback), 1 },
> >      { "no_writeback", offsetof(struct lo_data, writeback), 0 },
> > +    { "mount_ns", offsetof(struct lo_data, mount_ns), 1 },
> > +    { "no_mount_ns", offsetof(struct lo_data, mount_ns), 0 },
> >      { "source=%s", offsetof(struct lo_data, source), 0 },
> >      { "flock", offsetof(struct lo_data, flock), 1 },
> >      { "no_flock", offsetof(struct lo_data, flock), 0 },
> > @@ -3047,8 +3053,14 @@ static void setup_namespaces(struct lo_data *lo,
> struct fuse_session *se)
> >       * an empty network namespace to prevent TCP/IP and other network
> >       * activity in case this process is compromised.
> >       */
> > -    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
> > -        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS):
> %m\n");
> > +    int unshare_flag;
> > +    if (lo->mount_ns) {
> > +        unshare_flag = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
> > +    } else {
> > +        unshare_flag = CLONE_NEWPID | CLONE_NEWNET;
> > +    }
> > +    if (unshare(unshare_flag) != 0) {
> > +        fuse_log(FUSE_LOG_ERR, "unshare(): %m\n");
> >          exit(1);
> >      }
> >
> > @@ -3083,38 +3095,47 @@ static void setup_namespaces(struct lo_data *lo,
> struct fuse_session *se)
> >      /* Send us SIGTERM when the parent thread terminates, see prctl(2)
> */
> >      prctl(PR_SET_PDEATHSIG, SIGTERM);
> >
> > -    /*
> > -     * If the mounts have shared propagation then we want to opt out so
> our
> > -     * mount changes don't affect the parent mount namespace.
> > -     */
> > -    if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
> > -        fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> > -        exit(1);
> > -    }
> > +    if (lo->mount_ns) {
> > +        /*
> > +         * If the mounts have shared propagation then we want to opt
> out so our
> > +         * mount changes don't affect the parent mount namespace.
> > +         */
> > +        if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
> > +            fuse_log(FUSE_LOG_ERR, "mount(/, MS_REC|MS_SLAVE): %m\n");
> > +            exit(1);
> > +        }
> >
> > -    /* The child must remount /proc to use the new pid namespace */
> > -    if (mount("proc", "/proc", "proc",
> > -              MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) <
> 0) {
> > -        fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> > -        exit(1);
> > -    }
> > +        /* The child must remount /proc to use the new pid namespace */
> > +        if (mount("proc", "/proc", "proc",
> > +                  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL)
> < 0) {
> > +            fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
> > +            exit(1);
> > +        }
> >
> > -    /*
> > -     * We only need /proc/self/fd. Prevent ".." from accessing parent
> > -     * directories of /proc/self/fd by bind-mounting it over /proc.
> Since / was
> > -     * previously remounted with MS_REC | MS_SLAVE this mount change
> only
> > -     * affects our process.
> > -     */
> > -    if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
> > -        fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND): %m\n");
> > -        exit(1);
> > -    }
> > +        /*
> > +         * We only need /proc/self/fd. Prevent ".." from accessing
> parent
> > +         * directories of /proc/self/fd by bind-mounting it over /proc.
> Since
> > +         * / was previously remounted with MS_REC | MS_SLAVE this mount
> change
> > +         * only affects our process.
> > +         */
> > +        if (mount("/proc/self/fd", "/proc", NULL, MS_BIND, NULL) < 0) {
> > +            fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, MS_BIND):
> %m\n");
> > +            exit(1);
> > +        }
> >
> > -    /* Get the /proc (actually /proc/self/fd, see above) file
> descriptor */
> > -    lo->proc_self_fd = open("/proc", O_PATH);
> > -    if (lo->proc_self_fd == -1) {
> > -        fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
> > -        exit(1);
> > +        /* Get the /proc (actually /proc/self/fd, see above) file
> descriptor */
> > +        lo->proc_self_fd = open("/proc", O_PATH);
> > +        if (lo->proc_self_fd == -1) {
> > +            fuse_log(FUSE_LOG_ERR, "open(/proc, O_PATH): %m\n");
> > +            exit(1);
> > +        }
> > +    } else {
> > +        /* Now we can get our /proc/self/fd directory file descriptor */
> > +        lo->proc_self_fd = open("/proc/self/fd", O_PATH);
> > +        if (lo->proc_self_fd == -1) {
> > +            fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
> > +            exit(1);
> > +        }
> >      }
> >  }
> >
> > @@ -3347,7 +3368,9 @@ static void setup_sandbox(struct lo_data *lo,
> struct fuse_session *se,
> >  {
> >      if (lo->sandbox == SANDBOX_NAMESPACE) {
> >          setup_namespaces(lo, se);
> > -        setup_mounts(lo->source);
> > +        if (lo->mount_ns) {
> > +            setup_mounts(lo->source);
> > +        }
> >      } else {
> >          setup_chroot(lo);
> >      }
> > @@ -3438,7 +3461,11 @@ static void setup_root(struct lo_data *lo, struct
> lo_inode *root)
> >      struct stat stat;
> >      uint64_t mnt_id;
> >
> > -    fd = open("/", O_PATH);
> > +    if (lo->mount_ns) {
> > +        fd = open("/", O_PATH);
> > +    } else {
> > +        fd = open(lo->source, O_PATH);
> > +    }
> >      if (fd == -1) {
> >          fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", lo->source);
> >          exit(1);
> > @@ -3515,6 +3542,9 @@ int main(int argc, char *argv[])
> >      lo.allow_direct_io = 0,
> >      lo.proc_self_fd = -1;
> >
> > +    lo.reconnect = 0;
> > +    lo.mount_ns = 1;
> > +
> >      /* Don't mask creation mode, kernel already did that */
> >      umask(0);
> >
> > @@ -3577,6 +3607,22 @@ int main(int argc, char *argv[])
> >          goto err_out1;
> >      }
> >
> > +    /* For sandbox mode "chroot", set the mount_ns option to 0. */
> > +    if (lo.sandbox == SANDBOX_CHROOT) {
> > +        lo.mount_ns = 0;
> > +    }
> > +
> > +    if (lo.reconnect) {
> > +        if ((lo.sandbox == SANDBOX_NAMESPACE && lo.mount_ns) || lo.flock
> > +                                               || lo.posix_lock ||
> lo.xattr) {
> > +            printf("Mount namespace, remote lock, and extended
> attributes"
> > +                   " are not supported by reconnection (-o reconnect).
> Please "
> > +                   "specify  -o no_mount_ns, -o no_flock (default), -o"
> > +                   "no_posix_lock (default), and -o no_xattr
> (default).\n");
> > +            exit(1);
> > +        }
> > +    }
> > +
> >      if (opts.log_level != 0) {
> >          current_log_level = opts.log_level;
> >      } else {
> > --
> > 2.20.1
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>

[-- Attachment #2: Type: text/html, Size: 16924 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking
  2021-02-04 12:15   ` Dr. David Alan Gilbert
@ 2021-02-04 14:20     ` Jiachen Zhang
  0 siblings, 0 replies; 30+ messages in thread
From: Jiachen Zhang @ 2021-02-04 14:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: virtio-fs, Xie Yongji, QEMU, Stefan Hajnoczi, Michael S . Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2304 bytes --]

On Thu, Feb 4, 2021 at 8:16 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Jiachen Zhang (zhangjiachen.jaycee@bytedance.com) wrote:
> > This is a work around. The qsort function will malloc memory instead of
> use
> > stack memory when the resubmit_num is larger than 64 (total size larger
> than
> > 1024 Bytes). This will cause seccomp kill virtiofsd, so we comment qsort.
> > This work around will not affect the correctness of inflight I/O
> tracking.
> >
> > Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
> > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>
> That's an odd hack!   Just follow the audit log to see what seccomp was
> upset by and add the right syscall.
>
> Dave
>
>
We recently found the cause is sysinfo (2). We will revert this and add
sysinfo to the
whitelist in the 2nd version patchset. Thanks!

Jiachen



> > ---
> >  contrib/libvhost-user/libvhost-user.c | 18 ------------------
> >  1 file changed, 18 deletions(-)
> >
> > diff --git a/contrib/libvhost-user/libvhost-user.c
> b/contrib/libvhost-user/libvhost-user.c
> > index 8c97013e59..c226d5d915 100644
> > --- a/contrib/libvhost-user/libvhost-user.c
> > +++ b/contrib/libvhost-user/libvhost-user.c
> > @@ -1167,20 +1167,6 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg
> *vmsg)
> >      return true;
> >  }
> >
> > -static int
> > -inflight_desc_compare(const void *a, const void *b)
> > -{
> > -    VuVirtqInflightDesc *desc0 = (VuVirtqInflightDesc *)a,
> > -                        *desc1 = (VuVirtqInflightDesc *)b;
> > -
> > -    if (desc1->counter > desc0->counter &&
> > -        (desc1->counter - desc0->counter) < VIRTQUEUE_MAX_SIZE * 2) {
> > -        return 1;
> > -    }
> > -
> > -    return -1;
> > -}
> > -
> >  static int
> >  vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
> >  {
> > @@ -1236,10 +1222,6 @@ vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
> >              }
> >          }
> >
> > -        if (vq->resubmit_num > 1) {
> > -            qsort(vq->resubmit_list, vq->resubmit_num,
> > -                  sizeof(VuVirtqInflightDesc), inflight_desc_compare);
> > -        }
> >          vq->counter = vq->resubmit_list[0].counter + 1;
> >      }
> >
> > --
> > 2.20.1
> >
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>

[-- Attachment #2: Type: text/html, Size: 3564 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2020-12-18  9:39   ` [External] " Jiachen Zhang
@ 2021-03-17 10:05     ` Stefan Hajnoczi
  2021-03-17 11:49       ` Christian Schoenebeck
  2021-03-17 12:32       ` Jiachen Zhang
  0 siblings, 2 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-17 10:05 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin, QEMU,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji,
	Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 1694 bytes --]

On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> Thanks for the suggestions. Actually, we choose to save all state
> information to QEMU because a virtiofsd has the same lifecycle as its
> QEMU master. However, saving things to a file do avoid communication with
> QEMU, and we no longer need to increase the complexity of vhost-user
> protocol. The suggestion to save fds to the systemd is also very reasonable
> if we don't consider the lifecycle issues, we will try it.

Hi,
We recently discussed crash recovery in the virtio-fs bi-weekly call and
I read some of this email thread because it's a topic I'm interested in.

I agree with Marc-André that storing file descriptors does not need to
be in the vhost-user protocol. The lifetime of a vhost-user device
backend is not controlled by the VMM since the device backend is
launched separately. Therefore it's reasonable for the component that
launched the device backend to also have the responsibility of cleaning
up the vhost-user device backend.

Using the sd_notify(3) interface is a neat idea. It's supported natively
by systemd but you can also implement a compatible interface in your own
software. This way the vhost-user device backend can be launched using
systemd or your own software.

That said, if people find it more convenient to store fds using the
vhost-user protocol, then I think that is enough justification to add a
generic message to the vhost-user protocol. The important thing is to
make the message generic so it solves all crash recovery use cases. The
inflight fd messages were too specific and now we're having to think
about adding more messages again.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-17 10:05     ` Stefan Hajnoczi
@ 2021-03-17 11:49       ` Christian Schoenebeck
  2021-03-17 12:57         ` Jiachen Zhang
  2021-03-17 12:32       ` Jiachen Zhang
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Schoenebeck @ 2021-03-17 11:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Jiachen Zhang, Daniel P. Berrange, slp,
	Michael S . Tsirkin, Dr . David Alan Gilbert, virtio-fs,
	Xie Yongji, Marc-André Lureau

On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > Thanks for the suggestions. Actually, we choose to save all state
> > information to QEMU because a virtiofsd has the same lifecycle as its
> > QEMU master. However, saving things to a file do avoid communication with
> > QEMU, and we no longer need to increase the complexity of vhost-user
> > protocol. The suggestion to save fds to the systemd is also very
> > reasonable
> > if we don't consider the lifecycle issues, we will try it.
> 
> Hi,
> We recently discussed crash recovery in the virtio-fs bi-weekly call and
> I read some of this email thread because it's a topic I'm interested in.

I just had a quick fly over the patches so far. Shouldn't there be some kind 
of constraint for an automatic reconnection feature after a crash to prevent 
this being exploited by ROP brute force attacks?

E.g. adding some (maybe continuously increasing) delay and/or limiting the 
amount of reconnects within a certain time frame would come to my mind.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-17 10:05     ` Stefan Hajnoczi
  2021-03-17 11:49       ` Christian Schoenebeck
@ 2021-03-17 12:32       ` Jiachen Zhang
  2021-03-22 11:00         ` Stefan Hajnoczi
  1 sibling, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2021-03-17 12:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin, QEMU,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji,
	Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 2399 bytes --]

On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > Thanks for the suggestions. Actually, we choose to save all state
> > information to QEMU because a virtiofsd has the same lifecycle as its
> > QEMU master. However, saving things to a file do avoid communication with
> > QEMU, and we no longer need to increase the complexity of vhost-user
> > protocol. The suggestion to save fds to the systemd is also very
> reasonable
> > if we don't consider the lifecycle issues, we will try it.
>
> Hi,
> We recently discussed crash recovery in the virtio-fs bi-weekly call and
> I read some of this email thread because it's a topic I'm interested in.
>
> I agree with Marc-André that storing file descriptors does not need to
> be in the vhost-user protocol. The lifetime of a vhost-user device
> backend is not controlled by the VMM since the device backend is
> launched separately. Therefore it's reasonable for the component that
> launched the device backend to also have the responsibility of cleaning
> up the vhost-user device backend.
>
> Using the sd_notify(3) interface is a neat idea. It's supported natively
> by systemd but you can also implement a compatible interface in your own
> software. This way the vhost-user device backend can be launched using
> systemd or your own software.
>
> That said, if people find it more convenient to store fds using the
> vhost-user protocol, then I think that is enough justification to add a
> generic message to the vhost-user protocol. The important thing is to
> make the message generic so it solves all crash recovery use cases. The
> inflight fd messages were too specific and now we're having to think
> about adding more messages again.
>
> Stefan
>


Hi, Stefan,

I agreed with you that a virtiofsd must be launched by a software like
systemd. So we are planning to define more generic persist/restore
interfaces (callbacks). Then anyone can implement their own persist/restore
callbacks to store states to proper places.  And I think in the next
version we will implement default callbacks for the interfaces. Instead of
vhost-user messages, systemd's sd_notify(3) will be the default method for
storing fds, and several tmpfs files can be the default place to store the
shm regions.

Jiachen

[-- Attachment #2: Type: text/html, Size: 2865 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-17 11:49       ` Christian Schoenebeck
@ 2021-03-17 12:57         ` Jiachen Zhang
  2021-03-18 11:58           ` Christian Schoenebeck
  0 siblings, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2021-03-17 12:57 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin,
	Dr . David Alan Gilbert, QEMU, virtio-fs, Xie Yongji,
	Marc-André Lureau, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2387 bytes --]

On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck <
qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > Thanks for the suggestions. Actually, we choose to save all state
> > > information to QEMU because a virtiofsd has the same lifecycle as its
> > > QEMU master. However, saving things to a file do avoid communication
> with
> > > QEMU, and we no longer need to increase the complexity of vhost-user
> > > protocol. The suggestion to save fds to the systemd is also very
> > > reasonable
> > > if we don't consider the lifecycle issues, we will try it.
> >
> > Hi,
> > We recently discussed crash recovery in the virtio-fs bi-weekly call and
> > I read some of this email thread because it's a topic I'm interested in.
>
> I just had a quick fly over the patches so far. Shouldn't there be some
> kind
> of constraint for an automatic reconnection feature after a crash to
> prevent
> this being exploited by ROP brute force attacks?
>
> E.g. adding some (maybe continuously increasing) delay and/or limiting the
> amount of reconnects within a certain time frame would come to my mind.
>
> Best regards,
> Christian Schoenebeck
>
>
>

Thanks, Christian. I am still trying to figure out the details of the ROP
attacks.

However, QEMU's vhost-user reconnection is based on chardev socket
reconnection. The socket reconnection can be enabled by the "--chardev
socket,...,reconnect=N" in QEMU command options, in which N means QEMU will
try to connect the disconnected socket every N seconds. We can increase N
to increase the reconnect delay. If we want to change the reconnect delay
dynamically, I think we should change the chardev socket reconnection code.
It is a more generic mechanism than vhost-user-fs and vhost-user backend.

By the way, I also considered the socket reconnection delay time in the
performance aspect. As the reconnection delay increase, if an application
in the guest is doing I/Os, it will suffer larger tail latency. And for
now, the smallest delay is 1 second, which is rather large for
high-performance virtual I/O devices today. I think maybe a more performant
and safer reconnect delay adjustment mechanism should be considered in the
future. What are your thoughts?


Jiachen

[-- Attachment #2: Type: text/html, Size: 2958 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-17 12:57         ` Jiachen Zhang
@ 2021-03-18 11:58           ` Christian Schoenebeck
  2021-03-22 10:54             ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Schoenebeck @ 2021-03-18 11:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Jiachen Zhang, Daniel P. Berrange, slp, Michael S . Tsirkin,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji,
	Marc-André Lureau, Stefan Hajnoczi

On Mittwoch, 17. März 2021 13:57:47 CET Jiachen Zhang wrote:
> On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck <
> 
> qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > > Thanks for the suggestions. Actually, we choose to save all state
> > > > information to QEMU because a virtiofsd has the same lifecycle as its
> > > > QEMU master. However, saving things to a file do avoid communication
> > 
> > with
> > 
> > > > QEMU, and we no longer need to increase the complexity of vhost-user
> > > > protocol. The suggestion to save fds to the systemd is also very
> > > > reasonable
> > > > if we don't consider the lifecycle issues, we will try it.
> > > 
> > > Hi,
> > > We recently discussed crash recovery in the virtio-fs bi-weekly call and
> > > I read some of this email thread because it's a topic I'm interested in.
> > 
> > I just had a quick fly over the patches so far. Shouldn't there be some
> > kind
> > of constraint for an automatic reconnection feature after a crash to
> > prevent
> > this being exploited by ROP brute force attacks?
> > 
> > E.g. adding some (maybe continuously increasing) delay and/or limiting the
> > amount of reconnects within a certain time frame would come to my mind.
> > 
> > Best regards,
> > Christian Schoenebeck
> 
> Thanks, Christian. I am still trying to figure out the details of the ROP
> attacks.
> 
> However, QEMU's vhost-user reconnection is based on chardev socket
> reconnection. The socket reconnection can be enabled by the "--chardev
> socket,...,reconnect=N" in QEMU command options, in which N means QEMU will
> try to connect the disconnected socket every N seconds. We can increase N
> to increase the reconnect delay. If we want to change the reconnect delay
> dynamically, I think we should change the chardev socket reconnection code.
> It is a more generic mechanism than vhost-user-fs and vhost-user backend.
> 
> By the way, I also considered the socket reconnection delay time in the
> performance aspect. As the reconnection delay increase, if an application
> in the guest is doing I/Os, it will suffer larger tail latency. And for
> now, the smallest delay is 1 second, which is rather large for
> high-performance virtual I/O devices today. I think maybe a more performant
> and safer reconnect delay adjustment mechanism should be considered in the
> future. What are your thoughts?

So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in ~18 
hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no MTE) it 
would be ~194 days. Independent of what architecture and defend mechanism is 
used, there is always the possibility though that some kind of side channel 
attack exists that might require a much lower amount of attempts. So in an 
untrusted environment I would personally limit the amount of automatic 
reconnects and rather accept a down time for further investigation if a 
suspicious high amount of crashes happened.

And yes, if a dynamic delay scheme was deployed in future then starting with a 
value smaller than 1 second would make sense.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-18 11:58           ` Christian Schoenebeck
@ 2021-03-22 10:54             ` Stefan Hajnoczi
  2021-03-23 12:54               ` Christian Schoenebeck
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22 10:54 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, virtio-fs, Xie Yongji,
	Jiachen Zhang, Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 3613 bytes --]

On Thu, Mar 18, 2021 at 12:58:46PM +0100, Christian Schoenebeck wrote:
> On Mittwoch, 17. März 2021 13:57:47 CET Jiachen Zhang wrote:
> > On Wed, Mar 17, 2021 at 7:50 PM Christian Schoenebeck <
> > 
> > qemu_oss@crudebyte.com> wrote:
> > > On Mittwoch, 17. März 2021 11:05:32 CET Stefan Hajnoczi wrote:
> > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > > > Thanks for the suggestions. Actually, we choose to save all state
> > > > > information to QEMU because a virtiofsd has the same lifecycle as its
> > > > > QEMU master. However, saving things to a file do avoid communication
> > > 
> > > with
> > > 
> > > > > QEMU, and we no longer need to increase the complexity of vhost-user
> > > > > protocol. The suggestion to save fds to the systemd is also very
> > > > > reasonable
> > > > > if we don't consider the lifecycle issues, we will try it.
> > > > 
> > > > Hi,
> > > > We recently discussed crash recovery in the virtio-fs bi-weekly call and
> > > > I read some of this email thread because it's a topic I'm interested in.
> > > 
> > > I just had a quick fly over the patches so far. Shouldn't there be some
> > > kind
> > > of constraint for an automatic reconnection feature after a crash to
> > > prevent
> > > this being exploited by ROP brute force attacks?
> > > 
> > > E.g. adding some (maybe continuously increasing) delay and/or limiting the
> > > amount of reconnects within a certain time frame would come to my mind.
> > > 
> > > Best regards,
> > > Christian Schoenebeck
> > 
> > Thanks, Christian. I am still trying to figure out the details of the ROP
> > attacks.
> > 
> > However, QEMU's vhost-user reconnection is based on chardev socket
> > reconnection. The socket reconnection can be enabled by the "--chardev
> > socket,...,reconnect=N" in QEMU command options, in which N means QEMU will
> > try to connect the disconnected socket every N seconds. We can increase N
> > to increase the reconnect delay. If we want to change the reconnect delay
> > dynamically, I think we should change the chardev socket reconnection code.
> > It is a more generic mechanism than vhost-user-fs and vhost-user backend.
> > 
> > By the way, I also considered the socket reconnection delay time in the
> > performance aspect. As the reconnection delay increase, if an application
> > in the guest is doing I/Os, it will suffer larger tail latency. And for
> > now, the smallest delay is 1 second, which is rather large for
> > high-performance virtual I/O devices today. I think maybe a more performant
> > and safer reconnect delay adjustment mechanism should be considered in the
> > future. What are your thoughts?
> 
> So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in ~18 
> hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no MTE) it 
> would be ~194 days. Independent of what architecture and defend mechanism is 
> used, there is always the possibility though that some kind of side channel 
> attack exists that might require a much lower amount of attempts. So in an 
> untrusted environment I would personally limit the amount of automatic 
> reconnects and rather accept a down time for further investigation if a 
> suspicious high amount of crashes happened.
> 
> And yes, if a dynamic delay scheme was deployed in future then starting with a 
> value smaller than 1 second would make sense.

If we're talking about repeatedly crashing the process to find out its
memory map, shouldn't each process have a different randomized memory
layout?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-17 12:32       ` Jiachen Zhang
@ 2021-03-22 11:00         ` Stefan Hajnoczi
  2021-03-22 20:13           ` [Virtio-fs] " Vivek Goyal
  0 siblings, 1 reply; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-22 11:00 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin, QEMU,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji,
	Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 1168 bytes --]

On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> I agreed with you that a virtiofsd must be launched by a software like
> systemd. So we are planning to define more generic persist/restore
> interfaces (callbacks). Then anyone can implement their own persist/restore
> callbacks to store states to proper places.  And I think in the next
> version we will implement default callbacks for the interfaces. Instead of
> vhost-user messages, systemd's sd_notify(3) will be the default method for
> storing fds, and several tmpfs files can be the default place to store the
> shm regions.

Okay, great!

I was thinking about how to make the crash recovery mechanism reusable
as a C library or Rust crate. The mechanism is a combination of:
1. sd_listen_fds(3) for restoring the fds on restart.
2. sd_notify(3) for storing the fds.
3. memfd or tmpfs for storing state (could be mmapped).

I'm not sure if there is enough common behavior to create a reusable API
or if this is quite application-specific.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Virtio-fs] [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-22 11:00         ` Stefan Hajnoczi
@ 2021-03-22 20:13           ` Vivek Goyal
  2021-03-23 13:45             ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Vivek Goyal @ 2021-03-22 20:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrange, Michael S . Tsirkin, QEMU, virtio-fs,
	Xie Yongji, Jiachen Zhang, Marc-André Lureau

On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote:
> On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > I agreed with you that a virtiofsd must be launched by a software like
> > systemd. So we are planning to define more generic persist/restore
> > interfaces (callbacks). Then anyone can implement their own persist/restore
> > callbacks to store states to proper places.  And I think in the next
> > version we will implement default callbacks for the interfaces. Instead of
> > vhost-user messages, systemd's sd_notify(3) will be the default method for
> > storing fds, and several tmpfs files can be the default place to store the
> > shm regions.
> 
> Okay, great!
> 
> I was thinking about how to make the crash recovery mechanism reusable
> as a C library or Rust crate. The mechanism is a combination of:
> 1. sd_listen_fds(3) for restoring the fds on restart.
> 2. sd_notify(3) for storing the fds.
> 3. memfd or tmpfs for storing state (could be mmapped).
> 
> I'm not sure if there is enough common behavior to create a reusable API
> or if this is quite application-specific.

I am wondering what will happen for use cases where virtiofsd is running
inside a container (with no systemd inside containers).

Do container managers offer systemd like services to save and restore
state.

Vivek



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-22 10:54             ` Stefan Hajnoczi
@ 2021-03-23 12:54               ` Christian Schoenebeck
  2021-03-23 14:25                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Schoenebeck @ 2021-03-23 12:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Stefan Hajnoczi, Daniel P. Berrange, slp, Michael S . Tsirkin,
	Dr . David Alan Gilbert, virtio-fs, Xie Yongji, Jiachen Zhang,
	Marc-André Lureau

On Montag, 22. März 2021 11:54:56 CET Stefan Hajnoczi wrote:
> > > Thanks, Christian. I am still trying to figure out the details of the
> > > ROP
> > > attacks.
> > > 
> > > However, QEMU's vhost-user reconnection is based on chardev socket
> > > reconnection. The socket reconnection can be enabled by the "--chardev
> > > socket,...,reconnect=N" in QEMU command options, in which N means QEMU
> > > will
> > > try to connect the disconnected socket every N seconds. We can increase
> > > N
> > > to increase the reconnect delay. If we want to change the reconnect
> > > delay
> > > dynamically, I think we should change the chardev socket reconnection
> > > code.
> > > It is a more generic mechanism than vhost-user-fs and vhost-user
> > > backend.
> > > 
> > > By the way, I also considered the socket reconnection delay time in the
> > > performance aspect. As the reconnection delay increase, if an
> > > application
> > > in the guest is doing I/Os, it will suffer larger tail latency. And for
> > > now, the smallest delay is 1 second, which is rather large for
> > > high-performance virtual I/O devices today. I think maybe a more
> > > performant
> > > and safer reconnect delay adjustment mechanism should be considered in
> > > the
> > > future. What are your thoughts?
> > 
> > So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in
> > ~18 hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no
> > MTE) it would be ~194 days. Independent of what architecture and defend
> > mechanism is used, there is always the possibility though that some kind
> > of side channel attack exists that might require a much lower amount of
> > attempts. So in an untrusted environment I would personally limit the
> > amount of automatic reconnects and rather accept a down time for further
> > investigation if a suspicious high amount of crashes happened.
> > 
> > And yes, if a dynamic delay scheme was deployed in future then starting
> > with a value smaller than 1 second would make sense.
> 
> If we're talking about repeatedly crashing the process to find out its
> memory map, shouldn't each process have a different randomized memory
> layout?
> 
> Stefan

Yes, ASLR is enabled on Linux and other OSes by default for more than 10 
years. But ASLR does not prevent ROP attacks which are commonly using relative 
offsets, tweaking the stack, indirect jumps, as well as heap spraying. Plus 
side channels exist to gain access to direct addresses.

The situation might improve significantly when shadow stacks (e.g. Intel CET) 
become widely used in future. But in the meantime I would be cautious if 
something is crashing too often in a certain time frame.

Best regards,
Christian Schoenebeck




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Virtio-fs] [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-22 20:13           ` [Virtio-fs] " Vivek Goyal
@ 2021-03-23 13:45             ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-23 13:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: vrothber, Daniel P. Berrange, Michael S . Tsirkin, dwalsh, QEMU,
	virtio-fs, Xie Yongji, Jiachen Zhang, Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 2100 bytes --]

On Mon, Mar 22, 2021 at 04:13:26PM -0400, Vivek Goyal wrote:
> On Mon, Mar 22, 2021 at 11:00:52AM +0000, Stefan Hajnoczi wrote:
> > On Wed, Mar 17, 2021 at 08:32:31PM +0800, Jiachen Zhang wrote:
> > > On Wed, Mar 17, 2021 at 6:05 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Fri, Dec 18, 2020 at 05:39:34PM +0800, Jiachen Zhang wrote:
> > > I agreed with you that a virtiofsd must be launched by a software like
> > > systemd. So we are planning to define more generic persist/restore
> > > interfaces (callbacks). Then anyone can implement their own persist/restore
> > > callbacks to store states to proper places.  And I think in the next
> > > version we will implement default callbacks for the interfaces. Instead of
> > > vhost-user messages, systemd's sd_notify(3) will be the default method for
> > > storing fds, and several tmpfs files can be the default place to store the
> > > shm regions.
> > 
> > Okay, great!
> > 
> > I was thinking about how to make the crash recovery mechanism reusable
> > as a C library or Rust crate. The mechanism is a combination of:
> > 1. sd_listen_fds(3) for restoring the fds on restart.
> > 2. sd_notify(3) for storing the fds.
> > 3. memfd or tmpfs for storing state (could be mmapped).
> > 
> > I'm not sure if there is enough common behavior to create a reusable API
> > or if this is quite application-specific.
> 
> I am wondering what will happen for use cases where virtiofsd is running
> inside a container (with no systemd inside containers).
> 
> Do container managers offer systemd like services to save and restore
> state.

It appears so, at least for Podman where sd_notify is explicitly
mentioned:
https://www.redhat.com/sysadmin/improved-systemd-podman

As mentioned in this email thread, the socket activation and
sd_notify(3) interface can also be implemented by non-systemd software.
Anyone running a custom container runtime or orchestration software
could add these interfaces (they are reasonably simple and the protocols
are documented in the systemd documentation).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [External] Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-03-23 12:54               ` Christian Schoenebeck
@ 2021-03-23 14:25                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-03-23 14:25 UTC (permalink / raw)
  To: Christian Schoenebeck
  Cc: Daniel P. Berrange, slp, Michael S . Tsirkin,
	Dr . David Alan Gilbert, qemu-devel, virtio-fs, Xie Yongji,
	Jiachen Zhang, Marc-André Lureau

[-- Attachment #1: Type: text/plain, Size: 3130 bytes --]

On Tue, Mar 23, 2021 at 01:54:46PM +0100, Christian Schoenebeck wrote:
> On Montag, 22. März 2021 11:54:56 CET Stefan Hajnoczi wrote:
> > > > Thanks, Christian. I am still trying to figure out the details of the
> > > > ROP
> > > > attacks.
> > > > 
> > > > However, QEMU's vhost-user reconnection is based on chardev socket
> > > > reconnection. The socket reconnection can be enabled by the "--chardev
> > > > socket,...,reconnect=N" in QEMU command options, in which N means QEMU
> > > > will
> > > > try to connect the disconnected socket every N seconds. We can increase
> > > > N
> > > > to increase the reconnect delay. If we want to change the reconnect
> > > > delay
> > > > dynamically, I think we should change the chardev socket reconnection
> > > > code.
> > > > It is a more generic mechanism than vhost-user-fs and vhost-user
> > > > backend.
> > > > 
> > > > By the way, I also considered the socket reconnection delay time in the
> > > > performance aspect. As the reconnection delay increase, if an
> > > > application
> > > > in the guest is doing I/Os, it will suffer larger tail latency. And for
> > > > now, the smallest delay is 1 second, which is rather large for
> > > > high-performance virtual I/O devices today. I think maybe a more
> > > > performant
> > > > and safer reconnect delay adjustment mechanism should be considered in
> > > > the
> > > > future. What are your thoughts?
> > > 
> > > So with N=1 an attacker could e.g. bypass a 16-bit PAC by brute-force in
> > > ~18 hours (e.g. on Arm if PAC + MTE was enabled). With 24-bit PAC (no
> > > MTE) it would be ~194 days. Independent of what architecture and defend
> > > mechanism is used, there is always the possibility though that some kind
> > > of side channel attack exists that might require a much lower amount of
> > > attempts. So in an untrusted environment I would personally limit the
> > > amount of automatic reconnects and rather accept a down time for further
> > > investigation if a suspicious high amount of crashes happened.
> > > 
> > > And yes, if a dynamic delay scheme was deployed in future then starting
> > > with a value smaller than 1 second would make sense.
> > 
> > If we're talking about repeatedly crashing the process to find out its
> > memory map, shouldn't each process have a different randomized memory
> > layout?
> > 
> > Stefan
> 
> Yes, ASLR is enabled on Linux and other OSes by default for more than 10 
> years. But ASLR does not prevent ROP attacks which are commonly using relative 
> offsets, tweaking the stack, indirect jumps, as well as heap spraying. Plus 
> side channels exist to gain access to direct addresses.
> 
> The situation might improve significantly when shadow stacks (e.g. Intel CET) 
> become widely used in future. But in the meantime I would be cautious if 
> something is crashing too often in a certain time frame.

It's a good point for performance as well. A broken service should not
hog CPU by constantly restarting itself. If it's broken badly it may
never come back up and should be throttled.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
                   ` (10 preceding siblings ...)
  2020-12-16 15:36 ` Marc-André Lureau
@ 2021-05-10 14:38 ` Jiachen Zhang
  2021-05-13 15:17   ` Stefan Hajnoczi
  11 siblings, 1 reply; 30+ messages in thread
From: Jiachen Zhang @ 2021-05-10 14:38 UTC (permalink / raw)
  To: Dr . David Alan Gilbert, Michael S . Tsirkin, Stefan Hajnoczi,
	Xie Yongji
  Cc: virtio-fs-list, QEMU

[-- Attachment #1: Type: text/plain, Size: 6686 bytes --]

Hi all,


We are going to develop the v2 patch for virtio-fs crash reconnection. As
suggested by Marc-André and Stefan, except for the inflight I/O tracking
log area, all the other internal statuses of virtiofsd will be saved to
some places other than QEMU. Specifically, the three lo_maps (ino_map,
dirp_map, and fd_map) could be saved to several mmapped files, and the
opened fds could be saved to systemd. I'd like to get some feedback on our
further thoughts before we work on the revision.


1. What about by default save the opened fds as file handles to host
kernel, instead of saving them to systemd. After some internal discussion,
we think introducing systemd may introduce more uncertainness to the
system, as we need to create one service for each daemon, and all the
daemons may suffer the single-point failure of the systemd process.


2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what
about splitting the flatten lo_map implementation, which supports to be
persisted to files, from passhtrough_ll.c to a new separated source file.
This way, maybe we can more easily wrap it with some Rust compatible
interfaces, and enable crash recovery for virtiofsd-rs based on it.


3. What about dropping the dirp_map, and integrate the opened directory fds
to fd_map. The virtiofsd-rs implementation only has two maps (inodes and
handles). In the C version, dirp_map may also unnecessary.


All the best,

Jiachen

On Wed, Dec 16, 2020 at 12:21 AM Jiachen Zhang <
zhangjiachen.jaycee@bytedance.com> wrote:

> Hi, all
>
> We implement virtio-fs crash reconnection in this patchset. The crash
> reconnection of virtiofsd here is completely transparent to guest, no
> remount in guest is needed, even the inflight requests can be handled
> normally after reconnection. We are looking forward to any comments.
>
> Thanks,
> Jiachen
>
>
> OVERVIEW:
>
> To support virtio-fs crash reconnection, we need to support the recovery
> of 1) inflight FUSE request, and 2) virtiofsd internal status information.
>
> Fortunately, QEMU's vhost-user reconnection framework already supports
> inflight I/O tracking by using VHOST_USER_GET_INFLIGHT_FD and
> VHOST_USER_SET_INFLIGHT_FD (see 5ad204bf2 and 5f9ff1eff for details).
> As the FUSE requests are transferred by virtqueue I/O requests, by using
> the vhost-user inflight I/O tracking, we can recover the inflight FUSE
> requests.
>
> To support virtiofsd internal status recovery, we introduce 4 new
> vhost-user message types. As shown in the following diagram, two of them
> are used to persist shared lo_maps and opened fds to QEMU, the other two
> message types are used to restore the status when reconnecting.
>
>                                VHOST_USER_SLAVE_SHM
>                                VHOST_USER_SLAVE_FD
>     +--------------+       Persist       +--------------------+
>     |              <---------------------+                    |
>     |     QEMU     |                     |  Virtio-fs Daemon  |
>     |              +--------------------->                    |
>     +--------------+       Restore       +--------------------+
>             VHOST_USER_SET_SHM
>             VHOST_USER_SET_FD
>
> Although the 4 newly added message types are to support virtiofsd
> reconnection in this patchset, it might be potential in other situation.
> So we keep in mind to make them more general when add them to vhost
> related source files. VHOST_USER_SLAVE_SHM and VHOST_USER_SET_SHM can be
> used for memory sharing between a vhost-user daemon and QEMU,
> VHOST_USER_SLAVE_FD and VHOST_USER_SET_FD would be useful if we want to
> shared opened fds between QEMU process and vhost-user daemon process.
>
>
> USAGE and NOTES:
>
> - The commits are rebased to a recent QEMU master commit b4d939133dca0fa2b.
>
> - ",reconnect=1" should be added to the "-chardev socket" of
> vhost-user-fs-pci
> in the QEMU command line, for example:
>
>     qemu-system-x86_64 ... \
>     -chardev socket,id=char0,path=/tmp/vhostqemu,reconnect=1 \
>     -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs \
>     ...
>
> - We add new options for virtiofsd to enable or disable crash reconnection.
> And some options are not supported by crash reconnection. So add following
> options to virtiofsd to enable reconnection:
>
>     virtiofsd ... -o reconnect -o no_mount_ns -o no_flock -o no_posix_lock
>     -o no_xattr ...
>
> - The reasons why virtiofsd-side locking, extended attributes, and mount
> namespace are not supported is explained in the commit message of the 6th
> patch (virtiofsd: Add two new options for crash reconnection).
>
> - The 9th patch is a work-around that will not affect the overall
> correctness.
> We remove the qsort related codes because we found that when resubmit_num
> is
> larger than 64, seccomp will kill the virtiofsd process.
>
> - Support for dax version virtiofsd is very possible and requires almost no
> additional change to this patchset.
>
>
> Jiachen Zhang (9):
>   vhost-user-fs: Add support for reconnection of vhost-user-fs backend
>   vhost: Add vhost-user message types for sending shared memory and file
>     fds
>   vhost-user-fs: Support virtiofsd crash reconnection
>   libvhost-user: Add vhost-user message types for sending shared memory
>     and file fds
>   virtiofsd: Convert the struct lo_map array to a more flatten layout
>   virtiofsd: Add two new options for crash reconnection
>   virtiofsd: Persist/restore lo_map and opened fds to/from QEMU
>   virtiofsd: Ensure crash consistency after reconnection
>   virtiofsd: (work around) Comment qsort in inflight I/O tracking
>
>  contrib/libvhost-user/libvhost-user.c | 106 +++-
>  contrib/libvhost-user/libvhost-user.h |  70 +++
>  docs/interop/vhost-user.rst           |  41 ++
>  hw/virtio/vhost-user-fs.c             | 334 ++++++++++-
>  hw/virtio/vhost-user.c                | 123 ++++
>  hw/virtio/vhost.c                     |  42 ++
>  include/hw/virtio/vhost-backend.h     |   6 +
>  include/hw/virtio/vhost-user-fs.h     |  16 +-
>  include/hw/virtio/vhost.h             |  42 ++
>  tools/virtiofsd/fuse_lowlevel.c       |  24 +-
>  tools/virtiofsd/fuse_virtio.c         |  44 ++
>  tools/virtiofsd/fuse_virtio.h         |   1 +
>  tools/virtiofsd/helper.c              |   9 +
>  tools/virtiofsd/passthrough_helpers.h |   2 +-
>  tools/virtiofsd/passthrough_ll.c      | 830 ++++++++++++++++++--------
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  16 files changed, 1413 insertions(+), 278 deletions(-)
>
> --
> 2.20.1
>
>

[-- Attachment #2: Type: text/html, Size: 9959 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection
  2021-05-10 14:38 ` Jiachen Zhang
@ 2021-05-13 15:17   ` Stefan Hajnoczi
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Hajnoczi @ 2021-05-13 15:17 UTC (permalink / raw)
  To: Jiachen Zhang
  Cc: QEMU, Xie Yongji, virtio-fs-list, Dr . David Alan Gilbert,
	Michael S . Tsirkin

[-- Attachment #1: Type: text/plain, Size: 3294 bytes --]

On Mon, May 10, 2021 at 10:38:05PM +0800, Jiachen Zhang wrote:
> Hi all,
> 
> 
> We are going to develop the v2 patch for virtio-fs crash reconnection. As
> suggested by Marc-André and Stefan, except for the inflight I/O tracking
> log area, all the other internal statuses of virtiofsd will be saved to
> some places other than QEMU. Specifically, the three lo_maps (ino_map,
> dirp_map, and fd_map) could be saved to several mmapped files, and the
> opened fds could be saved to systemd. I'd like to get some feedback on our
> further thoughts before we work on the revision.
> 
> 
> 1. What about by default save the opened fds as file handles to host
> kernel, instead of saving them to systemd. After some internal discussion,
> we think introducing systemd may introduce more uncertainness to the
> system, as we need to create one service for each daemon, and all the
> daemons may suffer the single-point failure of the systemd process.

I don't think saving file handles works 100%. The difference between an
open fd and a file handle is what happens when the inode is deleted. If
an external process deletes the inode during restart and then the fd
keeps it alive while a file handle becomes stale and the inode is gone.

Regarding systemd, it's pid 1 and cannot die - otherwise the system is
broken.

But in any case I think there are multiple options here. Whether you
choose to systemd, implement the sd_notify(3) protocol in your own
parent process, or take a different approach like a parent process with
clone(2) CLONE_FILES to avoid the communication overhead for saving
every fd, I think all of those approaches would be reasonable.

> 2. Like the btree map implementation (multikey.rs) of virtiofsd-rs, what
> about splitting the flatten lo_map implementation, which supports to be
> persisted to files, from passhtrough_ll.c to a new separated source file.
> This way, maybe we can more easily wrap it with some Rust compatible
> interfaces, and enable crash recovery for virtiofsd-rs based on it.

In the past two months I've noticed the number of virtiofsd-rs merge
requests has increased and I think the trend is that new development is
focussing on virtiofsd-rs.

If it fits into your plans then focussing on virtiofsd-rs would be fine
and then there is no need to worry about Rust compatible interfaces for
C virtiofsd.

> 3. What about dropping the dirp_map, and integrate the opened directory fds
> to fd_map. The virtiofsd-rs implementation only has two maps (inodes and
> handles). In the C version, dirp_map may also unnecessary.

Maybe, but carefully:

The purpose of the maps is to safely isolate the client from the
virtiofsd's internal objects. The way I remember it is that C virtiofsd
has a separate dirp map to prevent type confusion between regular open
files and directories. The client must not trick the server into calling
readdir(3) on something that's not a struct dirent because that could be
a security issue.

However, it's possible that virtiofsd-rs is able to combine the two
because it uses syscall APIs on file descriptors instead of libc
opendir(3) so there is no possibility of type confusion. The syscall
would simply fail if the file descriptor is not O_DIRECTORY.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-05-13 15:18 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-15 16:21 [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 1/9] vhost-user-fs: Add support for reconnection of vhost-user-fs backend Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 2/9] vhost: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 3/9] vhost-user-fs: Support virtiofsd crash reconnection Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 4/9] libvhost-user: Add vhost-user message types for sending shared memory and file fds Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 5/9] virtiofsd: Convert the struct lo_map array to a more flatten layout Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 6/9] virtiofsd: Add two new options for crash reconnection Jiachen Zhang
2021-02-04 12:08   ` Dr. David Alan Gilbert
2021-02-04 14:16     ` [External] " Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 7/9] virtiofsd: Persist/restore lo_map and opened fds to/from QEMU Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 8/9] virtiofsd: Ensure crash consistency after reconnection Jiachen Zhang
2020-12-15 16:21 ` [RFC PATCH 9/9] virtiofsd: (work around) Comment qsort in inflight I/O tracking Jiachen Zhang
2021-02-04 12:15   ` Dr. David Alan Gilbert
2021-02-04 14:20     ` [External] " Jiachen Zhang
2020-12-15 22:51 ` [RFC PATCH 0/9] Support for Virtio-fs daemon crash reconnection no-reply
2020-12-16 15:36 ` Marc-André Lureau
2020-12-18  9:39   ` [External] " Jiachen Zhang
2021-03-17 10:05     ` Stefan Hajnoczi
2021-03-17 11:49       ` Christian Schoenebeck
2021-03-17 12:57         ` Jiachen Zhang
2021-03-18 11:58           ` Christian Schoenebeck
2021-03-22 10:54             ` Stefan Hajnoczi
2021-03-23 12:54               ` Christian Schoenebeck
2021-03-23 14:25                 ` Stefan Hajnoczi
2021-03-17 12:32       ` Jiachen Zhang
2021-03-22 11:00         ` Stefan Hajnoczi
2021-03-22 20:13           ` [Virtio-fs] " Vivek Goyal
2021-03-23 13:45             ` Stefan Hajnoczi
2021-05-10 14:38 ` Jiachen Zhang
2021-05-13 15:17   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).