All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting
@ 2019-02-18 10:27 elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patchset is aimed at supporting qemu to reconnect
vhost-user-blk backend after vhost-user-blk backend crash or
restart.

The patch 1 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support transferring shared
buffer between qemu and backend.

The patch 2 deletes some redundant check in contrib/libvhost-user.c.

The patch 3,4 are the corresponding libvhost-user patches of
patch 1. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD.

The patch 5 allows vhost-user-blk to use the two new messages
to get/set inflight buffer from/to backend.

The patch 6 supports vhost-user-blk to reconnect backend when
connection closed.

The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
to vhost-user-blk backend which is used to tell qemu that
we support reconnecting now.

To use it, we could start qemu with:

qemu-system-x86_64 \
        -chardev socket,id=char0,path=/path/vhost.socket,reconnect=1, \
        -device vhost-user-blk-pci,chardev=char0 \

and start vhost-user-blk backend with:

vhost-user-blk -b /path/file -s /path/vhost.socket

Then we can restart vhost-user-blk at any time during VM running.

V5 to V6:
- Document the layout in inflight buffer for packed virtqueue
- Rework the layout in inflight buffer for split virtqueue
- Remove version field in VhostUserInflight
- Add a patch to remove some redundant check in
  contrib/libvhost-user.c
- Document more details in vhost-user.txt

V4 to V5:
- Drop patch that enables "nowait" option on client sockets
- Support resubmitting inflight I/O in order
- Make inflight I/O tracking more robust
- Remove align field and add queue size field in VhostUserInflight
- Document more details in vhost-user.txt

V3 to V4:
- Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
- Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
  and VHOST_USER_SET_INFLIGHT_FD
- Allocate inflight buffer in backend rather than in qemu
- Document a recommended format for inflight buffer

V2 to V3:
- Using exisiting wait/nowait options to control connection on
  client sockets instead of introducing "disconnected" option.
- Support the case that vhost-user backend restart during initialzation
  of vhost-user-blk device.

V1 to V2:
- Introduce "disconnected" option for chardev instead of reuse "wait"
  option
- Support the case that QEMU starts before vhost-user backend
- Drop message VHOST_USER_SET_VRING_INFLIGHT
- Introduce two new messages VHOST_USER_GET_SHM_SIZE
  and VHOST_USER_SET_SHM_FD

Xie Yongji (7):
  vhost-user: Support transferring inflight buffer between qemu and
    backend
  libvhost-user: Remove unnecessary FD flag check for event file
    descriptors
  libvhost-user: Introduce vu_queue_map_desc()
  libvhost-user: Support tracking inflight I/O in shared memory
  vhost-user-blk: Add support to get/set inflight buffer
  vhost-user-blk: Add support to reconnect backend
  contrib/vhost-user-blk: enable inflight I/O tracking

 Makefile                                |   2 +-
 contrib/libvhost-user/libvhost-user.c   | 400 ++++++++++++++++++++----
 contrib/libvhost-user/libvhost-user.h   |  58 ++++
 contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
 docs/interop/vhost-user.txt             | 264 ++++++++++++++++
 hw/block/vhost-user-blk.c               | 229 +++++++++++---
 hw/virtio/vhost-user.c                  | 107 +++++++
 hw/virtio/vhost.c                       |  96 ++++++
 include/hw/virtio/vhost-backend.h       |  10 +
 include/hw/virtio/vhost-user-blk.h      |   5 +
 include/hw/virtio/vhost.h               |  18 ++
 11 files changed, 1084 insertions(+), 108 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-21 17:27   ` Michael S. Tsirkin
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors elohimes
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
buffer between qemu and backend.

Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
shared buffer from backend. Then qemu should send it back
through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.

This shared buffer is used to track inflight I/O by backend.
Qemu should retrieve a new one when vm reset.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Chai Wen <chaiwen@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 docs/interop/vhost-user.txt       | 264 ++++++++++++++++++++++++++++++
 hw/virtio/vhost-user.c            | 107 ++++++++++++
 hw/virtio/vhost.c                 |  96 +++++++++++
 include/hw/virtio/vhost-backend.h |  10 ++
 include/hw/virtio/vhost.h         |  18 ++
 5 files changed, 495 insertions(+)

diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index c2194711d9..61c6d0e415 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -142,6 +142,17 @@ Depending on the request type, payload can be:
    Offset: a 64-bit offset of this area from the start of the
        supplied file descriptor
 
+ * Inflight description
+   -----------------------------------------------------
+   | mmap size | mmap offset | num queues | queue size |
+   -----------------------------------------------------
+
+   mmap size: a 64-bit size of area to track inflight I/O
+   mmap offset: a 64-bit offset of this area from the start
+                of the supplied file descriptor
+   num queues: a 16-bit number of virtqueues
+   queue size: a 16-bit size of virtqueues
+
 In QEMU the vhost-user message is implemented with the following struct:
 
 typedef struct VhostUserMsg {
@@ -157,6 +168,7 @@ typedef struct VhostUserMsg {
         struct vhost_iotlb_msg iotlb;
         VhostUserConfig config;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
     };
 } QEMU_PACKED VhostUserMsg;
 
@@ -175,6 +187,7 @@ the ones that do:
  * VHOST_USER_GET_PROTOCOL_FEATURES
  * VHOST_USER_GET_VRING_BASE
  * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+ * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 [ Also see the section on REPLY_ACK protocol extension. ]
 
@@ -188,6 +201,7 @@ in the ancillary data:
  * VHOST_USER_SET_VRING_CALL
  * VHOST_USER_SET_VRING_ERR
  * VHOST_USER_SET_SLAVE_REQ_FD
+ * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 If Master is unable to send the full message or receives a wrong reply it will
 close the connection. An optional reconnection mechanism can be implemented.
@@ -382,6 +396,235 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
 slave can send file descriptors (at most 8 descriptors in each message)
 to master via ancillary data using this fd communication channel.
 
+Inflight I/O tracking
+---------------------
+
+To support reconnecting after restart or crash, slave may need to resubmit
+inflight I/Os. If virtqueue is processed in order, we can easily achieve
+that by getting the inflight descriptors from descriptor table (split virtqueue)
+or descriptor ring (packed virtqueue). However, it can't work when we process
+descriptors out-of-order because some entries which store the information of
+inflight descriptors in available ring (split virtqueue) or descriptor
+ring (packed virtqueue) might be overrided by new entries. To solve this
+problem, slave need to allocate an extra buffer to store this information of inflight
+descriptors and share it with master for persistent. VHOST_USER_GET_INFLIGHT_FD and
+VHOST_USER_SET_INFLIGHT_FD are used to transfer this buffer between master
+and slave. And the format of this buffer is described below:
+
+-------------------------------------------------------
+| queue0 region | queue1 region | ... | queueN region |
+-------------------------------------------------------
+
+N is the number of available virtqueues. Slave could get it from num queues
+field of VhostUserInflight.
+
+For split virtqueue, queue region can be implemented as:
+
+typedef struct DescStateSplit {
+    /* Indicate whether this descriptor is inflight or not.
+     * Only available for head-descriptor. */
+    uint8_t inflight;
+
+    /* Padding */
+    uint8_t padding;
+
+    /* Link to the last processed entry */
+    uint16_t next;
+} DescStateSplit;
+
+typedef struct QueueRegionSplit {
+    /* The feature flags of this region. Now it's initialized to 0. */
+    uint64_t features;
+
+    /* The version of this region. It's 1 currently.
+     * Zero value indicates an uninitialized buffer */
+    uint16_t version;
+
+    /* The size of DescStateSplit array. It's equal to the virtqueue
+     * size. Slave could get it from queue size field of VhostUserInflight. */
+    uint16_t desc_num;
+
+    /* The head of processed DescStateSplit entry list */
+    uint16_t process_head;
+
+    /* Storing the idx value of used ring */
+    uint16_t used_idx;
+
+    /* Used to track the state of each descriptor in descriptor table */
+    DescStateSplit desc[0];
+} QueueRegionSplit;
+
+To track inflight I/O, the queue region should be processed as follows:
+
+When receiving available buffers from the driver:
+
+    1. Get the next available head-descriptor index from available ring, i
+
+    2. Set desc[i].inflight to 1
+
+When supplying used buffers to the driver:
+
+    1. Get corresponding used head-descriptor index, i
+
+    2. Set desc[i].next to process_head
+
+    3. Set process_head to i
+
+    4. Steps 1,2,3 may be performed repeatedly if batching is possible
+
+    5. Increase the idx value of used ring by the size of the batch
+
+    6. Set the inflight field of each DescStateSplit entry in the batch to 0
+
+    7. Set used_idx to the idx value of used ring
+
+When reconnecting:
+
+    1. If the value of used_idx does not match the idx value of used ring,
+
+        (a) Subtract the value of used_idx from the idx value of used ring to get
+        the number of in-progress DescStateSplit entries
+
+        (b) Set the inflight field of the in-progress DescStateSplit entries which
+        start from process_head to 0
+
+        (c) Set used_idx to the idx value of used ring
+
+    2. Resubmit each inflight DescStateSplit entry
+
+For packed virtqueue, queue region can be implemented as:
+
+typedef struct DescStatePacked {
+    /* Indicate whether this descriptor is inflight or not.
+     * Only available for head-descriptor. */
+    uint8_t inflight;
+
+    /* Padding */
+    uint8_t padding;
+
+    /* Link to the next free entry */
+    uint16_t next;
+
+    /* Link to the last entry of descriptor list.
+     * Only available for head-descriptor. */
+    uint16_t last;
+
+    /* The length of descriptor list.
+     * Only available for head-descriptor. */
+    uint16_t num;
+
+    /* The buffer id */
+    uint16_t id;
+
+    /* The descriptor flags */
+    uint16_t flags;
+
+    /* The buffer length */
+    uint32_t len;
+
+    /* The buffer address */
+    uint64_t addr;
+} DescStatePacked;
+
+typedef struct QueueRegionPacked {
+    /* The feature flags of this region. Now it's initialized to 0. */
+    uint64_t features;
+
+    /* The version of this region. It's 1 currently.
+     * Zero value indicates an uninitialized buffer */
+    uint16_t version;
+
+    /* The size of DescStatePacked array. It's equal to the virtqueue
+     * size. Slave could get it from queue size field of VhostUserInflight. */
+    uint16_t desc_num;
+
+    /* The head of free DescStatePacked entry list */
+    uint16_t free_head;
+
+    /* The old head of free DescStatePacked entry list */
+    uint16_t old_free_head;
+
+    /* The used index of descriptor ring */
+    uint16_t used_idx;
+
+    /* The old used index of descriptor ring */
+    uint16_t old_used_idx;
+
+    /* Device ring wrap counter */
+    uint8_t used_wrap_counter;
+
+    /* The old device ring wrap counter */
+    uint8_t old_used_wrap_counter;
+
+    /* Padding */
+    uint8_t padding[7];
+
+    /* Used to track the state of each descriptor fetched from descriptor ring */
+    DescStatePacked desc[0];
+} QueueRegionPacked;
+
+To track inflight I/O, the queue region should be processed as follows:
+
+When receiving available buffers from the driver:
+
+    1. Get the next available descriptor entry from descriptor ring, d
+
+    2. If d is head descriptor,
+
+        (a) Set desc[old_free_head].num to 0
+
+        (b) Set desc[old_free_head].inflight to 1
+
+    3. If d is last descriptor, set desc[old_free_head].last to free_head
+
+    4. Increase desc[old_free_head].num by 1
+
+    5. Set desc[free_head].addr, desc[free_head].len, desc[free_head].flags,
+    desc[free_head].id to d.addr, d.len, d.flags, d.id
+
+    6. Set free_head to desc[free_head].next
+
+    7. If d is last descriptor, set old_free_head to free_head
+
+When supplying used buffers to the driver:
+
+    1. Get corresponding used head-descriptor entry from descriptor ring, d
+
+    2. Get corresponding DescStatePacked entry, e
+
+    3. Set desc[e.last].next to free_head
+
+    4. Set free_head to the index of e
+
+    5. Steps 1,2,3,4 may be performed repeatedly if batching is possible
+
+    6. Increase used_idx by the size of the batch and update used_wrap_counter if needed
+
+    7. Update d.flags
+
+    8. Set the inflight field of each head DescStatePacked entry in the batch to 0
+
+    9. Set old_free_head, old_used_idx, old_used_wrap_counter to free_head, used_idx,
+    used_wrap_counter
+
+When reconnecting:
+
+    1. If used_idx does not match old_used_idx,
+
+        (a) Get the next descriptor ring entry through old_used_idx, d
+
+        (b) Use old_used_wrap_counter to calculate the available flags
+
+        (c) If d.flags is not equal to the calculated flags value, set old_free_head,
+        old_used_idx, old_used_wrap_counter to free_head, used_idx, used_wrap_counter
+
+    2. Set free_head, used_idx, used_wrap_counter to old_free_head, old_used_idx,
+    old_used_wrap_counter
+
+    3. Set the inflight field of each free DescStatePacked entry to 0
+
+    4. Resubmit each inflight DescStatePacked entry
+
 Protocol features
 -----------------
 
@@ -397,6 +640,7 @@ Protocol features
 #define VHOST_USER_PROTOCOL_F_CONFIG         9
 #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD  10
 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER  11
+#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
 
 Master message types
 --------------------
@@ -761,6 +1005,26 @@ Master message types
       was previously sent.
       The value returned is an error indication; 0 is success.
 
+ * VHOST_USER_GET_INFLIGHT_FD
+      Id: 31
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to get
+      a shared buffer from slave. The shared buffer will be used to track
+      inflight I/O by slave. QEMU should retrieve a new one when vm reset.
+
+ * VHOST_USER_SET_INFLIGHT_FD
+      Id: 32
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to send
+      the shared inflight buffer back to slave so that slave could get
+      inflight I/O after a crash or restart.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 564a31d12c..21a81998ba 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -52,6 +52,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_CONFIG = 9,
     VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
     VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -89,6 +90,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_POSTCOPY_ADVISE  = 28,
     VHOST_USER_POSTCOPY_LISTEN  = 29,
     VHOST_USER_POSTCOPY_END     = 30,
+    VHOST_USER_GET_INFLIGHT_FD = 31,
+    VHOST_USER_SET_INFLIGHT_FD = 32,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -147,6 +150,13 @@ typedef struct VhostUserVringArea {
     uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserInflight {
+    uint64_t mmap_size;
+    uint64_t mmap_offset;
+    uint16_t num_queues;
+    uint16_t queue_size;
+} VhostUserInflight;
+
 typedef struct {
     VhostUserRequest request;
 
@@ -169,6 +179,7 @@ typedef union {
         VhostUserConfig config;
         VhostUserCryptoSession session;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1739,6 +1750,100 @@ static bool vhost_user_mem_section_filter(struct vhost_dev *dev,
     return result;
 }
 
+static int vhost_user_get_inflight_fd(struct vhost_dev *dev,
+                                      uint16_t queue_size,
+                                      struct vhost_inflight *inflight)
+{
+    void *addr;
+    int fd;
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->user->chr;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_GET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.num_queues = dev->nvqs,
+        .payload.inflight.queue_size = queue_size,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        return -1;
+    }
+
+    if (vhost_user_read(dev, &msg) < 0) {
+        return -1;
+    }
+
+    if (msg.hdr.request != VHOST_USER_GET_INFLIGHT_FD) {
+        error_report("Received unexpected msg type. "
+                     "Expected %d received %d",
+                     VHOST_USER_GET_INFLIGHT_FD, msg.hdr.request);
+        return -1;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.inflight)) {
+        error_report("Received bad msg size.");
+        return -1;
+    }
+
+    if (!msg.payload.inflight.mmap_size) {
+        return 0;
+    }
+
+    fd = qemu_chr_fe_get_msgfd(chr);
+    if (fd < 0) {
+        error_report("Failed to get mem fd");
+        return -1;
+    }
+
+    addr = mmap(0, msg.payload.inflight.mmap_size, PROT_READ | PROT_WRITE,
+                MAP_SHARED, fd, msg.payload.inflight.mmap_offset);
+
+    if (addr == MAP_FAILED) {
+        error_report("Failed to mmap mem fd");
+        close(fd);
+        return -1;
+    }
+
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = msg.payload.inflight.mmap_size;
+    inflight->offset = msg.payload.inflight.mmap_offset;
+    inflight->queue_size = queue_size;
+
+    return 0;
+}
+
+static int vhost_user_set_inflight_fd(struct vhost_dev *dev,
+                                      struct vhost_inflight *inflight)
+{
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_SET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.mmap_size = inflight->size,
+        .payload.inflight.mmap_offset = inflight->offset,
+        .payload.inflight.num_queues = dev->nvqs,
+        .payload.inflight.queue_size = inflight->queue_size,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, &inflight->fd, 1) < 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 VhostUserState *vhost_user_init(void)
 {
     VhostUserState *user = g_new0(struct VhostUserState, 1);
@@ -1790,4 +1895,6 @@ const VhostOps user_ops = {
         .vhost_crypto_create_session = vhost_user_crypto_create_session,
         .vhost_crypto_close_session = vhost_user_crypto_close_session,
         .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
+        .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
+        .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 569c4053ea..8db1a855eb 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1481,6 +1481,102 @@ void vhost_dev_set_config_notifier(struct vhost_dev *hdev,
     hdev->config_ops = ops;
 }
 
+void vhost_dev_free_inflight(struct vhost_inflight *inflight)
+{
+    if (inflight->addr) {
+        qemu_memfd_free(inflight->addr, inflight->size, inflight->fd);
+        inflight->addr = NULL;
+        inflight->fd = -1;
+    }
+}
+
+static int vhost_dev_resize_inflight(struct vhost_inflight *inflight,
+                                     uint64_t new_size)
+{
+    Error *err = NULL;
+    int fd = -1;
+    void *addr = qemu_memfd_alloc("vhost-inflight", new_size,
+                                  F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
+                                  &fd, &err);
+
+    if (err) {
+        error_report_err(err);
+        return -1;
+    }
+
+    vhost_dev_free_inflight(inflight);
+    inflight->offset = 0;
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = new_size;
+
+    return 0;
+}
+
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    if (inflight->addr) {
+        qemu_put_be64(f, inflight->size);
+        qemu_put_be16(f, inflight->queue_size);
+        qemu_put_buffer(f, inflight->addr, inflight->size);
+    } else {
+        qemu_put_be64(f, 0);
+    }
+}
+
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    uint64_t size;
+
+    size = qemu_get_be64(f);
+    if (!size) {
+        return 0;
+    }
+
+    if (inflight->size != size) {
+        if (vhost_dev_resize_inflight(inflight, size)) {
+            return -1;
+        }
+    }
+    inflight->queue_size = qemu_get_be16(f);
+
+    qemu_get_buffer(f, inflight->addr, size);
+
+    return 0;
+}
+
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_set_inflight_fd && inflight->addr) {
+        r = dev->vhost_ops->vhost_set_inflight_fd(dev, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_set_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
+int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_get_inflight_fd) {
+        r = dev->vhost_ops->vhost_get_inflight_fd(dev, queue_size, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_get_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
 /* Host notifiers must be enabled at this point. */
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 81283ec50f..d6632a18e6 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -25,6 +25,7 @@ typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
 struct vhost_memory;
@@ -104,6 +105,13 @@ typedef int (*vhost_crypto_close_session_op)(struct vhost_dev *dev,
 typedef bool (*vhost_backend_mem_section_filter_op)(struct vhost_dev *dev,
                                                 MemoryRegionSection *section);
 
+typedef int (*vhost_get_inflight_fd_op)(struct vhost_dev *dev,
+                                        uint16_t queue_size,
+                                        struct vhost_inflight *inflight);
+
+typedef int (*vhost_set_inflight_fd_op)(struct vhost_dev *dev,
+                                        struct vhost_inflight *inflight);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -142,6 +150,8 @@ typedef struct VhostOps {
     vhost_crypto_create_session_op vhost_crypto_create_session;
     vhost_crypto_close_session_op vhost_crypto_close_session;
     vhost_backend_mem_section_filter_op vhost_backend_mem_section_filter;
+    vhost_get_inflight_fd_op vhost_get_inflight_fd;
+    vhost_set_inflight_fd_op vhost_set_inflight_fd;
 } VhostOps;
 
 extern const VhostOps user_ops;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a7f449fa87..619498c8f4 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -7,6 +7,15 @@
 #include "exec/memory.h"
 
 /* Generic structures common for any vhost based device. */
+
+struct vhost_inflight {
+    int fd;
+    void *addr;
+    uint64_t size;
+    uint64_t offset;
+    uint16_t queue_size;
+};
+
 struct vhost_virtqueue {
     int kick;
     int call;
@@ -120,4 +129,13 @@ int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
  */
 void vhost_dev_set_config_notifier(struct vhost_dev *dev,
                                    const VhostDevConfigOps *ops);
+
+void vhost_dev_reset_inflight(struct vhost_inflight *inflight);
+void vhost_dev_free_inflight(struct vhost_inflight *inflight);
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight);
+int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
+                           struct vhost_inflight *inflight);
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

The vu_check_queue_msg_file() has checked the FD flag. So let's
delete the redundant check after it.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 contrib/libvhost-user/libvhost-user.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 3f14b4138b..16fec3a3fd 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -907,10 +907,8 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
         dev->vq[index].kick_fd = -1;
     }
 
-    if (!(vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
-        dev->vq[index].kick_fd = vmsg->fds[0];
-        DPRINT("Got kick_fd: %d for vq: %d\n", vmsg->fds[0], index);
-    }
+    dev->vq[index].kick_fd = vmsg->fds[0];
+    DPRINT("Got kick_fd: %d for vq: %d\n", vmsg->fds[0], index);
 
     dev->vq[index].started = true;
     if (dev->iface->queue_set_started) {
@@ -995,9 +993,7 @@ vu_set_vring_call_exec(VuDev *dev, VhostUserMsg *vmsg)
         dev->vq[index].call_fd = -1;
     }
 
-    if (!(vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
-        dev->vq[index].call_fd = vmsg->fds[0];
-    }
+    dev->vq[index].call_fd = vmsg->fds[0];
 
     DPRINT("Got call_fd: %d for vq: %d\n", vmsg->fds[0], index);
 
@@ -1020,9 +1016,7 @@ vu_set_vring_err_exec(VuDev *dev, VhostUserMsg *vmsg)
         dev->vq[index].err_fd = -1;
     }
 
-    if (!(vmsg->payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
-        dev->vq[index].err_fd = vmsg->fds[0];
-    }
+    dev->vq[index].err_fd = vmsg->fds[0];
 
     return false;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc()
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

Introduce vu_queue_map_desc() which should be
independent with vu_queue_pop();

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 contrib/libvhost-user/libvhost-user.c | 88 ++++++++++++++++-----------
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index 16fec3a3fd..ea0f414b6d 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -1847,49 +1847,20 @@ virtqueue_alloc_element(size_t sz,
     return elem;
 }
 
-void *
-vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+static void *
+vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
 {
-    unsigned int i, head, max, desc_len;
+    struct vring_desc *desc = vq->vring.desc;
     uint64_t desc_addr, read_len;
+    unsigned int desc_len;
+    unsigned int max = vq->vring.num;
+    unsigned int i = idx;
     VuVirtqElement *elem;
-    unsigned out_num, in_num;
+    unsigned int out_num = 0, in_num = 0;
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
     struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
-    struct vring_desc *desc;
     int rc;
 
-    if (unlikely(dev->broken) ||
-        unlikely(!vq->vring.avail)) {
-        return NULL;
-    }
-
-    if (vu_queue_empty(dev, vq)) {
-        return NULL;
-    }
-    /* Needed after virtio_queue_empty(), see comment in
-     * virtqueue_num_heads(). */
-    smp_rmb();
-
-    /* When we start there are none of either input nor output. */
-    out_num = in_num = 0;
-
-    max = vq->vring.num;
-    if (vq->inuse >= vq->vring.num) {
-        vu_panic(dev, "Virtqueue size exceeded");
-        return NULL;
-    }
-
-    if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
-        return NULL;
-    }
-
-    if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
-        vring_set_avail_event(vq, vq->last_avail_idx);
-    }
-
-    i = head;
-    desc = vq->vring.desc;
     if (desc[i].flags & VRING_DESC_F_INDIRECT) {
         if (desc[i].len % sizeof(struct vring_desc)) {
             vu_panic(dev, "Invalid size for indirect buffer table");
@@ -1941,12 +1912,13 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
     } while (rc == VIRTQUEUE_READ_DESC_MORE);
 
     if (rc == VIRTQUEUE_READ_DESC_ERROR) {
+        vu_panic(dev, "read descriptor error");
         return NULL;
     }
 
     /* Now copy what we have collected and mapped */
     elem = virtqueue_alloc_element(sz, out_num, in_num);
-    elem->index = head;
+    elem->index = idx;
     for (i = 0; i < out_num; i++) {
         elem->out_sg[i] = iov[i];
     }
@@ -1954,6 +1926,48 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
         elem->in_sg[i] = iov[out_num + i];
     }
 
+    return elem;
+}
+
+void *
+vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+{
+    unsigned int head;
+    VuVirtqElement *elem;
+
+    if (unlikely(dev->broken) ||
+        unlikely(!vq->vring.avail)) {
+        return NULL;
+    }
+
+    if (vu_queue_empty(dev, vq)) {
+        return NULL;
+    }
+    /*
+     * Needed after virtio_queue_empty(), see comment in
+     * virtqueue_num_heads().
+     */
+    smp_rmb();
+
+    if (vq->inuse >= vq->vring.num) {
+        vu_panic(dev, "Virtqueue size exceeded");
+        return NULL;
+    }
+
+    if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, &head)) {
+        return NULL;
+    }
+
+    if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+        vring_set_avail_event(vq, vq->last_avail_idx);
+    }
+
+    elem = vu_queue_map_desc(dev, vq, head, sz);
+
+    if (!elem) {
+        return NULL;
+    }
+
     vq->inuse++;
 
     return elem;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (2 preceding siblings ...)
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch adds support for VHOST_USER_GET_INFLIGHT_FD and
VHOST_USER_SET_INFLIGHT_FD message to set/get shared buffer
to/from qemu. Then backend can track inflight I/O in this buffer.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 Makefile                              |   2 +-
 contrib/libvhost-user/libvhost-user.c | 300 ++++++++++++++++++++++++--
 contrib/libvhost-user/libvhost-user.h |  58 +++++
 3 files changed, 339 insertions(+), 21 deletions(-)

diff --git a/Makefile b/Makefile
index 3658310b95..8469bd94fb 100644
--- a/Makefile
+++ b/Makefile
@@ -477,7 +477,7 @@ Makefile: $(version-obj-y)
 # Build libraries
 
 libqemuutil.a: $(util-obj-y) $(trace-obj-y) $(stub-obj-y)
-libvhost-user.a: $(libvhost-user-obj-y)
+libvhost-user.a: $(libvhost-user-obj-y) $(util-obj-y) $(stub-obj-y)
 
 ######################################################################
 
diff --git a/contrib/libvhost-user/libvhost-user.c b/contrib/libvhost-user/libvhost-user.c
index ea0f414b6d..c20850e890 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -41,6 +41,8 @@
 #endif
 
 #include "qemu/atomic.h"
+#include "qemu/osdep.h"
+#include "qemu/memfd.h"
 
 #include "libvhost-user.h"
 
@@ -53,6 +55,18 @@
             _min1 < _min2 ? _min1 : _min2; })
 #endif
 
+/* Round number down to multiple */
+#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
+
+/* Round number up to multiple */
+#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
+
+/* Align each region to cache line size in inflight buffer */
+#define INFLIGHT_ALIGNMENT 64
+
+/* The version of inflight buffer */
+#define INFLIGHT_VERSION 1
+
 #define VHOST_USER_HDR_SIZE offsetof(VhostUserMsg, payload.u64)
 
 /* The version of the protocol we support */
@@ -66,6 +80,20 @@
         }                                       \
     } while (0)
 
+static inline
+bool has_feature(uint64_t features, unsigned int fbit)
+{
+    assert(fbit < 64);
+    return !!(features & (1ULL << fbit));
+}
+
+static inline
+bool vu_has_feature(VuDev *dev,
+                    unsigned int fbit)
+{
+    return has_feature(dev->features, fbit);
+}
+
 static const char *
 vu_request_to_string(unsigned int req)
 {
@@ -100,6 +128,8 @@ vu_request_to_string(unsigned int req)
         REQ(VHOST_USER_POSTCOPY_ADVISE),
         REQ(VHOST_USER_POSTCOPY_LISTEN),
         REQ(VHOST_USER_POSTCOPY_END),
+        REQ(VHOST_USER_GET_INFLIGHT_FD),
+        REQ(VHOST_USER_SET_INFLIGHT_FD),
         REQ(VHOST_USER_MAX),
     };
 #undef REQ
@@ -890,6 +920,55 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
     return true;
 }
 
+static int
+vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
+{
+    int i = 0;
+
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    if (unlikely(!vq->inflight->version)) {
+        /* initialize the buffer */
+        vq->inflight->version = INFLIGHT_VERSION;
+        return 0;
+    }
+
+    vq->used_idx = vq->vring.used->idx;
+    vq->inflight_num = 0;
+
+    if (unlikely(vq->inflight->used_idx != vq->used_idx)) {
+        vq->inflight->desc[vq->inflight->process_head].inflight = 0;
+
+        barrier();
+
+        vq->inflight->used_idx = vq->used_idx;
+    }
+
+    for (i = 0; i < vq->inflight->desc_num; i++) {
+        if (vq->inflight->desc[i].inflight == 0) {
+            continue;
+        }
+
+        vq->inflight_desc[vq->inflight_num++] = i;
+        vq->inuse++;
+    }
+    vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
+
+    /* in case of I/O hang after reconnecting */
+    if (eventfd_write(vq->kick_fd, 1)) {
+        return -1;
+    }
+
+    return 0;
+}
+
 static bool
 vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -923,6 +1002,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
                dev->vq[index].kick_fd, index);
     }
 
+    if (vu_check_queue_inflights(dev, &dev->vq[index])) {
+        vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
+    }
+
     return false;
 }
 
@@ -995,6 +1078,11 @@ vu_set_vring_call_exec(VuDev *dev, VhostUserMsg *vmsg)
 
     dev->vq[index].call_fd = vmsg->fds[0];
 
+    /* in case of I/O hang after reconnecting */
+    if (eventfd_write(vmsg->fds[0], 1)) {
+        return -1;
+    }
+
     DPRINT("Got call_fd: %d for vq: %d\n", vmsg->fds[0], index);
 
     return false;
@@ -1209,6 +1297,116 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
     return true;
 }
 
+static inline uint64_t
+vu_inflight_queue_size(uint16_t queue_size)
+{
+    return ALIGN_UP(sizeof(VuDescStateSplit) * queue_size +
+           sizeof(uint16_t), INFLIGHT_ALIGNMENT);
+}
+
+static bool
+vu_get_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int fd;
+    void *addr;
+    uint64_t mmap_size;
+    uint16_t num_queues, queue_size;
+
+    if (vmsg->size != sizeof(vmsg->payload.inflight)) {
+        vu_panic(dev, "Invalid get_inflight_fd message:%d", vmsg->size);
+        vmsg->payload.inflight.mmap_size = 0;
+        return true;
+    }
+
+    num_queues = vmsg->payload.inflight.num_queues;
+    queue_size = vmsg->payload.inflight.queue_size;
+
+    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
+    DPRINT("set_inflight_fd queue_size: %"PRId16"\n", queue_size);
+
+    mmap_size = vu_inflight_queue_size(queue_size) * num_queues;
+
+    addr = qemu_memfd_alloc("vhost-inflight", mmap_size,
+                            F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
+                            &fd, NULL);
+
+    if (!addr) {
+        vu_panic(dev, "Failed to alloc vhost inflight area");
+        vmsg->payload.inflight.mmap_size = 0;
+        return true;
+    }
+
+    memset(addr, 0, mmap_size);
+
+    dev->inflight_info.addr = addr;
+    dev->inflight_info.size = vmsg->payload.inflight.mmap_size = mmap_size;
+    dev->inflight_info.fd = vmsg->fds[0] = fd;
+    vmsg->fd_num = 1;
+    vmsg->payload.inflight.mmap_offset = 0;
+
+    DPRINT("send inflight mmap_size: %"PRId64"\n",
+           vmsg->payload.inflight.mmap_size);
+    DPRINT("send inflight mmap offset: %"PRId64"\n",
+           vmsg->payload.inflight.mmap_offset);
+
+    return true;
+}
+
+static bool
+vu_set_inflight_fd(VuDev *dev, VhostUserMsg *vmsg)
+{
+    int fd, i;
+    uint64_t mmap_size, mmap_offset;
+    uint16_t num_queues, queue_size;
+    void *rc;
+
+    if (vmsg->fd_num != 1 ||
+        vmsg->size != sizeof(vmsg->payload.inflight)) {
+        vu_panic(dev, "Invalid set_inflight_fd message size:%d fds:%d",
+                 vmsg->size, vmsg->fd_num);
+        return false;
+    }
+
+    fd = vmsg->fds[0];
+    mmap_size = vmsg->payload.inflight.mmap_size;
+    mmap_offset = vmsg->payload.inflight.mmap_offset;
+    num_queues = vmsg->payload.inflight.num_queues;
+    queue_size = vmsg->payload.inflight.queue_size;
+
+    DPRINT("set_inflight_fd mmap_size: %"PRId64"\n", mmap_size);
+    DPRINT("set_inflight_fd mmap_offset: %"PRId64"\n", mmap_offset);
+    DPRINT("set_inflight_fd num_queues: %"PRId16"\n", num_queues);
+    DPRINT("set_inflight_fd queue_size: %"PRId16"\n", queue_size);
+
+    rc = mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+              fd, mmap_offset);
+
+    if (rc == MAP_FAILED) {
+        vu_panic(dev, "set_inflight_fd mmap error: %s", strerror(errno));
+        return false;
+    }
+
+    if (dev->inflight_info.fd) {
+        close(dev->inflight_info.fd);
+    }
+
+    if (dev->inflight_info.addr) {
+        munmap(dev->inflight_info.addr, dev->inflight_info.size);
+    }
+
+    dev->inflight_info.fd = fd;
+    dev->inflight_info.addr = rc;
+    dev->inflight_info.size = mmap_size;
+
+    for (i = 0; i < num_queues; i++) {
+        dev->vq[i].inflight = (VuVirtqInflight *)rc;
+        dev->vq[i].inflight->desc_num = queue_size;
+        rc = (void *)((char *)rc + vu_inflight_queue_size(queue_size));
+    }
+
+    return false;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1286,6 +1484,10 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
         return vu_set_postcopy_listen(dev, vmsg);
     case VHOST_USER_POSTCOPY_END:
         return vu_set_postcopy_end(dev, vmsg);
+    case VHOST_USER_GET_INFLIGHT_FD:
+        return vu_get_inflight_fd(dev, vmsg);
+    case VHOST_USER_SET_INFLIGHT_FD:
+        return vu_set_inflight_fd(dev, vmsg);
     default:
         vmsg_close_fds(vmsg);
         vu_panic(dev, "Unhandled request: %d", vmsg->request);
@@ -1353,8 +1555,18 @@ vu_deinit(VuDev *dev)
             close(vq->err_fd);
             vq->err_fd = -1;
         }
+        vq->inflight = NULL;
+    }
+
+    if (dev->inflight_info.addr) {
+        munmap(dev->inflight_info.addr, dev->inflight_info.size);
+        dev->inflight_info.addr = NULL;
     }
 
+    if (dev->inflight_info.fd > 0) {
+        close(dev->inflight_info.fd);
+        dev->inflight_info.fd = -1;
+    }
 
     vu_close_log(dev);
     if (dev->slave_fd != -1) {
@@ -1681,20 +1893,6 @@ vu_queue_empty(VuDev *dev, VuVirtq *vq)
     return vring_avail_idx(vq) == vq->last_avail_idx;
 }
 
-static inline
-bool has_feature(uint64_t features, unsigned int fbit)
-{
-    assert(fbit < 64);
-    return !!(features & (1ULL << fbit));
-}
-
-static inline
-bool vu_has_feature(VuDev *dev,
-                    unsigned int fbit)
-{
-    return has_feature(dev->features, fbit);
-}
-
 static bool
 vring_notify(VuDev *dev, VuVirtq *vq)
 {
@@ -1823,12 +2021,6 @@ virtqueue_map_desc(VuDev *dev,
     *p_num_sg = num_sg;
 }
 
-/* Round number down to multiple */
-#define ALIGN_DOWN(n, m) ((n) / (m) * (m))
-
-/* Round number up to multiple */
-#define ALIGN_UP(n, m) ALIGN_DOWN((n) + (m) - 1, (m))
-
 static void *
 virtqueue_alloc_element(size_t sz,
                                      unsigned out_num, unsigned in_num)
@@ -1929,9 +2121,67 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
     return elem;
 }
 
+static int
+vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    vq->inflight->desc[desc_idx].inflight = 1;
+
+    return 0;
+}
+
+static int
+vu_queue_inflight_pre_put(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    vq->inflight->process_head = desc_idx;
+
+    return 0;
+}
+
+static int
+vu_queue_inflight_post_put(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+    if (!has_feature(dev->protocol_features,
+        VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (unlikely(!vq->inflight)) {
+        return -1;
+    }
+
+    barrier();
+
+    vq->inflight->desc[desc_idx].inflight = 0;
+
+    barrier();
+
+    vq->inflight->used_idx = vq->used_idx;
+
+    return 0;
+}
+
 void *
 vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 {
+    int i;
     unsigned int head;
     VuVirtqElement *elem;
 
@@ -1940,6 +2190,12 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
         return NULL;
     }
 
+    if (unlikely(vq->inflight_num > 0)) {
+        i = (--vq->inflight_num);
+        elem = vu_queue_map_desc(dev, vq, vq->inflight_desc[i], sz);
+        return elem;
+    }
+
     if (vu_queue_empty(dev, vq)) {
         return NULL;
     }
@@ -1970,6 +2226,8 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 
     vq->inuse++;
 
+    vu_queue_inflight_get(dev, vq, head);
+
     return elem;
 }
 
@@ -2114,5 +2372,7 @@ vu_queue_push(VuDev *dev, VuVirtq *vq,
               const VuVirtqElement *elem, unsigned int len)
 {
     vu_queue_fill(dev, vq, elem, len, 0);
+    vu_queue_inflight_pre_put(dev, vq, elem->index);
     vu_queue_flush(dev, vq, 1);
+    vu_queue_inflight_post_put(dev, vq, elem->index);
 }
diff --git a/contrib/libvhost-user/libvhost-user.h b/contrib/libvhost-user/libvhost-user.h
index 4aa55b4d2d..b1ca7fc5c1 100644
--- a/contrib/libvhost-user/libvhost-user.h
+++ b/contrib/libvhost-user/libvhost-user.h
@@ -53,6 +53,7 @@ enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_CONFIG = 9,
     VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
     VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 
     VHOST_USER_PROTOCOL_F_MAX
 };
@@ -91,6 +92,8 @@ typedef enum VhostUserRequest {
     VHOST_USER_POSTCOPY_ADVISE  = 28,
     VHOST_USER_POSTCOPY_LISTEN  = 29,
     VHOST_USER_POSTCOPY_END     = 30,
+    VHOST_USER_GET_INFLIGHT_FD = 31,
+    VHOST_USER_SET_INFLIGHT_FD = 32,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -138,6 +141,13 @@ typedef struct VhostUserVringArea {
     uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserInflight {
+    uint64_t mmap_size;
+    uint64_t mmap_offset;
+    uint16_t num_queues;
+    uint16_t queue_size;
+} VhostUserInflight;
+
 #if defined(_WIN32)
 # define VU_PACKED __attribute__((gcc_struct, packed))
 #else
@@ -163,6 +173,7 @@ typedef struct VhostUserMsg {
         VhostUserLog log;
         VhostUserConfig config;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
     } payload;
 
     int fds[VHOST_MEMORY_MAX_NREGIONS];
@@ -234,9 +245,49 @@ typedef struct VuRing {
     uint32_t flags;
 } VuRing;
 
+typedef struct VuDescStateSplit {
+    /* Indicate whether this descriptor is inflight or not.
+     * Only available for head-descriptor. */
+    uint8_t inflight;
+
+    /* Padding */
+    uint8_t padding;
+
+    /* Link to the last processed entry */
+    uint16_t next;
+} VuDescStateSplit;
+
+typedef struct VuVirtqInflight {
+    /* The feature flags of this region. Now it's initialized to 0. */
+    uint64_t features;
+
+    /* The version of this region. It's 1 currently.
+     * Zero value indicates a vm reset happened. */
+    uint16_t version;
+
+    /* The size of VuDescStateSplit array. It's equal to the virtqueue
+     * size. Slave could get it from queue size field of VhostUserInflight. */
+    uint16_t desc_num;
+
+    /* The head of processed VuDescStateSplit entry list */
+    uint16_t process_head;
+
+    /* Storing the idx value of used ring */
+    uint16_t used_idx;
+
+    /* Used to track the state of each descriptor in descriptor table */
+    VuDescStateSplit desc[0];
+} VuVirtqInflight;
+
 typedef struct VuVirtq {
     VuRing vring;
 
+    VuVirtqInflight *inflight;
+
+    uint16_t inflight_desc[VIRTQUEUE_MAX_SIZE];
+
+    uint16_t inflight_num;
+
     /* Next head to pop */
     uint16_t last_avail_idx;
 
@@ -279,11 +330,18 @@ typedef void (*vu_set_watch_cb) (VuDev *dev, int fd, int condition,
                                  vu_watch_cb cb, void *data);
 typedef void (*vu_remove_watch_cb) (VuDev *dev, int fd);
 
+typedef struct VuDevInflightInfo {
+    int fd;
+    void *addr;
+    uint64_t size;
+} VuDevInflightInfo;
+
 struct VuDev {
     int sock;
     uint32_t nregions;
     VuDevRegion regions[VHOST_MEMORY_MAX_NREGIONS];
     VuVirtq vq[VHOST_MAX_NR_VIRTQUEUE];
+    VuDevInflightInfo inflight_info;
     int log_call_fd;
     int slave_fd;
     uint64_t log_size;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (3 preceding siblings ...)
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend elohimes
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch adds support for vhost-user-blk device to get/set
inflight buffer from/to backend.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 hw/block/vhost-user-blk.c          | 28 ++++++++++++++++++++++++++++
 include/hw/virtio/vhost-user-blk.h |  1 +
 2 files changed, 29 insertions(+)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 44ac814016..9682df1a7b 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -128,6 +128,21 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
     }
 
     s->dev.acked_features = vdev->guest_features;
+
+    if (!s->inflight->addr) {
+        ret = vhost_dev_get_inflight(&s->dev, s->queue_size, s->inflight);
+        if (ret < 0) {
+            error_report("Error get inflight: %d", -ret);
+            goto err_guest_notifiers;
+        }
+    }
+
+    ret = vhost_dev_set_inflight(&s->dev, s->inflight);
+    if (ret < 0) {
+        error_report("Error set inflight: %d", -ret);
+        goto err_guest_notifiers;
+    }
+
     ret = vhost_dev_start(&s->dev, vdev);
     if (ret < 0) {
         error_report("Error starting vhost: %d", -ret);
@@ -249,6 +264,13 @@ static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     }
 }
 
+static void vhost_user_blk_reset(VirtIODevice *vdev)
+{
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    vhost_dev_free_inflight(s->inflight);
+}
+
 static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -289,6 +311,8 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
                          vhost_user_blk_handle_output);
     }
 
+    s->inflight = g_new0(struct vhost_inflight, 1);
+
     s->dev.nvqs = s->num_queues;
     s->dev.vqs = g_new(struct vhost_virtqueue, s->dev.nvqs);
     s->dev.vq_index = 0;
@@ -321,6 +345,7 @@ vhost_err:
     vhost_dev_cleanup(&s->dev);
 virtio_err:
     g_free(vqs);
+    g_free(s->inflight);
     virtio_cleanup(vdev);
 
     vhost_user_cleanup(user);
@@ -336,7 +361,9 @@ static void vhost_user_blk_device_unrealize(DeviceState *dev, Error **errp)
 
     vhost_user_blk_set_status(vdev, 0);
     vhost_dev_cleanup(&s->dev);
+    vhost_dev_free_inflight(s->inflight);
     g_free(vqs);
+    g_free(s->inflight);
     virtio_cleanup(vdev);
 
     if (s->vhost_user) {
@@ -386,6 +413,7 @@ static void vhost_user_blk_class_init(ObjectClass *klass, void *data)
     vdc->set_config = vhost_user_blk_set_config;
     vdc->get_features = vhost_user_blk_get_features;
     vdc->set_status = vhost_user_blk_set_status;
+    vdc->reset = vhost_user_blk_reset;
 }
 
 static const TypeInfo vhost_user_blk_info = {
diff --git a/include/hw/virtio/vhost-user-blk.h b/include/hw/virtio/vhost-user-blk.h
index d52944aeeb..445516604a 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -36,6 +36,7 @@ typedef struct VHostUserBlk {
     uint32_t queue_size;
     uint32_t config_wce;
     struct vhost_dev dev;
+    struct vhost_inflight *inflight;
     VhostUserState *vhost_user;
 } VHostUserBlk;
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (4 preceding siblings ...)
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
  2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

Since we now support the message VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD. The backend is able to restart
safely because it can track inflight I/O in shared memory.
This patch allows qemu to reconnect the backend after
connection closed.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Ni Xun <nixun@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 hw/block/vhost-user-blk.c          | 205 +++++++++++++++++++++++------
 include/hw/virtio/vhost-user-blk.h |   4 +
 2 files changed, 167 insertions(+), 42 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 9682df1a7b..539ea2e571 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -103,7 +103,7 @@ const VhostDevConfigOps blk_ops = {
     .vhost_dev_config_notifier = vhost_user_blk_handle_config_change,
 };
 
-static void vhost_user_blk_start(VirtIODevice *vdev)
+static int vhost_user_blk_start(VirtIODevice *vdev)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
@@ -112,13 +112,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
 
     if (!k->set_guest_notifiers) {
         error_report("binding does not support guest notifiers");
-        return;
+        return -ENOSYS;
     }
 
     ret = vhost_dev_enable_notifiers(&s->dev, vdev);
     if (ret < 0) {
         error_report("Error enabling host notifiers: %d", -ret);
-        return;
+        return ret;
     }
 
     ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, true);
@@ -157,12 +157,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
         vhost_virtqueue_mask(&s->dev, vdev, i, false);
     }
 
-    return;
+    return ret;
 
 err_guest_notifiers:
     k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
 err_host_notifiers:
     vhost_dev_disable_notifiers(&s->dev, vdev);
+    return ret;
 }
 
 static void vhost_user_blk_stop(VirtIODevice *vdev)
@@ -181,7 +182,6 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
     ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
     if (ret < 0) {
         error_report("vhost guest notifier cleanup failed: %d", ret);
-        return;
     }
 
     vhost_dev_disable_notifiers(&s->dev, vdev);
@@ -191,21 +191,43 @@ static void vhost_user_blk_set_status(VirtIODevice *vdev, uint8_t status)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     bool should_start = status & VIRTIO_CONFIG_S_DRIVER_OK;
+    int ret;
 
     if (!vdev->vm_running) {
         should_start = false;
     }
 
-    if (s->dev.started == should_start) {
+    if (s->should_start == should_start) {
+        return;
+    }
+
+    if (!s->connected || s->dev.started == should_start) {
+        s->should_start = should_start;
         return;
     }
 
     if (should_start) {
-        vhost_user_blk_start(vdev);
+        s->should_start = true;
+        /*
+         * make sure vhost_user_blk_handle_output() ignores fake
+         * guest kick by vhost_dev_enable_notifiers()
+         */
+        barrier();
+        ret = vhost_user_blk_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-blk: vhost start failed: %s",
+                         strerror(-ret));
+            qemu_chr_fe_disconnect(&s->chardev);
+        }
     } else {
         vhost_user_blk_stop(vdev);
+        /*
+         * make sure vhost_user_blk_handle_output() ignore fake
+         * guest kick by vhost_dev_disable_notifiers()
+         */
+        barrier();
+        s->should_start = false;
     }
-
 }
 
 static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev,
@@ -237,13 +259,22 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev,
 static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
-    int i;
+    int i, ret;
 
     if (!(virtio_host_has_feature(vdev, VIRTIO_F_VERSION_1) &&
         !virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1))) {
         return;
     }
 
+    if (s->should_start) {
+        return;
+    }
+    s->should_start = true;
+
+    if (!s->connected) {
+        return;
+    }
+
     if (s->dev.started) {
         return;
     }
@@ -251,7 +282,13 @@ static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
      * vhost here instead of waiting for .set_status().
      */
-    vhost_user_blk_start(vdev);
+    ret = vhost_user_blk_start(vdev);
+    if (ret < 0) {
+        error_report("vhost-user-blk: vhost start failed: %s",
+                     strerror(-ret));
+        qemu_chr_fe_disconnect(&s->chardev);
+        return;
+    }
 
     /* Kick right away to begin processing requests already in vring */
     for (i = 0; i < s->dev.nvqs; i++) {
@@ -271,13 +308,106 @@ static void vhost_user_blk_reset(VirtIODevice *vdev)
     vhost_dev_free_inflight(s->inflight);
 }
 
+static int vhost_user_blk_connect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+    int ret = 0;
+
+    if (s->connected) {
+        return 0;
+    }
+    s->connected = true;
+
+    s->dev.nvqs = s->num_queues;
+    s->dev.vqs = s->vqs;
+    s->dev.vq_index = 0;
+    s->dev.backend_features = 0;
+
+    vhost_dev_set_config_notifier(&s->dev, &blk_ops);
+
+    ret = vhost_dev_init(&s->dev, s->vhost_user, VHOST_BACKEND_TYPE_USER, 0);
+    if (ret < 0) {
+        error_report("vhost-user-blk: vhost initialization failed: %s",
+                     strerror(-ret));
+        return ret;
+    }
+
+    /* restore vhost state */
+    if (s->should_start) {
+        ret = vhost_user_blk_start(vdev);
+        if (ret < 0) {
+            error_report("vhost-user-blk: vhost start failed: %s",
+                         strerror(-ret));
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static void vhost_user_blk_disconnect(DeviceState *dev)
+{
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    if (!s->connected) {
+        return;
+    }
+    s->connected = false;
+
+    if (s->dev.started) {
+        vhost_user_blk_stop(vdev);
+    }
+
+    vhost_dev_cleanup(&s->dev);
+}
+
+static gboolean vhost_user_blk_watch(GIOChannel *chan, GIOCondition cond,
+                                     void *opaque)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    qemu_chr_fe_disconnect(&s->chardev);
+
+    return true;
+}
+
+static void vhost_user_blk_event(void *opaque, int event)
+{
+    DeviceState *dev = opaque;
+    VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+    VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+    switch (event) {
+    case CHR_EVENT_OPENED:
+        if (vhost_user_blk_connect(dev) < 0) {
+            qemu_chr_fe_disconnect(&s->chardev);
+            return;
+        }
+        s->watch = qemu_chr_fe_add_watch(&s->chardev, G_IO_HUP,
+                                         vhost_user_blk_watch, dev);
+        break;
+    case CHR_EVENT_CLOSED:
+        vhost_user_blk_disconnect(dev);
+        if (s->watch) {
+            g_source_remove(s->watch);
+            s->watch = 0;
+        }
+        break;
+    }
+}
+
+
 static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VHostUserBlk *s = VHOST_USER_BLK(vdev);
     VhostUserState *user;
-    struct vhost_virtqueue *vqs = NULL;
     int i, ret;
+    Error *err = NULL;
 
     if (!s->chardev.chr) {
         error_setg(errp, "vhost-user-blk: chardev is mandatory");
@@ -312,27 +442,28 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     s->inflight = g_new0(struct vhost_inflight, 1);
-
-    s->dev.nvqs = s->num_queues;
-    s->dev.vqs = g_new(struct vhost_virtqueue, s->dev.nvqs);
-    s->dev.vq_index = 0;
-    s->dev.backend_features = 0;
-    vqs = s->dev.vqs;
-
-    vhost_dev_set_config_notifier(&s->dev, &blk_ops);
-
-    ret = vhost_dev_init(&s->dev, s->vhost_user, VHOST_BACKEND_TYPE_USER, 0);
-    if (ret < 0) {
-        error_setg(errp, "vhost-user-blk: vhost initialization failed: %s",
-                   strerror(-ret));
-        goto virtio_err;
-    }
+    s->vqs = g_new(struct vhost_virtqueue, s->num_queues);
+    s->watch = 0;
+    s->should_start = false;
+    s->connected = false;
+
+    qemu_chr_fe_set_handlers(&s->chardev,  NULL, NULL, vhost_user_blk_event,
+                             NULL, (void *)dev, NULL, true);
+
+reconnect:
+    do {
+        if (qemu_chr_fe_wait_connected(&s->chardev, &err) < 0) {
+            error_report_err(err);
+            err = NULL;
+            sleep(1);
+        }
+    } while (!s->connected);
 
     ret = vhost_dev_get_config(&s->dev, (uint8_t *)&s->blkcfg,
-                              sizeof(struct virtio_blk_config));
+                               sizeof(struct virtio_blk_config));
     if (ret < 0) {
-        error_setg(errp, "vhost-user-blk: get block config failed");
-        goto vhost_err;
+        error_report("vhost-user-blk: get block config failed");
+        goto reconnect;
     }
 
     if (s->blkcfg.num_queues != s->num_queues) {
@@ -340,29 +471,19 @@ static void vhost_user_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     return;
-
-vhost_err:
-    vhost_dev_cleanup(&s->dev);
-virtio_err:
-    g_free(vqs);
-    g_free(s->inflight);
-    virtio_cleanup(vdev);
-
-    vhost_user_cleanup(user);
-    g_free(user);
-    s->vhost_user = NULL;
 }
 
 static void vhost_user_blk_device_unrealize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VHostUserBlk *s = VHOST_USER_BLK(dev);
-    struct vhost_virtqueue *vqs = s->dev.vqs;
 
     vhost_user_blk_set_status(vdev, 0);
+    qemu_chr_fe_set_handlers(&s->chardev,  NULL, NULL, NULL,
+                             NULL, NULL, NULL, false);
     vhost_dev_cleanup(&s->dev);
     vhost_dev_free_inflight(s->inflight);
-    g_free(vqs);
+    g_free(s->vqs);
     g_free(s->inflight);
     virtio_cleanup(vdev);
 
diff --git a/include/hw/virtio/vhost-user-blk.h b/include/hw/virtio/vhost-user-blk.h
index 445516604a..4849aa5eb5 100644
--- a/include/hw/virtio/vhost-user-blk.h
+++ b/include/hw/virtio/vhost-user-blk.h
@@ -38,6 +38,10 @@ typedef struct VHostUserBlk {
     struct vhost_dev dev;
     struct vhost_inflight *inflight;
     VhostUserState *vhost_user;
+    struct vhost_virtqueue *vqs;
+    guint watch;
+    bool should_start;
+    bool connected;
 } VHostUserBlk;
 
 #endif
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (5 preceding siblings ...)
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend elohimes
@ 2019-02-18 10:27 ` elohimes
  2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
  7 siblings, 0 replies; 18+ messages in thread
From: elohimes @ 2019-02-18 10:27 UTC (permalink / raw)
  To: mst, stefanha, marcandre.lureau, berrange, jasowang,
	maxime.coquelin, yury-kotov, wrfsh
  Cc: qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

From: Xie Yongji <xieyongji@baidu.com>

This patch enables inflight I/O tracking for
vhost-user-blk backend so that we could restart it safely.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 contrib/vhost-user-blk/vhost-user-blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/vhost-user-blk/vhost-user-blk.c b/contrib/vhost-user-blk/vhost-user-blk.c
index 43583f2659..86a3987744 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -398,7 +398,8 @@ vub_get_features(VuDev *dev)
 static uint64_t
 vub_get_protocol_features(VuDev *dev)
 {
-    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG;
+    return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
+           1ull << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
 }
 
 static int
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
                   ` (6 preceding siblings ...)
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
@ 2019-02-20 19:59 ` Michael S. Tsirkin
  2019-02-21  1:30   ` Yongji Xie
  7 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-02-20 19:59 UTC (permalink / raw)
  To: elohimes
  Cc: stefanha, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh, qemu-devel, zhangyu31, chaiwen, nixun,
	lilin24, Xie Yongji

On Mon, Feb 18, 2019 at 06:27:41PM +0800, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
> 
> The patch 1 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support transferring shared
> buffer between qemu and backend.
> 
> The patch 2 deletes some redundant check in contrib/libvhost-user.c.
> 
> The patch 3,4 are the corresponding libvhost-user patches of
> patch 1. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD.
> 
> The patch 5 allows vhost-user-blk to use the two new messages
> to get/set inflight buffer from/to backend.
> 
> The patch 6 supports vhost-user-blk to reconnect backend when
> connection closed.
> 
> The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> to vhost-user-blk backend which is used to tell qemu that
> we support reconnecting now.
> 
> To use it, we could start qemu with:
> 
> qemu-system-x86_64 \
>         -chardev socket,id=char0,path=/path/vhost.socket,reconnect=1, \
>         -device vhost-user-blk-pci,chardev=char0 \
> 
> and start vhost-user-blk backend with:
> 
> vhost-user-blk -b /path/file -s /path/vhost.socket
> 
> Then we can restart vhost-user-blk at any time during VM running.

Sorry is elohimes@gmail.com also an address that belongs to
Xie Yongji?

If not we need a signed off by from that address'
owner as well.

Thanks!



> V5 to V6:
> - Document the layout in inflight buffer for packed virtqueue
> - Rework the layout in inflight buffer for split virtqueue
> - Remove version field in VhostUserInflight
> - Add a patch to remove some redundant check in
>   contrib/libvhost-user.c
> - Document more details in vhost-user.txt
> 
> V4 to V5:
> - Drop patch that enables "nowait" option on client sockets
> - Support resubmitting inflight I/O in order
> - Make inflight I/O tracking more robust
> - Remove align field and add queue size field in VhostUserInflight
> - Document more details in vhost-user.txt
> 
> V3 to V4:
> - Drop messages VHOST_USER_GET_SHM_SIZE and VHOST_USER_SET_SHM_FD
> - Introduce two new messages VHOST_USER_GET_INFLIGHT_FD
>   and VHOST_USER_SET_INFLIGHT_FD
> - Allocate inflight buffer in backend rather than in qemu
> - Document a recommended format for inflight buffer
> 
> V2 to V3:
> - Using exisiting wait/nowait options to control connection on
>   client sockets instead of introducing "disconnected" option.
> - Support the case that vhost-user backend restart during initialzation
>   of vhost-user-blk device.
> 
> V1 to V2:
> - Introduce "disconnected" option for chardev instead of reuse "wait"
>   option
> - Support the case that QEMU starts before vhost-user backend
> - Drop message VHOST_USER_SET_VRING_INFLIGHT
> - Introduce two new messages VHOST_USER_GET_SHM_SIZE
>   and VHOST_USER_SET_SHM_FD
> 
> Xie Yongji (7):
>   vhost-user: Support transferring inflight buffer between qemu and
>     backend
>   libvhost-user: Remove unnecessary FD flag check for event file
>     descriptors
>   libvhost-user: Introduce vu_queue_map_desc()
>   libvhost-user: Support tracking inflight I/O in shared memory
>   vhost-user-blk: Add support to get/set inflight buffer
>   vhost-user-blk: Add support to reconnect backend
>   contrib/vhost-user-blk: enable inflight I/O tracking
> 
>  Makefile                                |   2 +-
>  contrib/libvhost-user/libvhost-user.c   | 400 ++++++++++++++++++++----
>  contrib/libvhost-user/libvhost-user.h   |  58 ++++
>  contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
>  docs/interop/vhost-user.txt             | 264 ++++++++++++++++
>  hw/block/vhost-user-blk.c               | 229 +++++++++++---
>  hw/virtio/vhost-user.c                  | 107 +++++++
>  hw/virtio/vhost.c                       |  96 ++++++
>  include/hw/virtio/vhost-backend.h       |  10 +
>  include/hw/virtio/vhost-user-blk.h      |   5 +
>  include/hw/virtio/vhost.h               |  18 ++
>  11 files changed, 1084 insertions(+), 108 deletions(-)
> 
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting
  2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
@ 2019-02-21  1:30   ` Yongji Xie
  0 siblings, 0 replies; 18+ messages in thread
From: Yongji Xie @ 2019-02-21  1:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Thu, 21 Feb 2019 at 04:00, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Feb 18, 2019 at 06:27:41PM +0800, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support transferring shared
> > buffer between qemu and backend.
> >
> > The patch 2 deletes some redundant check in contrib/libvhost-user.c.
> >
> > The patch 3,4 are the corresponding libvhost-user patches of
> > patch 1. Make libvhost-user support VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD.
> >
> > The patch 5 allows vhost-user-blk to use the two new messages
> > to get/set inflight buffer from/to backend.
> >
> > The patch 6 supports vhost-user-blk to reconnect backend when
> > connection closed.
> >
> > The patch 7 introduces VHOST_USER_PROTOCOL_F_SLAVE_SHMFD
> > to vhost-user-blk backend which is used to tell qemu that
> > we support reconnecting now.
> >
> > To use it, we could start qemu with:
> >
> > qemu-system-x86_64 \
> >         -chardev socket,id=char0,path=/path/vhost.socket,reconnect=1, \
> >         -device vhost-user-blk-pci,chardev=char0 \
> >
> > and start vhost-user-blk backend with:
> >
> > vhost-user-blk -b /path/file -s /path/vhost.socket
> >
> > Then we can restart vhost-user-blk at any time during VM running.
>
> Sorry is elohimes@gmail.com also an address that belongs to
> Xie Yongji?
>

Yes, that's also my email address.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
@ 2019-02-21 17:27   ` Michael S. Tsirkin
  2019-02-22  2:47     ` Yongji Xie
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-02-21 17:27 UTC (permalink / raw)
  To: elohimes
  Cc: stefanha, marcandre.lureau, berrange, jasowang, maxime.coquelin,
	yury-kotov, wrfsh, qemu-devel, zhangyu31, chaiwen, nixun,
	lilin24, Xie Yongji

On Mon, Feb 18, 2019 at 06:27:42PM +0800, elohimes@gmail.com wrote:
> From: Xie Yongji <xieyongji@baidu.com>
> 
> This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
> buffer between qemu and backend.
> 
> Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
> shared buffer from backend. Then qemu should send it back
> through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.
> 
> This shared buffer is used to track inflight I/O by backend.
> Qemu should retrieve a new one when vm reset.
> 
> Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> Signed-off-by: Chai Wen <chaiwen@baidu.com>
> Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> ---
>  docs/interop/vhost-user.txt       | 264 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost-user.c            | 107 ++++++++++++
>  hw/virtio/vhost.c                 |  96 +++++++++++
>  include/hw/virtio/vhost-backend.h |  10 ++
>  include/hw/virtio/vhost.h         |  18 ++
>  5 files changed, 495 insertions(+)
> 
> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> index c2194711d9..61c6d0e415 100644
> --- a/docs/interop/vhost-user.txt
> +++ b/docs/interop/vhost-user.txt
> @@ -142,6 +142,17 @@ Depending on the request type, payload can be:
>     Offset: a 64-bit offset of this area from the start of the
>         supplied file descriptor
>  
> + * Inflight description
> +   -----------------------------------------------------
> +   | mmap size | mmap offset | num queues | queue size |
> +   -----------------------------------------------------
> +
> +   mmap size: a 64-bit size of area to track inflight I/O
> +   mmap offset: a 64-bit offset of this area from the start
> +                of the supplied file descriptor
> +   num queues: a 16-bit number of virtqueues
> +   queue size: a 16-bit size of virtqueues
> +
>  In QEMU the vhost-user message is implemented with the following struct:
>  
>  typedef struct VhostUserMsg {
> @@ -157,6 +168,7 @@ typedef struct VhostUserMsg {
>          struct vhost_iotlb_msg iotlb;
>          VhostUserConfig config;
>          VhostUserVringArea area;
> +        VhostUserInflight inflight;
>      };
>  } QEMU_PACKED VhostUserMsg;
>  
> @@ -175,6 +187,7 @@ the ones that do:
>   * VHOST_USER_GET_PROTOCOL_FEATURES
>   * VHOST_USER_GET_VRING_BASE
>   * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
> + * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
>  
>  [ Also see the section on REPLY_ACK protocol extension. ]
>  
> @@ -188,6 +201,7 @@ in the ancillary data:
>   * VHOST_USER_SET_VRING_CALL
>   * VHOST_USER_SET_VRING_ERR
>   * VHOST_USER_SET_SLAVE_REQ_FD
> + * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
>  
>  If Master is unable to send the full message or receives a wrong reply it will
>  close the connection. An optional reconnection mechanism can be implemented.
> @@ -382,6 +396,235 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
>  slave can send file descriptors (at most 8 descriptors in each message)
>  to master via ancillary data using this fd communication channel.
>  
> +Inflight I/O tracking
> +---------------------
> +
> +To support reconnecting after restart or crash, slave may need to resubmit
> +inflight I/Os. If virtqueue is processed in order, we can easily achieve
> +that by getting the inflight descriptors from descriptor table (split virtqueue)
> +or descriptor ring (packed virtqueue). However, it can't work when we process
> +descriptors out-of-order because some entries which store the information of
> +inflight descriptors in available ring (split virtqueue) or descriptor
> +ring (packed virtqueue) might be overrided by new entries. To solve this
> +problem, slave need to allocate an extra buffer to store this information of inflight
> +descriptors and share it with master for persistent. VHOST_USER_GET_INFLIGHT_FD and
> +VHOST_USER_SET_INFLIGHT_FD are used to transfer this buffer between master
> +and slave. And the format of this buffer is described below:
> +
> +-------------------------------------------------------
> +| queue0 region | queue1 region | ... | queueN region |
> +-------------------------------------------------------
> +
> +N is the number of available virtqueues. Slave could get it from num queues
> +field of VhostUserInflight.
> +
> +For split virtqueue, queue region can be implemented as:
> +
> +typedef struct DescStateSplit {
> +    /* Indicate whether this descriptor is inflight or not.
> +     * Only available for head-descriptor. */
> +    uint8_t inflight;
> +
> +    /* Padding */
> +    uint8_t padding;
> +
> +    /* Link to the last processed entry */
> +    uint16_t next;
> +} DescStateSplit;
> +
> +typedef struct QueueRegionSplit {
> +    /* The feature flags of this region. Now it's initialized to 0. */
> +    uint64_t features;
> +
> +    /* The version of this region. It's 1 currently.
> +     * Zero value indicates an uninitialized buffer */
> +    uint16_t version;
> +
> +    /* The size of DescStateSplit array. It's equal to the virtqueue
> +     * size. Slave could get it from queue size field of VhostUserInflight. */
> +    uint16_t desc_num;
> +
> +    /* The head of processed DescStateSplit entry list */
> +    uint16_t process_head;
> +
> +    /* Storing the idx value of used ring */
> +    uint16_t used_idx;
> +
> +    /* Used to track the state of each descriptor in descriptor table */
> +    DescStateSplit desc[0];
> +} QueueRegionSplit;


What is the endian-ness of multibyte fields?


> +
> +To track inflight I/O, the queue region should be processed as follows:
> +
> +When receiving available buffers from the driver:
> +
> +    1. Get the next available head-descriptor index from available ring, i
> +
> +    2. Set desc[i].inflight to 1
> +
> +When supplying used buffers to the driver:
> +
> +    1. Get corresponding used head-descriptor index, i
> +
> +    2. Set desc[i].next to process_head
> +
> +    3. Set process_head to i
> +
> +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> +
> +    5. Increase the idx value of used ring by the size of the batch
> +
> +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> +
> +    7. Set used_idx to the idx value of used ring
> +
> +When reconnecting:
> +
> +    1. If the value of used_idx does not match the idx value of used ring,
> +
> +        (a) Subtract the value of used_idx from the idx value of used ring to get
> +        the number of in-progress DescStateSplit entries
> +
> +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> +        start from process_head to 0
> +
> +        (c) Set used_idx to the idx value of used ring
> +
> +    2. Resubmit each inflight DescStateSplit entry

I re-read a couple of time and I still don't understand what it says.

For simplicity consider split ring. So we want a list of heads that are
outstanding. Fair enough. Now device finishes a head. What now? I needs
to drop head from the list. But list is unidirectional (just next, no
prev). So how can you drop an entry from the middle?


> +For packed virtqueue, queue region can be implemented as:
> +
> +typedef struct DescStatePacked {
> +    /* Indicate whether this descriptor is inflight or not.
> +     * Only available for head-descriptor. */
> +    uint8_t inflight;
> +
> +    /* Padding */
> +    uint8_t padding;
> +
> +    /* Link to the next free entry */
> +    uint16_t next;
> +
> +    /* Link to the last entry of descriptor list.
> +     * Only available for head-descriptor. */
> +    uint16_t last;
> +
> +    /* The length of descriptor list.
> +     * Only available for head-descriptor. */
> +    uint16_t num;
> +
> +    /* The buffer id */
> +    uint16_t id;
> +
> +    /* The descriptor flags */
> +    uint16_t flags;
> +
> +    /* The buffer length */
> +    uint32_t len;
> +
> +    /* The buffer address */
> +    uint64_t addr;

Do we want an extra u64 here to make it a power of two?


> +} DescStatePacked;
> +
> +typedef struct QueueRegionPacked {
> +    /* The feature flags of this region. Now it's initialized to 0. */
> +    uint64_t features;
> +
> +    /* The version of this region. It's 1 currently.
> +     * Zero value indicates an uninitialized buffer */
> +    uint16_t version;
> +
> +    /* The size of DescStatePacked array. It's equal to the virtqueue
> +     * size. Slave could get it from queue size field of VhostUserInflight. */
> +    uint16_t desc_num;
> +
> +    /* The head of free DescStatePacked entry list */
> +    uint16_t free_head;
> +
> +    /* The old head of free DescStatePacked entry list */
> +    uint16_t old_free_head;
> +
> +    /* The used index of descriptor ring */
> +    uint16_t used_idx;
> +
> +    /* The old used index of descriptor ring */
> +    uint16_t old_used_idx;
> +
> +    /* Device ring wrap counter */
> +    uint8_t used_wrap_counter;
> +
> +    /* The old device ring wrap counter */
> +    uint8_t old_used_wrap_counter;
> +
> +    /* Padding */
> +    uint8_t padding[7];
> +
> +    /* Used to track the state of each descriptor fetched from descriptor ring */
> +    DescStatePacked desc[0];
> +} QueueRegionPacked;
> +
> +To track inflight I/O, the queue region should be processed as follows:
> +
> +When receiving available buffers from the driver:
> +
> +    1. Get the next available descriptor entry from descriptor ring, d
> +
> +    2. If d is head descriptor,
> +
> +        (a) Set desc[old_free_head].num to 0
> +
> +        (b) Set desc[old_free_head].inflight to 1
> +
> +    3. If d is last descriptor, set desc[old_free_head].last to free_head
> +
> +    4. Increase desc[old_free_head].num by 1
> +
> +    5. Set desc[free_head].addr, desc[free_head].len, desc[free_head].flags,
> +    desc[free_head].id to d.addr, d.len, d.flags, d.id
> +
> +    6. Set free_head to desc[free_head].next
> +
> +    7. If d is last descriptor, set old_free_head to free_head
> +
> +When supplying used buffers to the driver:
> +
> +    1. Get corresponding used head-descriptor entry from descriptor ring, d
> +
> +    2. Get corresponding DescStatePacked entry, e
> +
> +    3. Set desc[e.last].next to free_head
> +
> +    4. Set free_head to the index of e
> +
> +    5. Steps 1,2,3,4 may be performed repeatedly if batching is possible
> +
> +    6. Increase used_idx by the size of the batch and update used_wrap_counter if needed
> +
> +    7. Update d.flags
> +
> +    8. Set the inflight field of each head DescStatePacked entry in the batch to 0
> +
> +    9. Set old_free_head, old_used_idx, old_used_wrap_counter to free_head, used_idx,
> +    used_wrap_counter
> +
> +When reconnecting:
> +
> +    1. If used_idx does not match old_used_idx,
> +
> +        (a) Get the next descriptor ring entry through old_used_idx, d
> +
> +        (b) Use old_used_wrap_counter to calculate the available flags
> +
> +        (c) If d.flags is not equal to the calculated flags value, set old_free_head,
> +        old_used_idx, old_used_wrap_counter to free_head, used_idx, used_wrap_counter
> +
> +    2. Set free_head, used_idx, used_wrap_counter to old_free_head, old_used_idx,
> +    old_used_wrap_counter
> +
> +    3. Set the inflight field of each free DescStatePacked entry to 0
> +
> +    4. Resubmit each inflight DescStatePacked entry
> +
>  Protocol features
>  -----------------
>  
> @@ -397,6 +640,7 @@ Protocol features
>  #define VHOST_USER_PROTOCOL_F_CONFIG         9
>  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD  10
>  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER  11
> +#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
>  
>  Master message types
>  --------------------
> @@ -761,6 +1005,26 @@ Master message types
>        was previously sent.
>        The value returned is an error indication; 0 is success.
>  
> + * VHOST_USER_GET_INFLIGHT_FD
> +      Id: 31
> +      Equivalent ioctl: N/A
> +      Master payload: inflight description
> +
> +      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
> +      successfully negotiated, this message is submitted by master to get
> +      a shared buffer from slave. The shared buffer will be used to track
> +      inflight I/O by slave. QEMU should retrieve a new one when vm reset.
> +
> + * VHOST_USER_SET_INFLIGHT_FD
> +      Id: 32
> +      Equivalent ioctl: N/A
> +      Master payload: inflight description
> +
> +      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
> +      successfully negotiated, this message is submitted by master to send
> +      the shared inflight buffer back to slave so that slave could get
> +      inflight I/O after a crash or restart.
> +
>  Slave message types
>  -------------------
>  
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index 564a31d12c..21a81998ba 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -52,6 +52,7 @@ enum VhostUserProtocolFeature {
>      VHOST_USER_PROTOCOL_F_CONFIG = 9,
>      VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
>      VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>  
> @@ -89,6 +90,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_POSTCOPY_ADVISE  = 28,
>      VHOST_USER_POSTCOPY_LISTEN  = 29,
>      VHOST_USER_POSTCOPY_END     = 30,
> +    VHOST_USER_GET_INFLIGHT_FD = 31,
> +    VHOST_USER_SET_INFLIGHT_FD = 32,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -147,6 +150,13 @@ typedef struct VhostUserVringArea {
>      uint64_t offset;
>  } VhostUserVringArea;
>  
> +typedef struct VhostUserInflight {
> +    uint64_t mmap_size;
> +    uint64_t mmap_offset;
> +    uint16_t num_queues;
> +    uint16_t queue_size;
> +} VhostUserInflight;
> +
>  typedef struct {
>      VhostUserRequest request;
>  
> @@ -169,6 +179,7 @@ typedef union {
>          VhostUserConfig config;
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
> +        VhostUserInflight inflight;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -1739,6 +1750,100 @@ static bool vhost_user_mem_section_filter(struct vhost_dev *dev,
>      return result;
>  }
>  
> +static int vhost_user_get_inflight_fd(struct vhost_dev *dev,
> +                                      uint16_t queue_size,
> +                                      struct vhost_inflight *inflight)
> +{
> +    void *addr;
> +    int fd;
> +    struct vhost_user *u = dev->opaque;
> +    CharBackend *chr = u->user->chr;
> +    VhostUserMsg msg = {
> +        .hdr.request = VHOST_USER_GET_INFLIGHT_FD,
> +        .hdr.flags = VHOST_USER_VERSION,
> +        .payload.inflight.num_queues = dev->nvqs,
> +        .payload.inflight.queue_size = queue_size,
> +        .hdr.size = sizeof(msg.payload.inflight),
> +    };
> +
> +    if (!virtio_has_feature(dev->protocol_features,
> +                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +        return 0;
> +    }
> +
> +    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
> +        return -1;
> +    }
> +
> +    if (vhost_user_read(dev, &msg) < 0) {
> +        return -1;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_GET_INFLIGHT_FD) {
> +        error_report("Received unexpected msg type. "
> +                     "Expected %d received %d",
> +                     VHOST_USER_GET_INFLIGHT_FD, msg.hdr.request);
> +        return -1;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.inflight)) {
> +        error_report("Received bad msg size.");
> +        return -1;
> +    }
> +
> +    if (!msg.payload.inflight.mmap_size) {
> +        return 0;
> +    }
> +
> +    fd = qemu_chr_fe_get_msgfd(chr);
> +    if (fd < 0) {
> +        error_report("Failed to get mem fd");
> +        return -1;
> +    }
> +
> +    addr = mmap(0, msg.payload.inflight.mmap_size, PROT_READ | PROT_WRITE,
> +                MAP_SHARED, fd, msg.payload.inflight.mmap_offset);
> +
> +    if (addr == MAP_FAILED) {
> +        error_report("Failed to mmap mem fd");
> +        close(fd);
> +        return -1;
> +    }
> +
> +    inflight->addr = addr;
> +    inflight->fd = fd;
> +    inflight->size = msg.payload.inflight.mmap_size;
> +    inflight->offset = msg.payload.inflight.mmap_offset;
> +    inflight->queue_size = queue_size;
> +
> +    return 0;
> +}
> +
> +static int vhost_user_set_inflight_fd(struct vhost_dev *dev,
> +                                      struct vhost_inflight *inflight)
> +{
> +    VhostUserMsg msg = {
> +        .hdr.request = VHOST_USER_SET_INFLIGHT_FD,
> +        .hdr.flags = VHOST_USER_VERSION,
> +        .payload.inflight.mmap_size = inflight->size,
> +        .payload.inflight.mmap_offset = inflight->offset,
> +        .payload.inflight.num_queues = dev->nvqs,
> +        .payload.inflight.queue_size = inflight->queue_size,
> +        .hdr.size = sizeof(msg.payload.inflight),
> +    };
> +
> +    if (!virtio_has_feature(dev->protocol_features,
> +                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +        return 0;
> +    }
> +
> +    if (vhost_user_write(dev, &msg, &inflight->fd, 1) < 0) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
>  VhostUserState *vhost_user_init(void)
>  {
>      VhostUserState *user = g_new0(struct VhostUserState, 1);
> @@ -1790,4 +1895,6 @@ const VhostOps user_ops = {
>          .vhost_crypto_create_session = vhost_user_crypto_create_session,
>          .vhost_crypto_close_session = vhost_user_crypto_close_session,
>          .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
> +        .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
> +        .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 569c4053ea..8db1a855eb 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1481,6 +1481,102 @@ void vhost_dev_set_config_notifier(struct vhost_dev *hdev,
>      hdev->config_ops = ops;
>  }
>  
> +void vhost_dev_free_inflight(struct vhost_inflight *inflight)
> +{
> +    if (inflight->addr) {
> +        qemu_memfd_free(inflight->addr, inflight->size, inflight->fd);
> +        inflight->addr = NULL;
> +        inflight->fd = -1;
> +    }
> +}
> +
> +static int vhost_dev_resize_inflight(struct vhost_inflight *inflight,
> +                                     uint64_t new_size)
> +{
> +    Error *err = NULL;
> +    int fd = -1;
> +    void *addr = qemu_memfd_alloc("vhost-inflight", new_size,
> +                                  F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
> +                                  &fd, &err);
> +
> +    if (err) {
> +        error_report_err(err);
> +        return -1;
> +    }
> +
> +    vhost_dev_free_inflight(inflight);
> +    inflight->offset = 0;
> +    inflight->addr = addr;
> +    inflight->fd = fd;
> +    inflight->size = new_size;
> +
> +    return 0;
> +}
> +
> +void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f)
> +{
> +    if (inflight->addr) {
> +        qemu_put_be64(f, inflight->size);
> +        qemu_put_be16(f, inflight->queue_size);
> +        qemu_put_buffer(f, inflight->addr, inflight->size);
> +    } else {
> +        qemu_put_be64(f, 0);
> +    }
> +}
> +
> +int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f)
> +{
> +    uint64_t size;
> +
> +    size = qemu_get_be64(f);
> +    if (!size) {
> +        return 0;
> +    }
> +
> +    if (inflight->size != size) {
> +        if (vhost_dev_resize_inflight(inflight, size)) {
> +            return -1;
> +        }
> +    }
> +    inflight->queue_size = qemu_get_be16(f);
> +
> +    qemu_get_buffer(f, inflight->addr, size);
> +
> +    return 0;
> +}
> +
> +int vhost_dev_set_inflight(struct vhost_dev *dev,
> +                           struct vhost_inflight *inflight)
> +{
> +    int r;
> +
> +    if (dev->vhost_ops->vhost_set_inflight_fd && inflight->addr) {
> +        r = dev->vhost_ops->vhost_set_inflight_fd(dev, inflight);
> +        if (r) {
> +            VHOST_OPS_DEBUG("vhost_set_inflight_fd failed");
> +            return -errno;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> +                           struct vhost_inflight *inflight)
> +{
> +    int r;
> +
> +    if (dev->vhost_ops->vhost_get_inflight_fd) {
> +        r = dev->vhost_ops->vhost_get_inflight_fd(dev, queue_size, inflight);
> +        if (r) {
> +            VHOST_OPS_DEBUG("vhost_get_inflight_fd failed");
> +            return -errno;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  /* Host notifiers must be enabled at this point. */
>  int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
>  {
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index 81283ec50f..d6632a18e6 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -25,6 +25,7 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>  
> +struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
>  struct vhost_memory;
> @@ -104,6 +105,13 @@ typedef int (*vhost_crypto_close_session_op)(struct vhost_dev *dev,
>  typedef bool (*vhost_backend_mem_section_filter_op)(struct vhost_dev *dev,
>                                                  MemoryRegionSection *section);
>  
> +typedef int (*vhost_get_inflight_fd_op)(struct vhost_dev *dev,
> +                                        uint16_t queue_size,
> +                                        struct vhost_inflight *inflight);
> +
> +typedef int (*vhost_set_inflight_fd_op)(struct vhost_dev *dev,
> +                                        struct vhost_inflight *inflight);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -142,6 +150,8 @@ typedef struct VhostOps {
>      vhost_crypto_create_session_op vhost_crypto_create_session;
>      vhost_crypto_close_session_op vhost_crypto_close_session;
>      vhost_backend_mem_section_filter_op vhost_backend_mem_section_filter;
> +    vhost_get_inflight_fd_op vhost_get_inflight_fd;
> +    vhost_set_inflight_fd_op vhost_set_inflight_fd;
>  } VhostOps;
>  
>  extern const VhostOps user_ops;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a7f449fa87..619498c8f4 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -7,6 +7,15 @@
>  #include "exec/memory.h"
>  
>  /* Generic structures common for any vhost based device. */
> +
> +struct vhost_inflight {
> +    int fd;
> +    void *addr;
> +    uint64_t size;
> +    uint64_t offset;
> +    uint16_t queue_size;
> +};
> +
>  struct vhost_virtqueue {
>      int kick;
>      int call;
> @@ -120,4 +129,13 @@ int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
>   */
>  void vhost_dev_set_config_notifier(struct vhost_dev *dev,
>                                     const VhostDevConfigOps *ops);
> +
> +void vhost_dev_reset_inflight(struct vhost_inflight *inflight);
> +void vhost_dev_free_inflight(struct vhost_inflight *inflight);
> +void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
> +int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
> +int vhost_dev_set_inflight(struct vhost_dev *dev,
> +                           struct vhost_inflight *inflight);
> +int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> +                           struct vhost_inflight *inflight);
>  #endif
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-21 17:27   ` Michael S. Tsirkin
@ 2019-02-22  2:47     ` Yongji Xie
  2019-02-22  6:21       ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Yongji Xie @ 2019-02-22  2:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 22 Feb 2019 at 01:27, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Mon, Feb 18, 2019 at 06:27:42PM +0800, elohimes@gmail.com wrote:
> > From: Xie Yongji <xieyongji@baidu.com>
> >
> > This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
> > and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
> > buffer between qemu and backend.
> >
> > Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
> > shared buffer from backend. Then qemu should send it back
> > through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.
> >
> > This shared buffer is used to track inflight I/O by backend.
> > Qemu should retrieve a new one when vm reset.
> >
> > Signed-off-by: Xie Yongji <xieyongji@baidu.com>
> > Signed-off-by: Chai Wen <chaiwen@baidu.com>
> > Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
> > ---
> >  docs/interop/vhost-user.txt       | 264 ++++++++++++++++++++++++++++++
> >  hw/virtio/vhost-user.c            | 107 ++++++++++++
> >  hw/virtio/vhost.c                 |  96 +++++++++++
> >  include/hw/virtio/vhost-backend.h |  10 ++
> >  include/hw/virtio/vhost.h         |  18 ++
> >  5 files changed, 495 insertions(+)
> >
> > diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
> > index c2194711d9..61c6d0e415 100644
> > --- a/docs/interop/vhost-user.txt
> > +++ b/docs/interop/vhost-user.txt
> > @@ -142,6 +142,17 @@ Depending on the request type, payload can be:
> >     Offset: a 64-bit offset of this area from the start of the
> >         supplied file descriptor
> >
> > + * Inflight description
> > +   -----------------------------------------------------
> > +   | mmap size | mmap offset | num queues | queue size |
> > +   -----------------------------------------------------
> > +
> > +   mmap size: a 64-bit size of area to track inflight I/O
> > +   mmap offset: a 64-bit offset of this area from the start
> > +                of the supplied file descriptor
> > +   num queues: a 16-bit number of virtqueues
> > +   queue size: a 16-bit size of virtqueues
> > +
> >  In QEMU the vhost-user message is implemented with the following struct:
> >
> >  typedef struct VhostUserMsg {
> > @@ -157,6 +168,7 @@ typedef struct VhostUserMsg {
> >          struct vhost_iotlb_msg iotlb;
> >          VhostUserConfig config;
> >          VhostUserVringArea area;
> > +        VhostUserInflight inflight;
> >      };
> >  } QEMU_PACKED VhostUserMsg;
> >
> > @@ -175,6 +187,7 @@ the ones that do:
> >   * VHOST_USER_GET_PROTOCOL_FEATURES
> >   * VHOST_USER_GET_VRING_BASE
> >   * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
> > + * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
> >
> >  [ Also see the section on REPLY_ACK protocol extension. ]
> >
> > @@ -188,6 +201,7 @@ in the ancillary data:
> >   * VHOST_USER_SET_VRING_CALL
> >   * VHOST_USER_SET_VRING_ERR
> >   * VHOST_USER_SET_SLAVE_REQ_FD
> > + * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
> >
> >  If Master is unable to send the full message or receives a wrong reply it will
> >  close the connection. An optional reconnection mechanism can be implemented.
> > @@ -382,6 +396,235 @@ If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
> >  slave can send file descriptors (at most 8 descriptors in each message)
> >  to master via ancillary data using this fd communication channel.
> >
> > +Inflight I/O tracking
> > +---------------------
> > +
> > +To support reconnecting after restart or crash, slave may need to resubmit
> > +inflight I/Os. If virtqueue is processed in order, we can easily achieve
> > +that by getting the inflight descriptors from descriptor table (split virtqueue)
> > +or descriptor ring (packed virtqueue). However, it can't work when we process
> > +descriptors out-of-order because some entries which store the information of
> > +inflight descriptors in available ring (split virtqueue) or descriptor
> > +ring (packed virtqueue) might be overrided by new entries. To solve this
> > +problem, slave need to allocate an extra buffer to store this information of inflight
> > +descriptors and share it with master for persistent. VHOST_USER_GET_INFLIGHT_FD and
> > +VHOST_USER_SET_INFLIGHT_FD are used to transfer this buffer between master
> > +and slave. And the format of this buffer is described below:
> > +
> > +-------------------------------------------------------
> > +| queue0 region | queue1 region | ... | queueN region |
> > +-------------------------------------------------------
> > +
> > +N is the number of available virtqueues. Slave could get it from num queues
> > +field of VhostUserInflight.
> > +
> > +For split virtqueue, queue region can be implemented as:
> > +
> > +typedef struct DescStateSplit {
> > +    /* Indicate whether this descriptor is inflight or not.
> > +     * Only available for head-descriptor. */
> > +    uint8_t inflight;
> > +
> > +    /* Padding */
> > +    uint8_t padding;
> > +
> > +    /* Link to the last processed entry */
> > +    uint16_t next;
> > +} DescStateSplit;
> > +
> > +typedef struct QueueRegionSplit {
> > +    /* The feature flags of this region. Now it's initialized to 0. */
> > +    uint64_t features;
> > +
> > +    /* The version of this region. It's 1 currently.
> > +     * Zero value indicates an uninitialized buffer */
> > +    uint16_t version;
> > +
> > +    /* The size of DescStateSplit array. It's equal to the virtqueue
> > +     * size. Slave could get it from queue size field of VhostUserInflight. */
> > +    uint16_t desc_num;
> > +
> > +    /* The head of processed DescStateSplit entry list */
> > +    uint16_t process_head;
> > +
> > +    /* Storing the idx value of used ring */
> > +    uint16_t used_idx;
> > +
> > +    /* Used to track the state of each descriptor in descriptor table */
> > +    DescStateSplit desc[0];
> > +} QueueRegionSplit;
>
>
> What is the endian-ness of multibyte fields?
>

Native endian is OK here. Right?

>
> > +
> > +To track inflight I/O, the queue region should be processed as follows:
> > +
> > +When receiving available buffers from the driver:
> > +
> > +    1. Get the next available head-descriptor index from available ring, i
> > +
> > +    2. Set desc[i].inflight to 1
> > +
> > +When supplying used buffers to the driver:
> > +
> > +    1. Get corresponding used head-descriptor index, i
> > +
> > +    2. Set desc[i].next to process_head
> > +
> > +    3. Set process_head to i
> > +
> > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > +
> > +    5. Increase the idx value of used ring by the size of the batch
> > +
> > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > +
> > +    7. Set used_idx to the idx value of used ring
> > +
> > +When reconnecting:
> > +
> > +    1. If the value of used_idx does not match the idx value of used ring,
> > +
> > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > +        the number of in-progress DescStateSplit entries
> > +
> > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > +        start from process_head to 0
> > +
> > +        (c) Set used_idx to the idx value of used ring
> > +
> > +    2. Resubmit each inflight DescStateSplit entry
>
> I re-read a couple of time and I still don't understand what it says.
>
> For simplicity consider split ring. So we want a list of heads that are
> outstanding. Fair enough. Now device finishes a head. What now? I needs
> to drop head from the list. But list is unidirectional (just next, no
> prev). So how can you drop an entry from the middle?
>

The process_head is only used when slave crash between increasing the
idx value of used ring and updating used_idx. We use it to find the
in-progress DescStateSplit entries before the crash and complete them
when reconnecting. Make sure guest and slave have the same view for
inflight I/Os.

In other case, the inflight field is enough to track inflight I/O.
When reconnecting, we go through all DescStateSplit entries and
re-submit the entry whose inflight field is equal to 1.

> > +For packed virtqueue, queue region can be implemented as:
> > +
> > +typedef struct DescStatePacked {
> > +    /* Indicate whether this descriptor is inflight or not.
> > +     * Only available for head-descriptor. */
> > +    uint8_t inflight;
> > +
> > +    /* Padding */
> > +    uint8_t padding;
> > +
> > +    /* Link to the next free entry */
> > +    uint16_t next;
> > +
> > +    /* Link to the last entry of descriptor list.
> > +     * Only available for head-descriptor. */
> > +    uint16_t last;
> > +
> > +    /* The length of descriptor list.
> > +     * Only available for head-descriptor. */
> > +    uint16_t num;
> > +
> > +    /* The buffer id */
> > +    uint16_t id;
> > +
> > +    /* The descriptor flags */
> > +    uint16_t flags;
> > +
> > +    /* The buffer length */
> > +    uint32_t len;
> > +
> > +    /* The buffer address */
> > +    uint64_t addr;
>
> Do we want an extra u64 here to make it a power of two?
>

Looks good to me.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-22  2:47     ` Yongji Xie
@ 2019-02-22  6:21       ` Michael S. Tsirkin
  2019-02-22  7:05         ` Yongji Xie
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22  6:21 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > +
> > > +To track inflight I/O, the queue region should be processed as follows:
> > > +
> > > +When receiving available buffers from the driver:
> > > +
> > > +    1. Get the next available head-descriptor index from available ring, i
> > > +
> > > +    2. Set desc[i].inflight to 1
> > > +
> > > +When supplying used buffers to the driver:
> > > +
> > > +    1. Get corresponding used head-descriptor index, i
> > > +
> > > +    2. Set desc[i].next to process_head
> > > +
> > > +    3. Set process_head to i
> > > +
> > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > +
> > > +    5. Increase the idx value of used ring by the size of the batch
> > > +
> > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > +
> > > +    7. Set used_idx to the idx value of used ring
> > > +
> > > +When reconnecting:
> > > +
> > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > +
> > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > +        the number of in-progress DescStateSplit entries
> > > +
> > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > +        start from process_head to 0
> > > +
> > > +        (c) Set used_idx to the idx value of used ring
> > > +
> > > +    2. Resubmit each inflight DescStateSplit entry
> >
> > I re-read a couple of time and I still don't understand what it says.
> >
> > For simplicity consider split ring. So we want a list of heads that are
> > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > to drop head from the list. But list is unidirectional (just next, no
> > prev). So how can you drop an entry from the middle?
> >
> 
> The process_head is only used when slave crash between increasing the
> idx value of used ring and updating used_idx. We use it to find the
> in-progress DescStateSplit entries before the crash and complete them
> when reconnecting. Make sure guest and slave have the same view for
> inflight I/Os.
> 

But I don't understand how does the described process help do it?


> In other case, the inflight field is enough to track inflight I/O.
> When reconnecting, we go through all DescStateSplit entries and
> re-submit the entry whose inflight field is equal to 1.

What I don't understand is how do we know the order
in which they have to be resubmitted. Reordering
operations would be a big problem, won't it?


Let's say I fetch descriptors A, B, C and start
processing. how does memory look?
Now I finished B and marked it used. How does
memory look?

I also wonder how do you address a crash between
marking descriptor used and clearing inflight.
Will you redo the descriptor? Is it always safe?
What if it's a write?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-22  6:21       ` Michael S. Tsirkin
@ 2019-02-22  7:05         ` Yongji Xie
  2019-02-22 14:54           ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Yongji Xie @ 2019-02-22  7:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > +
> > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > +
> > > > +When receiving available buffers from the driver:
> > > > +
> > > > +    1. Get the next available head-descriptor index from available ring, i
> > > > +
> > > > +    2. Set desc[i].inflight to 1
> > > > +
> > > > +When supplying used buffers to the driver:
> > > > +
> > > > +    1. Get corresponding used head-descriptor index, i
> > > > +
> > > > +    2. Set desc[i].next to process_head
> > > > +
> > > > +    3. Set process_head to i
> > > > +
> > > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > +
> > > > +    5. Increase the idx value of used ring by the size of the batch
> > > > +
> > > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > +
> > > > +    7. Set used_idx to the idx value of used ring
> > > > +
> > > > +When reconnecting:
> > > > +
> > > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > > +
> > > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > +        the number of in-progress DescStateSplit entries
> > > > +
> > > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > +        start from process_head to 0
> > > > +
> > > > +        (c) Set used_idx to the idx value of used ring
> > > > +
> > > > +    2. Resubmit each inflight DescStateSplit entry
> > >
> > > I re-read a couple of time and I still don't understand what it says.
> > >
> > > For simplicity consider split ring. So we want a list of heads that are
> > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > to drop head from the list. But list is unidirectional (just next, no
> > > prev). So how can you drop an entry from the middle?
> > >
> >
> > The process_head is only used when slave crash between increasing the
> > idx value of used ring and updating used_idx. We use it to find the
> > in-progress DescStateSplit entries before the crash and complete them
> > when reconnecting. Make sure guest and slave have the same view for
> > inflight I/Os.
> >
>
> But I don't understand how does the described process help do it?
>

For example, we need to submit descriptors A, B, C to driver in a batch.

Firstly, we will link those descriptors like:

process_head->A->B->C    (A)

Then, we need to update idx value of used vring to mark those
descriptors as used:

_vring.used->idx += 3    (B)

At last, clear the inflight field of those descriptors and update
used_idx field:

A.inflight = 0; B.inflight = 0; C.inflight = 0;    (C)

used_idx = _vring.used->idx;    (D)

After (B), guest can consume the descriptors A,B,C. So we must make
sure the inflight field of A,B,C is cleared when reconnecting to avoid
re-submitting used descriptor. If slave crash during (C), the inflight
field of A,B,C may be incorrect. To detect that case, we can see
whether used_idx matches _vring.used->idx. And through process_head,
we can get the in-progress descriptors A,B,C and clear their inflight
field again when reconnecting.

>
> > In other case, the inflight field is enough to track inflight I/O.
> > When reconnecting, we go through all DescStateSplit entries and
> > re-submit the entry whose inflight field is equal to 1.
>
> What I don't understand is how do we know the order
> in which they have to be resubmitted. Reordering
> operations would be a big problem, won't it?
>

In previous patch, I record avail_idx for each DescStateSplit entry to
preserve the order. Is it useful to fix this?

>
> Let's say I fetch descriptors A, B, C and start
> processing. how does memory look?

A.inflight = 1, C.inflight = 1, B.inflight = 1

> Now I finished B and marked it used. How does
> memory look?
>

A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B

> I also wonder how do you address a crash between
> marking descriptor used and clearing inflight.
> Will you redo the descriptor? Is it always safe?
> What if it's a write?
>

It's safe. We can get the in-progess descriptors through process_head
and clear their inflight field when reconnecting.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-22  7:05         ` Yongji Xie
@ 2019-02-22 14:54           ` Michael S. Tsirkin
  2019-02-23 13:10             ` Yongji Xie
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-02-22 14:54 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > +
> > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > +
> > > > > +When receiving available buffers from the driver:
> > > > > +
> > > > > +    1. Get the next available head-descriptor index from available ring, i
> > > > > +
> > > > > +    2. Set desc[i].inflight to 1
> > > > > +
> > > > > +When supplying used buffers to the driver:
> > > > > +
> > > > > +    1. Get corresponding used head-descriptor index, i
> > > > > +
> > > > > +    2. Set desc[i].next to process_head
> > > > > +
> > > > > +    3. Set process_head to i
> > > > > +
> > > > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > +
> > > > > +    5. Increase the idx value of used ring by the size of the batch
> > > > > +
> > > > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > +
> > > > > +    7. Set used_idx to the idx value of used ring
> > > > > +
> > > > > +When reconnecting:
> > > > > +
> > > > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > > > +
> > > > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > +        the number of in-progress DescStateSplit entries
> > > > > +
> > > > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > +        start from process_head to 0
> > > > > +
> > > > > +        (c) Set used_idx to the idx value of used ring
> > > > > +
> > > > > +    2. Resubmit each inflight DescStateSplit entry
> > > >
> > > > I re-read a couple of time and I still don't understand what it says.
> > > >
> > > > For simplicity consider split ring. So we want a list of heads that are
> > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > to drop head from the list. But list is unidirectional (just next, no
> > > > prev). So how can you drop an entry from the middle?
> > > >
> > >
> > > The process_head is only used when slave crash between increasing the
> > > idx value of used ring and updating used_idx. We use it to find the
> > > in-progress DescStateSplit entries before the crash and complete them
> > > when reconnecting. Make sure guest and slave have the same view for
> > > inflight I/Os.
> > >
> >
> > But I don't understand how does the described process help do it?
> >
> 
> For example, we need to submit descriptors A, B, C to driver in a batch.
> 
> Firstly, we will link those descriptors like:
> 
> process_head->A->B->C    (A)
> 
> Then, we need to update idx value of used vring to mark those
> descriptors as used:
> 
> _vring.used->idx += 3    (B)
> 
> At last, clear the inflight field of those descriptors and update
> used_idx field:
> 
> A.inflight = 0; B.inflight = 0; C.inflight = 0;    (C)
> 
> used_idx = _vring.used->idx;    (D)
> 
> After (B), guest can consume the descriptors A,B,C. So we must make
> sure the inflight field of A,B,C is cleared when reconnecting to avoid
> re-submitting used descriptor. If slave crash during (C), the inflight
> field of A,B,C may be incorrect. To detect that case, we can see
> whether used_idx matches _vring.used->idx. And through process_head,
> we can get the in-progress descriptors A,B,C and clear their inflight
> field again when reconnecting.
> 
> >
> > > In other case, the inflight field is enough to track inflight I/O.
> > > When reconnecting, we go through all DescStateSplit entries and
> > > re-submit the entry whose inflight field is equal to 1.
> >
> > What I don't understand is how do we know the order
> > in which they have to be resubmitted. Reordering
> > operations would be a big problem, won't it?
> >
> 
> In previous patch, I record avail_idx for each DescStateSplit entry to
> preserve the order. Is it useful to fix this?
> 
> >
> > Let's say I fetch descriptors A, B, C and start
> > processing. how does memory look?
> 
> A.inflight = 1, C.inflight = 1, B.inflight = 1
> 
> > Now I finished B and marked it used. How does
> > memory look?
> >
> 
> A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B

OK. And we have

process_head->B->process_head

?

Now if there is a reconnect, I want to submit A and then C,
correct? How do I know that from this picture? How do I
know to start with A? It's not on the list anymore...



> > I also wonder how do you address a crash between
> > marking descriptor used and clearing inflight.
> > Will you redo the descriptor? Is it always safe?
> > What if it's a write?
> >
> 
> It's safe. We can get the in-progess descriptors through process_head
> and clear their inflight field when reconnecting.
> 
> Thanks,
> Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-22 14:54           ` Michael S. Tsirkin
@ 2019-02-23 13:10             ` Yongji Xie
  2019-02-24  0:14               ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Yongji Xie @ 2019-02-23 13:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > > +
> > > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > > +
> > > > > > +When receiving available buffers from the driver:
> > > > > > +
> > > > > > +    1. Get the next available head-descriptor index from available ring, i
> > > > > > +
> > > > > > +    2. Set desc[i].inflight to 1
> > > > > > +
> > > > > > +When supplying used buffers to the driver:
> > > > > > +
> > > > > > +    1. Get corresponding used head-descriptor index, i
> > > > > > +
> > > > > > +    2. Set desc[i].next to process_head
> > > > > > +
> > > > > > +    3. Set process_head to i
> > > > > > +
> > > > > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > > +
> > > > > > +    5. Increase the idx value of used ring by the size of the batch
> > > > > > +
> > > > > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > > +
> > > > > > +    7. Set used_idx to the idx value of used ring
> > > > > > +
> > > > > > +When reconnecting:
> > > > > > +
> > > > > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > > > > +
> > > > > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > > +        the number of in-progress DescStateSplit entries
> > > > > > +
> > > > > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > > +        start from process_head to 0
> > > > > > +
> > > > > > +        (c) Set used_idx to the idx value of used ring
> > > > > > +
> > > > > > +    2. Resubmit each inflight DescStateSplit entry
> > > > >
> > > > > I re-read a couple of time and I still don't understand what it says.
> > > > >
> > > > > For simplicity consider split ring. So we want a list of heads that are
> > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > > to drop head from the list. But list is unidirectional (just next, no
> > > > > prev). So how can you drop an entry from the middle?
> > > > >
> > > >
> > > > The process_head is only used when slave crash between increasing the
> > > > idx value of used ring and updating used_idx. We use it to find the
> > > > in-progress DescStateSplit entries before the crash and complete them
> > > > when reconnecting. Make sure guest and slave have the same view for
> > > > inflight I/Os.
> > > >
> > >
> > > But I don't understand how does the described process help do it?
> > >
> >
> > For example, we need to submit descriptors A, B, C to driver in a batch.
> >
> > Firstly, we will link those descriptors like:
> >
> > process_head->A->B->C    (A)
> >
> > Then, we need to update idx value of used vring to mark those
> > descriptors as used:
> >
> > _vring.used->idx += 3    (B)
> >
> > At last, clear the inflight field of those descriptors and update
> > used_idx field:
> >
> > A.inflight = 0; B.inflight = 0; C.inflight = 0;    (C)
> >
> > used_idx = _vring.used->idx;    (D)
> >
> > After (B), guest can consume the descriptors A,B,C. So we must make
> > sure the inflight field of A,B,C is cleared when reconnecting to avoid
> > re-submitting used descriptor. If slave crash during (C), the inflight
> > field of A,B,C may be incorrect. To detect that case, we can see
> > whether used_idx matches _vring.used->idx. And through process_head,
> > we can get the in-progress descriptors A,B,C and clear their inflight
> > field again when reconnecting.
> >
> > >
> > > > In other case, the inflight field is enough to track inflight I/O.
> > > > When reconnecting, we go through all DescStateSplit entries and
> > > > re-submit the entry whose inflight field is equal to 1.
> > >
> > > What I don't understand is how do we know the order
> > > in which they have to be resubmitted. Reordering
> > > operations would be a big problem, won't it?
> > >
> >
> > In previous patch, I record avail_idx for each DescStateSplit entry to
> > preserve the order. Is it useful to fix this?
> >
> > >
> > > Let's say I fetch descriptors A, B, C and start
> > > processing. how does memory look?
> >
> > A.inflight = 1, C.inflight = 1, B.inflight = 1
> >
> > > Now I finished B and marked it used. How does
> > > memory look?
> > >
> >
> > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B
>
> OK. And we have
>
> process_head->B->process_head
>
> ?
>
> Now if there is a reconnect, I want to submit A and then C,
> correct? How do I know that from this picture? How do I
> know to start with A? It's not on the list anymore...
>

We can go through all DescStateSplit entries (track all descriptors in
Descriptor Table), then we can find A and C are inflight entry by its
inflight field. And if we want to resubmit them in order (submit A and
then C), we need to introduce a timestamp for each DescStateSplit
entry to preserve the order when we fetch them from driver. Something
like:

When receiving available buffers from the driver:

1. Get the next available head-descriptor index from available ring, i

2. desc[i].timestamp = avail_idx++;

3. Set desc[i].inflight to 1

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-23 13:10             ` Yongji Xie
@ 2019-02-24  0:14               ` Michael S. Tsirkin
  2019-02-24  8:02                 ` Yongji Xie
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2019-02-24  0:14 UTC (permalink / raw)
  To: Yongji Xie
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Sat, Feb 23, 2019 at 09:10:01PM +0800, Yongji Xie wrote:
> On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> > > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > > > +
> > > > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > > > +
> > > > > > > +When receiving available buffers from the driver:
> > > > > > > +
> > > > > > > +    1. Get the next available head-descriptor index from available ring, i
> > > > > > > +
> > > > > > > +    2. Set desc[i].inflight to 1
> > > > > > > +
> > > > > > > +When supplying used buffers to the driver:
> > > > > > > +
> > > > > > > +    1. Get corresponding used head-descriptor index, i
> > > > > > > +
> > > > > > > +    2. Set desc[i].next to process_head
> > > > > > > +
> > > > > > > +    3. Set process_head to i
> > > > > > > +
> > > > > > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > > > +
> > > > > > > +    5. Increase the idx value of used ring by the size of the batch
> > > > > > > +
> > > > > > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > > > +
> > > > > > > +    7. Set used_idx to the idx value of used ring
> > > > > > > +
> > > > > > > +When reconnecting:
> > > > > > > +
> > > > > > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > > > > > +
> > > > > > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > > > +        the number of in-progress DescStateSplit entries
> > > > > > > +
> > > > > > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > > > +        start from process_head to 0
> > > > > > > +
> > > > > > > +        (c) Set used_idx to the idx value of used ring
> > > > > > > +
> > > > > > > +    2. Resubmit each inflight DescStateSplit entry
> > > > > >
> > > > > > I re-read a couple of time and I still don't understand what it says.
> > > > > >
> > > > > > For simplicity consider split ring. So we want a list of heads that are
> > > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > > > to drop head from the list. But list is unidirectional (just next, no
> > > > > > prev). So how can you drop an entry from the middle?
> > > > > >
> > > > >
> > > > > The process_head is only used when slave crash between increasing the
> > > > > idx value of used ring and updating used_idx. We use it to find the
> > > > > in-progress DescStateSplit entries before the crash and complete them
> > > > > when reconnecting. Make sure guest and slave have the same view for
> > > > > inflight I/Os.
> > > > >
> > > >
> > > > But I don't understand how does the described process help do it?
> > > >
> > >
> > > For example, we need to submit descriptors A, B, C to driver in a batch.
> > >
> > > Firstly, we will link those descriptors like:
> > >
> > > process_head->A->B->C    (A)
> > >
> > > Then, we need to update idx value of used vring to mark those
> > > descriptors as used:
> > >
> > > _vring.used->idx += 3    (B)
> > >
> > > At last, clear the inflight field of those descriptors and update
> > > used_idx field:
> > >
> > > A.inflight = 0; B.inflight = 0; C.inflight = 0;    (C)
> > >
> > > used_idx = _vring.used->idx;    (D)
> > >
> > > After (B), guest can consume the descriptors A,B,C. So we must make
> > > sure the inflight field of A,B,C is cleared when reconnecting to avoid
> > > re-submitting used descriptor. If slave crash during (C), the inflight
> > > field of A,B,C may be incorrect. To detect that case, we can see
> > > whether used_idx matches _vring.used->idx. And through process_head,
> > > we can get the in-progress descriptors A,B,C and clear their inflight
> > > field again when reconnecting.
> > >
> > > >
> > > > > In other case, the inflight field is enough to track inflight I/O.
> > > > > When reconnecting, we go through all DescStateSplit entries and
> > > > > re-submit the entry whose inflight field is equal to 1.
> > > >
> > > > What I don't understand is how do we know the order
> > > > in which they have to be resubmitted. Reordering
> > > > operations would be a big problem, won't it?
> > > >
> > >
> > > In previous patch, I record avail_idx for each DescStateSplit entry to
> > > preserve the order. Is it useful to fix this?
> > >
> > > >
> > > > Let's say I fetch descriptors A, B, C and start
> > > > processing. how does memory look?
> > >
> > > A.inflight = 1, C.inflight = 1, B.inflight = 1
> > >
> > > > Now I finished B and marked it used. How does
> > > > memory look?
> > > >
> > >
> > > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B
> >
> > OK. And we have
> >
> > process_head->B->process_head
> >
> > ?
> >
> > Now if there is a reconnect, I want to submit A and then C,
> > correct? How do I know that from this picture? How do I
> > know to start with A? It's not on the list anymore...
> >
> 
> We can go through all DescStateSplit entries (track all descriptors in
> Descriptor Table), then we can find A and C are inflight entry by its
> inflight field. And if we want to resubmit them in order (submit A and
> then C), we need to introduce a timestamp for each DescStateSplit
> entry to preserve the order when we fetch them from driver. Something
> like:
> 
> When receiving available buffers from the driver:
> 
> 1. Get the next available head-descriptor index from available ring, i
> 
> 2. desc[i].timestamp = avail_idx++;
> 
> 3. Set desc[i].inflight to 1
> 
> Thanks,
> Yongji

OK I guess a 64 bit counter would be fine for that.

In order seems critical for storage right?
Reordering write would seem to lead to data corruption.

But now I don't understand what does the next
field do. So it so you can maintain a freelist
within a statically allocated array?

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend
  2019-02-24  0:14               ` Michael S. Tsirkin
@ 2019-02-24  8:02                 ` Yongji Xie
  0 siblings, 0 replies; 18+ messages in thread
From: Yongji Xie @ 2019-02-24  8:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stefan Hajnoczi, Marc-André Lureau, Daniel P. Berrangé,
	Jason Wang, Coquelin, Maxime, Yury Kotov,
	Евгений
	Яковлев,
	qemu-devel, zhangyu31, chaiwen, nixun, lilin24, Xie Yongji

On Sun, 24 Feb 2019 at 08:14, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Sat, Feb 23, 2019 at 09:10:01PM +0800, Yongji Xie wrote:
> > On Fri, 22 Feb 2019 at 22:54, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Feb 22, 2019 at 03:05:23PM +0800, Yongji Xie wrote:
> > > > On Fri, 22 Feb 2019 at 14:21, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Fri, Feb 22, 2019 at 10:47:03AM +0800, Yongji Xie wrote:
> > > > > > > > +
> > > > > > > > +To track inflight I/O, the queue region should be processed as follows:
> > > > > > > > +
> > > > > > > > +When receiving available buffers from the driver:
> > > > > > > > +
> > > > > > > > +    1. Get the next available head-descriptor index from available ring, i
> > > > > > > > +
> > > > > > > > +    2. Set desc[i].inflight to 1
> > > > > > > > +
> > > > > > > > +When supplying used buffers to the driver:
> > > > > > > > +
> > > > > > > > +    1. Get corresponding used head-descriptor index, i
> > > > > > > > +
> > > > > > > > +    2. Set desc[i].next to process_head
> > > > > > > > +
> > > > > > > > +    3. Set process_head to i
> > > > > > > > +
> > > > > > > > +    4. Steps 1,2,3 may be performed repeatedly if batching is possible
> > > > > > > > +
> > > > > > > > +    5. Increase the idx value of used ring by the size of the batch
> > > > > > > > +
> > > > > > > > +    6. Set the inflight field of each DescStateSplit entry in the batch to 0
> > > > > > > > +
> > > > > > > > +    7. Set used_idx to the idx value of used ring
> > > > > > > > +
> > > > > > > > +When reconnecting:
> > > > > > > > +
> > > > > > > > +    1. If the value of used_idx does not match the idx value of used ring,
> > > > > > > > +
> > > > > > > > +        (a) Subtract the value of used_idx from the idx value of used ring to get
> > > > > > > > +        the number of in-progress DescStateSplit entries
> > > > > > > > +
> > > > > > > > +        (b) Set the inflight field of the in-progress DescStateSplit entries which
> > > > > > > > +        start from process_head to 0
> > > > > > > > +
> > > > > > > > +        (c) Set used_idx to the idx value of used ring
> > > > > > > > +
> > > > > > > > +    2. Resubmit each inflight DescStateSplit entry
> > > > > > >
> > > > > > > I re-read a couple of time and I still don't understand what it says.
> > > > > > >
> > > > > > > For simplicity consider split ring. So we want a list of heads that are
> > > > > > > outstanding. Fair enough. Now device finishes a head. What now? I needs
> > > > > > > to drop head from the list. But list is unidirectional (just next, no
> > > > > > > prev). So how can you drop an entry from the middle?
> > > > > > >
> > > > > >
> > > > > > The process_head is only used when slave crash between increasing the
> > > > > > idx value of used ring and updating used_idx. We use it to find the
> > > > > > in-progress DescStateSplit entries before the crash and complete them
> > > > > > when reconnecting. Make sure guest and slave have the same view for
> > > > > > inflight I/Os.
> > > > > >
> > > > >
> > > > > But I don't understand how does the described process help do it?
> > > > >
> > > >
> > > > For example, we need to submit descriptors A, B, C to driver in a batch.
> > > >
> > > > Firstly, we will link those descriptors like:
> > > >
> > > > process_head->A->B->C    (A)
> > > >
> > > > Then, we need to update idx value of used vring to mark those
> > > > descriptors as used:
> > > >
> > > > _vring.used->idx += 3    (B)
> > > >
> > > > At last, clear the inflight field of those descriptors and update
> > > > used_idx field:
> > > >
> > > > A.inflight = 0; B.inflight = 0; C.inflight = 0;    (C)
> > > >
> > > > used_idx = _vring.used->idx;    (D)
> > > >
> > > > After (B), guest can consume the descriptors A,B,C. So we must make
> > > > sure the inflight field of A,B,C is cleared when reconnecting to avoid
> > > > re-submitting used descriptor. If slave crash during (C), the inflight
> > > > field of A,B,C may be incorrect. To detect that case, we can see
> > > > whether used_idx matches _vring.used->idx. And through process_head,
> > > > we can get the in-progress descriptors A,B,C and clear their inflight
> > > > field again when reconnecting.
> > > >
> > > > >
> > > > > > In other case, the inflight field is enough to track inflight I/O.
> > > > > > When reconnecting, we go through all DescStateSplit entries and
> > > > > > re-submit the entry whose inflight field is equal to 1.
> > > > >
> > > > > What I don't understand is how do we know the order
> > > > > in which they have to be resubmitted. Reordering
> > > > > operations would be a big problem, won't it?
> > > > >
> > > >
> > > > In previous patch, I record avail_idx for each DescStateSplit entry to
> > > > preserve the order. Is it useful to fix this?
> > > >
> > > > >
> > > > > Let's say I fetch descriptors A, B, C and start
> > > > > processing. how does memory look?
> > > >
> > > > A.inflight = 1, C.inflight = 1, B.inflight = 1
> > > >
> > > > > Now I finished B and marked it used. How does
> > > > > memory look?
> > > > >
> > > >
> > > > A.inflight = 1, C.inflight = 1, B.inflight = 0, process_head = B
> > >
> > > OK. And we have
> > >
> > > process_head->B->process_head
> > >
> > > ?
> > >
> > > Now if there is a reconnect, I want to submit A and then C,
> > > correct? How do I know that from this picture? How do I
> > > know to start with A? It's not on the list anymore...
> > >
> >
> > We can go through all DescStateSplit entries (track all descriptors in
> > Descriptor Table), then we can find A and C are inflight entry by its
> > inflight field. And if we want to resubmit them in order (submit A and
> > then C), we need to introduce a timestamp for each DescStateSplit
> > entry to preserve the order when we fetch them from driver. Something
> > like:
> >
> > When receiving available buffers from the driver:
> >
> > 1. Get the next available head-descriptor index from available ring, i
> >
> > 2. desc[i].timestamp = avail_idx++;
> >
> > 3. Set desc[i].inflight to 1
> >
> > Thanks,
> > Yongji
>
> OK I guess a 64 bit counter would be fine for that.
>
> In order seems critical for storage right?
> Reordering write would seem to lead to data corruption.
>

Actually I'm not sure. If we care about the order, we should not do
access to one block until another access to the same block is
completed?

> But now I don't understand what does the next
> field do. So it so you can maintain a freelist
> within a statically allocated array?
>

Yes, we can use it to maintain a list. The head of the list is
process_head. This list is only used when we want to submit
descriptors in a batch. All descriptors in this batch are linked to
the list. Then if we crash between marking those descriptors used and
clearing their inflight field. We need to find those in-progress
descriptors. The list will be helpful to achieve that. If no batch for
submitting, the next field can be removed.

Thanks,
Yongji

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-02-24  8:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18 10:27 [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 1/7] vhost-user: Support transferring inflight buffer between qemu and backend elohimes
2019-02-21 17:27   ` Michael S. Tsirkin
2019-02-22  2:47     ` Yongji Xie
2019-02-22  6:21       ` Michael S. Tsirkin
2019-02-22  7:05         ` Yongji Xie
2019-02-22 14:54           ` Michael S. Tsirkin
2019-02-23 13:10             ` Yongji Xie
2019-02-24  0:14               ` Michael S. Tsirkin
2019-02-24  8:02                 ` Yongji Xie
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 2/7] libvhost-user: Remove unnecessary FD flag check for event file descriptors elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 3/7] libvhost-user: Introduce vu_queue_map_desc() elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 4/7] libvhost-user: Support tracking inflight I/O in shared memory elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 5/7] vhost-user-blk: Add support to get/set inflight buffer elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 6/7] vhost-user-blk: Add support to reconnect backend elohimes
2019-02-18 10:27 ` [Qemu-devel] [PATCH v6 7/7] contrib/vhost-user-blk: enable inflight I/O tracking elohimes
2019-02-20 19:59 ` [Qemu-devel] [PATCH v6 0/7] vhost-user-blk: Add support for backend reconnecting Michael S. Tsirkin
2019-02-21  1:30   ` Yongji Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.