All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
@ 2016-01-31 10:28 Paolo Bonzini
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

This includes two optimization of virtio:

- "slimming down" VirtQueueElements by not including room for
  1024 buffers.  This makes malloc much faster.

- optimizations to limit the number of address_space_translate
  calls in virtio.c, from Vincenzo and myself.

Thanks,

Paolo

v1->v2: improved commit messages [Conny]
        add assertions on sz [Conny]
        change bools from 1 and 0 to "true" and "false" [Conny]
        update shadow avail_idx in virtio_queue_set_last_avail_idx [Michael]
        collect Reviewed-by
        
Paolo Bonzini (7):
  virtio: move VirtQueueElement at the beginning of the structs
  virtio: move allocation to virtqueue_pop/vring_pop
  virtio: introduce qemu_get/put_virtqueue_element
  virtio: introduce virtqueue_alloc_element
  virtio: slim down allocation of VirtQueueElements
  vring: slim down allocation of VirtQueueElements
  virtio: combine the read of a descriptor

Vincenzo Maffione (3):
  virtio: cache used_idx in a VirtQueue field
  virtio: read avail_idx from VQ only when necessary
  virtio: combine write of an entry into used ring

 hw/9pfs/9p.c                        |   2 +-
 hw/9pfs/virtio-9p-device.c          |  17 +-
 hw/9pfs/virtio-9p.h                 |   2 +-
 hw/block/dataplane/virtio-blk.c     |  11 +-
 hw/block/virtio-blk.c               |  23 +--
 hw/char/virtio-serial-bus.c         |  78 +++++----
 hw/display/virtio-gpu.c             |  25 ++-
 hw/input/virtio-input.c             |  24 ++-
 hw/net/virtio-net.c                 |  69 +++++---
 hw/scsi/virtio-scsi-dataplane.c     |  15 +-
 hw/scsi/virtio-scsi.c               |  26 ++-
 hw/virtio/dataplane/vring.c         |  62 ++++---
 hw/virtio/virtio-balloon.c          |  22 ++-
 hw/virtio/virtio-rng.c              |  10 +-
 hw/virtio/virtio.c                  | 340 +++++++++++++++++++++++++-----------
 include/hw/virtio/dataplane/vring.h |   2 +-
 include/hw/virtio/virtio-balloon.h  |   2 +-
 include/hw/virtio/virtio-blk.h      |   5 +-
 include/hw/virtio/virtio-net.h      |   2 +-
 include/hw/virtio/virtio-scsi.h     |  15 +-
 include/hw/virtio/virtio-serial.h   |   2 +-
 include/hw/virtio/virtio.h          |  13 +-
 22 files changed, 486 insertions(+), 281 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
@ 2016-01-31 10:28 ` Paolo Bonzini
  2016-02-01 11:17   ` Cornelia Huck
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

The next patch will make virtqueue_pop/vring_pop allocate memory for
the VirtQueueElement. In some cases (blk, scsi, gpu) the device wants
to extend VirtQueueElement with device-specific fields and, until now,
the place of the VirtQueueElement within the containing struct didn't
matter. When allocating the entire block in virtqueue_pop/vring_pop,
however, the containing struct must basically be a "subclass" of
VirtQueueElement, with the VirtQueueElement as the first field. Make
that the case for blk and scsi; gpu is already doing it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/virtio-scsi.c           |  3 +--
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio-scsi.h | 13 ++++++-------
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 607593c..df8e379 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -44,8 +44,7 @@ VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
 {
     VirtIOSCSIReq *req;
     VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
-    const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
-                             + sizeof(VirtQueueElement);
+    const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
 
     req = g_malloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index ae11a63..403ab86 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -60,9 +60,9 @@ typedef struct VirtIOBlock {
 } VirtIOBlock;
 
 typedef struct VirtIOBlockReq {
+    VirtQueueElement elem;
     int64_t sector_num;
     VirtIOBlock *dev;
-    VirtQueueElement elem;
     struct virtio_blk_inhdr *in;
     struct virtio_blk_outhdr out;
     QEMUIOVector qiov;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 088fe9f..63f5b51 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -102,18 +102,17 @@ typedef struct VirtIOSCSI {
 } VirtIOSCSI;
 
 typedef struct VirtIOSCSIReq {
+    /* Note:
+     * - fields up to resp_iov are initialized by virtio_scsi_init_req;
+     * - fields starting at vring are zeroed by virtio_scsi_init_req.
+     * */
+    VirtQueueElement elem;
+
     VirtIOSCSI *dev;
     VirtQueue *vq;
     QEMUSGList qsgl;
     QEMUIOVector resp_iov;
 
-    /* Note:
-     * - fields before elem are initialized by virtio_scsi_init_req;
-     * - elem is uninitialized at the time of allocation.
-     * - fields after elem are zeroed by virtio_scsi_init_req.
-     * */
-
-    VirtQueueElement elem;
     /* Set by dataplane code. */
     VirtIOSCSIVring *vring;
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
@ 2016-01-31 10:28 ` Paolo Bonzini
  2016-02-01 11:20   ` Cornelia Huck
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 03/10] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

The return code of virtqueue_pop/vring_pop is unused except to check for
errors or 0.  We can thus easily move allocation inside the functions
and just return a pointer to the VirtQueueElement.

The advantage is that we will be able to allocate only the space that
is needed for the actual size of the s/g list instead of the full
VIRTQUEUE_MAX_SIZE items.  Currently VirtQueueElement takes about 48K
of memory, and this kind of allocation puts a lot of stress on malloc.
By cutting the size by two or three orders of magnitude, malloc can
use much more efficient algorithms.

The patch is pretty large, but changes to each device are testable
more or less independently.  Splitting it would mostly add churn.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: add assertions on sz [Conny]

 hw/9pfs/9p.c                        |  2 +-
 hw/9pfs/virtio-9p-device.c          | 17 ++++----
 hw/9pfs/virtio-9p.h                 |  2 +-
 hw/block/dataplane/virtio-blk.c     | 11 +++--
 hw/block/virtio-blk.c               | 15 +++----
 hw/char/virtio-serial-bus.c         | 80 +++++++++++++++++++++++--------------
 hw/display/virtio-gpu.c             | 25 +++++++-----
 hw/input/virtio-input.c             | 24 +++++++----
 hw/net/virtio-net.c                 | 69 ++++++++++++++++++++------------
 hw/scsi/virtio-scsi-dataplane.c     | 15 +++----
 hw/scsi/virtio-scsi.c               | 18 ++++-----
 hw/virtio/dataplane/vring.c         | 18 +++++----
 hw/virtio/virtio-balloon.c          | 22 ++++++----
 hw/virtio/virtio-rng.c              | 10 +++--
 hw/virtio/virtio.c                  | 12 ++++--
 include/hw/virtio/dataplane/vring.h |  2 +-
 include/hw/virtio/virtio-balloon.h  |  2 +-
 include/hw/virtio/virtio-blk.h      |  3 +-
 include/hw/virtio/virtio-net.h      |  2 +-
 include/hw/virtio/virtio-scsi.h     |  2 +-
 include/hw/virtio/virtio-serial.h   |  2 +-
 include/hw/virtio/virtio.h          |  2 +-
 22 files changed, 212 insertions(+), 143 deletions(-)

diff --git a/hw/9pfs/9p.c b/hw/9pfs/9p.c
index 3ff3106..ad1ae96 100644
--- a/hw/9pfs/9p.c
+++ b/hw/9pfs/9p.c
@@ -1586,7 +1586,7 @@ static int v9fs_xattr_read(V9fsState *s, V9fsPDU *pdu, V9fsFidState *fidp,
     int read_count;
     int64_t xattr_len;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
-    VirtQueueElement *elem = &v->elems[pdu->idx];
+    VirtQueueElement *elem = v->elems[pdu->idx];
 
     xattr_len = fidp->fs.xattr.len;
     read_count = xattr_len - off;
diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index 5643fd5..a0fe179 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -25,10 +25,12 @@ void virtio_9p_push_and_notify(V9fsPDU *pdu)
 {
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
-    VirtQueueElement *elem = &v->elems[pdu->idx];
+    VirtQueueElement *elem = v->elems[pdu->idx];
 
     /* push onto queue and notify */
     virtqueue_push(v->vq, elem, pdu->size);
+    g_free(elem);
+    v->elems[pdu->idx] = NULL;
 
     /* FIXME: we should batch these completions */
     virtio_notify(VIRTIO_DEVICE(v), v->vq);
@@ -47,10 +49,10 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
             uint8_t id;
             uint16_t tag_le;
         } QEMU_PACKED out;
-        VirtQueueElement *elem = &v->elems[pdu->idx];
+        VirtQueueElement *elem;
 
-        len = virtqueue_pop(vq, elem);
-        if (!len) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
             pdu_free(pdu);
             break;
         }
@@ -58,6 +60,7 @@ static void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
         BUG_ON(elem->out_num == 0 || elem->in_num == 0);
         QEMU_BUILD_BUG_ON(sizeof out != 7);
 
+        v->elems[pdu->idx] = elem;
         len = iov_to_buf(elem->out_sg, elem->out_num, 0,
                          &out, sizeof out);
         BUG_ON(len != sizeof out);
@@ -140,7 +143,7 @@ ssize_t virtio_pdu_vmarshal(V9fsPDU *pdu, size_t offset,
 {
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
-    VirtQueueElement *elem = &v->elems[pdu->idx];
+    VirtQueueElement *elem = v->elems[pdu->idx];
 
     return v9fs_iov_vmarshal(elem->in_sg, elem->in_num, offset, 1, fmt, ap);
 }
@@ -150,7 +153,7 @@ ssize_t virtio_pdu_vunmarshal(V9fsPDU *pdu, size_t offset,
 {
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
-    VirtQueueElement *elem = &v->elems[pdu->idx];
+    VirtQueueElement *elem = v->elems[pdu->idx];
 
     return v9fs_iov_vunmarshal(elem->out_sg, elem->out_num, offset, 1, fmt, ap);
 }
@@ -160,7 +163,7 @@ void virtio_init_iov_from_pdu(V9fsPDU *pdu, struct iovec **piov,
 {
     V9fsState *s = pdu->s;
     V9fsVirtioState *v = container_of(s, V9fsVirtioState, state);
-    VirtQueueElement *elem = &v->elems[pdu->idx];
+    VirtQueueElement *elem = v->elems[pdu->idx];
 
     if (is_write) {
         *piov = elem->out_sg;
diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
index 1cdf0a2..7f6d885 100644
--- a/hw/9pfs/virtio-9p.h
+++ b/hw/9pfs/virtio-9p.h
@@ -11,7 +11,7 @@ typedef struct V9fsVirtioState
     VirtQueue *vq;
     size_t config_size;
     V9fsPDU pdus[MAX_REQ];
-    VirtQueueElement elems[MAX_REQ];
+    VirtQueueElement *elems[MAX_REQ];
     V9fsState state;
 } V9fsVirtioState;
 
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index b8ce6cd..5ea28ae 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -97,20 +97,19 @@ static void handle_notify(EventNotifier *e)
     blk_io_plug(s->conf->conf.blk);
     for (;;) {
         MultiReqBuffer mrb = {};
-        int ret;
 
         /* Disable guest->host notifies to avoid unnecessary vmexits */
         vring_disable_notification(s->vdev, &s->vring);
 
         for (;;) {
-            VirtIOBlockReq *req = virtio_blk_alloc_request(vblk);
+            VirtIOBlockReq *req = vring_pop(s->vdev, &s->vring,
+                                            sizeof(VirtIOBlockReq));
 
-            ret = vring_pop(s->vdev, &s->vring, &req->elem);
-            if (ret < 0) {
-                virtio_blk_free_request(req);
+            if (req == NULL) {
                 break; /* no more requests */
             }
 
+            virtio_blk_init_request(vblk, req);
             trace_virtio_blk_data_plane_process_request(s, req->elem.out_num,
                                                         req->elem.in_num,
                                                         req->elem.index);
@@ -122,7 +121,7 @@ static void handle_notify(EventNotifier *e)
             virtio_blk_submit_multireq(s->conf->conf.blk, &mrb);
         }
 
-        if (likely(ret == -EAGAIN)) { /* vring emptied */
+        if (likely(!vring_more_avail(s->vdev, &s->vring))) { /* vring emptied */
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 51f867b..a874cb7 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -28,15 +28,13 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 
-VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
+void virtio_blk_init_request(VirtIOBlock *s, VirtIOBlockReq *req)
 {
-    VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
     req->dev = s;
     req->qiov.size = 0;
     req->in_len = 0;
     req->next = NULL;
     req->mr_next = NULL;
-    return req;
 }
 
 void virtio_blk_free_request(VirtIOBlockReq *req)
@@ -192,13 +190,11 @@ out:
 
 static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
 {
-    VirtIOBlockReq *req = virtio_blk_alloc_request(s);
+    VirtIOBlockReq *req = virtqueue_pop(s->vq, sizeof(VirtIOBlockReq));
 
-    if (!virtqueue_pop(s->vq, &req->elem)) {
-        virtio_blk_free_request(req);
-        return NULL;
+    if (req) {
+        virtio_blk_init_request(s, req);
     }
-
     return req;
 }
 
@@ -835,7 +831,8 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
     VirtIOBlock *s = VIRTIO_BLK(vdev);
 
     while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = virtio_blk_alloc_request(s);
+        VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
+        virtio_blk_init_request(s, req);
         qemu_get_buffer(f, (unsigned char *)&req->elem,
                         sizeof(VirtQueueElement));
         req->next = s->rq;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 2d2a659..f0c1c45 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -82,7 +82,7 @@ static bool use_multiport(VirtIOSerial *vser)
 static size_t write_to_port(VirtIOSerialPort *port,
                             const uint8_t *buf, size_t size)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtQueue *vq;
     size_t offset;
 
@@ -95,15 +95,17 @@ static size_t write_to_port(VirtIOSerialPort *port,
     while (offset < size) {
         size_t len;
 
-        if (!virtqueue_pop(vq, &elem)) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
             break;
         }
 
-        len = iov_from_buf(elem.in_sg, elem.in_num, 0,
+        len = iov_from_buf(elem->in_sg, elem->in_num, 0,
                            buf + offset, size - offset);
         offset += len;
 
-        virtqueue_push(vq, &elem, len);
+        virtqueue_push(vq, elem, len);
+        g_free(elem);
     }
 
     virtio_notify(VIRTIO_DEVICE(port->vser), vq);
@@ -112,13 +114,18 @@ static size_t write_to_port(VirtIOSerialPort *port,
 
 static void discard_vq_data(VirtQueue *vq, VirtIODevice *vdev)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
 
     if (!virtio_queue_ready(vq)) {
         return;
     }
-    while (virtqueue_pop(vq, &elem)) {
-        virtqueue_push(vq, &elem, 0);
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+        virtqueue_push(vq, elem, 0);
+        g_free(elem);
     }
     virtio_notify(vdev, vq);
 }
@@ -137,21 +144,22 @@ static void do_flush_queued_data(VirtIOSerialPort *port, VirtQueue *vq,
         unsigned int i;
 
         /* Pop an elem only if we haven't left off a previous one mid-way */
-        if (!port->elem.out_num) {
-            if (!virtqueue_pop(vq, &port->elem)) {
+        if (!port->elem) {
+            port->elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+            if (!port->elem) {
                 break;
             }
             port->iov_idx = 0;
             port->iov_offset = 0;
         }
 
-        for (i = port->iov_idx; i < port->elem.out_num; i++) {
+        for (i = port->iov_idx; i < port->elem->out_num; i++) {
             size_t buf_size;
             ssize_t ret;
 
-            buf_size = port->elem.out_sg[i].iov_len - port->iov_offset;
+            buf_size = port->elem->out_sg[i].iov_len - port->iov_offset;
             ret = vsc->have_data(port,
-                                  port->elem.out_sg[i].iov_base
+                                  port->elem->out_sg[i].iov_base
                                   + port->iov_offset,
                                   buf_size);
             if (port->throttled) {
@@ -166,8 +174,9 @@ static void do_flush_queued_data(VirtIOSerialPort *port, VirtQueue *vq,
         if (port->throttled) {
             break;
         }
-        virtqueue_push(vq, &port->elem, 0);
-        port->elem.out_num = 0;
+        virtqueue_push(vq, port->elem, 0);
+        g_free(port->elem);
+        port->elem = NULL;
     }
     virtio_notify(vdev, vq);
 }
@@ -184,22 +193,26 @@ static void flush_queued_data(VirtIOSerialPort *port)
 
 static size_t send_control_msg(VirtIOSerial *vser, void *buf, size_t len)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtQueue *vq;
 
     vq = vser->c_ivq;
     if (!virtio_queue_ready(vq)) {
         return 0;
     }
-    if (!virtqueue_pop(vq, &elem)) {
+
+    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
         return 0;
     }
 
     /* TODO: detect a buffer that's too short, set NEEDS_RESET */
-    iov_from_buf(elem.in_sg, elem.in_num, 0, buf, len);
+    iov_from_buf(elem->in_sg, elem->in_num, 0, buf, len);
 
-    virtqueue_push(vq, &elem, len);
+    virtqueue_push(vq, elem, len);
     virtio_notify(VIRTIO_DEVICE(vser), vq);
+    g_free(elem);
+
     return len;
 }
 
@@ -413,7 +426,7 @@ static void control_in(VirtIODevice *vdev, VirtQueue *vq)
 
 static void control_out(VirtIODevice *vdev, VirtQueue *vq)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtIOSerial *vser;
     uint8_t *buf;
     size_t len;
@@ -422,10 +435,15 @@ static void control_out(VirtIODevice *vdev, VirtQueue *vq)
 
     len = 0;
     buf = NULL;
-    while (virtqueue_pop(vq, &elem)) {
+    for (;;) {
         size_t cur_len;
 
-        cur_len = iov_size(elem.out_sg, elem.out_num);
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        cur_len = iov_size(elem->out_sg, elem->out_num);
         /*
          * Allocate a new buf only if we didn't have one previously or
          * if the size of the buf differs
@@ -436,10 +454,11 @@ static void control_out(VirtIODevice *vdev, VirtQueue *vq)
             buf = g_malloc(cur_len);
             len = cur_len;
         }
-        iov_to_buf(elem.out_sg, elem.out_num, 0, buf, cur_len);
+        iov_to_buf(elem->out_sg, elem->out_num, 0, buf, cur_len);
 
         handle_control_message(vser, buf, cur_len);
-        virtqueue_push(vq, &elem, 0);
+        virtqueue_push(vq, elem, 0);
+        g_free(elem);
     }
     g_free(buf);
     virtio_notify(vdev, vq);
@@ -619,7 +638,7 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
         qemu_put_byte(f, port->host_connected);
 
 	elem_popped = 0;
-        if (port->elem.out_num) {
+        if (port->elem) {
             elem_popped = 1;
         }
         qemu_put_be32s(f, &elem_popped);
@@ -627,8 +646,8 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
             qemu_put_be32s(f, &port->iov_idx);
             qemu_put_be64s(f, &port->iov_offset);
 
-            qemu_put_buffer(f, (unsigned char *)&port->elem,
-                            sizeof(port->elem));
+            qemu_put_buffer(f, (unsigned char *)port->elem,
+                            sizeof(VirtQueueElement));
         }
     }
 }
@@ -703,9 +722,10 @@ static int fetch_active_ports_list(QEMUFile *f, int version_id,
                 qemu_get_be32s(f, &port->iov_idx);
                 qemu_get_be64s(f, &port->iov_offset);
 
-                qemu_get_buffer(f, (unsigned char *)&port->elem,
-                                sizeof(port->elem));
-                virtqueue_map(&port->elem);
+                port->elem = g_new(VirtQueueElement, 1);
+                qemu_get_buffer(f, (unsigned char *)port->elem,
+                                sizeof(VirtQueueElement));
+                virtqueue_map(port->elem);
 
                 /*
                  *  Port was throttled on source machine.  Let's
@@ -927,7 +947,7 @@ static void virtser_port_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    port->elem.out_num = 0;
+    port->elem = NULL;
 }
 
 static void virtser_port_device_plug(HotplugHandler *hotplug_dev,
diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
index 7e79a9c..ea10603 100644
--- a/hw/display/virtio-gpu.c
+++ b/hw/display/virtio-gpu.c
@@ -770,8 +770,8 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
     }
 #endif
 
-    cmd = g_new(struct virtio_gpu_ctrl_command, 1);
-    while (virtqueue_pop(vq, &cmd->elem)) {
+    cmd = virtqueue_pop(vq, sizeof(struct virtio_gpu_ctrl_command));
+    while (cmd) {
         cmd->vq = vq;
         cmd->error = 0;
         cmd->finished = false;
@@ -781,7 +781,9 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
 
         VIRGL(g, virtio_gpu_virgl_process_cmd, virtio_gpu_simple_process_cmd,
               g, cmd);
-        if (!cmd->finished) {
+        if (cmd->finished) {
+            g_free(cmd);
+        } else {
             QTAILQ_INSERT_TAIL(&g->fenceq, cmd, next);
             g->inflight++;
             if (virtio_gpu_stats_enabled(g->conf)) {
@@ -790,10 +792,9 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
                 }
                 fprintf(stderr, "inflight: %3d (+)\r", g->inflight);
             }
-            cmd = g_new(struct virtio_gpu_ctrl_command, 1);
         }
+        cmd = virtqueue_pop(vq, sizeof(struct virtio_gpu_ctrl_command));
     }
-    g_free(cmd);
 
 #ifdef CONFIG_VIRGL
     if (g->use_virgl_renderer) {
@@ -811,15 +812,20 @@ static void virtio_gpu_ctrl_bh(void *opaque)
 static void virtio_gpu_handle_cursor(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOGPU *g = VIRTIO_GPU(vdev);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t s;
     struct virtio_gpu_update_cursor cursor_info;
 
     if (!virtio_queue_ready(vq)) {
         return;
     }
-    while (virtqueue_pop(vq, &elem)) {
-        s = iov_to_buf(elem.out_sg, elem.out_num, 0,
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        s = iov_to_buf(elem->out_sg, elem->out_num, 0,
                        &cursor_info, sizeof(cursor_info));
         if (s != sizeof(cursor_info)) {
             qemu_log_mask(LOG_GUEST_ERROR,
@@ -828,8 +834,9 @@ static void virtio_gpu_handle_cursor(VirtIODevice *vdev, VirtQueue *vq)
         } else {
             update_cursor(g, &cursor_info);
         }
-        virtqueue_push(vq, &elem, 0);
+        virtqueue_push(vq, elem, 0);
         virtio_notify(vdev, vq);
+        g_free(elem);
     }
 }
 
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 1f5a40d..bbae884 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -16,7 +16,7 @@
 
 void virtio_input_send(VirtIOInput *vinput, virtio_input_event *event)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     unsigned have, need;
     int i, len;
 
@@ -49,14 +49,16 @@ void virtio_input_send(VirtIOInput *vinput, virtio_input_event *event)
 
     /* ... and finally pass them to the guest */
     for (i = 0; i < vinput->qindex; i++) {
-        if (!virtqueue_pop(vinput->evt, &elem)) {
+        elem = virtqueue_pop(vinput->evt, sizeof(VirtQueueElement));
+        if (!elem) {
             /* should not happen, we've checked for space beforehand */
             fprintf(stderr, "%s: Huh?  No vq elem available ...\n", __func__);
             return;
         }
-        len = iov_from_buf(elem.in_sg, elem.in_num,
+        len = iov_from_buf(elem->in_sg, elem->in_num,
                            0, vinput->queue+i, sizeof(virtio_input_event));
-        virtqueue_push(vinput->evt, &elem, len);
+        virtqueue_push(vinput->evt, elem, len);
+        g_free(elem);
     }
     virtio_notify(VIRTIO_DEVICE(vinput), vinput->evt);
     vinput->qindex = 0;
@@ -72,17 +74,23 @@ static void virtio_input_handle_sts(VirtIODevice *vdev, VirtQueue *vq)
     VirtIOInputClass *vic = VIRTIO_INPUT_GET_CLASS(vdev);
     VirtIOInput *vinput = VIRTIO_INPUT(vdev);
     virtio_input_event event;
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     int len;
 
-    while (virtqueue_pop(vinput->sts, &elem)) {
+    for (;;) {
+        elem = virtqueue_pop(vinput->sts, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
         memset(&event, 0, sizeof(event));
-        len = iov_to_buf(elem.out_sg, elem.out_num,
+        len = iov_to_buf(elem->out_sg, elem->out_num,
                          0, &event, sizeof(event));
         if (vic->handle_status) {
             vic->handle_status(vinput, &event);
         }
-        virtqueue_push(vinput->sts, &elem, len);
+        virtqueue_push(vinput->sts, elem, len);
+        g_free(elem);
     }
     virtio_notify(vdev, vinput->sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index a877614..c03b568 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -818,20 +818,24 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
     VirtIONet *n = VIRTIO_NET(vdev);
     struct virtio_net_ctrl_hdr ctrl;
     virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t s;
     struct iovec *iov, *iov2;
     unsigned int iov_cnt;
 
-    while (virtqueue_pop(vq, &elem)) {
-        if (iov_size(elem.in_sg, elem.in_num) < sizeof(status) ||
-            iov_size(elem.out_sg, elem.out_num) < sizeof(ctrl)) {
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+        if (iov_size(elem->in_sg, elem->in_num) < sizeof(status) ||
+            iov_size(elem->out_sg, elem->out_num) < sizeof(ctrl)) {
             error_report("virtio-net ctrl missing headers");
             exit(1);
         }
 
-        iov_cnt = elem.out_num;
-        iov2 = iov = g_memdup(elem.out_sg, sizeof(struct iovec) * elem.out_num);
+        iov_cnt = elem->out_num;
+        iov2 = iov = g_memdup(elem->out_sg, sizeof(struct iovec) * elem->out_num);
         s = iov_to_buf(iov, iov_cnt, 0, &ctrl, sizeof(ctrl));
         iov_discard_front(&iov, &iov_cnt, sizeof(ctrl));
         if (s != sizeof(ctrl)) {
@@ -850,12 +854,13 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
             status = virtio_net_handle_offloads(n, ctrl.cmd, iov, iov_cnt);
         }
 
-        s = iov_from_buf(elem.in_sg, elem.in_num, 0, &status, sizeof(status));
+        s = iov_from_buf(elem->in_sg, elem->in_num, 0, &status, sizeof(status));
         assert(s == sizeof(status));
 
-        virtqueue_push(vq, &elem, sizeof(status));
+        virtqueue_push(vq, elem, sizeof(status));
         virtio_notify(vdev, vq);
         g_free(iov2);
+        g_free(elem);
     }
 }
 
@@ -1044,13 +1049,14 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
     offset = i = 0;
 
     while (offset < size) {
-        VirtQueueElement elem;
+        VirtQueueElement *elem;
         int len, total;
-        const struct iovec *sg = elem.in_sg;
+        const struct iovec *sg;
 
         total = 0;
 
-        if (virtqueue_pop(q->rx_vq, &elem) == 0) {
+        elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));
+        if (!elem) {
             if (i == 0)
                 return -1;
             error_report("virtio-net unexpected empty queue: "
@@ -1063,21 +1069,22 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
             exit(1);
         }
 
-        if (elem.in_num < 1) {
+        if (elem->in_num < 1) {
             error_report("virtio-net receive queue contains no in buffers");
             exit(1);
         }
 
+        sg = elem->in_sg;
         if (i == 0) {
             assert(offset == 0);
             if (n->mergeable_rx_bufs) {
                 mhdr_cnt = iov_copy(mhdr_sg, ARRAY_SIZE(mhdr_sg),
-                                    sg, elem.in_num,
+                                    sg, elem->in_num,
                                     offsetof(typeof(mhdr), num_buffers),
                                     sizeof(mhdr.num_buffers));
             }
 
-            receive_header(n, sg, elem.in_num, buf, size);
+            receive_header(n, sg, elem->in_num, buf, size);
             offset = n->host_hdr_len;
             total += n->guest_hdr_len;
             guest_offset = n->guest_hdr_len;
@@ -1086,7 +1093,7 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
         }
 
         /* copy in packet.  ugh */
-        len = iov_from_buf(sg, elem.in_num, guest_offset,
+        len = iov_from_buf(sg, elem->in_num, guest_offset,
                            buf + offset, size - offset);
         total += len;
         offset += len;
@@ -1094,12 +1101,14 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
          * must have consumed the complete packet.
          * Otherwise, drop it. */
         if (!n->mergeable_rx_bufs && offset < size) {
-            virtqueue_discard(q->rx_vq, &elem, total);
+            virtqueue_discard(q->rx_vq, elem, total);
+            g_free(elem);
             return size;
         }
 
         /* signal other side */
-        virtqueue_fill(q->rx_vq, &elem, total, i++);
+        virtqueue_fill(q->rx_vq, elem, total, i++);
+        g_free(elem);
     }
 
     if (mhdr_cnt) {
@@ -1123,10 +1132,11 @@ static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
 
-    virtqueue_push(q->tx_vq, &q->async_tx.elem, 0);
+    virtqueue_push(q->tx_vq, q->async_tx.elem, 0);
     virtio_notify(vdev, q->tx_vq);
 
-    q->async_tx.elem.out_num = 0;
+    g_free(q->async_tx.elem);
+    q->async_tx.elem = NULL;
 
     virtio_queue_set_notification(q->tx_vq, 1);
     virtio_net_flush_tx(q);
@@ -1137,25 +1147,31 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
 {
     VirtIONet *n = q->n;
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     int32_t num_packets = 0;
     int queue_index = vq2q(virtio_get_queue_index(q->tx_vq));
     if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return num_packets;
     }
 
-    if (q->async_tx.elem.out_num) {
+    if (q->async_tx.elem) {
         virtio_queue_set_notification(q->tx_vq, 0);
         return num_packets;
     }
 
-    while (virtqueue_pop(q->tx_vq, &elem)) {
+    for (;;) {
         ssize_t ret;
-        unsigned int out_num = elem.out_num;
-        struct iovec *out_sg = &elem.out_sg[0];
-        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1];
+        unsigned int out_num;
+        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
         struct virtio_net_hdr_mrg_rxbuf mhdr;
 
+        elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        out_num = elem->out_num;
+        out_sg = elem->out_sg;
         if (out_num < 1) {
             error_report("virtio-net header not in first element");
             exit(1);
@@ -1207,8 +1223,9 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         }
 
 drop:
-        virtqueue_push(q->tx_vq, &elem, 0);
+        virtqueue_push(q->tx_vq, elem, 0);
         virtio_notify(vdev, q->tx_vq);
+        g_free(elem);
 
         if (++num_packets >= n->tx_burst) {
             break;
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 0d8d71e..45ae2d2 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -80,15 +80,16 @@ fail_vring:
 VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s,
                                          VirtIOSCSIVring *vring)
 {
-    VirtIOSCSIReq *req = virtio_scsi_init_req(s, NULL);
-    int r;
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+    VirtIOSCSIReq *req;
 
-    req->vring = vring;
-    r = vring_pop((VirtIODevice *)s, &vring->vring, &req->elem);
-    if (r < 0) {
-        virtio_scsi_free_req(req);
-        req = NULL;
+    req = vring_pop((VirtIODevice *)s, &vring->vring,
+                    sizeof(VirtIOSCSIReq) + vs->cdb_size);
+    if (!req) {
+        return NULL;
     }
+    virtio_scsi_init_req(s, NULL, req);
+    req->vring = vring;
     return req;
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index df8e379..ca20a1d 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -40,19 +40,15 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
     return scsi_device_find(&s->bus, 0, lun[1], virtio_scsi_get_lun(lun));
 }
 
-VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
+void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req)
 {
-    VirtIOSCSIReq *req;
-    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
     const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
 
-    req = g_malloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
     req->dev = s;
     qemu_sglist_init(&req->qsgl, DEVICE(s), 8, &address_space_memory);
     qemu_iovec_init(&req->resp_iov, 1);
     memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip);
-    return req;
 }
 
 void virtio_scsi_free_req(VirtIOSCSIReq *req)
@@ -173,11 +169,14 @@ static int virtio_scsi_parse_req(VirtIOSCSIReq *req,
 
 static VirtIOSCSIReq *virtio_scsi_pop_req(VirtIOSCSI *s, VirtQueue *vq)
 {
-    VirtIOSCSIReq *req = virtio_scsi_init_req(s, vq);
-    if (!virtqueue_pop(vq, &req->elem)) {
-        virtio_scsi_free_req(req);
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+    VirtIOSCSIReq *req;
+
+    req = virtqueue_pop(vq, sizeof(VirtIOSCSIReq) + vs->cdb_size);
+    if (!req) {
         return NULL;
     }
+    virtio_scsi_init_req(s, vq, req);
     return req;
 }
 
@@ -202,8 +201,9 @@ static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
 
     qemu_get_be32s(f, &n);
     assert(n < vs->conf.num_queues);
-    req = virtio_scsi_init_req(s, vs->cmd_vqs[n]);
+    req = g_malloc(sizeof(VirtIOSCSIReq) + vs->cdb_size);
     qemu_get_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    virtio_scsi_init_req(s, vs->cmd_vqs[n], req);
 
     virtqueue_map(&req->elem);
 
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 23f667e..11e7f9f 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -388,23 +388,26 @@ static void vring_unmap_element(VirtQueueElement *elem)
  *
  * Stolen from linux/drivers/vhost/vhost.c.
  */
-int vring_pop(VirtIODevice *vdev, Vring *vring,
-              VirtQueueElement *elem)
+void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
 {
     struct vring_desc desc;
     unsigned int i, head, found = 0, num = vring->vr.num;
     uint16_t avail_idx, last_avail_idx;
+    VirtQueueElement *elem = NULL;
     int ret;
 
-    /* Initialize elem so it can be safely unmapped */
-    elem->in_num = elem->out_num = 0;
-
     /* If there was a fatal error then refuse operation */
     if (vring->broken) {
         ret = -EFAULT;
         goto out;
     }
 
+    assert(sz >= sizeof(VirtQueueElement));
+    elem = g_malloc(sz);
+
+    /* Initialize elem so it can be safely unmapped */
+    elem->in_num = elem->out_num = 0;
+
     /* Check it isn't doing very strange things with descriptor numbers. */
     last_avail_idx = vring->last_avail_idx;
     avail_idx = vring_get_avail_idx(vdev, vring);
@@ -480,7 +483,7 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
             virtio_tswap16(vdev, vring->last_avail_idx);
     }
 
-    return head;
+    return elem;
 
 out:
     assert(ret < 0);
@@ -488,7 +491,8 @@ out:
         vring->broken = true;
     }
     vring_unmap_element(elem);
-    return ret;
+    g_free(elem);
+    return NULL;
 }
 
 /* After we've used one of their buffers, we tell them about it.
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 9671635..2704680 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -106,8 +106,10 @@ static void balloon_stats_poll_cb(void *opaque)
         return;
     }
 
-    virtqueue_push(s->svq, &s->stats_vq_elem, s->stats_vq_offset);
+    virtqueue_push(s->svq, s->stats_vq_elem, s->stats_vq_offset);
     virtio_notify(vdev, s->svq);
+    g_free(s->stats_vq_elem);
+    s->stats_vq_elem = NULL;
 }
 
 static void balloon_stats_get_all(Object *obj, struct Visitor *v,
@@ -205,14 +207,18 @@ static void balloon_stats_set_poll_interval(Object *obj, struct Visitor *v,
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     MemoryRegionSection section;
 
-    while (virtqueue_pop(vq, &elem)) {
+    for (;;) {
         size_t offset = 0;
         uint32_t pfn;
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            return;
+        }
 
-        while (iov_to_buf(elem.out_sg, elem.out_num, offset, &pfn, 4) == 4) {
+        while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
             ram_addr_t pa;
             ram_addr_t addr;
             int p = virtio_ldl_p(vdev, &pfn);
@@ -235,20 +241,22 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             memory_region_unref(section.mr);
         }
 
-        virtqueue_push(vq, &elem, offset);
+        virtqueue_push(vq, elem, offset);
         virtio_notify(vdev, vq);
+        g_free(elem);
     }
 }
 
 static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
-    VirtQueueElement *elem = &s->stats_vq_elem;
+    VirtQueueElement *elem;
     VirtIOBalloonStat stat;
     size_t offset = 0;
     qemu_timeval tv;
 
-    if (!virtqueue_pop(vq, elem)) {
+    s->stats_vq_elem = elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
         goto out;
     }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index 97d1541..5f92047 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -43,7 +43,7 @@ static void chr_read(void *opaque, const void *buf, size_t size)
 {
     VirtIORNG *vrng = opaque;
     VirtIODevice *vdev = VIRTIO_DEVICE(vrng);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t len;
     int offset;
 
@@ -55,15 +55,17 @@ static void chr_read(void *opaque, const void *buf, size_t size)
 
     offset = 0;
     while (offset < size) {
-        if (!virtqueue_pop(vrng->vq, &elem)) {
+        elem = virtqueue_pop(vrng->vq, sizeof(VirtQueueElement));
+        if (!elem) {
             break;
         }
-        len = iov_from_buf(elem.in_sg, elem.in_num,
+        len = iov_from_buf(elem->in_sg, elem->in_num,
                            0, buf + offset, size - offset);
         offset += len;
 
-        virtqueue_push(vrng->vq, &elem, len);
+        virtqueue_push(vrng->vq, elem, len);
         trace_virtio_rng_pushed(vrng, len);
+        g_free(elem);
     }
     virtio_notify(vdev, vrng->vq);
 }
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index bd6b4df..9b2c0bf 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -501,16 +501,20 @@ void virtqueue_map(VirtQueueElement *elem)
                         0);
 }
 
-int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
+void *virtqueue_pop(VirtQueue *vq, size_t sz)
 {
     unsigned int i, head, max;
     hwaddr desc_pa = vq->vring.desc;
     VirtIODevice *vdev = vq->vdev;
+    VirtQueueElement *elem;
 
-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
-        return 0;
+    if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
+        return NULL;
+    }
 
     /* When we start there are none of either input nor output. */
+    assert(sz >= sizeof(VirtQueueElement));
+    elem = g_malloc(sz);
     elem->out_num = elem->in_num = 0;
 
     max = vq->vring.num;
@@ -569,7 +573,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     vq->inuse++;
 
     trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
-    return elem->in_num + elem->out_num;
+    return elem;
 }
 
 /* virtio device */
diff --git a/include/hw/virtio/dataplane/vring.h b/include/hw/virtio/dataplane/vring.h
index a596e4c..e80985e 100644
--- a/include/hw/virtio/dataplane/vring.h
+++ b/include/hw/virtio/dataplane/vring.h
@@ -44,7 +44,7 @@ void vring_teardown(Vring *vring, VirtIODevice *vdev, int n);
 void vring_disable_notification(VirtIODevice *vdev, Vring *vring);
 bool vring_enable_notification(VirtIODevice *vdev, Vring *vring);
 bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
-int vring_pop(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem);
+void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz);
 void vring_push(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
                 int len);
 
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 09c2ce4..35f62ac 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -37,7 +37,7 @@ typedef struct VirtIOBalloon {
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
-    VirtQueueElement stats_vq_elem;
+    VirtQueueElement *stats_vq_elem;
     size_t stats_vq_offset;
     QEMUTimer *stats_timer;
     int64_t stats_last_update;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 403ab86..199bb0e 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -80,8 +80,7 @@ typedef struct MultiReqBuffer {
     bool is_write;
 } MultiReqBuffer;
 
-VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
-
+void virtio_blk_init_request(VirtIOBlock *s, VirtIOBlockReq *req);
 void virtio_blk_free_request(VirtIOBlockReq *req);
 
 void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb);
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index f3cc25f..2ce3b03 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -47,7 +47,7 @@ typedef struct VirtIONetQueue {
     QEMUBH *tx_bh;
     int tx_waiting;
     struct {
-        VirtQueueElement elem;
+        VirtQueueElement *elem;
     } async_tx;
     struct VirtIONet *n;
 } VirtIONetQueue;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 63f5b51..d9ae76c 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -150,7 +150,7 @@ void virtio_scsi_common_unrealize(DeviceState *dev, Error **errp);
 void virtio_scsi_handle_ctrl_req(VirtIOSCSI *s, VirtIOSCSIReq *req);
 bool virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req);
 void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req);
-VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq);
+void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req);
 void virtio_scsi_free_req(VirtIOSCSIReq *req);
 void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
                             uint32_t event, uint32_t reason);
diff --git a/include/hw/virtio/virtio-serial.h b/include/hw/virtio/virtio-serial.h
index 527d0bf..12a55a1 100644
--- a/include/hw/virtio/virtio-serial.h
+++ b/include/hw/virtio/virtio-serial.h
@@ -122,7 +122,7 @@ struct VirtIOSerialPort {
      * element popped and continue consuming it once the backend
      * becomes writable again.
      */
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
 
     /*
      * The index and the offset into the iov buffer that was popped in
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 205fadf..21fda17 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -152,7 +152,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len, unsigned int idx);
 
 void virtqueue_map(VirtQueueElement *elem);
-int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem);
+void *virtqueue_pop(VirtQueue *vq, size_t sz);
 int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 03/10] virtio: introduce qemu_get/put_virtqueue_element
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
@ 2016-01-31 10:28 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 04/10] virtio: introduce virtqueue_alloc_element Paolo Bonzini
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

Move allocation to virtio functions also when loading/saving a
VirtQueueElement.  This will also let the load/save functions
keep backwards compatibility when the VirtQueueElement layout
is changed.

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/block/virtio-blk.c       | 10 +++-------
 hw/char/virtio-serial-bus.c | 10 +++-------
 hw/scsi/virtio-scsi.c       |  7 ++-----
 hw/virtio/virtio.c          | 13 +++++++++++++
 include/hw/virtio/virtio.h  |  2 ++
 5 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index a874cb7..75b2bfc 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -807,8 +807,7 @@ static void virtio_blk_save_device(VirtIODevice *vdev, QEMUFile *f)
 
     while (req) {
         qemu_put_sbyte(f, 1);
-        qemu_put_buffer(f, (unsigned char *)&req->elem,
-                        sizeof(VirtQueueElement));
+        qemu_put_virtqueue_element(f, &req->elem);
         req = req->next;
     }
     qemu_put_sbyte(f, 0);
@@ -831,14 +830,11 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
     VirtIOBlock *s = VIRTIO_BLK(vdev);
 
     while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
+        VirtIOBlockReq *req;
+        req = qemu_get_virtqueue_element(f, sizeof(VirtIOBlockReq));
         virtio_blk_init_request(s, req);
-        qemu_get_buffer(f, (unsigned char *)&req->elem,
-                        sizeof(VirtQueueElement));
         req->next = s->rq;
         s->rq = req;
-
-        virtqueue_map(&req->elem);
     }
 
     return 0;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index f0c1c45..12ce64a 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -645,9 +645,7 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
         if (elem_popped) {
             qemu_put_be32s(f, &port->iov_idx);
             qemu_put_be64s(f, &port->iov_offset);
-
-            qemu_put_buffer(f, (unsigned char *)port->elem,
-                            sizeof(VirtQueueElement));
+            qemu_put_virtqueue_element(f, port->elem);
         }
     }
 }
@@ -722,10 +720,8 @@ static int fetch_active_ports_list(QEMUFile *f, int version_id,
                 qemu_get_be32s(f, &port->iov_idx);
                 qemu_get_be64s(f, &port->iov_offset);
 
-                port->elem = g_new(VirtQueueElement, 1);
-                qemu_get_buffer(f, (unsigned char *)port->elem,
-                                sizeof(VirtQueueElement));
-                virtqueue_map(port->elem);
+                port->elem =
+                    qemu_get_virtqueue_element(f, sizeof(VirtQueueElement));
 
                 /*
                  *  Port was throttled on source machine.  Let's
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index ca20a1d..789cf38 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -188,7 +188,7 @@ static void virtio_scsi_save_request(QEMUFile *f, SCSIRequest *sreq)
 
     assert(n < vs->conf.num_queues);
     qemu_put_be32s(f, &n);
-    qemu_put_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    qemu_put_virtqueue_element(f, &req->elem);
 }
 
 static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
@@ -201,12 +201,9 @@ static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
 
     qemu_get_be32s(f, &n);
     assert(n < vs->conf.num_queues);
-    req = g_malloc(sizeof(VirtIOSCSIReq) + vs->cdb_size);
-    qemu_get_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    req = qemu_get_virtqueue_element(f, sizeof(VirtIOSCSIReq) + vs->cdb_size);
     virtio_scsi_init_req(s, vs->cmd_vqs[n], req);
 
-    virtqueue_map(&req->elem);
-
     if (virtio_scsi_parse_req(req, sizeof(VirtIOSCSICmdReq) + vs->cdb_size,
                               sizeof(VirtIOSCSICmdResp) + vs->sense_size) < 0) {
         error_report("invalid SCSI request migration data");
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 9b2c0bf..388e91c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -576,6 +576,19 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     return elem;
 }
 
+void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
+{
+    VirtQueueElement *elem = g_malloc(sz);
+    qemu_get_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    virtqueue_map(elem);
+    return elem;
+}
+
+void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem)
+{
+    qemu_put_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+}
+
 /* virtio device */
 static void virtio_notify_vector(VirtIODevice *vdev, uint16_t vector)
 {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 21fda17..44da9a8 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -153,6 +153,8 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
 
 void virtqueue_map(VirtQueueElement *elem);
 void *virtqueue_pop(VirtQueue *vq, size_t sz);
+void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz);
+void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem);
 int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 04/10] virtio: introduce virtqueue_alloc_element
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (2 preceding siblings ...)
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 03/10] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 05/10] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

Allocate the arrays for in_addr/out_addr/in_sg/out_sg outside the
VirtQueueElement.  For now, virtqueue_pop and vring_pop keep
allocating a very large VirtQueueElement.

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: add assertions on sz [Conny]

 hw/virtio/dataplane/vring.c |   3 +-
 hw/virtio/virtio.c          | 110 +++++++++++++++++++++++++++++++++++++++-----
 include/hw/virtio/virtio.h  |   9 ++--
 3 files changed, 105 insertions(+), 17 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 11e7f9f..c950caa 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -402,8 +402,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         goto out;
     }
 
-    assert(sz >= sizeof(VirtQueueElement));
-    elem = g_malloc(sz);
+    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
 
     /* Initialize elem so it can be safely unmapped */
     elem->in_num = elem->out_num = 0;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 388e91c..f49c5ae 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -494,11 +494,30 @@ static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
 void virtqueue_map(VirtQueueElement *elem)
 {
     virtqueue_map_iovec(elem->in_sg, elem->in_addr, &elem->in_num,
-                        MIN(ARRAY_SIZE(elem->in_sg), ARRAY_SIZE(elem->in_addr)),
-                        1);
+                        VIRTQUEUE_MAX_SIZE, 1);
     virtqueue_map_iovec(elem->out_sg, elem->out_addr, &elem->out_num,
-                        MIN(ARRAY_SIZE(elem->out_sg), ARRAY_SIZE(elem->out_addr)),
-                        0);
+                        VIRTQUEUE_MAX_SIZE, 0);
+}
+
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num)
+{
+    VirtQueueElement *elem;
+    size_t in_addr_ofs = QEMU_ALIGN_UP(sz, __alignof__(elem->in_addr[0]));
+    size_t out_addr_ofs = in_addr_ofs + in_num * sizeof(elem->in_addr[0]);
+    size_t out_addr_end = out_addr_ofs + out_num * sizeof(elem->out_addr[0]);
+    size_t in_sg_ofs = QEMU_ALIGN_UP(out_addr_end, __alignof__(elem->in_sg[0]));
+    size_t out_sg_ofs = in_sg_ofs + in_num * sizeof(elem->in_sg[0]);
+    size_t out_sg_end = out_sg_ofs + out_num * sizeof(elem->out_sg[0]);
+
+    assert(sz >= sizeof(VirtQueueElement));
+    elem = g_malloc(out_sg_end);
+    elem->out_num = out_num;
+    elem->in_num = in_num;
+    elem->in_addr = (void *)elem + in_addr_ofs;
+    elem->out_addr = (void *)elem + out_addr_ofs;
+    elem->in_sg = (void *)elem + in_sg_ofs;
+    elem->out_sg = (void *)elem + out_sg_ofs;
+    return elem;
 }
 
 void *virtqueue_pop(VirtQueue *vq, size_t sz)
@@ -513,8 +532,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     }
 
     /* When we start there are none of either input nor output. */
-    assert(sz >= sizeof(VirtQueueElement));
-    elem = g_malloc(sz);
+    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
     elem->out_num = elem->in_num = 0;
 
     max = vq->vring.num;
@@ -541,14 +559,14 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
         struct iovec *sg;
 
         if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-            if (elem->in_num >= ARRAY_SIZE(elem->in_sg)) {
+            if (elem->in_num >= VIRTQUEUE_MAX_SIZE) {
                 error_report("Too many write descriptors in indirect table");
                 exit(1);
             }
             elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
             sg = &elem->in_sg[elem->in_num++];
         } else {
-            if (elem->out_num >= ARRAY_SIZE(elem->out_sg)) {
+            if (elem->out_num >= VIRTQUEUE_MAX_SIZE) {
                 error_report("Too many read descriptors in indirect table");
                 exit(1);
             }
@@ -576,17 +594,87 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     return elem;
 }
 
+/* Reading and writing a structure directly to QEMUFile is *awful*, but
+ * it is what QEMU has always done by mistake.  We can change it sooner
+ * or later by bumping the version number of the affected vm states.
+ * In the meanwhile, since the in-memory layout of VirtQueueElement
+ * has changed, we need to marshal to and from the layout that was
+ * used before the change.
+ */
+typedef struct VirtQueueElementOld {
+    unsigned int index;
+    unsigned int out_num;
+    unsigned int in_num;
+    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
+    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
+    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+} VirtQueueElementOld;
+
 void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
 {
-    VirtQueueElement *elem = g_malloc(sz);
-    qemu_get_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    VirtQueueElement *elem;
+    VirtQueueElementOld data;
+    int i;
+
+    qemu_get_buffer(f, (uint8_t *)&data, sizeof(VirtQueueElementOld));
+
+    elem = virtqueue_alloc_element(sz, data.out_num, data.in_num);
+    elem->index = data.index;
+
+    for (i = 0; i < elem->in_num; i++) {
+        elem->in_addr[i] = data.in_addr[i];
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        elem->out_addr[i] = data.out_addr[i];
+    }
+
+    for (i = 0; i < elem->in_num; i++) {
+        /* Base is overwritten by virtqueue_map.  */
+        elem->in_sg[i].iov_base = 0;
+        elem->in_sg[i].iov_len = data.in_sg[i].iov_len;
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        /* Base is overwritten by virtqueue_map.  */
+        elem->out_sg[i].iov_base = 0;
+        elem->out_sg[i].iov_len = data.out_sg[i].iov_len;
+    }
+
     virtqueue_map(elem);
     return elem;
 }
 
 void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem)
 {
-    qemu_put_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    VirtQueueElementOld data;
+    int i;
+
+    memset(&data, 0, sizeof(data));
+    data.index = elem->index;
+    data.in_num = elem->in_num;
+    data.out_num = elem->out_num;
+
+    for (i = 0; i < elem->in_num; i++) {
+        data.in_addr[i] = elem->in_addr[i];
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        data.out_addr[i] = elem->out_addr[i];
+    }
+
+    for (i = 0; i < elem->in_num; i++) {
+        /* Base is overwritten by virtqueue_map when loading.  Do not
+         * save it, as it would leak the QEMU address space layout.  */
+        data.in_sg[i].iov_len = elem->in_sg[i].iov_len;
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        /* Do not save iov_base as above.  */
+        data.out_sg[i].iov_len = elem->out_sg[i].iov_len;
+    }
+    qemu_put_buffer(f, (uint8_t *)&data, sizeof(VirtQueueElementOld));
 }
 
 /* virtio device */
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 44da9a8..108cdb0 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -46,10 +46,10 @@ typedef struct VirtQueueElement
     unsigned int index;
     unsigned int out_num;
     unsigned int in_num;
-    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
-    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
-    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+    hwaddr *in_addr;
+    hwaddr *out_addr;
+    struct iovec *in_sg;
+    struct iovec *out_sg;
 } VirtQueueElement;
 
 #define VIRTIO_QUEUE_MAX 1024
@@ -143,6 +143,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 
 void virtio_del_queue(VirtIODevice *vdev, int n);
 
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num);
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len);
 void virtqueue_flush(VirtQueue *vq, unsigned int count);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 05/10] virtio: slim down allocation of VirtQueueElements
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (3 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 04/10] virtio: introduce virtqueue_alloc_element Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 06/10] vring: " Paolo Bonzini
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

Build the addresses and s/g lists on the stack, and then copy them
to a VirtQueueElement that is just as big as required to contain this
particular s/g list.  The cost of the copy is minimal compared to that
of a large malloc.

When virtqueue_map is used on the destination side of migration or on
loadvm, the iovecs have already been split at memory region boundary,
so we can just reuse the out_num/in_num we find in the file.

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: change bools from 1 and 0 to "true" and "false" [Conny]

 hw/virtio/virtio.c | 82 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index f49c5ae..79a635f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -448,6 +448,32 @@ int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
     return in_bytes <= in_total && out_bytes <= out_total;
 }
 
+static void virtqueue_map_desc(unsigned int *p_num_sg, hwaddr *addr, struct iovec *iov,
+                               unsigned int max_num_sg, bool is_write,
+                               hwaddr pa, size_t sz)
+{
+    unsigned num_sg = *p_num_sg;
+    assert(num_sg <= max_num_sg);
+
+    while (sz) {
+        hwaddr len = sz;
+
+        if (num_sg == max_num_sg) {
+            error_report("virtio: too many write descriptors in indirect table");
+            exit(1);
+        }
+
+        iov[num_sg].iov_base = cpu_physical_memory_map(pa, &len, is_write);
+        iov[num_sg].iov_len = len;
+        addr[num_sg] = pa;
+
+        sz -= len;
+        pa += len;
+        num_sg++;
+    }
+    *p_num_sg = num_sg;
+}
+
 static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
                                 unsigned int *num_sg, unsigned int max_size,
                                 int is_write)
@@ -474,20 +500,10 @@ static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
             error_report("virtio: error trying to map MMIO memory");
             exit(1);
         }
-        if (len == sg[i].iov_len) {
-            continue;
-        }
-        if (*num_sg >= max_size) {
-            error_report("virtio: memory split makes iovec too large");
+        if (len != sg[i].iov_len) {
+            error_report("virtio: unexpected memory split");
             exit(1);
         }
-        memmove(sg + i + 1, sg + i, sizeof(*sg) * (*num_sg - i));
-        memmove(addr + i + 1, addr + i, sizeof(*addr) * (*num_sg - i));
-        assert(len < sg[i + 1].iov_len);
-        sg[i].iov_len = len;
-        addr[i + 1] += len;
-        sg[i + 1].iov_len -= len;
-        ++*num_sg;
     }
 }
 
@@ -526,14 +542,16 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     hwaddr desc_pa = vq->vring.desc;
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem;
+    unsigned out_num, in_num;
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
 
     if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
         return NULL;
     }
 
     /* When we start there are none of either input nor output. */
-    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
-    elem->out_num = elem->in_num = 0;
+    out_num = in_num = 0;
 
     max = vq->vring.num;
 
@@ -556,37 +574,39 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
 
     /* Collect all the descriptors */
     do {
-        struct iovec *sg;
+        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
+        size_t len = vring_desc_len(vdev, desc_pa, i);
 
         if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-            if (elem->in_num >= VIRTQUEUE_MAX_SIZE) {
-                error_report("Too many write descriptors in indirect table");
-                exit(1);
-            }
-            elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->in_sg[elem->in_num++];
+            virtqueue_map_desc(&in_num, addr + out_num, iov + out_num,
+                               VIRTQUEUE_MAX_SIZE - out_num, true, pa, len);
         } else {
-            if (elem->out_num >= VIRTQUEUE_MAX_SIZE) {
-                error_report("Too many read descriptors in indirect table");
+            if (in_num) {
+                error_report("Incorrect order for descriptors");
                 exit(1);
             }
-            elem->out_addr[elem->out_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->out_sg[elem->out_num++];
+            virtqueue_map_desc(&out_num, addr, iov,
+                               VIRTQUEUE_MAX_SIZE, false, pa, len);
         }
 
-        sg->iov_len = vring_desc_len(vdev, desc_pa, i);
-
         /* If we've got too many, that implies a descriptor loop. */
-        if ((elem->in_num + elem->out_num) > max) {
+        if ((in_num + out_num) > max) {
             error_report("Looped descriptor");
             exit(1);
         }
     } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
 
-    /* Now map what we have collected */
-    virtqueue_map(elem);
-
+    /* Now copy what we have collected and mapped */
+    elem = virtqueue_alloc_element(sz, out_num, in_num);
     elem->index = head;
+    for (i = 0; i < out_num; i++) {
+        elem->out_addr[i] = addr[i];
+        elem->out_sg[i] = iov[i];
+    }
+    for (i = 0; i < in_num; i++) {
+        elem->in_addr[i] = addr[out_num + i];
+        elem->in_sg[i] = iov[out_num + i];
+    }
 
     vq->inuse++;
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 06/10] vring: slim down allocation of VirtQueueElements
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (4 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 05/10] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor Paolo Bonzini
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

Build the addresses and s/g lists on the stack, and then copy them
to a VirtQueueElement that is just as big as required to contain this
particular s/g list.  The cost of the copy is minimal compared to that
of a large malloc.

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/dataplane/vring.c | 53 ++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index c950caa..d6b8ba9 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -217,8 +217,14 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
                             new, old);
 }
 
-
-static int get_desc(Vring *vring, VirtQueueElement *elem,
+typedef struct VirtQueueCurrentElement {
+    unsigned in_num;
+    unsigned out_num;
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+} VirtQueueCurrentElement;
+
+static int get_desc(Vring *vring, VirtQueueCurrentElement *elem,
                     struct vring_desc *desc)
 {
     unsigned *num;
@@ -229,12 +235,12 @@ static int get_desc(Vring *vring, VirtQueueElement *elem,
 
     if (desc->flags & VRING_DESC_F_WRITE) {
         num = &elem->in_num;
-        iov = &elem->in_sg[*num];
-        addr = &elem->in_addr[*num];
+        iov = &elem->iov[elem->out_num + *num];
+        addr = &elem->addr[elem->out_num + *num];
     } else {
         num = &elem->out_num;
-        iov = &elem->out_sg[*num];
-        addr = &elem->out_addr[*num];
+        iov = &elem->iov[*num];
+        addr = &elem->addr[*num];
 
         /* If it's an output descriptor, they're all supposed
          * to come before any input descriptors. */
@@ -298,7 +304,8 @@ static bool read_vring_desc(VirtIODevice *vdev,
 
 /* This is stolen from linux/drivers/vhost/vhost.c. */
 static int get_indirect(VirtIODevice *vdev, Vring *vring,
-                        VirtQueueElement *elem, struct vring_desc *indirect)
+                        VirtQueueCurrentElement *cur_elem,
+                        struct vring_desc *indirect)
 {
     struct vring_desc desc;
     unsigned int i = 0, count, found = 0;
@@ -350,7 +357,7 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring,
             return -EFAULT;
         }
 
-        ret = get_desc(vring, elem, &desc);
+        ret = get_desc(vring, cur_elem, &desc);
         if (ret < 0) {
             vring->broken |= (ret == -EFAULT);
             return ret;
@@ -393,6 +400,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
     struct vring_desc desc;
     unsigned int i, head, found = 0, num = vring->vr.num;
     uint16_t avail_idx, last_avail_idx;
+    VirtQueueCurrentElement cur_elem;
     VirtQueueElement *elem = NULL;
     int ret;
 
@@ -402,10 +410,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         goto out;
     }
 
-    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
-
-    /* Initialize elem so it can be safely unmapped */
-    elem->in_num = elem->out_num = 0;
+    cur_elem.in_num = cur_elem.out_num = 0;
 
     /* Check it isn't doing very strange things with descriptor numbers. */
     last_avail_idx = vring->last_avail_idx;
@@ -432,8 +437,6 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
      * the index we've seen. */
     head = vring_get_avail_ring(vdev, vring, last_avail_idx % num);
 
-    elem->index = head;
-
     /* If their number is silly, that's an error. */
     if (unlikely(head >= num)) {
         error_report("Guest says index %u > %u is available", head, num);
@@ -460,14 +463,14 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         barrier();
 
         if (desc.flags & VRING_DESC_F_INDIRECT) {
-            ret = get_indirect(vdev, vring, elem, &desc);
+            ret = get_indirect(vdev, vring, &cur_elem, &desc);
             if (ret < 0) {
                 goto out;
             }
             continue;
         }
 
-        ret = get_desc(vring, elem, &desc);
+        ret = get_desc(vring, &cur_elem, &desc);
         if (ret < 0) {
             goto out;
         }
@@ -482,6 +485,18 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
             virtio_tswap16(vdev, vring->last_avail_idx);
     }
 
+    /* Now copy what we have collected and mapped */
+    elem = virtqueue_alloc_element(sz, cur_elem.out_num, cur_elem.in_num);
+    elem->index = head;
+    for (i = 0; i < cur_elem.out_num; i++) {
+        elem->out_addr[i] = cur_elem.addr[i];
+        elem->out_sg[i] = cur_elem.iov[i];
+    }
+    for (i = 0; i < cur_elem.in_num; i++) {
+        elem->in_addr[i] = cur_elem.addr[cur_elem.out_num + i];
+        elem->in_sg[i] = cur_elem.iov[cur_elem.out_num + i];
+    }
+
     return elem;
 
 out:
@@ -489,7 +504,11 @@ out:
     if (ret == -EFAULT) {
         vring->broken = true;
     }
-    vring_unmap_element(elem);
+
+    for (i = 0; i < cur_elem.out_num + cur_elem.in_num; i++) {
+        vring_unmap(cur_elem.iov[i].iov_base, false);
+    }
+
     g_free(elem);
     return NULL;
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (5 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 06/10] vring: " Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-02-03 12:34   ` Gonglei (Arei)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 08/10] virtio: cache used_idx in a VirtQueue field Paolo Bonzini
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, mst

Compared to vring, virtio has a performance penalty of 10%.  Fix it
by combining all the reads for a descriptor in a single address_space_read
call.  This also simplifies the code nicely.

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 86 ++++++++++++++++++++++--------------------------------
 1 file changed, 35 insertions(+), 51 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 79a635f..2433866 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -107,35 +107,15 @@ void virtio_queue_update_rings(VirtIODevice *vdev, int n)
                               vring->align);
 }
 
-static inline uint64_t vring_desc_addr(VirtIODevice *vdev, hwaddr desc_pa,
-                                       int i)
+static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
+                            hwaddr desc_pa, int i)
 {
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr);
-    return virtio_ldq_phys(vdev, pa);
-}
-
-static inline uint32_t vring_desc_len(VirtIODevice *vdev, hwaddr desc_pa, int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
-    return virtio_ldl_phys(vdev, pa);
-}
-
-static inline uint16_t vring_desc_flags(VirtIODevice *vdev, hwaddr desc_pa,
-                                        int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags);
-    return virtio_lduw_phys(vdev, pa);
-}
-
-static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
-                                       int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next);
-    return virtio_lduw_phys(vdev, pa);
+    address_space_read(&address_space_memory, desc_pa + i * sizeof(VRingDesc),
+                       MEMTXATTRS_UNSPECIFIED, (void *)desc, sizeof(VRingDesc));
+    virtio_tswap64s(vdev, &desc->addr);
+    virtio_tswap32s(vdev, &desc->len);
+    virtio_tswap16s(vdev, &desc->flags);
+    virtio_tswap16s(vdev, &desc->next);
 }
 
 static inline uint16_t vring_avail_flags(VirtQueue *vq)
@@ -345,18 +325,18 @@ static unsigned int virtqueue_get_head(VirtQueue *vq, unsigned int idx)
     return head;
 }
 
-static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
-                                    unsigned int i, unsigned int max)
+static unsigned virtqueue_read_next_desc(VirtIODevice *vdev, VRingDesc *desc,
+                                         hwaddr desc_pa, unsigned int max)
 {
     unsigned int next;
 
     /* If this descriptor says it doesn't chain, we're done. */
-    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
+    if (!(desc->flags & VRING_DESC_F_NEXT)) {
         return max;
     }
 
     /* Check they're not leading us off end of descriptors. */
-    next = vring_desc_next(vdev, desc_pa, i);
+    next = desc->next;
     /* Make sure compiler knows to grab that: we don't want it changing! */
     smp_wmb();
 
@@ -365,6 +345,7 @@ static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
         exit(1);
     }
 
+    vring_desc_read(vdev, desc, desc_pa, next);
     return next;
 }
 
@@ -381,6 +362,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
     while (virtqueue_num_heads(vq, idx)) {
         VirtIODevice *vdev = vq->vdev;
         unsigned int max, num_bufs, indirect = 0;
+        VRingDesc desc;
         hwaddr desc_pa;
         int i;
 
@@ -388,9 +370,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
         num_bufs = total_bufs;
         i = virtqueue_get_head(vq, idx++);
         desc_pa = vq->vring.desc;
+        vring_desc_read(vdev, &desc, desc_pa, i);
 
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+        if (desc.flags & VRING_DESC_F_INDIRECT) {
+            if (desc.len % sizeof(VRingDesc)) {
                 error_report("Invalid size for indirect buffer table");
                 exit(1);
             }
@@ -403,9 +386,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
 
             /* loop over the indirect descriptor table */
             indirect = 1;
-            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-            desc_pa = vring_desc_addr(vdev, desc_pa, i);
+            max = desc.len / sizeof(VRingDesc);
+            desc_pa = desc.addr;
             num_bufs = i = 0;
+            vring_desc_read(vdev, &desc, desc_pa, i);
         }
 
         do {
@@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
                 exit(1);
             }
 
-            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-                in_total += vring_desc_len(vdev, desc_pa, i);
+            if (desc.flags & VRING_DESC_F_WRITE) {
+                in_total += desc.len;
             } else {
-                out_total += vring_desc_len(vdev, desc_pa, i);
+                out_total += desc.len;
             }
             if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
                 goto done;
             }
-        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) != max);
 
         if (!indirect)
             total_bufs = num_bufs;
@@ -545,6 +529,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     unsigned out_num, in_num;
     hwaddr addr[VIRTQUEUE_MAX_SIZE];
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    VRingDesc desc;
 
     if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
         return NULL;
@@ -560,33 +545,32 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
         vring_set_avail_event(vq, vq->last_avail_idx);
     }
 
-    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+    vring_desc_read(vdev, &desc, desc_pa, i);
+    if (desc.flags & VRING_DESC_F_INDIRECT) {
+        if (desc.len % sizeof(VRingDesc)) {
             error_report("Invalid size for indirect buffer table");
             exit(1);
         }
 
         /* loop over the indirect descriptor table */
-        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-        desc_pa = vring_desc_addr(vdev, desc_pa, i);
+        max = desc.len / sizeof(VRingDesc);
+        desc_pa = desc.addr;
         i = 0;
+        vring_desc_read(vdev, &desc, desc_pa, i);
     }
 
     /* Collect all the descriptors */
     do {
-        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
-        size_t len = vring_desc_len(vdev, desc_pa, i);
-
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
+        if (desc.flags & VRING_DESC_F_WRITE) {
             virtqueue_map_desc(&in_num, addr + out_num, iov + out_num,
-                               VIRTQUEUE_MAX_SIZE - out_num, true, pa, len);
+                               VIRTQUEUE_MAX_SIZE - out_num, true, desc.addr, desc.len);
         } else {
             if (in_num) {
                 error_report("Incorrect order for descriptors");
                 exit(1);
             }
             virtqueue_map_desc(&out_num, addr, iov,
-                               VIRTQUEUE_MAX_SIZE, false, pa, len);
+                               VIRTQUEUE_MAX_SIZE, false, desc.addr, desc.len);
         }
 
         /* If we've got too many, that implies a descriptor loop. */
@@ -594,7 +578,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
             error_report("Looped descriptor");
             exit(1);
         }
-    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+    } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) != max);
 
     /* Now copy what we have collected and mapped */
     elem = virtqueue_alloc_element(sz, out_num, in_num);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 08/10] virtio: cache used_idx in a VirtQueue field
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (6 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 09/10] virtio: read avail_idx from VQ only when necessary Paolo Bonzini
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, Vincenzo Maffione, mst

From: Vincenzo Maffione <v.maffione@gmail.com>

Accessing used_idx in the VQ requires an expensive access to
guest physical memory. Before this patch, 3 accesses are normally
done for each pop/push/notify call. However, since the used_idx is
only written by us, we can track it in our internal data structure.

Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Message-Id: <3d062ec54e9a7bf9fb325c1fd693564951f2b319.1450218353.git.v.maffione@gmail.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 2433866..5116a2e 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -71,6 +71,9 @@ struct VirtQueue
 {
     VRing vring;
     uint16_t last_avail_idx;
+
+    uint16_t used_idx;
+
     /* Last used index value we have signalled on */
     uint16_t signalled_used;
 
@@ -170,6 +173,7 @@ static inline void vring_used_idx_set(VirtQueue *vq, uint16_t val)
     hwaddr pa;
     pa = vq->vring.used + offsetof(VRingUsed, idx);
     virtio_stw_phys(vq->vdev, pa, val);
+    vq->used_idx = val;
 }
 
 static inline void vring_used_flags_set_bit(VirtQueue *vq, int mask)
@@ -261,7 +265,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
 
     virtqueue_unmap_sg(vq, elem, len);
 
-    idx = (idx + vring_used_idx(vq)) % vq->vring.num;
+    idx = (idx + vq->used_idx) % vq->vring.num;
 
     /* Get a pointer to the next entry in the used ring. */
     vring_used_ring_id(vq, idx, elem->index);
@@ -274,7 +278,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
     /* Make sure buffer is written before we update index. */
     smp_wmb();
     trace_virtqueue_flush(vq, count);
-    old = vring_used_idx(vq);
+    old = vq->used_idx;
     new = old + count;
     vring_used_idx_set(vq, new);
     vq->inuse -= count;
@@ -782,6 +786,7 @@ void virtio_reset(void *opaque)
         vdev->vq[i].vring.avail = 0;
         vdev->vq[i].vring.used = 0;
         vdev->vq[i].last_avail_idx = 0;
+        vdev->vq[i].used_idx = 0;
         virtio_queue_set_vector(vdev, i, VIRTIO_NO_VECTOR);
         vdev->vq[i].signalled_used = 0;
         vdev->vq[i].signalled_used_valid = false;
@@ -1161,7 +1166,7 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
     v = vq->signalled_used_valid;
     vq->signalled_used_valid = true;
     old = vq->signalled_used;
-    new = vq->signalled_used = vring_used_idx(vq);
+    new = vq->signalled_used = vq->used_idx;
     return !v || vring_need_event(vring_get_used_event(vq), new, old);
 }
 
@@ -1573,6 +1578,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
                              vdev->vq[i].last_avail_idx, nheads);
                 return -1;
             }
+            vdev->vq[i].used_idx = vring_used_idx(&vdev->vq[i]);
         }
     }
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 09/10] virtio: read avail_idx from VQ only when necessary
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (7 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 08/10] virtio: cache used_idx in a VirtQueue field Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 10/10] virtio: combine write of an entry into used ring Paolo Bonzini
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, Vincenzo Maffione, mst

From: Vincenzo Maffione <v.maffione@gmail.com>

The virtqueue_pop() implementation needs to check if the avail ring
contains some pending buffers. To perform this check, it is not
always necessary to fetch the avail_idx in the VQ memory, which is
expensive. This patch introduces a shadow variable tracking avail_idx
and modifies virtio_queue_empty() to access avail_idx in physical
memory only when necessary.

Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Message-Id: <b617d6459902773d9f4ab843bfaca764f5af8eda.1450218353.git.v.maffione@gmail.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
        v1->v2: update shadow avail_idx in virtio_queue_set_last_avail_idx
                [Michael]

 hw/virtio/virtio.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5116a2e..6842938 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -70,8 +70,13 @@ typedef struct VRing
 struct VirtQueue
 {
     VRing vring;
+
+    /* Next head to pop */
     uint16_t last_avail_idx;
 
+    /* Last avail_idx read from VQ. */
+    uint16_t shadow_avail_idx;
+
     uint16_t used_idx;
 
     /* Last used index value we have signalled on */
@@ -132,7 +137,8 @@ static inline uint16_t vring_avail_idx(VirtQueue *vq)
 {
     hwaddr pa;
     pa = vq->vring.avail + offsetof(VRingAvail, idx);
-    return virtio_lduw_phys(vq->vdev, pa);
+    vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
+    return vq->shadow_avail_idx;
 }
 
 static inline uint16_t vring_avail_ring(VirtQueue *vq, int i)
@@ -223,8 +229,14 @@ int virtio_queue_ready(VirtQueue *vq)
     return vq->vring.avail != 0;
 }
 
+/* Fetch avail_idx from VQ memory only when we really need to know if
+ * guest has added some buffers. */
 int virtio_queue_empty(VirtQueue *vq)
 {
+    if (vq->shadow_avail_idx != vq->last_avail_idx) {
+        return 0;
+    }
+
     return vring_avail_idx(vq) == vq->last_avail_idx;
 }
 
@@ -300,7 +312,7 @@ static int virtqueue_num_heads(VirtQueue *vq, unsigned int idx)
     /* Check it isn't doing very strange things with descriptor numbers. */
     if (num_heads > vq->vring.num) {
         error_report("Guest moved used index from %u to %u",
-                     idx, vring_avail_idx(vq));
+                     idx, vq->shadow_avail_idx);
         exit(1);
     }
     /* On success, callers read a descriptor at vq->last_avail_idx.
@@ -535,9 +547,12 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
     VRingDesc desc;
 
-    if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
+    if (virtio_queue_empty(vq)) {
         return NULL;
     }
+    /* Needed after virtio_queue_empty(), see comment in
+     * virtqueue_num_heads(). */
+    smp_rmb();
 
     /* When we start there are none of either input nor output. */
     out_num = in_num = 0;
@@ -786,6 +801,7 @@ void virtio_reset(void *opaque)
         vdev->vq[i].vring.avail = 0;
         vdev->vq[i].vring.used = 0;
         vdev->vq[i].last_avail_idx = 0;
+        vdev->vq[i].shadow_avail_idx = 0;
         vdev->vq[i].used_idx = 0;
         virtio_queue_set_vector(vdev, i, VIRTIO_NO_VECTOR);
         vdev->vq[i].signalled_used = 0;
@@ -1155,7 +1171,7 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
     smp_mb();
     /* Always notify when queue is empty (when feature acknowledge) */
     if (virtio_vdev_has_feature(vdev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
-        !vq->inuse && vring_avail_idx(vq) == vq->last_avail_idx) {
+        !vq->inuse && virtio_queue_empty(vq)) {
         return true;
     }
 
@@ -1579,6 +1595,7 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f, int version_id)
                 return -1;
             }
             vdev->vq[i].used_idx = vring_used_idx(&vdev->vq[i]);
+            vdev->vq[i].shadow_avail_idx = vring_avail_idx(&vdev->vq[i]);
         }
     }
 
@@ -1714,6 +1731,7 @@ uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n)
 void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx)
 {
     vdev->vq[n].last_avail_idx = idx;
+    vdev->vq[n].shadow_avail_idx = idx;
 }
 
 void virtio_queue_invalidate_signalled_used(VirtIODevice *vdev, int n)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 10/10] virtio: combine write of an entry into used ring
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (8 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 09/10] virtio: read avail_idx from VQ only when necessary Paolo Bonzini
@ 2016-01-31 10:29 ` Paolo Bonzini
  2016-02-03 12:08 ` [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Gonglei (Arei)
  2016-02-03 12:38 ` Gonglei (Arei)
  11 siblings, 0 replies; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-31 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: cornelia.huck, Vincenzo Maffione, mst

From: Vincenzo Maffione <v.maffione@gmail.com>

Fill in an element of the used ring with a single combined access to the
guest physical memory, rather than using two separated accesses.
This reduces the overhead due to expensive address translation.

Signed-off-by: Vincenzo Maffione <v.maffione@gmail.com>
Message-Id: <e4a89a767a4a92cbb6bcc551e151487eb36e1722.1450218353.git.v.maffione@gmail.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 6842938..241c7e3 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -153,18 +153,15 @@ static inline uint16_t vring_get_used_event(VirtQueue *vq)
     return vring_avail_ring(vq, vq->vring.num);
 }
 
-static inline void vring_used_ring_id(VirtQueue *vq, int i, uint32_t val)
+static inline void vring_used_write(VirtQueue *vq, VRingUsedElem *uelem,
+                                    int i)
 {
     hwaddr pa;
-    pa = vq->vring.used + offsetof(VRingUsed, ring[i].id);
-    virtio_stl_phys(vq->vdev, pa, val);
-}
-
-static inline void vring_used_ring_len(VirtQueue *vq, int i, uint32_t val)
-{
-    hwaddr pa;
-    pa = vq->vring.used + offsetof(VRingUsed, ring[i].len);
-    virtio_stl_phys(vq->vdev, pa, val);
+    virtio_tswap32s(vq->vdev, &uelem->id);
+    virtio_tswap32s(vq->vdev, &uelem->len);
+    pa = vq->vring.used + offsetof(VRingUsed, ring[i]);
+    address_space_write(&address_space_memory, pa, MEMTXATTRS_UNSPECIFIED,
+                       (void *)uelem, sizeof(VRingUsedElem));
 }
 
 static uint16_t vring_used_idx(VirtQueue *vq)
@@ -273,15 +270,17 @@ void virtqueue_discard(VirtQueue *vq, const VirtQueueElement *elem,
 void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len, unsigned int idx)
 {
+    VRingUsedElem uelem;
+
     trace_virtqueue_fill(vq, elem, len, idx);
 
     virtqueue_unmap_sg(vq, elem, len);
 
     idx = (idx + vq->used_idx) % vq->vring.num;
 
-    /* Get a pointer to the next entry in the used ring. */
-    vring_used_ring_id(vq, idx, elem->index);
-    vring_used_ring_len(vq, idx, len);
+    uelem.id = elem->index;
+    uelem.len = len;
+    vring_used_write(vq, &uelem, idx);
 }
 
 void virtqueue_flush(VirtQueue *vq, unsigned int count)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
@ 2016-02-01 11:17   ` Cornelia Huck
  0 siblings, 0 replies; 26+ messages in thread
From: Cornelia Huck @ 2016-02-01 11:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, mst

On Sun, 31 Jan 2016 11:28:57 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The next patch will make virtqueue_pop/vring_pop allocate memory for
> the VirtQueueElement. In some cases (blk, scsi, gpu) the device wants
> to extend VirtQueueElement with device-specific fields and, until now,
> the place of the VirtQueueElement within the containing struct didn't
> matter. When allocating the entire block in virtqueue_pop/vring_pop,
> however, the containing struct must basically be a "subclass" of
> VirtQueueElement, with the VirtQueueElement as the first field. Make
> that the case for blk and scsi; gpu is already doing it.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/scsi/virtio-scsi.c           |  3 +--
>  include/hw/virtio/virtio-blk.h  |  2 +-
>  include/hw/virtio/virtio-scsi.h | 13 ++++++-------
>  3 files changed, 8 insertions(+), 10 deletions(-)

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop
  2016-01-31 10:28 ` [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
@ 2016-02-01 11:20   ` Cornelia Huck
  0 siblings, 0 replies; 26+ messages in thread
From: Cornelia Huck @ 2016-02-01 11:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, mst

On Sun, 31 Jan 2016 11:28:58 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The return code of virtqueue_pop/vring_pop is unused except to check for
> errors or 0.  We can thus easily move allocation inside the functions
> and just return a pointer to the VirtQueueElement.
> 
> The advantage is that we will be able to allocate only the space that
> is needed for the actual size of the s/g list instead of the full
> VIRTQUEUE_MAX_SIZE items.  Currently VirtQueueElement takes about 48K
> of memory, and this kind of allocation puts a lot of stress on malloc.
> By cutting the size by two or three orders of magnitude, malloc can
> use much more efficient algorithms.
> 
> The patch is pretty large, but changes to each device are testable
> more or less independently.  Splitting it would mostly add churn.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>         v1->v2: add assertions on sz [Conny]
> 
>  hw/9pfs/9p.c                        |  2 +-
>  hw/9pfs/virtio-9p-device.c          | 17 ++++----
>  hw/9pfs/virtio-9p.h                 |  2 +-
>  hw/block/dataplane/virtio-blk.c     | 11 +++--
>  hw/block/virtio-blk.c               | 15 +++----
>  hw/char/virtio-serial-bus.c         | 80 +++++++++++++++++++++++--------------
>  hw/display/virtio-gpu.c             | 25 +++++++-----
>  hw/input/virtio-input.c             | 24 +++++++----
>  hw/net/virtio-net.c                 | 69 ++++++++++++++++++++------------
>  hw/scsi/virtio-scsi-dataplane.c     | 15 +++----
>  hw/scsi/virtio-scsi.c               | 18 ++++-----
>  hw/virtio/dataplane/vring.c         | 18 +++++----
>  hw/virtio/virtio-balloon.c          | 22 ++++++----
>  hw/virtio/virtio-rng.c              | 10 +++--
>  hw/virtio/virtio.c                  | 12 ++++--
>  include/hw/virtio/dataplane/vring.h |  2 +-
>  include/hw/virtio/virtio-balloon.h  |  2 +-
>  include/hw/virtio/virtio-blk.h      |  3 +-
>  include/hw/virtio/virtio-net.h      |  2 +-
>  include/hw/virtio/virtio-scsi.h     |  2 +-
>  include/hw/virtio/virtio-serial.h   |  2 +-
>  include/hw/virtio/virtio.h          |  2 +-
>  22 files changed, 212 insertions(+), 143 deletions(-)

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (9 preceding siblings ...)
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 10/10] virtio: combine write of an entry into used ring Paolo Bonzini
@ 2016-02-03 12:08 ` Gonglei (Arei)
  2016-02-04 10:19   ` Paolo Bonzini
  2016-02-03 12:38 ` Gonglei (Arei)
  11 siblings, 1 reply; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-03 12:08 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, v.maffione, mst

Hi,

> Subject: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
> 
> This includes two optimization of virtio:
> 
> - "slimming down" VirtQueueElements by not including room for
>   1024 buffers.  This makes malloc much faster.
> 
> - optimizations to limit the number of address_space_translate
>   calls in virtio.c, from Vincenzo and myself.
> 

Very nice! After apply those optimizations (patch 8, 9, 10), I got 4MB/sec bonus of speed.
Based on my virtio-crypto device benchmark, ase-128-cbc algorithm. I haven't rebase
other patches of this patch set yet, maybe I can get other bonus, can I? 

Before applying those three patches:

Testing AES-128-CBC cipher: 
        Encrypting in chunks of 256 bytes: done. 246.84 MiB in 5.02 secs: 49.13 MiB/sec (1011061 packets)
        Encrypting in chunks of 256 bytes: done. 247.03 MiB in 5.02 secs: 49.16 MiB/sec (1011840 packets)
        Encrypting in chunks of 256 bytes: done. 246.98 MiB in 5.02 secs: 49.17 MiB/sec (1011636 packets)
        Encrypting in chunks of 256 bytes: done. 247.14 MiB in 5.02 secs: 49.19 MiB/sec (1012270 packets)
        Encrypting in chunks of 256 bytes: done. 246.96 MiB in 5.02 secs: 49.16 MiB/sec (1011565 packets)
        Encrypting in chunks of 256 bytes: done. 246.97 MiB in 5.02 secs: 49.18 MiB/sec (1011594 packets)
        Encrypting in chunks of 256 bytes: done. 246.89 MiB in 5.02 secs: 49.15 MiB/sec (1011259 packets)
        Encrypting in chunks of 256 bytes: done. 246.96 MiB in 5.02 secs: 49.15 MiB/sec (1011561 packets)

'Perf top' shows:

 23.61%  qemu-kvm                 [.] address_space_translate
 14.49%  qemu-kvm                 [.] qemu_get_ram_ptr
  4.65%  qemu-kvm                 [.] phys_page_find
  4.31%  qemu-kvm                 [.] address_space_translate_internal
  3.18%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  2.83%  qemu-kvm                 [.] qemu_ram_addr_from_host
  2.40%  qemu-kvm                 [.] address_space_map
  2.34%  libc-2.19.so             [.] _int_malloc
  2.22%  libc-2.19.so             [.] _int_free
  1.96%  libc-2.19.so             [.] malloc
  1.71%  libpthread-2.19.so       [.] pthread_mutex_lock
  1.40%  qemu-kvm                 [.] find_next_zero_bit
  1.38%  libc-2.19.so             [.] malloc_consolidate
  1.31%  qemu-kvm                 [.] lduw_le_phys
  1.27%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  1.05%  qemu-kvm                 [.] qemu_get_ram_block
  1.05%  qemu-kvm                 [.] object_unref
  1.04%  qemu-kvm                 [.] memory_region_get_ram_addr


After applying those optimizations:
	    Encrypting in chunks of 256 bytes: done. 267.92 MiB in 5.03 secs: 53.31 MiB/sec (1097399 packets)
        Encrypting in chunks of 256 bytes: done. 268.05 MiB in 5.02 secs: 53.35 MiB/sec (1097935 packets)
        Encrypting in chunks of 256 bytes: done. 265.40 MiB in 5.02 secs: 52.82 MiB/sec (1087091 packets)
        Encrypting in chunks of 256 bytes: done. 263.18 MiB in 5.01 secs: 52.50 MiB/sec (1077999 packets)
        Encrypting in chunks of 256 bytes: done. 266.85 MiB in 5.01 secs: 53.29 MiB/sec (1093010 packets)
        Encrypting in chunks of 256 bytes: done. 267.64 MiB in 5.02 secs: 53.28 MiB/sec (1096251 packets)
        Encrypting in chunks of 256 bytes: done. 267.30 MiB in 5.02 secs: 53.24 MiB/sec (1094861 packets)
        Encrypting in chunks of 256 bytes: done. 267.29 MiB in 5.02 secs: 53.25 MiB/sec (1094833 packets)

'Perf top' shows:

22.56%  qemu-kvm                 [.] address_space_translate
 13.29%  qemu-kvm                 [.] qemu_get_ram_ptr
  4.71%  qemu-kvm                 [.] phys_page_find
  4.43%  qemu-kvm                 [.] address_space_translate_internal
  3.47%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  3.08%  qemu-kvm                 [.] qemu_ram_addr_from_host
  2.62%  qemu-kvm                 [.] address_space_map
  2.61%  libc-2.19.so             [.] _int_malloc
  2.58%  libc-2.19.so             [.] _int_free
  2.38%  libc-2.19.so             [.] malloc
  2.06%  libpthread-2.19.so       [.] pthread_mutex_lock
  1.68%  libc-2.19.so             [.] malloc_consolidate
  1.35%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  1.23%  qemu-kvm                 [.] lduw_le_phys
  1.18%  qemu-kvm                 [.] find_next_zero_bit
  1.02%  qemu-kvm                 [.] object_unref



Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-01-31 10:29 ` [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor Paolo Bonzini
@ 2016-02-03 12:34   ` Gonglei (Arei)
  2016-02-03 13:40     ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-03 12:34 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, mst

Hi,

> Subject: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
> 
> Compared to vring, virtio has a performance penalty of 10%.  Fix it
> by combining all the reads for a descriptor in a single address_space_read
> call.  This also simplifies the code nicely.
> 
> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/virtio/virtio.c | 86 ++++++++++++++++++++++--------------------------------
>  1 file changed, 35 insertions(+), 51 deletions(-)
> 

Unbelievable! After applying this patch, the virtio-crypto speed can attach 74MB/sec, host
Cpu overhead is 180% (the main thread 100% and vcpu threads 80%)

Testing AES-128-CBC cipher: 
        Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs: 74.12 MiB/sec (1523475 packets)
        Encrypting in chunks of 256 bytes: done. 369.85 MiB in 5.01 secs: 73.88 MiB/sec (1514900 packets)
        Encrypting in chunks of 256 bytes: done. 371.07 MiB in 5.02 secs: 73.97 MiB/sec (1519914 packets)
        Encrypting in chunks of 256 bytes: done. 371.66 MiB in 5.02 secs: 74.09 MiB/sec (1522309 packets)
        Encrypting in chunks of 256 bytes: done. 371.79 MiB in 5.02 secs: 74.12 MiB/sec (1522868 packets)
        Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs: 74.15 MiB/sec (1523457 packets)
        Encrypting in chunks of 256 bytes: done. 371.90 MiB in 5.02 secs: 74.14 MiB/sec (1523317 packets)
        Encrypting in chunks of 256 bytes: done. 371.71 MiB in 5.02 secs: 74.10 MiB/sec (1522522 packets)

15.95%  qemu-kvm                 [.] address_space_translate
  6.98%  qemu-kvm                 [.] qemu_get_ram_ptr
  4.87%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  4.40%  qemu-kvm                 [.] qemu_ram_addr_from_host
  3.79%  qemu-kvm                 [.] address_space_map
  3.41%  libc-2.19.so             [.] _int_malloc
  3.29%  libc-2.19.so             [.] _int_free
  3.07%  libc-2.19.so             [.] malloc
  2.95%  libpthread-2.19.so       [.] pthread_mutex_lock
  2.94%  qemu-kvm                 [.] phys_page_find
  2.73%  qemu-kvm                 [.] address_space_translate_internal
  2.65%  libc-2.19.so             [.] malloc_consolidate
  2.35%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  1.72%  qemu-kvm                 [.] find_next_zero_bit
  1.38%  qemu-kvm                 [.] address_space_rw
  1.34%  qemu-kvm                 [.] object_unref
  1.30%  qemu-kvm                 [.] object_ref
  1.28%  qemu-kvm                 [.] virtqueue_pop
  1.20%  libc-2.19.so             [.] memset
  1.11%  qemu-kvm                 [.] virtio_notify

Thank you so much!

Regards,
-Gonglei

> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 79a635f..2433866 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -107,35 +107,15 @@ void virtio_queue_update_rings(VirtIODevice *vdev,
> int n)
>                                vring->align);
>  }
> 
> -static inline uint64_t vring_desc_addr(VirtIODevice *vdev, hwaddr desc_pa,
> -                                       int i)
> +static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
> +                            hwaddr desc_pa, int i)
>  {
> -    hwaddr pa;
> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr);
> -    return virtio_ldq_phys(vdev, pa);
> -}
> -
> -static inline uint32_t vring_desc_len(VirtIODevice *vdev, hwaddr desc_pa, int
> i)
> -{
> -    hwaddr pa;
> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
> -    return virtio_ldl_phys(vdev, pa);
> -}
> -
> -static inline uint16_t vring_desc_flags(VirtIODevice *vdev, hwaddr desc_pa,
> -                                        int i)
> -{
> -    hwaddr pa;
> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags);
> -    return virtio_lduw_phys(vdev, pa);
> -}
> -
> -static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
> -                                       int i)
> -{
> -    hwaddr pa;
> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next);
> -    return virtio_lduw_phys(vdev, pa);
> +    address_space_read(&address_space_memory, desc_pa + i *
> sizeof(VRingDesc),
> +                       MEMTXATTRS_UNSPECIFIED, (void *)desc,
> sizeof(VRingDesc));
> +    virtio_tswap64s(vdev, &desc->addr);
> +    virtio_tswap32s(vdev, &desc->len);
> +    virtio_tswap16s(vdev, &desc->flags);
> +    virtio_tswap16s(vdev, &desc->next);
>  }
> 
>  static inline uint16_t vring_avail_flags(VirtQueue *vq)
> @@ -345,18 +325,18 @@ static unsigned int virtqueue_get_head(VirtQueue
> *vq, unsigned int idx)
>      return head;
>  }
> 
> -static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
> -                                    unsigned int i, unsigned int max)
> +static unsigned virtqueue_read_next_desc(VirtIODevice *vdev, VRingDesc
> *desc,
> +                                         hwaddr desc_pa, unsigned
> int max)
>  {
>      unsigned int next;
> 
>      /* If this descriptor says it doesn't chain, we're done. */
> -    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
> +    if (!(desc->flags & VRING_DESC_F_NEXT)) {
>          return max;
>      }
> 
>      /* Check they're not leading us off end of descriptors. */
> -    next = vring_desc_next(vdev, desc_pa, i);
> +    next = desc->next;
>      /* Make sure compiler knows to grab that: we don't want it changing! */
>      smp_wmb();
> 
> @@ -365,6 +345,7 @@ static unsigned virtqueue_next_desc(VirtIODevice
> *vdev, hwaddr desc_pa,
>          exit(1);
>      }
> 
> +    vring_desc_read(vdev, desc, desc_pa, next);
>      return next;
>  }
> 
> @@ -381,6 +362,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> unsigned int *in_bytes,
>      while (virtqueue_num_heads(vq, idx)) {
>          VirtIODevice *vdev = vq->vdev;
>          unsigned int max, num_bufs, indirect = 0;
> +        VRingDesc desc;
>          hwaddr desc_pa;
>          int i;
> 
> @@ -388,9 +370,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> unsigned int *in_bytes,
>          num_bufs = total_bufs;
>          i = virtqueue_get_head(vq, idx++);
>          desc_pa = vq->vring.desc;
> +        vring_desc_read(vdev, &desc, desc_pa, i);
> 
> -        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
> -            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> +            if (desc.len % sizeof(VRingDesc)) {
>                  error_report("Invalid size for indirect buffer table");
>                  exit(1);
>              }
> @@ -403,9 +386,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> unsigned int *in_bytes,
> 
>              /* loop over the indirect descriptor table */
>              indirect = 1;
> -            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
> -            desc_pa = vring_desc_addr(vdev, desc_pa, i);
> +            max = desc.len / sizeof(VRingDesc);
> +            desc_pa = desc.addr;
>              num_bufs = i = 0;
> +            vring_desc_read(vdev, &desc, desc_pa, i);
>          }
> 
>          do {
> @@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> unsigned int *in_bytes,
>                  exit(1);
>              }
> 
> -            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE)
> {
> -                in_total += vring_desc_len(vdev, desc_pa, i);
> +            if (desc.flags & VRING_DESC_F_WRITE) {
> +                in_total += desc.len;
>              } else {
> -                out_total += vring_desc_len(vdev, desc_pa, i);
> +                out_total += desc.len;
>              }
>              if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
>                  goto done;
>              }
> -        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
> +        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa,
> max)) != max);
> 
>          if (!indirect)
>              total_bufs = num_bufs;
> @@ -545,6 +529,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>      unsigned out_num, in_num;
>      hwaddr addr[VIRTQUEUE_MAX_SIZE];
>      struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    VRingDesc desc;
> 
>      if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
>          return NULL;
> @@ -560,33 +545,32 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>          vring_set_avail_event(vq, vq->last_avail_idx);
>      }
> 
> -    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
> -        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
> +    vring_desc_read(vdev, &desc, desc_pa, i);
> +    if (desc.flags & VRING_DESC_F_INDIRECT) {
> +        if (desc.len % sizeof(VRingDesc)) {
>              error_report("Invalid size for indirect buffer table");
>              exit(1);
>          }
> 
>          /* loop over the indirect descriptor table */
> -        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
> -        desc_pa = vring_desc_addr(vdev, desc_pa, i);
> +        max = desc.len / sizeof(VRingDesc);
> +        desc_pa = desc.addr;
>          i = 0;
> +        vring_desc_read(vdev, &desc, desc_pa, i);
>      }
> 
>      /* Collect all the descriptors */
>      do {
> -        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
> -        size_t len = vring_desc_len(vdev, desc_pa, i);
> -
> -        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
> +        if (desc.flags & VRING_DESC_F_WRITE) {
>              virtqueue_map_desc(&in_num, addr + out_num, iov +
> out_num,
> -                               VIRTQUEUE_MAX_SIZE - out_num, true,
> pa, len);
> +                               VIRTQUEUE_MAX_SIZE - out_num, true,
> desc.addr, desc.len);
>          } else {
>              if (in_num) {
>                  error_report("Incorrect order for descriptors");
>                  exit(1);
>              }
>              virtqueue_map_desc(&out_num, addr, iov,
> -                               VIRTQUEUE_MAX_SIZE, false, pa, len);
> +                               VIRTQUEUE_MAX_SIZE, false,
> desc.addr, desc.len);
>          }
> 
>          /* If we've got too many, that implies a descriptor loop. */
> @@ -594,7 +578,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>              error_report("Looped descriptor");
>              exit(1);
>          }
> -    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
> +    } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) !=
> max);
> 
>      /* Now copy what we have collected and mapped */
>      elem = virtqueue_alloc_element(sz, out_num, in_num);
> --
> 2.5.0
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
  2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
                   ` (10 preceding siblings ...)
  2016-02-03 12:08 ` [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Gonglei (Arei)
@ 2016-02-03 12:38 ` Gonglei (Arei)
  11 siblings, 0 replies; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-03 12:38 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, mst


> -----Original Message-----
> From: qemu-devel-bounces+arei.gonglei=huawei.com@nongnu.org
> [mailto:qemu-devel-bounces+arei.gonglei=huawei.com@nongnu.org] On
> Behalf Of Paolo Bonzini
> Sent: Sunday, January 31, 2016 6:29 PM
> To: qemu-devel@nongnu.org
> Cc: cornelia.huck@de.ibm.com; mst@redhat.com
> Subject: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
> 
> This includes two optimization of virtio:
> 
> - "slimming down" VirtQueueElements by not including room for
>   1024 buffers.  This makes malloc much faster.
> 
> - optimizations to limit the number of address_space_translate
>   calls in virtio.c, from Vincenzo and myself.
> 
> Thanks,
> 
> Paolo
> 
> v1->v2: improved commit messages [Conny]
>         add assertions on sz [Conny]
>         change bools from 1 and 0 to "true" and "false" [Conny]
>         update shadow avail_idx in virtio_queue_set_last_avail_idx [Michael]
>         collect Reviewed-by
> 
> Paolo Bonzini (7):
>   virtio: move VirtQueueElement at the beginning of the structs
>   virtio: move allocation to virtqueue_pop/vring_pop
>   virtio: introduce qemu_get/put_virtqueue_element
>   virtio: introduce virtqueue_alloc_element
>   virtio: slim down allocation of VirtQueueElements
>   vring: slim down allocation of VirtQueueElements
>   virtio: combine the read of a descriptor
> 
> Vincenzo Maffione (3):
>   virtio: cache used_idx in a VirtQueue field
>   virtio: read avail_idx from VQ only when necessary
>   virtio: combine write of an entry into used ring
> 
>  hw/9pfs/9p.c                        |   2 +-
>  hw/9pfs/virtio-9p-device.c          |  17 +-
>  hw/9pfs/virtio-9p.h                 |   2 +-
>  hw/block/dataplane/virtio-blk.c     |  11 +-
>  hw/block/virtio-blk.c               |  23 +--
>  hw/char/virtio-serial-bus.c         |  78 +++++----
>  hw/display/virtio-gpu.c             |  25 ++-
>  hw/input/virtio-input.c             |  24 ++-
>  hw/net/virtio-net.c                 |  69 +++++---
>  hw/scsi/virtio-scsi-dataplane.c     |  15 +-
>  hw/scsi/virtio-scsi.c               |  26 ++-
>  hw/virtio/dataplane/vring.c         |  62 ++++---
>  hw/virtio/virtio-balloon.c          |  22 ++-
>  hw/virtio/virtio-rng.c              |  10 +-
>  hw/virtio/virtio.c                  | 340
> +++++++++++++++++++++++++-----------
>  include/hw/virtio/dataplane/vring.h |   2 +-
>  include/hw/virtio/virtio-balloon.h  |   2 +-
>  include/hw/virtio/virtio-blk.h      |   5 +-
>  include/hw/virtio/virtio-net.h      |   2 +-
>  include/hw/virtio/virtio-scsi.h     |  15 +-
>  include/hw/virtio/virtio-serial.h   |   2 +-
>  include/hw/virtio/virtio.h          |  13 +-
>  22 files changed, 486 insertions(+), 281 deletions(-)
> 
> --
> 2.5.0
> 

For patch 7,8,9,10:

 Tested-by: Gonglei <arei.gonglei@huawei.com>

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-02-03 12:34   ` Gonglei (Arei)
@ 2016-02-03 13:40     ` Paolo Bonzini
  2016-02-04  7:48       ` Gonglei (Arei)
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-02-03 13:40 UTC (permalink / raw)
  To: Gonglei (Arei), qemu-devel; +Cc: cornelia.huck, mst



On 03/02/2016 13:34, Gonglei (Arei) wrote:
> Hi,
> 
>> Subject: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
>>
>> Compared to vring, virtio has a performance penalty of 10%.  Fix it
>> by combining all the reads for a descriptor in a single address_space_read
>> call.  This also simplifies the code nicely.
>>
>> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  hw/virtio/virtio.c | 86 ++++++++++++++++++++++--------------------------------
>>  1 file changed, 35 insertions(+), 51 deletions(-)
>>
> 
> Unbelievable! After applying this patch, the virtio-crypto speed can attach 74MB/sec, host
> Cpu overhead is 180% (the main thread 100% and vcpu threads 80%)

The three patches from Vincenzo will help too.  What was it like before?

Also, are you using ioeventfd or dataplane?  virtio-crypto sounds like
something that could be very easily run outside the "big QEMU lock".

Paolo

> Testing AES-128-CBC cipher: 
>         Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs: 74.12 MiB/sec (1523475 packets)
>         Encrypting in chunks of 256 bytes: done. 369.85 MiB in 5.01 secs: 73.88 MiB/sec (1514900 packets)
>         Encrypting in chunks of 256 bytes: done. 371.07 MiB in 5.02 secs: 73.97 MiB/sec (1519914 packets)
>         Encrypting in chunks of 256 bytes: done. 371.66 MiB in 5.02 secs: 74.09 MiB/sec (1522309 packets)
>         Encrypting in chunks of 256 bytes: done. 371.79 MiB in 5.02 secs: 74.12 MiB/sec (1522868 packets)
>         Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs: 74.15 MiB/sec (1523457 packets)
>         Encrypting in chunks of 256 bytes: done. 371.90 MiB in 5.02 secs: 74.14 MiB/sec (1523317 packets)
>         Encrypting in chunks of 256 bytes: done. 371.71 MiB in 5.02 secs: 74.10 MiB/sec (1522522 packets)
> 
> 15.95%  qemu-kvm                 [.] address_space_translate
>   6.98%  qemu-kvm                 [.] qemu_get_ram_ptr
>   4.87%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
>   4.40%  qemu-kvm                 [.] qemu_ram_addr_from_host
>   3.79%  qemu-kvm                 [.] address_space_map
>   3.41%  libc-2.19.so             [.] _int_malloc
>   3.29%  libc-2.19.so             [.] _int_free
>   3.07%  libc-2.19.so             [.] malloc
>   2.95%  libpthread-2.19.so       [.] pthread_mutex_lock
>   2.94%  qemu-kvm                 [.] phys_page_find
>   2.73%  qemu-kvm                 [.] address_space_translate_internal
>   2.65%  libc-2.19.so             [.] malloc_consolidate
>   2.35%  libc-2.19.so             [.] __memcpy_sse2_unaligned
>   1.72%  qemu-kvm                 [.] find_next_zero_bit
>   1.38%  qemu-kvm                 [.] address_space_rw
>   1.34%  qemu-kvm                 [.] object_unref
>   1.30%  qemu-kvm                 [.] object_ref
>   1.28%  qemu-kvm                 [.] virtqueue_pop
>   1.20%  libc-2.19.so             [.] memset
>   1.11%  qemu-kvm                 [.] virtio_notify
> 
> Thank you so much!
> 
> Regards,
> -Gonglei
> 
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 79a635f..2433866 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -107,35 +107,15 @@ void virtio_queue_update_rings(VirtIODevice *vdev,
>> int n)
>>                                vring->align);
>>  }
>>
>> -static inline uint64_t vring_desc_addr(VirtIODevice *vdev, hwaddr desc_pa,
>> -                                       int i)
>> +static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
>> +                            hwaddr desc_pa, int i)
>>  {
>> -    hwaddr pa;
>> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr);
>> -    return virtio_ldq_phys(vdev, pa);
>> -}
>> -
>> -static inline uint32_t vring_desc_len(VirtIODevice *vdev, hwaddr desc_pa, int
>> i)
>> -{
>> -    hwaddr pa;
>> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
>> -    return virtio_ldl_phys(vdev, pa);
>> -}
>> -
>> -static inline uint16_t vring_desc_flags(VirtIODevice *vdev, hwaddr desc_pa,
>> -                                        int i)
>> -{
>> -    hwaddr pa;
>> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags);
>> -    return virtio_lduw_phys(vdev, pa);
>> -}
>> -
>> -static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
>> -                                       int i)
>> -{
>> -    hwaddr pa;
>> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next);
>> -    return virtio_lduw_phys(vdev, pa);
>> +    address_space_read(&address_space_memory, desc_pa + i *
>> sizeof(VRingDesc),
>> +                       MEMTXATTRS_UNSPECIFIED, (void *)desc,
>> sizeof(VRingDesc));
>> +    virtio_tswap64s(vdev, &desc->addr);
>> +    virtio_tswap32s(vdev, &desc->len);
>> +    virtio_tswap16s(vdev, &desc->flags);
>> +    virtio_tswap16s(vdev, &desc->next);
>>  }
>>
>>  static inline uint16_t vring_avail_flags(VirtQueue *vq)
>> @@ -345,18 +325,18 @@ static unsigned int virtqueue_get_head(VirtQueue
>> *vq, unsigned int idx)
>>      return head;
>>  }
>>
>> -static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
>> -                                    unsigned int i, unsigned int max)
>> +static unsigned virtqueue_read_next_desc(VirtIODevice *vdev, VRingDesc
>> *desc,
>> +                                         hwaddr desc_pa, unsigned
>> int max)
>>  {
>>      unsigned int next;
>>
>>      /* If this descriptor says it doesn't chain, we're done. */
>> -    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
>> +    if (!(desc->flags & VRING_DESC_F_NEXT)) {
>>          return max;
>>      }
>>
>>      /* Check they're not leading us off end of descriptors. */
>> -    next = vring_desc_next(vdev, desc_pa, i);
>> +    next = desc->next;
>>      /* Make sure compiler knows to grab that: we don't want it changing! */
>>      smp_wmb();
>>
>> @@ -365,6 +345,7 @@ static unsigned virtqueue_next_desc(VirtIODevice
>> *vdev, hwaddr desc_pa,
>>          exit(1);
>>      }
>>
>> +    vring_desc_read(vdev, desc, desc_pa, next);
>>      return next;
>>  }
>>
>> @@ -381,6 +362,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
>> unsigned int *in_bytes,
>>      while (virtqueue_num_heads(vq, idx)) {
>>          VirtIODevice *vdev = vq->vdev;
>>          unsigned int max, num_bufs, indirect = 0;
>> +        VRingDesc desc;
>>          hwaddr desc_pa;
>>          int i;
>>
>> @@ -388,9 +370,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
>> unsigned int *in_bytes,
>>          num_bufs = total_bufs;
>>          i = virtqueue_get_head(vq, idx++);
>>          desc_pa = vq->vring.desc;
>> +        vring_desc_read(vdev, &desc, desc_pa, i);
>>
>> -        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> -            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
>> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
>> +            if (desc.len % sizeof(VRingDesc)) {
>>                  error_report("Invalid size for indirect buffer table");
>>                  exit(1);
>>              }
>> @@ -403,9 +386,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
>> unsigned int *in_bytes,
>>
>>              /* loop over the indirect descriptor table */
>>              indirect = 1;
>> -            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
>> -            desc_pa = vring_desc_addr(vdev, desc_pa, i);
>> +            max = desc.len / sizeof(VRingDesc);
>> +            desc_pa = desc.addr;
>>              num_bufs = i = 0;
>> +            vring_desc_read(vdev, &desc, desc_pa, i);
>>          }
>>
>>          do {
>> @@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
>> unsigned int *in_bytes,
>>                  exit(1);
>>              }
>>
>> -            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE)
>> {
>> -                in_total += vring_desc_len(vdev, desc_pa, i);
>> +            if (desc.flags & VRING_DESC_F_WRITE) {
>> +                in_total += desc.len;
>>              } else {
>> -                out_total += vring_desc_len(vdev, desc_pa, i);
>> +                out_total += desc.len;
>>              }
>>              if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
>>                  goto done;
>>              }
>> -        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
>> +        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa,
>> max)) != max);
>>
>>          if (!indirect)
>>              total_bufs = num_bufs;
>> @@ -545,6 +529,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>>      unsigned out_num, in_num;
>>      hwaddr addr[VIRTQUEUE_MAX_SIZE];
>>      struct iovec iov[VIRTQUEUE_MAX_SIZE];
>> +    VRingDesc desc;
>>
>>      if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
>>          return NULL;
>> @@ -560,33 +545,32 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>>          vring_set_avail_event(vq, vq->last_avail_idx);
>>      }
>>
>> -    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> -        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
>> +    vring_desc_read(vdev, &desc, desc_pa, i);
>> +    if (desc.flags & VRING_DESC_F_INDIRECT) {
>> +        if (desc.len % sizeof(VRingDesc)) {
>>              error_report("Invalid size for indirect buffer table");
>>              exit(1);
>>          }
>>
>>          /* loop over the indirect descriptor table */
>> -        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
>> -        desc_pa = vring_desc_addr(vdev, desc_pa, i);
>> +        max = desc.len / sizeof(VRingDesc);
>> +        desc_pa = desc.addr;
>>          i = 0;
>> +        vring_desc_read(vdev, &desc, desc_pa, i);
>>      }
>>
>>      /* Collect all the descriptors */
>>      do {
>> -        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
>> -        size_t len = vring_desc_len(vdev, desc_pa, i);
>> -
>> -        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
>> +        if (desc.flags & VRING_DESC_F_WRITE) {
>>              virtqueue_map_desc(&in_num, addr + out_num, iov +
>> out_num,
>> -                               VIRTQUEUE_MAX_SIZE - out_num, true,
>> pa, len);
>> +                               VIRTQUEUE_MAX_SIZE - out_num, true,
>> desc.addr, desc.len);
>>          } else {
>>              if (in_num) {
>>                  error_report("Incorrect order for descriptors");
>>                  exit(1);
>>              }
>>              virtqueue_map_desc(&out_num, addr, iov,
>> -                               VIRTQUEUE_MAX_SIZE, false, pa, len);
>> +                               VIRTQUEUE_MAX_SIZE, false,
>> desc.addr, desc.len);
>>          }
>>
>>          /* If we've got too many, that implies a descriptor loop. */
>> @@ -594,7 +578,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>>              error_report("Looped descriptor");
>>              exit(1);
>>          }
>> -    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
>> +    } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) !=
>> max);
>>
>>      /* Now copy what we have collected and mapped */
>>      elem = virtqueue_alloc_element(sz, out_num, in_num);
>> --
>> 2.5.0
>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-02-03 13:40     ` Paolo Bonzini
@ 2016-02-04  7:48       ` Gonglei (Arei)
  2016-02-04 10:18         ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-04  7:48 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, mst


Hi Paolo,

> From: Paolo Bonzini [mailto:paolo.bonzini@gmail.com] On Behalf Of Paolo
> Bonzini
>
> On 03/02/2016 13:34, Gonglei (Arei) wrote:
> > Hi,
> >
> >> Subject: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
> >>
> >> Compared to vring, virtio has a performance penalty of 10%.  Fix it
> >> by combining all the reads for a descriptor in a single address_space_read
> >> call.  This also simplifies the code nicely.
> >>
> >> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>  hw/virtio/virtio.c | 86 ++++++++++++++++++++++--------------------------------
> >>  1 file changed, 35 insertions(+), 51 deletions(-)
> >>
> >
> > Unbelievable! After applying this patch, the virtio-crypto speed can attach
> 74MB/sec, host
> > Cpu overhead is 180% (the main thread 100% and vcpu threads 80%)
> 
> The three patches from Vincenzo will help too.  What was it like before?
> 
That's true, and I replied in the cover letter.

> Also, are you using ioeventfd or dataplane?  virtio-crypto sounds like
> something that could be very easily run outside the "big QEMU lock".
> 

Yes, all testing results based on the conditions of ioeventfd enabled.  And I
also realized the virtio-crypto-dataplane scheme, got the below results:

Testing AES-128-CBC cipher: 
        Encrypting in chunks of 256 bytes: done. 370.81 MiB in 5.02 secs: 73.92 MiB/sec (1518832 packets)
        Encrypting in chunks of 256 bytes: done. 374.09 MiB in 5.02 secs: 74.51 MiB/sec (1532272 packets)
        Encrypting in chunks of 256 bytes: done. 372.05 MiB in 5.02 secs: 74.11 MiB/sec (1523920 packets)
        Encrypting in chunks of 256 bytes: done. 365.77 MiB in 5.02 secs: 72.86 MiB/sec (1498200 packets)
        Encrypting in chunks of 256 bytes: done. 379.63 MiB in 5.02 secs: 75.62 MiB/sec (1554976 packets)
        Encrypting in chunks of 256 bytes: done. 374.61 MiB in 5.02 secs: 74.62 MiB/sec (1534384 packets)
        Encrypting in chunks of 256 bytes: done. 372.13 MiB in 5.02 secs: 74.12 MiB/sec (1524226 packets)
        Encrypting in chunks of 256 bytes: done. 374.18 MiB in 5.02 secs: 74.53 MiB/sec (1532656 packets)


11.44%  qemu-kvm                 [.] memory_region_find
  6.31%  qemu-kvm                 [.] qemu_get_ram_ptr
  4.61%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  3.54%  qemu-kvm                 [.] qemu_ram_addr_from_host
  2.80%  libpthread-2.19.so       [.] pthread_mutex_lock
  2.55%  qemu-kvm                 [.] object_unref
  2.49%  libc-2.19.so             [.] malloc
  2.47%  libc-2.19.so             [.] _int_malloc
  2.34%  libc-2.19.so             [.] _int_free
  2.18%  qemu-kvm                 [.] object_ref
  2.18%  qemu-kvm                 [.] address_space_translate
  2.03%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  1.76%  libc-2.19.so             [.] malloc_consolidate
  1.56%  qemu-kvm                 [.] addrrange_intersection
  1.52%  qemu-kvm                 [.] vring_pop
  1.36%  qemu-kvm                 [.] find_next_zero_bit
  1.30%  [kernel]                 [k] native_write_msr_safe
  1.29%  qemu-kvm                 [.] addrrange_intersects
  1.21%  qemu-kvm                 [.] vring_map
  0.93%  qemu-kvm                 [.] virtio_notify

Do you have any thoughts to decrease the cpu overhead and get higher through output? Thanks!

Regards,
-Gonglei

> Paolo
> 
> > Testing AES-128-CBC cipher:
> >         Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs:
> 74.12 MiB/sec (1523475 packets)
> >         Encrypting in chunks of 256 bytes: done. 369.85 MiB in 5.01 secs:
> 73.88 MiB/sec (1514900 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.07 MiB in 5.02 secs:
> 73.97 MiB/sec (1519914 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.66 MiB in 5.02 secs:
> 74.09 MiB/sec (1522309 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.79 MiB in 5.02 secs:
> 74.12 MiB/sec (1522868 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.94 MiB in 5.02 secs:
> 74.15 MiB/sec (1523457 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.90 MiB in 5.02 secs:
> 74.14 MiB/sec (1523317 packets)
> >         Encrypting in chunks of 256 bytes: done. 371.71 MiB in 5.02 secs:
> 74.10 MiB/sec (1522522 packets)
> >
> > 15.95%  qemu-kvm                 [.] address_space_translate
> >   6.98%  qemu-kvm                 [.] qemu_get_ram_ptr
> >   4.87%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
> >   4.40%  qemu-kvm                 [.] qemu_ram_addr_from_host
> >   3.79%  qemu-kvm                 [.] address_space_map
> >   3.41%  libc-2.19.so             [.] _int_malloc
> >   3.29%  libc-2.19.so             [.] _int_free
> >   3.07%  libc-2.19.so             [.] malloc
> >   2.95%  libpthread-2.19.so       [.] pthread_mutex_lock
> >   2.94%  qemu-kvm                 [.] phys_page_find
> >   2.73%  qemu-kvm                 [.]
> address_space_translate_internal
> >   2.65%  libc-2.19.so             [.] malloc_consolidate
> >   2.35%  libc-2.19.so             [.] __memcpy_sse2_unaligned
> >   1.72%  qemu-kvm                 [.] find_next_zero_bit
> >   1.38%  qemu-kvm                 [.] address_space_rw
> >   1.34%  qemu-kvm                 [.] object_unref
> >   1.30%  qemu-kvm                 [.] object_ref
> >   1.28%  qemu-kvm                 [.] virtqueue_pop
> >   1.20%  libc-2.19.so             [.] memset
> >   1.11%  qemu-kvm                 [.] virtio_notify
> >
> > Thank you so much!
> >
> > Regards,
> > -Gonglei
> >
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 79a635f..2433866 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -107,35 +107,15 @@ void virtio_queue_update_rings(VirtIODevice
> *vdev,
> >> int n)
> >>                                vring->align);
> >>  }
> >>
> >> -static inline uint64_t vring_desc_addr(VirtIODevice *vdev, hwaddr desc_pa,
> >> -                                       int i)
> >> +static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
> >> +                            hwaddr desc_pa, int i)
> >>  {
> >> -    hwaddr pa;
> >> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr);
> >> -    return virtio_ldq_phys(vdev, pa);
> >> -}
> >> -
> >> -static inline uint32_t vring_desc_len(VirtIODevice *vdev, hwaddr desc_pa,
> int
> >> i)
> >> -{
> >> -    hwaddr pa;
> >> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
> >> -    return virtio_ldl_phys(vdev, pa);
> >> -}
> >> -
> >> -static inline uint16_t vring_desc_flags(VirtIODevice *vdev, hwaddr
> desc_pa,
> >> -                                        int i)
> >> -{
> >> -    hwaddr pa;
> >> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags);
> >> -    return virtio_lduw_phys(vdev, pa);
> >> -}
> >> -
> >> -static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
> >> -                                       int i)
> >> -{
> >> -    hwaddr pa;
> >> -    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next);
> >> -    return virtio_lduw_phys(vdev, pa);
> >> +    address_space_read(&address_space_memory, desc_pa + i *
> >> sizeof(VRingDesc),
> >> +                       MEMTXATTRS_UNSPECIFIED, (void *)desc,
> >> sizeof(VRingDesc));
> >> +    virtio_tswap64s(vdev, &desc->addr);
> >> +    virtio_tswap32s(vdev, &desc->len);
> >> +    virtio_tswap16s(vdev, &desc->flags);
> >> +    virtio_tswap16s(vdev, &desc->next);
> >>  }
> >>
> >>  static inline uint16_t vring_avail_flags(VirtQueue *vq)
> >> @@ -345,18 +325,18 @@ static unsigned int
> virtqueue_get_head(VirtQueue
> >> *vq, unsigned int idx)
> >>      return head;
> >>  }
> >>
> >> -static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
> >> -                                    unsigned int i, unsigned int
> max)
> >> +static unsigned virtqueue_read_next_desc(VirtIODevice *vdev, VRingDesc
> >> *desc,
> >> +                                         hwaddr desc_pa,
> unsigned
> >> int max)
> >>  {
> >>      unsigned int next;
> >>
> >>      /* If this descriptor says it doesn't chain, we're done. */
> >> -    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
> >> +    if (!(desc->flags & VRING_DESC_F_NEXT)) {
> >>          return max;
> >>      }
> >>
> >>      /* Check they're not leading us off end of descriptors. */
> >> -    next = vring_desc_next(vdev, desc_pa, i);
> >> +    next = desc->next;
> >>      /* Make sure compiler knows to grab that: we don't want it changing!
> */
> >>      smp_wmb();
> >>
> >> @@ -365,6 +345,7 @@ static unsigned virtqueue_next_desc(VirtIODevice
> >> *vdev, hwaddr desc_pa,
> >>          exit(1);
> >>      }
> >>
> >> +    vring_desc_read(vdev, desc, desc_pa, next);
> >>      return next;
> >>  }
> >>
> >> @@ -381,6 +362,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> >> unsigned int *in_bytes,
> >>      while (virtqueue_num_heads(vq, idx)) {
> >>          VirtIODevice *vdev = vq->vdev;
> >>          unsigned int max, num_bufs, indirect = 0;
> >> +        VRingDesc desc;
> >>          hwaddr desc_pa;
> >>          int i;
> >>
> >> @@ -388,9 +370,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> >> unsigned int *in_bytes,
> >>          num_bufs = total_bufs;
> >>          i = virtqueue_get_head(vq, idx++);
> >>          desc_pa = vq->vring.desc;
> >> +        vring_desc_read(vdev, &desc, desc_pa, i);
> >>
> >> -        if (vring_desc_flags(vdev, desc_pa, i) &
> VRING_DESC_F_INDIRECT) {
> >> -            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
> >> +        if (desc.flags & VRING_DESC_F_INDIRECT) {
> >> +            if (desc.len % sizeof(VRingDesc)) {
> >>                  error_report("Invalid size for indirect buffer table");
> >>                  exit(1);
> >>              }
> >> @@ -403,9 +386,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> >> unsigned int *in_bytes,
> >>
> >>              /* loop over the indirect descriptor table */
> >>              indirect = 1;
> >> -            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
> >> -            desc_pa = vring_desc_addr(vdev, desc_pa, i);
> >> +            max = desc.len / sizeof(VRingDesc);
> >> +            desc_pa = desc.addr;
> >>              num_bufs = i = 0;
> >> +            vring_desc_read(vdev, &desc, desc_pa, i);
> >>          }
> >>
> >>          do {
> >> @@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq,
> >> unsigned int *in_bytes,
> >>                  exit(1);
> >>              }
> >>
> >> -            if (vring_desc_flags(vdev, desc_pa, i) &
> VRING_DESC_F_WRITE)
> >> {
> >> -                in_total += vring_desc_len(vdev, desc_pa, i);
> >> +            if (desc.flags & VRING_DESC_F_WRITE) {
> >> +                in_total += desc.len;
> >>              } else {
> >> -                out_total += vring_desc_len(vdev, desc_pa, i);
> >> +                out_total += desc.len;
> >>              }
> >>              if (in_total >= max_in_bytes && out_total >=
> max_out_bytes) {
> >>                  goto done;
> >>              }
> >> -        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
> >> +        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa,
> >> max)) != max);
> >>
> >>          if (!indirect)
> >>              total_bufs = num_bufs;
> >> @@ -545,6 +529,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >>      unsigned out_num, in_num;
> >>      hwaddr addr[VIRTQUEUE_MAX_SIZE];
> >>      struct iovec iov[VIRTQUEUE_MAX_SIZE];
> >> +    VRingDesc desc;
> >>
> >>      if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
> >>          return NULL;
> >> @@ -560,33 +545,32 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >>          vring_set_avail_event(vq, vq->last_avail_idx);
> >>      }
> >>
> >> -    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> -        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
> >> +    vring_desc_read(vdev, &desc, desc_pa, i);
> >> +    if (desc.flags & VRING_DESC_F_INDIRECT) {
> >> +        if (desc.len % sizeof(VRingDesc)) {
> >>              error_report("Invalid size for indirect buffer table");
> >>              exit(1);
> >>          }
> >>
> >>          /* loop over the indirect descriptor table */
> >> -        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
> >> -        desc_pa = vring_desc_addr(vdev, desc_pa, i);
> >> +        max = desc.len / sizeof(VRingDesc);
> >> +        desc_pa = desc.addr;
> >>          i = 0;
> >> +        vring_desc_read(vdev, &desc, desc_pa, i);
> >>      }
> >>
> >>      /* Collect all the descriptors */
> >>      do {
> >> -        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
> >> -        size_t len = vring_desc_len(vdev, desc_pa, i);
> >> -
> >> -        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
> >> +        if (desc.flags & VRING_DESC_F_WRITE) {
> >>              virtqueue_map_desc(&in_num, addr + out_num, iov +
> >> out_num,
> >> -                               VIRTQUEUE_MAX_SIZE - out_num,
> true,
> >> pa, len);
> >> +                               VIRTQUEUE_MAX_SIZE - out_num,
> true,
> >> desc.addr, desc.len);
> >>          } else {
> >>              if (in_num) {
> >>                  error_report("Incorrect order for descriptors");
> >>                  exit(1);
> >>              }
> >>              virtqueue_map_desc(&out_num, addr, iov,
> >> -                               VIRTQUEUE_MAX_SIZE, false, pa,
> len);
> >> +                               VIRTQUEUE_MAX_SIZE, false,
> >> desc.addr, desc.len);
> >>          }
> >>
> >>          /* If we've got too many, that implies a descriptor loop. */
> >> @@ -594,7 +578,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
> >>              error_report("Looped descriptor");
> >>              exit(1);
> >>          }
> >> -    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
> >> +    } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) !=
> >> max);
> >>
> >>      /* Now copy what we have collected and mapped */
> >>      elem = virtqueue_alloc_element(sz, out_num, in_num);
> >> --
> >> 2.5.0
> >>
> >>
> >
> >
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-02-04  7:48       ` Gonglei (Arei)
@ 2016-02-04 10:18         ` Paolo Bonzini
  2016-02-05  6:16           ` Gonglei (Arei)
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-02-04 10:18 UTC (permalink / raw)
  To: Gonglei (Arei), qemu-devel; +Cc: cornelia.huck, mst



On 04/02/2016 08:48, Gonglei (Arei) wrote:
> 11.44%  qemu-kvm                 [.] memory_region_find
>   6.31%  qemu-kvm                 [.] qemu_get_ram_ptr
>   4.61%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
>   3.54%  qemu-kvm                 [.] qemu_ram_addr_from_host
>   2.80%  libpthread-2.19.so       [.] pthread_mutex_lock
>   2.55%  qemu-kvm                 [.] object_unref
>   2.49%  libc-2.19.so             [.] malloc
>   2.47%  libc-2.19.so             [.] _int_malloc
>   2.34%  libc-2.19.so             [.] _int_free
>   2.18%  qemu-kvm                 [.] object_ref
>   2.18%  qemu-kvm                 [.] address_space_translate
>   2.03%  libc-2.19.so             [.] __memcpy_sse2_unaligned
>   1.76%  libc-2.19.so             [.] malloc_consolidate
>   1.56%  qemu-kvm                 [.] addrrange_intersection
>   1.52%  qemu-kvm                 [.] vring_pop
>   1.36%  qemu-kvm                 [.] find_next_zero_bit
>   1.30%  [kernel]                 [k] native_write_msr_safe
>   1.29%  qemu-kvm                 [.] addrrange_intersects
>   1.21%  qemu-kvm                 [.] vring_map
>   0.93%  qemu-kvm                 [.] virtio_notify
> 
> Do you have any thoughts to decrease the cpu overhead and get higher through output? Thanks!

Using bigger chunks than 256 bytes will reduce the overhead in
memory_region_find and qemu_get_ram_ptr.  You could expect
 a further 10-12% improvement.

Paolo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
  2016-02-03 12:08 ` [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Gonglei (Arei)
@ 2016-02-04 10:19   ` Paolo Bonzini
  2016-02-05  7:17     ` Gonglei (Arei)
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-02-04 10:19 UTC (permalink / raw)
  To: Gonglei (Arei), qemu-devel; +Cc: cornelia.huck, v.maffione, mst



On 03/02/2016 13:08, Gonglei (Arei) wrote:
> 22.56%  qemu-kvm                 [.] address_space_translate
>  13.29%  qemu-kvm                 [.] qemu_get_ram_ptr

We could get rid of qemu_get_ram_ptr by storing the RAMBlock pointer
into the memory region, instead of the ram_addr_t value.  I'm happy to
answer any question if you want to do it.

Paolo

>   4.71%  qemu-kvm                 [.] phys_page_find
>   4.43%  qemu-kvm                 [.] address_space_translate_internal
>   3.47%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
>   3.08%  qemu-kvm                 [.] qemu_ram_addr_from_host
>   2.62%  qemu-kvm                 [.] address_space_map
>   2.61%  libc-2.19.so             [.] _int_malloc
>   2.58%  libc-2.19.so             [.] _int_free
>   2.38%  libc-2.19.so             [.] malloc
>   2.06%  libpthread-2.19.so       [.] pthread_mutex_lock
>   1.68%  libc-2.19.so             [.] malloc_consolidate
>   1.35%  libc-2.19.so             [.] __memcpy_sse2_unaligned
>   1.23%  qemu-kvm                 [.] lduw_le_phys
>   1.18%  qemu-kvm                 [.] find_next_zero_bit
>   1.02%  qemu-kvm                 [.] object_unref

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor
  2016-02-04 10:18         ` Paolo Bonzini
@ 2016-02-05  6:16           ` Gonglei (Arei)
  0 siblings, 0 replies; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-05  6:16 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, mst





Regards,
-Gonglei


> -----Original Message-----
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Thursday, February 04, 2016 6:18 PM
> To: Gonglei (Arei); qemu-devel@nongnu.org
> Cc: cornelia.huck@de.ibm.com; mst@redhat.com
> Subject: Re: [PATCH 07/10] virtio: combine the read of a descriptor
> 
> 
> 
> On 04/02/2016 08:48, Gonglei (Arei) wrote:
> > 11.44%  qemu-kvm                 [.] memory_region_find
> >   6.31%  qemu-kvm                 [.] qemu_get_ram_ptr
> >   4.61%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
> >   3.54%  qemu-kvm                 [.] qemu_ram_addr_from_host
> >   2.80%  libpthread-2.19.so       [.] pthread_mutex_lock
> >   2.55%  qemu-kvm                 [.] object_unref
> >   2.49%  libc-2.19.so             [.] malloc
> >   2.47%  libc-2.19.so             [.] _int_malloc
> >   2.34%  libc-2.19.so             [.] _int_free
> >   2.18%  qemu-kvm                 [.] object_ref
> >   2.18%  qemu-kvm                 [.] address_space_translate
> >   2.03%  libc-2.19.so             [.] __memcpy_sse2_unaligned
> >   1.76%  libc-2.19.so             [.] malloc_consolidate
> >   1.56%  qemu-kvm                 [.] addrrange_intersection
> >   1.52%  qemu-kvm                 [.] vring_pop
> >   1.36%  qemu-kvm                 [.] find_next_zero_bit
> >   1.30%  [kernel]                 [k] native_write_msr_safe
> >   1.29%  qemu-kvm                 [.] addrrange_intersects
> >   1.21%  qemu-kvm                 [.] vring_map
> >   0.93%  qemu-kvm                 [.] virtio_notify
> >
> > Do you have any thoughts to decrease the cpu overhead and get higher
> through output? Thanks!
> 
> Using bigger chunks than 256 bytes will reduce the overhead in
> memory_region_find and qemu_get_ram_ptr.  You could expect
>  a further 10-12% improvement.
> 
Yes, you're right. This is the testing result:

        Encrypting in chunks of 256 bytes: done. 386.89 MiB in 5.02 secs: 77.13 MiB/sec (1584701 packets)
        Encrypting in chunks of 512 bytes: done. 756.80 MiB in 5.02 secs: 150.86 MiB/sec (1549918 packets)
        Encrypting in chunks of 1024 bytes: done. 1.30 GiB in 5.02 secs: 0.26 GiB/sec (1358614 packets)
        Encrypting in chunks of 2048 bytes: done. 2.42 GiB in 5.02 secs: 0.48 GiB/sec (1270223 packets)
        Encrypting in chunks of 4096 bytes: done. 3.99 GiB in 5.02 secs: 0.79 GiB/sec (1046680 packets)
        Encrypting in chunks of 8192 bytes: done. 6.12 GiB in 5.02 secs: 1.22 GiB/sec (802379 packets)
        Encrypting in chunks of 16384 bytes: done. 8.48 GiB in 5.04 secs: 1.68 GiB/sec (556046 packets)
        Encrypting in chunks of 32768 bytes: done. 10.42 GiB in 5.07 secs: 2.06 GiB/sec (341524 packets)

But 256-byte packet is the main chunk size of packet in CT scenarios.

Regards,
-Gonglei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches
  2016-02-04 10:19   ` Paolo Bonzini
@ 2016-02-05  7:17     ` Gonglei (Arei)
  0 siblings, 0 replies; 26+ messages in thread
From: Gonglei (Arei) @ 2016-02-05  7:17 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: cornelia.huck, v.maffione, mst

Dear Paolo,


> 
> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Thursday, February 04, 2016 6:19 PM
> 
> On 03/02/2016 13:08, Gonglei (Arei) wrote:
> > 22.56%  qemu-kvm                 [.] address_space_translate
> >  13.29%  qemu-kvm                 [.] qemu_get_ram_ptr
> 
> We could get rid of qemu_get_ram_ptr by storing the RAMBlock pointer
> into the memory region, instead of the ram_addr_t value.  I'm happy to
> answer any question if you want to do it.
> 
Good point! And I simply realize this change, get nearly 8MB/s through output bonus.

Testing AES-128-CBC cipher: 
        Encrypting in chunks of 256 bytes: done. 412.16 MiB in 5.02 secs: 82.17 MiB/sec (1688202 packets)
        Encrypting in chunks of 256 bytes: done. 412.15 MiB in 5.02 secs: 82.16 MiB/sec (1688158 packets)
        Encrypting in chunks of 256 bytes: done. 412.32 MiB in 5.02 secs: 82.20 MiB/sec (1688876 packets)
        Encrypting in chunks of 256 bytes: done. 412.47 MiB in 5.02 secs: 82.23 MiB/sec (1689491 packets)
        Encrypting in chunks of 256 bytes: done. 412.31 MiB in 5.02 secs: 82.20 MiB/sec (1688825 packets)
        Encrypting in chunks of 256 bytes: done. 411.30 MiB in 5.01 secs: 82.15 MiB/sec (1684671 packets)
        Encrypting in chunks of 256 bytes: done. 412.08 MiB in 5.01 secs: 82.18 MiB/sec (1687864 packets)
        Encrypting in chunks of 256 bytes: done. 412.49 MiB in 5.02 secs: 82.23 MiB/sec (1689564 packets)

Now, 'perf top' shows me:

16.32%  qemu-kvm                 [.] address_space_translate
  5.39%  libpthread-2.19.so       [.] __pthread_mutex_unlock_usercnt
  4.13%  qemu-kvm                 [.] qemu_ram_addr_from_host
  4.01%  qemu-kvm                 [.] address_space_map
  3.82%  libc-2.19.so             [.] _int_malloc
  3.70%  libc-2.19.so             [.] _int_free
  3.49%  libc-2.19.so             [.] malloc
  3.18%  libpthread-2.19.so       [.] pthread_mutex_lock
  3.10%  qemu-kvm                 [.] phys_page_find
  2.93%  qemu-kvm                 [.] address_space_translate_internal
  2.74%  libc-2.19.so             [.] malloc_consolidate
  2.71%  libc-2.19.so             [.] __memcpy_sse2_unaligned
  1.92%  qemu-kvm                 [.] find_next_zero_bit
  1.65%  qemu-kvm                 [.] object_unref
  1.61%  qemu-kvm                 [.] address_space_rw
  1.35%  qemu-kvm                 [.] virtio_notify
  1.33%  qemu-kvm                 [.] object_ref
  1.22%  libc-2.19.so             [.] memset


Please review the below patch (based on qemu-2.3 which I'm using), thanks!
If it's ok, I can rebase it based on the master branch.

[PATCH] exec: store RAMBlock pointer into memory region

Signed-off-by: Gonglei <arei.gonglei@huawei.com>
---
 exec.c                  | 39 ++++++++++++++++++++++++---------------
 include/exec/memory.h   |  1 +
 include/exec/ram_addr.h |  1 +
 memory.c                |  4 +++-
 4 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/exec.c b/exec.c
index 4a16769..51d6f30 100644
--- a/exec.c
+++ b/exec.c
@@ -1544,6 +1544,7 @@ ram_addr_t qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
         error_propagate(errp, local_err);
         return -1;
     }
+    mr->ram_block = new_block;
     return addr;
 }
 
@@ -1817,6 +1818,11 @@ found:
     return mr;
 }
 
+void *qemu_get_ram_ptr_from_block(RAMBlock *block, hwaddr addr)
+{
+    return ramblock_ptr(block, addr - block->offset);
+}
+
 static void notdirty_mem_write(void *opaque, hwaddr ram_addr,
                                uint64_t val, unsigned size)
 {
@@ -2350,7 +2356,7 @@ bool address_space_rw(AddressSpace *as, hwaddr addr, uint8_t *buf,
             } else {
                 addr1 += memory_region_get_ram_addr(mr);
                 /* RAM case */
-                ptr = qemu_get_ram_ptr(addr1);
+                ptr = qemu_get_ram_ptr_from_block(mr->ram_block, addr1);
                 memcpy(ptr, buf, l);
                 invalidate_and_set_dirty(addr1, l);
             }
@@ -2384,7 +2390,7 @@ bool address_space_rw(AddressSpace *as, hwaddr addr, uint8_t *buf,
                 }
             } else {
                 /* RAM case */
-                ptr = qemu_get_ram_ptr(mr->ram_addr + addr1);
+                ptr = qemu_get_ram_ptr_from_block(mr->ram_block, mr->ram_addr + addr1);
                 memcpy(buf, ptr, l);
             }
         }
@@ -2437,7 +2443,7 @@ static inline void cpu_physical_memory_write_rom_internal(AddressSpace *as,
         } else {
             addr1 += memory_region_get_ram_addr(mr);
             /* ROM/RAM case */
-            ptr = qemu_get_ram_ptr(addr1);
+            ptr = qemu_get_ram_ptr_from_block(mr->ram_block, addr1);
             switch (type) {
             case WRITE_DATA:
                 memcpy(ptr, buf, l);
@@ -2681,9 +2687,10 @@ static inline uint32_t ldl_phys_internal(AddressSpace *as, hwaddr addr,
 #endif
     } else {
         /* RAM case */
-        ptr = qemu_get_ram_ptr((memory_region_get_ram_addr(mr)
-                                & TARGET_PAGE_MASK)
-                               + addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block,
+                                (memory_region_get_ram_addr(mr)
+                                 & TARGET_PAGE_MASK)
+                                 + addr1);
         switch (endian) {
         case DEVICE_LITTLE_ENDIAN:
             val = ldl_le_p(ptr);
@@ -2740,9 +2747,10 @@ static inline uint64_t ldq_phys_internal(AddressSpace *as, hwaddr addr,
 #endif
     } else {
         /* RAM case */
-        ptr = qemu_get_ram_ptr((memory_region_get_ram_addr(mr)
-                                & TARGET_PAGE_MASK)
-                               + addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block,
+                                (memory_region_get_ram_addr(mr)
+                                 & TARGET_PAGE_MASK)
+                                 + addr1);
         switch (endian) {
         case DEVICE_LITTLE_ENDIAN:
             val = ldq_le_p(ptr);
@@ -2807,9 +2815,10 @@ static inline uint32_t lduw_phys_internal(AddressSpace *as, hwaddr addr,
 #endif
     } else {
         /* RAM case */
-        ptr = qemu_get_ram_ptr((memory_region_get_ram_addr(mr)
-                                & TARGET_PAGE_MASK)
-                               + addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block,
+                                (memory_region_get_ram_addr(mr)
+                                 & TARGET_PAGE_MASK)
+                                 + addr1);
         switch (endian) {
         case DEVICE_LITTLE_ENDIAN:
             val = lduw_le_p(ptr);
@@ -2856,7 +2865,7 @@ void stl_phys_notdirty(AddressSpace *as, hwaddr addr, uint32_t val)
         io_mem_write(mr, addr1, val, 4);
     } else {
         addr1 += memory_region_get_ram_addr(mr) & TARGET_PAGE_MASK;
-        ptr = qemu_get_ram_ptr(addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block, addr1);
         stl_p(ptr, val);
 
         if (unlikely(in_migration)) {
@@ -2896,7 +2905,7 @@ static inline void stl_phys_internal(AddressSpace *as,
     } else {
         /* RAM case */
         addr1 += memory_region_get_ram_addr(mr) & TARGET_PAGE_MASK;
-        ptr = qemu_get_ram_ptr(addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block, addr1);
         switch (endian) {
         case DEVICE_LITTLE_ENDIAN:
             stl_le_p(ptr, val);
@@ -2959,7 +2968,7 @@ static inline void stw_phys_internal(AddressSpace *as,
     } else {
         /* RAM case */
         addr1 += memory_region_get_ram_addr(mr) & TARGET_PAGE_MASK;
-        ptr = qemu_get_ram_ptr(addr1);
+        ptr = qemu_get_ram_ptr_from_block(mr->ram_block, addr1);
         switch (endian) {
         case DEVICE_LITTLE_ENDIAN:
             stw_le_p(ptr, val);
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 06ffa1d..bd9ddea 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -146,6 +146,7 @@ struct MemoryRegion {
     Int128 size;
     hwaddr addr;
     void (*destructor)(MemoryRegion *mr);
+    void *ram_block;   /* RAMBlock pointer */
     ram_addr_t ram_addr;
     uint64_t align;
     bool subpage;
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ff558a4..cc8d769 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -38,6 +38,7 @@ void *qemu_get_ram_block_host_ptr(ram_addr_t addr);
 void *qemu_get_ram_ptr(ram_addr_t addr);
 void qemu_ram_free(ram_addr_t addr);
 void qemu_ram_free_from_ptr(ram_addr_t addr);
+void *qemu_get_ram_ptr_from_block(RAMBlock *block, hwaddr addr);
 
 int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, Error **errp);
 
diff --git a/memory.c b/memory.c
index ee3f2a8..31bd84a 100644
--- a/memory.c
+++ b/memory.c
@@ -877,6 +877,7 @@ void memory_region_init(MemoryRegion *mr,
         mr->size = int128_2_64();
     }
     mr->name = g_strdup(name);
+    mr->ram_block = NULL;
 
     if (name) {
         char *escaped_name = memory_region_escape_name(name);
@@ -1449,7 +1450,8 @@ void *memory_region_get_ram_ptr(MemoryRegion *mr)
 
     assert(mr->terminates);
 
-    return qemu_get_ram_ptr(mr->ram_addr & TARGET_PAGE_MASK);
+    return qemu_get_ram_ptr_from_block(mr->ram_block,
+                                       mr->ram_addr & TARGET_PAGE_MASK);
 }
 
 static void memory_region_update_coalesced_range_as(MemoryRegion *mr, AddressSpace *as)
-- 
1.8.5.2

Regards,
-Gonglei

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-19 13:22     ` Paolo Bonzini
@ 2016-01-19 14:01       ` Cornelia Huck
  0 siblings, 0 replies; 26+ messages in thread
From: Cornelia Huck @ 2016-01-19 14:01 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, mst

On Tue, 19 Jan 2016 14:22:37 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The next patch will make virtqueue_pop/vring_pop allocate memory for the
> VirtQueueElement.  In some cases (blk, scsi, gpu) the device wants to
> extend VirtQueueElement with device-specific fields and, until now, the
> place of the VirtQueueElement within the containing struct didn't
> matter.  When allocating the entire block in virtqueue_pop/vring_pop,
> however, the containing struct must basically be a "subclass" of
> VirtQueueElement, with the VirtQueueElement as the first field.  Make
> that the case for blk and scsi; gpu is already doing it.

Sounds good!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-19 12:09   ` Cornelia Huck
@ 2016-01-19 13:22     ` Paolo Bonzini
  2016-01-19 14:01       ` Cornelia Huck
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-19 13:22 UTC (permalink / raw)
  To: Cornelia Huck; +Cc: qemu-devel, mst



On 19/01/2016 13:09, Cornelia Huck wrote:
>> > The next patch will make virtqueue_pop/vring_pop allocate memory for a
> s/will make/will make it possible for/
> 
> ?

This patch will actually do that.  This patch makes it possible.

> I had to spend some time grepping through the code to find that blk and
> scsi (and gpu, which already had elem at the beginning of its
> structure) are the only ones that work like this and that other devices
> do not need any change.
> 
>> > "subclass" of VirtQueueElement.  For this to work, VirtQueueElement
>> > must be the first field in the containing struct.

So...

The next patch will make virtqueue_pop/vring_pop allocate memory for the
VirtQueueElement.  In some cases (blk, scsi, gpu) the device wants to
extend VirtQueueElement with device-specific fields and, until now, the
place of the VirtQueueElement within the containing struct didn't
matter.  When allocating the entire block in virtqueue_pop/vring_pop,
however, the containing struct must basically be a "subclass" of
VirtQueueElement, with the VirtQueueElement as the first field.  Make
that the case for blk and scsi; gpu is already doing it.

Paolo

>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> > ---
>> >  hw/scsi/virtio-scsi.c           |  3 +--
>> >  include/hw/virtio/virtio-blk.h  |  2 +-
>> >  include/hw/virtio/virtio-scsi.h | 13 ++++++-------
>> >  3 files changed, 8 insertions(+), 10 deletions(-)
> Otherwise,
> 
> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-15 12:41 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
@ 2016-01-19 12:09   ` Cornelia Huck
  2016-01-19 13:22     ` Paolo Bonzini
  0 siblings, 1 reply; 26+ messages in thread
From: Cornelia Huck @ 2016-01-19 12:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, mst

On Fri, 15 Jan 2016 13:41:49 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The next patch will make virtqueue_pop/vring_pop allocate memory for a

s/will make/will make it possible for/

?

I had to spend some time grepping through the code to find that blk and
scsi (and gpu, which already had elem at the beginning of its
structure) are the only ones that work like this and that other devices
do not need any change.

> "subclass" of VirtQueueElement.  For this to work, VirtQueueElement
> must be the first field in the containing struct.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/scsi/virtio-scsi.c           |  3 +--
>  include/hw/virtio/virtio-blk.h  |  2 +-
>  include/hw/virtio/virtio-scsi.h | 13 ++++++-------
>  3 files changed, 8 insertions(+), 10 deletions(-)

Otherwise,

Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs
  2016-01-15 12:41 [Qemu-devel] [PATCH " Paolo Bonzini
@ 2016-01-15 12:41 ` Paolo Bonzini
  2016-01-19 12:09   ` Cornelia Huck
  0 siblings, 1 reply; 26+ messages in thread
From: Paolo Bonzini @ 2016-01-15 12:41 UTC (permalink / raw)
  To: qemu-devel; +Cc: mst

The next patch will make virtqueue_pop/vring_pop allocate memory for a
"subclass" of VirtQueueElement.  For this to work, VirtQueueElement
must be the first field in the containing struct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/virtio-scsi.c           |  3 +--
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio-scsi.h | 13 ++++++-------
 3 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3a4f520..1567da8 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -44,8 +44,7 @@ VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
 {
     VirtIOSCSIReq *req;
     VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
-    const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
-                             + sizeof(VirtQueueElement);
+    const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
 
     req = g_malloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index ae11a63..403ab86 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -60,9 +60,9 @@ typedef struct VirtIOBlock {
 } VirtIOBlock;
 
 typedef struct VirtIOBlockReq {
+    VirtQueueElement elem;
     int64_t sector_num;
     VirtIOBlock *dev;
-    VirtQueueElement elem;
     struct virtio_blk_inhdr *in;
     struct virtio_blk_outhdr out;
     QEMUIOVector qiov;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 088fe9f..63f5b51 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -102,18 +102,17 @@ typedef struct VirtIOSCSI {
 } VirtIOSCSI;
 
 typedef struct VirtIOSCSIReq {
+    /* Note:
+     * - fields up to resp_iov are initialized by virtio_scsi_init_req;
+     * - fields starting at vring are zeroed by virtio_scsi_init_req.
+     * */
+    VirtQueueElement elem;
+
     VirtIOSCSI *dev;
     VirtQueue *vq;
     QEMUSGList qsgl;
     QEMUIOVector resp_iov;
 
-    /* Note:
-     * - fields before elem are initialized by virtio_scsi_init_req;
-     * - elem is uninitialized at the time of allocation.
-     * - fields after elem are zeroed by virtio_scsi_init_req.
-     * */
-
-    VirtQueueElement elem;
     /* Set by dataplane code. */
     VirtIOSCSIVring *vring;
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2016-02-05  7:18 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-31 10:28 [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Paolo Bonzini
2016-01-31 10:28 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
2016-02-01 11:17   ` Cornelia Huck
2016-01-31 10:28 ` [Qemu-devel] [PATCH 02/10] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
2016-02-01 11:20   ` Cornelia Huck
2016-01-31 10:28 ` [Qemu-devel] [PATCH 03/10] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 04/10] virtio: introduce virtqueue_alloc_element Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 05/10] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 06/10] vring: " Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 07/10] virtio: combine the read of a descriptor Paolo Bonzini
2016-02-03 12:34   ` Gonglei (Arei)
2016-02-03 13:40     ` Paolo Bonzini
2016-02-04  7:48       ` Gonglei (Arei)
2016-02-04 10:18         ` Paolo Bonzini
2016-02-05  6:16           ` Gonglei (Arei)
2016-01-31 10:29 ` [Qemu-devel] [PATCH 08/10] virtio: cache used_idx in a VirtQueue field Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 09/10] virtio: read avail_idx from VQ only when necessary Paolo Bonzini
2016-01-31 10:29 ` [Qemu-devel] [PATCH 10/10] virtio: combine write of an entry into used ring Paolo Bonzini
2016-02-03 12:08 ` [Qemu-devel] [PATCH v2 00/10] virtio/vring: optimization patches Gonglei (Arei)
2016-02-04 10:19   ` Paolo Bonzini
2016-02-05  7:17     ` Gonglei (Arei)
2016-02-03 12:38 ` Gonglei (Arei)
  -- strict thread matches above, loose matches on Subject: below --
2016-01-15 12:41 [Qemu-devel] [PATCH " Paolo Bonzini
2016-01-15 12:41 ` [Qemu-devel] [PATCH 01/10] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
2016-01-19 12:09   ` Cornelia Huck
2016-01-19 13:22     ` Paolo Bonzini
2016-01-19 14:01       ` Cornelia Huck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.