All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
@ 2015-11-24 18:00 Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
                   ` (40 more replies)
  0 siblings, 41 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

This large series is basically all that I would like to get into 2.6.
It is a combination of several pieces of work on dataplane and
multithreaded block layer.

It's also a large part of why I would like someone else to look at
miscellaneous patches for a while (in case you've missed that).  I
can foresee that following the reviews is going to be a huge time drain.

With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
and 5 VCPUs).  The bulk of the improvement actually comes from the first
8 patches, but the rest of the series is what prepares for what's next
to come in QEMU 2.7 and later, such as a multiqueue block layer.

It's tedious to review, with some pretty large patches (3, 32, 33, 35).
That's how you attract reviewers, isn't it?  I would like to get the
first virtio and the first block layer part in very soon after 2.6
development starts.

I've split it in four parts, the first two touching virtio mostly,
while the last two are for the block layer.

Because it's large, I've CCed people only on the cover letter.

This work is available at github.com/bonzini/qemu.git, branch dataplane.

A. "LEAN" VIRTQUEUEELEMENT
--------------------------

Patches 1 to 8 modify VirtQueueElement so that the space for
scatter/gather lists is allocated dynamically rather than being
fixed to 4K.  VirtQueueElement becomes a sort of "superclass", and
the scatter/gather elements are placed in the same malloc block,
which is laid out like

	VirtQueueElement
	other fields ("subclass" fields)
	in_addr[]
	out_addr[]
	in_sg[]
	out_sg[]

This can provide a large speedup (from 1.3x to 2.3x) with many disks,
due to the 48K sized VirtQueueElement.  All virtio devices have to
be changed (patch 3).  I chose to do it all in a single patch because
the changes are anyway well isolated between each device.

The main issue here is that VirtQueueElement was haphazardly shoveled
straight in the migration stream (in host endianness). :(  Patch 5
straightens this out, but at the cost of breaking backwards migration
because it now writes the VirtQueueElement in big endian, consistent
with other migration streams.

This is the least tested part of the series.  I nevertheless put it
first because it's the one that is more complicated to rebase, and
I want to get rid of it as fast as possible.  Reviewing the general
approach is welcome anyway.

	Status: virtio-input, virtio-gpu and migration not tested at all

B. REMOVING VRING.C
-------------------

This is patches 9 to 16.  It removes the duplicate dataplane-specific
implementation of virtio in favor of the regular one that is already
used for non-dataplane.  While the dataplane implementation is slightly
more optimized, I chose to keep the other one to avoid another "touch
all virtio devices" series.

Patch 10 alone mostly brings performance in par between the two.
The remaining 7-8% can be recovered by mostly getting rid of tiny
address_space_* operations, keeping the rings always mapped.  Note that
the rest of this big series does bring a little performance improvement,
and already makes up for the lost performance.

This part has a dependency on patches that are not part of this series
(and do not exist yet), which make it possible to write the dirty
bitmap outside the BQL.  The dirty bitmap is not yet thread-safe because,
while it is read and written with atomic operations, it may be resized
when there is a memory hotplug operation.  There are plans to fix this
using RCU.

Nevertheless, this doesn't block part C.

	Status: ready, but depends on the missing dirty bitmap support


C. FINE-GRAINED AIO_POLL CRITICAL SECTIONS
------------------------------------------

This is patch 17 to 28.  It starts pushing aio_context_acquire down
into aio_poll.  This part is more or less independent from A and B,
and it ends with aio_poll calling aio_context_acquire/release around
every callback.

To do this, this part introduces a thread-safe variant of the common
"walking_xxx++/walking_xxx--" idiom already found in several places
in aio*.c and async.c.

	Status: ready, except that I haven't tested quorum enough

D. FINE-GRAINED BLOCK LAYER CRITICAL SECTIONS
---------------------------------------------

This is patch 29 to 40.  It explicitly acquires the AioContext in all
callbacks that need it (file descriptors, bottom halves, timers, AIO)
rather than in aio_poll.  This is the first step towards breaking
AioContext in many small locks, and hence the last prerequisite for
a real multiqueue QEMU block layer.

This has the biggest patches and, unlike patch 3, they are very hard
to split further.

At the end, starting with patch 37, a few patches do some small
optimization on aio_poll that is now possible, and the last one makes
virtio-scsi dataplane _almost_ thread-safe.

	Status: ready

If you've read so far and didn't get bored, you're more than qualified
as a reviewer. :)

Paolo

Paolo Bonzini (40):
  9pfs: allocate pdus with g_malloc/g_free
  virtio: move VirtQueueElement at the beginning of the structs
  virtio: move allocation to virtqueue_pop/vring_pop
  virtio: introduce qemu_get/put_virtqueue_element
  virtio: read/write the VirtQueueElement a field at a time
  virtio: introduce virtqueue_alloc_element
  virtio: slim down allocation of VirtQueueElements
  vring: slim down allocation of VirtQueueElements
  vring: make vring_enable_notification return void
  virtio: combine the read of a descriptor
  virtio: add AioContext-specific function for host notifiers
  virtio: export vring_notify as virtio_should_notify
  virtio-blk: fix "disabled data plane" mode
  virtio-blk: do not use vring in dataplane
  virtio-scsi: do not use vring in dataplane
  vring: remove
  iothread: release AioContext around aio_poll
  qemu-thread: introduce QemuRecMutex
  aio: convert from RFifoLock to QemuRecMutex
  aio: rename bh_lock to list_lock
  qemu-thread: introduce QemuLockCnt
  aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh
  qemu-thread: optimize QemuLockCnt with futexes on Linux
  aio: tweak walking in dispatch phase
  aio-posix: remove walking_handlers, protecting AioHandler list with list_lock
  aio-win32: remove walking_handlers, protecting AioHandler list with list_lock
  aio: document locking
  aio: push aio_context_acquire/release down to dispatching
  quorum: use atomics for rewrite_count
  quorum: split quorum_fifo_aio_cb from quorum_aio_cb
  qed: introduce qed_aio_start_io and qed_aio_next_io_cb
  block: explicitly acquire aiocontext in callbacks that need it
  block: explicitly acquire aiocontext in bottom halves that need it
  block: explicitly acquire aiocontext in timers that need it
  block: explicitly acquire aiocontext in aio callbacks that need it
  aio: update locking documentation
  async: optimize aio_bh_poll
  aio-posix: partially inline aio_dispatch into aio_poll
  async: remove unnecessary inc/dec pairs
  dma-helpers: avoid lock inversion with AioContext

 aio-posix.c                                   | 108 +++---
 aio-win32.c                                   | 111 +++---
 async.c                                       |  76 ++--
 block/blkverify.c                             |   6 +-
 block/curl.c                                  |  43 ++-
 block/gluster.c                               |   2 +
 block/io.c                                    |   7 +
 block/iscsi.c                                 |  10 +
 block/linux-aio.c                             |  14 +-
 block/mirror.c                                |  12 +-
 block/nbd-client.c                            |  14 +-
 block/nfs.c                                   |  10 +
 block/qed-cluster.c                           |   2 +
 block/qed-table.c                             |  12 +-
 block/qed.c                                   | 112 ++++--
 block/qed.h                                   |   3 +
 block/quorum.c                                |  60 +--
 block/sheepdog.c                              |  29 +-
 block/ssh.c                                   |  47 ++-
 block/throttle-groups.c                       |   2 +
 block/win32-aio.c                             |   8 +-
 dma-helpers.c                                 |  27 +-
 docs/lockcnt.txt                              | 342 +++++++++++++++++
 docs/multiple-iothreads.txt                   |  95 ++++-
 hw/9pfs/virtio-9p-device.c                    |   7 +-
 hw/9pfs/virtio-9p.c                           |  25 +-
 hw/9pfs/virtio-9p.h                           |   4 +-
 hw/block/dataplane/virtio-blk.c               | 131 +------
 hw/block/dataplane/virtio-blk.h               |   1 +
 hw/block/virtio-blk.c                         |  92 ++---
 hw/char/virtio-serial-bus.c                   |  78 ++--
 hw/display/virtio-gpu.c                       |  25 +-
 hw/input/virtio-input.c                       |  24 +-
 hw/net/virtio-net.c                           |  69 ++--
 hw/scsi/scsi-bus.c                            |   2 +
 hw/scsi/scsi-disk.c                           |  18 +
 hw/scsi/scsi-generic.c                        |  20 +-
 hw/scsi/virtio-scsi-dataplane.c               | 197 ++--------
 hw/scsi/virtio-scsi.c                         |  82 ++--
 hw/virtio/Makefile.objs                       |   1 -
 hw/virtio/dataplane/Makefile.objs             |   1 -
 hw/virtio/dataplane/vring.c                   | 526 --------------------------
 hw/virtio/virtio-balloon.c                    |  22 +-
 hw/virtio/virtio-rng.c                        |  10 +-
 hw/virtio/virtio.c                            | 323 +++++++++++-----
 include/block/aio.h                           |  38 +-
 include/hw/virtio/dataplane/vring-accessors.h |  75 ----
 include/hw/virtio/dataplane/vring.h           |  51 ---
 include/hw/virtio/virtio-balloon.h            |   2 +-
 include/hw/virtio/virtio-blk.h                |   9 +-
 include/hw/virtio/virtio-net.h                |   2 +-
 include/hw/virtio/virtio-scsi.h               |  36 +-
 include/hw/virtio/virtio-serial.h             |   2 +-
 include/hw/virtio/virtio.h                    |  16 +-
 include/qemu/futex.h                          |  36 ++
 include/qemu/rfifolock.h                      |  54 ---
 include/qemu/thread-posix.h                   |   6 +
 include/qemu/thread-win32.h                   |  10 +
 include/qemu/thread.h                         |  23 ++
 iothread.c                                    |  11 +-
 nbd.c                                         |   4 +
 tests/.gitignore                              |   1 -
 tests/Makefile                                |   2 -
 tests/test-aio.c                              |  19 +-
 tests/test-rfifolock.c                        |  91 -----
 thread-pool.c                                 |  14 +-
 trace-events                                  |  13 +-
 util/Makefile.objs                            |   2 +-
 util/lockcnt.c                                | 404 ++++++++++++++++++++
 util/qemu-coroutine-sleep.c                   |   5 +
 util/qemu-thread-posix.c                      |  38 +-
 util/qemu-thread-win32.c                      |  25 ++
 util/rfifolock.c                              |  78 ----
 73 files changed, 2000 insertions(+), 1877 deletions(-)
 create mode 100644 docs/lockcnt.txt
 delete mode 100644 hw/virtio/dataplane/Makefile.objs
 delete mode 100644 hw/virtio/dataplane/vring.c
 delete mode 100644 include/hw/virtio/dataplane/vring-accessors.h
 delete mode 100644 include/hw/virtio/dataplane/vring.h
 create mode 100644 include/qemu/futex.h
 delete mode 100644 include/qemu/rfifolock.h
 delete mode 100644 tests/test-rfifolock.c
 create mode 100644 util/lockcnt.c
 delete mode 100644 util/rfifolock.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-30  2:27   ` Fam Zheng
  2015-11-30 16:35   ` Greg Kurz
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 02/40] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
                   ` (39 subsequent siblings)
  40 siblings, 2 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Prepare for moving the allocation to virtqueue_pop.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/9pfs/virtio-9p-device.c |  7 +------
 hw/9pfs/virtio-9p.c        | 10 +++-------
 hw/9pfs/virtio-9p.h        |  2 --
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
index e3abcfa..72a93c2 100644
--- a/hw/9pfs/virtio-9p-device.c
+++ b/hw/9pfs/virtio-9p-device.c
@@ -57,7 +57,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     V9fsState *s = VIRTIO_9P(dev);
-    int i, len;
+    int len;
     struct stat stat;
     FsDriverEntry *fse;
     V9fsPath path;
@@ -65,12 +65,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
     virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
                 sizeof(struct virtio_9p_config) + MAX_TAG_LEN);
 
-    /* initialize pdu allocator */
-    QLIST_INIT(&s->free_list);
     QLIST_INIT(&s->active_list);
-    for (i = 0; i < (MAX_REQ - 1); i++) {
-        QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
-    }
 
     s->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
 
diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index f972731..747306b 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -565,13 +565,9 @@ static int fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp, V9fsQID *qidp)
 
 static V9fsPDU *alloc_pdu(V9fsState *s)
 {
-    V9fsPDU *pdu = NULL;
+    V9fsPDU *pdu = g_new(V9fsPDU, 1);
 
-    if (!QLIST_EMPTY(&s->free_list)) {
-        pdu = QLIST_FIRST(&s->free_list);
-        QLIST_REMOVE(pdu, next);
-        QLIST_INSERT_HEAD(&s->active_list, pdu, next);
-    }
+    QLIST_INSERT_HEAD(&s->active_list, pdu, next);
     return pdu;
 }
 
@@ -584,8 +580,8 @@ static void free_pdu(V9fsState *s, V9fsPDU *pdu)
          */
         if (!pdu->cancelled) {
             QLIST_REMOVE(pdu, next);
-            QLIST_INSERT_HEAD(&s->free_list, pdu, next);
         }
+        g_free(pdu);
     }
 }
 
diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
index d7a4dc1..1fb4ff9 100644
--- a/hw/9pfs/virtio-9p.h
+++ b/hw/9pfs/virtio-9p.h
@@ -201,8 +201,6 @@ typedef struct V9fsState
 {
     VirtIODevice parent_obj;
     VirtQueue *vq;
-    V9fsPDU pdus[MAX_REQ];
-    QLIST_HEAD(, V9fsPDU) free_list;
     QLIST_HEAD(, V9fsPDU) active_list;
     V9fsFidState *fid_list;
     FileOperations *ops;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 02/40] virtio: move VirtQueueElement at the beginning of the structs
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
                   ` (38 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

The next patch will make virtqueue_pop/vring_pop allocate memory for a
"subclass" of VirtQueueElement.  For this to work, VirtQueueElement
must be the first field in the containing struct.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/9pfs/virtio-9p.h             |  2 +-
 hw/scsi/virtio-scsi.c           |  3 +--
 include/hw/virtio/virtio-blk.h  |  2 +-
 include/hw/virtio/virtio-scsi.h | 13 ++++++-------
 4 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
index 1fb4ff9..2e6d535 100644
--- a/hw/9pfs/virtio-9p.h
+++ b/hw/9pfs/virtio-9p.h
@@ -127,12 +127,12 @@ struct V9fsState;
 
 struct V9fsPDU
 {
+    VirtQueueElement elem;
     uint32_t size;
     uint16_t tag;
     uint8_t id;
     uint8_t cancelled;
     CoQueue complete;
-    VirtQueueElement elem;
     struct V9fsState *s;
     QLIST_ENTRY(V9fsPDU) next;
 };
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 7655401..179c341 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -44,8 +44,7 @@ VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
 {
     VirtIOSCSIReq *req;
     VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
-    const size_t zero_skip = offsetof(VirtIOSCSIReq, elem)
-                             + sizeof(VirtQueueElement);
+    const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
 
     req = g_malloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 6bf5905..ef5f90f 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -61,9 +61,9 @@ typedef struct VirtIOBlock {
 } VirtIOBlock;
 
 typedef struct VirtIOBlockReq {
+    VirtQueueElement elem;
     int64_t sector_num;
     VirtIOBlock *dev;
-    VirtQueueElement elem;
     struct virtio_blk_inhdr *in;
     struct virtio_blk_outhdr out;
     QEMUIOVector qiov;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 088fe9f..63f5b51 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -102,18 +102,17 @@ typedef struct VirtIOSCSI {
 } VirtIOSCSI;
 
 typedef struct VirtIOSCSIReq {
+    /* Note:
+     * - fields up to resp_iov are initialized by virtio_scsi_init_req;
+     * - fields starting at vring are zeroed by virtio_scsi_init_req.
+     * */
+    VirtQueueElement elem;
+
     VirtIOSCSI *dev;
     VirtQueue *vq;
     QEMUSGList qsgl;
     QEMUIOVector resp_iov;
 
-    /* Note:
-     * - fields before elem are initialized by virtio_scsi_init_req;
-     * - elem is uninitialized at the time of allocation.
-     * - fields after elem are zeroed by virtio_scsi_init_req.
-     * */
-
-    VirtQueueElement elem;
     /* Set by dataplane code. */
     VirtIOSCSIVring *vring;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 02/40] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-30  3:00   ` Fam Zheng
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 04/40] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
                   ` (37 subsequent siblings)
  40 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

The return code of virtqueue_pop/vring_pop is unused except to check for
errors or 0.  We can thus easily move allocation inside the functions
and just return a pointer to the VirtQueueElement.

The advantage is that we will be able to allocate only the space that
is needed for the actual size of the s/g list instead of the full
VIRTQUEUE_MAX_SIZE items.  Currently VirtQueueElement takes about 48K
of memory, and this kind of allocation puts a lot of stress on malloc.
By cutting the size by two or three orders of magnitude, malloc can
use much more efficient algorithms.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/9pfs/virtio-9p.c                 | 19 ++++-----
 hw/block/dataplane/virtio-blk.c     | 11 +++--
 hw/block/virtio-blk.c               | 15 +++----
 hw/char/virtio-serial-bus.c         | 80 +++++++++++++++++++++++--------------
 hw/display/virtio-gpu.c             | 25 +++++++-----
 hw/input/virtio-input.c             | 24 +++++++----
 hw/net/virtio-net.c                 | 69 ++++++++++++++++++++------------
 hw/scsi/virtio-scsi-dataplane.c     | 15 +++----
 hw/scsi/virtio-scsi.c               | 18 ++++-----
 hw/virtio/dataplane/vring.c         | 17 ++++----
 hw/virtio/virtio-balloon.c          | 22 ++++++----
 hw/virtio/virtio-rng.c              | 10 +++--
 hw/virtio/virtio.c                  | 11 +++--
 include/hw/virtio/dataplane/vring.h |  2 +-
 include/hw/virtio/virtio-balloon.h  |  2 +-
 include/hw/virtio/virtio-blk.h      |  3 +-
 include/hw/virtio/virtio-net.h      |  2 +-
 include/hw/virtio/virtio-scsi.h     |  2 +-
 include/hw/virtio/virtio-serial.h   |  2 +-
 include/hw/virtio/virtio.h          |  2 +-
 20 files changed, 205 insertions(+), 146 deletions(-)

diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
index 747306b..560daad 100644
--- a/hw/9pfs/virtio-9p.c
+++ b/hw/9pfs/virtio-9p.c
@@ -563,14 +563,6 @@ static int fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp, V9fsQID *qidp)
     return 0;
 }
 
-static V9fsPDU *alloc_pdu(V9fsState *s)
-{
-    V9fsPDU *pdu = g_new(V9fsPDU, 1);
-
-    QLIST_INSERT_HEAD(&s->active_list, pdu, next);
-    return pdu;
-}
-
 static void free_pdu(V9fsState *s, V9fsPDU *pdu)
 {
     if (pdu) {
@@ -3254,10 +3246,8 @@ void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     V9fsState *s = (V9fsState *)vdev;
     V9fsPDU *pdu;
-    ssize_t len;
 
-    while ((pdu = alloc_pdu(s)) &&
-            (len = virtqueue_pop(vq, &pdu->elem)) != 0) {
+    for (;;) {
         struct {
             uint32_t size_le;
             uint8_t id;
@@ -3265,6 +3255,12 @@ void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
         } QEMU_PACKED out;
         int len;
 
+        pdu = virtqueue_pop(vq, sizeof(V9fsPDU));
+        if (!pdu) {
+            break;
+        }
+
+        QLIST_INSERT_HEAD(&s->active_list, pdu, next);
         pdu->s = s;
         BUG_ON(pdu->elem.out_num == 0 || pdu->elem.in_num == 0);
         QEMU_BUILD_BUG_ON(sizeof out != 7);
@@ -3281,7 +3277,6 @@ void handle_9p_output(VirtIODevice *vdev, VirtQueue *vq)
         qemu_co_queue_init(&pdu->complete);
         submit_pdu(s, pdu);
     }
-    free_pdu(s, pdu);
 }
 
 static void __attribute__((__constructor__)) virtio_9p_set_fd_limit(void)
diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index c42ddeb..0cadd3a 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -98,20 +98,19 @@ static void handle_notify(EventNotifier *e)
     blk_io_plug(s->conf->conf.blk);
     for (;;) {
         MultiReqBuffer mrb = {};
-        int ret;
 
         /* Disable guest->host notifies to avoid unnecessary vmexits */
         vring_disable_notification(s->vdev, &s->vring);
 
         for (;;) {
-            VirtIOBlockReq *req = virtio_blk_alloc_request(vblk);
+            VirtIOBlockReq *req = vring_pop(s->vdev, &s->vring,
+                                            sizeof(VirtIOBlockReq));
 
-            ret = vring_pop(s->vdev, &s->vring, &req->elem);
-            if (ret < 0) {
-                virtio_blk_free_request(req);
+            if (req == NULL) {
                 break; /* no more requests */
             }
 
+            virtio_blk_init_request(vblk, req);
             trace_virtio_blk_data_plane_process_request(s, req->elem.out_num,
                                                         req->elem.in_num,
                                                         req->elem.index);
@@ -123,7 +122,7 @@ static void handle_notify(EventNotifier *e)
             virtio_blk_submit_multireq(s->conf->conf.blk, &mrb);
         }
 
-        if (likely(ret == -EAGAIN)) { /* vring emptied */
+        if (likely(!vring_more_avail(s->vdev, &s->vring))) { /* vring emptied */
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 756ae5c..d697de4 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -28,15 +28,13 @@
 #include "hw/virtio/virtio-bus.h"
 #include "hw/virtio/virtio-access.h"
 
-VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
+void virtio_blk_init_request(VirtIOBlock *s, VirtIOBlockReq *req)
 {
-    VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
     req->dev = s;
     req->qiov.size = 0;
     req->in_len = 0;
     req->next = NULL;
     req->mr_next = NULL;
-    return req;
 }
 
 void virtio_blk_free_request(VirtIOBlockReq *req)
@@ -192,13 +190,11 @@ out:
 
 static VirtIOBlockReq *virtio_blk_get_request(VirtIOBlock *s)
 {
-    VirtIOBlockReq *req = virtio_blk_alloc_request(s);
+    VirtIOBlockReq *req = virtqueue_pop(s->vq, sizeof(VirtIOBlockReq));
 
-    if (!virtqueue_pop(s->vq, &req->elem)) {
-        virtio_blk_free_request(req);
-        return NULL;
+    if (req) {
+        virtio_blk_init_request(s, req);
     }
-
     return req;
 }
 
@@ -843,7 +839,8 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
     VirtIOBlock *s = VIRTIO_BLK(vdev);
 
     while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = virtio_blk_alloc_request(s);
+        VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
+        virtio_blk_init_request(s, req);
         qemu_get_buffer(f, (unsigned char *)&req->elem,
                         sizeof(VirtQueueElement));
         req->next = s->rq;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 497b0af..1af2314 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -82,7 +82,7 @@ static bool use_multiport(VirtIOSerial *vser)
 static size_t write_to_port(VirtIOSerialPort *port,
                             const uint8_t *buf, size_t size)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtQueue *vq;
     size_t offset;
 
@@ -95,15 +95,17 @@ static size_t write_to_port(VirtIOSerialPort *port,
     while (offset < size) {
         size_t len;
 
-        if (!virtqueue_pop(vq, &elem)) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
             break;
         }
 
-        len = iov_from_buf(elem.in_sg, elem.in_num, 0,
+        len = iov_from_buf(elem->in_sg, elem->in_num, 0,
                            buf + offset, size - offset);
         offset += len;
 
-        virtqueue_push(vq, &elem, len);
+        virtqueue_push(vq, elem, len);
+        g_free(elem);
     }
 
     virtio_notify(VIRTIO_DEVICE(port->vser), vq);
@@ -112,13 +114,18 @@ static size_t write_to_port(VirtIOSerialPort *port,
 
 static void discard_vq_data(VirtQueue *vq, VirtIODevice *vdev)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
 
     if (!virtio_queue_ready(vq)) {
         return;
     }
-    while (virtqueue_pop(vq, &elem)) {
-        virtqueue_push(vq, &elem, 0);
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+        virtqueue_push(vq, elem, 0);
+        g_free(elem);
     }
     virtio_notify(vdev, vq);
 }
@@ -137,21 +144,22 @@ static void do_flush_queued_data(VirtIOSerialPort *port, VirtQueue *vq,
         unsigned int i;
 
         /* Pop an elem only if we haven't left off a previous one mid-way */
-        if (!port->elem.out_num) {
-            if (!virtqueue_pop(vq, &port->elem)) {
+        if (!port->elem) {
+            port->elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+            if (!port->elem) {
                 break;
             }
             port->iov_idx = 0;
             port->iov_offset = 0;
         }
 
-        for (i = port->iov_idx; i < port->elem.out_num; i++) {
+        for (i = port->iov_idx; i < port->elem->out_num; i++) {
             size_t buf_size;
             ssize_t ret;
 
-            buf_size = port->elem.out_sg[i].iov_len - port->iov_offset;
+            buf_size = port->elem->out_sg[i].iov_len - port->iov_offset;
             ret = vsc->have_data(port,
-                                  port->elem.out_sg[i].iov_base
+                                  port->elem->out_sg[i].iov_base
                                   + port->iov_offset,
                                   buf_size);
             if (port->throttled) {
@@ -166,8 +174,9 @@ static void do_flush_queued_data(VirtIOSerialPort *port, VirtQueue *vq,
         if (port->throttled) {
             break;
         }
-        virtqueue_push(vq, &port->elem, 0);
-        port->elem.out_num = 0;
+        virtqueue_push(vq, port->elem, 0);
+        g_free(port->elem);
+        port->elem = NULL;
     }
     virtio_notify(vdev, vq);
 }
@@ -184,22 +193,26 @@ static void flush_queued_data(VirtIOSerialPort *port)
 
 static size_t send_control_msg(VirtIOSerial *vser, void *buf, size_t len)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtQueue *vq;
 
     vq = vser->c_ivq;
     if (!virtio_queue_ready(vq)) {
         return 0;
     }
-    if (!virtqueue_pop(vq, &elem)) {
+
+    elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
         return 0;
     }
 
     /* TODO: detect a buffer that's too short, set NEEDS_RESET */
-    iov_from_buf(elem.in_sg, elem.in_num, 0, buf, len);
+    iov_from_buf(elem->in_sg, elem->in_num, 0, buf, len);
 
-    virtqueue_push(vq, &elem, len);
+    virtqueue_push(vq, elem, len);
     virtio_notify(VIRTIO_DEVICE(vser), vq);
+    g_free(elem);
+
     return len;
 }
 
@@ -413,7 +426,7 @@ static void control_in(VirtIODevice *vdev, VirtQueue *vq)
 
 static void control_out(VirtIODevice *vdev, VirtQueue *vq)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     VirtIOSerial *vser;
     uint8_t *buf;
     size_t len;
@@ -422,10 +435,15 @@ static void control_out(VirtIODevice *vdev, VirtQueue *vq)
 
     len = 0;
     buf = NULL;
-    while (virtqueue_pop(vq, &elem)) {
+    for (;;) {
         size_t cur_len;
 
-        cur_len = iov_size(elem.out_sg, elem.out_num);
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        cur_len = iov_size(elem->out_sg, elem->out_num);
         /*
          * Allocate a new buf only if we didn't have one previously or
          * if the size of the buf differs
@@ -436,10 +454,11 @@ static void control_out(VirtIODevice *vdev, VirtQueue *vq)
             buf = g_malloc(cur_len);
             len = cur_len;
         }
-        iov_to_buf(elem.out_sg, elem.out_num, 0, buf, cur_len);
+        iov_to_buf(elem->out_sg, elem->out_num, 0, buf, cur_len);
 
         handle_control_message(vser, buf, cur_len);
-        virtqueue_push(vq, &elem, 0);
+        virtqueue_push(vq, elem, 0);
+	g_free(elem);
     }
     g_free(buf);
     virtio_notify(vdev, vq);
@@ -619,7 +638,7 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
         qemu_put_byte(f, port->host_connected);
 
 	elem_popped = 0;
-        if (port->elem.out_num) {
+        if (port->elem) {
             elem_popped = 1;
         }
         qemu_put_be32s(f, &elem_popped);
@@ -627,8 +646,8 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
             qemu_put_be32s(f, &port->iov_idx);
             qemu_put_be64s(f, &port->iov_offset);
 
-            qemu_put_buffer(f, (unsigned char *)&port->elem,
-                            sizeof(port->elem));
+            qemu_put_buffer(f, (unsigned char *)port->elem,
+                            sizeof(VirtQueueElement));
         }
     }
 }
@@ -703,9 +722,10 @@ static int fetch_active_ports_list(QEMUFile *f, int version_id,
                 qemu_get_be32s(f, &port->iov_idx);
                 qemu_get_be64s(f, &port->iov_offset);
 
-                qemu_get_buffer(f, (unsigned char *)&port->elem,
-                                sizeof(port->elem));
-                virtqueue_map(&port->elem);
+                port->elem = g_new(VirtQueueElement, 1);
+                qemu_get_buffer(f, (unsigned char *)port->elem,
+                                sizeof(VirtQueueElement));
+                virtqueue_map(port->elem);
 
                 /*
                  *  Port was throttled on source machine.  Let's
@@ -927,7 +947,7 @@ static void virtser_port_device_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    port->elem.out_num = 0;
+    port->elem = NULL;
 }
 
 static void virtser_port_device_plug(HotplugHandler *hotplug_dev,
diff --git a/hw/display/virtio-gpu.c b/hw/display/virtio-gpu.c
index a836ce3..93d696b 100644
--- a/hw/display/virtio-gpu.c
+++ b/hw/display/virtio-gpu.c
@@ -770,8 +770,8 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
     }
 #endif
 
-    cmd = g_new(struct virtio_gpu_ctrl_command, 1);
-    while (virtqueue_pop(vq, &cmd->elem)) {
+    cmd = virtqueue_pop(vq, sizeof(struct virtio_gpu_ctrl_command));
+    while (cmd) {
         cmd->vq = vq;
         cmd->error = 0;
         cmd->finished = false;
@@ -781,7 +781,9 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
 
         VIRGL(g, virtio_gpu_virgl_process_cmd, virtio_gpu_simple_process_cmd,
               g, cmd);
-        if (!cmd->finished) {
+        if (cmd->finished) {
+            g_free(cmd);
+        } else {
             QTAILQ_INSERT_TAIL(&g->fenceq, cmd, next);
             g->inflight++;
             if (virtio_gpu_stats_enabled(g->conf)) {
@@ -790,10 +792,9 @@ static void virtio_gpu_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
                 }
                 fprintf(stderr, "inflight: %3d (+)\r", g->inflight);
             }
-            cmd = g_new(struct virtio_gpu_ctrl_command, 1);
         }
+        cmd = virtqueue_pop(vq, sizeof(struct virtio_gpu_ctrl_command));
     }
-    g_free(cmd);
 
 #ifdef CONFIG_VIRGL
     if (g->use_virgl_renderer) {
@@ -811,15 +812,20 @@ static void virtio_gpu_ctrl_bh(void *opaque)
 static void virtio_gpu_handle_cursor(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOGPU *g = VIRTIO_GPU(vdev);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t s;
     struct virtio_gpu_update_cursor cursor_info;
 
     if (!virtio_queue_ready(vq)) {
         return;
     }
-    while (virtqueue_pop(vq, &elem)) {
-        s = iov_to_buf(elem.out_sg, elem.out_num, 0,
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        s = iov_to_buf(elem->out_sg, elem->out_num, 0,
                        &cursor_info, sizeof(cursor_info));
         if (s != sizeof(cursor_info)) {
             qemu_log_mask(LOG_GUEST_ERROR,
@@ -828,8 +834,9 @@ static void virtio_gpu_handle_cursor(VirtIODevice *vdev, VirtQueue *vq)
         } else {
             update_cursor(g, &cursor_info);
         }
-        virtqueue_push(vq, &elem, 0);
+        virtqueue_push(vq, elem, 0);
         virtio_notify(vdev, vq);
+        g_free(elem);
     }
 }
 
diff --git a/hw/input/virtio-input.c b/hw/input/virtio-input.c
index 1f5a40d..bbae884 100644
--- a/hw/input/virtio-input.c
+++ b/hw/input/virtio-input.c
@@ -16,7 +16,7 @@
 
 void virtio_input_send(VirtIOInput *vinput, virtio_input_event *event)
 {
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     unsigned have, need;
     int i, len;
 
@@ -49,14 +49,16 @@ void virtio_input_send(VirtIOInput *vinput, virtio_input_event *event)
 
     /* ... and finally pass them to the guest */
     for (i = 0; i < vinput->qindex; i++) {
-        if (!virtqueue_pop(vinput->evt, &elem)) {
+        elem = virtqueue_pop(vinput->evt, sizeof(VirtQueueElement));
+        if (!elem) {
             /* should not happen, we've checked for space beforehand */
             fprintf(stderr, "%s: Huh?  No vq elem available ...\n", __func__);
             return;
         }
-        len = iov_from_buf(elem.in_sg, elem.in_num,
+        len = iov_from_buf(elem->in_sg, elem->in_num,
                            0, vinput->queue+i, sizeof(virtio_input_event));
-        virtqueue_push(vinput->evt, &elem, len);
+        virtqueue_push(vinput->evt, elem, len);
+        g_free(elem);
     }
     virtio_notify(VIRTIO_DEVICE(vinput), vinput->evt);
     vinput->qindex = 0;
@@ -72,17 +74,23 @@ static void virtio_input_handle_sts(VirtIODevice *vdev, VirtQueue *vq)
     VirtIOInputClass *vic = VIRTIO_INPUT_GET_CLASS(vdev);
     VirtIOInput *vinput = VIRTIO_INPUT(vdev);
     virtio_input_event event;
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     int len;
 
-    while (virtqueue_pop(vinput->sts, &elem)) {
+    for (;;) {
+        elem = virtqueue_pop(vinput->sts, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
         memset(&event, 0, sizeof(event));
-        len = iov_to_buf(elem.out_sg, elem.out_num,
+        len = iov_to_buf(elem->out_sg, elem->out_num,
                          0, &event, sizeof(event));
         if (vic->handle_status) {
             vic->handle_status(vinput, &event);
         }
-        virtqueue_push(vinput->sts, &elem, len);
+        virtqueue_push(vinput->sts, elem, len);
+        g_free(elem);
     }
     virtio_notify(vdev, vinput->sts);
 }
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index a877614..c03b568 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -818,20 +818,24 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
     VirtIONet *n = VIRTIO_NET(vdev);
     struct virtio_net_ctrl_hdr ctrl;
     virtio_net_ctrl_ack status = VIRTIO_NET_ERR;
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t s;
     struct iovec *iov, *iov2;
     unsigned int iov_cnt;
 
-    while (virtqueue_pop(vq, &elem)) {
-        if (iov_size(elem.in_sg, elem.in_num) < sizeof(status) ||
-            iov_size(elem.out_sg, elem.out_num) < sizeof(ctrl)) {
+    for (;;) {
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+        if (iov_size(elem->in_sg, elem->in_num) < sizeof(status) ||
+            iov_size(elem->out_sg, elem->out_num) < sizeof(ctrl)) {
             error_report("virtio-net ctrl missing headers");
             exit(1);
         }
 
-        iov_cnt = elem.out_num;
-        iov2 = iov = g_memdup(elem.out_sg, sizeof(struct iovec) * elem.out_num);
+        iov_cnt = elem->out_num;
+        iov2 = iov = g_memdup(elem->out_sg, sizeof(struct iovec) * elem->out_num);
         s = iov_to_buf(iov, iov_cnt, 0, &ctrl, sizeof(ctrl));
         iov_discard_front(&iov, &iov_cnt, sizeof(ctrl));
         if (s != sizeof(ctrl)) {
@@ -850,12 +854,13 @@ static void virtio_net_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
             status = virtio_net_handle_offloads(n, ctrl.cmd, iov, iov_cnt);
         }
 
-        s = iov_from_buf(elem.in_sg, elem.in_num, 0, &status, sizeof(status));
+        s = iov_from_buf(elem->in_sg, elem->in_num, 0, &status, sizeof(status));
         assert(s == sizeof(status));
 
-        virtqueue_push(vq, &elem, sizeof(status));
+        virtqueue_push(vq, elem, sizeof(status));
         virtio_notify(vdev, vq);
         g_free(iov2);
+        g_free(elem);
     }
 }
 
@@ -1044,13 +1049,14 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
     offset = i = 0;
 
     while (offset < size) {
-        VirtQueueElement elem;
+        VirtQueueElement *elem;
         int len, total;
-        const struct iovec *sg = elem.in_sg;
+        const struct iovec *sg;
 
         total = 0;
 
-        if (virtqueue_pop(q->rx_vq, &elem) == 0) {
+        elem = virtqueue_pop(q->rx_vq, sizeof(VirtQueueElement));
+        if (!elem) {
             if (i == 0)
                 return -1;
             error_report("virtio-net unexpected empty queue: "
@@ -1063,21 +1069,22 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
             exit(1);
         }
 
-        if (elem.in_num < 1) {
+        if (elem->in_num < 1) {
             error_report("virtio-net receive queue contains no in buffers");
             exit(1);
         }
 
+        sg = elem->in_sg;
         if (i == 0) {
             assert(offset == 0);
             if (n->mergeable_rx_bufs) {
                 mhdr_cnt = iov_copy(mhdr_sg, ARRAY_SIZE(mhdr_sg),
-                                    sg, elem.in_num,
+                                    sg, elem->in_num,
                                     offsetof(typeof(mhdr), num_buffers),
                                     sizeof(mhdr.num_buffers));
             }
 
-            receive_header(n, sg, elem.in_num, buf, size);
+            receive_header(n, sg, elem->in_num, buf, size);
             offset = n->host_hdr_len;
             total += n->guest_hdr_len;
             guest_offset = n->guest_hdr_len;
@@ -1086,7 +1093,7 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
         }
 
         /* copy in packet.  ugh */
-        len = iov_from_buf(sg, elem.in_num, guest_offset,
+        len = iov_from_buf(sg, elem->in_num, guest_offset,
                            buf + offset, size - offset);
         total += len;
         offset += len;
@@ -1094,12 +1101,14 @@ static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *buf, size_t
          * must have consumed the complete packet.
          * Otherwise, drop it. */
         if (!n->mergeable_rx_bufs && offset < size) {
-            virtqueue_discard(q->rx_vq, &elem, total);
+            virtqueue_discard(q->rx_vq, elem, total);
+            g_free(elem);
             return size;
         }
 
         /* signal other side */
-        virtqueue_fill(q->rx_vq, &elem, total, i++);
+        virtqueue_fill(q->rx_vq, elem, total, i++);
+        g_free(elem);
     }
 
     if (mhdr_cnt) {
@@ -1123,10 +1132,11 @@ static void virtio_net_tx_complete(NetClientState *nc, ssize_t len)
     VirtIONetQueue *q = virtio_net_get_subqueue(nc);
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
 
-    virtqueue_push(q->tx_vq, &q->async_tx.elem, 0);
+    virtqueue_push(q->tx_vq, q->async_tx.elem, 0);
     virtio_notify(vdev, q->tx_vq);
 
-    q->async_tx.elem.out_num = 0;
+    g_free(q->async_tx.elem);
+    q->async_tx.elem = NULL;
 
     virtio_queue_set_notification(q->tx_vq, 1);
     virtio_net_flush_tx(q);
@@ -1137,25 +1147,31 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
 {
     VirtIONet *n = q->n;
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     int32_t num_packets = 0;
     int queue_index = vq2q(virtio_get_queue_index(q->tx_vq));
     if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
         return num_packets;
     }
 
-    if (q->async_tx.elem.out_num) {
+    if (q->async_tx.elem) {
         virtio_queue_set_notification(q->tx_vq, 0);
         return num_packets;
     }
 
-    while (virtqueue_pop(q->tx_vq, &elem)) {
+    for (;;) {
         ssize_t ret;
-        unsigned int out_num = elem.out_num;
-        struct iovec *out_sg = &elem.out_sg[0];
-        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1];
+        unsigned int out_num;
+        struct iovec sg[VIRTQUEUE_MAX_SIZE], sg2[VIRTQUEUE_MAX_SIZE + 1], *out_sg;
         struct virtio_net_hdr_mrg_rxbuf mhdr;
 
+        elem = virtqueue_pop(q->tx_vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            break;
+        }
+
+        out_num = elem->out_num;
+        out_sg = elem->out_sg;
         if (out_num < 1) {
             error_report("virtio-net header not in first element");
             exit(1);
@@ -1207,8 +1223,9 @@ static int32_t virtio_net_flush_tx(VirtIONetQueue *q)
         }
 
 drop:
-        virtqueue_push(q->tx_vq, &elem, 0);
+        virtqueue_push(q->tx_vq, elem, 0);
         virtio_notify(vdev, q->tx_vq);
+        g_free(elem);
 
         if (++num_packets >= n->tx_burst) {
             break;
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 0d8d71e..45ae2d2 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -80,15 +80,16 @@ fail_vring:
 VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s,
                                          VirtIOSCSIVring *vring)
 {
-    VirtIOSCSIReq *req = virtio_scsi_init_req(s, NULL);
-    int r;
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+    VirtIOSCSIReq *req;
 
-    req->vring = vring;
-    r = vring_pop((VirtIODevice *)s, &vring->vring, &req->elem);
-    if (r < 0) {
-        virtio_scsi_free_req(req);
-        req = NULL;
+    req = vring_pop((VirtIODevice *)s, &vring->vring,
+                    sizeof(VirtIOSCSIReq) + vs->cdb_size);
+    if (!req) {
+        return NULL;
     }
+    virtio_scsi_init_req(s, NULL, req);
+    req->vring = vring;
     return req;
 }
 
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 179c341..f5985dd 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -40,19 +40,15 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
     return scsi_device_find(&s->bus, 0, lun[1], virtio_scsi_get_lun(lun));
 }
 
-VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq)
+void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req)
 {
-    VirtIOSCSIReq *req;
-    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
     const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
 
-    req = g_malloc(sizeof(*req) + vs->cdb_size);
     req->vq = vq;
     req->dev = s;
     qemu_sglist_init(&req->qsgl, DEVICE(s), 8, &address_space_memory);
     qemu_iovec_init(&req->resp_iov, 1);
     memset((uint8_t *)req + zero_skip, 0, sizeof(*req) - zero_skip);
-    return req;
 }
 
 void virtio_scsi_free_req(VirtIOSCSIReq *req)
@@ -173,11 +169,14 @@ static int virtio_scsi_parse_req(VirtIOSCSIReq *req,
 
 static VirtIOSCSIReq *virtio_scsi_pop_req(VirtIOSCSI *s, VirtQueue *vq)
 {
-    VirtIOSCSIReq *req = virtio_scsi_init_req(s, vq);
-    if (!virtqueue_pop(vq, &req->elem)) {
-        virtio_scsi_free_req(req);
+    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
+    VirtIOSCSIReq *req;
+
+    req = virtqueue_pop(vq, sizeof(VirtIOSCSIReq) + vs->cdb_size);
+    if (!req) {
         return NULL;
     }
+    virtio_scsi_init_req(s, vq, req);
     return req;
 }
 
@@ -202,8 +201,9 @@ static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
 
     qemu_get_be32s(f, &n);
     assert(n < vs->conf.num_queues);
-    req = virtio_scsi_init_req(s, vs->cmd_vqs[n]);
+    req = g_malloc(sizeof(VirtIOSCSIReq) + vs->cdb_size);
     qemu_get_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    virtio_scsi_init_req(s, vs->cmd_vqs[n], req);
 
     virtqueue_map(&req->elem);
 
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 23f667e..1b900fc 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -388,23 +388,25 @@ static void vring_unmap_element(VirtQueueElement *elem)
  *
  * Stolen from linux/drivers/vhost/vhost.c.
  */
-int vring_pop(VirtIODevice *vdev, Vring *vring,
-              VirtQueueElement *elem)
+void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
 {
     struct vring_desc desc;
     unsigned int i, head, found = 0, num = vring->vr.num;
     uint16_t avail_idx, last_avail_idx;
+    VirtQueueElement *elem = NULL;
     int ret;
 
-    /* Initialize elem so it can be safely unmapped */
-    elem->in_num = elem->out_num = 0;
-
     /* If there was a fatal error then refuse operation */
     if (vring->broken) {
         ret = -EFAULT;
         goto out;
     }
 
+    elem = g_malloc(sz);
+
+    /* Initialize elem so it can be safely unmapped */
+    elem->in_num = elem->out_num = 0;
+
     /* Check it isn't doing very strange things with descriptor numbers. */
     last_avail_idx = vring->last_avail_idx;
     avail_idx = vring_get_avail_idx(vdev, vring);
@@ -480,7 +482,7 @@ int vring_pop(VirtIODevice *vdev, Vring *vring,
             virtio_tswap16(vdev, vring->last_avail_idx);
     }
 
-    return head;
+    return elem;
 
 out:
     assert(ret < 0);
@@ -488,7 +490,8 @@ out:
         vring->broken = true;
     }
     vring_unmap_element(elem);
-    return ret;
+    g_free(elem);
+    return NULL;
 }
 
 /* After we've used one of their buffers, we tell them about it.
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index 9671635..2704680 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -106,8 +106,10 @@ static void balloon_stats_poll_cb(void *opaque)
         return;
     }
 
-    virtqueue_push(s->svq, &s->stats_vq_elem, s->stats_vq_offset);
+    virtqueue_push(s->svq, s->stats_vq_elem, s->stats_vq_offset);
     virtio_notify(vdev, s->svq);
+    g_free(s->stats_vq_elem);
+    s->stats_vq_elem = NULL;
 }
 
 static void balloon_stats_get_all(Object *obj, struct Visitor *v,
@@ -205,14 +207,18 @@ static void balloon_stats_set_poll_interval(Object *obj, struct Visitor *v,
 static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     MemoryRegionSection section;
 
-    while (virtqueue_pop(vq, &elem)) {
+    for (;;) {
         size_t offset = 0;
         uint32_t pfn;
+        elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+        if (!elem) {
+            return;
+        }
 
-        while (iov_to_buf(elem.out_sg, elem.out_num, offset, &pfn, 4) == 4) {
+        while (iov_to_buf(elem->out_sg, elem->out_num, offset, &pfn, 4) == 4) {
             ram_addr_t pa;
             ram_addr_t addr;
             int p = virtio_ldl_p(vdev, &pfn);
@@ -235,20 +241,22 @@ static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             memory_region_unref(section.mr);
         }
 
-        virtqueue_push(vq, &elem, offset);
+        virtqueue_push(vq, elem, offset);
         virtio_notify(vdev, vq);
+        g_free(elem);
     }
 }
 
 static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBalloon *s = VIRTIO_BALLOON(vdev);
-    VirtQueueElement *elem = &s->stats_vq_elem;
+    VirtQueueElement *elem;
     VirtIOBalloonStat stat;
     size_t offset = 0;
     qemu_timeval tv;
 
-    if (!virtqueue_pop(vq, elem)) {
+    s->stats_vq_elem = elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
+    if (!elem) {
         goto out;
     }
 
diff --git a/hw/virtio/virtio-rng.c b/hw/virtio/virtio-rng.c
index 97d1541..5f92047 100644
--- a/hw/virtio/virtio-rng.c
+++ b/hw/virtio/virtio-rng.c
@@ -43,7 +43,7 @@ static void chr_read(void *opaque, const void *buf, size_t size)
 {
     VirtIORNG *vrng = opaque;
     VirtIODevice *vdev = VIRTIO_DEVICE(vrng);
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
     size_t len;
     int offset;
 
@@ -55,15 +55,17 @@ static void chr_read(void *opaque, const void *buf, size_t size)
 
     offset = 0;
     while (offset < size) {
-        if (!virtqueue_pop(vrng->vq, &elem)) {
+        elem = virtqueue_pop(vrng->vq, sizeof(VirtQueueElement));
+        if (!elem) {
             break;
         }
-        len = iov_from_buf(elem.in_sg, elem.in_num,
+        len = iov_from_buf(elem->in_sg, elem->in_num,
                            0, buf + offset, size - offset);
         offset += len;
 
-        virtqueue_push(vrng->vq, &elem, len);
+        virtqueue_push(vrng->vq, elem, len);
         trace_virtio_rng_pushed(vrng, len);
+        g_free(elem);
     }
     virtio_notify(vdev, vrng->vq);
 }
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 1edef59..dc76a99 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -501,16 +501,19 @@ void virtqueue_map(VirtQueueElement *elem)
                         0);
 }
 
-int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
+void *virtqueue_pop(VirtQueue *vq, size_t sz)
 {
     unsigned int i, head, max;
     hwaddr desc_pa = vq->vring.desc;
     VirtIODevice *vdev = vq->vdev;
+    VirtQueueElement *elem;
 
-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
-        return 0;
+    if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
+        return NULL;
+    }
 
     /* When we start there are none of either input nor output. */
+    elem = g_malloc(sz);
     elem->out_num = elem->in_num = 0;
 
     max = vq->vring.num;
@@ -569,7 +572,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     vq->inuse++;
 
     trace_virtqueue_pop(vq, elem, elem->in_num, elem->out_num);
-    return elem->in_num + elem->out_num;
+    return elem;
 }
 
 /* virtio device */
diff --git a/include/hw/virtio/dataplane/vring.h b/include/hw/virtio/dataplane/vring.h
index a596e4c..e80985e 100644
--- a/include/hw/virtio/dataplane/vring.h
+++ b/include/hw/virtio/dataplane/vring.h
@@ -44,7 +44,7 @@ void vring_teardown(Vring *vring, VirtIODevice *vdev, int n);
 void vring_disable_notification(VirtIODevice *vdev, Vring *vring);
 bool vring_enable_notification(VirtIODevice *vdev, Vring *vring);
 bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
-int vring_pop(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem);
+void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz);
 void vring_push(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
                 int len);
 
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 09c2ce4..35f62ac 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -37,7 +37,7 @@ typedef struct VirtIOBalloon {
     uint32_t num_pages;
     uint32_t actual;
     uint64_t stats[VIRTIO_BALLOON_S_NR];
-    VirtQueueElement stats_vq_elem;
+    VirtQueueElement *stats_vq_elem;
     size_t stats_vq_offset;
     QEMUTimer *stats_timer;
     int64_t stats_last_update;
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index ef5f90f..9a2048d 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -81,8 +81,7 @@ typedef struct MultiReqBuffer {
     bool is_write;
 } MultiReqBuffer;
 
-VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
-
+void virtio_blk_init_request(VirtIOBlock *s, VirtIOBlockReq *req);
 void virtio_blk_free_request(VirtIOBlockReq *req);
 
 void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb);
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index f3cc25f..2ce3b03 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -47,7 +47,7 @@ typedef struct VirtIONetQueue {
     QEMUBH *tx_bh;
     int tx_waiting;
     struct {
-        VirtQueueElement elem;
+        VirtQueueElement *elem;
     } async_tx;
     struct VirtIONet *n;
 } VirtIONetQueue;
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index 63f5b51..d9ae76c 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -150,7 +150,7 @@ void virtio_scsi_common_unrealize(DeviceState *dev, Error **errp);
 void virtio_scsi_handle_ctrl_req(VirtIOSCSI *s, VirtIOSCSIReq *req);
 bool virtio_scsi_handle_cmd_req_prepare(VirtIOSCSI *s, VirtIOSCSIReq *req);
 void virtio_scsi_handle_cmd_req_submit(VirtIOSCSI *s, VirtIOSCSIReq *req);
-VirtIOSCSIReq *virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq);
+void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req);
 void virtio_scsi_free_req(VirtIOSCSIReq *req);
 void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
                             uint32_t event, uint32_t reason);
diff --git a/include/hw/virtio/virtio-serial.h b/include/hw/virtio/virtio-serial.h
index 527d0bf..12a55a1 100644
--- a/include/hw/virtio/virtio-serial.h
+++ b/include/hw/virtio/virtio-serial.h
@@ -122,7 +122,7 @@ struct VirtIOSerialPort {
      * element popped and continue consuming it once the backend
      * becomes writable again.
      */
-    VirtQueueElement elem;
+    VirtQueueElement *elem;
 
     /*
      * The index and the offset into the iov buffer that was popped in
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 205fadf..21fda17 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -152,7 +152,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len, unsigned int idx);
 
 void virtqueue_map(VirtQueueElement *elem);
-int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem);
+void *virtqueue_pop(VirtQueue *vq, size_t sz);
 int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 04/40] virtio: introduce qemu_get/put_virtqueue_element
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (2 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time Paolo Bonzini
                   ` (36 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Move allocation to virtio functions also when loading/saving a
VirtQueueElement.  This will also let the load/save functions
keep backwards compatibility when the VirtQueueElement layout
is changed.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/block/virtio-blk.c       | 10 +++-------
 hw/char/virtio-serial-bus.c | 10 +++-------
 hw/scsi/virtio-scsi.c       |  7 ++-----
 hw/virtio/virtio.c          | 13 +++++++++++++
 include/hw/virtio/virtio.h  |  2 ++
 5 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index d697de4..773b61c 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -815,8 +815,7 @@ static void virtio_blk_save_device(VirtIODevice *vdev, QEMUFile *f)
 
     while (req) {
         qemu_put_sbyte(f, 1);
-        qemu_put_buffer(f, (unsigned char *)&req->elem,
-                        sizeof(VirtQueueElement));
+        qemu_put_virtqueue_element(f, &req->elem);
         req = req->next;
     }
     qemu_put_sbyte(f, 0);
@@ -839,14 +838,11 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f,
     VirtIOBlock *s = VIRTIO_BLK(vdev);
 
     while (qemu_get_sbyte(f)) {
-        VirtIOBlockReq *req = g_new(VirtIOBlockReq, 1);
+        VirtIOBlockReq *req;
+        req = qemu_get_virtqueue_element(f, sizeof(VirtIOBlockReq));
         virtio_blk_init_request(s, req);
-        qemu_get_buffer(f, (unsigned char *)&req->elem,
-                        sizeof(VirtQueueElement));
         req->next = s->rq;
         s->rq = req;
-
-        virtqueue_map(&req->elem);
     }
 
     return 0;
diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c
index 1af2314..93d8309 100644
--- a/hw/char/virtio-serial-bus.c
+++ b/hw/char/virtio-serial-bus.c
@@ -645,9 +645,7 @@ static void virtio_serial_save_device(VirtIODevice *vdev, QEMUFile *f)
         if (elem_popped) {
             qemu_put_be32s(f, &port->iov_idx);
             qemu_put_be64s(f, &port->iov_offset);
-
-            qemu_put_buffer(f, (unsigned char *)port->elem,
-                            sizeof(VirtQueueElement));
+            qemu_put_virtqueue_element(f, port->elem);
         }
     }
 }
@@ -722,10 +720,8 @@ static int fetch_active_ports_list(QEMUFile *f, int version_id,
                 qemu_get_be32s(f, &port->iov_idx);
                 qemu_get_be64s(f, &port->iov_offset);
 
-                port->elem = g_new(VirtQueueElement, 1);
-                qemu_get_buffer(f, (unsigned char *)port->elem,
-                                sizeof(VirtQueueElement));
-                virtqueue_map(port->elem);
+                port->elem =
+                    qemu_get_virtqueue_element(f, sizeof(VirtQueueElement));
 
                 /*
                  *  Port was throttled on source machine.  Let's
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index f5985dd..3ddfcf1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -188,7 +188,7 @@ static void virtio_scsi_save_request(QEMUFile *f, SCSIRequest *sreq)
 
     assert(n < vs->conf.num_queues);
     qemu_put_be32s(f, &n);
-    qemu_put_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    qemu_put_virtqueue_element(f, &req->elem);
 }
 
 static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
@@ -201,12 +201,9 @@ static void *virtio_scsi_load_request(QEMUFile *f, SCSIRequest *sreq)
 
     qemu_get_be32s(f, &n);
     assert(n < vs->conf.num_queues);
-    req = g_malloc(sizeof(VirtIOSCSIReq) + vs->cdb_size);
-    qemu_get_buffer(f, (unsigned char *)&req->elem, sizeof(req->elem));
+    req = qemu_get_virtqueue_element(f, sizeof(VirtIOSCSIReq) + vs->cdb_size);
     virtio_scsi_init_req(s, vs->cmd_vqs[n], req);
 
-    virtqueue_map(&req->elem);
-
     if (virtio_scsi_parse_req(req, sizeof(VirtIOSCSICmdReq) + vs->cdb_size,
                               sizeof(VirtIOSCSICmdResp) + vs->sense_size) < 0) {
         error_report("invalid SCSI request migration data");
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index dc76a99..fd63206 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -575,6 +575,19 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     return elem;
 }
 
+void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
+{
+    VirtQueueElement *elem = g_malloc(sz);
+    qemu_get_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    virtqueue_map(elem);
+    return elem;
+}
+
+void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem)
+{
+    qemu_put_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+}
+
 /* virtio device */
 static void virtio_notify_vector(VirtIODevice *vdev, uint16_t vector)
 {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 21fda17..44da9a8 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -153,6 +153,8 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
 
 void virtqueue_map(VirtQueueElement *elem);
 void *virtqueue_pop(VirtQueue *vq, size_t sz);
+void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz);
+void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem);
 int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
                           unsigned int out_bytes);
 void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (3 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 04/40] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-30  9:47   ` Fam Zheng
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 06/40] virtio: introduce virtqueue_alloc_element Paolo Bonzini
                   ` (35 subsequent siblings)
  40 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index fd63206..f5f8108 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -578,14 +578,105 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
 void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
 {
     VirtQueueElement *elem = g_malloc(sz);
-    qemu_get_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    bool swap;
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    uint64_t scratch;
+    int i;
+
+    qemu_get_be32s(f, &elem->index);
+    qemu_get_be32s(f, &elem->out_num);
+    qemu_get_be32s(f, &elem->in_num);
+
+    swap = (elem->out_num & 0xFFFF0000) || (elem->in_num & 0xFFFF0000);
+    if (swap) {
+        bswap32s(&elem->index);
+        bswap32s(&elem->out_num);
+        bswap32s(&elem->in_num);
+    }
+
+    for (i = 0; i < elem->in_num; i++) {
+        qemu_get_be64s(f, &elem->in_addr[i]);
+        if (swap) {
+            bswap64s(&elem->in_addr[i]);
+        }
+    }
+    if (i < ARRAY_SIZE(addr)) {
+        qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        qemu_get_be64s(f, &elem->out_addr[i]);
+        if (swap) {
+            bswap64s(&elem->out_addr[i]);
+        }
+    }
+    if (i < ARRAY_SIZE(addr)) {
+        qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
+    }
+
+    for (i = 0; i < elem->in_num; i++) {
+        (void) qemu_get_be64(f); /* base */
+	qemu_get_be64s(f, &scratch); /* length */
+        if (swap) {
+            bswap64s(&scratch);
+        }
+	elem->in_sg[i].iov_len = scratch;
+    }
+    if (i < ARRAY_SIZE(iov)) {
+        qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
+    }
+
+    for (i = 0; i < elem->out_num; i++) {
+        (void) qemu_get_be64(f); /* base */
+        qemu_get_be64s(f, &scratch); /* length */
+        if (swap) {
+            bswap64s(&scratch);
+        }
+	elem->out_sg[i].iov_len = scratch;
+    }
+    if (i < ARRAY_SIZE(iov)) {
+        qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
+    }
+
     virtqueue_map(elem);
     return elem;
 }
 
 void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem)
 {
-    qemu_put_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    int i;
+
+    memset(addr, 0, sizeof(addr));
+    memset(iov, 0, sizeof(iov));
+
+    qemu_put_be32s(f, &elem->index);
+    qemu_put_be32s(f, &elem->out_num);
+    qemu_put_be32s(f, &elem->in_num);
+
+    for (i = 0; i < elem->in_num; i++) {
+        qemu_put_be64s(f, &elem->in_addr[i]);
+    }
+    qemu_put_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
+
+    for (i = 0; i < elem->out_num; i++) {
+        qemu_put_be64s(f, &elem->out_addr[i]);
+    }
+    qemu_put_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
+
+    for (i = 0; i < elem->in_num; i++) {
+        qemu_put_be64(f, 0);
+        qemu_put_be64(f, elem->in_sg[i].iov_len);
+    }
+    qemu_put_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
+
+    for (i = 0; i < elem->out_num; i++) {
+        qemu_put_be64(f, 0);
+        qemu_put_be64(f, elem->out_sg[i].iov_len);
+    }
+    qemu_put_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
 }
 
 /* virtio device */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 06/40] virtio: introduce virtqueue_alloc_element
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (4 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
                   ` (34 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Allocate the arrays for in_addr/out_addr/in_sg/out_sg outside the
VirtQueueElement.  For now, virtqueue_pop and vring_pop keep
allocating a very large VirtQueueElement.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/dataplane/vring.c |  2 +-
 hw/virtio/virtio.c          | 60 +++++++++++++++++++++++++++++++--------------
 include/hw/virtio/virtio.h  |  9 ++++---
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index 1b900fc..c950caa 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -402,7 +402,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         goto out;
     }
 
-    elem = g_malloc(sz);
+    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
 
     /* Initialize elem so it can be safely unmapped */
     elem->in_num = elem->out_num = 0;
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index f5f8108..32c89eb 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -494,11 +494,29 @@ static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
 void virtqueue_map(VirtQueueElement *elem)
 {
     virtqueue_map_iovec(elem->in_sg, elem->in_addr, &elem->in_num,
-                        MIN(ARRAY_SIZE(elem->in_sg), ARRAY_SIZE(elem->in_addr)),
-                        1);
+                        VIRTQUEUE_MAX_SIZE, 1);
     virtqueue_map_iovec(elem->out_sg, elem->out_addr, &elem->out_num,
-                        MIN(ARRAY_SIZE(elem->out_sg), ARRAY_SIZE(elem->out_addr)),
-                        0);
+                        VIRTQUEUE_MAX_SIZE, 0);
+}
+
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num)
+{
+    VirtQueueElement *elem;
+    size_t in_addr_ofs = QEMU_ALIGN_UP(sz, __alignof__(elem->in_addr[0]));
+    size_t out_addr_ofs = in_addr_ofs + in_num * sizeof (elem->in_addr[0]);
+    size_t out_addr_end = out_addr_ofs + out_num * sizeof (elem->out_addr[0]);
+    size_t in_sg_ofs = QEMU_ALIGN_UP(out_addr_end, __alignof__(elem->in_sg[0]));
+    size_t out_sg_ofs = in_sg_ofs + in_num * sizeof (elem->in_sg[0]);
+    size_t out_sg_end = out_sg_ofs + out_num * sizeof (elem->out_sg[0]);
+
+    elem = g_malloc(out_sg_end);
+    elem->out_num = out_num;
+    elem->in_num = in_num;
+    elem->in_addr = (void*)elem + in_addr_ofs;
+    elem->out_addr = (void*)elem + out_addr_ofs;
+    elem->in_sg = (void*)elem + in_sg_ofs;
+    elem->out_sg = (void*)elem + out_sg_ofs;
+    return elem;
 }
 
 void *virtqueue_pop(VirtQueue *vq, size_t sz)
@@ -513,7 +531,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     }
 
     /* When we start there are none of either input nor output. */
-    elem = g_malloc(sz);
+    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
     elem->out_num = elem->in_num = 0;
 
     max = vq->vring.num;
@@ -540,14 +558,14 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
         struct iovec *sg;
 
         if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-            if (elem->in_num >= ARRAY_SIZE(elem->in_sg)) {
+            if (elem->in_num >= VIRTQUEUE_MAX_SIZE) {
                 error_report("Too many write descriptors in indirect table");
                 exit(1);
             }
             elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
             sg = &elem->in_sg[elem->in_num++];
         } else {
-            if (elem->out_num >= ARRAY_SIZE(elem->out_sg)) {
+            if (elem->out_num >= VIRTQUEUE_MAX_SIZE) {
                 error_report("Too many read descriptors in indirect table");
                 exit(1);
             }
@@ -577,31 +595,35 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
 
 void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
 {
-    VirtQueueElement *elem = g_malloc(sz);
+    VirtQueueElement *elem;
     bool swap;
     hwaddr addr[VIRTQUEUE_MAX_SIZE];
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    uint32_t index, out_num, in_num;
     uint64_t scratch;
     int i;
 
-    qemu_get_be32s(f, &elem->index);
-    qemu_get_be32s(f, &elem->out_num);
-    qemu_get_be32s(f, &elem->in_num);
+    qemu_get_be32s(f, &index);
+    qemu_get_be32s(f, &out_num);
+    qemu_get_be32s(f, &in_num);
 
-    swap = (elem->out_num & 0xFFFF0000) || (elem->in_num & 0xFFFF0000);
+    swap = (out_num & 0xFFFF0000) || (in_num & 0xFFFF0000);
     if (swap) {
-        bswap32s(&elem->index);
-        bswap32s(&elem->out_num);
-        bswap32s(&elem->in_num);
+        bswap32s(&index);
+        bswap32s(&out_num);
+        bswap32s(&in_num);
     }
 
+    elem = virtqueue_alloc_element(sz, out_num, in_num);
+    elem->index = index;
+
     for (i = 0; i < elem->in_num; i++) {
         qemu_get_be64s(f, &elem->in_addr[i]);
         if (swap) {
             bswap64s(&elem->in_addr[i]);
         }
     }
-    if (i < ARRAY_SIZE(addr)) {
+    if (i < VIRTQUEUE_MAX_SIZE) {
         qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
     }
 
@@ -611,7 +633,7 @@ void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
             bswap64s(&elem->out_addr[i]);
         }
     }
-    if (i < ARRAY_SIZE(addr)) {
+    if (i < VIRTQUEUE_MAX_SIZE) {
         qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
     }
 
@@ -623,7 +645,7 @@ void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
         }
 	elem->in_sg[i].iov_len = scratch;
     }
-    if (i < ARRAY_SIZE(iov)) {
+    if (i < VIRTQUEUE_MAX_SIZE) {
         qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
     }
 
@@ -635,7 +657,7 @@ void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
         }
 	elem->out_sg[i].iov_len = scratch;
     }
-    if (i < ARRAY_SIZE(iov)) {
+    if (i < VIRTQUEUE_MAX_SIZE) {
         qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
     }
 
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 44da9a8..108cdb0 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -46,10 +46,10 @@ typedef struct VirtQueueElement
     unsigned int index;
     unsigned int out_num;
     unsigned int in_num;
-    hwaddr in_addr[VIRTQUEUE_MAX_SIZE];
-    hwaddr out_addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec in_sg[VIRTQUEUE_MAX_SIZE];
-    struct iovec out_sg[VIRTQUEUE_MAX_SIZE];
+    hwaddr *in_addr;
+    hwaddr *out_addr;
+    struct iovec *in_sg;
+    struct iovec *out_sg;
 } VirtQueueElement;
 
 #define VIRTIO_QUEUE_MAX 1024
@@ -143,6 +143,7 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int queue_size,
 
 void virtio_del_queue(VirtIODevice *vdev, int n);
 
+void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num);
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
                     unsigned int len);
 void virtqueue_flush(VirtQueue *vq, unsigned int count);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (5 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 06/40] virtio: introduce virtqueue_alloc_element Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-30  3:24   ` Fam Zheng
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 08/40] vring: " Paolo Bonzini
                   ` (33 subsequent siblings)
  40 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Build the addresses and s/g lists on the stack, and then copy them
to a VirtQueueElement that is just as big as required to contain this
particular s/g list.  The cost of the copy is minimal compared to that
of a large malloc.

When virtqueue_map is used on the destination side of migration or on
loadvm, the iovecs have already been split at memory region boundary,
so we can just reuse the out_num/in_num we find in the file.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 82 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 32c89eb..0163d0f 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -448,6 +448,32 @@ int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
     return in_bytes <= in_total && out_bytes <= out_total;
 }
 
+static void virtqueue_map_desc(unsigned int *p_num_sg, hwaddr *addr, struct iovec *iov,
+                               unsigned int max_num_sg, bool is_write,
+                               hwaddr pa, size_t sz)
+{
+    unsigned num_sg = *p_num_sg;
+    assert(num_sg <= max_num_sg);
+
+    while (sz) {
+        hwaddr len = sz;
+
+        if (num_sg == max_num_sg) {
+            error_report("virtio: too many write descriptors in indirect table");
+            exit(1);
+        }
+
+        iov[num_sg].iov_base = cpu_physical_memory_map(pa, &len, is_write);
+        iov[num_sg].iov_len = len;
+        addr[num_sg] = pa;
+
+        sz -= len;
+        pa += len;
+        num_sg++;
+    }
+    *p_num_sg = num_sg;
+}
+
 static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
                                 unsigned int *num_sg, unsigned int max_size,
                                 int is_write)
@@ -474,20 +500,10 @@ static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
             error_report("virtio: error trying to map MMIO memory");
             exit(1);
         }
-        if (len == sg[i].iov_len) {
-            continue;
-        }
-        if (*num_sg >= max_size) {
-            error_report("virtio: memory split makes iovec too large");
+        if (len != sg[i].iov_len) {
+            error_report("virtio: unexpected memory split");
             exit(1);
         }
-        memmove(sg + i + 1, sg + i, sizeof(*sg) * (*num_sg - i));
-        memmove(addr + i + 1, addr + i, sizeof(*addr) * (*num_sg - i));
-        assert(len < sg[i + 1].iov_len);
-        sg[i].iov_len = len;
-        addr[i + 1] += len;
-        sg[i + 1].iov_len -= len;
-        ++*num_sg;
     }
 }
 
@@ -525,14 +541,16 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     hwaddr desc_pa = vq->vring.desc;
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem;
+    unsigned out_num, in_num;
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
 
     if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
         return NULL;
     }
 
     /* When we start there are none of either input nor output. */
-    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
-    elem->out_num = elem->in_num = 0;
+    out_num = in_num = 0;
 
     max = vq->vring.num;
 
@@ -555,37 +573,39 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
 
     /* Collect all the descriptors */
     do {
-        struct iovec *sg;
+        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
+        size_t len = vring_desc_len(vdev, desc_pa, i);
 
         if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-            if (elem->in_num >= VIRTQUEUE_MAX_SIZE) {
-                error_report("Too many write descriptors in indirect table");
-                exit(1);
-            }
-            elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->in_sg[elem->in_num++];
+            virtqueue_map_desc(&in_num, addr + out_num, iov + out_num,
+                               VIRTQUEUE_MAX_SIZE - out_num, 1, pa, len);
         } else {
-            if (elem->out_num >= VIRTQUEUE_MAX_SIZE) {
-                error_report("Too many read descriptors in indirect table");
+            if (in_num) {
+                error_report("Incorrect order for descriptors");
                 exit(1);
             }
-            elem->out_addr[elem->out_num] = vring_desc_addr(vdev, desc_pa, i);
-            sg = &elem->out_sg[elem->out_num++];
+            virtqueue_map_desc(&out_num, addr, iov,
+                               VIRTQUEUE_MAX_SIZE, 0, pa, len);
         }
 
-        sg->iov_len = vring_desc_len(vdev, desc_pa, i);
-
         /* If we've got too many, that implies a descriptor loop. */
-        if ((elem->in_num + elem->out_num) > max) {
+        if ((in_num + out_num) > max) {
             error_report("Looped descriptor");
             exit(1);
         }
     } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
 
-    /* Now map what we have collected */
-    virtqueue_map(elem);
-
+    /* Now copy what we have collected and mapped */
+    elem = virtqueue_alloc_element(sz, out_num, in_num);
     elem->index = head;
+    for (i = 0; i < out_num; i++) {
+        elem->out_addr[i] = addr[i];
+        elem->out_sg[i] = iov[i];
+    }
+    for (i = 0; i < in_num; i++) {
+        elem->in_addr[i] = addr[out_num + i];
+        elem->in_sg[i] = iov[out_num + i];
+    }
 
     vq->inuse++;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 08/40] vring: slim down allocation of VirtQueueElements
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (6 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
@ 2015-11-24 18:00 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 09/40] vring: make vring_enable_notification return void Paolo Bonzini
                   ` (32 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Build the addresses and s/g lists on the stack, and then copy them
to a VirtQueueElement that is just as big as required to contain this
particular s/g list.  The cost of the copy is minimal compared to that
of a large malloc.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/dataplane/vring.c | 53 ++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index c950caa..d6b8ba9 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -217,8 +217,14 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
                             new, old);
 }
 
-
-static int get_desc(Vring *vring, VirtQueueElement *elem,
+typedef struct VirtQueueCurrentElement {
+    unsigned in_num;
+    unsigned out_num;
+    hwaddr addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+} VirtQueueCurrentElement;
+
+static int get_desc(Vring *vring, VirtQueueCurrentElement *elem,
                     struct vring_desc *desc)
 {
     unsigned *num;
@@ -229,12 +235,12 @@ static int get_desc(Vring *vring, VirtQueueElement *elem,
 
     if (desc->flags & VRING_DESC_F_WRITE) {
         num = &elem->in_num;
-        iov = &elem->in_sg[*num];
-        addr = &elem->in_addr[*num];
+        iov = &elem->iov[elem->out_num + *num];
+        addr = &elem->addr[elem->out_num + *num];
     } else {
         num = &elem->out_num;
-        iov = &elem->out_sg[*num];
-        addr = &elem->out_addr[*num];
+        iov = &elem->iov[*num];
+        addr = &elem->addr[*num];
 
         /* If it's an output descriptor, they're all supposed
          * to come before any input descriptors. */
@@ -298,7 +304,8 @@ static bool read_vring_desc(VirtIODevice *vdev,
 
 /* This is stolen from linux/drivers/vhost/vhost.c. */
 static int get_indirect(VirtIODevice *vdev, Vring *vring,
-                        VirtQueueElement *elem, struct vring_desc *indirect)
+                        VirtQueueCurrentElement *cur_elem,
+                        struct vring_desc *indirect)
 {
     struct vring_desc desc;
     unsigned int i = 0, count, found = 0;
@@ -350,7 +357,7 @@ static int get_indirect(VirtIODevice *vdev, Vring *vring,
             return -EFAULT;
         }
 
-        ret = get_desc(vring, elem, &desc);
+        ret = get_desc(vring, cur_elem, &desc);
         if (ret < 0) {
             vring->broken |= (ret == -EFAULT);
             return ret;
@@ -393,6 +400,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
     struct vring_desc desc;
     unsigned int i, head, found = 0, num = vring->vr.num;
     uint16_t avail_idx, last_avail_idx;
+    VirtQueueCurrentElement cur_elem;
     VirtQueueElement *elem = NULL;
     int ret;
 
@@ -402,10 +410,7 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         goto out;
     }
 
-    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
-
-    /* Initialize elem so it can be safely unmapped */
-    elem->in_num = elem->out_num = 0;
+    cur_elem.in_num = cur_elem.out_num = 0;
 
     /* Check it isn't doing very strange things with descriptor numbers. */
     last_avail_idx = vring->last_avail_idx;
@@ -432,8 +437,6 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
      * the index we've seen. */
     head = vring_get_avail_ring(vdev, vring, last_avail_idx % num);
 
-    elem->index = head;
-
     /* If their number is silly, that's an error. */
     if (unlikely(head >= num)) {
         error_report("Guest says index %u > %u is available", head, num);
@@ -460,14 +463,14 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
         barrier();
 
         if (desc.flags & VRING_DESC_F_INDIRECT) {
-            ret = get_indirect(vdev, vring, elem, &desc);
+            ret = get_indirect(vdev, vring, &cur_elem, &desc);
             if (ret < 0) {
                 goto out;
             }
             continue;
         }
 
-        ret = get_desc(vring, elem, &desc);
+        ret = get_desc(vring, &cur_elem, &desc);
         if (ret < 0) {
             goto out;
         }
@@ -482,6 +485,18 @@ void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
             virtio_tswap16(vdev, vring->last_avail_idx);
     }
 
+    /* Now copy what we have collected and mapped */
+    elem = virtqueue_alloc_element(sz, cur_elem.out_num, cur_elem.in_num);
+    elem->index = head;
+    for (i = 0; i < cur_elem.out_num; i++) {
+        elem->out_addr[i] = cur_elem.addr[i];
+        elem->out_sg[i] = cur_elem.iov[i];
+    }
+    for (i = 0; i < cur_elem.in_num; i++) {
+        elem->in_addr[i] = cur_elem.addr[cur_elem.out_num + i];
+        elem->in_sg[i] = cur_elem.iov[cur_elem.out_num + i];
+    }
+
     return elem;
 
 out:
@@ -489,7 +504,11 @@ out:
     if (ret == -EFAULT) {
         vring->broken = true;
     }
-    vring_unmap_element(elem);
+
+    for (i = 0; i < cur_elem.out_num + cur_elem.in_num; i++) {
+        vring_unmap(cur_elem.iov[i].iov_base, false);
+    }
+
     g_free(elem);
     return NULL;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 09/40] vring: make vring_enable_notification return void
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (7 preceding siblings ...)
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 08/40] vring: " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 10/40] virtio: combine the read of a descriptor Paolo Bonzini
                   ` (31 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Make the API more similar to the regular virtqueue API.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/block/dataplane/virtio-blk.c     | 3 ++-
 hw/virtio/dataplane/vring.c         | 3 +--
 include/hw/virtio/dataplane/vring.h | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 0cadd3a..ace36f5 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -126,7 +126,8 @@ static void handle_notify(EventNotifier *e)
             /* Re-enable guest->host notifies and stop processing the vring.
              * But if the guest has snuck in more descriptors, keep processing.
              */
-            if (vring_enable_notification(s->vdev, &s->vring)) {
+            vring_enable_notification(s->vdev, &s->vring);
+            if (!vring_more_avail(s->vdev, &s->vring)) {
                 break;
             }
         } else { /* fatal error */
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
index d6b8ba9..4e1a299 100644
--- a/hw/virtio/dataplane/vring.c
+++ b/hw/virtio/dataplane/vring.c
@@ -174,7 +174,7 @@ void vring_disable_notification(VirtIODevice *vdev, Vring *vring)
  *
  * Return true if the vring is empty, false if there are more requests.
  */
-bool vring_enable_notification(VirtIODevice *vdev, Vring *vring)
+void vring_enable_notification(VirtIODevice *vdev, Vring *vring)
 {
     if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
         vring_avail_event(&vring->vr) = vring->vr.avail->idx;
@@ -182,7 +182,6 @@ bool vring_enable_notification(VirtIODevice *vdev, Vring *vring)
         vring_clear_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
     }
     smp_mb(); /* ensure update is seen before reading avail_idx */
-    return !vring_more_avail(vdev, vring);
 }
 
 /* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
diff --git a/include/hw/virtio/dataplane/vring.h b/include/hw/virtio/dataplane/vring.h
index e80985e..e1c2a65 100644
--- a/include/hw/virtio/dataplane/vring.h
+++ b/include/hw/virtio/dataplane/vring.h
@@ -42,7 +42,7 @@ static inline void vring_set_broken(Vring *vring)
 bool vring_setup(Vring *vring, VirtIODevice *vdev, int n);
 void vring_teardown(Vring *vring, VirtIODevice *vdev, int n);
 void vring_disable_notification(VirtIODevice *vdev, Vring *vring);
-bool vring_enable_notification(VirtIODevice *vdev, Vring *vring);
+void vring_enable_notification(VirtIODevice *vdev, Vring *vring);
 bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
 void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz);
 void vring_push(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 10/40] virtio: combine the read of a descriptor
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (8 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 09/40] vring: make vring_enable_notification return void Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 11/40] virtio: add AioContext-specific function for host notifiers Paolo Bonzini
                   ` (30 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Compared to vring, virtio has a performance penalty of 10%.  Fix it
by combining all the reads for a descriptor in a single address_space_read
call.  This also simplifies the code nicely.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c | 86 ++++++++++++++++++++++--------------------------------
 1 file changed, 35 insertions(+), 51 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 0163d0f..0e9fff6 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -107,35 +107,15 @@ void virtio_queue_update_rings(VirtIODevice *vdev, int n)
                               vring->align);
 }
 
-static inline uint64_t vring_desc_addr(VirtIODevice *vdev, hwaddr desc_pa,
-                                       int i)
+static void vring_desc_read(VirtIODevice *vdev, VRingDesc *desc,
+                            hwaddr desc_pa, int i)
 {
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, addr);
-    return virtio_ldq_phys(vdev, pa);
-}
-
-static inline uint32_t vring_desc_len(VirtIODevice *vdev, hwaddr desc_pa, int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, len);
-    return virtio_ldl_phys(vdev, pa);
-}
-
-static inline uint16_t vring_desc_flags(VirtIODevice *vdev, hwaddr desc_pa,
-                                        int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, flags);
-    return virtio_lduw_phys(vdev, pa);
-}
-
-static inline uint16_t vring_desc_next(VirtIODevice *vdev, hwaddr desc_pa,
-                                       int i)
-{
-    hwaddr pa;
-    pa = desc_pa + sizeof(VRingDesc) * i + offsetof(VRingDesc, next);
-    return virtio_lduw_phys(vdev, pa);
+    address_space_read(&address_space_memory, desc_pa + i * sizeof(VRingDesc),
+                       MEMTXATTRS_UNSPECIFIED, (void *)desc, sizeof(VRingDesc));
+    virtio_tswap64s(vdev, &desc->addr);
+    virtio_tswap32s(vdev, &desc->len);
+    virtio_tswap16s(vdev, &desc->flags);
+    virtio_tswap16s(vdev, &desc->next);
 }
 
 static inline uint16_t vring_avail_flags(VirtQueue *vq)
@@ -345,18 +325,18 @@ static unsigned int virtqueue_get_head(VirtQueue *vq, unsigned int idx)
     return head;
 }
 
-static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
-                                    unsigned int i, unsigned int max)
+static unsigned virtqueue_read_next_desc(VirtIODevice *vdev, VRingDesc *desc,
+                                         hwaddr desc_pa, unsigned int max)
 {
     unsigned int next;
 
     /* If this descriptor says it doesn't chain, we're done. */
-    if (!(vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_NEXT)) {
+    if (!(desc->flags & VRING_DESC_F_NEXT)) {
         return max;
     }
 
     /* Check they're not leading us off end of descriptors. */
-    next = vring_desc_next(vdev, desc_pa, i);
+    next = desc->next;
     /* Make sure compiler knows to grab that: we don't want it changing! */
     smp_wmb();
 
@@ -365,6 +345,7 @@ static unsigned virtqueue_next_desc(VirtIODevice *vdev, hwaddr desc_pa,
         exit(1);
     }
 
+    vring_desc_read(vdev, desc, desc_pa, next);
     return next;
 }
 
@@ -381,6 +362,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
     while (virtqueue_num_heads(vq, idx)) {
         VirtIODevice *vdev = vq->vdev;
         unsigned int max, num_bufs, indirect = 0;
+        VRingDesc desc;
         hwaddr desc_pa;
         int i;
 
@@ -388,9 +370,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
         num_bufs = total_bufs;
         i = virtqueue_get_head(vq, idx++);
         desc_pa = vq->vring.desc;
+        vring_desc_read(vdev, &desc, desc_pa, i);
 
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-            if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+        if (desc.flags & VRING_DESC_F_INDIRECT) {
+            if (desc.len % sizeof(VRingDesc)) {
                 error_report("Invalid size for indirect buffer table");
                 exit(1);
             }
@@ -403,9 +386,10 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
 
             /* loop over the indirect descriptor table */
             indirect = 1;
-            max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-            desc_pa = vring_desc_addr(vdev, desc_pa, i);
+            max = desc.len / sizeof(VRingDesc);
+            desc_pa = desc.addr;
             num_bufs = i = 0;
+            vring_desc_read(vdev, &desc, desc_pa, i);
         }
 
         do {
@@ -415,15 +399,15 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
                 exit(1);
             }
 
-            if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
-                in_total += vring_desc_len(vdev, desc_pa, i);
+            if (desc.flags & VRING_DESC_F_WRITE) {
+                in_total += desc.flags;
             } else {
-                out_total += vring_desc_len(vdev, desc_pa, i);
+                out_total += desc.flags;
             }
             if (in_total >= max_in_bytes && out_total >= max_out_bytes) {
                 goto done;
             }
-        } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+        } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) != max);
 
         if (!indirect)
             total_bufs = num_bufs;
@@ -544,6 +528,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
     unsigned out_num, in_num;
     hwaddr addr[VIRTQUEUE_MAX_SIZE];
     struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    VRingDesc desc;
 
     if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
         return NULL;
@@ -559,33 +544,32 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
         vring_set_avail_event(vq, vq->last_avail_idx);
     }
 
-    if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_INDIRECT) {
-        if (vring_desc_len(vdev, desc_pa, i) % sizeof(VRingDesc)) {
+    vring_desc_read(vdev, &desc, desc_pa, i);
+    if (desc.flags & VRING_DESC_F_INDIRECT) {
+        if (desc.len % sizeof(VRingDesc)) {
             error_report("Invalid size for indirect buffer table");
             exit(1);
         }
 
         /* loop over the indirect descriptor table */
-        max = vring_desc_len(vdev, desc_pa, i) / sizeof(VRingDesc);
-        desc_pa = vring_desc_addr(vdev, desc_pa, i);
+        max = desc.len / sizeof(VRingDesc);
+        desc_pa = desc.addr;
         i = 0;
+        vring_desc_read(vdev, &desc, desc_pa, i);
     }
 
     /* Collect all the descriptors */
     do {
-        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
-        size_t len = vring_desc_len(vdev, desc_pa, i);
-
-        if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
+        if (desc.flags & VRING_DESC_F_WRITE) {
             virtqueue_map_desc(&in_num, addr + out_num, iov + out_num,
-                               VIRTQUEUE_MAX_SIZE - out_num, 1, pa, len);
+                               VIRTQUEUE_MAX_SIZE - out_num, 1, desc.addr, desc.len);
         } else {
             if (in_num) {
                 error_report("Incorrect order for descriptors");
                 exit(1);
             }
             virtqueue_map_desc(&out_num, addr, iov,
-                               VIRTQUEUE_MAX_SIZE, 0, pa, len);
+                               VIRTQUEUE_MAX_SIZE, 0, desc.addr, desc.len);
         }
 
         /* If we've got too many, that implies a descriptor loop. */
@@ -593,7 +577,7 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
             error_report("Looped descriptor");
             exit(1);
         }
-    } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
+    } while ((i = virtqueue_read_next_desc(vdev, &desc, desc_pa, max)) != max);
 
     /* Now copy what we have collected and mapped */
     elem = virtqueue_alloc_element(sz, out_num, in_num);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 11/40] virtio: add AioContext-specific function for host notifiers
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (9 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 10/40] virtio: combine the read of a descriptor Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 12/40] virtio: export vring_notify as virtio_should_notify Paolo Bonzini
                   ` (29 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c         | 16 ++++++++++++++++
 include/hw/virtio/virtio.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 0e9fff6..c835451 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1835,6 +1835,22 @@ static void virtio_queue_host_notifier_read(EventNotifier *n)
     }
 }
 
+void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
+						bool assign, bool set_handler)
+{
+    if (assign && set_handler) {
+        aio_set_event_notifier(ctx, &vq->host_notifier, true,
+                               virtio_queue_host_notifier_read);
+    } else {
+        aio_set_event_notifier(ctx, &vq->host_notifier, true, NULL);
+    }
+    if (!assign) {
+        /* Test and clear notifier before after disabling event,
+         * in case poll callback didn't have time to run. */
+        virtio_queue_host_notifier_read(&vq->host_notifier);
+    }
+}
+
 void virtio_queue_set_host_notifier_fd_handler(VirtQueue *vq, bool assign,
                                                bool set_handler)
 {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 108cdb0..4ce01a1 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -248,6 +248,8 @@ void virtio_queue_set_guest_notifier_fd_handler(VirtQueue *vq, bool assign,
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_fd_handler(VirtQueue *vq, bool assign,
                                                bool set_handler);
+void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
+						bool assign, bool set_handler);
 void virtio_queue_notify_vq(VirtQueue *vq);
 void virtio_irq(VirtQueue *vq);
 VirtQueue *virtio_vector_first_queue(VirtIODevice *vdev, uint16_t vector);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 12/40] virtio: export vring_notify as virtio_should_notify
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (10 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 11/40] virtio: add AioContext-specific function for host notifiers Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 13/40] virtio-blk: fix "disabled data plane" mode Paolo Bonzini
                   ` (28 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Virtio dataplane needs to trigger the irq manually through the
guest notifier.  Export virtio_should_notify so that it can be
used around event_notifier_set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/virtio.c         | 4 ++--
 include/hw/virtio/virtio.h | 1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index c835451..d49306b 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1166,7 +1166,7 @@ void virtio_irq(VirtQueue *vq)
     virtio_notify_vector(vq->vdev, vq->vector);
 }
 
-static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
+bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
     uint16_t old, new;
     bool v;
@@ -1191,7 +1191,7 @@ static bool vring_notify(VirtIODevice *vdev, VirtQueue *vq)
 
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq)
 {
-    if (!vring_notify(vdev, vq)) {
+    if (!virtio_should_notify(vdev, vq)) {
         return;
     }
 
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 4ce01a1..5884228 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -162,6 +162,7 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes,
                                unsigned int *out_bytes,
                                unsigned max_in_bytes, unsigned max_out_bytes);
 
+bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq);
 void virtio_notify(VirtIODevice *vdev, VirtQueue *vq);
 
 void virtio_save(VirtIODevice *vdev, QEMUFile *f);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 13/40] virtio-blk: fix "disabled data plane" mode
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (11 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 12/40] virtio: export vring_notify as virtio_should_notify Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 14/40] virtio-blk: do not use vring in dataplane Paolo Bonzini
                   ` (27 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

In disabled mode, virtio-blk dataplane seems to be enabled, but flow
actually goes through the normal virtio path.  This patch simplifies a bit
the handling of disabled mode.  In disabled mode, virtio_blk_handle_output
might be called even if s->dataplane is not NULL.

This is a bit tricky, because the current check for s->dataplane will
always trigger, causing a continuous stream of calls to
virtio_blk_data_plane_start.  Unfortunately, these calls will not
do anything.  To fix this, set the "started" flag even in disabled
mode, and skip virtio_blk_data_plane_start if the started flag is true.
The resulting changes also prepare the code for the next patch, were
virtio-blk dataplane will reuse the same virtio_blk_handle_output function
as "regular" virtio-blk.

Because struct VirtIOBlockDataPlane is opaque in virtio-blk.c, we have
to move s->dataplane->started inside struct VirtIOBlock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 21 +++++++++------------
 hw/block/virtio-blk.c           |  2 +-
 include/hw/virtio/virtio-blk.h  |  1 +
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index ace36f5..8d417e2 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -27,7 +27,6 @@
 #include "qom/object_interfaces.h"
 
 struct VirtIOBlockDataPlane {
-    bool started;
     bool starting;
     bool stopping;
     bool disabled;
@@ -237,11 +236,7 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     VirtQueue *vq;
     int r;
 
-    if (s->started || s->disabled) {
-        return;
-    }
-
-    if (s->starting) {
+    if (vblk->dataplane_started || s->starting) {
         return;
     }
 
@@ -273,7 +268,7 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     vblk->complete_request = complete_request_vring;
 
     s->starting = false;
-    s->started = true;
+    vblk->dataplane_started = true;
     trace_virtio_blk_data_plane_start(s);
 
     blk_set_aio_context(s->conf->conf.blk, s->ctx);
@@ -292,9 +287,10 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     k->set_guest_notifiers(qbus->parent, 1, false);
   fail_guest_notifiers:
     vring_teardown(&s->vring, s->vdev, 0);
-    s->disabled = true;
   fail_vring:
+    s->disabled = true;
     s->starting = false;
+    vblk->dataplane_started = true;
 }
 
 /* Context: QEMU global mutex held */
@@ -304,13 +300,14 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
 
+    if (!vblk->dataplane_started || s->stopping) {
+        return;
+    }
 
     /* Better luck next time. */
     if (s->disabled) {
         s->disabled = false;
-        return;
-    }
-    if (!s->started || s->stopping) {
+        vblk->dataplane_started = false;
         return;
     }
     s->stopping = true;
@@ -337,6 +334,6 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
     /* Clean up guest notifier (irq) */
     k->set_guest_notifiers(qbus->parent, 1, false);
 
-    s->started = false;
+    vblk->dataplane_started = false;
     s->stopping = false;
 }
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 773b61c..062d57e 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -596,7 +596,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
      * dataplane here instead of waiting for .set_status().
      */
-    if (s->dataplane) {
+    if (s->dataplane && !s->dataplane_started) {
         virtio_blk_data_plane_start(s->dataplane);
         return;
     }
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 9a2048d..e720934 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -57,6 +57,7 @@ typedef struct VirtIOBlock {
     /* Function to push to vq and notify guest */
     void (*complete_request)(struct VirtIOBlockReq *req, unsigned char status);
     Notifier migration_state_notifier;
+    bool dataplane_started;
     struct VirtIOBlockDataPlane *dataplane;
 } VirtIOBlock;
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 14/40] virtio-blk: do not use vring in dataplane
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (12 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 13/40] virtio-blk: fix "disabled data plane" mode Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 15/40] virtio-scsi: " Paolo Bonzini
                   ` (26 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/block/dataplane/virtio-blk.c | 112 +++++-----------------------------------
 hw/block/dataplane/virtio-blk.h |   1 +
 hw/block/virtio-blk.c           |  48 +++--------------
 include/hw/virtio/virtio-blk.h  |   3 --
 4 files changed, 19 insertions(+), 145 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 8d417e2..f2482eb 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -17,8 +17,6 @@
 #include "qemu/thread.h"
 #include "qemu/error-report.h"
 #include "hw/virtio/virtio-access.h"
-#include "hw/virtio/dataplane/vring.h"
-#include "hw/virtio/dataplane/vring-accessors.h"
 #include "sysemu/block-backend.h"
 #include "hw/virtio/virtio-blk.h"
 #include "virtio-blk.h"
@@ -34,7 +32,7 @@ struct VirtIOBlockDataPlane {
     VirtIOBlkConf *conf;
 
     VirtIODevice *vdev;
-    Vring vring;                    /* virtqueue vring */
+    VirtQueue *vq;                  /* virtqueue vring */
     EventNotifier *guest_notifier;  /* irq */
     QEMUBH *bh;                     /* bh for guest notification */
 
@@ -46,94 +44,26 @@ struct VirtIOBlockDataPlane {
     IOThread *iothread;
     IOThread internal_iothread_obj;
     AioContext *ctx;
-    EventNotifier host_notifier;    /* doorbell */
 
     /* Operation blocker on BDS */
     Error *blocker;
-    void (*saved_complete_request)(struct VirtIOBlockReq *req,
-                                   unsigned char status);
 };
 
 /* Raise an interrupt to signal guest, if necessary */
-static void notify_guest(VirtIOBlockDataPlane *s)
+void virtio_blk_data_plane_notify(VirtIOBlockDataPlane *s)
 {
-    if (!vring_should_notify(s->vdev, &s->vring)) {
-        return;
-    }
-
-    event_notifier_set(s->guest_notifier);
+    qemu_bh_schedule(s->bh);
 }
 
 static void notify_guest_bh(void *opaque)
 {
     VirtIOBlockDataPlane *s = opaque;
 
-    notify_guest(s);
-}
-
-static void complete_request_vring(VirtIOBlockReq *req, unsigned char status)
-{
-    VirtIOBlockDataPlane *s = req->dev->dataplane;
-    stb_p(&req->in->status, status);
-
-    vring_push(s->vdev, &req->dev->dataplane->vring, &req->elem, req->in_len);
-
-    /* Suppress notification to guest by BH and its scheduled
-     * flag because requests are completed as a batch after io
-     * plug & unplug is introduced, and the BH can still be
-     * executed in dataplane aio context even after it is
-     * stopped, so needn't worry about notification loss with BH.
-     */
-    qemu_bh_schedule(s->bh);
-}
-
-static void handle_notify(EventNotifier *e)
-{
-    VirtIOBlockDataPlane *s = container_of(e, VirtIOBlockDataPlane,
-                                           host_notifier);
-    VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
-
-    event_notifier_test_and_clear(&s->host_notifier);
-    blk_io_plug(s->conf->conf.blk);
-    for (;;) {
-        MultiReqBuffer mrb = {};
-
-        /* Disable guest->host notifies to avoid unnecessary vmexits */
-        vring_disable_notification(s->vdev, &s->vring);
-
-        for (;;) {
-            VirtIOBlockReq *req = vring_pop(s->vdev, &s->vring,
-                                            sizeof(VirtIOBlockReq));
-
-            if (req == NULL) {
-                break; /* no more requests */
-            }
-
-            virtio_blk_init_request(vblk, req);
-            trace_virtio_blk_data_plane_process_request(s, req->elem.out_num,
-                                                        req->elem.in_num,
-                                                        req->elem.index);
-
-            virtio_blk_handle_request(req, &mrb);
-        }
-
-        if (mrb.num_reqs) {
-            virtio_blk_submit_multireq(s->conf->conf.blk, &mrb);
-        }
-
-        if (likely(!vring_more_avail(s->vdev, &s->vring))) { /* vring emptied */
-            /* Re-enable guest->host notifies and stop processing the vring.
-             * But if the guest has snuck in more descriptors, keep processing.
-             */
-            vring_enable_notification(s->vdev, &s->vring);
-            if (!vring_more_avail(s->vdev, &s->vring)) {
-                break;
-            }
-        } else { /* fatal error */
-            break;
-        }
+    if (!virtio_should_notify(s->vdev, s->vq)) {
+        return;
     }
-    blk_io_unplug(s->conf->conf.blk);
+
+    event_notifier_set(s->guest_notifier);
 }
 
 /* Context: QEMU global mutex held */
@@ -233,7 +163,6 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s->vdev)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
     VirtIOBlock *vblk = VIRTIO_BLK(s->vdev);
-    VirtQueue *vq;
     int r;
 
     if (vblk->dataplane_started || s->starting) {
@@ -241,11 +170,7 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     }
 
     s->starting = true;
-
-    vq = virtio_get_queue(s->vdev, 0);
-    if (!vring_setup(&s->vring, s->vdev, 0)) {
-        goto fail_vring;
-    }
+    s->vq = virtio_get_queue(s->vdev, 0);
 
     /* Set up guest notifier (irq) */
     r = k->set_guest_notifiers(qbus->parent, 1, true);
@@ -254,7 +179,7 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
                 "ensure -enable-kvm is set\n", r);
         goto fail_guest_notifiers;
     }
-    s->guest_notifier = virtio_queue_get_guest_notifier(vq);
+    s->guest_notifier = virtio_queue_get_guest_notifier(s->vq);
 
     /* Set up virtqueue notify */
     r = k->set_host_notifier(qbus->parent, 0, true);
@@ -262,10 +187,6 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
         fprintf(stderr, "virtio-blk failed to set host notifier (%d)\n", r);
         goto fail_host_notifier;
     }
-    s->host_notifier = *virtio_queue_get_host_notifier(vq);
-
-    s->saved_complete_request = vblk->complete_request;
-    vblk->complete_request = complete_request_vring;
 
     s->starting = false;
     vblk->dataplane_started = true;
@@ -274,20 +195,17 @@ void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s)
     blk_set_aio_context(s->conf->conf.blk, s->ctx);
 
     /* Kick right away to begin processing requests already in vring */
-    event_notifier_set(virtio_queue_get_host_notifier(vq));
+    event_notifier_set(virtio_queue_get_host_notifier(s->vq));
 
     /* Get this show started by hooking up our callbacks */
     aio_context_acquire(s->ctx);
-    aio_set_event_notifier(s->ctx, &s->host_notifier, true,
-                           handle_notify);
+    virtio_queue_aio_set_host_notifier_handler(s->vq, s->ctx, true, true);
     aio_context_release(s->ctx);
     return;
 
   fail_host_notifier:
     k->set_guest_notifiers(qbus->parent, 1, false);
   fail_guest_notifiers:
-    vring_teardown(&s->vring, s->vdev, 0);
-  fail_vring:
     s->disabled = true;
     s->starting = false;
     vblk->dataplane_started = true;
@@ -311,24 +229,18 @@ void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s)
         return;
     }
     s->stopping = true;
-    vblk->complete_request = s->saved_complete_request;
     trace_virtio_blk_data_plane_stop(s);
 
     aio_context_acquire(s->ctx);
 
     /* Stop notifications for new requests from guest */
-    aio_set_event_notifier(s->ctx, &s->host_notifier, true, NULL);
+    virtio_queue_aio_set_host_notifier_handler(s->vq, s->ctx, false, false);
 
     /* Drain and switch bs back to the QEMU main loop */
     blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
 
     aio_context_release(s->ctx);
 
-    /* Sync vring state back to virtqueue so that non-dataplane request
-     * processing can continue when we disable the host notifier below.
-     */
-    vring_teardown(&s->vring, s->vdev, 0);
-
     k->set_host_notifier(qbus->parent, 0, false);
 
     /* Clean up guest notifier (irq) */
diff --git a/hw/block/dataplane/virtio-blk.h b/hw/block/dataplane/virtio-blk.h
index c88d40e..0714c11 100644
--- a/hw/block/dataplane/virtio-blk.h
+++ b/hw/block/dataplane/virtio-blk.h
@@ -26,5 +26,6 @@ void virtio_blk_data_plane_destroy(VirtIOBlockDataPlane *s);
 void virtio_blk_data_plane_start(VirtIOBlockDataPlane *s);
 void virtio_blk_data_plane_stop(VirtIOBlockDataPlane *s);
 void virtio_blk_data_plane_drain(VirtIOBlockDataPlane *s);
+void virtio_blk_data_plane_notify(VirtIOBlockDataPlane *s);
 
 #endif /* HW_DATAPLANE_VIRTIO_BLK_H */
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 062d57e..e83d823 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -44,8 +44,7 @@ void virtio_blk_free_request(VirtIOBlockReq *req)
     }
 }
 
-static void virtio_blk_complete_request(VirtIOBlockReq *req,
-                                        unsigned char status)
+static void virtio_blk_req_complete(VirtIOBlockReq *req, unsigned char status)
 {
     VirtIOBlock *s = req->dev;
     VirtIODevice *vdev = VIRTIO_DEVICE(s);
@@ -54,12 +53,11 @@ static void virtio_blk_complete_request(VirtIOBlockReq *req,
 
     stb_p(&req->in->status, status);
     virtqueue_push(s->vq, &req->elem, req->in_len);
-    virtio_notify(vdev, s->vq);
-}
-
-static void virtio_blk_req_complete(VirtIOBlockReq *req, unsigned char status)
-{
-    req->dev->complete_request(req, status);
+    if (s->dataplane) {
+        virtio_blk_data_plane_notify(s->dataplane);
+    } else {
+        virtio_notify(vdev, s->vq);
+    }
 }
 
 static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
@@ -859,36 +857,6 @@ static const BlockDevOps virtio_block_ops = {
     .resize_cb = virtio_blk_resize,
 };
 
-/* Disable dataplane thread during live migration since it does not
- * update the dirty memory bitmap yet.
- */
-static void virtio_blk_migration_state_changed(Notifier *notifier, void *data)
-{
-    VirtIOBlock *s = container_of(notifier, VirtIOBlock,
-                                  migration_state_notifier);
-    MigrationState *mig = data;
-    Error *err = NULL;
-
-    if (migration_in_setup(mig)) {
-        if (!s->dataplane) {
-            return;
-        }
-        virtio_blk_data_plane_destroy(s->dataplane);
-        s->dataplane = NULL;
-    } else if (migration_has_finished(mig) ||
-               migration_has_failed(mig)) {
-        if (s->dataplane) {
-            return;
-        }
-        blk_drain_all(); /* complete in-flight non-dataplane requests */
-        virtio_blk_data_plane_create(VIRTIO_DEVICE(s), &s->conf,
-                                     &s->dataplane, &err);
-        if (err != NULL) {
-            error_report_err(err);
-        }
-    }
-}
-
 static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -923,15 +891,12 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     s->sector_mask = (s->conf.conf.logical_block_size / BDRV_SECTOR_SIZE) - 1;
 
     s->vq = virtio_add_queue(vdev, 128, virtio_blk_handle_output);
-    s->complete_request = virtio_blk_complete_request;
     virtio_blk_data_plane_create(vdev, conf, &s->dataplane, &err);
     if (err != NULL) {
         error_propagate(errp, err);
         virtio_cleanup(vdev);
         return;
     }
-    s->migration_state_notifier.notify = virtio_blk_migration_state_changed;
-    add_migration_state_change_notifier(&s->migration_state_notifier);
 
     s->change = qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
     register_savevm(dev, "virtio-blk", virtio_blk_id++, 2,
@@ -947,7 +912,6 @@ static void virtio_blk_device_unrealize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBlock *s = VIRTIO_BLK(dev);
 
-    remove_migration_state_change_notifier(&s->migration_state_notifier);
     virtio_blk_data_plane_destroy(s->dataplane);
     s->dataplane = NULL;
     qemu_del_vm_change_state_handler(s->change);
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index e720934..4c72021 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -54,9 +54,6 @@ typedef struct VirtIOBlock {
     unsigned short sector_mask;
     bool original_wce;
     VMChangeStateEntry *change;
-    /* Function to push to vq and notify guest */
-    void (*complete_request)(struct VirtIOBlockReq *req, unsigned char status);
-    Notifier migration_state_notifier;
     bool dataplane_started;
     struct VirtIOBlockDataPlane *dataplane;
 } VirtIOBlock;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 15/40] virtio-scsi: do not use vring in dataplane
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (13 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 14/40] virtio-blk: do not use vring in dataplane Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 16/40] vring: remove Paolo Bonzini
                   ` (25 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/scsi/virtio-scsi-dataplane.c | 196 +++++-----------------------------------
 hw/scsi/virtio-scsi.c           |  51 ++---------
 include/hw/virtio/virtio-scsi.h |  21 +----
 3 files changed, 35 insertions(+), 233 deletions(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 45ae2d2..b1745b2 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -38,14 +38,10 @@ void virtio_scsi_set_iothread(VirtIOSCSI *s, IOThread *iothread)
     }
 }
 
-static VirtIOSCSIVring *virtio_scsi_vring_init(VirtIOSCSI *s,
-                                               VirtQueue *vq,
-                                               EventNotifierHandler *handler,
-                                               int n)
+static int virtio_scsi_vring_init(VirtIOSCSI *s, VirtQueue *vq, int n)
 {
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(s)));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
-    VirtIOSCSIVring *r;
     int rc;
 
     /* Set up virtqueue notify */
@@ -54,105 +50,17 @@ static VirtIOSCSIVring *virtio_scsi_vring_init(VirtIOSCSI *s,
         fprintf(stderr, "virtio-scsi: Failed to set host notifier (%d)\n",
                 rc);
         s->dataplane_fenced = true;
-        return NULL;
+        return rc;
     }
 
-    r = g_new(VirtIOSCSIVring, 1);
-    r->host_notifier = *virtio_queue_get_host_notifier(vq);
-    r->guest_notifier = *virtio_queue_get_guest_notifier(vq);
-    aio_set_event_notifier(s->ctx, &r->host_notifier, true, handler);
-
-    r->parent = s;
-
-    if (!vring_setup(&r->vring, VIRTIO_DEVICE(s), n)) {
-        fprintf(stderr, "virtio-scsi: VRing setup failed\n");
-        goto fail_vring;
-    }
-    return r;
-
-fail_vring:
-    aio_set_event_notifier(s->ctx, &r->host_notifier, true, NULL);
-    k->set_host_notifier(qbus->parent, n, false);
-    g_free(r);
-    return NULL;
-}
-
-VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s,
-                                         VirtIOSCSIVring *vring)
-{
-    VirtIOSCSICommon *vs = (VirtIOSCSICommon *)s;
-    VirtIOSCSIReq *req;
-
-    req = vring_pop((VirtIODevice *)s, &vring->vring,
-                    sizeof(VirtIOSCSIReq) + vs->cdb_size);
-    if (!req) {
-        return NULL;
-    }
-    virtio_scsi_init_req(s, NULL, req);
-    req->vring = vring;
-    return req;
-}
-
-void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req)
-{
-    VirtIODevice *vdev = VIRTIO_DEVICE(req->vring->parent);
-
-    vring_push(vdev, &req->vring->vring, &req->elem,
-               req->qsgl.size + req->resp_iov.size);
-
-    if (vring_should_notify(vdev, &req->vring->vring)) {
-        event_notifier_set(&req->vring->guest_notifier);
-    }
+    virtio_queue_aio_set_host_notifier_handler(vq, s->ctx, true, true);
+    return 0;
 }
 
-static void virtio_scsi_iothread_handle_ctrl(EventNotifier *notifier)
+void virtio_scsi_dataplane_notify(VirtIODevice *vdev, VirtIOSCSIReq *req)
 {
-    VirtIOSCSIVring *vring = container_of(notifier,
-                                          VirtIOSCSIVring, host_notifier);
-    VirtIOSCSI *s = VIRTIO_SCSI(vring->parent);
-    VirtIOSCSIReq *req;
-
-    event_notifier_test_and_clear(notifier);
-    while ((req = virtio_scsi_pop_req_vring(s, vring))) {
-        virtio_scsi_handle_ctrl_req(s, req);
-    }
-}
-
-static void virtio_scsi_iothread_handle_event(EventNotifier *notifier)
-{
-    VirtIOSCSIVring *vring = container_of(notifier,
-                                          VirtIOSCSIVring, host_notifier);
-    VirtIOSCSI *s = vring->parent;
-    VirtIODevice *vdev = VIRTIO_DEVICE(s);
-
-    event_notifier_test_and_clear(notifier);
-
-    if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
-        return;
-    }
-
-    if (s->events_dropped) {
-        virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
-    }
-}
-
-static void virtio_scsi_iothread_handle_cmd(EventNotifier *notifier)
-{
-    VirtIOSCSIVring *vring = container_of(notifier,
-                                          VirtIOSCSIVring, host_notifier);
-    VirtIOSCSI *s = (VirtIOSCSI *)vring->parent;
-    VirtIOSCSIReq *req, *next;
-    QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
-
-    event_notifier_test_and_clear(notifier);
-    while ((req = virtio_scsi_pop_req_vring(s, vring))) {
-        if (virtio_scsi_handle_cmd_req_prepare(s, req)) {
-            QTAILQ_INSERT_TAIL(&reqs, req, next);
-        }
-    }
-
-    QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
-        virtio_scsi_handle_cmd_req_submit(s, req);
+    if (virtio_should_notify(vdev, req->vq)) {
+        event_notifier_set(virtio_queue_get_guest_notifier(req->vq));
     }
 }
 
@@ -162,46 +70,10 @@ static void virtio_scsi_clear_aio(VirtIOSCSI *s)
     VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
     int i;
 
-    if (s->ctrl_vring) {
-        aio_set_event_notifier(s->ctx, &s->ctrl_vring->host_notifier,
-                               true, NULL);
-    }
-    if (s->event_vring) {
-        aio_set_event_notifier(s->ctx, &s->event_vring->host_notifier,
-                               true, NULL);
-    }
-    if (s->cmd_vrings) {
-        for (i = 0; i < vs->conf.num_queues && s->cmd_vrings[i]; i++) {
-            aio_set_event_notifier(s->ctx, &s->cmd_vrings[i]->host_notifier,
-                                   true, NULL);
-        }
-    }
-}
-
-static void virtio_scsi_vring_teardown(VirtIOSCSI *s)
-{
-    VirtIODevice *vdev = VIRTIO_DEVICE(s);
-    VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
-    int i;
-
-    if (s->ctrl_vring) {
-        vring_teardown(&s->ctrl_vring->vring, vdev, 0);
-        g_free(s->ctrl_vring);
-        s->ctrl_vring = NULL;
-    }
-    if (s->event_vring) {
-        vring_teardown(&s->event_vring->vring, vdev, 1);
-        g_free(s->event_vring);
-        s->event_vring = NULL;
-    }
-    if (s->cmd_vrings) {
-        for (i = 0; i < vs->conf.num_queues && s->cmd_vrings[i]; i++) {
-            vring_teardown(&s->cmd_vrings[i]->vring, vdev, 2 + i);
-            g_free(s->cmd_vrings[i]);
-            s->cmd_vrings[i] = NULL;
-        }
-        free(s->cmd_vrings);
-        s->cmd_vrings = NULL;
+    virtio_queue_aio_set_host_notifier_handler(vs->ctrl_vq, s->ctx, false, false);
+    virtio_queue_aio_set_host_notifier_handler(vs->event_vq, s->ctx, false, false);
+    for (i = 0; i < vs->conf.num_queues; i++) {
+        virtio_queue_aio_set_host_notifier_handler(vs->cmd_vqs[i], s->ctx, false, false);
     }
 }
 
@@ -228,30 +100,21 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
     if (rc != 0) {
         fprintf(stderr, "virtio-scsi: Failed to set guest notifiers (%d), "
                 "ensure -enable-kvm is set\n", rc);
-        s->dataplane_fenced = true;
         goto fail_guest_notifiers;
     }
 
     aio_context_acquire(s->ctx);
-    s->ctrl_vring = virtio_scsi_vring_init(s, vs->ctrl_vq,
-                                           virtio_scsi_iothread_handle_ctrl,
-                                           0);
-    if (!s->ctrl_vring) {
+    rc = virtio_scsi_vring_init(s, vs->ctrl_vq, 0);
+    if (rc) {
         goto fail_vrings;
     }
-    s->event_vring = virtio_scsi_vring_init(s, vs->event_vq,
-                                            virtio_scsi_iothread_handle_event,
-                                            1);
-    if (!s->event_vring) {
+    rc = virtio_scsi_vring_init(s, vs->event_vq, 1);
+    if (rc) {
         goto fail_vrings;
     }
-    s->cmd_vrings = g_new(VirtIOSCSIVring *, vs->conf.num_queues);
     for (i = 0; i < vs->conf.num_queues; i++) {
-        s->cmd_vrings[i] =
-            virtio_scsi_vring_init(s, vs->cmd_vqs[i],
-                                   virtio_scsi_iothread_handle_cmd,
-                                   i + 2);
-        if (!s->cmd_vrings[i]) {
+        rc = virtio_scsi_vring_init(s, vs->cmd_vqs[i], i + 2);
+        if (rc) {
             goto fail_vrings;
         }
     }
@@ -264,13 +127,14 @@ void virtio_scsi_dataplane_start(VirtIOSCSI *s)
 fail_vrings:
     virtio_scsi_clear_aio(s);
     aio_context_release(s->ctx);
-    virtio_scsi_vring_teardown(s);
     for (i = 0; i < vs->conf.num_queues + 2; i++) {
         k->set_host_notifier(qbus->parent, i, false);
     }
     k->set_guest_notifiers(qbus->parent, vs->conf.num_queues + 2, false);
 fail_guest_notifiers:
+    s->dataplane_fenced = true;
     s->dataplane_starting = false;
+    s->dataplane_started = true;
 }
 
 /* Context: QEMU global mutex held */
@@ -281,12 +145,14 @@ void virtio_scsi_dataplane_stop(VirtIOSCSI *s)
     VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(s);
     int i;
 
+    if (!s->dataplane_started || s->dataplane_stopping) {
+        return;
+    }
+
     /* Better luck next time. */
     if (s->dataplane_fenced) {
         s->dataplane_fenced = false;
-        return;
-    }
-    if (!s->dataplane_started || s->dataplane_stopping) {
+        s->dataplane_started = false;
         return;
     }
     s->dataplane_stopping = true;
@@ -294,24 +160,12 @@ void virtio_scsi_dataplane_stop(VirtIOSCSI *s)
 
     aio_context_acquire(s->ctx);
 
-    aio_set_event_notifier(s->ctx, &s->ctrl_vring->host_notifier,
-                           true, NULL);
-    aio_set_event_notifier(s->ctx, &s->event_vring->host_notifier,
-                           true, NULL);
-    for (i = 0; i < vs->conf.num_queues; i++) {
-        aio_set_event_notifier(s->ctx, &s->cmd_vrings[i]->host_notifier,
-                               true, NULL);
-    }
+    virtio_scsi_clear_aio(s);
 
     blk_drain_all(); /* ensure there are no in-flight requests */
 
     aio_context_release(s->ctx);
 
-    /* Sync vring state back to virtqueue so that non-dataplane request
-     * processing can continue when we disable the host notifier below.
-     */
-    virtio_scsi_vring_teardown(s);
-
     for (i = 0; i < vs->conf.num_queues + 2; i++) {
         k->set_host_notifier(qbus->parent, i, false);
     }
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 3ddfcf1..4054ce5 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -42,7 +42,8 @@ static inline SCSIDevice *virtio_scsi_device_find(VirtIOSCSI *s, uint8_t *lun)
 
 void virtio_scsi_init_req(VirtIOSCSI *s, VirtQueue *vq, VirtIOSCSIReq *req)
 {
-    const size_t zero_skip = offsetof(VirtIOSCSIReq, vring);
+    const size_t zero_skip =
+        offsetof(VirtIOSCSIReq, resp_iov) + sizeof(req->resp_iov);
 
     req->vq = vq;
     req->dev = s;
@@ -65,11 +66,10 @@ static void virtio_scsi_complete_req(VirtIOSCSIReq *req)
     VirtIODevice *vdev = VIRTIO_DEVICE(s);
 
     qemu_iovec_from_buf(&req->resp_iov, 0, &req->resp, req->resp_size);
-    if (req->vring) {
-        assert(req->vq == NULL);
-        virtio_scsi_vring_push_notify(req);
+    virtqueue_push(vq, &req->elem, req->qsgl.size + req->resp_iov.size);
+    if (s->dataplane_started) {
+        virtio_scsi_dataplane_notify(vdev, req);
     } else {
-        virtqueue_push(vq, &req->elem, req->qsgl.size + req->resp_iov.size);
         virtio_notify(vdev, vq);
     }
 
@@ -416,7 +416,7 @@ static void virtio_scsi_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
     VirtIOSCSI *s = (VirtIOSCSI *)vdev;
     VirtIOSCSIReq *req;
 
-    if (s->ctx && !s->dataplane_disabled) {
+    if (s->ctx && !s->dataplane_started) {
         virtio_scsi_dataplane_start(s);
         return;
     }
@@ -566,7 +566,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq)
     VirtIOSCSIReq *req, *next;
     QTAILQ_HEAD(, VirtIOSCSIReq) reqs = QTAILQ_HEAD_INITIALIZER(reqs);
 
-    if (s->ctx && !s->dataplane_disabled) {
+    if (s->ctx && !s->dataplane_started) {
         virtio_scsi_dataplane_start(s);
         return;
     }
@@ -686,11 +686,7 @@ void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
         aio_context_acquire(s->ctx);
     }
 
-    if (s->dataplane_started) {
-        req = virtio_scsi_pop_req_vring(s, s->event_vring);
-    } else {
-        req = virtio_scsi_pop_req(s, vs->event_vq);
-    }
+    req = virtio_scsi_pop_req(s, vs->event_vq);
     if (!req) {
         s->events_dropped = true;
         goto out;
@@ -732,7 +728,7 @@ static void virtio_scsi_handle_event(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
-    if (s->ctx && !s->dataplane_disabled) {
+    if (s->ctx && !s->dataplane_started) {
         virtio_scsi_dataplane_start(s);
         return;
     }
@@ -848,31 +844,6 @@ void virtio_scsi_common_realize(DeviceState *dev, Error **errp,
     }
 }
 
-/* Disable dataplane thread during live migration since it does not
- * update the dirty memory bitmap yet.
- */
-static void virtio_scsi_migration_state_changed(Notifier *notifier, void *data)
-{
-    VirtIOSCSI *s = container_of(notifier, VirtIOSCSI,
-                                 migration_state_notifier);
-    MigrationState *mig = data;
-
-    if (migration_in_setup(mig)) {
-        if (!s->dataplane_started) {
-            return;
-        }
-        virtio_scsi_dataplane_stop(s);
-        s->dataplane_disabled = true;
-    } else if (migration_has_finished(mig) ||
-               migration_has_failed(mig)) {
-        if (s->dataplane_started) {
-            return;
-        }
-        blk_drain_all(); /* complete in-flight non-dataplane requests */
-        s->dataplane_disabled = false;
-    }
-}
-
 static void virtio_scsi_device_realize(DeviceState *dev, Error **errp)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
@@ -903,8 +874,6 @@ static void virtio_scsi_device_realize(DeviceState *dev, Error **errp)
 
     register_savevm(dev, "virtio-scsi", virtio_scsi_id++, 1,
                     virtio_scsi_save, virtio_scsi_load, s);
-    s->migration_state_notifier.notify = virtio_scsi_migration_state_changed;
-    add_migration_state_change_notifier(&s->migration_state_notifier);
 
     error_setg(&s->blocker, "block device is in use by data plane");
 }
@@ -935,8 +904,6 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, Error **errp)
     error_free(s->blocker);
 
     unregister_savevm(dev, "virtio-scsi", s);
-    remove_migration_state_change_notifier(&s->migration_state_notifier);
-
     virtio_scsi_common_unrealize(dev, errp);
 }
 
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index d9ae76c..b2b98b7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -22,7 +22,6 @@
 #include "hw/pci/pci.h"
 #include "hw/scsi/scsi.h"
 #include "sysemu/iothread.h"
-#include "hw/virtio/dataplane/vring.h"
 
 #define TYPE_VIRTIO_SCSI_COMMON "virtio-scsi-common"
 #define VIRTIO_SCSI_COMMON(obj) \
@@ -58,13 +57,6 @@ struct VirtIOSCSIConf {
 
 struct VirtIOSCSI;
 
-typedef struct {
-    struct VirtIOSCSI *parent;
-    Vring vring;
-    EventNotifier host_notifier;
-    EventNotifier guest_notifier;
-} VirtIOSCSIVring;
-
 typedef struct VirtIOSCSICommon {
     VirtIODevice parent_obj;
     VirtIOSCSIConf conf;
@@ -86,18 +78,12 @@ typedef struct VirtIOSCSI {
     /* Fields for dataplane below */
     AioContext *ctx; /* one iothread per virtio-scsi-pci for now */
 
-    /* Vring is used instead of vq in dataplane code, because of the underlying
-     * memory layer thread safety */
-    VirtIOSCSIVring *ctrl_vring;
-    VirtIOSCSIVring *event_vring;
-    VirtIOSCSIVring **cmd_vrings;
     bool dataplane_started;
     bool dataplane_starting;
     bool dataplane_stopping;
     bool dataplane_disabled;
     bool dataplane_fenced;
     Error *blocker;
-    Notifier migration_state_notifier;
     uint32_t host_features;
 } VirtIOSCSI;
 
@@ -113,9 +99,6 @@ typedef struct VirtIOSCSIReq {
     QEMUSGList qsgl;
     QEMUIOVector resp_iov;
 
-    /* Set by dataplane code. */
-    VirtIOSCSIVring *vring;
-
     union {
         /* Used for two-stage request submission */
         QTAILQ_ENTRY(VirtIOSCSIReq) next;
@@ -158,8 +141,6 @@ void virtio_scsi_push_event(VirtIOSCSI *s, SCSIDevice *dev,
 void virtio_scsi_set_iothread(VirtIOSCSI *s, IOThread *iothread);
 void virtio_scsi_dataplane_start(VirtIOSCSI *s);
 void virtio_scsi_dataplane_stop(VirtIOSCSI *s);
-void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req);
-VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s,
-                                         VirtIOSCSIVring *vring);
+void virtio_scsi_dataplane_notify(VirtIODevice *vdev, VirtIOSCSIReq *req);
 
 #endif /* _QEMU_VIRTIO_SCSI_H */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 16/40] vring: remove
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (14 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 15/40] virtio-scsi: " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 17/40] iothread: release AioContext around aio_poll Paolo Bonzini
                   ` (24 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/virtio/Makefile.objs                       |   1 -
 hw/virtio/dataplane/Makefile.objs             |   1 -
 hw/virtio/dataplane/vring.c                   | 547 --------------------------
 include/hw/virtio/dataplane/vring-accessors.h |  75 ----
 include/hw/virtio/dataplane/vring.h           |  51 ---
 trace-events                                  |   3 -
 6 files changed, 678 deletions(-)
 delete mode 100644 hw/virtio/dataplane/Makefile.objs
 delete mode 100644 hw/virtio/dataplane/vring.c
 delete mode 100644 include/hw/virtio/dataplane/vring-accessors.h
 delete mode 100644 include/hw/virtio/dataplane/vring.h

diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs
index 19b224a..3e2b175 100644
--- a/hw/virtio/Makefile.objs
+++ b/hw/virtio/Makefile.objs
@@ -2,7 +2,6 @@ common-obj-y += virtio-rng.o
 common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
 common-obj-y += virtio-bus.o
 common-obj-y += virtio-mmio.o
-obj-$(CONFIG_VIRTIO) += dataplane/
 
 obj-y += virtio.o virtio-balloon.o 
 obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o
diff --git a/hw/virtio/dataplane/Makefile.objs b/hw/virtio/dataplane/Makefile.objs
deleted file mode 100644
index 753a9ca..0000000
--- a/hw/virtio/dataplane/Makefile.objs
+++ /dev/null
@@ -1 +0,0 @@
-obj-y += vring.o
diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c
deleted file mode 100644
index 4e1a299..0000000
--- a/hw/virtio/dataplane/vring.c
+++ /dev/null
@@ -1,547 +0,0 @@
-/* Copyright 2012 Red Hat, Inc.
- * Copyright IBM, Corp. 2012
- *
- * Based on Linux 2.6.39 vhost code:
- * Copyright (C) 2009 Red Hat, Inc.
- * Copyright (C) 2006 Rusty Russell IBM Corporation
- *
- * Author: Michael S. Tsirkin <mst@redhat.com>
- *         Stefan Hajnoczi <stefanha@redhat.com>
- *
- * Inspiration, some code, and most witty comments come from
- * Documentation/virtual/lguest/lguest.c, by Rusty Russell
- *
- * This work is licensed under the terms of the GNU GPL, version 2.
- */
-
-#include "trace.h"
-#include "hw/hw.h"
-#include "exec/memory.h"
-#include "exec/address-spaces.h"
-#include "hw/virtio/virtio-access.h"
-#include "hw/virtio/dataplane/vring.h"
-#include "hw/virtio/dataplane/vring-accessors.h"
-#include "qemu/error-report.h"
-
-/* vring_map can be coupled with vring_unmap or (if you still have the
- * value returned in *mr) memory_region_unref.
- * Returns NULL on failure.
- * Callers that can handle a partial mapping must supply mapped_len pointer to
- * get the actual length mapped.
- * Passing mapped_len == NULL requires either a full mapping or a failure.
- */
-static void *vring_map(MemoryRegion **mr, hwaddr phys,
-                       hwaddr len, hwaddr *mapped_len,
-                       bool is_write)
-{
-    MemoryRegionSection section = memory_region_find(get_system_memory(), phys, len);
-    uint64_t size;
-
-    if (!section.mr) {
-        goto out;
-    }
-
-    size = int128_get64(section.size);
-    assert(size);
-
-    /* Passing mapped_len == NULL requires either a full mapping or a failure. */
-    if (!mapped_len && size < len) {
-        goto out;
-    }
-
-    if (is_write && section.readonly) {
-        goto out;
-    }
-    if (!memory_region_is_ram(section.mr)) {
-        goto out;
-    }
-
-    /* Ignore regions with dirty logging, we cannot mark them dirty */
-    if (memory_region_get_dirty_log_mask(section.mr)) {
-        goto out;
-    }
-
-    if (mapped_len) {
-        *mapped_len = MIN(size, len);
-    }
-
-    *mr = section.mr;
-    return memory_region_get_ram_ptr(section.mr) + section.offset_within_region;
-
-out:
-    memory_region_unref(section.mr);
-    *mr = NULL;
-    return NULL;
-}
-
-static void vring_unmap(void *buffer, bool is_write)
-{
-    ram_addr_t addr;
-    MemoryRegion *mr;
-
-    mr = qemu_ram_addr_from_host(buffer, &addr);
-    memory_region_unref(mr);
-}
-
-/* Map the guest's vring to host memory */
-bool vring_setup(Vring *vring, VirtIODevice *vdev, int n)
-{
-    struct vring *vr = &vring->vr;
-    hwaddr addr;
-    hwaddr size;
-    void *ptr;
-
-    vring->broken = false;
-    vr->num = virtio_queue_get_num(vdev, n);
-
-    addr = virtio_queue_get_desc_addr(vdev, n);
-    size = virtio_queue_get_desc_size(vdev, n);
-    /* Map the descriptor area as read only */
-    ptr = vring_map(&vring->mr_desc, addr, size, NULL, false);
-    if (!ptr) {
-        error_report("Failed to map 0x%" HWADDR_PRIx " byte for vring desc "
-                     "at 0x%" HWADDR_PRIx,
-                      size, addr);
-        goto out_err_desc;
-    }
-    vr->desc = ptr;
-
-    addr = virtio_queue_get_avail_addr(vdev, n);
-    size = virtio_queue_get_avail_size(vdev, n);
-    /* Add the size of the used_event_idx */
-    size += sizeof(uint16_t);
-    /* Map the driver area as read only */
-    ptr = vring_map(&vring->mr_avail, addr, size, NULL, false);
-    if (!ptr) {
-        error_report("Failed to map 0x%" HWADDR_PRIx " byte for vring avail "
-                     "at 0x%" HWADDR_PRIx,
-                      size, addr);
-        goto out_err_avail;
-    }
-    vr->avail = ptr;
-
-    addr = virtio_queue_get_used_addr(vdev, n);
-    size = virtio_queue_get_used_size(vdev, n);
-    /* Add the size of the avail_event_idx */
-    size += sizeof(uint16_t);
-    /* Map the device area as read-write */
-    ptr = vring_map(&vring->mr_used, addr, size, NULL, true);
-    if (!ptr) {
-        error_report("Failed to map 0x%" HWADDR_PRIx " byte for vring used "
-                     "at 0x%" HWADDR_PRIx,
-                      size, addr);
-        goto out_err_used;
-    }
-    vr->used = ptr;
-
-    vring->last_avail_idx = virtio_queue_get_last_avail_idx(vdev, n);
-    vring->last_used_idx = vring_get_used_idx(vdev, vring);
-    vring->signalled_used = 0;
-    vring->signalled_used_valid = false;
-
-    trace_vring_setup(virtio_queue_get_ring_addr(vdev, n),
-                      vring->vr.desc, vring->vr.avail, vring->vr.used);
-    return true;
-
-out_err_used:
-    memory_region_unref(vring->mr_avail);
-out_err_avail:
-    memory_region_unref(vring->mr_desc);
-out_err_desc:
-    vring->broken = true;
-    return false;
-}
-
-void vring_teardown(Vring *vring, VirtIODevice *vdev, int n)
-{
-    virtio_queue_set_last_avail_idx(vdev, n, vring->last_avail_idx);
-    virtio_queue_invalidate_signalled_used(vdev, n);
-
-    memory_region_unref(vring->mr_desc);
-    memory_region_unref(vring->mr_avail);
-    memory_region_unref(vring->mr_used);
-}
-
-/* Disable guest->host notifies */
-void vring_disable_notification(VirtIODevice *vdev, Vring *vring)
-{
-    if (!virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        vring_set_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
-    }
-}
-
-/* Enable guest->host notifies
- *
- * Return true if the vring is empty, false if there are more requests.
- */
-void vring_enable_notification(VirtIODevice *vdev, Vring *vring)
-{
-    if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        vring_avail_event(&vring->vr) = vring->vr.avail->idx;
-    } else {
-        vring_clear_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY);
-    }
-    smp_mb(); /* ensure update is seen before reading avail_idx */
-}
-
-/* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */
-bool vring_should_notify(VirtIODevice *vdev, Vring *vring)
-{
-    uint16_t old, new;
-    bool v;
-    /* Flush out used index updates. This is paired
-     * with the barrier that the Guest executes when enabling
-     * interrupts. */
-    smp_mb();
-
-    if (virtio_vdev_has_feature(vdev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
-        unlikely(!vring_more_avail(vdev, vring))) {
-        return true;
-    }
-
-    if (!virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        return !(vring_get_avail_flags(vdev, vring) &
-                 VRING_AVAIL_F_NO_INTERRUPT);
-    }
-    old = vring->signalled_used;
-    v = vring->signalled_used_valid;
-    new = vring->signalled_used = vring->last_used_idx;
-    vring->signalled_used_valid = true;
-
-    if (unlikely(!v)) {
-        return true;
-    }
-
-    return vring_need_event(virtio_tswap16(vdev, vring_used_event(&vring->vr)),
-                            new, old);
-}
-
-typedef struct VirtQueueCurrentElement {
-    unsigned in_num;
-    unsigned out_num;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
-} VirtQueueCurrentElement;
-
-static int get_desc(Vring *vring, VirtQueueCurrentElement *elem,
-                    struct vring_desc *desc)
-{
-    unsigned *num;
-    struct iovec *iov;
-    hwaddr *addr;
-    MemoryRegion *mr;
-    hwaddr len;
-
-    if (desc->flags & VRING_DESC_F_WRITE) {
-        num = &elem->in_num;
-        iov = &elem->iov[elem->out_num + *num];
-        addr = &elem->addr[elem->out_num + *num];
-    } else {
-        num = &elem->out_num;
-        iov = &elem->iov[*num];
-        addr = &elem->addr[*num];
-
-        /* If it's an output descriptor, they're all supposed
-         * to come before any input descriptors. */
-        if (unlikely(elem->in_num)) {
-            error_report("Descriptor has out after in");
-            return -EFAULT;
-        }
-    }
-
-    while (desc->len) {
-        /* Stop for now if there are not enough iovecs available. */
-        if (*num >= VIRTQUEUE_MAX_SIZE) {
-            error_report("Invalid SG num: %u", *num);
-            return -EFAULT;
-        }
-
-        iov->iov_base = vring_map(&mr, desc->addr, desc->len, &len,
-                                  desc->flags & VRING_DESC_F_WRITE);
-        if (!iov->iov_base) {
-            error_report("Failed to map descriptor addr %#" PRIx64 " len %u",
-                         (uint64_t)desc->addr, desc->len);
-            return -EFAULT;
-        }
-
-        /* The MemoryRegion is looked up again and unref'ed later, leave the
-         * ref in place.  */
-        (iov++)->iov_len = len;
-        *addr++ = desc->addr;
-        desc->len -= len;
-        desc->addr += len;
-        *num += 1;
-    }
-
-    return 0;
-}
-
-static void copy_in_vring_desc(VirtIODevice *vdev,
-                               const struct vring_desc *guest,
-                               struct vring_desc *host)
-{
-    host->addr = virtio_ldq_p(vdev, &guest->addr);
-    host->len = virtio_ldl_p(vdev, &guest->len);
-    host->flags = virtio_lduw_p(vdev, &guest->flags);
-    host->next = virtio_lduw_p(vdev, &guest->next);
-}
-
-static bool read_vring_desc(VirtIODevice *vdev,
-                            hwaddr guest,
-                            struct vring_desc *host)
-{
-    if (address_space_read(&address_space_memory, guest, MEMTXATTRS_UNSPECIFIED,
-                           (uint8_t *)host, sizeof *host)) {
-        return false;
-    }
-    host->addr = virtio_tswap64(vdev, host->addr);
-    host->len = virtio_tswap32(vdev, host->len);
-    host->flags = virtio_tswap16(vdev, host->flags);
-    host->next = virtio_tswap16(vdev, host->next);
-    return true;
-}
-
-/* This is stolen from linux/drivers/vhost/vhost.c. */
-static int get_indirect(VirtIODevice *vdev, Vring *vring,
-                        VirtQueueCurrentElement *cur_elem,
-                        struct vring_desc *indirect)
-{
-    struct vring_desc desc;
-    unsigned int i = 0, count, found = 0;
-    int ret;
-
-    /* Sanity check */
-    if (unlikely(indirect->len % sizeof(desc))) {
-        error_report("Invalid length in indirect descriptor: "
-                     "len %#x not multiple of %#zx",
-                     indirect->len, sizeof(desc));
-        vring->broken = true;
-        return -EFAULT;
-    }
-
-    count = indirect->len / sizeof(desc);
-    /* Buffers are chained via a 16 bit next field, so
-     * we can have at most 2^16 of these. */
-    if (unlikely(count > USHRT_MAX + 1)) {
-        error_report("Indirect buffer length too big: %d", indirect->len);
-        vring->broken = true;
-        return -EFAULT;
-    }
-
-    do {
-        /* Translate indirect descriptor */
-        if (!read_vring_desc(vdev, indirect->addr + found * sizeof(desc),
-                             &desc)) {
-            error_report("Failed to read indirect descriptor "
-                         "addr %#" PRIx64 " len %zu",
-                         (uint64_t)indirect->addr + found * sizeof(desc),
-                         sizeof(desc));
-            vring->broken = true;
-            return -EFAULT;
-        }
-
-        /* Ensure descriptor has been loaded before accessing fields */
-        barrier(); /* read_barrier_depends(); */
-
-        if (unlikely(++found > count)) {
-            error_report("Loop detected: last one at %u "
-                         "indirect size %u", i, count);
-            vring->broken = true;
-            return -EFAULT;
-        }
-
-        if (unlikely(desc.flags & VRING_DESC_F_INDIRECT)) {
-            error_report("Nested indirect descriptor");
-            vring->broken = true;
-            return -EFAULT;
-        }
-
-        ret = get_desc(vring, cur_elem, &desc);
-        if (ret < 0) {
-            vring->broken |= (ret == -EFAULT);
-            return ret;
-        }
-        i = desc.next;
-    } while (desc.flags & VRING_DESC_F_NEXT);
-    return 0;
-}
-
-static void vring_unmap_element(VirtQueueElement *elem)
-{
-    int i;
-
-    /* This assumes that the iovecs, if changed, are never moved past
-     * the end of the valid area.  This is true if iovec manipulations
-     * are done with iov_discard_front and iov_discard_back.
-     */
-    for (i = 0; i < elem->out_num; i++) {
-        vring_unmap(elem->out_sg[i].iov_base, false);
-    }
-
-    for (i = 0; i < elem->in_num; i++) {
-        vring_unmap(elem->in_sg[i].iov_base, true);
-    }
-}
-
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access.  Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found.  A negative code is
- * returned on error.
- *
- * Stolen from linux/drivers/vhost/vhost.c.
- */
-void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz)
-{
-    struct vring_desc desc;
-    unsigned int i, head, found = 0, num = vring->vr.num;
-    uint16_t avail_idx, last_avail_idx;
-    VirtQueueCurrentElement cur_elem;
-    VirtQueueElement *elem = NULL;
-    int ret;
-
-    /* If there was a fatal error then refuse operation */
-    if (vring->broken) {
-        ret = -EFAULT;
-        goto out;
-    }
-
-    cur_elem.in_num = cur_elem.out_num = 0;
-
-    /* Check it isn't doing very strange things with descriptor numbers. */
-    last_avail_idx = vring->last_avail_idx;
-    avail_idx = vring_get_avail_idx(vdev, vring);
-    barrier(); /* load indices now and not again later */
-
-    if (unlikely((uint16_t)(avail_idx - last_avail_idx) > num)) {
-        error_report("Guest moved used index from %u to %u",
-                     last_avail_idx, avail_idx);
-        ret = -EFAULT;
-        goto out;
-    }
-
-    /* If there's nothing new since last we looked. */
-    if (avail_idx == last_avail_idx) {
-        ret = -EAGAIN;
-        goto out;
-    }
-
-    /* Only get avail ring entries after they have been exposed by guest. */
-    smp_rmb();
-
-    /* Grab the next descriptor number they're advertising, and increment
-     * the index we've seen. */
-    head = vring_get_avail_ring(vdev, vring, last_avail_idx % num);
-
-    /* If their number is silly, that's an error. */
-    if (unlikely(head >= num)) {
-        error_report("Guest says index %u > %u is available", head, num);
-        ret = -EFAULT;
-        goto out;
-    }
-
-    i = head;
-    do {
-        if (unlikely(i >= num)) {
-            error_report("Desc index is %u > %u, head = %u", i, num, head);
-            ret = -EFAULT;
-            goto out;
-        }
-        if (unlikely(++found > num)) {
-            error_report("Loop detected: last one at %u vq size %u head %u",
-                         i, num, head);
-            ret = -EFAULT;
-            goto out;
-        }
-        copy_in_vring_desc(vdev, &vring->vr.desc[i], &desc);
-
-        /* Ensure descriptor is loaded before accessing fields */
-        barrier();
-
-        if (desc.flags & VRING_DESC_F_INDIRECT) {
-            ret = get_indirect(vdev, vring, &cur_elem, &desc);
-            if (ret < 0) {
-                goto out;
-            }
-            continue;
-        }
-
-        ret = get_desc(vring, &cur_elem, &desc);
-        if (ret < 0) {
-            goto out;
-        }
-
-        i = desc.next;
-    } while (desc.flags & VRING_DESC_F_NEXT);
-
-    /* On success, increment avail index. */
-    vring->last_avail_idx++;
-    if (virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        vring_avail_event(&vring->vr) =
-            virtio_tswap16(vdev, vring->last_avail_idx);
-    }
-
-    /* Now copy what we have collected and mapped */
-    elem = virtqueue_alloc_element(sz, cur_elem.out_num, cur_elem.in_num);
-    elem->index = head;
-    for (i = 0; i < cur_elem.out_num; i++) {
-        elem->out_addr[i] = cur_elem.addr[i];
-        elem->out_sg[i] = cur_elem.iov[i];
-    }
-    for (i = 0; i < cur_elem.in_num; i++) {
-        elem->in_addr[i] = cur_elem.addr[cur_elem.out_num + i];
-        elem->in_sg[i] = cur_elem.iov[cur_elem.out_num + i];
-    }
-
-    return elem;
-
-out:
-    assert(ret < 0);
-    if (ret == -EFAULT) {
-        vring->broken = true;
-    }
-
-    for (i = 0; i < cur_elem.out_num + cur_elem.in_num; i++) {
-        vring_unmap(cur_elem.iov[i].iov_base, false);
-    }
-
-    g_free(elem);
-    return NULL;
-}
-
-/* After we've used one of their buffers, we tell them about it.
- *
- * Stolen from linux/drivers/vhost/vhost.c.
- */
-void vring_push(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
-                int len)
-{
-    unsigned int head = elem->index;
-    uint16_t new;
-
-    vring_unmap_element(elem);
-
-    /* Don't touch vring if a fatal error occurred */
-    if (vring->broken) {
-        return;
-    }
-
-    /* The virtqueue contains a ring of used buffers.  Get a pointer to the
-     * next entry in that used ring. */
-    vring_set_used_ring_id(vdev, vring, vring->last_used_idx % vring->vr.num,
-                           head);
-    vring_set_used_ring_len(vdev, vring, vring->last_used_idx % vring->vr.num,
-                            len);
-
-    /* Make sure buffer is written before we update index. */
-    smp_wmb();
-
-    new = ++vring->last_used_idx;
-    vring_set_used_idx(vdev, vring, new);
-    if (unlikely((int16_t)(new - vring->signalled_used) < (uint16_t)1)) {
-        vring->signalled_used_valid = false;
-    }
-}
diff --git a/include/hw/virtio/dataplane/vring-accessors.h b/include/hw/virtio/dataplane/vring-accessors.h
deleted file mode 100644
index 815c19b..0000000
--- a/include/hw/virtio/dataplane/vring-accessors.h
+++ /dev/null
@@ -1,75 +0,0 @@
-#ifndef VRING_ACCESSORS_H
-#define VRING_ACCESSORS_H
-
-#include "standard-headers/linux/virtio_ring.h"
-#include "hw/virtio/virtio.h"
-#include "hw/virtio/virtio-access.h"
-
-static inline uint16_t vring_get_used_idx(VirtIODevice *vdev, Vring *vring)
-{
-    return virtio_tswap16(vdev, vring->vr.used->idx);
-}
-
-static inline void vring_set_used_idx(VirtIODevice *vdev, Vring *vring,
-                                      uint16_t idx)
-{
-    vring->vr.used->idx = virtio_tswap16(vdev, idx);
-}
-
-static inline uint16_t vring_get_avail_idx(VirtIODevice *vdev, Vring *vring)
-{
-    return virtio_tswap16(vdev, vring->vr.avail->idx);
-}
-
-static inline uint16_t vring_get_avail_ring(VirtIODevice *vdev, Vring *vring,
-                                            int i)
-{
-    return virtio_tswap16(vdev, vring->vr.avail->ring[i]);
-}
-
-static inline void vring_set_used_ring_id(VirtIODevice *vdev, Vring *vring,
-                                          int i, uint32_t id)
-{
-    vring->vr.used->ring[i].id = virtio_tswap32(vdev, id);
-}
-
-static inline void vring_set_used_ring_len(VirtIODevice *vdev, Vring *vring,
-                                          int i, uint32_t len)
-{
-    vring->vr.used->ring[i].len = virtio_tswap32(vdev, len);
-}
-
-static inline uint16_t vring_get_used_flags(VirtIODevice *vdev, Vring *vring)
-{
-    return virtio_tswap16(vdev, vring->vr.used->flags);
-}
-
-static inline uint16_t vring_get_avail_flags(VirtIODevice *vdev, Vring *vring)
-{
-    return virtio_tswap16(vdev, vring->vr.avail->flags);
-}
-
-static inline void vring_set_used_flags(VirtIODevice *vdev, Vring *vring,
-                                        uint16_t flags)
-{
-    vring->vr.used->flags |= virtio_tswap16(vdev, flags);
-}
-
-static inline void vring_clear_used_flags(VirtIODevice *vdev, Vring *vring,
-                                          uint16_t flags)
-{
-    vring->vr.used->flags &= virtio_tswap16(vdev, ~flags);
-}
-
-static inline unsigned int vring_get_num(Vring *vring)
-{
-    return vring->vr.num;
-}
-
-/* Are there more descriptors available? */
-static inline bool vring_more_avail(VirtIODevice *vdev, Vring *vring)
-{
-    return vring_get_avail_idx(vdev, vring) != vring->last_avail_idx;
-}
-
-#endif
diff --git a/include/hw/virtio/dataplane/vring.h b/include/hw/virtio/dataplane/vring.h
deleted file mode 100644
index e1c2a65..0000000
--- a/include/hw/virtio/dataplane/vring.h
+++ /dev/null
@@ -1,51 +0,0 @@
-/* Copyright 2012 Red Hat, Inc. and/or its affiliates
- * Copyright IBM, Corp. 2012
- *
- * Based on Linux 2.6.39 vhost code:
- * Copyright (C) 2009 Red Hat, Inc.
- * Copyright (C) 2006 Rusty Russell IBM Corporation
- *
- * Author: Michael S. Tsirkin <mst@redhat.com>
- *         Stefan Hajnoczi <stefanha@redhat.com>
- *
- * Inspiration, some code, and most witty comments come from
- * Documentation/virtual/lguest/lguest.c, by Rusty Russell
- *
- * This work is licensed under the terms of the GNU GPL, version 2.
- */
-
-#ifndef VRING_H
-#define VRING_H
-
-#include "qemu-common.h"
-#include "standard-headers/linux/virtio_ring.h"
-#include "hw/virtio/virtio.h"
-
-typedef struct {
-    MemoryRegion *mr_desc;          /* memory region for the vring desc */
-    MemoryRegion *mr_avail;         /* memory region for the vring avail */
-    MemoryRegion *mr_used;          /* memory region for the vring used */
-    struct vring vr;                /* virtqueue vring mapped to host memory */
-    uint16_t last_avail_idx;        /* last processed avail ring index */
-    uint16_t last_used_idx;         /* last processed used ring index */
-    uint16_t signalled_used;        /* EVENT_IDX state */
-    bool signalled_used_valid;
-    bool broken;                    /* was there a fatal error? */
-} Vring;
-
-/* Fail future vring_pop() and vring_push() calls until reset */
-static inline void vring_set_broken(Vring *vring)
-{
-    vring->broken = true;
-}
-
-bool vring_setup(Vring *vring, VirtIODevice *vdev, int n);
-void vring_teardown(Vring *vring, VirtIODevice *vdev, int n);
-void vring_disable_notification(VirtIODevice *vdev, Vring *vring);
-void vring_enable_notification(VirtIODevice *vdev, Vring *vring);
-bool vring_should_notify(VirtIODevice *vdev, Vring *vring);
-void *vring_pop(VirtIODevice *vdev, Vring *vring, size_t sz);
-void vring_push(VirtIODevice *vdev, Vring *vring, VirtQueueElement *elem,
-                int len);
-
-#endif /* VRING_H */
diff --git a/trace-events b/trace-events
index 0b0ff02..c230d02 100644
--- a/trace-events
+++ b/trace-events
@@ -125,9 +125,6 @@ virtio_blk_data_plane_start(void *s) "dataplane %p"
 virtio_blk_data_plane_stop(void *s) "dataplane %p"
 virtio_blk_data_plane_process_request(void *s, unsigned int out_num, unsigned int in_num, unsigned int head) "dataplane %p out_num %u in_num %u head %u"
 
-# hw/virtio/dataplane/vring.c
-vring_setup(uint64_t physical, void *desc, void *avail, void *used) "vring physical %#"PRIx64" desc %p avail %p used %p"
-
 # thread-pool.c
 thread_pool_submit(void *pool, void *req, void *opaque) "pool %p req %p opaque %p"
 thread_pool_complete(void *pool, void *req, void *opaque, int ret) "pool %p req %p opaque %p ret %d"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 17/40] iothread: release AioContext around aio_poll
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (15 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 16/40] vring: remove Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 18/40] qemu-thread: introduce QemuRecMutex Paolo Bonzini
                   ` (23 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This is the first step towards having fine-grained critical sections in
dataplane threads, which resolves lock ordering problems between
address_space_* functions (which need the BQL when doing MMIO, even
after we complete RCU-based dispatch) and the AioContext.

Because AioContext does not use contention callbacks anymore, the
unit test has to be changed.

Previously applied as a0710f7995f914e3044e5899bd8ff6c43c62f916 and
then reverted.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c                     | 22 +++-------------------
 docs/multiple-iothreads.txt | 25 +++++++++++++++----------
 include/block/aio.h         |  3 ---
 iothread.c                  | 11 ++---------
 tests/test-aio.c            | 19 +++++++++++--------
 5 files changed, 31 insertions(+), 49 deletions(-)

diff --git a/async.c b/async.c
index e106072..f70d23a 100644
--- a/async.c
+++ b/async.c
@@ -84,8 +84,8 @@ int aio_bh_poll(AioContext *ctx)
          * aio_notify again if necessary.
          */
         if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
-            /* Idle BHs and the notify BH don't count as progress */
-            if (!bh->idle && bh != ctx->notify_dummy_bh) {
+            /* Idle BHs don't count as progress */
+            if (!bh->idle) {
                 ret = 1;
             }
             bh->idle = 0;
@@ -237,7 +237,6 @@ aio_ctx_finalize(GSource     *source)
 {
     AioContext *ctx = (AioContext *) source;
 
-    qemu_bh_delete(ctx->notify_dummy_bh);
     thread_pool_free(ctx->thread_pool);
 
     qemu_mutex_lock(&ctx->bh_lock);
@@ -304,19 +303,6 @@ static void aio_timerlist_notify(void *opaque)
     aio_notify(opaque);
 }
 
-static void aio_rfifolock_cb(void *opaque)
-{
-    AioContext *ctx = opaque;
-
-    /* Kick owner thread in case they are blocked in aio_poll() */
-    qemu_bh_schedule(ctx->notify_dummy_bh);
-}
-
-static void notify_dummy_bh(void *opaque)
-{
-    /* Do nothing, we were invoked just to force the event loop to iterate */
-}
-
 static void event_notifier_dummy_cb(EventNotifier *e)
 {
 }
@@ -345,11 +331,9 @@ AioContext *aio_context_new(Error **errp)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
     qemu_mutex_init(&ctx->bh_lock);
-    rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
+    rfifolock_init(&ctx->lock, NULL, NULL);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
-    ctx->notify_dummy_bh = aio_bh_new(ctx, notify_dummy_bh, NULL);
-
     return ctx;
 fail:
     g_source_destroy(&ctx->source);
diff --git a/docs/multiple-iothreads.txt b/docs/multiple-iothreads.txt
index 40b8419..723cc7e 100644
--- a/docs/multiple-iothreads.txt
+++ b/docs/multiple-iothreads.txt
@@ -105,13 +105,10 @@ a BH in the target AioContext beforehand and then call qemu_bh_schedule().  No
 acquire/release or locking is needed for the qemu_bh_schedule() call.  But be
 sure to acquire the AioContext for aio_bh_new() if necessary.
 
-The relationship between AioContext and the block layer
--------------------------------------------------------
-The AioContext originates from the QEMU block layer because it provides a
-scoped way of running event loop iterations until all work is done.  This
-feature is used to complete all in-flight block I/O requests (see
-bdrv_drain_all()).  Nowadays AioContext is a generic event loop that can be
-used by any QEMU subsystem.
+AioContext and the block layer
+------------------------------
+The AioContext originates from the QEMU block layer, even though nowadays
+AioContext is a generic event loop that can be used by any QEMU subsystem.
 
 The block layer has support for AioContext integrated.  Each BlockDriverState
 is associated with an AioContext using bdrv_set_aio_context() and
@@ -122,9 +119,17 @@ Block layer code must therefore expect to run in an IOThread and avoid using
 old APIs that implicitly use the main loop.  See the "How to program for
 IOThreads" above for information on how to do that.
 
-If main loop code such as a QMP function wishes to access a BlockDriverState it
-must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
-IOThread does not run in parallel.
+If main loop code such as a QMP function wishes to access a BlockDriverState
+it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
+that callbacks in the IOThread do not run in parallel.
+
+Code running in the monitor typically uses bdrv_drain() to ensure that
+past requests from the guest are completed.  When a block device is
+running in an IOThread, the IOThread can also process requests from
+the guest (via ioeventfd).  These requests have to be blocked with
+aio_disable_clients() before calling bdrv_drain().  You can then reenable
+guest requests with aio_enable_clients() before finally releasing the
+AioContext and completing the monitor command.
 
 Long-running jobs (usually in the form of coroutines) are best scheduled in the
 BlockDriverState's AioContext to avoid the need to acquire/release around each
diff --git a/include/block/aio.h b/include/block/aio.h
index e086e3b..b41e291 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -114,9 +114,6 @@ struct AioContext {
     bool notified;
     EventNotifier notifier;
 
-    /* Scheduling this BH forces the event loop it iterate */
-    QEMUBH *notify_dummy_bh;
-
     /* Thread pool for performing work and receiving completion callbacks */
     struct ThreadPool *thread_pool;
 
diff --git a/iothread.c b/iothread.c
index 1b8c2bb..df30f5e 100644
--- a/iothread.c
+++ b/iothread.c
@@ -30,7 +30,6 @@ typedef ObjectClass IOThreadClass;
 static void *iothread_run(void *opaque)
 {
     IOThread *iothread = opaque;
-    bool blocking;
 
     rcu_register_thread();
 
@@ -39,14 +38,8 @@ static void *iothread_run(void *opaque)
     qemu_cond_signal(&iothread->init_done_cond);
     qemu_mutex_unlock(&iothread->init_done_lock);
 
-    while (!iothread->stopping) {
-        aio_context_acquire(iothread->ctx);
-        blocking = true;
-        while (!iothread->stopping && aio_poll(iothread->ctx, blocking)) {
-            /* Progress was made, keep going */
-            blocking = false;
-        }
-        aio_context_release(iothread->ctx);
+    while (!atomic_read(&iothread->stopping)) {
+        aio_poll(iothread->ctx, true);
     }
 
     rcu_unregister_thread();
diff --git a/tests/test-aio.c b/tests/test-aio.c
index 1623803..cb64e3b 100644
--- a/tests/test-aio.c
+++ b/tests/test-aio.c
@@ -99,6 +99,7 @@ static void event_ready_cb(EventNotifier *e)
 
 typedef struct {
     QemuMutex start_lock;
+    EventNotifier notifier;
     bool thread_acquired;
 } AcquireTestData;
 
@@ -110,6 +111,8 @@ static void *test_acquire_thread(void *opaque)
     qemu_mutex_lock(&data->start_lock);
     qemu_mutex_unlock(&data->start_lock);
 
+    g_usleep(500000);
+    event_notifier_set(&data->notifier);
     aio_context_acquire(ctx);
     aio_context_release(ctx);
 
@@ -124,20 +127,19 @@ static void set_event_notifier(AioContext *ctx, EventNotifier *notifier,
     aio_set_event_notifier(ctx, notifier, false, handler);
 }
 
-static void dummy_notifier_read(EventNotifier *unused)
+static void dummy_notifier_read(EventNotifier *n)
 {
-    g_assert(false); /* should never be invoked */
+    event_notifier_test_and_clear(n);
 }
 
 static void test_acquire(void)
 {
     QemuThread thread;
-    EventNotifier notifier;
     AcquireTestData data;
 
     /* Dummy event notifier ensures aio_poll() will block */
-    event_notifier_init(&notifier, false);
-    set_event_notifier(ctx, &notifier, dummy_notifier_read);
+    event_notifier_init(&data.notifier, false);
+    set_event_notifier(ctx, &data.notifier, dummy_notifier_read);
     g_assert(!aio_poll(ctx, false)); /* consume aio_notify() */
 
     qemu_mutex_init(&data.start_lock);
@@ -151,12 +153,13 @@ static void test_acquire(void)
     /* Block in aio_poll(), let other thread kick us and acquire context */
     aio_context_acquire(ctx);
     qemu_mutex_unlock(&data.start_lock); /* let the thread run */
-    g_assert(!aio_poll(ctx, true));
+    g_assert(aio_poll(ctx, true));
+    g_assert(!data.thread_acquired);
     aio_context_release(ctx);
 
     qemu_thread_join(&thread);
-    set_event_notifier(ctx, &notifier, NULL);
-    event_notifier_cleanup(&notifier);
+    set_event_notifier(ctx, &data.notifier, NULL);
+    event_notifier_cleanup(&data.notifier);
 
     g_assert(data.thread_acquired);
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 18/40] qemu-thread: introduce QemuRecMutex
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (16 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 17/40] iothread: release AioContext around aio_poll Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 19/40] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
                   ` (22 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

GRecMutex is new in glib 2.32, so we cannot use it.  Introduce
a recursive mutex in qemu-thread instead, which will be used
instead of RFifoLock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/thread-posix.h |  6 ++++++
 include/qemu/thread-win32.h | 10 ++++++++++
 include/qemu/thread.h       |  3 +++
 util/qemu-thread-posix.c    | 13 +++++++++++++
 util/qemu-thread-win32.c    | 25 +++++++++++++++++++++++++
 5 files changed, 57 insertions(+)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index eb5c7a1..2fb6b90 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -3,6 +3,12 @@
 #include "pthread.h"
 #include <semaphore.h>
 
+typedef QemuMutex QemuRecMutex;
+#define qemu_rec_mutex_destroy qemu_mutex_destroy
+#define qemu_rec_mutex_lock qemu_mutex_lock
+#define qemu_rec_mutex_try_lock qemu_mutex_try_lock
+#define qemu_rec_mutex_unlock qemu_mutex_unlock
+
 struct QemuMutex {
     pthread_mutex_t lock;
 };
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index 385ff5f..50999c5 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -7,6 +7,16 @@ struct QemuMutex {
     LONG owner;
 };
 
+typedef struct QemuRecMutex QemuRecMutex;
+struct QemuRecMutex {
+    CRITICAL_SECTION lock;
+};
+
+void qemu_rec_mutex_destroy(QemuMutex *mutex);
+void qemu_rec_mutex_lock(QemuMutex *mutex);
+int qemu_rec_mutex_trylock(QemuMutex *mutex);
+void qemu_rec_mutex_unlock(QemuMutex *mutex);
+
 struct QemuCond {
     LONG waiters, target;
     HANDLE sema;
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 5114ec8..981f3dc 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -25,6 +25,9 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
+/* Prototypes for other functions are in thread-posix.h/thread-win32.h.  */
+void qemu_rec_mutex_init(QemuRecMutex *mutex);
+
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index dbd8094..4cc9f13 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -89,6 +89,19 @@ void qemu_mutex_unlock(QemuMutex *mutex)
         error_exit(err, __func__);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+    int err;
+    pthread_mutexattr_t attr;
+
+    pthread_mutexattr_init(&attr);
+    pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);
+    err = pthread_mutex_init(&mutex->lock, &attr);
+    if (err) {
+        error_exit(err, __func__);
+    }
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     int err;
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index 6cdd553..018b73e 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -80,6 +80,31 @@ void qemu_mutex_unlock(QemuMutex *mutex)
     LeaveCriticalSection(&mutex->lock);
 }
 
+void qemu_rec_mutex_init(QemuRecMutex *mutex)
+{
+    InitializeCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_destroy(QemuRecMutex *mutex)
+{
+    DeleteCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_lock(QemuRecMutex *mutex)
+{
+    EnterCriticalSection(&mutex->lock);
+}
+
+int qemu_rec_mutex_trylock(QemuRecMutex *mutex)
+{
+    return !TryEnterCriticalSection(&mutex->lock);
+}
+
+void qemu_rec_mutex_unlock(QemuRecMutex *mutex)
+{
+    LeaveCriticalSection(&mutex->lock);
+}
+
 void qemu_cond_init(QemuCond *cond)
 {
     memset(cond, 0, sizeof(*cond));
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 19/40] aio: convert from RFifoLock to QemuRecMutex
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (17 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 18/40] qemu-thread: introduce QemuRecMutex Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 20/40] aio: rename bh_lock to list_lock Paolo Bonzini
                   ` (21 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

It is simpler and a bit faster, and QEMU does not need the contention
callbacks (and thus the fairness) anymore.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c                  |  8 ++---
 include/block/aio.h      |  3 +-
 include/qemu/rfifolock.h | 54 ----------------------------
 tests/.gitignore         |  1 -
 tests/Makefile           |  2 --
 tests/test-rfifolock.c   | 91 ------------------------------------------------
 util/Makefile.objs       |  1 -
 util/rfifolock.c         | 78 -----------------------------------------
 8 files changed, 5 insertions(+), 233 deletions(-)
 delete mode 100644 include/qemu/rfifolock.h
 delete mode 100644 tests/test-rfifolock.c
 delete mode 100644 util/rfifolock.c

diff --git a/async.c b/async.c
index f70d23a..567c3b9 100644
--- a/async.c
+++ b/async.c
@@ -253,7 +253,7 @@ aio_ctx_finalize(GSource     *source)
 
     aio_set_event_notifier(ctx, &ctx->notifier, false, NULL);
     event_notifier_cleanup(&ctx->notifier);
-    rfifolock_destroy(&ctx->lock);
+    qemu_rec_mutex_destroy(&ctx->lock);
     qemu_mutex_destroy(&ctx->bh_lock);
     timerlistgroup_deinit(&ctx->tlg);
 }
@@ -331,7 +331,7 @@ AioContext *aio_context_new(Error **errp)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
     qemu_mutex_init(&ctx->bh_lock);
-    rfifolock_init(&ctx->lock, NULL, NULL);
+    qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
     return ctx;
@@ -352,10 +352,10 @@ void aio_context_unref(AioContext *ctx)
 
 void aio_context_acquire(AioContext *ctx)
 {
-    rfifolock_lock(&ctx->lock);
+    qemu_rec_mutex_lock(&ctx->lock);
 }
 
 void aio_context_release(AioContext *ctx)
 {
-    rfifolock_unlock(&ctx->lock);
+    qemu_rec_mutex_unlock(&ctx->lock);
 }
diff --git a/include/block/aio.h b/include/block/aio.h
index b41e291..b416621 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -19,7 +19,6 @@
 #include "qemu/queue.h"
 #include "qemu/event_notifier.h"
 #include "qemu/thread.h"
-#include "qemu/rfifolock.h"
 #include "qemu/timer.h"
 
 typedef struct BlockAIOCB BlockAIOCB;
@@ -52,7 +51,7 @@ struct AioContext {
     GSource source;
 
     /* Protects all fields from multi-threaded access */
-    RFifoLock lock;
+    QemuRecMutex lock;
 
     /* The list of registered AIO handlers */
     QLIST_HEAD(, AioHandler) aio_handlers;
diff --git a/include/qemu/rfifolock.h b/include/qemu/rfifolock.h
deleted file mode 100644
index b23ab53..0000000
--- a/include/qemu/rfifolock.h
+++ /dev/null
@@ -1,54 +0,0 @@
-/*
- * Recursive FIFO lock
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi   <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- *
- */
-
-#ifndef QEMU_RFIFOLOCK_H
-#define QEMU_RFIFOLOCK_H
-
-#include "qemu/thread.h"
-
-/* Recursive FIFO lock
- *
- * This lock provides more features than a plain mutex:
- *
- * 1. Fairness - enforces FIFO order.
- * 2. Nesting - can be taken recursively.
- * 3. Contention callback - optional, called when thread must wait.
- *
- * The recursive FIFO lock is heavyweight so prefer other synchronization
- * primitives if you do not need its features.
- */
-typedef struct {
-    QemuMutex lock;             /* protects all fields */
-
-    /* FIFO order */
-    unsigned int head;          /* active ticket number */
-    unsigned int tail;          /* waiting ticket number */
-    QemuCond cond;              /* used to wait for our ticket number */
-
-    /* Nesting */
-    QemuThread owner_thread;    /* thread that currently has ownership */
-    unsigned int nesting;       /* amount of nesting levels */
-
-    /* Contention callback */
-    void (*cb)(void *);         /* called when thread must wait, with ->lock
-                                 * held so it may not recursively lock/unlock
-                                 */
-    void *cb_opaque;
-} RFifoLock;
-
-void rfifolock_init(RFifoLock *r, void (*cb)(void *), void *opaque);
-void rfifolock_destroy(RFifoLock *r);
-void rfifolock_lock(RFifoLock *r);
-void rfifolock_unlock(RFifoLock *r);
-
-#endif /* QEMU_RFIFOLOCK_H */
diff --git a/tests/.gitignore b/tests/.gitignore
index 1e55722..850d64e 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -41,7 +41,6 @@ test-qmp-introspect.[ch]
 test-qmp-marshal.c
 test-qmp-output-visitor
 test-rcu-list
-test-rfifolock
 test-string-input-visitor
 test-string-output-visitor
 test-thread-pool
diff --git a/tests/Makefile b/tests/Makefile
index b937984..135bf5f 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -39,7 +39,6 @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
-check-unit-$(CONFIG_POSIX) += tests/test-rfifolock$(EXESUF)
 check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
@@ -392,7 +391,6 @@ tests/check-qom-interface$(EXESUF): tests/check-qom-interface.o $(test-qom-obj-y
 tests/check-qom-proplist$(EXESUF): tests/check-qom-proplist.o $(test-qom-obj-y)
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(test-block-obj-y)
 tests/test-aio$(EXESUF): tests/test-aio.o $(test-block-obj-y)
-tests/test-rfifolock$(EXESUF): tests/test-rfifolock.o $(test-util-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
diff --git a/tests/test-rfifolock.c b/tests/test-rfifolock.c
deleted file mode 100644
index 0572ebb..0000000
--- a/tests/test-rfifolock.c
+++ /dev/null
@@ -1,91 +0,0 @@
-/*
- * RFifoLock tests
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi    <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU LGPL, version 2 or later.
- * See the COPYING.LIB file in the top-level directory.
- */
-
-#include <glib.h>
-#include "qemu-common.h"
-#include "qemu/rfifolock.h"
-
-static void test_nesting(void)
-{
-    RFifoLock lock;
-
-    /* Trivial test, ensure the lock is recursive */
-    rfifolock_init(&lock, NULL, NULL);
-    rfifolock_lock(&lock);
-    rfifolock_lock(&lock);
-    rfifolock_lock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_unlock(&lock);
-    rfifolock_destroy(&lock);
-}
-
-typedef struct {
-    RFifoLock lock;
-    int fd[2];
-} CallbackTestData;
-
-static void rfifolock_cb(void *opaque)
-{
-    CallbackTestData *data = opaque;
-    int ret;
-    char c = 0;
-
-    ret = write(data->fd[1], &c, sizeof(c));
-    g_assert(ret == 1);
-}
-
-static void *callback_thread(void *opaque)
-{
-    CallbackTestData *data = opaque;
-
-    /* The other thread holds the lock so the contention callback will be
-     * invoked...
-     */
-    rfifolock_lock(&data->lock);
-    rfifolock_unlock(&data->lock);
-    return NULL;
-}
-
-static void test_callback(void)
-{
-    CallbackTestData data;
-    QemuThread thread;
-    int ret;
-    char c;
-
-    rfifolock_init(&data.lock, rfifolock_cb, &data);
-    ret = qemu_pipe(data.fd);
-    g_assert(ret == 0);
-
-    /* Hold lock but allow the callback to kick us by writing to the pipe */
-    rfifolock_lock(&data.lock);
-    qemu_thread_create(&thread, "callback_thread",
-                       callback_thread, &data, QEMU_THREAD_JOINABLE);
-    ret = read(data.fd[0], &c, sizeof(c));
-    g_assert(ret == 1);
-    rfifolock_unlock(&data.lock);
-    /* If we got here then the callback was invoked, as expected */
-
-    qemu_thread_join(&thread);
-    close(data.fd[0]);
-    close(data.fd[1]);
-    rfifolock_destroy(&data.lock);
-}
-
-int main(int argc, char **argv)
-{
-    g_test_init(&argc, &argv, NULL);
-    g_test_add_func("/nesting", test_nesting);
-    g_test_add_func("/callback", test_callback);
-    return g_test_run();
-}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 89dd80e..f536cc8 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -23,7 +23,6 @@ util-obj-y += crc32c.o
 util-obj-y += throttle.o
 util-obj-y += getauxval.o
 util-obj-y += readline.o
-util-obj-y += rfifolock.o
 util-obj-y += rcu.o
 util-obj-y += qemu-coroutine.o qemu-coroutine-lock.o qemu-coroutine-io.o
 util-obj-y += qemu-coroutine-sleep.o
diff --git a/util/rfifolock.c b/util/rfifolock.c
deleted file mode 100644
index afbf748..0000000
--- a/util/rfifolock.c
+++ /dev/null
@@ -1,78 +0,0 @@
-/*
- * Recursive FIFO lock
- *
- * Copyright Red Hat, Inc. 2013
- *
- * Authors:
- *  Stefan Hajnoczi   <stefanha@redhat.com>
- *
- * This work is licensed under the terms of the GNU LGPL, version 2 or later.
- * See the COPYING.LIB file in the top-level directory.
- *
- */
-
-#include <assert.h>
-#include "qemu/rfifolock.h"
-
-void rfifolock_init(RFifoLock *r, void (*cb)(void *), void *opaque)
-{
-    qemu_mutex_init(&r->lock);
-    r->head = 0;
-    r->tail = 0;
-    qemu_cond_init(&r->cond);
-    r->nesting = 0;
-    r->cb = cb;
-    r->cb_opaque = opaque;
-}
-
-void rfifolock_destroy(RFifoLock *r)
-{
-    qemu_cond_destroy(&r->cond);
-    qemu_mutex_destroy(&r->lock);
-}
-
-/*
- * Theory of operation:
- *
- * In order to ensure FIFO ordering, implement a ticketlock.  Threads acquiring
- * the lock enqueue themselves by incrementing the tail index.  When the lock
- * is unlocked, the head is incremented and waiting threads are notified.
- *
- * Recursive locking does not take a ticket since the head is only incremented
- * when the outermost recursive caller unlocks.
- */
-void rfifolock_lock(RFifoLock *r)
-{
-    qemu_mutex_lock(&r->lock);
-
-    /* Take a ticket */
-    unsigned int ticket = r->tail++;
-
-    if (r->nesting > 0 && qemu_thread_is_self(&r->owner_thread)) {
-        r->tail--; /* put ticket back, we're nesting */
-    } else {
-        while (ticket != r->head) {
-            /* Invoke optional contention callback */
-            if (r->cb) {
-                r->cb(r->cb_opaque);
-            }
-            qemu_cond_wait(&r->cond, &r->lock);
-        }
-    }
-
-    qemu_thread_get_self(&r->owner_thread);
-    r->nesting++;
-    qemu_mutex_unlock(&r->lock);
-}
-
-void rfifolock_unlock(RFifoLock *r)
-{
-    qemu_mutex_lock(&r->lock);
-    assert(r->nesting > 0);
-    assert(qemu_thread_is_self(&r->owner_thread));
-    if (--r->nesting == 0) {
-        r->head++;
-        qemu_cond_broadcast(&r->cond);
-    }
-    qemu_mutex_unlock(&r->lock);
-}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 20/40] aio: rename bh_lock to list_lock
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (18 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 19/40] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 21/40] qemu-thread: introduce QemuLockCnt Paolo Bonzini
                   ` (20 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This will be used for AioHandlers too.  There is going to be little
or no contention, so it is better to reuse the same lock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c             | 16 ++++++++--------
 include/block/aio.h |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/async.c b/async.c
index 567c3b9..7015bad 100644
--- a/async.c
+++ b/async.c
@@ -50,12 +50,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
         .cb = cb,
         .opaque = opaque,
     };
-    qemu_mutex_lock(&ctx->bh_lock);
+    qemu_mutex_lock(&ctx->list_lock);
     bh->next = ctx->first_bh;
     /* Make sure that the members are ready before putting bh into list */
     smp_wmb();
     ctx->first_bh = bh;
-    qemu_mutex_unlock(&ctx->bh_lock);
+    qemu_mutex_unlock(&ctx->list_lock);
     return bh;
 }
 
@@ -97,7 +97,7 @@ int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!ctx->walking_bh) {
-        qemu_mutex_lock(&ctx->bh_lock);
+        qemu_mutex_lock(&ctx->list_lock);
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -108,7 +108,7 @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
-        qemu_mutex_unlock(&ctx->bh_lock);
+        qemu_mutex_unlock(&ctx->list_lock);
     }
 
     return ret;
@@ -239,7 +239,7 @@ aio_ctx_finalize(GSource     *source)
 
     thread_pool_free(ctx->thread_pool);
 
-    qemu_mutex_lock(&ctx->bh_lock);
+    qemu_mutex_lock(&ctx->list_lock);
     while (ctx->first_bh) {
         QEMUBH *next = ctx->first_bh->next;
 
@@ -249,12 +249,12 @@ aio_ctx_finalize(GSource     *source)
         g_free(ctx->first_bh);
         ctx->first_bh = next;
     }
-    qemu_mutex_unlock(&ctx->bh_lock);
+    qemu_mutex_unlock(&ctx->list_lock);
 
     aio_set_event_notifier(ctx, &ctx->notifier, false, NULL);
     event_notifier_cleanup(&ctx->notifier);
     qemu_rec_mutex_destroy(&ctx->lock);
-    qemu_mutex_destroy(&ctx->bh_lock);
+    qemu_mutex_destroy(&ctx->list_lock);
     timerlistgroup_deinit(&ctx->tlg);
 }
 
@@ -330,7 +330,7 @@ AioContext *aio_context_new(Error **errp)
                            (EventNotifierHandler *)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
-    qemu_mutex_init(&ctx->bh_lock);
+    qemu_mutex_init(&ctx->list_lock);
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
diff --git a/include/block/aio.h b/include/block/aio.h
index b416621..5d60d68 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -88,7 +88,7 @@ struct AioContext {
     uint32_t notify_me;
 
     /* lock to protect between bh's adders and deleter */
-    QemuMutex bh_lock;
+    QemuMutex list_lock;
 
     /* Anchor of the list of Bottom Halves belonging to the context */
     struct QEMUBH *first_bh;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 21/40] qemu-thread: introduce QemuLockCnt
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (19 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 20/40] aio: rename bh_lock to list_lock Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 22/40] aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh Paolo Bonzini
                   ` (19 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

A QemuLockCnt comprises a counter and a mutex, with primitives
to increment and decrement the counter, and to take and release the
mutex.  It can be used to do lock-free visits to a data structure
whenever mutexes would be too heavy-weight and the critical section
is too long for RCU.

This could be implemented simply by protecting the counter with the
mutex, but QemuLockCnt is harder to misuse and more efficient.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/lockcnt.txt      | 343 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/qemu/thread.h |  17 +++
 util/Makefile.objs    |   1 +
 util/lockcnt.c        | 122 ++++++++++++++++++
 4 files changed, 483 insertions(+)
 create mode 100644 docs/lockcnt.txt
 create mode 100644 util/lockcnt.c

diff --git a/docs/lockcnt.txt b/docs/lockcnt.txt
new file mode 100644
index 0000000..fc5d240
--- /dev/null
+++ b/docs/lockcnt.txt
@@ -0,0 +1,343 @@
+DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt)
+===================================================
+
+QEMU often uses reference counts to track data structures that are being
+accessed and should not be freed.  For example, a loop that invoke
+callbacks like this is not safe:
+
+    QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+        if (ioh->revents & G_IO_OUT) {
+            ioh->fd_write(ioh->opaque);
+        }
+    }
+
+QLIST_FOREACH_SAFE protects against deletion of the current node (ioh)
+by stashing away its "next" pointer.  However, ioh->fd_write could
+actually delete the next node from the list.  The simplest way to
+avoid this is to mark the node as deleted, and remove it from the
+list in the above loop:
+
+    QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+        if (ioh->deleted) {
+            QLIST_REMOVE(ioh, next);
+            g_free(ioh);
+        } else {
+            if (ioh->revents & G_IO_OUT) {
+                ioh->fd_write(ioh->opaque);
+            }
+        }
+    }
+
+If however this loop must also be reentrant, i.e. it is possible that
+ioh->fd_write invokes the loop again, some kind of counting is needed:
+
+    walking_handlers++;
+    QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+        if (ioh->deleted) {
+            if (walking_handlers == 1) {
+                QLIST_REMOVE(ioh, next);
+                g_free(ioh);
+            }
+        } else {
+            if (ioh->revents & G_IO_OUT) {
+                ioh->fd_write(ioh->opaque);
+            }
+        }
+    }
+    walking_handlers--;
+
+One may think of using the RCU primitives, rcu_read_lock() and
+rcu_read_unlock(); effectively, the RCU nesting count would take
+the place of the walking_handlers global variable.  Indeed,
+reference counting and RCU have similar purposes, but their usage in
+general is complementary:
+
+- reference counting is fine-grained and limited to a single data
+  structure; RCU delays reclamation of *all* RCU-protected data
+  structures;
+
+- reference counting works even in the presence of code that keeps
+  a reference for a long time; RCU critical sections in principle
+  should be kept short;
+
+- reference counting is often applied to code that is not thread-safe
+  but is reentrant; in fact, usage of reference counting in QEMU predates
+  the introduction of threads by many years.  RCU is generally used to
+  protect readers from other threads freeing memory after concurrent
+  modifications to a data structure.
+
+- reclaiming data can be done by a separate thread in the case of RCU;
+  this can improve performance, but also delay reclamation undesirably.
+  With reference counting, reclamation is deterministic.
+
+This file documents QemuLockCnt, an abstraction for using reference
+counting in code that has to be both thread-safe and reentrant.
+
+
+QemuLockCnt concepts
+--------------------
+
+A QemuLockCnt comprises both a counter and a mutex; it has primitives
+to increment and decrement the counter, and to take and release the
+mutex.  The counter notes how many visits to the data structures are
+taking place (the visits could be from different threads, or there could
+be multiple reentrant visits from the same thread).  The basic rules
+governing the counter/mutex pair then are the following:
+
+- Data protected by the QemuLockCnt must not be freed unless the
+  counter is zero and the mutex is taken.
+
+- A new visit cannot be started while the counter is zero and the
+  mutex is taken.
+
+Most of the time, the mutex protects all writes to the data structure,
+not just frees, though there could be cases where this is not necessary.
+
+Reads, instead, can be done without taking the mutex, as long as the
+readers and writers use the same macros that are used for RCU, for
+example atomic_rcu_read, atomic_rcu_set, QLIST_FOREACH_RCU, etc.  This is
+because the reads are done outside a lock and a set or QLIST_INSERT_HEAD
+can happen concurrently with the read.  The RCU API ensures that the
+processor and the compiler see all required memory barriers.
+
+This could be implemented simply by protecting the counter with the
+mutex, for example:
+
+    // (1)
+    qemu_mutex_lock(&walking_handlers_mutex);
+    walking_handlers++;
+    qemu_mutex_unlock(&walking_handlers_mutex);
+
+    ...
+
+    // (2)
+    qemu_mutex_lock(&walking_handlers_mutex);
+    if (--walking_handlers == 0) {
+        QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+            if (ioh->deleted) {
+                QLIST_REMOVE(ioh, next);
+                g_free(ioh);
+            }
+        }
+    }
+    qemu_mutex_unlock(&walking_handlers_mutex);
+
+Here, no frees can happen in the code represented by the ellipsis.
+If another thread is executing critical section (2), that part of
+the code cannot be entered, because the thread will not be able
+to increment the walking_handlers variable.  And of course
+during the visit any other thread will see a nonzero value for
+walking_handlers, as in the single-threaded code.
+
+Note that it is possible for multiple concurrent accesses to delay
+the cleanup arbitrarily; in other words, for the walking_handlers
+counter to never become zero.  For this reason, this technique is
+more easily applicable if concurrent access to the structure is rare.
+
+However, critical sections are easy to forget since you have to do
+them for each modification of the counter.  QemuLockCnt ensures that
+all modifications of the counter take the lock appropriately, and it
+can also be more efficient in two ways:
+
+- it avoids taking the lock for many operations (for example
+  incrementing the counter while it is non-zero);
+
+- on some platforms, one could implement QemuLockCnt to hold the
+  lock and the mutex in a single word, making it no more expensive
+  than simply managing a counter using atomic operations (see
+  docs/atomics.txt).  This is not implemented yet, but can be
+  very helpful if concurrent access to the data structure is
+  expected to be rare.
+
+
+Using the same mutex for frees and writes can still incur some small
+inefficiencies; for example, a visit can never start if the counter is
+zero and the mutex is taken---even if the mutex is taken by a write,
+which in principle need not block a visit of the data structure.
+However, these are usually not a problem if any of the following
+assumptions are valid:
+
+- concurrent access is possible but rare
+
+- writes are rare
+
+- writes are frequent, but this kind of write (e.g. appending to a
+  list) has a very small critical section.
+
+For example, QEMU uses QemuLockCnt to manage an AioContext's list of
+bottom halves and file descriptor handlers.  Modifications to the list
+of file descriptor handlers are rare.  Creation of a new bottom half is
+frequent and can happen on a fast path; however: 1) it is almost never
+concurrent with a visit to the list of bottom halves; 2) it only has
+three instructions in the critical path, two assignments and a smp_wmb().
+
+
+QemuLockCnt API
+---------------
+
+    void qemu_lockcnt_init(QemuLockCnt *lockcnt);
+
+        Initialize lockcnt's counter to zero and prepare its mutex
+        for usage.
+
+    void qemu_lockcnt_destroy(QemuLockCnt *lockcnt);
+
+        Destroy lockcnt's mutex.
+
+    void qemu_lockcnt_inc(QemuLockCnt *lockcnt);
+
+        If the lockcnt's count is zero, wait for critical sections
+        to finish and increment lockcnt's count to 1.  If the count
+        is not zero, just increment it.
+
+        Because this function can wait on the mutex, it must not be
+        called while the lockcnt's mutex is held by the current thread.
+        For the same reason, qemu_lockcnt_inc can also contribute to
+        AB-BA deadlocks.  This is a sample deadlock scenario:
+
+              thread 1                      thread 2
+              -------------------------------------------------------
+              qemu_lockcnt_lock(&lc1);
+                                            qemu_lockcnt_lock(&lc2);
+              qemu_lockcnt_inc(&lc2);
+                                            qemu_lockcnt_inc(&lc1);
+
+    void qemu_lockcnt_dec(QemuLockCnt *lockcnt);
+
+        Decrement lockcnt's count.
+
+    bool qemu_lockcnt_dec_and_lock(QemuLockCnt *lockcnt);
+
+        Decrement the count.  If the new count is zero, lock
+        the mutex and return true.  Otherwise, return false.
+
+    bool qemu_lockcnt_dec_if_lock(QemuLockCnt *lockcnt);
+
+        If the count is 1, decrement the count to zero, lock
+        the mutex and return true.  Otherwise, return false.
+
+    void qemu_lockcnt_lock(QemuLockCnt *lockcnt);
+
+        Lock the lockcnt's mutex.  Remember that concurrent visits
+        are not blocked unless the count is also zero.  You can
+        use qemu_lockcnt_count to check for this inside a critical
+        section.
+
+    void qemu_lockcnt_unlock(QemuLockCnt *lockcnt);
+
+        Release the lockcnt's mutex.
+
+    void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt);
+
+        This is the same as
+
+            qemu_lockcnt_unlock(lockcnt);
+            qemu_lockcnt_inc(lockcnt);
+
+        but more efficient.
+
+    int qemu_lockcnt_count(QemuLockCnt *lockcnt);
+
+        Return the lockcnt's count.  The count can change at any time
+        any time; still, while the lockcnt is locked, one can usefully
+        check whether the count is non-zero.
+
+
+QemuLockCnt usage
+-----------------
+
+This section explains the typical usage patterns for QemuLockCnt functions.
+
+Setting a variable to a non-NULL value can be done between
+qemu_lockcnt_lock and qemu_lockcnt_unlock:
+
+    qemu_lockcnt_lock(&xyz_lockcnt);
+    if (!xyz) {
+        new_xyz = g_new(XYZ, 1);
+        ...
+        atomic_rcu_set(&xyz, new_xyz);
+    }
+    qemu_lockcnt_unlock(&xyz_lockcnt);
+
+Accessing the value can be done between qemu_lockcnt_inc and
+qemu_lockcnt_dec:
+
+    qemu_lockcnt_inc(&xyz_lockcnt);
+    if (xyz) {
+        XYZ *p = atomic_rcu_read(&xyz);
+        ...
+        /* Accesses can now be done through "p".  */
+    }
+    qemu_lockcnt_dec(&xyz_lockcnt);
+
+Freeing the object can similarly use qemu_lockcnt_lock and
+qemu_lockcnt_unlock, but you also need to ensure that the count
+is zero (i.e. there is no concurrent visit).  Because qemu_lockcnt_inc
+takes the QemuLockCnt's lock, the count cannot become non-zero while
+the object is being freed.  Freeing an object looks like this:
+
+    qemu_lockcnt_lock(&xyz_lockcnt);
+    if (!qemu_lockcnt_count(&xyz_lockcnt)) {
+        g_free(xyz);
+        xyz = NULL;
+    }
+    qemu_lockcnt_unlock(&xyz_lockcnt);
+
+If an object has to be freed right after a visit, you can combine
+the decrement, the locking and the check on count as follows:
+
+    qemu_lockcnt_inc(&xyz_lockcnt);
+    if (xyz) {
+        XYZ *p = atomic_rcu_read(&xyz);
+        ...
+        /* Accesses can now be done through "p".  */
+    }
+    if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
+        g_free(xyz);
+        xyz = NULL;
+        qemu_lockcnt_unlock(&xyz_lockcnt);
+    }
+
+QemuLockCnt can also be used to access a list as follows:
+
+    qemu_lockcnt_inc(&io_handlers_lockcnt);
+    QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
+        if (ioh->revents & G_IO_OUT) {
+            ioh->fd_write(ioh->opaque);
+        }
+    }
+
+    if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
+        QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+            if (ioh->deleted) {
+                QLIST_REMOVE(ioh, next);
+                g_free(ioh);
+            }
+        }
+        qemu_lockcnt_unlock(&io_handlers_lockcnt);
+    }
+
+Again, the RCU primitives are used because new items can be added to the
+list during the walk.  QLIST_FOREACH_RCU ensures that the processor and
+the compiler see the appropriate memory barriers.
+
+An alternative pattern uses qemu_lockcnt_dec_if_lock:
+
+    qemu_lockcnt_inc(&io_handlers_lockcnt);
+    QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
+        if (ioh->deleted) {
+            if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
+                QLIST_REMOVE(ioh, next);
+                g_free(ioh);
+                qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
+            }
+        } else {
+            if (ioh->revents & G_IO_OUT) {
+                ioh->fd_write(ioh->opaque);
+            }
+        }
+    }
+    qemu_lockcnt_dec(&io_handlers_lockcnt);
+
+Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock,
+because there is no special task to do if the count goes from 1 to 0.
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 981f3dc..9fadca4 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -8,6 +8,7 @@ typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
 typedef struct QemuSemaphore QemuSemaphore;
 typedef struct QemuEvent QemuEvent;
+typedef struct QemuLockCnt QemuLockCnt;
 typedef struct QemuThread QemuThread;
 
 #ifdef _WIN32
@@ -65,4 +66,20 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+struct QemuLockCnt {
+    QemuMutex mutex;
+    unsigned count;
+};
+
+void qemu_lockcnt_init(QemuLockCnt *lockcnt);
+void qemu_lockcnt_destroy(QemuLockCnt *lockcnt);
+void qemu_lockcnt_inc(QemuLockCnt *lockcnt);
+void qemu_lockcnt_dec(QemuLockCnt *lockcnt);
+bool qemu_lockcnt_dec_and_lock(QemuLockCnt *lockcnt);
+bool qemu_lockcnt_dec_if_lock(QemuLockCnt *lockcnt);
+void qemu_lockcnt_lock(QemuLockCnt *lockcnt);
+void qemu_lockcnt_unlock(QemuLockCnt *lockcnt);
+void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt);
+unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt);
+
 #endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index f536cc8..5636f1c 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,5 @@
 util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y += lockcnt.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/lockcnt.c b/util/lockcnt.c
new file mode 100644
index 0000000..304f9d9
--- /dev/null
+++ b/util/lockcnt.c
@@ -0,0 +1,122 @@
+/*
+ * QemuLockCnt implementation
+ *
+ * Copyright Red Hat, Inc. 2015
+ *
+ * Author:
+ *   Paolo Bonzini <pbonzini@redhat.com>
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include <errno.h>
+#include <time.h>
+#include <signal.h>
+#include <stdint.h>
+#include <string.h>
+#include <limits.h>
+#include <unistd.h>
+#include <sys/time.h>
+#include "qemu/thread.h"
+#include "qemu/atomic.h"
+
+void qemu_lockcnt_init(QemuLockCnt *lockcnt)
+{
+    qemu_mutex_init(&lockcnt->mutex);
+    lockcnt->count = 0;
+}
+
+void qemu_lockcnt_destroy(QemuLockCnt *lockcnt)
+{
+    qemu_mutex_destroy(&lockcnt->mutex);
+}
+
+void qemu_lockcnt_inc(QemuLockCnt *lockcnt)
+{
+    int old;
+    for (;;) {
+        old = atomic_mb_read(&lockcnt->count);
+        if (old == 0) {
+            qemu_lockcnt_lock(lockcnt);
+            qemu_lockcnt_inc_and_unlock(lockcnt);
+            return;
+        } else {
+            if (atomic_cmpxchg(&lockcnt->count, old, old + 1) == old) {
+                return;
+            }
+        }
+    }
+}
+
+void qemu_lockcnt_dec(QemuLockCnt *lockcnt)
+{
+    atomic_dec(&lockcnt->count);
+}
+
+/* Decrement a counter, and return locked if it is decremented to zero.
+ * It is impossible for the counter to become nonzero while the mutex
+ * is taken.
+ */
+bool qemu_lockcnt_dec_and_lock(QemuLockCnt *lockcnt)
+{
+    int val = atomic_read(&lockcnt->count);
+    while (val > 1) {
+        int old = atomic_cmpxchg(&lockcnt->count, val, val - 1);
+        if (old != val) {
+            val = old;
+            continue;
+        }
+
+        return false;
+    }
+
+    qemu_lockcnt_lock(lockcnt);
+    if (atomic_fetch_dec(&lockcnt->count) == 1) {
+        return true;
+    }
+
+    qemu_lockcnt_unlock(lockcnt);
+    return false;
+}
+
+/* Decrement a counter and return locked if it is decremented to zero.
+ * Otherwise do nothing.
+ *
+ * It is impossible for the counter to become nonzero while the mutex
+ * is taken.
+ */
+bool qemu_lockcnt_dec_if_lock(QemuLockCnt *lockcnt)
+{
+    int val = atomic_mb_read(&lockcnt->count);
+    if (val > 1) {
+        return false;
+    }
+
+    qemu_lockcnt_lock(lockcnt);
+    if (atomic_fetch_dec(&lockcnt->count) == 1) {
+        return true;
+    }
+
+    qemu_lockcnt_inc_and_unlock(lockcnt);
+    return false;
+}
+
+void qemu_lockcnt_lock(QemuLockCnt *lockcnt)
+{
+    qemu_mutex_lock(&lockcnt->mutex);
+}
+
+void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt)
+{
+    atomic_inc(&lockcnt->count);
+    qemu_mutex_unlock(&lockcnt->mutex);
+}
+
+void qemu_lockcnt_unlock(QemuLockCnt *lockcnt)
+{
+    qemu_mutex_unlock(&lockcnt->mutex);
+}
+
+unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt)
+{
+    return lockcnt->count;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 22/40] aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (20 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 21/40] qemu-thread: introduce QemuLockCnt Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 23/40] qemu-thread: optimize QemuLockCnt with futexes on Linux Paolo Bonzini
                   ` (18 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This will make it possible to walk the list of bottom halves without
holding the AioContext lock---and in turn to call bottom half
handlers without holding the lock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c             | 31 ++++++++++++++-----------------
 include/block/aio.h | 12 +++++-------
 2 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/async.c b/async.c
index 7015bad..4c1f658 100644
--- a/async.c
+++ b/async.c
@@ -50,12 +50,12 @@ QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
         .cb = cb,
         .opaque = opaque,
     };
-    qemu_mutex_lock(&ctx->list_lock);
+    qemu_lockcnt_lock(&ctx->list_lock);
     bh->next = ctx->first_bh;
     /* Make sure that the members are ready before putting bh into list */
     smp_wmb();
     ctx->first_bh = bh;
-    qemu_mutex_unlock(&ctx->list_lock);
+    qemu_lockcnt_unlock(&ctx->list_lock);
     return bh;
 }
 
@@ -70,13 +70,11 @@ int aio_bh_poll(AioContext *ctx)
     QEMUBH *bh, **bhp, *next;
     int ret;
 
-    ctx->walking_bh++;
+    qemu_lockcnt_inc(&ctx->list_lock);
 
     ret = 0;
-    for (bh = ctx->first_bh; bh; bh = next) {
-        /* Make sure that fetching bh happens before accessing its members */
-        smp_read_barrier_depends();
-        next = bh->next;
+    for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
+        next = atomic_rcu_read(&bh->next);
         /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
          * implicit memory barrier ensures that the callback sees all writes
          * done by the scheduling thread.  It also ensures that the scheduling
@@ -93,11 +91,8 @@ int aio_bh_poll(AioContext *ctx)
         }
     }
 
-    ctx->walking_bh--;
-
     /* remove deleted bhs */
-    if (!ctx->walking_bh) {
-        qemu_mutex_lock(&ctx->list_lock);
+    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -108,7 +103,7 @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
-        qemu_mutex_unlock(&ctx->list_lock);
+        qemu_lockcnt_unlock(&ctx->list_lock);
     }
 
     return ret;
@@ -164,7 +159,8 @@ aio_compute_timeout(AioContext *ctx)
     int timeout = -1;
     QEMUBH *bh;
 
-    for (bh = ctx->first_bh; bh; bh = bh->next) {
+    for (bh = atomic_rcu_read(&ctx->first_bh); bh;
+         bh = atomic_rcu_read(&bh->next)) {
         if (!bh->deleted && bh->scheduled) {
             if (bh->idle) {
                 /* idle bottom halves will be polled at least
@@ -239,7 +235,8 @@ aio_ctx_finalize(GSource     *source)
 
     thread_pool_free(ctx->thread_pool);
 
-    qemu_mutex_lock(&ctx->list_lock);
+    qemu_lockcnt_lock(&ctx->list_lock);
+    assert(!qemu_lockcnt_count(&ctx->list_lock));
     while (ctx->first_bh) {
         QEMUBH *next = ctx->first_bh->next;
 
@@ -249,12 +246,12 @@ aio_ctx_finalize(GSource     *source)
         g_free(ctx->first_bh);
         ctx->first_bh = next;
     }
-    qemu_mutex_unlock(&ctx->list_lock);
+    qemu_lockcnt_unlock(&ctx->list_lock);
 
     aio_set_event_notifier(ctx, &ctx->notifier, false, NULL);
     event_notifier_cleanup(&ctx->notifier);
     qemu_rec_mutex_destroy(&ctx->lock);
-    qemu_mutex_destroy(&ctx->list_lock);
+    qemu_lockcnt_destroy(&ctx->list_lock);
     timerlistgroup_deinit(&ctx->tlg);
 }
 
@@ -330,7 +327,7 @@ AioContext *aio_context_new(Error **errp)
                            (EventNotifierHandler *)
                            event_notifier_dummy_cb);
     ctx->thread_pool = NULL;
-    qemu_mutex_init(&ctx->list_lock);
+    qemu_lockcnt_init(&ctx->list_lock);
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
 
diff --git a/include/block/aio.h b/include/block/aio.h
index 5d60d68..26aa658 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -87,17 +87,15 @@ struct AioContext {
      */
     uint32_t notify_me;
 
-    /* lock to protect between bh's adders and deleter */
-    QemuMutex list_lock;
+    /* A lock to protect between bh's adders and deleter, and to ensure
+     * that no callbacks are removed while we're walking and dispatching
+     * them.
+     */
+    QemuLockCnt list_lock;
 
     /* Anchor of the list of Bottom Halves belonging to the context */
     struct QEMUBH *first_bh;
 
-    /* A simple lock used to protect the first_bh list, and ensure that
-     * no callbacks are removed while we're walking and dispatching callbacks.
-     */
-    int walking_bh;
-
     /* Used by aio_notify.
      *
      * "notified" is used to avoid expensive event_notifier_test_and_clear
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 23/40] qemu-thread: optimize QemuLockCnt with futexes on Linux
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (21 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 22/40] aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 24/40] aio: tweak walking in dispatch phase Paolo Bonzini
                   ` (17 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This is complex, but I think it is reasonably documented in the source.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/lockcnt.txt         |   9 +-
 include/qemu/futex.h     |  36 ++++++
 include/qemu/thread.h    |   3 +
 trace-events             |  10 ++
 util/lockcnt.c           | 282 +++++++++++++++++++++++++++++++++++++++++++++++
 util/qemu-thread-posix.c |  25 +----
 6 files changed, 336 insertions(+), 29 deletions(-)
 create mode 100644 include/qemu/futex.h

diff --git a/docs/lockcnt.txt b/docs/lockcnt.txt
index fc5d240..594764b 100644
--- a/docs/lockcnt.txt
+++ b/docs/lockcnt.txt
@@ -142,12 +142,11 @@ can also be more efficient in two ways:
 - it avoids taking the lock for many operations (for example
   incrementing the counter while it is non-zero);
 
-- on some platforms, one could implement QemuLockCnt to hold the
-  lock and the mutex in a single word, making it no more expensive
+- on some platforms, one can implement QemuLockCnt to hold the lock
+  and the mutex in a single word, making the fast path no more expensive
   than simply managing a counter using atomic operations (see
-  docs/atomics.txt).  This is not implemented yet, but can be
-  very helpful if concurrent access to the data structure is
-  expected to be rare.
+  docs/atomics.txt).  This can be very helpful if concurrent access to
+  the data structure is expected to be rare.
 
 
 Using the same mutex for frees and writes can still incur some small
diff --git a/include/qemu/futex.h b/include/qemu/futex.h
new file mode 100644
index 0000000..c3d1089
--- /dev/null
+++ b/include/qemu/futex.h
@@ -0,0 +1,36 @@
+/*
+ * Wrappers around Linux futex syscall
+ *
+ * Copyright Red Hat, Inc. 2015
+ *
+ * Author:
+ *  Paolo Bonzini <pbonzini@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include <sys/syscall.h>
+#include <linux/futex.h>
+
+#define futex(...)              syscall(__NR_futex, __VA_ARGS__)
+
+static inline void futex_wake(void *f, int n)
+{
+    futex(f, FUTEX_WAKE, n, NULL, NULL, 0);
+}
+
+static inline void futex_wait(void *f, unsigned val)
+{
+    while (futex(f, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
+        switch (errno) {
+        case EWOULDBLOCK:
+            return;
+        case EINTR:
+            break; /* get out of switch and retry */
+        default:
+            abort();
+        }
+    }
+}
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index 9fadca4..22d92d2 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,7 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include "config-host.h"
 #include <inttypes.h>
 #include <stdbool.h>
 
@@ -67,7 +68,9 @@ void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
 struct QemuLockCnt {
+#ifndef CONFIG_LINUX
     QemuMutex mutex;
+#endif
     unsigned count;
 };
 
diff --git a/trace-events b/trace-events
index c230d02..d667e4c 100644
--- a/trace-events
+++ b/trace-events
@@ -1427,6 +1427,16 @@ hbitmap_iter_skip_words(const void *hb, void *hbi, uint64_t pos, unsigned long c
 hbitmap_reset(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64
 hbitmap_set(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64
 
+# util/lockcnt.c
+lockcnt_fast_path_attempt(const void *lockcnt, int expected, int new) "lockcnt %p fast path %d->%d"
+lockcnt_fast_path_success(const void *lockcnt, int expected, int new) "lockcnt %p fast path %d->%d succeeded"
+lockcnt_unlock_attempt(const void *lockcnt, int expected, int new) "lockcnt %p unlock %d->%d"
+lockcnt_unlock_success(const void *lockcnt, int expected, int new) "lockcnt %p unlock %d->%d succeeded"
+lockcnt_futex_wait_prepare(const void *lockcnt, int expected, int new) "lockcnt %p preparing slow path %d->%d"
+lockcnt_futex_wait(const void *lockcnt, int val) "lockcnt %p waiting on %d"
+lockcnt_futex_wait_resume(const void *lockcnt, int new) "lockcnt %p after wait: %d"
+lockcnt_futex_wake(const void *lockcnt) "lockcnt %p waking up one waiter"
+
 # target-s390x/mmu_helper.c
 get_skeys_nonzero(int rc) "SKEY: Call to get_skeys unexpectedly returned %d"
 set_skeys_nonzero(int rc) "SKEY: Call to set_skeys unexpectedly returned %d"
diff --git a/util/lockcnt.c b/util/lockcnt.c
index 304f9d9..56eb29e 100644
--- a/util/lockcnt.c
+++ b/util/lockcnt.c
@@ -18,7 +18,288 @@
 #include <sys/time.h>
 #include "qemu/thread.h"
 #include "qemu/atomic.h"
+#include "trace.h"
 
+#ifdef CONFIG_LINUX
+#include "qemu/futex.h"
+
+/* On Linux, bits 0-1 are a futex-based lock, bits 2-31 are the counter.
+ * For the mutex algorithm see Ulrich Drepper's "Futexes Are Tricky" (ok,
+ * this is not the most relaxing citation I could make...).  It is similar
+ * to mutex2 in the paper.
+ */
+
+#define QEMU_LOCKCNT_STATE_MASK    3
+#define QEMU_LOCKCNT_STATE_FREE    0
+#define QEMU_LOCKCNT_STATE_LOCKED  1
+#define QEMU_LOCKCNT_STATE_WAITING 2
+
+#define QEMU_LOCKCNT_COUNT_STEP    4
+#define QEMU_LOCKCNT_COUNT_SHIFT   2
+
+void qemu_lockcnt_init(QemuLockCnt *lockcnt)
+{
+    lockcnt->count = 0;
+}
+
+void qemu_lockcnt_destroy(QemuLockCnt *lockcnt)
+{
+}
+
+/* *val is the current value of lockcnt->count.
+ *
+ * If the lock is free, try a cmpxchg from *val to new_if_free; return
+ * true and set *val to the old value found by the cmpxchg in
+ * lockcnt->count.
+ *
+ * If the lock is taken, wait for it to be released and return false
+ * *without trying again to take the lock*.  Again, set *val to the
+ * new value of lockcnt->count.
+ *
+ * new_if_free's bottom two bits must not be QEMU_LOCKCNT_STATE_LOCKED
+ * if calling this function a second time after it has returned
+ * false.
+ */
+static bool qemu_lockcnt_cmpxchg_or_wait(QemuLockCnt *lockcnt, int *val,
+                                         int new_if_free, bool *waited)
+{
+    /* Fast path for when the lock is free.  */
+    if ((*val & QEMU_LOCKCNT_STATE_MASK) == QEMU_LOCKCNT_STATE_FREE) {
+        int expected = *val;
+
+        trace_lockcnt_fast_path_attempt(lockcnt, expected, new_if_free);
+        *val = atomic_cmpxchg(&lockcnt->count, expected, new_if_free);
+        if (*val == expected) {
+            trace_lockcnt_fast_path_success(lockcnt, expected, new_if_free);
+            *val = new_if_free;
+            return true;
+        }
+    }
+
+    /* The slow path moves from locked to waiting if necessary, then
+     * does a futex wait.  Both steps can be repeated ad nauseam,
+     * only getting out of the loop if we can have another shot at the
+     * fast path.  Once we can, get out to compute the new destination
+     * value for the fast path.
+     */
+    while ((*val & QEMU_LOCKCNT_STATE_MASK) != QEMU_LOCKCNT_STATE_FREE) {
+        if ((*val & QEMU_LOCKCNT_STATE_MASK) == QEMU_LOCKCNT_STATE_LOCKED) {
+            int expected = *val;
+            int new = expected - QEMU_LOCKCNT_STATE_LOCKED + QEMU_LOCKCNT_STATE_WAITING;
+
+            trace_lockcnt_futex_wait_prepare(lockcnt, expected, new);
+            *val = atomic_cmpxchg(&lockcnt->count, expected, new);
+            if (*val == expected) {
+                *val = new;
+            }
+            continue;
+        }
+
+        if ((*val & QEMU_LOCKCNT_STATE_MASK) == QEMU_LOCKCNT_STATE_WAITING) {
+            *waited = true;
+            trace_lockcnt_futex_wait(lockcnt, *val);
+            futex_wait(&lockcnt->count, *val);
+            *val = atomic_read(&lockcnt->count);
+            trace_lockcnt_futex_wait_resume(lockcnt, *val);
+            continue;
+        }
+
+        abort();
+    }
+    return false;
+}
+
+static void lockcnt_wake(QemuLockCnt *lockcnt)
+{
+    trace_lockcnt_futex_wake(lockcnt);
+    futex_wake(&lockcnt->count, 1);
+}
+
+void qemu_lockcnt_inc(QemuLockCnt *lockcnt)
+{
+    int val = atomic_read(&lockcnt->count);
+    bool waited = false;
+
+    for (;;) {
+        if (val >= QEMU_LOCKCNT_COUNT_STEP) {
+            int expected = val;
+            val = atomic_cmpxchg(&lockcnt->count, val, val + QEMU_LOCKCNT_COUNT_STEP);
+            if (val == expected) {
+                break;
+            }
+        } else {
+            /* The fast path is (0, unlocked)->(1, unlocked).  */
+            if (qemu_lockcnt_cmpxchg_or_wait(lockcnt, &val, QEMU_LOCKCNT_COUNT_STEP,
+                                             &waited)) {
+                break;
+            }
+        }
+    }
+
+    /* If we were woken by another thread, we should also wake one because
+     * we are effectively releasing the lock that was given to us.  This is
+     * the case where qemu_lockcnt_lock would leave QEMU_LOCKCNT_STATE_WAITING
+     * in the low bits, and qemu_lockcnt_inc_and_unlock would find it and
+     * wake someone.
+     */
+    if (waited) {
+        lockcnt_wake(lockcnt);
+    }
+}
+
+void qemu_lockcnt_dec(QemuLockCnt *lockcnt)
+{
+    atomic_sub(&lockcnt->count, QEMU_LOCKCNT_COUNT_STEP);
+}
+
+/* Decrement a counter, and return locked if it is decremented to zero.
+ * If the function returns true, it is impossible for the counter to
+ * become nonzero until the next qemu_lockcnt_unlock.
+ */
+bool qemu_lockcnt_dec_and_lock(QemuLockCnt *lockcnt)
+{
+    int val = atomic_read(&lockcnt->count);
+    int locked_state = QEMU_LOCKCNT_STATE_LOCKED;
+    bool waited = false;
+
+    for (;;) {
+        if (val >= 2 * QEMU_LOCKCNT_COUNT_STEP) {
+            int expected = val;
+            int new = val - QEMU_LOCKCNT_COUNT_STEP;
+            val = atomic_cmpxchg(&lockcnt->count, val, new);
+            if (val == expected) {
+                break;
+            }
+        }
+
+        /* If count is going 1->0, take the lock. The fast path is
+         * (1, unlocked)->(0, locked) or (1, unlocked)->(0, waiting).
+         */
+        if (qemu_lockcnt_cmpxchg_or_wait(lockcnt, &val, locked_state, &waited)) {
+            return true;
+        }
+
+        if (waited) {
+            /* At this point we do not know if there are more waiters.  Assume
+             * there are.
+             */
+            locked_state = QEMU_LOCKCNT_STATE_WAITING;
+        }
+    }
+
+    /* If we were woken by another thread, but we're returning in unlocked
+     * state, we should also wake a thread because we are effectively
+     * releasing the lock that was given to us.  This is the case where
+     * qemu_lockcnt_lock would leave QEMU_LOCKCNT_STATE_WAITING in the low
+     * bits, and qemu_lockcnt_unlock would find it and wake someone.
+     */
+    if (waited) {
+        lockcnt_wake(lockcnt);
+    }
+    return false;
+}
+
+/* If the counter is one, decrement it and return locked.  Otherwise do
+ * nothing.
+ *
+ * If the function returns true, it is impossible for the counter to
+ * become nonzero until the next qemu_lockcnt_unlock.
+ */
+bool qemu_lockcnt_dec_if_lock(QemuLockCnt *lockcnt)
+{
+    int val = atomic_read(&lockcnt->count);
+    int locked_state = QEMU_LOCKCNT_STATE_LOCKED;
+    bool waited = false;
+
+    while (val < 2 * QEMU_LOCKCNT_COUNT_STEP) {
+        /* If count is going 1->0, take the lock. The fast path is
+         * (1, unlocked)->(0, locked) or (1, unlocked)->(0, waiting).
+         */
+        if (qemu_lockcnt_cmpxchg_or_wait(lockcnt, &val, locked_state, &waited)) {
+            return true;
+        }
+
+        if (waited) {
+            /* At this point we do not know if there are more waiters.  Assume
+             * there are.
+             */
+            locked_state = QEMU_LOCKCNT_STATE_WAITING;
+        }
+    }
+
+    /* If we were woken by another thread, but we're returning in unlocked
+     * state, we should also wake a thread because we are effectively
+     * releasing the lock that was given to us.  This is the case where
+     * qemu_lockcnt_lock would leave QEMU_LOCKCNT_STATE_WAITING in the low
+     * bits, and qemu_lockcnt_inc_and_unlock would find it and wake someone.
+     */
+    if (waited) {
+        lockcnt_wake(lockcnt);
+    }
+    return false;
+}
+
+void qemu_lockcnt_lock(QemuLockCnt *lockcnt)
+{
+    int val = atomic_read(&lockcnt->count);
+    int step = QEMU_LOCKCNT_STATE_LOCKED;
+    bool waited = false;
+
+    /* The third argument is only used if the low bits of val are 0
+     * (QEMU_LOCKCNT_STATE_FREE), so just blindly mix in the desired
+     * state.
+     */
+    while (!qemu_lockcnt_cmpxchg_or_wait(lockcnt, &val, val + step, &waited)) {
+        if (waited) {
+            /* At this point we do not know if there are more waiters.  Assume
+             * there are.
+             */
+            step = QEMU_LOCKCNT_STATE_WAITING;
+        }
+    }
+}
+
+void qemu_lockcnt_inc_and_unlock(QemuLockCnt *lockcnt)
+{
+    int expected, new, val;
+
+    val = atomic_read(&lockcnt->count);
+    do {
+        expected = val;
+        new = (val + QEMU_LOCKCNT_COUNT_STEP) & ~QEMU_LOCKCNT_STATE_MASK;
+        trace_lockcnt_unlock_attempt(lockcnt, val, new);
+        val = atomic_cmpxchg(&lockcnt->count, val, new);
+    } while (val != expected);
+
+    trace_lockcnt_unlock_success(lockcnt, val, new);
+    if (val & QEMU_LOCKCNT_STATE_WAITING) {
+        lockcnt_wake(lockcnt);
+    }
+}
+
+void qemu_lockcnt_unlock(QemuLockCnt *lockcnt)
+{
+    int expected, new, val;
+
+    val = atomic_read(&lockcnt->count);
+    do {
+        expected = val;
+        new = val & ~QEMU_LOCKCNT_STATE_MASK;
+        trace_lockcnt_unlock_attempt(lockcnt, val, new);
+        val = atomic_cmpxchg(&lockcnt->count, val, new);
+    } while (val != expected);
+
+    trace_lockcnt_unlock_success(lockcnt, val, new);
+    if (val & QEMU_LOCKCNT_STATE_WAITING) {
+        lockcnt_wake(lockcnt);
+    }
+}
+
+unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt)
+{
+    return lockcnt->count >> QEMU_LOCKCNT_COUNT_SHIFT;
+}
+#else
 void qemu_lockcnt_init(QemuLockCnt *lockcnt)
 {
     qemu_mutex_init(&lockcnt->mutex);
@@ -120,3 +401,4 @@ unsigned qemu_lockcnt_count(QemuLockCnt *lockcnt)
 {
     return lockcnt->count;
 }
+#endif
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 4cc9f13..9e44e23 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -20,10 +20,6 @@
 #include <limits.h>
 #include <unistd.h>
 #include <sys/time.h>
-#ifdef __linux__
-#include <sys/syscall.h>
-#include <linux/futex.h>
-#endif
 #include "qemu/thread.h"
 #include "qemu/atomic.h"
 #include "qemu/notify.h"
@@ -302,26 +298,7 @@ void qemu_sem_wait(QemuSemaphore *sem)
 }
 
 #ifdef __linux__
-#define futex(...)              syscall(__NR_futex, __VA_ARGS__)
-
-static inline void futex_wake(QemuEvent *ev, int n)
-{
-    futex(ev, FUTEX_WAKE, n, NULL, NULL, 0);
-}
-
-static inline void futex_wait(QemuEvent *ev, unsigned val)
-{
-    while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
-        switch (errno) {
-        case EWOULDBLOCK:
-            return;
-        case EINTR:
-            break; /* get out of switch and retry */
-        default:
-            abort();
-        }
-    }
-}
+#include "qemu/futex.h"
 #else
 static inline void futex_wake(QemuEvent *ev, int n)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 24/40] aio: tweak walking in dispatch phase
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (22 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 23/40] qemu-thread: optimize QemuLockCnt with futexes on Linux Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 25/40] aio-posix: remove walking_handlers, protecting AioHandler list with list_lock Paolo Bonzini
                   ` (16 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c | 26 ++++++++++++--------------
 aio-win32.c | 26 ++++++++++++--------------
 2 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 482b316..370fefe 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -294,7 +294,7 @@ bool aio_pending(AioContext *ctx)
 
 bool aio_dispatch(AioContext *ctx)
 {
-    AioHandler *node;
+    AioHandler *node, *tmp;
     bool progress = false;
 
     /*
@@ -310,12 +310,10 @@ bool aio_dispatch(AioContext *ctx)
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
      */
-    node = QLIST_FIRST(&ctx->aio_handlers);
-    while (node) {
-        AioHandler *tmp;
-        int revents;
+    ctx->walking_handlers++;
 
-        ctx->walking_handlers++;
+    QLIST_FOREACH_SAFE(node, &ctx->aio_handlers, node, tmp) {
+        int revents;
 
         revents = node->pfd.revents & node->pfd.events;
         node->pfd.revents = 0;
@@ -337,17 +335,17 @@ bool aio_dispatch(AioContext *ctx)
             progress = true;
         }
 
-        tmp = node;
-        node = QLIST_NEXT(node, node);
-
-        ctx->walking_handlers--;
-
-        if (!ctx->walking_handlers && tmp->deleted) {
-            QLIST_REMOVE(tmp, node);
-            g_free(tmp);
+        if (node->deleted) {
+            ctx->walking_handlers--;
+            if (!ctx->walking_handlers) {
+                g_free(node);
+            }
+            ctx->walking_handlers++;
         }
     }
 
+    ctx->walking_handlers--;
+
     /* Run our timers */
     progress |= timerlistgroup_run_timers(&ctx->tlg);
 
diff --git a/aio-win32.c b/aio-win32.c
index cdc4456..a05cffe 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -208,20 +208,18 @@ bool aio_pending(AioContext *ctx)
 
 static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
 {
-    AioHandler *node;
+    AioHandler *node, *tmp;
     bool progress = false;
 
+    ctx->walking_handlers++;
+
     /*
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
      */
-    node = QLIST_FIRST(&ctx->aio_handlers);
-    while (node) {
-        AioHandler *tmp;
+    QLIST_FOREACH_SAFE(node, &ctx->aio_handlers, node, tmp) {
         int revents = node->pfd.revents;
 
-        ctx->walking_handlers++;
-
         if (!node->deleted &&
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
@@ -256,17 +254,17 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
             }
         }
 
-        tmp = node;
-        node = QLIST_NEXT(node, node);
-
-        ctx->walking_handlers--;
-
-        if (!ctx->walking_handlers && tmp->deleted) {
-            QLIST_REMOVE(tmp, node);
-            g_free(tmp);
+        if (node->deleted) {
+            ctx->walking_handlers--;
+            if (!ctx->walking_handlers) {
+                QLIST_REMOVE(tmp, node);
+                g_free(tmp);
+            }
+            ctx->walking_handlers++;
         }
     }
 
+    ctx->walking_handlers--;
     return progress;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 25/40] aio-posix: remove walking_handlers, protecting AioHandler list with list_lock
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (23 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 24/40] aio: tweak walking in dispatch phase Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 26/40] aio-win32: " Paolo Bonzini
                   ` (15 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c | 55 +++++++++++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 370fefe..212c203 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -15,7 +15,7 @@
 
 #include "qemu-common.h"
 #include "block/block.h"
-#include "qemu/queue.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/sockets.h"
 #ifdef CONFIG_EPOLL
 #include <sys/epoll.h>
@@ -212,6 +212,8 @@ void aio_set_fd_handler(AioContext *ctx,
     bool is_new = false;
     bool deleted = false;
 
+    qemu_lockcnt_lock(&ctx->list_lock);
+
     node = find_aio_handler(ctx, fd);
 
     /* Are we deleting the fd handler? */
@@ -219,14 +221,14 @@ void aio_set_fd_handler(AioContext *ctx,
         if (node) {
             g_source_remove_poll(&ctx->source, &node->pfd);
 
-            /* If the lock is held, just mark the node as deleted */
-            if (ctx->walking_handlers) {
+            /* If aio_poll is in progress, just mark the node as deleted */
+            if (qemu_lockcnt_count(&ctx->list_lock)) {
                 node->deleted = 1;
                 node->pfd.revents = 0;
             } else {
                 /* Otherwise, delete it for real.  We can't just mark it as
                  * deleted because deleted nodes are only cleaned up after
-                 * releasing the walking_handlers lock.
+                 * releasing the list_lock.
                  */
                 QLIST_REMOVE(node, node);
                 deleted = true;
@@ -237,7 +239,7 @@ void aio_set_fd_handler(AioContext *ctx,
             /* Alloc and insert if it's not already there */
             node = g_new0(AioHandler, 1);
             node->pfd.fd = fd;
-            QLIST_INSERT_HEAD(&ctx->aio_handlers, node, node);
+            QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, node, node);
 
             g_source_add_poll(&ctx->source, &node->pfd);
             is_new = true;
@@ -253,6 +255,7 @@ void aio_set_fd_handler(AioContext *ctx,
     }
 
     aio_epoll_update(ctx, node, is_new);
+    qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
     if (deleted) {
         g_free(node);
@@ -276,20 +279,30 @@ bool aio_prepare(AioContext *ctx)
 bool aio_pending(AioContext *ctx)
 {
     AioHandler *node;
+    bool result = false;
 
-    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+    /*
+     * We have to walk very carefully in case aio_set_fd_handler is
+     * called while we're walking.
+     */
+    qemu_lockcnt_inc(&ctx->list_lock);
+
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         int revents;
 
         revents = node->pfd.revents & node->pfd.events;
         if (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR) && node->io_read) {
-            return true;
+            result = true;
+            break;
         }
         if (revents & (G_IO_OUT | G_IO_ERR) && node->io_write) {
-            return true;
+            result = true;
+            break;
         }
     }
+    qemu_lockcnt_dec(&ctx->list_lock);
 
-    return false;
+    return result;
 }
 
 bool aio_dispatch(AioContext *ctx)
@@ -310,13 +323,12 @@ bool aio_dispatch(AioContext *ctx)
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
      */
-    ctx->walking_handlers++;
+    qemu_lockcnt_inc(&ctx->list_lock);
 
-    QLIST_FOREACH_SAFE(node, &ctx->aio_handlers, node, tmp) {
+    QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
         int revents;
 
-        revents = node->pfd.revents & node->pfd.events;
-        node->pfd.revents = 0;
+        revents = atomic_xchg(&node->pfd.revents, 0) & node->pfd.events;
 
         if (!node->deleted &&
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
@@ -336,15 +348,15 @@ bool aio_dispatch(AioContext *ctx)
         }
 
         if (node->deleted) {
-            ctx->walking_handlers--;
-            if (!ctx->walking_handlers) {
+            if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
+                QLIST_REMOVE(node, node);
                 g_free(node);
+                qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
             }
-            ctx->walking_handlers++;
         }
     }
 
-    ctx->walking_handlers--;
+    qemu_lockcnt_dec(&ctx->list_lock);
 
     /* Run our timers */
     progress |= timerlistgroup_run_timers(&ctx->tlg);
@@ -419,12 +431,11 @@ bool aio_poll(AioContext *ctx, bool blocking)
         atomic_add(&ctx->notify_me, 2);
     }
 
-    ctx->walking_handlers++;
-
+    qemu_lockcnt_inc(&ctx->list_lock);
     assert(npfd == 0);
 
     /* fill pollfds */
-    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (!node->deleted && node->pfd.events
             && !aio_epoll_enabled(ctx)
             && aio_node_check(ctx, node->is_external)) {
@@ -461,12 +472,12 @@ bool aio_poll(AioContext *ctx, bool blocking)
     /* if we have any readable fds, dispatch event */
     if (ret > 0) {
         for (i = 0; i < npfd; i++) {
-            nodes[i]->pfd.revents = pollfds[i].revents;
+            atomic_or(&nodes[i]->pfd.revents, pollfds[i].revents);
         }
     }
 
     npfd = 0;
-    ctx->walking_handlers--;
+    qemu_lockcnt_dec(&ctx->list_lock);
 
     /* Run dispatch even if there were no readable fds to run timers */
     if (aio_dispatch(ctx)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 26/40] aio-win32: remove walking_handlers, protecting AioHandler list with list_lock
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (24 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 25/40] aio-posix: remove walking_handlers, protecting AioHandler list with list_lock Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 27/40] aio: document locking Paolo Bonzini
                   ` (14 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-win32.c | 81 +++++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 32 deletions(-)

diff --git a/aio-win32.c b/aio-win32.c
index a05cffe..fe89b20 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -42,6 +42,7 @@ void aio_set_fd_handler(AioContext *ctx,
     /* fd is a SOCKET in our case */
     AioHandler *node;
 
+    qemu_lockcnt_lock(&ctx->list_lock);
     QLIST_FOREACH(node, &ctx->aio_handlers, node) {
         if (node->pfd.fd == fd && !node->deleted) {
             break;
@@ -51,14 +52,14 @@ void aio_set_fd_handler(AioContext *ctx,
     /* Are we deleting the fd handler? */
     if (!io_read && !io_write) {
         if (node) {
-            /* If the lock is held, just mark the node as deleted */
-            if (ctx->walking_handlers) {
+            /* If aio_poll is in progress, just mark the node as deleted */
+            if (qemu_lockcnt_count(&ctx->list_lock)) {
                 node->deleted = 1;
                 node->pfd.revents = 0;
             } else {
                 /* Otherwise, delete it for real.  We can't just mark it as
                  * deleted because deleted nodes are only cleaned up after
-                 * releasing the walking_handlers lock.
+                 * releasing the list_lock.
                  */
                 QLIST_REMOVE(node, node);
                 g_free(node);
@@ -71,7 +72,7 @@ void aio_set_fd_handler(AioContext *ctx,
             /* Alloc and insert if it's not already there */
             node = g_new0(AioHandler, 1);
             node->pfd.fd = fd;
-            QLIST_INSERT_HEAD(&ctx->aio_handlers, node, node);
+            QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, node, node);
         }
 
         node->pfd.events = 0;
@@ -96,6 +97,7 @@ void aio_set_fd_handler(AioContext *ctx,
                        FD_CONNECT | FD_WRITE | FD_OOB);
     }
 
+    qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
 }
 
@@ -106,6 +108,7 @@ void aio_set_event_notifier(AioContext *ctx,
 {
     AioHandler *node;
 
+    qemu_lockcnt_lock(&ctx->list_lock);
     QLIST_FOREACH(node, &ctx->aio_handlers, node) {
         if (node->e == e && !node->deleted) {
             break;
@@ -117,14 +120,14 @@ void aio_set_event_notifier(AioContext *ctx,
         if (node) {
             g_source_remove_poll(&ctx->source, &node->pfd);
 
-            /* If the lock is held, just mark the node as deleted */
-            if (ctx->walking_handlers) {
+            /* aio_poll is in progress, just mark the node as deleted */
+            if (qemu_lockcnt_count(&ctx->list_lock)) {
                 node->deleted = 1;
                 node->pfd.revents = 0;
             } else {
                 /* Otherwise, delete it for real.  We can't just mark it as
                  * deleted because deleted nodes are only cleaned up after
-                 * releasing the walking_handlers lock.
+                 * releasing the list_lock.
                  */
                 QLIST_REMOVE(node, node);
                 g_free(node);
@@ -138,7 +141,7 @@ void aio_set_event_notifier(AioContext *ctx,
             node->pfd.fd = (uintptr_t)event_notifier_get_handle(e);
             node->pfd.events = G_IO_IN;
             node->is_external = is_external;
-            QLIST_INSERT_HEAD(&ctx->aio_handlers, node, node);
+            QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, node, node);
 
             g_source_add_poll(&ctx->source, &node->pfd);
         }
@@ -146,6 +149,7 @@ void aio_set_event_notifier(AioContext *ctx,
         node->io_notify = io_notify;
     }
 
+    qemu_lockcnt_unlock(&ctx->list_lock);
     aio_notify(ctx);
 }
 
@@ -156,10 +160,16 @@ bool aio_prepare(AioContext *ctx)
     bool have_select_revents = false;
     fd_set rfds, wfds;
 
+    /*
+     * We have to walk very carefully in case aio_set_fd_handler is
+     * called while we're walking.
+     */
+    qemu_lockcnt_inc(&ctx->list_lock);
+
     /* fill fd sets */
     FD_ZERO(&rfds);
     FD_ZERO(&wfds);
-    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (node->io_read) {
             FD_SET ((SOCKET)node->pfd.fd, &rfds);
         }
@@ -169,61 +179,71 @@ bool aio_prepare(AioContext *ctx)
     }
 
     if (select(0, &rfds, &wfds, NULL, &tv0) > 0) {
-        QLIST_FOREACH(node, &ctx->aio_handlers, node) {
-            node->pfd.revents = 0;
+        QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
             if (FD_ISSET(node->pfd.fd, &rfds)) {
-                node->pfd.revents |= G_IO_IN;
+                atomic_or(&node->pfd.revents, G_IO_IN);
                 have_select_revents = true;
             }
 
             if (FD_ISSET(node->pfd.fd, &wfds)) {
-                node->pfd.revents |= G_IO_OUT;
+                atomic_or(&node->pfd.revents, G_IO_OUT);
                 have_select_revents = true;
             }
         }
     }
 
+    qemu_lockcnt_dec(&ctx->list_lock);
     return have_select_revents;
 }
 
 bool aio_pending(AioContext *ctx)
 {
     AioHandler *node;
+    bool result = false;
 
-    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+    /*
+     * We have to walk very carefully in case aio_set_fd_handler is
+     * called while we're walking.
+     */
+    qemu_lockcnt_inc(&ctx->list_lock);
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (node->pfd.revents && node->io_notify) {
-            return true;
+            result = true;
+            break;
         }
 
         if ((node->pfd.revents & G_IO_IN) && node->io_read) {
-            return true;
+            result = true;
+            break;
         }
         if ((node->pfd.revents & G_IO_OUT) && node->io_write) {
-            return true;
+            result = true;
+            break;
         }
     }
 
-    return false;
+    qemu_lockcnt_dec(&ctx->list_lock);
+    return result;
 }
 
 static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
 {
-    AioHandler *node, *tmp;
+    AioHandler *node;
     bool progress = false;
 
-    ctx->walking_handlers++;
+    qemu_lockcnt_inc(&ctx->list_lock);
 
     /*
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
      */
-    QLIST_FOREACH_SAFE(node, &ctx->aio_handlers, node, tmp) {
-        int revents = node->pfd.revents;
+    QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
+        AioHandler *tmp;
+        int revents = atomic_xchg(&node->pfd.revents, 0);
 
         if (!node->deleted &&
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
-            node->pfd.revents = 0;
             node->io_notify(node->e);
 
             /* aio_notify() does not count as progress */
@@ -234,7 +254,6 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
 
         if (!node->deleted &&
             (node->io_read || node->io_write)) {
-            node->pfd.revents = 0;
             if ((revents & G_IO_IN) && node->io_read) {
                 node->io_read(node->opaque);
                 progress = true;
@@ -255,16 +274,15 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         }
 
         if (node->deleted) {
-            ctx->walking_handlers--;
-            if (!ctx->walking_handlers) {
+            if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
                 QLIST_REMOVE(tmp, node);
                 g_free(tmp);
+                qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
             }
-            ctx->walking_handlers++;
         }
     }
 
-    ctx->walking_handlers--;
+    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
@@ -300,20 +318,19 @@ bool aio_poll(AioContext *ctx, bool blocking)
         atomic_add(&ctx->notify_me, 2);
     }
 
+    qemu_lockcnt_inc(&ctx->list_lock);
     have_select_revents = aio_prepare(ctx);
 
-    ctx->walking_handlers++;
-
     /* fill fd sets */
     count = 0;
-    QLIST_FOREACH(node, &ctx->aio_handlers, node) {
+    QLIST_FOREACH_RCU(node, &ctx->aio_handlers, node) {
         if (!node->deleted && node->io_notify
             && aio_node_check(ctx, node->is_external)) {
             events[count++] = event_notifier_get_handle(node->e);
         }
     }
 
-    ctx->walking_handlers--;
+    qemu_lockcnt_dec(&ctx->list_lock);
     first = true;
 
     /* ctx->notifier is always registered.  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 27/40] aio: document locking
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (25 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 26/40] aio-win32: " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 28/40] aio: push aio_context_acquire/release down to dispatching Paolo Bonzini
                   ` (13 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/multiple-iothreads.txt |  5 ++---
 include/block/aio.h         | 24 +++++++++++-------------
 2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/docs/multiple-iothreads.txt b/docs/multiple-iothreads.txt
index 723cc7e..4197f62 100644
--- a/docs/multiple-iothreads.txt
+++ b/docs/multiple-iothreads.txt
@@ -84,9 +84,8 @@ How to synchronize with an IOThread
 AioContext is not thread-safe so some rules must be followed when using file
 descriptors, event notifiers, timers, or BHs across threads:
 
-1. AioContext functions can be called safely from file descriptor, event
-notifier, timer, or BH callbacks invoked by the AioContext.  No locking is
-necessary.
+1. AioContext functions can always be called safely.  They handle their
+own locking internally.
 
 2. Other threads wishing to access the AioContext must use
 aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
diff --git a/include/block/aio.h b/include/block/aio.h
index 26aa658..21044af 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -50,18 +50,12 @@ typedef void IOHandler(void *opaque);
 struct AioContext {
     GSource source;
 
-    /* Protects all fields from multi-threaded access */
+    /* Used by AioContext users to protect from multi-threaded access.  */
     QemuRecMutex lock;
 
-    /* The list of registered AIO handlers */
+    /* The list of registered AIO handlers.  Protected by ctx->list_lock. */
     QLIST_HEAD(, AioHandler) aio_handlers;
 
-    /* This is a simple lock used to protect the aio_handlers list.
-     * Specifically, it's used to ensure that no callbacks are removed while
-     * we're walking and dispatching callbacks.
-     */
-    int walking_handlers;
-
     /* Used to avoid unnecessary event_notifier_set calls in aio_notify;
      * accessed with atomic primitives.  If this field is 0, everything
      * (file descriptors, bottom halves, timers) will be re-evaluated
@@ -87,9 +81,9 @@ struct AioContext {
      */
     uint32_t notify_me;
 
-    /* A lock to protect between bh's adders and deleter, and to ensure
-     * that no callbacks are removed while we're walking and dispatching
-     * them.
+    /* A lock to protect between QEMUBH and AioHandler adders and deleter,
+     * and to ensure that no callbacks are removed while we're walking and
+     * dispatching them.
      */
     QemuLockCnt list_lock;
 
@@ -111,10 +105,14 @@ struct AioContext {
     bool notified;
     EventNotifier notifier;
 
-    /* Thread pool for performing work and receiving completion callbacks */
+    /* Thread pool for performing work and receiving completion callbacks.
+     * Has its own locking.
+     */
     struct ThreadPool *thread_pool;
 
-    /* TimerLists for calling timers - one per clock type */
+    /* TimerLists for calling timers - one per clock type.  Has its own
+     * locking.
+     */
     QEMUTimerListGroup tlg;
 
     int external_disable_cnt;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 28/40] aio: push aio_context_acquire/release down to dispatching
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (26 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 27/40] aio: document locking Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 29/40] quorum: use atomics for rewrite_count Paolo Bonzini
                   ` (12 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

The AioContext data structures are now protected by list_lock and/or
they are walked with FOREACH_RCU primitives.  There is no need anymore
to acquire the AioContext for the entire duration of aio_dispatch.
Instead, just acquire it before and after invoking the callbacks.
The next step is then to push it further down.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c | 15 ++++++---------
 aio-win32.c | 16 ++++++++--------
 async.c     |  2 ++
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 212c203..2b41a02 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -333,7 +333,9 @@ bool aio_dispatch(AioContext *ctx)
         if (!node->deleted &&
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             node->io_read) {
+            aio_context_acquire(ctx);
             node->io_read(node->opaque);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -343,7 +345,9 @@ bool aio_dispatch(AioContext *ctx)
         if (!node->deleted &&
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             node->io_write) {
+            aio_context_acquire(ctx);
             node->io_write(node->opaque);
+            aio_context_release(ctx);
             progress = true;
         }
 
@@ -359,7 +363,9 @@ bool aio_dispatch(AioContext *ctx)
     qemu_lockcnt_dec(&ctx->list_lock);
 
     /* Run our timers */
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
+    aio_context_release(ctx);
 
     return progress;
 }
@@ -417,7 +423,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
     bool progress;
     int64_t timeout;
 
-    aio_context_acquire(ctx);
     progress = false;
 
     /* aio_notify can avoid the expensive event_notifier_set if
@@ -446,9 +451,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
     timeout = blocking ? aio_compute_timeout(ctx) : 0;
 
     /* wait until next event */
-    if (timeout) {
-        aio_context_release(ctx);
-    }
     if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
         AioHandler epoll_handler;
 
@@ -463,9 +465,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
     if (blocking) {
         atomic_sub(&ctx->notify_me, 2);
     }
-    if (timeout) {
-        aio_context_acquire(ctx);
-    }
 
     aio_notify_accept(ctx);
 
@@ -484,8 +483,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress = true;
     }
 
-    aio_context_release(ctx);
-
     return progress;
 }
 
diff --git a/aio-win32.c b/aio-win32.c
index fe89b20..b025b3d 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -244,7 +244,9 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         if (!node->deleted &&
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
+            aio_context_acquire(ctx);
             node->io_notify(node->e);
+            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -255,11 +257,15 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         if (!node->deleted &&
             (node->io_read || node->io_write)) {
             if ((revents & G_IO_IN) && node->io_read) {
+                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
+                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
+                aio_context_release(ctx);
                 progress = true;
             }
 
@@ -304,7 +310,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
     int count;
     int timeout;
 
-    aio_context_acquire(ctx);
     progress = false;
 
     /* aio_notify can avoid the expensive event_notifier_set if
@@ -346,17 +351,11 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
         timeout = blocking && !have_select_revents
             ? qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)) : 0;
-        if (timeout) {
-            aio_context_release(ctx);
-        }
         ret = WaitForMultipleObjects(count, events, FALSE, timeout);
         if (blocking) {
             assert(first);
             atomic_sub(&ctx->notify_me, 2);
         }
-        if (timeout) {
-            aio_context_acquire(ctx);
-        }
 
         if (first) {
             aio_notify_accept(ctx);
@@ -379,9 +378,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-
     aio_context_release(ctx);
+
     return progress;
 }
 
diff --git a/async.c b/async.c
index 4c1f658..03fd05a 100644
--- a/async.c
+++ b/async.c
@@ -87,7 +87,9 @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
+            aio_context_acquire(ctx);
             aio_bh_call(bh);
+            aio_context_release(ctx);
         }
     }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 29/40] quorum: use atomics for rewrite_count
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (27 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 28/40] aio: push aio_context_acquire/release down to dispatching Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 30/40] quorum: split quorum_fifo_aio_cb from quorum_aio_cb Paolo Bonzini
                   ` (11 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This is simpler than taking and releasing the AioContext lock.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/quorum.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index b9ba028..3b19c9e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -250,15 +250,10 @@ static void quorum_rewrite_aio_cb(void *opaque, int ret)
 {
     QuorumAIOCB *acb = opaque;
 
-    /* one less rewrite to do */
-    acb->rewrite_count--;
-
     /* wait until all rewrite callbacks have completed */
-    if (acb->rewrite_count) {
-        return;
+    if (atomic_fetch_dec(&acb->rewrite_count) == 1) {
+        quorum_aio_finalize(acb);
     }
-
-    quorum_aio_finalize(acb);
 }
 
 static BlockAIOCB *read_fifo_child(QuorumAIOCB *acb);
@@ -361,7 +356,7 @@ static bool quorum_rewrite_bad_versions(BDRVQuorumState *s, QuorumAIOCB *acb,
     }
 
     /* quorum_rewrite_aio_cb will count down this to zero */
-    acb->rewrite_count = count;
+    atomic_mb_set(&acb->rewrite_count, count);
 
     /* now fire the correcting rewrites */
     QLIST_FOREACH(version, &acb->votes.vote_list, next) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 30/40] quorum: split quorum_fifo_aio_cb from quorum_aio_cb
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (28 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 29/40] quorum: use atomics for rewrite_count Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 31/40] qed: introduce qed_aio_start_io and qed_aio_next_io_cb Paolo Bonzini
                   ` (10 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

The two cases of quorum_aio_cb are called from different paths, clarify
that with an assertion.  Also note that the FIFO read pattern is not
doing rewrites.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/quorum.c | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 3b19c9e..d7a0f11 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -271,28 +271,37 @@ static void quorum_copy_qiov(QEMUIOVector *dest, QEMUIOVector *source)
     }
 }
 
-static void quorum_aio_cb(void *opaque, int ret)
+static void quorum_fifo_aio_cb(void *opaque, int ret)
 {
     QuorumChildRequest *sacb = opaque;
     QuorumAIOCB *acb = sacb->parent;
     BDRVQuorumState *s = acb->common.bs->opaque;
-    bool rewrite = false;
 
-    if (acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO) {
-        /* We try to read next child in FIFO order if we fail to read */
-        if (ret < 0 && ++acb->child_iter < s->num_children) {
-            read_fifo_child(acb);
-            return;
-        }
+    assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO);
 
-        if (ret == 0) {
-            quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
-        }
-        acb->vote_ret = ret;
-        quorum_aio_finalize(acb);
+    /* We try to read next child in FIFO order if we fail to read */
+    if (ret < 0 && ++acb->child_iter < s->num_children) {
+        read_fifo_child(acb);
         return;
     }
 
+    if (ret == 0) {
+        quorum_copy_qiov(acb->qiov, &acb->qcrs[acb->child_iter].qiov);
+    }
+    acb->vote_ret = ret;
+
+    /* FIXME: rewrite failed children if acb->child_iter > 0? */
+
+    quorum_aio_finalize(acb);
+}
+
+static void quorum_aio_cb(void *opaque, int ret)
+{
+    QuorumChildRequest *sacb = opaque;
+    QuorumAIOCB *acb = sacb->parent;
+    BDRVQuorumState *s = acb->common.bs->opaque;
+    bool rewrite = false;
+
     sacb->ret = ret;
     acb->count++;
     if (ret == 0) {
@@ -659,7 +668,7 @@ static BlockAIOCB *read_fifo_child(QuorumAIOCB *acb)
                      acb->qcrs[acb->child_iter].buf);
     bdrv_aio_readv(s->children[acb->child_iter]->bs, acb->sector_num,
                    &acb->qcrs[acb->child_iter].qiov, acb->nb_sectors,
-                   quorum_aio_cb, &acb->qcrs[acb->child_iter]);
+                   quorum_fifo_aio_cb, &acb->qcrs[acb->child_iter]);
 
     return &acb->common;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 31/40] qed: introduce qed_aio_start_io and qed_aio_next_io_cb
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (29 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 30/40] quorum: split quorum_fifo_aio_cb from quorum_aio_cb Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 32/40] block: explicitly acquire aiocontext in callbacks that need it Paolo Bonzini
                   ` (9 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

qed_aio_start_io and qed_aio_next_io will not have to acquire/release
the AioContext, while qed_aio_next_io_cb will.  Split the functionality
and gain a little type-safety in the process.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/qed.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 9b88895..3d6aa07 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -270,7 +270,19 @@ static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
     return l2_table;
 }
 
-static void qed_aio_next_io(void *opaque, int ret);
+static void qed_aio_next_io(QEDAIOCB *acb, int ret);
+
+static void qed_aio_start_io(QEDAIOCB *acb)
+{
+    qed_aio_next_io(acb, 0);
+}
+
+static void qed_aio_next_io_cb(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+
+    qed_aio_next_io(acb, ret);
+}
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
 {
@@ -289,7 +301,7 @@ static void qed_unplug_allocating_write_reqs(BDRVQEDState *s)
 
     acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
     if (acb) {
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
     }
 }
 
@@ -956,7 +968,7 @@ static void qed_aio_complete(QEDAIOCB *acb, int ret)
         QSIMPLEQ_REMOVE_HEAD(&s->allocating_write_reqs, next);
         acb = QSIMPLEQ_FIRST(&s->allocating_write_reqs);
         if (acb) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
         } else if (s->header.features & QED_F_NEED_CHECK) {
             qed_start_need_check_timer(s);
         }
@@ -981,7 +993,7 @@ static void qed_commit_l2_update(void *opaque, int ret)
     acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
     assert(acb->request.l2_table != NULL);
 
-    qed_aio_next_io(opaque, ret);
+    qed_aio_next_io(acb, ret);
 }
 
 /**
@@ -1029,11 +1041,11 @@ static void qed_aio_write_l2_update(QEDAIOCB *acb, int ret, uint64_t offset)
     if (need_alloc) {
         /* Write out the whole new L2 table */
         qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
-                            qed_aio_write_l1_update, acb);
+                           qed_aio_write_l1_update, acb);
     } else {
         /* Write out only the updated part of the L2 table */
         qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
-                            qed_aio_next_io, acb);
+                           qed_aio_next_io_cb, acb);
     }
     return;
 
@@ -1085,7 +1097,7 @@ static void qed_aio_write_main(void *opaque, int ret)
     }
 
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
-        next_fn = qed_aio_next_io;
+        next_fn = qed_aio_next_io_cb;
     } else {
         if (s->bs->backing) {
             next_fn = qed_aio_write_flush_before_l2_update;
@@ -1198,7 +1210,7 @@ static void qed_aio_write_alloc(QEDAIOCB *acb, size_t len)
     if (acb->flags & QED_AIOCB_ZERO) {
         /* Skip ahead if the clusters are already zero */
         if (acb->find_cluster_ret == QED_CLUSTER_ZERO) {
-            qed_aio_next_io(acb, 0);
+            qed_aio_start_io(acb);
             return;
         }
 
@@ -1318,18 +1330,18 @@ static void qed_aio_read_data(void *opaque, int ret,
     /* Handle zero cluster and backing file reads */
     if (ret == QED_CLUSTER_ZERO) {
         qemu_iovec_memset(&acb->cur_qiov, 0, 0, acb->cur_qiov.size);
-        qed_aio_next_io(acb, 0);
+        qed_aio_start_io(acb);
         return;
     } else if (ret != QED_CLUSTER_FOUND) {
         qed_read_backing_file(s, acb->cur_pos, &acb->cur_qiov,
-                              &acb->backing_qiov, qed_aio_next_io, acb);
+                              &acb->backing_qiov, qed_aio_next_io_cb, acb);
         return;
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
     bdrv_aio_readv(bs->file->bs, offset / BDRV_SECTOR_SIZE,
                    &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
-                   qed_aio_next_io, acb);
+                   qed_aio_next_io_cb, acb);
     return;
 
 err:
@@ -1339,9 +1351,8 @@ err:
 /**
  * Begin next I/O or complete the request
  */
-static void qed_aio_next_io(void *opaque, int ret)
+static void qed_aio_next_io(QEDAIOCB *acb, int ret)
 {
-    QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
     QEDFindClusterFunc *io_fn = (acb->flags & QED_AIOCB_WRITE) ?
                                 qed_aio_write_data : qed_aio_read_data;
@@ -1397,7 +1408,7 @@ static BlockAIOCB *qed_aio_setup(BlockDriverState *bs,
     qemu_iovec_init(&acb->cur_qiov, qiov->niov);
 
     /* Start request */
-    qed_aio_next_io(acb, 0);
+    qed_aio_start_io(acb);
     return &acb->common;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 32/40] block: explicitly acquire aiocontext in callbacks that need it
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (30 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 31/40] qed: introduce qed_aio_start_io and qed_aio_next_io_cb Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 33/40] block: explicitly acquire aiocontext in bottom halves " Paolo Bonzini
                   ` (8 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c                     |  4 ----
 aio-win32.c                     |  6 ------
 block/curl.c                    | 16 +++++++++++---
 block/iscsi.c                   |  4 ++++
 block/nbd-client.c              | 14 ++++++++++--
 block/nfs.c                     |  6 ++++++
 block/sheepdog.c                | 29 +++++++++++++++----------
 block/ssh.c                     | 47 ++++++++++++++++++++---------------------
 block/win32-aio.c               | 10 +++++----
 hw/block/virtio-blk.c           |  5 ++++-
 hw/scsi/virtio-scsi-dataplane.c |  2 ++
 hw/scsi/virtio-scsi.c           |  7 ++++++
 nbd.c                           |  4 ++++
 13 files changed, 99 insertions(+), 55 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 2b41a02..972f3ff 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -333,9 +333,7 @@ bool aio_dispatch(AioContext *ctx)
         if (!node->deleted &&
             (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
             node->io_read) {
-            aio_context_acquire(ctx);
             node->io_read(node->opaque);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->opaque != &ctx->notifier) {
@@ -345,9 +343,7 @@ bool aio_dispatch(AioContext *ctx)
         if (!node->deleted &&
             (revents & (G_IO_OUT | G_IO_ERR)) &&
             node->io_write) {
-            aio_context_acquire(ctx);
             node->io_write(node->opaque);
-            aio_context_release(ctx);
             progress = true;
         }
 
diff --git a/aio-win32.c b/aio-win32.c
index b025b3d..1b50019 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -244,9 +244,7 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         if (!node->deleted &&
             (revents || event_notifier_get_handle(node->e) == event) &&
             node->io_notify) {
-            aio_context_acquire(ctx);
             node->io_notify(node->e);
-            aio_context_release(ctx);
 
             /* aio_notify() does not count as progress */
             if (node->e != &ctx->notifier) {
@@ -257,15 +255,11 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         if (!node->deleted &&
             (node->io_read || node->io_write)) {
             if ((revents & G_IO_IN) && node->io_read) {
-                aio_context_acquire(ctx);
                 node->io_read(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
             if ((revents & G_IO_OUT) && node->io_write) {
-                aio_context_acquire(ctx);
                 node->io_write(node->opaque);
-                aio_context_release(ctx);
                 progress = true;
             }
 
diff --git a/block/curl.c b/block/curl.c
index 8994182..3d7e1cb 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -332,9 +332,8 @@ static void curl_multi_check_completion(BDRVCURLState *s)
     }
 }
 
-static void curl_multi_do(void *arg)
+static void curl_multi_do_locked(CURLState *s)
 {
-    CURLState *s = (CURLState *)arg;
     int running;
     int r;
 
@@ -348,12 +347,23 @@ static void curl_multi_do(void *arg)
 
 }
 
+static void curl_multi_do(void *arg)
+{
+    CURLState *s = (CURLState *)arg;
+
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
+    aio_context_release(s->s->aio_context);
+}
+
 static void curl_multi_read(void *arg)
 {
     CURLState *s = (CURLState *)arg;
 
-    curl_multi_do(arg);
+    aio_context_acquire(s->s->aio_context);
+    curl_multi_do_locked(s);
     curl_multi_check_completion(s->s);
+    aio_context_release(s->s->aio_context);
 }
 
 static void curl_multi_timeout_do(void *arg)
diff --git a/block/iscsi.c b/block/iscsi.c
index bd1f1bf..16c3b44 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -380,8 +380,10 @@ iscsi_process_read(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLIN);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void
@@ -390,8 +392,10 @@ iscsi_process_write(void *arg)
     IscsiLun *iscsilun = arg;
     struct iscsi_context *iscsi = iscsilun->iscsi;
 
+    aio_context_acquire(iscsilun->aio_context);
     iscsi_service(iscsi, POLLOUT);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static int64_t sector_lun2qemu(int64_t sector, IscsiLun *iscsilun)
diff --git a/block/nbd-client.c b/block/nbd-client.c
index b7fd17a..b0a888d 100644
--- a/block/nbd-client.c
+++ b/block/nbd-client.c
@@ -56,9 +56,8 @@ static void nbd_teardown_connection(BlockDriverState *bs)
     client->sock = -1;
 }
 
-static void nbd_reply_ready(void *opaque)
+static void nbd_reply_ready_locked(BlockDriverState *bs)
 {
-    BlockDriverState *bs = opaque;
     NbdClientSession *s = nbd_get_client_session(bs);
     uint64_t i;
     int ret;
@@ -95,11 +94,22 @@ fail:
     nbd_teardown_connection(bs);
 }
 
+static void nbd_reply_ready(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+
+    aio_context_acquire(bdrv_get_aio_context(bs));
+    nbd_reply_ready_locked(bs);
+    aio_context_release(bdrv_get_aio_context(bs));
+}
+
 static void nbd_restart_write(void *opaque)
 {
     BlockDriverState *bs = opaque;
 
+    aio_context_acquire(bdrv_get_aio_context(bs));
     qemu_coroutine_enter(nbd_get_client_session(bs)->send_coroutine, NULL);
+    aio_context_release(bdrv_get_aio_context(bs));
 }
 
 static int nbd_co_send_request(BlockDriverState *bs,
diff --git a/block/nfs.c b/block/nfs.c
index fd79f89..910a51e 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -75,15 +75,21 @@ static void nfs_set_events(NFSClient *client)
 static void nfs_process_read(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLIN);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_process_write(void *arg)
 {
     NFSClient *client = arg;
+
+    aio_context_acquire(client->aio_context);
     nfs_service(client->context, POLLOUT);
     nfs_set_events(client);
+    aio_context_release(client->aio_context);
 }
 
 static void nfs_co_init_task(NFSClient *client, NFSRPC *task)
diff --git a/block/sheepdog.c b/block/sheepdog.c
index d80e4ed..1113043 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -621,13 +621,6 @@ static coroutine_fn int send_co_req(int sockfd, SheepdogReq *hdr, void *data,
     return ret;
 }
 
-static void restart_co_req(void *opaque)
-{
-    Coroutine *co = opaque;
-
-    qemu_coroutine_enter(co, NULL);
-}
-
 typedef struct SheepdogReqCo {
     int sockfd;
     AioContext *aio_context;
@@ -637,12 +630,21 @@ typedef struct SheepdogReqCo {
     unsigned int *rlen;
     int ret;
     bool finished;
+    Coroutine *co;
 } SheepdogReqCo;
 
+static void restart_co_req(void *opaque)
+{
+    SheepdogReqCo *srco = opaque;
+
+    aio_context_acquire(srco->aio_context);
+    qemu_coroutine_enter(srco->co, NULL);
+    aio_context_release(srco->aio_context);
+}
+
 static coroutine_fn void do_co_req(void *opaque)
 {
     int ret;
-    Coroutine *co;
     SheepdogReqCo *srco = opaque;
     int sockfd = srco->sockfd;
     SheepdogReq *hdr = srco->hdr;
@@ -650,9 +652,9 @@ static coroutine_fn void do_co_req(void *opaque)
     unsigned int *wlen = srco->wlen;
     unsigned int *rlen = srco->rlen;
 
-    co = qemu_coroutine_self();
+    srco->co = qemu_coroutine_self();
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       NULL, restart_co_req, co);
+                       NULL, restart_co_req, srco);
 
     ret = send_co_req(sockfd, hdr, data, wlen);
     if (ret < 0) {
@@ -660,7 +662,7 @@ static coroutine_fn void do_co_req(void *opaque)
     }
 
     aio_set_fd_handler(srco->aio_context, sockfd, false,
-                       restart_co_req, NULL, co);
+                       restart_co_req, NULL, srco);
 
     ret = qemu_co_recv(sockfd, hdr, sizeof(*hdr));
     if (ret != sizeof(*hdr)) {
@@ -688,6 +690,7 @@ out:
     aio_set_fd_handler(srco->aio_context, sockfd, false,
                        NULL, NULL, NULL);
 
+    srco->co = NULL;
     srco->ret = ret;
     srco->finished = true;
 }
@@ -917,14 +920,18 @@ static void co_read_response(void *opaque)
         s->co_recv = qemu_coroutine_create(aio_read_response);
     }
 
+    aio_context_acquire(s->aio_context);
     qemu_coroutine_enter(s->co_recv, opaque);
+    aio_context_release(s->aio_context);
 }
 
 static void co_write_request(void *opaque)
 {
     BDRVSheepdogState *s = opaque;
 
+    aio_context_acquire(s->aio_context);
     qemu_coroutine_enter(s->co_send, NULL);
+    aio_context_release(s->aio_context);
 }
 
 /*
diff --git a/block/ssh.c b/block/ssh.c
index af025c0..00cda3f 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -772,20 +772,34 @@ static int ssh_has_zero_init(BlockDriverState *bs)
     return has_zero_init;
 }
 
+typedef struct BDRVSSHRestart {
+    Coroutine *co;
+    AioContext *ctx;
+} BDRVSSHRestart;
+
 static void restart_coroutine(void *opaque)
 {
-    Coroutine *co = opaque;
+    BDRVSSHRestart *restart = opaque;
 
-    DPRINTF("co=%p", co);
+    DPRINTF("ctx=%p co=%p", restart->ctx, restart->co);
 
-    qemu_coroutine_enter(co, NULL);
+    aio_context_acquire(restart->ctx);
+    qemu_coroutine_enter(restart->co, NULL);
+    aio_context_release(restart->ctx);
 }
 
-static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
+/* A non-blocking call returned EAGAIN, so yield, ensuring the
+ * handlers are set up so that we'll be rescheduled when there is an
+ * interesting event on the socket.
+ */
+static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
 {
     int r;
     IOHandler *rd_handler = NULL, *wr_handler = NULL;
-    Coroutine *co = qemu_coroutine_self();
+    BDRVSSHRestart restart = {
+        .ctx = bdrv_get_aio_context(bs),
+        .co = qemu_coroutine_self()
+    };
 
     r = libssh2_session_block_directions(s->session);
 
@@ -800,26 +814,11 @@ static coroutine_fn void set_fd_handler(BDRVSSHState *s, BlockDriverState *bs)
             rd_handler, wr_handler);
 
     aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
-                       false, rd_handler, wr_handler, co);
-}
-
-static coroutine_fn void clear_fd_handler(BDRVSSHState *s,
-                                          BlockDriverState *bs)
-{
-    DPRINTF("s->sock=%d", s->sock);
-    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock,
-                       false, NULL, NULL, NULL);
-}
-
-/* A non-blocking call returned EAGAIN, so yield, ensuring the
- * handlers are set up so that we'll be rescheduled when there is an
- * interesting event on the socket.
- */
-static coroutine_fn void co_yield(BDRVSSHState *s, BlockDriverState *bs)
-{
-    set_fd_handler(s, bs);
+                       false, rd_handler, wr_handler, &restart);
     qemu_coroutine_yield();
-    clear_fd_handler(s, bs);
+    DPRINTF("s->sock=%d - back", s->sock);
+    aio_set_fd_handler(bdrv_get_aio_context(bs), s->sock, false,
+                       NULL, NULL, NULL);
 }
 
 /* SFTP has a function `libssh2_sftp_seek64' which seeks to a position
diff --git a/block/win32-aio.c b/block/win32-aio.c
index bbf2f01..85aac85 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -40,7 +40,7 @@ struct QEMUWin32AIOState {
     HANDLE hIOCP;
     EventNotifier e;
     int count;
-    bool is_aio_context_attached;
+    AioContext *aio_ctx;
 };
 
 typedef struct QEMUWin32AIOCB {
@@ -87,7 +87,9 @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
     }
 
 
+    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
+    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
@@ -175,13 +177,13 @@ void win32_aio_detach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *old_context)
 {
     aio_set_event_notifier(old_context, &aio->e, false, NULL);
-    aio->is_aio_context_attached = false;
+    aio->aio_ctx = NULL;
 }
 
 void win32_aio_attach_aio_context(QEMUWin32AIOState *aio,
                                   AioContext *new_context)
 {
-    aio->is_aio_context_attached = true;
+    aio->aio_ctx = new_context;
     aio_set_event_notifier(new_context, &aio->e, false,
                            win32_aio_completion_cb);
 }
@@ -211,7 +213,7 @@ out_free_state:
 
 void win32_aio_cleanup(QEMUWin32AIOState *aio)
 {
-    assert(!aio->is_aio_context_attached);
+    assert(!aio->aio_ctx);
     CloseHandle(aio->hIOCP);
     event_notifier_cleanup(&aio->e);
     g_free(aio);
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e83d823..d72942e 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -147,7 +147,8 @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
 {
     VirtIOBlockIoctlReq *ioctl_req = opaque;
     VirtIOBlockReq *req = ioctl_req->req;
-    VirtIODevice *vdev = VIRTIO_DEVICE(req->dev);
+    VirtIOBlock *s = req->dev;
+    VirtIODevice *vdev = VIRTIO_DEVICE(s);
     struct virtio_scsi_inhdr *scsi;
     struct sg_io_hdr *hdr;
 
@@ -599,6 +600,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         return;
     }
 
+    aio_context_acquire(blk_get_aio_context(s->blk));
     blk_io_plug(s->blk);
 
     while ((req = virtio_blk_get_request(s))) {
@@ -610,6 +612,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     }
 
     blk_io_unplug(s->blk);
+    aio_context_release(blk_get_aio_context(s->blk));
 }
 
 static void virtio_blk_dma_restart_bh(void *opaque)
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index b1745b2..194ce40 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -59,9 +59,11 @@ static int virtio_scsi_vring_init(VirtIOSCSI *s, VirtQueue *vq, int n)
 
 void virtio_scsi_dataplane_notify(VirtIODevice *vdev, VirtIOSCSIReq *req)
 {
+    VirtIOSCSI *s = VIRTIO_SCSI(vdev);
     if (virtio_should_notify(vdev, req->vq)) {
         event_notifier_set(virtio_queue_get_guest_notifier(req->vq));
     }
+    aio_context_release(s->ctx);
 }
 
 /* assumes s->ctx held */
diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index 4054ce5..8afa489 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -420,9 +420,11 @@ static void virtio_scsi_handle_ctrl(VirtIODevice *vdev, VirtQueue *vq)
         virtio_scsi_dataplane_start(s);
         return;
     }
+    aio_context_acquire(s->ctx);
     while ((req = virtio_scsi_pop_req(s, vq))) {
         virtio_scsi_handle_ctrl_req(s, req);
     }
+    aio_context_release(s->ctx);
 }
 
 static void virtio_scsi_complete_cmd_req(VirtIOSCSIReq *req)
@@ -570,6 +572,8 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq)
         virtio_scsi_dataplane_start(s);
         return;
     }
+
+    aio_context_acquire(s->ctx);
     while ((req = virtio_scsi_pop_req(s, vq))) {
         if (virtio_scsi_handle_cmd_req_prepare(s, req)) {
             QTAILQ_INSERT_TAIL(&reqs, req, next);
@@ -579,6 +583,7 @@ static void virtio_scsi_handle_cmd(VirtIODevice *vdev, VirtQueue *vq)
     QTAILQ_FOREACH_SAFE(req, &reqs, next, next) {
         virtio_scsi_handle_cmd_req_submit(s, req);
     }
+    aio_context_release(s->ctx);
 }
 
 static void virtio_scsi_get_config(VirtIODevice *vdev,
@@ -732,9 +737,11 @@ static void virtio_scsi_handle_event(VirtIODevice *vdev, VirtQueue *vq)
         virtio_scsi_dataplane_start(s);
         return;
     }
+    aio_context_acquire(s->ctx);
     if (s->events_dropped) {
         virtio_scsi_push_event(s, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
     }
+    aio_context_release(s->ctx);
 }
 
 static void virtio_scsi_change(SCSIBus *bus, SCSIDevice *dev, SCSISense sense)
diff --git a/nbd.c b/nbd.c
index b3d9654..2867f34 100644
--- a/nbd.c
+++ b/nbd.c
@@ -1445,6 +1445,10 @@ static void nbd_restart_write(void *opaque)
 static void nbd_set_handlers(NBDClient *client)
 {
     if (client->exp && client->exp->ctx) {
+        /* Note that the handlers do not expect any concurrency; qemu-nbd
+         * does not instantiate multiple AioContexts yet, nor does it call
+         * aio_poll/aio_dispatch from multiple threads.
+         */
         aio_set_fd_handler(client->exp->ctx, client->sock,
                            true,
                            client->can_read ? nbd_read : NULL,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 33/40] block: explicitly acquire aiocontext in bottom halves that need it
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (31 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 32/40] block: explicitly acquire aiocontext in callbacks that need it Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 34/40] block: explicitly acquire aiocontext in timers " Paolo Bonzini
                   ` (7 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c               |  2 --
 block/archipelago.c   |  3 +++
 block/blkdebug.c      |  4 ++++
 block/blkverify.c     |  3 +++
 block/block-backend.c |  4 ++++
 block/curl.c          | 25 +++++++++++++++++--------
 block/gluster.c       |  2 ++
 block/io.c            |  6 ++++++
 block/iscsi.c         |  6 ++++++
 block/linux-aio.c     |  7 +++++++
 block/nfs.c           |  4 ++++
 block/null.c          |  4 ++++
 block/qed.c           | 13 +++++++++++++
 block/qed.h           |  3 +++
 block/rbd.c           |  4 ++++
 dma-helpers.c         |  7 +++++--
 hw/block/virtio-blk.c |  2 ++
 hw/scsi/scsi-bus.c    |  2 ++
 thread-pool.c         |  2 ++
 19 files changed, 91 insertions(+), 12 deletions(-)

diff --git a/async.c b/async.c
index 03fd05a..4c1f658 100644
--- a/async.c
+++ b/async.c
@@ -87,9 +87,7 @@ int aio_bh_poll(AioContext *ctx)
                 ret = 1;
             }
             bh->idle = 0;
-            aio_context_acquire(ctx);
             aio_bh_call(bh);
-            aio_context_release(ctx);
         }
     }
 
diff --git a/block/archipelago.c b/block/archipelago.c
index 855655c..7f69a3f 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -312,9 +312,12 @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
+    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
     qemu_bh_delete(aio_cb->bh);
+    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
+    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/blkdebug.c b/block/blkdebug.c
index 6860a2b..ba35185 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -458,8 +458,12 @@ out:
 static void error_callback_bh(void *opaque)
 {
     struct BlkdebugAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
     qemu_bh_delete(acb->bh);
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index c5f8e8d..3ff681a 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -188,13 +188,16 @@ static BlkverifyAIOCB *blkverify_aio_get(BlockDriverState *bs, bool is_write,
 static void blkverify_aio_bh(void *opaque)
 {
     BlkverifyAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
     if (acb->buf) {
         qemu_iovec_destroy(&acb->raw_qiov);
         qemu_vfree(acb->buf);
     }
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/block-backend.c b/block/block-backend.c
index 36ccc9e..8549289 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -637,8 +637,12 @@ int blk_write_zeroes(BlockBackend *blk, int64_t sector_num,
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
     qemu_bh_delete(acb->bh);
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/curl.c b/block/curl.c
index 3d7e1cb..17add0a 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -653,10 +653,14 @@ static void curl_readv_bh_cb(void *p)
 {
     CURLState *state;
     int running;
+    int ret = -EINPROGRESS;
 
     CURLAIOCB *acb = p;
-    BDRVCURLState *s = acb->common.bs->opaque;
+    BlockDriverState *bs = acb->common.bs;
+    BDRVCURLState *s = bs->opaque;
+    AioContext *ctx = bdrv_get_aio_context(bs);
 
+    aio_context_acquire(ctx);
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
 
@@ -670,7 +674,7 @@ static void curl_readv_bh_cb(void *p)
             qemu_aio_unref(acb);
             // fall through
         case FIND_RET_WAIT:
-            return;
+            goto out;
         default:
             break;
     }
@@ -678,9 +682,8 @@ static void curl_readv_bh_cb(void *p)
     // No cache found, so let's start a new request
     state = curl_init_state(acb->common.bs, s);
     if (!state) {
-        acb->common.cb(acb->common.opaque, -EIO);
-        qemu_aio_unref(acb);
-        return;
+        ret = -EIO;
+        goto out;
     }
 
     acb->start = 0;
@@ -694,9 +697,8 @@ static void curl_readv_bh_cb(void *p)
     state->orig_buf = g_try_malloc(state->buf_len);
     if (state->buf_len && state->orig_buf == NULL) {
         curl_clean_state(state);
-        acb->common.cb(acb->common.opaque, -ENOMEM);
-        qemu_aio_unref(acb);
-        return;
+        ret = -ENOMEM;
+        goto out;
     }
     state->acb[0] = acb;
 
@@ -709,6 +711,13 @@ static void curl_readv_bh_cb(void *p)
 
     /* Tell curl it needs to kick things off */
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
+
+out:
+    if (ret != -EINPROGRESS) {
+        acb->common.cb(acb->common.opaque, ret);
+        qemu_aio_unref(acb);
+    }
+    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/gluster.c b/block/gluster.c
index 0857c14..5aa34ea 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -232,7 +232,9 @@ static void qemu_gluster_complete_aio(void *opaque)
 
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
+    aio_context_acquire(acb->aio_context);
     qemu_coroutine_enter(acb->coroutine, NULL);
+    aio_context_release(acb->aio_context);
 }
 
 /*
diff --git a/block/io.c b/block/io.c
index adc1eab..4b3e2b2 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2036,12 +2036,15 @@ static const AIOCBInfo bdrv_em_aiocb_info = {
 static void bdrv_aio_bh_cb(void *opaque)
 {
     BlockAIOCBSync *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     if (!acb->is_write && acb->ret >= 0) {
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
+    aio_context_release(ctx);
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
     qemu_aio_unref(acb);
@@ -2117,10 +2120,13 @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     assert(!acb->need_bh);
     qemu_bh_delete(acb->bh);
+    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
+    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
diff --git a/block/iscsi.c b/block/iscsi.c
index 16c3b44..72c9171 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -129,7 +129,9 @@ iscsi_bh_cb(void *p)
     g_free(acb->buf);
     acb->buf = NULL;
 
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
+    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
@@ -152,9 +154,13 @@ iscsi_schedule_bh(IscsiAIOCB *acb)
 static void iscsi_co_generic_bh_cb(void *opaque)
 {
     struct IscsiTask *iTask = opaque;
+    AioContext *ctx = iTask->iscsilun->aio_context;
+
     iTask->complete = 1;
     qemu_bh_delete(iTask->bh);
+    aio_context_acquire(ctx);
     qemu_coroutine_enter(iTask->co, NULL);
+    aio_context_release(ctx);
 }
 
 static void iscsi_retry_timer_expired(void *opaque)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 88b0520..0e94a86 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -46,6 +46,8 @@ typedef struct {
 } LaioQueue;
 
 struct qemu_laio_state {
+    AioContext *aio_context;
+
     io_context_t ctx;
     EventNotifier e;
 
@@ -109,6 +111,7 @@ static void qemu_laio_completion_bh(void *opaque)
     struct qemu_laio_state *s = opaque;
 
     /* Fetch more completion events when empty */
+    aio_context_acquire(s->aio_context);
     if (s->event_idx == s->event_max) {
         do {
             struct timespec ts = { 0 };
@@ -141,6 +144,8 @@ static void qemu_laio_completion_bh(void *opaque)
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
         ioq_submit(s);
     }
+
+    aio_context_release(s->aio_context);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
@@ -289,12 +294,14 @@ void laio_detach_aio_context(void *s_, AioContext *old_context)
 
     aio_set_event_notifier(old_context, &s->e, false, NULL);
     qemu_bh_delete(s->completion_bh);
+    s->aio_context = NULL;
 }
 
 void laio_attach_aio_context(void *s_, AioContext *new_context)
 {
     struct qemu_laio_state *s = s_;
 
+    s->aio_context = new_context;
     s->completion_bh = aio_bh_new(new_context, qemu_laio_completion_bh, s);
     aio_set_event_notifier(new_context, &s->e, false,
                            qemu_laio_completion_cb);
diff --git a/block/nfs.c b/block/nfs.c
index 910a51e..c86850b 100644
--- a/block/nfs.c
+++ b/block/nfs.c
@@ -103,9 +103,13 @@ static void nfs_co_init_task(NFSClient *client, NFSRPC *task)
 static void nfs_co_generic_bh_cb(void *opaque)
 {
     NFSRPC *task = opaque;
+    AioContext *ctx = task->client->aio_context;
+
     task->complete = 1;
     qemu_bh_delete(task->bh);
+    aio_context_acquire(ctx);
     qemu_coroutine_enter(task->co, NULL);
+    aio_context_release(ctx);
 }
 
 static void
diff --git a/block/null.c b/block/null.c
index 7d08323..dd1b170 100644
--- a/block/null.c
+++ b/block/null.c
@@ -117,7 +117,11 @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     qemu_bh_delete(acb->bh);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed.c b/block/qed.c
index 3d6aa07..d128772 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -348,6 +348,16 @@ static void qed_need_check_timer_cb(void *opaque)
     bdrv_aio_flush(s->bs, qed_clear_need_check, s);
 }
 
+void qed_acquire(BDRVQEDState *s)
+{
+    aio_context_acquire(bdrv_get_aio_context(s->bs));
+}
+
+void qed_release(BDRVQEDState *s)
+{
+    aio_context_release(bdrv_get_aio_context(s->bs));
+}
+
 static void qed_start_need_check_timer(BDRVQEDState *s)
 {
     trace_qed_start_need_check_timer(s);
@@ -925,6 +935,7 @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 static void qed_aio_complete_bh(void *opaque)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb = acb->common.cb;
     void *user_opaque = acb->common.opaque;
     int ret = acb->bh_ret;
@@ -933,7 +944,9 @@ static void qed_aio_complete_bh(void *opaque)
     qemu_aio_unref(acb);
 
     /* Invoke callback */
+    qed_acquire(s);
     cb(user_opaque, ret);
+    qed_release(s);
 }
 
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
diff --git a/block/qed.h b/block/qed.h
index 615e676..106cefe 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -198,6 +198,9 @@ enum {
  */
 typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
 
+void qed_acquire(BDRVQEDState *s);
+void qed_release(BDRVQEDState *s);
+
 /**
  * Generic callback for chaining async callbacks
  */
diff --git a/block/rbd.c b/block/rbd.c
index a60a19d..6206dc3 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -376,6 +376,7 @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -408,7 +409,10 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
+    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/dma-helpers.c b/dma-helpers.c
index 4faec5d..68f6f07 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -69,6 +69,7 @@ void qemu_sglist_destroy(QEMUSGList *qsg)
 
 typedef struct {
     BlockAIOCB common;
+    AioContext *ctx;
     BlockBackend *blk;
     BlockAIOCB *acb;
     QEMUSGList *sg;
@@ -153,8 +154,7 @@ static void dma_blk_cb(void *opaque, int ret)
 
     if (dbs->iov.size == 0) {
         trace_dma_map_wait(dbs);
-        dbs->bh = aio_bh_new(blk_get_aio_context(dbs->blk),
-                             reschedule_dma, dbs);
+        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
         cpu_register_map_client(dbs->bh);
         return;
     }
@@ -163,8 +163,10 @@ static void dma_blk_cb(void *opaque, int ret)
         qemu_iovec_discard_back(&dbs->iov, dbs->iov.size & ~BDRV_SECTOR_MASK);
     }
 
+    aio_context_acquire(dbs->ctx);
     dbs->acb = dbs->io_func(dbs->blk, dbs->sector_num, &dbs->iov,
                             dbs->iov.size / 512, dma_blk_cb, dbs);
+    aio_context_release(dbs->ctx);
     assert(dbs->acb);
 }
 
@@ -201,6 +203,7 @@ BlockAIOCB *dma_blk_io(
 
     dbs->acb = NULL;
     dbs->blk = blk;
+    dbs->ctx = blk_get_aio_context(blk);
     dbs->sg = sg;
     dbs->sector_num = sector_num;
     dbs->sg_cur_index = 0;
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index d72942e..5c1cb89 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -626,6 +626,7 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 
     s->rq = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (req) {
         VirtIOBlockReq *next = req->next;
         virtio_blk_handle_request(req, &mrb);
@@ -635,6 +636,7 @@ static void virtio_blk_dma_restart_bh(void *opaque)
     if (mrb.num_reqs) {
         virtio_blk_submit_multireq(s->blk, &mrb);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index fd1171e..0d607b8 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -102,6 +102,7 @@ static void scsi_dma_restart_bh(void *opaque)
     qemu_bh_delete(s->bh);
     s->bh = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     QTAILQ_FOREACH_SAFE(req, &s->requests, next, next) {
         scsi_req_ref(req);
         if (req->retry) {
@@ -119,6 +120,7 @@ static void scsi_dma_restart_bh(void *opaque)
         }
         scsi_req_unref(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 void scsi_req_retry(SCSIRequest *req)
diff --git a/thread-pool.c b/thread-pool.c
index 402c778..bffd823 100644
--- a/thread-pool.c
+++ b/thread-pool.c
@@ -165,6 +165,7 @@ static void thread_pool_completion_bh(void *opaque)
     ThreadPool *pool = opaque;
     ThreadPoolElement *elem, *next;
 
+    aio_context_acquire(pool->ctx);
 restart:
     QLIST_FOREACH_SAFE(elem, &pool->head, all, next) {
         if (elem->state != THREAD_DONE) {
@@ -191,6 +192,7 @@ restart:
             qemu_aio_unref(elem);
         }
     }
+    aio_context_release(pool->ctx);
 }
 
 static void thread_pool_cancel(BlockAIOCB *acb)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 34/40] block: explicitly acquire aiocontext in timers that need it
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (32 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 33/40] block: explicitly acquire aiocontext in bottom halves " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 35/40] block: explicitly acquire aiocontext in aio callbacks " Paolo Bonzini
                   ` (6 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c                 | 2 --
 aio-win32.c                 | 2 --
 block/curl.c                | 2 ++
 block/iscsi.c               | 2 ++
 block/null.c                | 4 ++++
 block/qed.c                 | 2 ++
 block/throttle-groups.c     | 2 ++
 util/qemu-coroutine-sleep.c | 5 +++++
 8 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index 972f3ff..aabc4ae 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -359,9 +359,7 @@ bool aio_dispatch(AioContext *ctx)
     qemu_lockcnt_dec(&ctx->list_lock);
 
     /* Run our timers */
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
 
     return progress;
 }
diff --git a/aio-win32.c b/aio-win32.c
index 1b50019..4479d3f 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -372,9 +372,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
-    aio_context_acquire(ctx);
     progress |= timerlistgroup_run_timers(&ctx->tlg);
-    aio_context_release(ctx);
 
     return progress;
 }
diff --git a/block/curl.c b/block/curl.c
index 17add0a..c2b6726 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -376,9 +376,11 @@ static void curl_multi_timeout_do(void *arg)
         return;
     }
 
+    aio_context_acquire(s->aio_context);
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
     curl_multi_check_completion(s);
+    aio_context_release(s->aio_context);
 #else
     abort();
 #endif
diff --git a/block/iscsi.c b/block/iscsi.c
index 72c9171..411aef8 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -1214,6 +1214,7 @@ static void iscsi_nop_timed_event(void *opaque)
 {
     IscsiLun *iscsilun = opaque;
 
+    aio_context_acquire(iscsilun->aio_context);
     if (iscsi_get_nops_in_flight(iscsilun->iscsi) >= MAX_NOP_FAILURES) {
         error_report("iSCSI: NOP timeout. Reconnecting...");
         iscsilun->request_timed_out = true;
@@ -1224,6 +1225,7 @@ static void iscsi_nop_timed_event(void *opaque)
 
     timer_mod(iscsilun->nop_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + NOP_INTERVAL);
     iscsi_set_events(iscsilun);
+    aio_context_release(iscsilun->aio_context);
 }
 
 static void iscsi_readcapacity_sync(IscsiLun *iscsilun, Error **errp)
diff --git a/block/null.c b/block/null.c
index dd1b170..9bddc1b 100644
--- a/block/null.c
+++ b/block/null.c
@@ -129,7 +129,11 @@ static void null_bh_cb(void *opaque)
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
+
+    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
+    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed.c b/block/qed.c
index d128772..17777d1 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -342,10 +342,12 @@ static void qed_need_check_timer_cb(void *opaque)
 
     trace_qed_need_check_timer_cb(s);
 
+    qed_acquire(s);
     qed_plug_allocating_write_reqs(s);
 
     /* Ensure writes are on disk before clearing flag */
     bdrv_aio_flush(s->bs, qed_clear_need_check, s);
+    qed_release(s);
 }
 
 void qed_acquire(BDRVQEDState *s)
diff --git a/block/throttle-groups.c b/block/throttle-groups.c
index 13b5baa..cdb819a 100644
--- a/block/throttle-groups.c
+++ b/block/throttle-groups.c
@@ -370,7 +370,9 @@ static void timer_cb(BlockDriverState *bs, bool is_write)
     qemu_mutex_unlock(&tg->lock);
 
     /* Run the request that was waiting for this timer */
+    aio_context_acquire(bdrv_get_aio_context(bs));
     empty_queue = !qemu_co_enter_next(&bs->throttled_reqs[is_write]);
+    aio_context_release(bdrv_get_aio_context(bs));
 
     /* If the request queue was empty then we have to take care of
      * scheduling the next one */
diff --git a/util/qemu-coroutine-sleep.c b/util/qemu-coroutine-sleep.c
index b35db56..6e07343 100644
--- a/util/qemu-coroutine-sleep.c
+++ b/util/qemu-coroutine-sleep.c
@@ -18,13 +18,17 @@
 typedef struct CoSleepCB {
     QEMUTimer *ts;
     Coroutine *co;
+    AioContext *ctx;
 } CoSleepCB;
 
 static void co_sleep_cb(void *opaque)
 {
     CoSleepCB *sleep_cb = opaque;
+    AioContext *ctx = sleep_cb->ctx;
 
+    aio_context_acquire(ctx);
     qemu_coroutine_enter(sleep_cb->co, NULL);
+    aio_context_release(ctx);
 }
 
 void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
@@ -32,6 +36,7 @@ void coroutine_fn co_aio_sleep_ns(AioContext *ctx, QEMUClockType type,
 {
     CoSleepCB sleep_cb = {
         .co = qemu_coroutine_self(),
+        .ctx = ctx,
     };
     sleep_cb.ts = aio_timer_new(ctx, type, SCALE_NS, co_sleep_cb, &sleep_cb);
     timer_mod(sleep_cb.ts, qemu_clock_get_ns(type) + ns);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 35/40] block: explicitly acquire aiocontext in aio callbacks that need it
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (33 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 34/40] block: explicitly acquire aiocontext in timers " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 36/40] aio: update locking documentation Paolo Bonzini
                   ` (5 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 block/archipelago.c    |  3 ---
 block/blkdebug.c       |  4 ---
 block/blkverify.c      |  9 +++----
 block/block-backend.c  |  4 ---
 block/curl.c           |  2 +-
 block/io.c             | 13 +++++-----
 block/iscsi.c          |  2 --
 block/linux-aio.c      |  7 +++---
 block/mirror.c         | 12 ++++++---
 block/null.c           |  8 ------
 block/qed-cluster.c    |  2 ++
 block/qed-table.c      | 12 +++++++--
 block/qed.c            | 66 ++++++++++++++++++++++++++++++++++++++------------
 block/quorum.c         | 12 +++++++++
 block/rbd.c            |  4 ---
 block/win32-aio.c      |  2 --
 hw/block/virtio-blk.c  | 12 ++++++++-
 hw/scsi/scsi-disk.c    | 18 ++++++++++++++
 hw/scsi/scsi-generic.c | 20 ++++++++++++---
 thread-pool.c          | 12 ++++++++-
 20 files changed, 157 insertions(+), 67 deletions(-)

diff --git a/block/archipelago.c b/block/archipelago.c
index 7f69a3f..855655c 100644
--- a/block/archipelago.c
+++ b/block/archipelago.c
@@ -312,12 +312,9 @@ static void qemu_archipelago_complete_aio(void *opaque)
 {
     AIORequestData *reqdata = (AIORequestData *) opaque;
     ArchipelagoAIOCB *aio_cb = (ArchipelagoAIOCB *) reqdata->aio_cb;
-    AioContext *ctx = bdrv_get_aio_context(aio_cb->common.bs);
 
     qemu_bh_delete(aio_cb->bh);
-    aio_context_acquire(ctx);
     aio_cb->common.cb(aio_cb->common.opaque, aio_cb->ret);
-    aio_context_release(ctx);
     aio_cb->status = 0;
 
     qemu_aio_unref(aio_cb);
diff --git a/block/blkdebug.c b/block/blkdebug.c
index ba35185..6860a2b 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -458,12 +458,8 @@ out:
 static void error_callback_bh(void *opaque)
 {
     struct BlkdebugAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
     qemu_bh_delete(acb->bh);
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/blkverify.c b/block/blkverify.c
index 3ff681a..74188f5 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -188,23 +188,22 @@ static BlkverifyAIOCB *blkverify_aio_get(BlockDriverState *bs, bool is_write,
 static void blkverify_aio_bh(void *opaque)
 {
     BlkverifyAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     qemu_bh_delete(acb->bh);
     if (acb->buf) {
         qemu_iovec_destroy(&acb->raw_qiov);
         qemu_vfree(acb->buf);
     }
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
 static void blkverify_aio_cb(void *opaque, int ret)
 {
     BlkverifyAIOCB *acb = opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
+    aio_context_acquire(ctx);
     switch (++acb->done) {
     case 1:
         acb->ret = ret;
@@ -219,11 +218,11 @@ static void blkverify_aio_cb(void *opaque, int ret)
             acb->verify(acb);
         }
 
-        acb->bh = aio_bh_new(bdrv_get_aio_context(acb->common.bs),
-                             blkverify_aio_bh, acb);
+        acb->bh = aio_bh_new(ctx, blkverify_aio_bh, acb);
         qemu_bh_schedule(acb->bh);
         break;
     }
+    aio_context_release(ctx);
 }
 
 static void blkverify_verify_readv(BlkverifyAIOCB *acb)
diff --git a/block/block-backend.c b/block/block-backend.c
index 8549289..36ccc9e 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -637,12 +637,8 @@ int blk_write_zeroes(BlockBackend *blk, int64_t sector_num,
 static void error_callback_bh(void *opaque)
 {
     struct BlockBackendAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
     qemu_bh_delete(acb->bh);
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_aio_unref(acb);
 }
 
diff --git a/block/curl.c b/block/curl.c
index c2b6726..7b5b17f 100644
--- a/block/curl.c
+++ b/block/curl.c
@@ -715,11 +715,11 @@ static void curl_readv_bh_cb(void *p)
     curl_multi_socket_action(s->multi, CURL_SOCKET_TIMEOUT, 0, &running);
 
 out:
+    aio_context_release(ctx);
     if (ret != -EINPROGRESS) {
         acb->common.cb(acb->common.opaque, ret);
         qemu_aio_unref(acb);
     }
-    aio_context_release(ctx);
 }
 
 static BlockAIOCB *curl_aio_readv(BlockDriverState *bs,
diff --git a/block/io.c b/block/io.c
index 4b3e2b2..9b30f96 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2036,15 +2036,12 @@ static const AIOCBInfo bdrv_em_aiocb_info = {
 static void bdrv_aio_bh_cb(void *opaque)
 {
     BlockAIOCBSync *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     if (!acb->is_write && acb->ret >= 0) {
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->ret);
-    aio_context_release(ctx);
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
     qemu_aio_unref(acb);
@@ -2120,13 +2117,10 @@ static void bdrv_co_complete(BlockAIOCBCoroutine *acb)
 static void bdrv_co_em_bh(void *opaque)
 {
     BlockAIOCBCoroutine *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
     assert(!acb->need_bh);
     qemu_bh_delete(acb->bh);
-    aio_context_acquire(ctx);
     bdrv_co_complete(acb);
-    aio_context_release(ctx);
 }
 
 static void bdrv_co_maybe_schedule_bh(BlockAIOCBCoroutine *acb)
@@ -2277,15 +2271,19 @@ void qemu_aio_unref(void *p)
 
 typedef struct CoroutineIOCompletion {
     Coroutine *coroutine;
+    AioContext *ctx;
     int ret;
 } CoroutineIOCompletion;
 
 static void bdrv_co_io_em_complete(void *opaque, int ret)
 {
     CoroutineIOCompletion *co = opaque;
+    AioContext *ctx = co->ctx;
 
     co->ret = ret;
+    aio_context_acquire(ctx);
     qemu_coroutine_enter(co->coroutine, NULL);
+    aio_context_release(ctx);
 }
 
 static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
@@ -2294,6 +2292,7 @@ static int coroutine_fn bdrv_co_io_em(BlockDriverState *bs, int64_t sector_num,
 {
     CoroutineIOCompletion co = {
         .coroutine = qemu_coroutine_self(),
+        .ctx = bdrv_get_aio_context(bs),
     };
     BlockAIOCB *acb;
 
@@ -2367,6 +2366,7 @@ int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
         BlockAIOCB *acb;
         CoroutineIOCompletion co = {
             .coroutine = qemu_coroutine_self(),
+            .ctx = bdrv_get_aio_context(bs),
         };
 
         acb = bs->drv->bdrv_aio_flush(bs, bdrv_co_io_em_complete, &co);
@@ -2497,6 +2497,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
             BlockAIOCB *acb;
             CoroutineIOCompletion co = {
                 .coroutine = qemu_coroutine_self(),
+                .ctx = bdrv_get_aio_context(bs),
             };
 
             acb = bs->drv->bdrv_aio_discard(bs, sector_num, nb_sectors,
diff --git a/block/iscsi.c b/block/iscsi.c
index 411aef8..a3dc06b 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -129,9 +129,7 @@ iscsi_bh_cb(void *p)
     g_free(acb->buf);
     acb->buf = NULL;
 
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, acb->status);
-    aio_context_release(ctx);
 
     if (acb->task != NULL) {
         scsi_free_scsi_task(acb->task);
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 0e94a86..d061b8b 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -71,8 +71,7 @@ static inline ssize_t io_event_ret(struct io_event *ev)
 /*
  * Completes an AIO request (calls the callback and frees the ACB).
  */
-static void qemu_laio_process_completion(struct qemu_laio_state *s,
-    struct qemu_laiocb *laiocb)
+static void qemu_laio_process_completion(struct qemu_laiocb *laiocb)
 {
     int ret;
 
@@ -138,7 +137,9 @@ static void qemu_laio_completion_bh(void *opaque)
         laiocb->ret = io_event_ret(&s->events[s->event_idx]);
         s->event_idx++;
 
-        qemu_laio_process_completion(s, laiocb);
+        aio_context_release(s->aio_context);
+        qemu_laio_process_completion(laiocb);
+        aio_context_acquire(s->aio_context);
     }
 
     if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(&s->io_q.pending)) {
diff --git a/block/mirror.c b/block/mirror.c
index 52c9abf..a8249d1 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -125,6 +125,8 @@ static void mirror_write_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(bdrv_get_aio_context(s->common.bs));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -135,12 +137,15 @@ static void mirror_write_complete(void *opaque, int ret)
         }
     }
     mirror_iteration_done(op, ret);
+    aio_context_release(bdrv_get_aio_context(s->common.bs));
 }
 
 static void mirror_read_complete(void *opaque, int ret)
 {
     MirrorOp *op = opaque;
     MirrorBlockJob *s = op->s;
+
+    aio_context_acquire(bdrv_get_aio_context(s->common.bs));
     if (ret < 0) {
         BlockErrorAction action;
 
@@ -151,10 +156,11 @@ static void mirror_read_complete(void *opaque, int ret)
         }
 
         mirror_iteration_done(op, ret);
-        return;
+    } else {
+        bdrv_aio_writev(s->target, op->sector_num, &op->qiov, op->nb_sectors,
+                        mirror_write_complete, op);
     }
-    bdrv_aio_writev(s->target, op->sector_num, &op->qiov, op->nb_sectors,
-                    mirror_write_complete, op);
+    aio_context_release(bdrv_get_aio_context(s->common.bs));
 }
 
 static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
diff --git a/block/null.c b/block/null.c
index 9bddc1b..7d08323 100644
--- a/block/null.c
+++ b/block/null.c
@@ -117,11 +117,7 @@ static const AIOCBInfo null_aiocb_info = {
 static void null_bh_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     qemu_bh_delete(acb->bh);
     qemu_aio_unref(acb);
 }
@@ -129,11 +125,7 @@ static void null_bh_cb(void *opaque)
 static void null_timer_cb(void *opaque)
 {
     NullAIOCB *acb = opaque;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, 0);
-    aio_context_release(ctx);
     timer_deinit(&acb->timer);
     qemu_aio_unref(acb);
 }
diff --git a/block/qed-cluster.c b/block/qed-cluster.c
index f64b2af..64ea4f2 100644
--- a/block/qed-cluster.c
+++ b/block/qed-cluster.c
@@ -82,6 +82,7 @@ static void qed_find_cluster_cb(void *opaque, int ret)
     unsigned int index;
     unsigned int n;
 
+    qed_acquire(s);
     if (ret) {
         goto out;
     }
@@ -108,6 +109,7 @@ static void qed_find_cluster_cb(void *opaque, int ret)
 
 out:
     find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+    qed_release(s);
     g_free(find_cluster_cb);
 }
 
diff --git a/block/qed-table.c b/block/qed-table.c
index f4219b8..83bd5c2 100644
--- a/block/qed-table.c
+++ b/block/qed-table.c
@@ -29,6 +29,7 @@ static void qed_read_table_cb(void *opaque, int ret)
 {
     QEDReadTableCB *read_table_cb = opaque;
     QEDTable *table = read_table_cb->table;
+    BDRVQEDState *s = read_table_cb->s;
     int noffsets = read_table_cb->qiov.size / sizeof(uint64_t);
     int i;
 
@@ -38,13 +39,15 @@ static void qed_read_table_cb(void *opaque, int ret)
     }
 
     /* Byteswap offsets */
+    qed_acquire(s);
     for (i = 0; i < noffsets; i++) {
         table->offsets[i] = le64_to_cpu(table->offsets[i]);
     }
+    qed_release(s);
 
 out:
     /* Completion */
-    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
+    trace_qed_read_table_cb(s, read_table_cb->table, ret);
     gencb_complete(&read_table_cb->gencb, ret);
 }
 
@@ -82,8 +85,9 @@ typedef struct {
 static void qed_write_table_cb(void *opaque, int ret)
 {
     QEDWriteTableCB *write_table_cb = opaque;
+    BDRVQEDState *s = write_table_cb->s;
 
-    trace_qed_write_table_cb(write_table_cb->s,
+    trace_qed_write_table_cb(s,
                              write_table_cb->orig_table,
                              write_table_cb->flush,
                              ret);
@@ -95,8 +99,10 @@ static void qed_write_table_cb(void *opaque, int ret)
     if (write_table_cb->flush) {
         /* We still need to flush first */
         write_table_cb->flush = false;
+        qed_acquire(s);
         bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
                        write_table_cb);
+        qed_release(s);
         return;
     }
 
@@ -215,6 +221,7 @@ static void qed_read_l2_table_cb(void *opaque, int ret)
     CachedL2Table *l2_table = request->l2_table;
     uint64_t l2_offset = read_l2_table_cb->l2_offset;
 
+    qed_acquire(s);
     if (ret) {
         /* can't trust loaded L2 table anymore */
         qed_unref_l2_cache_entry(l2_table);
@@ -230,6 +237,7 @@ static void qed_read_l2_table_cb(void *opaque, int ret)
         request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, l2_offset);
         assert(request->l2_table != NULL);
     }
+    qed_release(s);
 
     gencb_complete(&read_l2_table_cb->gencb, ret);
 }
diff --git a/block/qed.c b/block/qed.c
index 17777d1..f4823e8 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -117,11 +117,13 @@ static void qed_write_header_read_cb(void *opaque, int ret)
     }
 
     /* Update header */
+    qed_acquire(s);
     qed_header_cpu_to_le(&s->header, (QEDHeader *)write_header_cb->buf);
 
     bdrv_aio_writev(s->bs->file->bs, 0, &write_header_cb->qiov,
                     write_header_cb->nsectors, qed_write_header_cb,
                     write_header_cb);
+    qed_release(s);
 }
 
 /**
@@ -277,11 +279,19 @@ static void qed_aio_start_io(QEDAIOCB *acb)
     qed_aio_next_io(acb, 0);
 }
 
+static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
+{
+    return acb->common.bs->opaque;
+}
+
 static void qed_aio_next_io_cb(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
 
-    qed_aio_next_io(acb, ret);
+    qed_acquire(s);
+    qed_aio_next_io(opaque, ret);
+    qed_release(s);
 }
 
 static void qed_plug_allocating_write_reqs(BDRVQEDState *s)
@@ -314,23 +324,29 @@ static void qed_flush_after_clear_need_check(void *opaque, int ret)
 {
     BDRVQEDState *s = opaque;
 
+    qed_acquire(s);
     bdrv_aio_flush(s->bs, qed_finish_clear_need_check, s);
 
     /* No need to wait until flush completes */
     qed_unplug_allocating_write_reqs(s);
+    qed_release(s);
 }
 
 static void qed_clear_need_check(void *opaque, int ret)
 {
     BDRVQEDState *s = opaque;
 
+    qed_acquire(s);
     if (ret) {
         qed_unplug_allocating_write_reqs(s);
-        return;
+        goto out;
     }
 
     s->header.features &= ~QED_F_NEED_CHECK;
     qed_write_header(s, qed_flush_after_clear_need_check, s);
+
+out:
+    qed_release(s);
 }
 
 static void qed_need_check_timer_cb(void *opaque)
@@ -773,11 +789,6 @@ static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
     return cb.status;
 }
 
-static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
-{
-    return acb->common.bs->opaque;
-}
-
 /**
  * Read from the backing file or zero-fill if no backing file
  *
@@ -868,10 +879,12 @@ static void qed_copy_from_backing_file_write(void *opaque, int ret)
         return;
     }
 
+    qed_acquire(s);
     BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
     bdrv_aio_writev(s->bs->file->bs, copy_cb->offset / BDRV_SECTOR_SIZE,
                     &copy_cb->qiov, copy_cb->qiov.size / BDRV_SECTOR_SIZE,
                     qed_copy_from_backing_file_cb, copy_cb);
+    qed_release(s);
 }
 
 /**
@@ -937,7 +950,6 @@ static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
 static void qed_aio_complete_bh(void *opaque)
 {
     QEDAIOCB *acb = opaque;
-    BDRVQEDState *s = acb_to_s(acb);
     BlockCompletionFunc *cb = acb->common.cb;
     void *user_opaque = acb->common.opaque;
     int ret = acb->bh_ret;
@@ -946,9 +958,7 @@ static void qed_aio_complete_bh(void *opaque)
     qemu_aio_unref(acb);
 
     /* Invoke callback */
-    qed_acquire(s);
     cb(user_opaque, ret);
-    qed_release(s);
 }
 
 static void qed_aio_complete(QEDAIOCB *acb, int ret)
@@ -1000,6 +1010,7 @@ static void qed_commit_l2_update(void *opaque, int ret)
     CachedL2Table *l2_table = acb->request.l2_table;
     uint64_t l2_offset = l2_table->offset;
 
+    qed_acquire(s);
     qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
 
     /* This is guaranteed to succeed because we just committed the entry to the
@@ -1009,6 +1020,7 @@ static void qed_commit_l2_update(void *opaque, int ret)
     assert(acb->request.l2_table != NULL);
 
     qed_aio_next_io(acb, ret);
+    qed_release(s);
 }
 
 /**
@@ -1020,15 +1032,18 @@ static void qed_aio_write_l1_update(void *opaque, int ret)
     BDRVQEDState *s = acb_to_s(acb);
     int index;
 
+    qed_acquire(s);
     if (ret) {
         qed_aio_complete(acb, ret);
-        return;
+        goto out;
     }
 
     index = qed_l1_index(s, acb->cur_pos);
     s->l1_table->offsets[index] = acb->request.l2_table->offset;
 
     qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+out:
+    qed_release(s);
 }
 
 /**
@@ -1071,7 +1086,11 @@ err:
 static void qed_aio_write_l2_update_cb(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+
+    qed_acquire(s);
     qed_aio_write_l2_update(acb, ret, acb->cur_cluster);
+    qed_release(s);
 }
 
 /**
@@ -1088,9 +1107,11 @@ static void qed_aio_write_flush_before_l2_update(void *opaque, int ret)
     QEDAIOCB *acb = opaque;
     BDRVQEDState *s = acb_to_s(acb);
 
+    qed_acquire(s);
     if (!bdrv_aio_flush(s->bs->file->bs, qed_aio_write_l2_update_cb, opaque)) {
         qed_aio_complete(acb, -EIO);
     }
+    qed_release(s);
 }
 
 /**
@@ -1106,9 +1127,10 @@ static void qed_aio_write_main(void *opaque, int ret)
 
     trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
 
+    qed_acquire(s);
     if (ret) {
         qed_aio_complete(acb, ret);
-        return;
+        goto out;
     }
 
     if (acb->find_cluster_ret == QED_CLUSTER_FOUND) {
@@ -1125,6 +1147,8 @@ static void qed_aio_write_main(void *opaque, int ret)
     bdrv_aio_writev(s->bs->file->bs, offset / BDRV_SECTOR_SIZE,
                     &acb->cur_qiov, acb->cur_qiov.size / BDRV_SECTOR_SIZE,
                     next_fn, acb);
+out:
+    qed_release(s);
 }
 
 /**
@@ -1141,14 +1165,17 @@ static void qed_aio_write_postfill(void *opaque, int ret)
                       qed_offset_into_cluster(s, acb->cur_pos) +
                       acb->cur_qiov.size;
 
+    qed_acquire(s);
     if (ret) {
         qed_aio_complete(acb, ret);
-        return;
+        goto out;
     }
 
     trace_qed_aio_write_postfill(s, acb, start, len, offset);
     qed_copy_from_backing_file(s, start, len, offset,
                                 qed_aio_write_main, acb);
+out:
+    qed_release(s);
 }
 
 /**
@@ -1161,9 +1188,11 @@ static void qed_aio_write_prefill(void *opaque, int ret)
     uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
     uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
 
+    qed_acquire(s);
     trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
     qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
                                 qed_aio_write_postfill, acb);
+    qed_release(s);
 }
 
 /**
@@ -1182,13 +1211,17 @@ static bool qed_should_set_need_check(BDRVQEDState *s)
 static void qed_aio_write_zero_cluster(void *opaque, int ret)
 {
     QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
 
+    qed_acquire(s);
     if (ret) {
         qed_aio_complete(acb, ret);
-        return;
+        goto out;
     }
 
     qed_aio_write_l2_update(acb, 0, 1);
+out:
+    qed_release(s);
 }
 
 /**
@@ -1447,6 +1480,7 @@ static BlockAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
 }
 
 typedef struct {
+    BDRVQEDState *s;
     Coroutine *co;
     int ret;
     bool done;
@@ -1459,7 +1493,9 @@ static void coroutine_fn qed_co_write_zeroes_cb(void *opaque, int ret)
     cb->done = true;
     cb->ret = ret;
     if (cb->co) {
+        qed_acquire(cb->s);
         qemu_coroutine_enter(cb->co, NULL);
+        qed_release(cb->s);
     }
 }
 
@@ -1470,7 +1506,7 @@ static int coroutine_fn bdrv_qed_co_write_zeroes(BlockDriverState *bs,
 {
     BlockAIOCB *blockacb;
     BDRVQEDState *s = bs->opaque;
-    QEDWriteZeroesCB cb = { .done = false };
+    QEDWriteZeroesCB cb = { .s = s, .done = false };
     QEMUIOVector qiov;
     struct iovec iov;
 
diff --git a/block/quorum.c b/block/quorum.c
index d7a0f11..7b451f1 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -149,6 +149,7 @@ static AIOCBInfo quorum_aiocb_info = {
     .cancel_async       = quorum_aio_cancel,
 };
 
+/* Called _without_ acquiring AioContext.  */
 static void quorum_aio_finalize(QuorumAIOCB *acb)
 {
     int i, ret = 0;
@@ -276,12 +277,15 @@ static void quorum_fifo_aio_cb(void *opaque, int ret)
     QuorumChildRequest *sacb = opaque;
     QuorumAIOCB *acb = sacb->parent;
     BDRVQuorumState *s = acb->common.bs->opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
 
+    aio_context_acquire(ctx);
     assert(acb->is_read && s->read_pattern == QUORUM_READ_PATTERN_FIFO);
 
     /* We try to read next child in FIFO order if we fail to read */
     if (ret < 0 && ++acb->child_iter < s->num_children) {
         read_fifo_child(acb);
+        aio_context_release(ctx);
         return;
     }
 
@@ -292,6 +296,7 @@ static void quorum_fifo_aio_cb(void *opaque, int ret)
 
     /* FIXME: rewrite failed children if acb->child_iter > 0? */
 
+    aio_context_release(ctx);
     quorum_aio_finalize(acb);
 }
 
@@ -300,8 +305,11 @@ static void quorum_aio_cb(void *opaque, int ret)
     QuorumChildRequest *sacb = opaque;
     QuorumAIOCB *acb = sacb->parent;
     BDRVQuorumState *s = acb->common.bs->opaque;
+    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     bool rewrite = false;
 
+    aio_context_acquire(ctx);
+
     sacb->ret = ret;
     acb->count++;
     if (ret == 0) {
@@ -311,7 +319,9 @@ static void quorum_aio_cb(void *opaque, int ret)
     }
     assert(acb->count <= s->num_children);
     assert(acb->success_count <= s->num_children);
+
     if (acb->count < s->num_children) {
+        aio_context_release(ctx);
         return;
     }
 
@@ -322,6 +332,8 @@ static void quorum_aio_cb(void *opaque, int ret)
         quorum_has_too_much_io_failed(acb);
     }
 
+    aio_context_release(ctx);
+
     /* if no rewrite is done the code will finish right away */
     if (!rewrite) {
         quorum_aio_finalize(acb);
diff --git a/block/rbd.c b/block/rbd.c
index 6206dc3..a60a19d 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -376,7 +376,6 @@ static int qemu_rbd_create(const char *filename, QemuOpts *opts, Error **errp)
 static void qemu_rbd_complete_aio(RADOSCB *rcb)
 {
     RBDAIOCB *acb = rcb->acb;
-    AioContext *ctx = bdrv_get_aio_context(acb->common.bs);
     int64_t r;
 
     r = rcb->ret;
@@ -409,10 +408,7 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
         qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
     }
     qemu_vfree(acb->bounce);
-
-    aio_context_acquire(ctx);
     acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-    aio_context_release(ctx);
 
     qemu_aio_unref(acb);
 }
diff --git a/block/win32-aio.c b/block/win32-aio.c
index 85aac85..16270ca 100644
--- a/block/win32-aio.c
+++ b/block/win32-aio.c
@@ -87,9 +87,7 @@ static void win32_aio_process_completion(QEMUWin32AIOState *s,
     }
 
 
-    aio_context_acquire(s->aio_ctx);
     waiocb->common.cb(waiocb->common.opaque, ret);
-    aio_context_release(s->aio_ctx);
     qemu_aio_unref(waiocb);
 }
 
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 5c1cb89..f05c84a 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -86,7 +86,9 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
 static void virtio_blk_rw_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *next = opaque;
+    VirtIOBlock *s = next->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     while (next) {
         VirtIOBlockReq *req = next;
         next = req->mr_next;
@@ -119,21 +121,27 @@ static void virtio_blk_rw_complete(void *opaque, int ret)
         block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
         virtio_blk_free_request(req);
     }
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 static void virtio_blk_flush_complete(void *opaque, int ret)
 {
     VirtIOBlockReq *req = opaque;
+    VirtIOBlock *s = req->dev;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     if (ret) {
         if (virtio_blk_handle_rw_error(req, -ret, 0)) {
-            return;
+            goto out;
         }
     }
 
     virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
     block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
     virtio_blk_free_request(req);
+
+out:
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
 }
 
 #ifdef __linux__
@@ -180,8 +188,10 @@ static void virtio_blk_ioctl_complete(void *opaque, int status)
     virtio_stl_p(vdev, &scsi->data_len, hdr->dxfer_len);
 
 out:
+    aio_context_acquire(blk_get_aio_context(s->conf.conf.blk));
     virtio_blk_req_complete(req, status);
     virtio_blk_free_request(req);
+    aio_context_release(blk_get_aio_context(s->conf.conf.blk));
     g_free(ioctl_req);
 }
 
diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 4797d83..57ce1a0 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -169,6 +169,8 @@ static void scsi_aio_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (r->req.io_canceled) {
         scsi_req_cancel_complete(&r->req);
         goto done;
@@ -184,6 +186,7 @@ static void scsi_aio_complete(void *opaque, int ret)
     scsi_req_complete(&r->req, GOOD);
 
 done:
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
     scsi_req_unref(&r->req);
 }
 
@@ -273,12 +276,14 @@ static void scsi_dma_complete(void *opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_dma_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_read_complete(void * opaque, int ret)
@@ -289,6 +294,8 @@ static void scsi_read_complete(void * opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (r->req.io_canceled) {
         scsi_req_cancel_complete(&r->req);
         goto done;
@@ -310,6 +317,7 @@ static void scsi_read_complete(void * opaque, int ret)
 
 done:
     scsi_req_unref(&r->req);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Actually issue a read to the block device.  */
@@ -359,12 +367,14 @@ static void scsi_do_read_cb(void *opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_do_read(opaque, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -492,12 +502,14 @@ static void scsi_write_complete(void * opaque, int ret)
     assert (r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (ret < 0) {
         block_acct_failed(blk_get_stats(s->qdev.conf.blk), &r->acct);
     } else {
         block_acct_done(blk_get_stats(s->qdev.conf.blk), &r->acct);
     }
     scsi_write_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_write_data(SCSIRequest *req)
@@ -1640,11 +1652,14 @@ static void scsi_unmap_complete(void *opaque, int ret)
 {
     UnmapCBData *data = opaque;
     SCSIDiskReq *r = data->r;
+    SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     scsi_unmap_complete_noio(data, ret);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_unmap(SCSIDiskReq *r, uint8_t *inbuf)
@@ -1711,6 +1726,8 @@ static void scsi_write_same_complete(void *opaque, int ret)
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
     if (r->req.io_canceled) {
         scsi_req_cancel_complete(&r->req);
         goto done;
@@ -1746,6 +1763,7 @@ done:
     scsi_req_unref(&r->req);
     qemu_vfree(data->iov.iov_base);
     g_free(data);
+    aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
 static void scsi_disk_emulate_write_same(SCSIDiskReq *r, uint8_t *inbuf)
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index a4626f7..1b0c7e9 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -145,10 +145,14 @@ done:
 static void scsi_command_complete(void *opaque, int ret)
 {
     SCSIGenericReq *r = (SCSIGenericReq *)opaque;
+    SCSIDevice *s = r->req.dev;
 
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
+
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
     scsi_command_complete_noio(r, ret);
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 static int execute_command(BlockBackend *blk,
@@ -184,9 +188,11 @@ static void scsi_read_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     len = r->io_header.dxfer_len - r->io_header.resid;
@@ -195,7 +201,7 @@ static void scsi_read_complete(void * opaque, int ret)
     r->len = -1;
     if (len == 0) {
         scsi_command_complete_noio(r, 0);
-        return;
+        goto done;
     }
 
     /* Snoop READ CAPACITY output to set the blocksize.  */
@@ -226,6 +232,9 @@ static void scsi_read_complete(void * opaque, int ret)
     }
     scsi_req_data(&r->req, len);
     scsi_req_unref(&r->req);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Read more data from scsi device into buffer.  */
@@ -261,9 +270,11 @@ static void scsi_write_complete(void * opaque, int ret)
     assert(r->req.aiocb != NULL);
     r->req.aiocb = NULL;
 
+    aio_context_acquire(blk_get_aio_context(s->conf.blk));
+
     if (ret || r->req.io_canceled) {
         scsi_command_complete_noio(r, ret);
-        return;
+        goto done;
     }
 
     if (r->req.cmd.buf[0] == MODE_SELECT && r->req.cmd.buf[4] == 12 &&
@@ -273,6 +284,9 @@ static void scsi_write_complete(void * opaque, int ret)
     }
 
     scsi_command_complete_noio(r, ret);
+
+done:
+    aio_context_release(blk_get_aio_context(s->conf.blk));
 }
 
 /* Write data to a scsi device.  Returns nonzero on failure.
diff --git a/thread-pool.c b/thread-pool.c
index bffd823..e923544 100644
--- a/thread-pool.c
+++ b/thread-pool.c
@@ -185,7 +185,9 @@ restart:
              */
             qemu_bh_schedule(pool->completion_bh);
 
+            aio_context_release(pool->ctx);
             elem->common.cb(elem->common.opaque, elem->ret);
+            aio_context_acquire(pool->ctx);
             qemu_aio_unref(elem);
             goto restart;
         } else {
@@ -261,21 +263,29 @@ BlockAIOCB *thread_pool_submit_aio(ThreadPool *pool,
 
 typedef struct ThreadPoolCo {
     Coroutine *co;
+    AioContext *ctx;
     int ret;
 } ThreadPoolCo;
 
 static void thread_pool_co_cb(void *opaque, int ret)
 {
     ThreadPoolCo *co = opaque;
+    AioContext *ctx = co->ctx;
 
     co->ret = ret;
+    aio_context_acquire(ctx);
     qemu_coroutine_enter(co->co, NULL);
+    aio_context_release(ctx);
 }
 
 int coroutine_fn thread_pool_submit_co(ThreadPool *pool, ThreadPoolFunc *func,
                                        void *arg)
 {
-    ThreadPoolCo tpc = { .co = qemu_coroutine_self(), .ret = -EINPROGRESS };
+    ThreadPoolCo tpc = {
+        .co = qemu_coroutine_self(),
+        .ctx = pool->ctx,
+        .ret = -EINPROGRESS
+    };
     assert(qemu_in_coroutine());
     thread_pool_submit_aio(pool, func, arg, thread_pool_co_cb, &tpc);
     qemu_coroutine_yield();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 36/40] aio: update locking documentation
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (34 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 35/40] block: explicitly acquire aiocontext in aio callbacks " Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 37/40] async: optimize aio_bh_poll Paolo Bonzini
                   ` (4 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 docs/multiple-iothreads.txt | 65 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 8 deletions(-)

diff --git a/docs/multiple-iothreads.txt b/docs/multiple-iothreads.txt
index 4197f62..729c64b 100644
--- a/docs/multiple-iothreads.txt
+++ b/docs/multiple-iothreads.txt
@@ -47,8 +47,11 @@ How to program for IOThreads
 The main difference between legacy code and new code that can run in an
 IOThread is dealing explicitly with the event loop object, AioContext
 (see include/block/aio.h).  Code that only works in the main loop
-implicitly uses the main loop's AioContext.  Code that supports running
-in IOThreads must be aware of its AioContext.
+implicitly uses the main loop's AioContext, and does not need to handle
+locking because it implicitly uses the QEMU global mutex.
+
+Code that supports running in IOThreads, instead, must be aware of its
+AioContext and of synchronization.
 
 AioContext supports the following services:
  * File descriptor monitoring (read/write/error on POSIX hosts)
@@ -79,6 +82,11 @@ iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
 Code that takes an AioContext argument works both in IOThreads or the main
 loop, depending on which AioContext instance the caller passes in.
 
+When running in an IOThread, handlers for bottom halves, file descriptors,
+and timers have to handle their own synchronization.  Block layer AIO
+callbacks also have to handle synchronization.  How to do this is the
+topic of the next sections.
+
 How to synchronize with an IOThread
 -----------------------------------
 AioContext is not thread-safe so some rules must be followed when using file
@@ -87,13 +95,15 @@ descriptors, event notifiers, timers, or BHs across threads:
 1. AioContext functions can always be called safely.  They handle their
 own locking internally.
 
-2. Other threads wishing to access the AioContext must use
-aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
-context is acquired no other thread can access it or run event loop iterations
-in this AioContext.
+2. Other threads wishing to access block devices on an IOThread's AioContext
+must use a mutex; the block layer does not attempt to provide mutual
+exclusion of any kind.  aio_context_acquire() and aio_context_release()
+are the typical choice; aio_context_acquire()/aio_context_release() calls
+may be nested.
 
-aio_context_acquire()/aio_context_release() calls may be nested.  This
-means you can call them if you're not sure whether #1 applies.
+#2 typically includes bottom half handlers, file descriptor handlers,
+timer handlers and AIO callbacks.  All of these typically acquire and
+release the AioContext.
 
 There is currently no lock ordering rule if a thread needs to acquire multiple
 AioContexts simultaneously.  Therefore, it is only safe for code holding the
@@ -130,9 +140,48 @@ aio_disable_clients() before calling bdrv_drain().  You can then reenable
 guest requests with aio_enable_clients() before finally releasing the
 AioContext and completing the monitor command.
 
+AioContext and coroutines
+-------------------------
 Long-running jobs (usually in the form of coroutines) are best scheduled in the
 BlockDriverState's AioContext to avoid the need to acquire/release around each
 bdrv_*() call.  Be aware that there is currently no mechanism to get notified
 when bdrv_set_aio_context() moves this BlockDriverState to a different
 AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
 may need to add this if you want to support long-running jobs.
+
+Usually, the coroutine is created and first entered in the main loop.  After
+some time it will yield and, when entered again, it will run in the
+IOThread.  In general, however, the coroutine never has to worry about
+releasing and acquiring the AioContext.  The release will happen in the
+function that entered the coroutine; the acquire will happen (e.g. in
+a bottom half or AIO callback) before the coroutine is entered.  For
+example:
+
+1) the coroutine yields because it waits on I/O (e.g. during a call to
+bdrv_co_readv) or sleeps for a determinate period of time (e.g. with
+co_aio_sleep_ns).  In this case, functions such as bdrv_co_io_em_complete
+or co_sleep_cb take the AioContext lock before re-entering.
+
+2) the coroutine explicitly yields.  In this case, the code implementing
+the long-running job will have a bottom half, AIO callback or similar that
+should acquire the AioContext before re-entering the coroutine.
+
+3) the coroutine yields through a CoQueue (directly or indirectly, e.g.
+through a CoMutex or CoRwLock).  This also "just works", but the mechanism
+is more subtle.  Suppose coroutine C1 has just released a mutex and C2
+was waiting on it.  C2 is woken up at C1's next yield point, or just
+before C1 terminates; C2 then runs *before qemu_coroutine_enter(C1)
+returns to its caller*.  Assuming that C1 was entered with a sequence like
+
+    aio_context_acquire(ctx);
+    qemu_coroutine_enter(co);     // enters C1
+    aio_context_release(ctx);
+
+... then C2 will also be protected by the AioContext lock.
+
+
+Bullet 3 provides the following rule for using coroutine mutual exclusion
+in multiple threads:
+
+    *** If two coroutines can exclude each other through a CoMutex or
+    *** CoRwLock or CoQueue, they should all run under the same AioContext.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 37/40] async: optimize aio_bh_poll
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (35 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 36/40] aio: update locking documentation Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 38/40] aio-posix: partially inline aio_dispatch into aio_poll Paolo Bonzini
                   ` (3 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Avoid entering the slow path of qemu_lockcnt_dec_and_lock if
no bottom half has to be deleted.  If a bottom half deletes itself,
it will be picked up on the next visit of the list, or when the
AioContext itself is finalized.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 async.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/async.c b/async.c
index 4c1f658..529934c 100644
--- a/async.c
+++ b/async.c
@@ -69,19 +69,24 @@ int aio_bh_poll(AioContext *ctx)
 {
     QEMUBH *bh, **bhp, *next;
     int ret;
+    bool deleted = false;
 
     qemu_lockcnt_inc(&ctx->list_lock);
 
     ret = 0;
     for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
         next = atomic_rcu_read(&bh->next);
+        if (bh->deleted) {
+            deleted = true;
+            continue;
+        }
         /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
          * implicit memory barrier ensures that the callback sees all writes
          * done by the scheduling thread.  It also ensures that the scheduling
          * thread sees the zero before bh->cb has run, and thus will call
          * aio_notify again if necessary.
          */
-        if (!bh->deleted && atomic_xchg(&bh->scheduled, 0)) {
+        if (atomic_xchg(&bh->scheduled, 0)) {
             /* Idle BHs don't count as progress */
             if (!bh->idle) {
                 ret = 1;
@@ -92,6 +97,11 @@ int aio_bh_poll(AioContext *ctx)
     }
 
     /* remove deleted bhs */
+    if (!deleted) {
+        qemu_lockcnt_dec(&ctx->list_lock);
+        return ret;
+    }
+
     if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
         bhp = &ctx->first_bh;
         while (*bhp) {
@@ -105,7 +115,6 @@ int aio_bh_poll(AioContext *ctx)
         }
         qemu_lockcnt_unlock(&ctx->list_lock);
     }
-
     return ret;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 38/40] aio-posix: partially inline aio_dispatch into aio_poll
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (36 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 37/40] async: optimize aio_bh_poll Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 39/40] async: remove unnecessary inc/dec pairs Paolo Bonzini
                   ` (2 subsequent siblings)
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

This patch prepares for the removal of unnecessary lockcnt inc/dec pairs.
Extract the dispatching loop for file descriptor handlers into a new
function aio_dispatch_handlers, and then inline aio_dispatch into
aio_poll.

aio_dispatch can now become void.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c         | 41 +++++++++++++++++------------------------
 aio-win32.c         | 11 ++++-------
 include/block/aio.h |  2 +-
 3 files changed, 22 insertions(+), 32 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index aabc4ae..b372816 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -305,26 +305,11 @@ bool aio_pending(AioContext *ctx)
     return result;
 }
 
-bool aio_dispatch(AioContext *ctx)
+static bool aio_dispatch_handlers(AioContext *ctx)
 {
     AioHandler *node, *tmp;
     bool progress = false;
 
-    /*
-     * If there are callbacks left that have been queued, we need to call them.
-     * Do not call select in this case, because it is possible that the caller
-     * does not need a complete flush (as is the case for aio_poll loops).
-     */
-    if (aio_bh_poll(ctx)) {
-        progress = true;
-    }
-
-    /*
-     * We have to walk very carefully in case aio_set_fd_handler is
-     * called while we're walking.
-     */
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     QLIST_FOREACH_SAFE_RCU(node, &ctx->aio_handlers, node, tmp) {
         int revents;
 
@@ -356,12 +341,18 @@ bool aio_dispatch(AioContext *ctx)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
+    return progress;
+}
 
-    /* Run our timers */
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
+void aio_dispatch(AioContext *ctx)
+{
+    aio_bh_poll(ctx);
 
-    return progress;
+    qemu_lockcnt_inc(&ctx->list_lock);
+    aio_dispatch_handlers(ctx);
+    qemu_lockcnt_dec(&ctx->list_lock);
+
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 /* These thread-local variables are used only in a small part of aio_poll
@@ -472,11 +463,13 @@ bool aio_poll(AioContext *ctx, bool blocking)
     npfd = 0;
     qemu_lockcnt_dec(&ctx->list_lock);
 
-    /* Run dispatch even if there were no readable fds to run timers */
-    if (aio_dispatch(ctx)) {
-        progress = true;
-    }
+    progress |= aio_bh_poll(ctx);
+
+    qemu_lockcnt_inc(&ctx->list_lock);
+    progress |= aio_dispatch_handlers(ctx);
+    qemu_lockcnt_dec(&ctx->list_lock);
 
+    progress |= timerlistgroup_run_timers(&ctx->tlg);
     return progress;
 }
 
diff --git a/aio-win32.c b/aio-win32.c
index 4479d3f..feffdc4 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -286,14 +286,11 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     return progress;
 }
 
-bool aio_dispatch(AioContext *ctx)
+void aio_dispatch(AioContext *ctx)
 {
-    bool progress;
-
-    progress = aio_bh_poll(ctx);
-    progress |= aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
-    progress |= timerlistgroup_run_timers(&ctx->tlg);
-    return progress;
+    aio_bh_poll(ctx);
+    aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    timerlistgroup_run_timers(&ctx->tlg);
 }
 
 bool aio_poll(AioContext *ctx, bool blocking)
diff --git a/include/block/aio.h b/include/block/aio.h
index 21044af..05a41b2 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -277,7 +277,7 @@ bool aio_pending(AioContext *ctx);
  *
  * This is used internally in the implementation of the GSource.
  */
-bool aio_dispatch(AioContext *ctx);
+void aio_dispatch(AioContext *ctx);
 
 /* Progress in completing AIO work to occur.  This can issue new pending
  * aio as a result of executing I/O completion or bh callbacks.
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 39/40] async: remove unnecessary inc/dec pairs
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (37 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 38/40] aio-posix: partially inline aio_dispatch into aio_poll Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 40/40] dma-helpers: avoid lock inversion with AioContext Paolo Bonzini
  2015-11-26  9:36 ` [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Christian Borntraeger
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Pull the increment/decrement pair out of aio_bh_poll and into the
callers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 aio-posix.c |  7 ++-----
 aio-win32.c |  9 ++++++---
 async.c     | 12 ++++++------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/aio-posix.c b/aio-posix.c
index b372816..23dcbaa 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -346,9 +346,8 @@ static bool aio_dispatch_handlers(AioContext *ctx)
 
 void aio_dispatch(AioContext *ctx)
 {
-    aio_bh_poll(ctx);
-
     qemu_lockcnt_inc(&ctx->list_lock);
+    aio_bh_poll(ctx);
     aio_dispatch_handlers(ctx);
     qemu_lockcnt_dec(&ctx->list_lock);
 
@@ -461,12 +460,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
     }
 
     npfd = 0;
-    qemu_lockcnt_dec(&ctx->list_lock);
 
     progress |= aio_bh_poll(ctx);
-
-    qemu_lockcnt_inc(&ctx->list_lock);
     progress |= aio_dispatch_handlers(ctx);
+
     qemu_lockcnt_dec(&ctx->list_lock);
 
     progress |= timerlistgroup_run_timers(&ctx->tlg);
diff --git a/aio-win32.c b/aio-win32.c
index feffdc4..36d39f7 100644
--- a/aio-win32.c
+++ b/aio-win32.c
@@ -231,8 +231,6 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
     AioHandler *node;
     bool progress = false;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     /*
      * We have to walk very carefully in case aio_set_fd_handler is
      * called while we're walking.
@@ -282,14 +280,15 @@ static bool aio_dispatch_handlers(AioContext *ctx, HANDLE event)
         }
     }
 
-    qemu_lockcnt_dec(&ctx->list_lock);
     return progress;
 }
 
 void aio_dispatch(AioContext *ctx)
 {
+    qemu_lockcnt_inc(&ctx->list_lock);
     aio_bh_poll(ctx);
     aio_dispatch_handlers(ctx, INVALID_HANDLE_VALUE);
+    qemu_lockcnt_dec(&ctx->list_lock);
     timerlistgroup_run_timers(&ctx->tlg);
 }
 
@@ -350,6 +349,7 @@ bool aio_poll(AioContext *ctx, bool blocking)
 
         if (first) {
             aio_notify_accept(ctx);
+            qemu_lockcnt_inc(&ctx->list_lock);
             progress |= aio_bh_poll(ctx);
             first = false;
         }
@@ -369,6 +369,9 @@ bool aio_poll(AioContext *ctx, bool blocking)
         progress |= aio_dispatch_handlers(ctx, event);
     } while (count > 0);
 
+    assert(!first); /* and lockcnt incremented */
+    qemu_lockcnt_dec(&ctx->list_lock);
+
     progress |= timerlistgroup_run_timers(&ctx->tlg);
 
     return progress;
diff --git a/async.c b/async.c
index 529934c..db05243 100644
--- a/async.c
+++ b/async.c
@@ -64,15 +64,16 @@ void aio_bh_call(QEMUBH *bh)
     bh->cb(bh->opaque);
 }
 
-/* Multiple occurrences of aio_bh_poll cannot be called concurrently */
+/* Multiple occurrences of aio_bh_poll cannot be called concurrently.
+ * The count in ctx->list_lock is incremented before the call, and is
+ * not affected by the call.
+ */
 int aio_bh_poll(AioContext *ctx)
 {
     QEMUBH *bh, **bhp, *next;
     int ret;
     bool deleted = false;
 
-    qemu_lockcnt_inc(&ctx->list_lock);
-
     ret = 0;
     for (bh = atomic_rcu_read(&ctx->first_bh); bh; bh = next) {
         next = atomic_rcu_read(&bh->next);
@@ -98,11 +99,10 @@ int aio_bh_poll(AioContext *ctx)
 
     /* remove deleted bhs */
     if (!deleted) {
-        qemu_lockcnt_dec(&ctx->list_lock);
         return ret;
     }
 
-    if (qemu_lockcnt_dec_and_lock(&ctx->list_lock)) {
+    if (qemu_lockcnt_dec_if_lock(&ctx->list_lock)) {
         bhp = &ctx->first_bh;
         while (*bhp) {
             bh = *bhp;
@@ -113,7 +113,7 @@ int aio_bh_poll(AioContext *ctx)
                 bhp = &bh->next;
             }
         }
-        qemu_lockcnt_unlock(&ctx->list_lock);
+        qemu_lockcnt_inc_and_unlock(&ctx->list_lock);
     }
     return ret;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [Qemu-devel] [PATCH 40/40] dma-helpers: avoid lock inversion with AioContext
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (38 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 39/40] async: remove unnecessary inc/dec pairs Paolo Bonzini
@ 2015-11-24 18:01 ` Paolo Bonzini
  2015-11-26  9:36 ` [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Christian Borntraeger
  40 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:01 UTC (permalink / raw)
  To: qemu-devel, qemu-block

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 dma-helpers.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/dma-helpers.c b/dma-helpers.c
index 68f6f07..b3e19ba 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -210,9 +210,25 @@ BlockAIOCB *dma_blk_io(
     dbs->sg_cur_byte = 0;
     dbs->dir = dir;
     dbs->io_func = io_func;
-    dbs->bh = NULL;
     qemu_iovec_init(&dbs->iov, sg->nsg);
-    dma_blk_cb(dbs, 0);
+
+    /* FIXME: dma_blk_io usually is called while the thread has acquired the
+     * AioContext, for consistency with blk_aio_readv.  However, dma_memory_map
+     * may acquire the iothread lock and thus requires being called with _no_ lock.
+     *
+     * The solution would be to call dma_blk_io without the AioContext lock,
+     * but this requires changes in the core block and SCSI layers.  In the
+     * interest of doing things a step at a time, defer to a bottom half
+     * if the iothread lock isn't taken.  The bottom half is called without
+     * the AioContext lock taken.
+     */
+    if (qemu_mutex_iothread_locked()) {
+        dbs->bh = NULL;
+        dma_blk_cb(dbs, 0);
+    } else {
+        dbs->bh = aio_bh_new(dbs->ctx, reschedule_dma, dbs);
+        qemu_bh_schedule(dbs->bh);
+    }
     return &dbs->common;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
                   ` (39 preceding siblings ...)
  2015-11-24 18:01 ` [Qemu-devel] [PATCH 40/40] dma-helpers: avoid lock inversion with AioContext Paolo Bonzini
@ 2015-11-26  9:36 ` Christian Borntraeger
  2015-11-26  9:41   ` Christian Borntraeger
  2015-11-26 10:39   ` Paolo Bonzini
  40 siblings, 2 replies; 56+ messages in thread
From: Christian Borntraeger @ 2015-11-26  9:36 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

On 11/24/2015 07:00 PM, Paolo Bonzini wrote:
> This large series is basically all that I would like to get into 2.6.
> It is a combination of several pieces of work on dataplane and
> multithreaded block layer.
> 
> It's also a large part of why I would like someone else to look at
> miscellaneous patches for a while (in case you've missed that).  I
> can foresee that following the reviews is going to be a huge time drain.
> 
> With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
> and 5 VCPUs).  The bulk of the improvement actually comes from the first
> 8 patches, but the rest of the series is what prepares for what's next
> to come in QEMU 2.7 and later, such as a multiqueue block layer.
> 
> It's tedious to review, with some pretty large patches (3, 32, 33, 35).
On 11/24/2015 07:00 PM, Paolo Bonzini wrote:
> This large series is basically all that I would like to get into 2.6.
> It is a combination of several pieces of work on dataplane and
> multithreaded block layer.
> 
> It's also a large part of why I would like someone else to look at
> miscellaneous patches for a while (in case you've missed that).  I
> can foresee that following the reviews is going to be a huge time drain.
> 
> With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
> and 5 VCPUs).  The bulk of the improvement actually comes from the first
> 8 patches, but the rest of the series is what prepares for what's next
> to come in QEMU 2.7 and later, such as a multiqueue block layer.

For some unknown reason, this seems to be slightly slower than 2.5-rc1 on my
old z196. (have not net tested the z13)

your branch is certainly better regarding malloc, but worse regarding others.

your branch:

# Overhead Command Shared Object Symbol
# ........ ............... ....................... .......................................
#
3.99% qemu-system-s39 libc-2.18.so	[.] __memcpy_z196
2.66% qemu-system-s39 qemu-system-s390x [.] address_space_lduw_le
2.51% qemu-system-s39 qemu-system-s390x [.] address_space_map
2.51% qemu-system-s39 qemu-system-s390x [.] phys_page_find
2.24% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr
2.18% qemu-system-s39 libc-2.18.so 	[.] _int_malloc
2.18% qemu-system-s39 qemu-system-s390x [.] address_space_translate_internal
2.03% qemu-system-s39 libc-2.18.so 	[.] malloc
1.91% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_switch
1.66% qemu-system-s39 qemu-system-s390x [.] address_space_rw
1.63% qemu-system-s39 qemu-system-s390x [.] address_space_stw_le
1.57% qemu-system-s39 qemu-system-s390x [.] address_space_stl_le
1.57% qemu-system-s39 qemu-system-s390x [.] address_space_translate
1.45% qemu-system-s39 qemu-system-s390x [.] virtqueue_pop
1.33% qemu-system-s39 libc-2.18.so 	[.] _int_free
1.27% qemu-system-s39 [kernel.vmlinux] 	[k] system_call
1.00% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_enter
0.94% qemu-system-s39 libc-2.18.so 	[.] __sigsetjmp
0.94% qemu-system-s39 libc-2.18.so 	[.] free
0.91% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host
0.88% qemu-system-s39 qemu-system-s390x [.] virtio_blk_handle_request
0.82% qemu-system-s39 qemu-system-s390x [.] object_unref
0.79% qemu-system-s39 qemu-system-s390x [.] vring_desc_read
0.76% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_block
0.73% qemu-system-s39 libglib-2.0.so.0.3800.2 [.] g_free

2.5.0-rc1:


# Overhead Command Shared Object Symbol
# ........ ............... ....................... .......................................
#
5.10% qemu-system-s39 libc-2.18.so 	[.] _int_malloc
3.30% qemu-system-s39 libc-2.18.so 	[.] __memcpy_z196
2.83% qemu-system-s39 qemu-system-s390x [.] memory_region_find_rcu
2.72% qemu-system-s39 qemu-system-s390x [.] vring_pop
2.27% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_switch
2.18% qemu-system-s39 libc-2.18.so 	[.] _int_free
1.99% qemu-system-s39 libc-2.18.so 	[.] malloc
1.37% qemu-system-s39 qemu-system-s390x [.] address_space_rw
1.37% qemu-system-s39 qemu-system-s390x [.] aio_bh_poll
1.37% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr
1.34% qemu-system-s39 libc-2.18.so 	[.] free
1.29% qemu-system-s39 libc-2.18.so 	[.] malloc_consolidate
1.18% qemu-system-s39 qemu-system-s390x [.] memory_region_find
1.09% qemu-system-s39 libglib-2.0.so.0.3800.2 [.] g_malloc
1.06% qemu-system-s39 [kernel.vmlinux] [k] system_call
0.95% qemu-system-s39 [kernel.vmlinux] [k] __schedule
0.95% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_enter
0.92% qemu-system-s39 qemu-system-s390x [.] get_desc.isra.11
0.92% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host
0.87% qemu-system-s39 [kernel.vmlinux] [k] enqueue_entity
0.84% qemu-system-s39 qemu-system-s390x [.] vring_push
0.78% qemu-system-s39 [kernel.vmlinux] [k] set_next_entity
0.73% qemu-system-s39 [kernel.vmlinux] [k] kvm_arch_vcpu_ioctl_run
0.73% qemu-system-s39 [vdso] [.] __kernel_clock_gettime
0.73% qemu-system-s39 qemu-system-s390x [.] virtio_blk_handle_request
0.73% qemu-system-s39 qemu-system-s390x [.] vring_map
0.67% qemu-system-s39 [kernel.vmlinux] [k] account_system_time
0.67% qemu-system-s39 [kernel.vmlinux] [k] set_adapter_int
0.67% qemu-system-s39 qemu-system-s390x [.] addrrange_intersection 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-11-26  9:36 ` [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Christian Borntraeger
@ 2015-11-26  9:41   ` Christian Borntraeger
  2015-11-26 10:39   ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Christian Borntraeger @ 2015-11-26  9:41 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

On 11/26/2015 10:36 AM, Christian Borntraeger wrote:
> On 11/24/2015 07:00 PM, Paolo Bonzini wrote:
>> This large series is basically all that I would like to get into 2.6.
>> It is a combination of several pieces of work on dataplane and
>> multithreaded block layer.
>>
>> It's also a large part of why I would like someone else to look at
>> miscellaneous patches for a while (in case you've missed that).  I
>> can foresee that following the reviews is going to be a huge time drain.
>>
>> With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
>> and 5 VCPUs).  The bulk of the improvement actually comes from the first
>> 8 patches, but the rest of the series is what prepares for what's next
>> to come in QEMU 2.7 and later, such as a multiqueue block layer.
>>
>> It's tedious to review, with some pretty large patches (3, 32, 33, 35).
> On 11/24/2015 07:00 PM, Paolo Bonzini wrote:
>> This large series is basically all that I would like to get into 2.6.
>> It is a combination of several pieces of work on dataplane and
>> multithreaded block layer.
>>
>> It's also a large part of why I would like someone else to look at
>> miscellaneous patches for a while (in case you've missed that).  I
>> can foresee that following the reviews is going to be a huge time drain.
>>
>> With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
>> and 5 VCPUs).  The bulk of the improvement actually comes from the first
>> 8 patches, but the rest of the series is what prepares for what's next
>> to come in QEMU 2.7 and later, such as a multiqueue block layer.
> 
> For some unknown reason, this seems to be slightly slower than 2.5-rc1 on my
> old z196. (have not net tested the z13)
> 
> your branch is certainly better regarding malloc, but worse regarding others.

Using the first 8 patches or so (commit be2f6b163e2b2a604f52a258fd932142c5974ffe
    vring: slim down allocation of VirtQueueElements)
is slightly faster than 2.5.0-rc1, so the regression seems to come from some
of the later patches.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-11-26  9:36 ` [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Christian Borntraeger
  2015-11-26  9:41   ` Christian Borntraeger
@ 2015-11-26 10:39   ` Paolo Bonzini
  2015-12-09 20:35     ` Paolo Bonzini
  1 sibling, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-26 10:39 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel, qemu-block
  Cc: mlin, famz, ming.lei, stefanha, mst



On 26/11/2015 10:36, Christian Borntraeger wrote:
> For some unknown reason, this seems to be slightly slower than 2.5-rc1 on my
> old z196. (have not net tested the z13)
> 
> your branch is certainly better regarding malloc, but worse regarding others.

Thanks for taking the time to test this!

This is correct, see the cover letter:

"[Patches 14 to 16 remove] the duplicate dataplane-specific
implementation of virtio in favor of the regular one that is already
used for non-dataplane. While the dataplane implementation is slightly
more optimized, I chose to keep the other one to avoid another "touch
all virtio devices" series.

Patch 10 alone mostly brings performance in par between the two.
The remaining 7-8% can be recovered by mostly getting rid of tiny
address_space_* operations, keeping the rings always mapped. Note that
the rest of this big series does bring a little performance improvement,
and already makes up for the lost performance."

The profile shows that the culprit is the repeated access
to the virtio ring:

3.99% qemu-system-s39 libc-2.18.so [.] __memcpy_z196
2.66% qemu-system-s39 qemu-system-s390x [.] address_space_lduw_le
2.51% qemu-system-s39 qemu-system-s390x [.] address_space_map
2.51% qemu-system-s39 qemu-system-s390x [.] phys_page_find
2.24% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr
2.18% qemu-system-s39 qemu-system-s390x [.] address_space_translate_internal
1.91% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_switch
1.66% qemu-system-s39 qemu-system-s390x [.] address_space_rw
1.63% qemu-system-s39 qemu-system-s390x [.] address_space_stw_le
1.57% qemu-system-s39 qemu-system-s390x [.] address_space_stl_le
1.57% qemu-system-s39 qemu-system-s390x [.] address_space_translate
1.45% qemu-system-s39 qemu-system-s390x [.] virtqueue_pop
0.91% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host
0.79% qemu-system-s39 qemu-system-s390x [.] vring_desc_read
0.76% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_block
-----------
28.33%

3.30% qemu-system-s39 libc-2.18.so [.] __memcpy_z196
2.83% qemu-system-s39 qemu-system-s390x [.] memory_region_find_rcu
2.72% qemu-system-s39 qemu-system-s390x [.] vring_pop
1.37% qemu-system-s39 qemu-system-s390x [.] address_space_rw
1.37% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr
1.18% qemu-system-s39 qemu-system-s390x [.] memory_region_find
0.92% qemu-system-s39 qemu-system-s390x [.] get_desc.isra.11
0.92% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host
0.84% qemu-system-s39 qemu-system-s390x [.] vring_push
-----------
15.45%

I would really prefer to get rid of vring.c as soon as the infrastructure
makes it possible---even if it's faster. We know what makes virtio.c
slower, and it's simpler to fix virtio.c than to convert all the other
models to vring.c _plus_ make vring.c safe for migration.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
@ 2015-11-30  2:27   ` Fam Zheng
  2015-11-30  2:33     ` Fam Zheng
  2015-11-30 16:35   ` Greg Kurz
  1 sibling, 1 reply; 56+ messages in thread
From: Fam Zheng @ 2015-11-30  2:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Tue, 11/24 19:00, Paolo Bonzini wrote:
> Prepare for moving the allocation to virtqueue_pop.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/9pfs/virtio-9p-device.c |  7 +------
>  hw/9pfs/virtio-9p.c        | 10 +++-------
>  hw/9pfs/virtio-9p.h        |  2 --
>  3 files changed, 4 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index e3abcfa..72a93c2 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -57,7 +57,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>  {
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      V9fsState *s = VIRTIO_9P(dev);
> -    int i, len;
> +    int len;
>      struct stat stat;
>      FsDriverEntry *fse;
>      V9fsPath path;
> @@ -65,12 +65,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
>                  sizeof(struct virtio_9p_config) + MAX_TAG_LEN);
>  
> -    /* initialize pdu allocator */
> -    QLIST_INIT(&s->free_list);
>      QLIST_INIT(&s->active_list);
> -    for (i = 0; i < (MAX_REQ - 1); i++) {
> -        QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
> -    }
>  
>      s->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
>  
> diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
> index f972731..747306b 100644
> --- a/hw/9pfs/virtio-9p.c
> +++ b/hw/9pfs/virtio-9p.c
> @@ -565,13 +565,9 @@ static int fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp, V9fsQID *qidp)
>  
>  static V9fsPDU *alloc_pdu(V9fsState *s)
>  {
> -    V9fsPDU *pdu = NULL;
> +    V9fsPDU *pdu = g_new(V9fsPDU, 1);
>  
> -    if (!QLIST_EMPTY(&s->free_list)) {
> -        pdu = QLIST_FIRST(&s->free_list);
> -        QLIST_REMOVE(pdu, next);
> -        QLIST_INSERT_HEAD(&s->active_list, pdu, next);
> -    }
> +    QLIST_INSERT_HEAD(&s->active_list, pdu, next);
>      return pdu;
>  }

OK, and I think handle_9p_output no longer needs to check the returned pointer.

Fam

>  
> @@ -584,8 +580,8 @@ static void free_pdu(V9fsState *s, V9fsPDU *pdu)
>           */
>          if (!pdu->cancelled) {
>              QLIST_REMOVE(pdu, next);
> -            QLIST_INSERT_HEAD(&s->free_list, pdu, next);
>          }
> +        g_free(pdu);
>      }
>  }
>  
> diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
> index d7a4dc1..1fb4ff9 100644
> --- a/hw/9pfs/virtio-9p.h
> +++ b/hw/9pfs/virtio-9p.h
> @@ -201,8 +201,6 @@ typedef struct V9fsState
>  {
>      VirtIODevice parent_obj;
>      VirtQueue *vq;
> -    V9fsPDU pdus[MAX_REQ];
> -    QLIST_HEAD(, V9fsPDU) free_list;
>      QLIST_HEAD(, V9fsPDU) active_list;
>      V9fsFidState *fid_list;
>      FileOperations *ops;
> -- 
> 1.8.3.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free
  2015-11-30  2:27   ` Fam Zheng
@ 2015-11-30  2:33     ` Fam Zheng
  0 siblings, 0 replies; 56+ messages in thread
From: Fam Zheng @ 2015-11-30  2:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Mon, 11/30 10:27, Fam Zheng wrote:
> On Tue, 11/24 19:00, Paolo Bonzini wrote:
> OK, and I think handle_9p_output no longer needs to check the returned pointer.

Yes it's removed in patch 3, good.

Fam

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
@ 2015-11-30  3:00   ` Fam Zheng
  0 siblings, 0 replies; 56+ messages in thread
From: Fam Zheng @ 2015-11-30  3:00 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Tue, 11/24 19:00, Paolo Bonzini wrote:
> @@ -436,10 +454,11 @@ static void control_out(VirtIODevice *vdev, VirtQueue *vq)
>              buf = g_malloc(cur_len);
>              len = cur_len;
>          }
> -        iov_to_buf(elem.out_sg, elem.out_num, 0, buf, cur_len);
> +        iov_to_buf(elem->out_sg, elem->out_num, 0, buf, cur_len);
>  
>          handle_control_message(vser, buf, cur_len);
> -        virtqueue_push(vq, &elem, 0);
> +        virtqueue_push(vq, elem, 0);
> +	g_free(elem);

s/^\t/        / ?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
@ 2015-11-30  3:24   ` Fam Zheng
  2015-11-30  8:36     ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Fam Zheng @ 2015-11-30  3:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Tue, 11/24 19:00, Paolo Bonzini wrote:
> Build the addresses and s/g lists on the stack, and then copy them
> to a VirtQueueElement that is just as big as required to contain this
> particular s/g list.  The cost of the copy is minimal compared to that
> of a large malloc.
> 
> When virtqueue_map is used on the destination side of migration or on
> loadvm, the iovecs have already been split at memory region boundary,
> so we can just reuse the out_num/in_num we find in the file.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/virtio/virtio.c | 82 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 51 insertions(+), 31 deletions(-)
> 
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 32c89eb..0163d0f 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -448,6 +448,32 @@ int virtqueue_avail_bytes(VirtQueue *vq, unsigned int in_bytes,
>      return in_bytes <= in_total && out_bytes <= out_total;
>  }
>  
> +static void virtqueue_map_desc(unsigned int *p_num_sg, hwaddr *addr, struct iovec *iov,
> +                               unsigned int max_num_sg, bool is_write,
> +                               hwaddr pa, size_t sz)
> +{
> +    unsigned num_sg = *p_num_sg;
> +    assert(num_sg <= max_num_sg);
> +
> +    while (sz) {
> +        hwaddr len = sz;
> +
> +        if (num_sg == max_num_sg) {
> +            error_report("virtio: too many write descriptors in indirect table");
> +            exit(1);
> +        }
> +
> +        iov[num_sg].iov_base = cpu_physical_memory_map(pa, &len, is_write);
> +        iov[num_sg].iov_len = len;
> +        addr[num_sg] = pa;
> +
> +        sz -= len;
> +        pa += len;
> +        num_sg++;
> +    }
> +    *p_num_sg = num_sg;
> +}
> +
>  static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
>                                  unsigned int *num_sg, unsigned int max_size,
>                                  int is_write)
> @@ -474,20 +500,10 @@ static void virtqueue_map_iovec(struct iovec *sg, hwaddr *addr,
>              error_report("virtio: error trying to map MMIO memory");
>              exit(1);
>          }
> -        if (len == sg[i].iov_len) {
> -            continue;
> -        }
> -        if (*num_sg >= max_size) {
> -            error_report("virtio: memory split makes iovec too large");
> +        if (len != sg[i].iov_len) {
> +            error_report("virtio: unexpected memory split");
>              exit(1);
>          }
> -        memmove(sg + i + 1, sg + i, sizeof(*sg) * (*num_sg - i));
> -        memmove(addr + i + 1, addr + i, sizeof(*addr) * (*num_sg - i));
> -        assert(len < sg[i + 1].iov_len);
> -        sg[i].iov_len = len;
> -        addr[i + 1] += len;
> -        sg[i + 1].iov_len -= len;
> -        ++*num_sg;
>      }
>  }
>  
> @@ -525,14 +541,16 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>      hwaddr desc_pa = vq->vring.desc;
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem;
> +    unsigned out_num, in_num;
> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec iov[VIRTQUEUE_MAX_SIZE];
>  
>      if (!virtqueue_num_heads(vq, vq->last_avail_idx)) {
>          return NULL;
>      }
>  
>      /* When we start there are none of either input nor output. */
> -    elem = virtqueue_alloc_element(sz, VIRTQUEUE_MAX_SIZE, VIRTQUEUE_MAX_SIZE);
> -    elem->out_num = elem->in_num = 0;
> +    out_num = in_num = 0;
>  
>      max = vq->vring.num;
>  
> @@ -555,37 +573,39 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>  
>      /* Collect all the descriptors */
>      do {
> -        struct iovec *sg;
> +        hwaddr pa = vring_desc_addr(vdev, desc_pa, i);
> +        size_t len = vring_desc_len(vdev, desc_pa, i);
>  
>          if (vring_desc_flags(vdev, desc_pa, i) & VRING_DESC_F_WRITE) {
> -            if (elem->in_num >= VIRTQUEUE_MAX_SIZE) {
> -                error_report("Too many write descriptors in indirect table");
> -                exit(1);
> -            }
> -            elem->in_addr[elem->in_num] = vring_desc_addr(vdev, desc_pa, i);
> -            sg = &elem->in_sg[elem->in_num++];
> +            virtqueue_map_desc(&in_num, addr + out_num, iov + out_num,
> +                               VIRTQUEUE_MAX_SIZE - out_num, 1, pa, len);
>          } else {
> -            if (elem->out_num >= VIRTQUEUE_MAX_SIZE) {
> -                error_report("Too many read descriptors in indirect table");
> +            if (in_num) {
> +                error_report("Incorrect order for descriptors");
>                  exit(1);
>              }
> -            elem->out_addr[elem->out_num] = vring_desc_addr(vdev, desc_pa, i);
> -            sg = &elem->out_sg[elem->out_num++];
> +            virtqueue_map_desc(&out_num, addr, iov,
> +                               VIRTQUEUE_MAX_SIZE, 0, pa, len);
>          }
>  
> -        sg->iov_len = vring_desc_len(vdev, desc_pa, i);
> -
>          /* If we've got too many, that implies a descriptor loop. */
> -        if ((elem->in_num + elem->out_num) > max) {
> +        if ((in_num + out_num) > max) {
>              error_report("Looped descriptor");
>              exit(1);
>          }
>      } while ((i = virtqueue_next_desc(vdev, desc_pa, i, max)) != max);
>  
> -    /* Now map what we have collected */
> -    virtqueue_map(elem);
> -
> +    /* Now copy what we have collected and mapped */
> +    elem = virtqueue_alloc_element(sz, out_num, in_num);
>      elem->index = head;
> +    for (i = 0; i < out_num; i++) {
> +        elem->out_addr[i] = addr[i];
> +        elem->out_sg[i] = iov[i];
> +    }

Isn't memcpy more efficient here? Otherwise looks good.

Fam

> +    for (i = 0; i < in_num; i++) {
> +        elem->in_addr[i] = addr[out_num + i];
> +        elem->in_sg[i] = iov[out_num + i];
> +    }
>  

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements
  2015-11-30  3:24   ` Fam Zheng
@ 2015-11-30  8:36     ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-30  8:36 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, qemu-block



On 30/11/2015 04:24, Fam Zheng wrote:
> > +    for (i = 0; i < out_num; i++) {
> > +        elem->out_addr[i] = addr[i];
> > +        elem->out_sg[i] = iov[i];
> > +    }
>
> Isn't memcpy more efficient here? Otherwise looks good.

Probably not, out_num/in_num is usually very small, in fact one of them
is often 0 or 1.

For example the memcpy in address_space_rw is awfully inefficient.  This
is roughly the same.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time Paolo Bonzini
@ 2015-11-30  9:47   ` Fam Zheng
  2015-11-30 10:37     ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Fam Zheng @ 2015-11-30  9:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Tue, 11/24 19:00, Paolo Bonzini wrote:
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  hw/virtio/virtio.c | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 93 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index fd63206..f5f8108 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -578,14 +578,105 @@ void *virtqueue_pop(VirtQueue *vq, size_t sz)
>  void *qemu_get_virtqueue_element(QEMUFile *f, size_t sz)
>  {
>      VirtQueueElement *elem = g_malloc(sz);
> -    qemu_get_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
> +    bool swap;
> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    uint64_t scratch;
> +    int i;
> +
> +    qemu_get_be32s(f, &elem->index);
> +    qemu_get_be32s(f, &elem->out_num);
> +    qemu_get_be32s(f, &elem->in_num);
> +
> +    swap = (elem->out_num & 0xFFFF0000) || (elem->in_num & 0xFFFF0000);

This is interesting, out_num and in_num are 32 bit numbers but there max values
are both VIRTQUEUE_MAX_SIZE (thanks for explaining this on IRC), so it can be a
clue for the source using a different endianness.

Probably worth a few comments here?

It's a great patch!  Thanks!

Fam

> +    if (swap) {
> +        bswap32s(&elem->index);
> +        bswap32s(&elem->out_num);
> +        bswap32s(&elem->in_num);
> +    }
> +
> +    for (i = 0; i < elem->in_num; i++) {
> +        qemu_get_be64s(f, &elem->in_addr[i]);
> +        if (swap) {
> +            bswap64s(&elem->in_addr[i]);
> +        }
> +    }
> +    if (i < ARRAY_SIZE(addr)) {
> +        qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
> +    }
> +
> +    for (i = 0; i < elem->out_num; i++) {
> +        qemu_get_be64s(f, &elem->out_addr[i]);
> +        if (swap) {
> +            bswap64s(&elem->out_addr[i]);
> +        }
> +    }
> +    if (i < ARRAY_SIZE(addr)) {
> +        qemu_get_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
> +    }
> +
> +    for (i = 0; i < elem->in_num; i++) {
> +        (void) qemu_get_be64(f); /* base */
> +	qemu_get_be64s(f, &scratch); /* length */
> +        if (swap) {
> +            bswap64s(&scratch);
> +        }
> +	elem->in_sg[i].iov_len = scratch;
> +    }
> +    if (i < ARRAY_SIZE(iov)) {
> +        qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
> +    }
> +
> +    for (i = 0; i < elem->out_num; i++) {
> +        (void) qemu_get_be64(f); /* base */
> +        qemu_get_be64s(f, &scratch); /* length */
> +        if (swap) {
> +            bswap64s(&scratch);
> +        }
> +	elem->out_sg[i].iov_len = scratch;
> +    }
> +    if (i < ARRAY_SIZE(iov)) {
> +        qemu_get_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
> +    }
> +
>      virtqueue_map(elem);
>      return elem;
>  }
>  
>  void qemu_put_virtqueue_element(QEMUFile *f, VirtQueueElement *elem)
>  {
> -    qemu_put_buffer(f, (uint8_t *)elem, sizeof(VirtQueueElement));
> +    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    int i;
> +
> +    memset(addr, 0, sizeof(addr));
> +    memset(iov, 0, sizeof(iov));
> +
> +    qemu_put_be32s(f, &elem->index);
> +    qemu_put_be32s(f, &elem->out_num);
> +    qemu_put_be32s(f, &elem->in_num);
> +
> +    for (i = 0; i < elem->in_num; i++) {
> +        qemu_put_be64s(f, &elem->in_addr[i]);
> +    }
> +    qemu_put_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
> +
> +    for (i = 0; i < elem->out_num; i++) {
> +        qemu_put_be64s(f, &elem->out_addr[i]);
> +    }
> +    qemu_put_buffer(f, (uint8_t *)addr, sizeof(addr) - i * sizeof(addr[0]));
> +
> +    for (i = 0; i < elem->in_num; i++) {
> +        qemu_put_be64(f, 0);
> +        qemu_put_be64(f, elem->in_sg[i].iov_len);
> +    }
> +    qemu_put_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
> +
> +    for (i = 0; i < elem->out_num; i++) {
> +        qemu_put_be64(f, 0);
> +        qemu_put_be64(f, elem->out_sg[i].iov_len);
> +    }
> +    qemu_put_buffer(f, (uint8_t *)iov, sizeof(iov) - i * sizeof(iov[0]));
>  }
>  
>  /* virtio device */
> -- 
> 1.8.3.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time
  2015-11-30  9:47   ` Fam Zheng
@ 2015-11-30 10:37     ` Paolo Bonzini
  0 siblings, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-30 10:37 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, qemu-block



On 30/11/2015 10:47, Fam Zheng wrote:
>> > +    swap = (elem->out_num & 0xFFFF0000) || (elem->in_num & 0xFFFF0000);
> This is interesting, out_num and in_num are 32 bit numbers but there max values
> are both VIRTQUEUE_MAX_SIZE (thanks for explaining this on IRC), so it can be a
> clue for the source using a different endianness.

Yes, good idea.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
  2015-11-30  2:27   ` Fam Zheng
@ 2015-11-30 16:35   ` Greg Kurz
  1 sibling, 0 replies; 56+ messages in thread
From: Greg Kurz @ 2015-11-30 16:35 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, qemu-block

On Tue, 24 Nov 2015 19:00:52 +0100
Paolo Bonzini <pbonzini@redhat.com> wrote:

> Prepare for moving the allocation to virtqueue_pop.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---

And aside from this series goal, it makes the code nicer :)

>  hw/9pfs/virtio-9p-device.c |  7 +------
>  hw/9pfs/virtio-9p.c        | 10 +++-------
>  hw/9pfs/virtio-9p.h        |  2 --
>  3 files changed, 4 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c
> index e3abcfa..72a93c2 100644
> --- a/hw/9pfs/virtio-9p-device.c
> +++ b/hw/9pfs/virtio-9p-device.c
> @@ -57,7 +57,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>  {
>      VirtIODevice *vdev = VIRTIO_DEVICE(dev);
>      V9fsState *s = VIRTIO_9P(dev);
> -    int i, len;
> +    int len;
>      struct stat stat;
>      FsDriverEntry *fse;
>      V9fsPath path;
> @@ -65,12 +65,7 @@ static void virtio_9p_device_realize(DeviceState *dev, Error **errp)
>      virtio_init(vdev, "virtio-9p", VIRTIO_ID_9P,
>                  sizeof(struct virtio_9p_config) + MAX_TAG_LEN);
> 
> -    /* initialize pdu allocator */
> -    QLIST_INIT(&s->free_list);
>      QLIST_INIT(&s->active_list);
> -    for (i = 0; i < (MAX_REQ - 1); i++) {
> -        QLIST_INSERT_HEAD(&s->free_list, &s->pdus[i], next);
> -    }
> 
>      s->vq = virtio_add_queue(vdev, MAX_REQ, handle_9p_output);
> 
> diff --git a/hw/9pfs/virtio-9p.c b/hw/9pfs/virtio-9p.c
> index f972731..747306b 100644
> --- a/hw/9pfs/virtio-9p.c
> +++ b/hw/9pfs/virtio-9p.c
> @@ -565,13 +565,9 @@ static int fid_to_qid(V9fsPDU *pdu, V9fsFidState *fidp, V9fsQID *qidp)
> 
>  static V9fsPDU *alloc_pdu(V9fsState *s)
>  {
> -    V9fsPDU *pdu = NULL;
> +    V9fsPDU *pdu = g_new(V9fsPDU, 1);
> 
> -    if (!QLIST_EMPTY(&s->free_list)) {
> -        pdu = QLIST_FIRST(&s->free_list);
> -        QLIST_REMOVE(pdu, next);
> -        QLIST_INSERT_HEAD(&s->active_list, pdu, next);
> -    }
> +    QLIST_INSERT_HEAD(&s->active_list, pdu, next);
>      return pdu;
>  }
> 
> @@ -584,8 +580,8 @@ static void free_pdu(V9fsState *s, V9fsPDU *pdu)
>           */
>          if (!pdu->cancelled) {
>              QLIST_REMOVE(pdu, next);
> -            QLIST_INSERT_HEAD(&s->free_list, pdu, next);
>          }
> +        g_free(pdu);
>      }
>  }
> 
> diff --git a/hw/9pfs/virtio-9p.h b/hw/9pfs/virtio-9p.h
> index d7a4dc1..1fb4ff9 100644
> --- a/hw/9pfs/virtio-9p.h
> +++ b/hw/9pfs/virtio-9p.h
> @@ -201,8 +201,6 @@ typedef struct V9fsState
>  {
>      VirtIODevice parent_obj;
>      VirtQueue *vq;
> -    V9fsPDU pdus[MAX_REQ];
> -    QLIST_HEAD(, V9fsPDU) free_list;
>      QLIST_HEAD(, V9fsPDU) active_list;
>      V9fsFidState *fid_list;
>      FileOperations *ops;

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-11-26 10:39   ` Paolo Bonzini
@ 2015-12-09 20:35     ` Paolo Bonzini
  2015-12-16 12:54       ` Christian Borntraeger
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2015-12-09 20:35 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel, qemu-block
  Cc: mlin, famz, ming.lei, stefanha, mst



On 26/11/2015 11:39, Paolo Bonzini wrote:
> I would really prefer to get rid of vring.c as soon as the infrastructure
> makes it possible---even if it's faster. We know what makes virtio.c
> slower, and it's simpler to fix virtio.c than to convert all the other
> models to vring.c _plus_ make vring.c safe for migration.

I've now pushed some optimizations of exec.c to the same place (branch
dataplane, git://github.com/bonzini/qemu.git).  Basically if the length
of an address_space_read is constant, and the target ends up being RAM,
the compiler can inline address_space_read in the caller and in
particular eliminate the memcpy.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-12-09 20:35     ` Paolo Bonzini
@ 2015-12-16 12:54       ` Christian Borntraeger
  2015-12-16 14:40         ` Christian Borntraeger
  2015-12-16 17:42         ` Paolo Bonzini
  0 siblings, 2 replies; 56+ messages in thread
From: Christian Borntraeger @ 2015-12-16 12:54 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

On 12/09/2015 09:35 PM, Paolo Bonzini wrote:
> 
> 
> On 26/11/2015 11:39, Paolo Bonzini wrote:
>> I would really prefer to get rid of vring.c as soon as the infrastructure
>> makes it possible---even if it's faster. We know what makes virtio.c
>> slower, and it's simpler to fix virtio.c than to convert all the other
>> models to vring.c _plus_ make vring.c safe for migration.
> 
> I've now pushed some optimizations of exec.c to the same place (branch
> dataplane, git://github.com/bonzini/qemu.git).  Basically if the length
> of an address_space_read is constant, and the target ends up being RAM,
> the compiler can inline address_space_read in the caller and in
> particular eliminate the memcpy.

Just some quick remarks before I leave into vacation:

Performance seems to be better than the initial version. I have some
hangs from time to time when shutting down (also with your previous
version)

(gdb) thread apply all bt

Thread 10 (Thread 0x3ff95b7f910 (LWP 9700)):
#0  0x000003ff9707cf56 in syscall () from /lib64/libc.so.6
#1  0x0000000010201586 in futex_wait (val=<optimized out>, f=<optimized out>) at /home/cborntra/REPOS/qemu/include/qemu/futex.h:26
#2  qemu_event_wait (ev=ev@entry=0x107f2cb4 <rcu_call_ready_event>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:398
#3  0x0000000010214fe2 in call_rcu_thread (opaque=<optimized out>) at /home/cborntra/REPOS/qemu/util/rcu.c:254
#4  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#5  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 9 (Thread 0x3ff9537f910 (LWP 9701)):
#0  0x000003ff9718f5fa in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003ff9718a582 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x0000000010201062 in qemu_mutex_lock (mutex=<optimized out>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:69
#3  0x0000000010184ffa in aio_context_acquire (ctx=<optimized out>) at /home/cborntra/REPOS/qemu/async.c:361
#4  0x0000000010185322 in thread_pool_completion_bh (opaque=0x3ff90000ca0) at /home/cborntra/REPOS/qemu/thread-pool.c:168
#5  0x0000000010184b0c in aio_bh_call (bh=<optimized out>) at /home/cborntra/REPOS/qemu/async.c:64
#6  aio_bh_poll (ctx=ctx@entry=0x480e00d0) at /home/cborntra/REPOS/qemu/async.c:96
#7  0x0000000010191ec2 in aio_poll (ctx=0x480e00d0, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:464
#8  0x00000000100c4bfe in iothread_run (opaque=0x480dfc00) at /home/cborntra/REPOS/qemu/iothread.c:42
#9  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#10 0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 8 (Thread 0x3ff94b7f910 (LWP 9702)):
#0  0x000003ff970766e6 in ppoll () from /lib64/libc.so.6
#1  0x00000000101908c0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:312
#3  0x0000000010191f44 in aio_poll (ctx=0x480e0890, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:447
#4  0x00000000100c4bfe in iothread_run (opaque=0x480e05c0) at /home/cborntra/REPOS/qemu/iothread.c:42
#5  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#6  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 7 (Thread 0x3ff8ffff910 (LWP 9703)):
#0  0x000003ff970766e6 in ppoll () from /lib64/libc.so.6
#1  0x00000000101908c0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:312
#3  0x0000000010191f44 in aio_poll (ctx=0x480e1800, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:447
#4  0x00000000100c4bfe in iothread_run (opaque=0x480e0d60) at /home/cborntra/REPOS/qemu/iothread.c:42
#5  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#6  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 6 (Thread 0x3ff8f7ff910 (LWP 9704)):
#0  0x000003ff970766e6 in ppoll () from /lib64/libc.so.6
#1  0x00000000101908c0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:312
#3  0x0000000010191f44 in aio_poll (ctx=0x480e1f60, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:447
#4  0x00000000100c4bfe in iothread_run (opaque=0x480e1cb0) at /home/cborntra/REPOS/qemu/iothread.c:42
#5  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#6  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 5 (Thread 0x3ff8e2f9910 (LWP 9707)):
#0  0x000003ff9718f61e in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003ff9719262e in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x000003ff9718c5be in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00000000102011da in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x103c7780 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:141
#4  0x0000000010044604 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:1016
#5  qemu_kvm_cpu_thread_fn (arg=0x484ae660) at /home/cborntra/REPOS/qemu/cpus.c:1055
#6  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
---Type <return> to continue, or q <return> to quit---
#7  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 4 (Thread 0x3ff8daf9910 (LWP 9708)):
#0  0x000003ff9718f61e in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003ff9719262e in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x000003ff9718c5be in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00000000102011da in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x103c7780 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:141
#4  0x0000000010044604 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:1016
#5  qemu_kvm_cpu_thread_fn (arg=0x484c0900) at /home/cborntra/REPOS/qemu/cpus.c:1055
#6  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#7  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 3 (Thread 0x3ff8d2f9910 (LWP 9709)):
#0  0x000003ff9718f61e in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003ff9719262e in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x000003ff9718c5be in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00000000102011da in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x103c7780 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:141
#4  0x0000000010044604 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:1016
#5  qemu_kvm_cpu_thread_fn (arg=0x484d2ba0) at /home/cborntra/REPOS/qemu/cpus.c:1055
#6  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#7  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 2 (Thread 0x3ff8ca7f910 (LWP 9710)):
#0  0x000003ff9718f61e in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000003ff9719262e in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0
#2  0x000003ff9718c5be in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#3  0x00000000102011da in qemu_cond_wait (cond=<optimized out>, mutex=mutex@entry=0x103c7780 <qemu_global_mutex>) at /home/cborntra/REPOS/qemu/util/qemu-thread-posix.c:141
#4  0x0000000010044604 in qemu_kvm_wait_io_event (cpu=<optimized out>) at /home/cborntra/REPOS/qemu/cpus.c:1016
#5  qemu_kvm_cpu_thread_fn (arg=0x484e4e40) at /home/cborntra/REPOS/qemu/cpus.c:1055
#6  0x000003ff971884c6 in start_thread () from /lib64/libpthread.so.0
#7  0x000003ff97082ec2 in thread_start () from /lib64/libc.so.6

Thread 1 (Thread 0x3ff98bdcbb0 (LWP 9690)):
#0  0x000003ff970766e6 in ppoll () from /lib64/libc.so.6
#1  0x00000000101908c0 in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at /home/cborntra/REPOS/qemu/qemu-timer.c:312
#3  0x0000000010191f44 in aio_poll (ctx=ctx@entry=0x480e00d0, blocking=blocking@entry=true) at /home/cborntra/REPOS/qemu/aio-posix.c:447
#4  0x00000000101c8ff4 in bdrv_flush (bs=bs@entry=0x480fa160) at /home/cborntra/REPOS/qemu/block/io.c:2426
#5  0x000000001018bba6 in bdrv_close (bs=bs@entry=0x480fa160) at /home/cborntra/REPOS/qemu/block.c:1914
#6  0x000000001018c0f6 in bdrv_close_all () at /home/cborntra/REPOS/qemu/block.c:1974
#7  0x0000000010014102 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/cborntra/REPOS/qemu/vl.c:4687
(gdb) 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-12-16 12:54       ` Christian Borntraeger
@ 2015-12-16 14:40         ` Christian Borntraeger
  2015-12-16 17:42         ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Christian Borntraeger @ 2015-12-16 14:40 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

On 12/16/2015 01:54 PM, Christian Borntraeger wrote:
> On 12/09/2015 09:35 PM, Paolo Bonzini wrote:
>>
>>
>> On 26/11/2015 11:39, Paolo Bonzini wrote:
>>> I would really prefer to get rid of vring.c as soon as the infrastructure
>>> makes it possible---even if it's faster. We know what makes virtio.c
>>> slower, and it's simpler to fix virtio.c than to convert all the other
>>> models to vring.c _plus_ make vring.c safe for migration.
>>
>> I've now pushed some optimizations of exec.c to the same place (branch
>> dataplane, git://github.com/bonzini/qemu.git).  Basically if the length
>> of an address_space_read is constant, and the target ends up being RAM,
>> the compiler can inline address_space_read in the caller and in
>> particular eliminate the memcpy.
> 
> Just some quick remarks before I leave into vacation:
> 
> Performance seems to be better than the initial version. 

In fact, it seems to be faster than 2.5-rc4. At least with the one
single test case on null block devices that I run.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
  2015-12-16 12:54       ` Christian Borntraeger
  2015-12-16 14:40         ` Christian Borntraeger
@ 2015-12-16 17:42         ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-12-16 17:42 UTC (permalink / raw)
  To: Christian Borntraeger, qemu-devel, qemu-block
  Cc: mlin, famz, ming.lei, stefanha, mst



On 16/12/2015 13:54, Christian Borntraeger wrote:
> Just some quick remarks before I leave into vacation:
> 
> Performance seems to be better than the initial version. I have some
> hangs from time to time when shutting down (also with your previous
> version)

Yes, I've seen them too.  To reproduce, just start some "dd" and
repeatedly stop/cont.  It hangs after 3-4 tries typically.

It's like this:

	I/O thread			main thread
	-----------------------------------------------------
					aio_context_acquire
					bdrv_drain
	bh->scheduled = 0
	call bottom half
	   aio_context_acquire
	   <hang>


The bottom half is hung in the I/O thread, and will never be processed
by the main thread because bh->scheduled = 0.

My current idea is that *all* callers of aio_poll have to call it
without the AioContext acquired, including bdrv_drain.  It's not a
problem now that Fam has fixed almost all bdrv_drain users to use
bdrv_drained_begin/end instead.

The nastiness here is two-fold:

1) doing

	aio_context_release
	aio_poll
	aio_context_acquire

in bdrv_drain is not very safe because the mutex is recursive.  (Have I
ever mentioned I don't like recursive mutexes?...)

2) there is still a race in bdrv_drain

                    bdrv_flush_io_queue(bs);
                    if (bdrv_requests_pending(bs)) {
                        busy = true;
                        aio_context_release(aio_context);
                        aio_poll(aio_context, busy);
                        aio_context_acquire(aio_context);

if we release before aio_poll, the last request can be completed
before aio_poll is invoked, and aio_poll will block forever.  In Linux
terms, bdrv_request_pending would be prepare_to_wait() and aio_poll
would be schedule().  But, we don't have prepare_to_wait().  I have a
hackish solution here, but I don't like it really.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2015-12-16 17:43 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-24 18:00 [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
2015-11-30  2:27   ` Fam Zheng
2015-11-30  2:33     ` Fam Zheng
2015-11-30 16:35   ` Greg Kurz
2015-11-24 18:00 ` [Qemu-devel] [PATCH 02/40] virtio: move VirtQueueElement at the beginning of the structs Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 03/40] virtio: move allocation to virtqueue_pop/vring_pop Paolo Bonzini
2015-11-30  3:00   ` Fam Zheng
2015-11-24 18:00 ` [Qemu-devel] [PATCH 04/40] virtio: introduce qemu_get/put_virtqueue_element Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 05/40] virtio: read/write the VirtQueueElement a field at a time Paolo Bonzini
2015-11-30  9:47   ` Fam Zheng
2015-11-30 10:37     ` Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 06/40] virtio: introduce virtqueue_alloc_element Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 07/40] virtio: slim down allocation of VirtQueueElements Paolo Bonzini
2015-11-30  3:24   ` Fam Zheng
2015-11-30  8:36     ` Paolo Bonzini
2015-11-24 18:00 ` [Qemu-devel] [PATCH 08/40] vring: " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 09/40] vring: make vring_enable_notification return void Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 10/40] virtio: combine the read of a descriptor Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 11/40] virtio: add AioContext-specific function for host notifiers Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 12/40] virtio: export vring_notify as virtio_should_notify Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 13/40] virtio-blk: fix "disabled data plane" mode Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 14/40] virtio-blk: do not use vring in dataplane Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 15/40] virtio-scsi: " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 16/40] vring: remove Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 17/40] iothread: release AioContext around aio_poll Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 18/40] qemu-thread: introduce QemuRecMutex Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 19/40] aio: convert from RFifoLock to QemuRecMutex Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 20/40] aio: rename bh_lock to list_lock Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 21/40] qemu-thread: introduce QemuLockCnt Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 22/40] aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 23/40] qemu-thread: optimize QemuLockCnt with futexes on Linux Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 24/40] aio: tweak walking in dispatch phase Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 25/40] aio-posix: remove walking_handlers, protecting AioHandler list with list_lock Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 26/40] aio-win32: " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 27/40] aio: document locking Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 28/40] aio: push aio_context_acquire/release down to dispatching Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 29/40] quorum: use atomics for rewrite_count Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 30/40] quorum: split quorum_fifo_aio_cb from quorum_aio_cb Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 31/40] qed: introduce qed_aio_start_io and qed_aio_next_io_cb Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 32/40] block: explicitly acquire aiocontext in callbacks that need it Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 33/40] block: explicitly acquire aiocontext in bottom halves " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 34/40] block: explicitly acquire aiocontext in timers " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 35/40] block: explicitly acquire aiocontext in aio callbacks " Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 36/40] aio: update locking documentation Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 37/40] async: optimize aio_bh_poll Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 38/40] aio-posix: partially inline aio_dispatch into aio_poll Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 39/40] async: remove unnecessary inc/dec pairs Paolo Bonzini
2015-11-24 18:01 ` [Qemu-devel] [PATCH 40/40] dma-helpers: avoid lock inversion with AioContext Paolo Bonzini
2015-11-26  9:36 ` [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 Christian Borntraeger
2015-11-26  9:41   ` Christian Borntraeger
2015-11-26 10:39   ` Paolo Bonzini
2015-12-09 20:35     ` Paolo Bonzini
2015-12-16 12:54       ` Christian Borntraeger
2015-12-16 14:40         ` Christian Borntraeger
2015-12-16 17:42         ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.