[Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6

* [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6
@ 2015-11-24 18:00 Paolo Bonzini
  2015-11-24 18:00 ` [Qemu-devel] [PATCH 01/40] 9pfs: allocate pdus with g_malloc/g_free Paolo Bonzini
                   ` (40 more replies)
  0 siblings, 41 replies; 56+ messages in thread
From: Paolo Bonzini @ 2015-11-24 18:00 UTC (permalink / raw)
  To: qemu-devel, qemu-block; +Cc: mlin, famz, ming.lei, stefanha, mst

This large series is basically all that I would like to get into 2.6.
It is a combination of several pieces of work on dataplane and
multithreaded block layer.

It's also a large part of why I would like someone else to look at
miscellaneous patches for a while (in case you've missed that).  I
can foresee that following the reviews is going to be a huge time drain.

With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads
and 5 VCPUs).  The bulk of the improvement actually comes from the first
8 patches, but the rest of the series is what prepares for what's next
to come in QEMU 2.7 and later, such as a multiqueue block layer.

It's tedious to review, with some pretty large patches (3, 32, 33, 35).
That's how you attract reviewers, isn't it?  I would like to get the
first virtio and the first block layer part in very soon after 2.6
development starts.

I've split it in four parts, the first two touching virtio mostly,
while the last two are for the block layer.

Because it's large, I've CCed people only on the cover letter.

This work is available at github.com/bonzini/qemu.git, branch dataplane.

A. "LEAN" VIRTQUEUEELEMENT
--------------------------

Patches 1 to 8 modify VirtQueueElement so that the space for
scatter/gather lists is allocated dynamically rather than being
fixed to 4K.  VirtQueueElement becomes a sort of "superclass", and
the scatter/gather elements are placed in the same malloc block,
which is laid out like

	VirtQueueElement
	other fields ("subclass" fields)
	in_addr[]
	out_addr[]
	in_sg[]
	out_sg[]

This can provide a large speedup (from 1.3x to 2.3x) with many disks,
due to the 48K sized VirtQueueElement.  All virtio devices have to
be changed (patch 3).  I chose to do it all in a single patch because
the changes are anyway well isolated between each device.

The main issue here is that VirtQueueElement was haphazardly shoveled
straight in the migration stream (in host endianness). :(  Patch 5
straightens this out, but at the cost of breaking backwards migration
because it now writes the VirtQueueElement in big endian, consistent
with other migration streams.

This is the least tested part of the series.  I nevertheless put it
first because it's the one that is more complicated to rebase, and
I want to get rid of it as fast as possible.  Reviewing the general
approach is welcome anyway.

	Status: virtio-input, virtio-gpu and migration not tested at all

B. REMOVING VRING.C
-------------------

This is patches 9 to 16.  It removes the duplicate dataplane-specific
implementation of virtio in favor of the regular one that is already
used for non-dataplane.  While the dataplane implementation is slightly
more optimized, I chose to keep the other one to avoid another "touch
all virtio devices" series.

Patch 10 alone mostly brings performance in par between the two.
The remaining 7-8% can be recovered by mostly getting rid of tiny
address_space_* operations, keeping the rings always mapped.  Note that
the rest of this big series does bring a little performance improvement,
and already makes up for the lost performance.

This part has a dependency on patches that are not part of this series
(and do not exist yet), which make it possible to write the dirty
bitmap outside the BQL.  The dirty bitmap is not yet thread-safe because,
while it is read and written with atomic operations, it may be resized
when there is a memory hotplug operation.  There are plans to fix this
using RCU.

Nevertheless, this doesn't block part C.

	Status: ready, but depends on the missing dirty bitmap support

C. FINE-GRAINED AIO_POLL CRITICAL SECTIONS
------------------------------------------

This is patch 17 to 28.  It starts pushing aio_context_acquire down
into aio_poll.  This part is more or less independent from A and B,
and it ends with aio_poll calling aio_context_acquire/release around
every callback.

To do this, this part introduces a thread-safe variant of the common
"walking_xxx++/walking_xxx--" idiom already found in several places
in aio*.c and async.c.

	Status: ready, except that I haven't tested quorum enough

D. FINE-GRAINED BLOCK LAYER CRITICAL SECTIONS
---------------------------------------------

This is patch 29 to 40.  It explicitly acquires the AioContext in all
callbacks that need it (file descriptors, bottom halves, timers, AIO)
rather than in aio_poll.  This is the first step towards breaking
AioContext in many small locks, and hence the last prerequisite for
a real multiqueue QEMU block layer.

This has the biggest patches and, unlike patch 3, they are very hard
to split further.

At the end, starting with patch 37, a few patches do some small
optimization on aio_poll that is now possible, and the last one makes
virtio-scsi dataplane _almost_ thread-safe.

	Status: ready

If you've read so far and didn't get bored, you're more than qualified
as a reviewer. :)

Paolo

Paolo Bonzini (40):
  9pfs: allocate pdus with g_malloc/g_free
  virtio: move VirtQueueElement at the beginning of the structs
  virtio: move allocation to virtqueue_pop/vring_pop
  virtio: introduce qemu_get/put_virtqueue_element
  virtio: read/write the VirtQueueElement a field at a time
  virtio: introduce virtqueue_alloc_element
  virtio: slim down allocation of VirtQueueElements
  vring: slim down allocation of VirtQueueElements
  vring: make vring_enable_notification return void
  virtio: combine the read of a descriptor
  virtio: add AioContext-specific function for host notifiers
  virtio: export vring_notify as virtio_should_notify
  virtio-blk: fix "disabled data plane" mode
  virtio-blk: do not use vring in dataplane
  virtio-scsi: do not use vring in dataplane
  vring: remove
  iothread: release AioContext around aio_poll
  qemu-thread: introduce QemuRecMutex
  aio: convert from RFifoLock to QemuRecMutex
  aio: rename bh_lock to list_lock
  qemu-thread: introduce QemuLockCnt
  aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh
  qemu-thread: optimize QemuLockCnt with futexes on Linux
  aio: tweak walking in dispatch phase
  aio-posix: remove walking_handlers, protecting AioHandler list with list_lock
  aio-win32: remove walking_handlers, protecting AioHandler list with list_lock
  aio: document locking
  aio: push aio_context_acquire/release down to dispatching
  quorum: use atomics for rewrite_count
  quorum: split quorum_fifo_aio_cb from quorum_aio_cb
  qed: introduce qed_aio_start_io and qed_aio_next_io_cb
  block: explicitly acquire aiocontext in callbacks that need it
  block: explicitly acquire aiocontext in bottom halves that need it
  block: explicitly acquire aiocontext in timers that need it
  block: explicitly acquire aiocontext in aio callbacks that need it
  aio: update locking documentation
  async: optimize aio_bh_poll
  aio-posix: partially inline aio_dispatch into aio_poll
  async: remove unnecessary inc/dec pairs
  dma-helpers: avoid lock inversion with AioContext

 aio-posix.c                                   | 108 +++---
 aio-win32.c                                   | 111 +++---
 async.c                                       |  76 ++--
 block/blkverify.c                             |   6 +-
 block/curl.c                                  |  43 ++-
 block/gluster.c                               |   2 +
 block/io.c                                    |   7 +
 block/iscsi.c                                 |  10 +
 block/linux-aio.c                             |  14 +-
 block/mirror.c                                |  12 +-
 block/nbd-client.c                            |  14 +-
 block/nfs.c                                   |  10 +
 block/qed-cluster.c                           |   2 +
 block/qed-table.c                             |  12 +-
 block/qed.c                                   | 112 ++++--
 block/qed.h                                   |   3 +
 block/quorum.c                                |  60 +--
 block/sheepdog.c                              |  29 +-
 block/ssh.c                                   |  47 ++-
 block/throttle-groups.c                       |   2 +
 block/win32-aio.c                             |   8 +-
 dma-helpers.c                                 |  27 +-
 docs/lockcnt.txt                              | 342 +++++++++++++++++
 docs/multiple-iothreads.txt                   |  95 ++++-
 hw/9pfs/virtio-9p-device.c                    |   7 +-
 hw/9pfs/virtio-9p.c                           |  25 +-
 hw/9pfs/virtio-9p.h                           |   4 +-
 hw/block/dataplane/virtio-blk.c               | 131 +------
 hw/block/dataplane/virtio-blk.h               |   1 +
 hw/block/virtio-blk.c                         |  92 ++---
 hw/char/virtio-serial-bus.c                   |  78 ++--
 hw/display/virtio-gpu.c                       |  25 +-
 hw/input/virtio-input.c                       |  24 +-
 hw/net/virtio-net.c                           |  69 ++--
 hw/scsi/scsi-bus.c                            |   2 +
 hw/scsi/scsi-disk.c                           |  18 +
 hw/scsi/scsi-generic.c                        |  20 +-
 hw/scsi/virtio-scsi-dataplane.c               | 197 ++--------
 hw/scsi/virtio-scsi.c                         |  82 ++--
 hw/virtio/Makefile.objs                       |   1 -
 hw/virtio/dataplane/Makefile.objs             |   1 -
 hw/virtio/dataplane/vring.c                   | 526 --------------------------
 hw/virtio/virtio-balloon.c                    |  22 +-
 hw/virtio/virtio-rng.c                        |  10 +-
 hw/virtio/virtio.c                            | 323 +++++++++++-----
 include/block/aio.h                           |  38 +-
 include/hw/virtio/dataplane/vring-accessors.h |  75 ----
 include/hw/virtio/dataplane/vring.h           |  51 ---
 include/hw/virtio/virtio-balloon.h            |   2 +-
 include/hw/virtio/virtio-blk.h                |   9 +-
 include/hw/virtio/virtio-net.h                |   2 +-
 include/hw/virtio/virtio-scsi.h               |  36 +-
 include/hw/virtio/virtio-serial.h             |   2 +-
 include/hw/virtio/virtio.h                    |  16 +-
 include/qemu/futex.h                          |  36 ++
 include/qemu/rfifolock.h                      |  54 ---
 include/qemu/thread-posix.h                   |   6 +
 include/qemu/thread-win32.h                   |  10 +
 include/qemu/thread.h                         |  23 ++
 iothread.c                                    |  11 +-
 nbd.c                                         |   4 +
 tests/.gitignore                              |   1 -
 tests/Makefile                                |   2 -
 tests/test-aio.c                              |  19 +-
 tests/test-rfifolock.c                        |  91 -----
 thread-pool.c                                 |  14 +-
 trace-events                                  |  13 +-
 util/Makefile.objs                            |   2 +-
 util/lockcnt.c                                | 404 ++++++++++++++++++++
 util/qemu-coroutine-sleep.c                   |   5 +
 util/qemu-thread-posix.c                      |  38 +-
 util/qemu-thread-win32.c                      |  25 ++
 util/rfifolock.c                              |  78 ----
 73 files changed, 2000 insertions(+), 1877 deletions(-)
 create mode 100644 docs/lockcnt.txt
 delete mode 100644 hw/virtio/dataplane/Makefile.objs
 delete mode 100644 hw/virtio/dataplane/vring.c
 delete mode 100644 include/hw/virtio/dataplane/vring-accessors.h
 delete mode 100644 include/hw/virtio/dataplane/vring.h
 create mode 100644 include/qemu/futex.h
 delete mode 100644 include/qemu/rfifolock.h
 delete mode 100644 tests/test-rfifolock.c
 create mode 100644 util/lockcnt.c
 delete mode 100644 util/rfifolock.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 56+ messages in thread