From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53751) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1HuY-0000Iz-Ga for qemu-devel@nongnu.org; Tue, 24 Nov 2015 13:01:52 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a1HuV-0005lg-Tf for qemu-devel@nongnu.org; Tue, 24 Nov 2015 13:01:50 -0500 From: Paolo Bonzini Date: Tue, 24 Nov 2015 19:00:51 +0100 Message-Id: <1448388091-117282-1-git-send-email-pbonzini@redhat.com> Subject: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: mlin@kernel.org, famz@redhat.com, ming.lei@canonical.com, stefanha@redhat.com, mst@redhat.com This large series is basically all that I would like to get into 2.6. It is a combination of several pieces of work on dataplane and multithreaded block layer. It's also a large part of why I would like someone else to look at miscellaneous patches for a while (in case you've missed that). I can foresee that following the reviews is going to be a huge time drain. With it I can get ~1300 Kiops on 8 disks (which I achieve with 2 iothreads and 5 VCPUs). The bulk of the improvement actually comes from the first 8 patches, but the rest of the series is what prepares for what's next to come in QEMU 2.7 and later, such as a multiqueue block layer. It's tedious to review, with some pretty large patches (3, 32, 33, 35). That's how you attract reviewers, isn't it? I would like to get the first virtio and the first block layer part in very soon after 2.6 development starts. I've split it in four parts, the first two touching virtio mostly, while the last two are for the block layer. Because it's large, I've CCed people only on the cover letter. This work is available at github.com/bonzini/qemu.git, branch dataplane. A. "LEAN" VIRTQUEUEELEMENT -------------------------- Patches 1 to 8 modify VirtQueueElement so that the space for scatter/gather lists is allocated dynamically rather than being fixed to 4K. VirtQueueElement becomes a sort of "superclass", and the scatter/gather elements are placed in the same malloc block, which is laid out like VirtQueueElement other fields ("subclass" fields) in_addr[] out_addr[] in_sg[] out_sg[] This can provide a large speedup (from 1.3x to 2.3x) with many disks, due to the 48K sized VirtQueueElement. All virtio devices have to be changed (patch 3). I chose to do it all in a single patch because the changes are anyway well isolated between each device. The main issue here is that VirtQueueElement was haphazardly shoveled straight in the migration stream (in host endianness). :( Patch 5 straightens this out, but at the cost of breaking backwards migration because it now writes the VirtQueueElement in big endian, consistent with other migration streams. This is the least tested part of the series. I nevertheless put it first because it's the one that is more complicated to rebase, and I want to get rid of it as fast as possible. Reviewing the general approach is welcome anyway. Status: virtio-input, virtio-gpu and migration not tested at all B. REMOVING VRING.C ------------------- This is patches 9 to 16. It removes the duplicate dataplane-specific implementation of virtio in favor of the regular one that is already used for non-dataplane. While the dataplane implementation is slightly more optimized, I chose to keep the other one to avoid another "touch all virtio devices" series. Patch 10 alone mostly brings performance in par between the two. The remaining 7-8% can be recovered by mostly getting rid of tiny address_space_* operations, keeping the rings always mapped. Note that the rest of this big series does bring a little performance improvement, and already makes up for the lost performance. This part has a dependency on patches that are not part of this series (and do not exist yet), which make it possible to write the dirty bitmap outside the BQL. The dirty bitmap is not yet thread-safe because, while it is read and written with atomic operations, it may be resized when there is a memory hotplug operation. There are plans to fix this using RCU. Nevertheless, this doesn't block part C. Status: ready, but depends on the missing dirty bitmap support C. FINE-GRAINED AIO_POLL CRITICAL SECTIONS ------------------------------------------ This is patch 17 to 28. It starts pushing aio_context_acquire down into aio_poll. This part is more or less independent from A and B, and it ends with aio_poll calling aio_context_acquire/release around every callback. To do this, this part introduces a thread-safe variant of the common "walking_xxx++/walking_xxx--" idiom already found in several places in aio*.c and async.c. Status: ready, except that I haven't tested quorum enough D. FINE-GRAINED BLOCK LAYER CRITICAL SECTIONS --------------------------------------------- This is patch 29 to 40. It explicitly acquires the AioContext in all callbacks that need it (file descriptors, bottom halves, timers, AIO) rather than in aio_poll. This is the first step towards breaking AioContext in many small locks, and hence the last prerequisite for a real multiqueue QEMU block layer. This has the biggest patches and, unlike patch 3, they are very hard to split further. At the end, starting with patch 37, a few patches do some small optimization on aio_poll that is now possible, and the last one makes virtio-scsi dataplane _almost_ thread-safe. Status: ready If you've read so far and didn't get bored, you're more than qualified as a reviewer. :) Paolo Paolo Bonzini (40): 9pfs: allocate pdus with g_malloc/g_free virtio: move VirtQueueElement at the beginning of the structs virtio: move allocation to virtqueue_pop/vring_pop virtio: introduce qemu_get/put_virtqueue_element virtio: read/write the VirtQueueElement a field at a time virtio: introduce virtqueue_alloc_element virtio: slim down allocation of VirtQueueElements vring: slim down allocation of VirtQueueElements vring: make vring_enable_notification return void virtio: combine the read of a descriptor virtio: add AioContext-specific function for host notifiers virtio: export vring_notify as virtio_should_notify virtio-blk: fix "disabled data plane" mode virtio-blk: do not use vring in dataplane virtio-scsi: do not use vring in dataplane vring: remove iothread: release AioContext around aio_poll qemu-thread: introduce QemuRecMutex aio: convert from RFifoLock to QemuRecMutex aio: rename bh_lock to list_lock qemu-thread: introduce QemuLockCnt aio: make ctx->list_lock a QemuLockCnt, subsuming ctx->walking_bh qemu-thread: optimize QemuLockCnt with futexes on Linux aio: tweak walking in dispatch phase aio-posix: remove walking_handlers, protecting AioHandler list with list_lock aio-win32: remove walking_handlers, protecting AioHandler list with list_lock aio: document locking aio: push aio_context_acquire/release down to dispatching quorum: use atomics for rewrite_count quorum: split quorum_fifo_aio_cb from quorum_aio_cb qed: introduce qed_aio_start_io and qed_aio_next_io_cb block: explicitly acquire aiocontext in callbacks that need it block: explicitly acquire aiocontext in bottom halves that need it block: explicitly acquire aiocontext in timers that need it block: explicitly acquire aiocontext in aio callbacks that need it aio: update locking documentation async: optimize aio_bh_poll aio-posix: partially inline aio_dispatch into aio_poll async: remove unnecessary inc/dec pairs dma-helpers: avoid lock inversion with AioContext aio-posix.c | 108 +++--- aio-win32.c | 111 +++--- async.c | 76 ++-- block/blkverify.c | 6 +- block/curl.c | 43 ++- block/gluster.c | 2 + block/io.c | 7 + block/iscsi.c | 10 + block/linux-aio.c | 14 +- block/mirror.c | 12 +- block/nbd-client.c | 14 +- block/nfs.c | 10 + block/qed-cluster.c | 2 + block/qed-table.c | 12 +- block/qed.c | 112 ++++-- block/qed.h | 3 + block/quorum.c | 60 +-- block/sheepdog.c | 29 +- block/ssh.c | 47 ++- block/throttle-groups.c | 2 + block/win32-aio.c | 8 +- dma-helpers.c | 27 +- docs/lockcnt.txt | 342 +++++++++++++++++ docs/multiple-iothreads.txt | 95 ++++- hw/9pfs/virtio-9p-device.c | 7 +- hw/9pfs/virtio-9p.c | 25 +- hw/9pfs/virtio-9p.h | 4 +- hw/block/dataplane/virtio-blk.c | 131 +------ hw/block/dataplane/virtio-blk.h | 1 + hw/block/virtio-blk.c | 92 ++--- hw/char/virtio-serial-bus.c | 78 ++-- hw/display/virtio-gpu.c | 25 +- hw/input/virtio-input.c | 24 +- hw/net/virtio-net.c | 69 ++-- hw/scsi/scsi-bus.c | 2 + hw/scsi/scsi-disk.c | 18 + hw/scsi/scsi-generic.c | 20 +- hw/scsi/virtio-scsi-dataplane.c | 197 ++-------- hw/scsi/virtio-scsi.c | 82 ++-- hw/virtio/Makefile.objs | 1 - hw/virtio/dataplane/Makefile.objs | 1 - hw/virtio/dataplane/vring.c | 526 -------------------------- hw/virtio/virtio-balloon.c | 22 +- hw/virtio/virtio-rng.c | 10 +- hw/virtio/virtio.c | 323 +++++++++++----- include/block/aio.h | 38 +- include/hw/virtio/dataplane/vring-accessors.h | 75 ---- include/hw/virtio/dataplane/vring.h | 51 --- include/hw/virtio/virtio-balloon.h | 2 +- include/hw/virtio/virtio-blk.h | 9 +- include/hw/virtio/virtio-net.h | 2 +- include/hw/virtio/virtio-scsi.h | 36 +- include/hw/virtio/virtio-serial.h | 2 +- include/hw/virtio/virtio.h | 16 +- include/qemu/futex.h | 36 ++ include/qemu/rfifolock.h | 54 --- include/qemu/thread-posix.h | 6 + include/qemu/thread-win32.h | 10 + include/qemu/thread.h | 23 ++ iothread.c | 11 +- nbd.c | 4 + tests/.gitignore | 1 - tests/Makefile | 2 - tests/test-aio.c | 19 +- tests/test-rfifolock.c | 91 ----- thread-pool.c | 14 +- trace-events | 13 +- util/Makefile.objs | 2 +- util/lockcnt.c | 404 ++++++++++++++++++++ util/qemu-coroutine-sleep.c | 5 + util/qemu-thread-posix.c | 38 +- util/qemu-thread-win32.c | 25 ++ util/rfifolock.c | 78 ---- 73 files changed, 2000 insertions(+), 1877 deletions(-) create mode 100644 docs/lockcnt.txt delete mode 100644 hw/virtio/dataplane/Makefile.objs delete mode 100644 hw/virtio/dataplane/vring.c delete mode 100644 include/hw/virtio/dataplane/vring-accessors.h delete mode 100644 include/hw/virtio/dataplane/vring.h create mode 100644 include/qemu/futex.h delete mode 100644 include/qemu/rfifolock.h delete mode 100644 tests/test-rfifolock.c create mode 100644 util/lockcnt.c delete mode 100644 util/rfifolock.c -- 1.8.3.1