All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] blkio: add libblkio BlockDriver
@ 2022-09-27 19:34 Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 01/12] coroutine: add flag to re-queue at front of CoQueue Stefan Hajnoczi
                   ` (12 more replies)
  0 siblings, 13 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

v5:
- Drop "RFC" since libblkio 1.0 has been released and the library API is stable
- Disable BDRV_REQ_REGISTERED_BUF if we run out of blkio_mem_regions. The
  bounce buffer slow path is taken when there are not enough blkio_mem_regions
  to cover guest RAM. [Hanna & David Hildenbrand]
- Call ram_block_discard_disable() when mem-region-pinned property is true or
  absent [David Hildenbrand]
- Use a bounce buffer pool instead of allocating/freeing a buffer for each
  request. This reduces the number of blkio_mem_regions required for bounce
  buffers to 1 and avoids frequent blkio_mem_region_map/unmap() calls.
- Switch to .bdrv_co_*() instead of .bdrv_aio_*(). Needed for the bounce buffer
  pool's CoQueue.
v4:
- Patch 1:
  - Add virtio-blk-vhost-user driver [Kevin]
  - Drop .bdrv_parse_filename() and .bdrv_needs_filename for virtio-blk-vhost-vdpa [Stefano]
  - Add copyright and license header [Hanna]
  - Drop .bdrv_parse_filename() in favor of --blockdev or json: [Hanna]
  - Clarify that "filename" is always non-NULL for io_uring [Hanna]
  - Check that virtio-blk-vhost-vdpa "path" option is non-NULL [Hanna]
  - Fix virtio-blk-vhost-vdpa cache.direct=off logic [Hanna]
  - Use macros for driver names [Hanna]
  - Assert that the driver name is valid [Hanna]
  - Update "readonly" property name to "read-only" [Hanna]
  - Call blkio_detach_aio_context() in blkio_close() [Hanna]
  - Avoid uint32_t * to int * casts in blkio_refresh_limits() [Hanna]
  - Remove write zeroes and discard from the todo list [Hanna]
  - Use PRIu32 instead of %d for uint32_t [Hanna]
  - Fix error messages with buf-alignment instead of optimal-io-size [Hanna]
  - Call map/unmap APIs since libblkio alloc/free APIs no longer do that
  - Update QAPI schema QEMU version to 7.2
- Patch 5:
  - Expand the BDRV_REQ_REGISTERED_BUF flag passthrough and drop assert(!flags)
    in drivers [Hanna]
- Patch 7:
  - Fix BLK->BDRV typo [Hanna]
  - Make BlockRAMRegistrar handle failure [Hanna]
- Patch 8:
  - Replace memory_region_get_fd() approach with qemu_ram_get_fd()
- Patch 10:
  - Use (void)ret; to discard unused return value [Hanna]
  - libblkio's blkio_unmap_mem_region() API no longer has a return value
  - Check for registered bufs that cross RAMBlocks [Hanna]
- Patch 11:
  - Handle bdrv_register_buf() errors [Hanna]
v3:
- Add virtio-blk-vhost-vdpa for vdpa-blk devices including VDUSE
- Add discard and write zeroes support
- Rebase and adopt latest libblkio APIs
v2:
- Add BDRV_REQ_REGISTERED_BUF to bs.supported_write_flags [Stefano]
- Use new blkioq_get_num_completions() API
- Implement .bdrv_refresh_limits()

This patch series adds a QEMU BlockDriver for libblkio
(https://gitlab.com/libblkio/libblkio/), a library for high-performance block
device I/O. This work was presented at KVM Forum 2022 and slides are available
here:
https://static.sched.com/hosted_files/kvmforum2022/8c/libblkio-kvm-forum-2022.pdf

The second patch adds the core BlockDriver and most of the libblkio API usage.
Three libblkio drivers are included:
- io_uring
- virtio-blk-vhost-user
- virtio-blk-vhost-vdpa

The remainder of the patch series reworks the existing QEMU bdrv_register_buf()
API so virtio-blk emulation efficiently map guest RAM for libblkio - some
libblkio drivers require that I/O buffer memory is pre-registered (think VFIO,
vhost, etc).

Vladimir requested performance results that show the effect of the
BDRV_REQ_REGISTERED_BUF flag. I ran the patches against qemu-storage-daemon's
vhost-user-blk export with iodepth=1 bs=512 to see the per-request overhead due
to bounce buffer allocation/mapping:

Name                                   IOPS   Error
bounce-buf                          4373.81 ± 0.01%
registered-buf                     13062.80 ± 0.67%

The BDRV_REQ_REGISTERED_BUF optimization version is about 3x faster.

See the BlockDriver struct in block/blkio.c for a list of APIs that still need
to be implemented. The core functionality is covered.

Regarding the design: each libblkio driver is a separately named BlockDriver.
That means there is an "io_uring" BlockDriver and not a generic "libblkio"
BlockDriver. This way QAPI and open parameters are type-safe and mandatory
parameters can be checked by QEMU.

Stefan Hajnoczi (12):
  coroutine: add flag to re-queue at front of CoQueue
  blkio: add libblkio block driver
  numa: call ->ram_block_removed() in ram_block_notifer_remove()
  block: pass size to bdrv_unregister_buf()
  block: use BdrvRequestFlags type for supported flag fields
  block: add BDRV_REQ_REGISTERED_BUF request flag
  block: return errors from bdrv_register_buf()
  block: add BlockRAMRegistrar
  exec/cpu-common: add qemu_ram_get_fd()
  stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd()
  blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint

 MAINTAINERS                                 |    7 +
 meson_options.txt                           |    2 +
 qapi/block-core.json                        |   53 +-
 meson.build                                 |    9 +
 include/block/block-common.h                |    9 +
 include/block/block-global-state.h          |   10 +-
 include/block/block_int-common.h            |   15 +-
 include/exec/cpu-common.h                   |    1 +
 include/hw/virtio/virtio-blk.h              |    2 +
 include/qemu/coroutine.h                    |   15 +-
 include/sysemu/block-backend-global-state.h |    4 +-
 include/sysemu/block-ram-registrar.h        |   37 +
 block.c                                     |   14 +
 block/blkio.c                               | 1017 +++++++++++++++++++
 block/blkverify.c                           |    4 +-
 block/block-backend.c                       |    8 +-
 block/block-ram-registrar.c                 |   54 +
 block/crypto.c                              |    4 +-
 block/file-posix.c                          |    1 -
 block/gluster.c                             |    1 -
 block/io.c                                  |  101 +-
 block/mirror.c                              |    2 +
 block/nbd.c                                 |    1 -
 block/nvme.c                                |   20 +-
 block/parallels.c                           |    1 -
 block/qcow.c                                |    2 -
 block/qed.c                                 |    1 -
 block/raw-format.c                          |    2 +
 block/replication.c                         |    1 -
 block/ssh.c                                 |    1 -
 block/vhdx.c                                |    1 -
 hw/block/virtio-blk.c                       |   39 +-
 hw/core/numa.c                              |   17 +
 qemu-img.c                                  |    6 +-
 softmmu/physmem.c                           |    5 +
 stubs/physmem.c                             |   13 +
 tests/qtest/modules-test.c                  |    3 +
 util/qemu-coroutine-lock.c                  |    9 +-
 util/vfio-helpers.c                         |    5 +-
 block/meson.build                           |    2 +
 scripts/meson-buildoptions.sh               |    3 +
 stubs/meson.build                           |    1 +
 42 files changed, 1412 insertions(+), 91 deletions(-)
 create mode 100644 include/sysemu/block-ram-registrar.h
 create mode 100644 block/blkio.c
 create mode 100644 block/block-ram-registrar.c
 create mode 100644 stubs/physmem.c

-- 
2.37.3



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 01/12] coroutine: add flag to re-queue at front of CoQueue
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 02/12] blkio: add libblkio block driver Stefan Hajnoczi
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

When a coroutine wakes up it may determine that it must re-queue.
Normally coroutines are pushed onto the back of the CoQueue, but for
fairness it may be necessary to push it onto the front of the CoQueue.

Add a flag to specify that the coroutine should be pushed onto the front
of the CoQueue. A later patch will use this to ensure fairness in the
bounce buffer CoQueue used by the blkio BlockDriver.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/coroutine.h   | 15 +++++++++++++--
 util/qemu-coroutine-lock.c |  9 +++++++--
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
index 08c5bb3c76..3f418a67f6 100644
--- a/include/qemu/coroutine.h
+++ b/include/qemu/coroutine.h
@@ -198,14 +198,25 @@ typedef struct CoQueue {
  */
 void qemu_co_queue_init(CoQueue *queue);
 
+typedef enum {
+    /*
+     * Enqueue at front instead of back. Use this to re-queue a request when
+     * its wait condition is not satisfied after being woken up.
+     */
+    CO_QUEUE_WAIT_FRONT = 0x1,
+} CoQueueWaitFlags;
+
 /**
  * Adds the current coroutine to the CoQueue and transfers control to the
  * caller of the coroutine.  The mutex is unlocked during the wait and
  * locked again afterwards.
  */
 #define qemu_co_queue_wait(queue, lock) \
-    qemu_co_queue_wait_impl(queue, QEMU_MAKE_LOCKABLE(lock))
-void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock);
+    qemu_co_queue_wait_impl(queue, QEMU_MAKE_LOCKABLE(lock), 0)
+#define qemu_co_queue_wait_flags(queue, lock, flags) \
+    qemu_co_queue_wait_impl(queue, QEMU_MAKE_LOCKABLE(lock), (flags))
+void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock,
+                                          CoQueueWaitFlags flags);
 
 /**
  * Removes the next coroutine from the CoQueue, and queue it to run after
diff --git a/util/qemu-coroutine-lock.c b/util/qemu-coroutine-lock.c
index 9ad24ab1af..0516bd2ff3 100644
--- a/util/qemu-coroutine-lock.c
+++ b/util/qemu-coroutine-lock.c
@@ -39,10 +39,15 @@ void qemu_co_queue_init(CoQueue *queue)
     QSIMPLEQ_INIT(&queue->entries);
 }
 
-void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock)
+void coroutine_fn qemu_co_queue_wait_impl(CoQueue *queue, QemuLockable *lock,
+                                          CoQueueWaitFlags flags)
 {
     Coroutine *self = qemu_coroutine_self();
-    QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+    if (flags & CO_QUEUE_WAIT_FRONT) {
+        QSIMPLEQ_INSERT_HEAD(&queue->entries, self, co_queue_next);
+    } else {
+        QSIMPLEQ_INSERT_TAIL(&queue->entries, self, co_queue_next);
+    }
 
     if (lock) {
         qemu_lockable_unlock(lock);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 02/12] blkio: add libblkio block driver
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 01/12] coroutine: add flag to re-queue at front of CoQueue Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-28  5:27   ` Markus Armbruster
  2022-10-06 16:41   ` Alberto Faria
  2022-09-27 19:34 ` [PATCH v5 03/12] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
high-performance disk I/O. It currently supports io_uring,
virtio-blk-vhost-user, and virtio-blk-vhost-vdpa with additional drivers
under development.

One of the reasons for developing libblkio is that other applications
besides QEMU can use it. This will be particularly useful for
virtio-blk-vhost-user which applications may wish to use for connecting
to qemu-storage-daemon.

libblkio also gives us an opportunity to develop in Rust behind a C API
that is easy to consume from QEMU.

This commit adds io_uring, virtio-blk-vhost-user, and
virtio-blk-vhost-vdpa BlockDrivers to QEMU using libblkio. It will be
easy to add other libblkio drivers since they will share the majority of
code.

For now I/O buffers are copied through bounce buffers if the libblkio
driver requires it. Later commits add an optimization for
pre-registering guest RAM to avoid bounce buffers.

The syntax is:

  --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off

and:

  --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
 MAINTAINERS                   |   6 +
 meson_options.txt             |   2 +
 qapi/block-core.json          |  53 ++-
 meson.build                   |   9 +
 block/blkio.c                 | 849 ++++++++++++++++++++++++++++++++++
 tests/qtest/modules-test.c    |   3 +
 block/meson.build             |   1 +
 scripts/meson-buildoptions.sh |   3 +
 8 files changed, 924 insertions(+), 2 deletions(-)
 create mode 100644 block/blkio.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 789172b2a8..878005f65b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3403,6 +3403,12 @@ L: qemu-block@nongnu.org
 S: Maintained
 F: block/vdi.c
 
+blkio
+M: Stefan Hajnoczi <stefanha@redhat.com>
+L: qemu-block@nongnu.org
+S: Maintained
+F: block/blkio.c
+
 iSCSI
 M: Ronnie Sahlberg <ronniesahlberg@gmail.com>
 M: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/meson_options.txt b/meson_options.txt
index 79c6af18d5..66128178bf 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -117,6 +117,8 @@ option('bzip2', type : 'feature', value : 'auto',
        description: 'bzip2 support for DMG images')
 option('cap_ng', type : 'feature', value : 'auto',
        description: 'cap_ng support')
+option('blkio', type : 'feature', value : 'auto',
+       description: 'libblkio block device driver')
 option('bpf', type : 'feature', value : 'auto',
         description: 'eBPF support')
 option('cocoa', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index f21fa235f2..5aed0dd436 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2951,11 +2951,16 @@
             'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
             {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
             {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
-            'http', 'https', 'iscsi',
+            'http', 'https',
+            { 'name': 'io_uring', 'if': 'CONFIG_BLKIO' },
+            'iscsi',
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+            'ssh', 'throttle', 'vdi', 'vhdx',
+            { 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
+            { 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
+            'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile:
@@ -3678,6 +3683,42 @@
             '*debug': 'int',
             '*logfile': 'str' } }
 
+##
+# @BlockdevOptionsIoUring:
+#
+# Driver specific block device options for the io_uring backend.
+#
+# @filename: path to the image file
+#
+# Since: 7.2
+##
+{ 'struct': 'BlockdevOptionsIoUring',
+  'data': { 'filename': 'str' } }
+
+##
+# @BlockdevOptionsVirtioBlkVhostUser:
+#
+# Driver specific block device options for the virtio-blk-vhost-user backend.
+#
+# @path: path to the vhost-user UNIX domain socket.
+#
+# Since: 7.2
+##
+{ 'struct': 'BlockdevOptionsVirtioBlkVhostUser',
+  'data': { 'path': 'str' } }
+
+##
+# @BlockdevOptionsVirtioBlkVhostVdpa:
+#
+# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
+#
+# @path: path to the vhost-vdpa character device.
+#
+# Since: 7.2
+##
+{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
+  'data': { 'path': 'str' } }
+
 ##
 # @IscsiTransport:
 #
@@ -4305,6 +4346,8 @@
                        'if': 'HAVE_HOST_BLOCK_DEVICE' },
       'http':       'BlockdevOptionsCurlHttp',
       'https':      'BlockdevOptionsCurlHttps',
+      'io_uring':   { 'type': 'BlockdevOptionsIoUring',
+                      'if': 'CONFIG_BLKIO' },
       'iscsi':      'BlockdevOptionsIscsi',
       'luks':       'BlockdevOptionsLUKS',
       'nbd':        'BlockdevOptionsNbd',
@@ -4327,6 +4370,12 @@
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
       'vhdx':       'BlockdevOptionsGenericFormat',
+      'virtio-blk-vhost-user':
+                    { 'type': 'BlockdevOptionsVirtioBlkVhostUser',
+                      'if': 'CONFIG_BLKIO' },
+      'virtio-blk-vhost-vdpa':
+                    { 'type': 'BlockdevOptionsVirtioBlkVhostVdpa',
+                      'if': 'CONFIG_BLKIO' },
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
       'vvfat':      'BlockdevOptionsVVFAT'
diff --git a/meson.build b/meson.build
index 8dc661363f..7c656e9c2f 100644
--- a/meson.build
+++ b/meson.build
@@ -738,6 +738,13 @@ if not get_option('virglrenderer').auto() or have_system or have_vhost_user_gpu
                      required: get_option('virglrenderer'),
                      kwargs: static_kwargs)
 endif
+blkio = not_found
+if not get_option('blkio').auto() or have_block
+  blkio = dependency('blkio',
+                     method: 'pkg-config',
+                     required: get_option('blkio'),
+                     kwargs: static_kwargs)
+endif
 curl = not_found
 if not get_option('curl').auto() or have_block
   curl = dependency('libcurl', version: '>=7.29.0',
@@ -1789,6 +1796,7 @@ config_host_data.set('CONFIG_LIBUDEV', libudev.found())
 config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
+config_host_data.set('CONFIG_BLKIO', blkio.found())
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
@@ -3841,6 +3849,7 @@ summary_info += {'PAM':               pam}
 summary_info += {'iconv support':     iconv}
 summary_info += {'curses support':    curses}
 summary_info += {'virgl support':     virgl}
+summary_info += {'blkio support':     blkio}
 summary_info += {'curl support':      curl}
 summary_info += {'Multipath support': mpathpersist}
 summary_info += {'PNG support':       png}
diff --git a/block/blkio.c b/block/blkio.c
new file mode 100644
index 0000000000..9244a653ef
--- /dev/null
+++ b/block/blkio.c
@@ -0,0 +1,849 @@
+/* SPDX-License-Identifier: LGPL-2.1-or-later */
+/*
+ * libblkio BlockDriver
+ *
+ * Copyright Red Hat, Inc.
+ *
+ * Author:
+ *   Stefan Hajnoczi <stefanha@redhat.com>
+ */
+
+#include "qemu/osdep.h"
+#include <blkio.h>
+#include "block/block_int.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qemu/module.h"
+
+/*
+ * Keep the QEMU BlockDriver names identical to the libblkio driver names.
+ * Using macros instead of typing out the string literals avoids typos.
+ */
+#define DRIVER_IO_URING "io_uring"
+#define DRIVER_VIRTIO_BLK_VHOST_USER "virtio-blk-vhost-user"
+#define DRIVER_VIRTIO_BLK_VHOST_VDPA "virtio-blk-vhost-vdpa"
+
+/*
+ * Allocated bounce buffers are kept in a list sorted by buffer address.
+ */
+typedef struct BlkioBounceBuf {
+    QLIST_ENTRY(BlkioBounceBuf) next;
+
+    /* The bounce buffer */
+    struct iovec buf;
+} BlkioBounceBuf;
+
+typedef struct {
+    /*
+     * libblkio is not thread-safe so this lock protects ->blkio and
+     * ->blkioq.
+     */
+    QemuMutex blkio_lock;
+    struct blkio *blkio;
+    struct blkioq *blkioq; /* make this multi-queue in the future... */
+    int completion_fd;
+
+    /*
+     * Polling fetches the next completion into this field.
+     *
+     * No lock is necessary since only one thread calls aio_poll() and invokes
+     * fd and poll handlers.
+     */
+    struct blkio_completion poll_completion;
+
+    /*
+     * Protects ->bounce_pool, ->bounce_bufs, ->bounce_available.
+     *
+     * Lock ordering: ->bounce_lock before ->blkio_lock.
+     */
+    CoMutex bounce_lock;
+
+    /* Bounce buffer pool */
+    struct blkio_mem_region bounce_pool;
+
+    /* Sorted list of allocated bounce buffers */
+    QLIST_HEAD(, BlkioBounceBuf) bounce_bufs;
+
+    /* Queue for coroutines waiting for bounce buffer space */
+    CoQueue bounce_available;
+
+    /* The value of the "mem-region-alignment" property */
+    size_t mem_region_alignment;
+
+    /* Can we skip adding/deleting blkio_mem_regions? */
+    bool needs_mem_regions;
+} BDRVBlkioState;
+
+/* Called with s->bounce_lock held */
+static int blkio_resize_bounce_pool(BDRVBlkioState *s, int64_t bytes)
+{
+    /* There can be no allocated bounce buffers during resize */
+    assert(QLIST_EMPTY(&s->bounce_bufs));
+
+    /* Pad size to reduce frequency of resize calls */
+    bytes += 128 * 1024;
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        int ret;
+
+        if (s->bounce_pool.addr) {
+            blkio_unmap_mem_region(s->blkio, &s->bounce_pool);
+            blkio_free_mem_region(s->blkio, &s->bounce_pool);
+            memset(&s->bounce_pool, 0, sizeof(s->bounce_pool));
+        }
+
+        /* Automatically freed when s->blkio is destroyed */
+        ret = blkio_alloc_mem_region(s->blkio, &s->bounce_pool, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+
+        ret = blkio_map_mem_region(s->blkio, &s->bounce_pool);
+        if (ret < 0) {
+            blkio_free_mem_region(s->blkio, &s->bounce_pool);
+            memset(&s->bounce_pool, 0, sizeof(s->bounce_pool));
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+/* Called with s->bounce_lock held */
+static bool
+blkio_do_alloc_bounce_buffer(BDRVBlkioState *s, BlkioBounceBuf *bounce,
+                             int64_t bytes)
+{
+    void *addr = s->bounce_pool.addr;
+    BlkioBounceBuf *cur = NULL;
+    BlkioBounceBuf *prev = NULL;
+    ptrdiff_t space;
+
+    /*
+     * This is just a linear search over the holes between requests. An
+     * efficient allocator would be nice.
+     */
+    QLIST_FOREACH(cur, &s->bounce_bufs, next) {
+        space = cur->buf.iov_base - addr;
+        if (bytes <= space) {
+            QLIST_INSERT_BEFORE(cur, bounce, next);
+            bounce->buf.iov_base = addr;
+            bounce->buf.iov_len = bytes;
+            return true;
+        }
+
+        addr = cur->buf.iov_base + cur->buf.iov_len;
+        prev = cur;
+    }
+
+    /* Is there space after the last request? */
+    space = s->bounce_pool.addr + s->bounce_pool.len - addr;
+    if (bytes > space) {
+        return false;
+    }
+    if (prev) {
+        QLIST_INSERT_AFTER(prev, bounce, next);
+    } else {
+        QLIST_INSERT_HEAD(&s->bounce_bufs, bounce, next);
+    }
+    bounce->buf.iov_base = addr;
+    bounce->buf.iov_len = bytes;
+    return true;
+}
+
+static int coroutine_fn
+blkio_alloc_bounce_buffer(BDRVBlkioState *s, BlkioBounceBuf *bounce,
+                          int64_t bytes)
+{
+    /*
+     * Ensure fairness: first time around we join the back of the queue,
+     * subsequently we join the front so we don't lose our place.
+     */
+    CoQueueWaitFlags wait_flags = 0;
+
+    QEMU_LOCK_GUARD(&s->bounce_lock);
+
+    /* Ensure fairness: don't even try if other requests are already waiting */
+    if (!qemu_co_queue_empty(&s->bounce_available)) {
+        qemu_co_queue_wait_flags(&s->bounce_available, &s->bounce_lock,
+                                 wait_flags);
+        wait_flags = CO_QUEUE_WAIT_FRONT;
+    }
+
+    while (true) {
+        if (blkio_do_alloc_bounce_buffer(s, bounce, bytes)) {
+            /* Kick the next queued request since there may be space */
+            qemu_co_queue_next(&s->bounce_available);
+            return 0;
+        }
+
+        /*
+         * If there are no in-flight requests then the pool was simply too
+         * small.
+         */
+        if (QLIST_EMPTY(&s->bounce_bufs)) {
+            bool ok;
+            int ret;
+
+            ret = blkio_resize_bounce_pool(s, bytes);
+            if (ret < 0) {
+                /* Kick the next queued request since that may fail too */
+                qemu_co_queue_next(&s->bounce_available);
+                return ret;
+            }
+
+            ok = blkio_do_alloc_bounce_buffer(s, bounce, bytes);
+            assert(ok); /* must have space this time */
+            return 0;
+        }
+
+        qemu_co_queue_wait_flags(&s->bounce_available, &s->bounce_lock,
+                                 wait_flags);
+        wait_flags = CO_QUEUE_WAIT_FRONT;
+    }
+}
+
+static void coroutine_fn blkio_free_bounce_buffer(BDRVBlkioState *s,
+                                                  BlkioBounceBuf *bounce)
+{
+    QEMU_LOCK_GUARD(&s->bounce_lock);
+
+    QLIST_REMOVE(bounce, next);
+
+    /* Wake up waiting coroutines since space may now be available */
+    qemu_co_queue_next(&s->bounce_available);
+}
+
+/* For async to .bdrv_co_*() conversion */
+typedef struct {
+    Coroutine *coroutine;
+    int ret;
+} BlkioCoData;
+
+static void blkio_completion_fd_read(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVBlkioState *s = bs->opaque;
+    uint64_t val;
+    int ret;
+
+    /* Polling may have already fetched a completion */
+    if (s->poll_completion.user_data != NULL) {
+        BlkioCoData *cod = s->poll_completion.user_data;
+        cod->ret = s->poll_completion.ret;
+
+        /* Clear it in case aio_co_wake() enters a nested event loop */
+        s->poll_completion.user_data = NULL;
+
+        aio_co_wake(cod->coroutine);
+    }
+
+    /* Reset completion fd status */
+    ret = read(s->completion_fd, &val, sizeof(val));
+
+    /* Ignore errors, there's nothing we can do */
+    (void)ret;
+
+    /*
+     * Reading one completion at a time makes nested event loop re-entrancy
+     * simple. Change this loop to get multiple completions in one go if it
+     * becomes a performance bottleneck.
+     */
+    while (true) {
+        struct blkio_completion completion;
+
+        WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+            ret = blkioq_do_io(s->blkioq, &completion, 0, 1, NULL);
+        }
+        if (ret != 1) {
+            break;
+        }
+
+        BlkioCoData *cod = completion.user_data;
+        cod->ret = completion.ret;
+        aio_co_wake(cod->coroutine);
+    }
+}
+
+static bool blkio_completion_fd_poll(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    /* Just in case we already fetched a completion */
+    if (s->poll_completion.user_data != NULL) {
+        return true;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        ret = blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL);
+    }
+    return ret == 1;
+}
+
+static void blkio_completion_fd_poll_ready(void *opaque)
+{
+    blkio_completion_fd_read(opaque);
+}
+
+static void blkio_attach_aio_context(BlockDriverState *bs,
+                                     AioContext *new_context)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    aio_set_fd_handler(new_context,
+                       s->completion_fd,
+                       false,
+                       blkio_completion_fd_read,
+                       NULL,
+                       blkio_completion_fd_poll,
+                       blkio_completion_fd_poll_ready,
+                       bs);
+}
+
+static void blkio_detach_aio_context(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    aio_set_fd_handler(bdrv_get_aio_context(bs),
+                       s->completion_fd,
+                       false, NULL, NULL, NULL, NULL, NULL);
+}
+
+/* Call with s->blkio_lock held to submit I/O after enqueuing a new request */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+    if (qatomic_read(&bs->io_plugged) == 0) {
+        BDRVBlkioState *s = bs->opaque;
+
+        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
+    }
+}
+
+static int coroutine_fn
+blkio_co_pdiscard(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioCoData cod = {
+        .coroutine = qemu_coroutine_self(),
+    };
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkioq_discard(s->blkioq, offset, bytes, &cod, 0);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+    return cod.ret;
+}
+
+static int coroutine_fn
+blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
+                QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+    BlkioCoData cod = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    BDRVBlkioState *s = bs->opaque;
+    bool use_bounce_buffer = s->needs_mem_regions;
+    BlkioBounceBuf bounce;
+    struct iovec *iov = qiov->iov;
+    int iovcnt = qiov->niov;
+
+    if (use_bounce_buffer) {
+        int ret = blkio_alloc_bounce_buffer(s, &bounce, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+
+        iov = &bounce.buf;
+        iovcnt = 1;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkioq_readv(s->blkioq, offset, iov, iovcnt, &cod, 0);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+
+    if (use_bounce_buffer) {
+        if (cod.ret == 0) {
+            qemu_iovec_from_buf(qiov, 0,
+                                bounce.buf.iov_base,
+                                bounce.buf.iov_len);
+        }
+
+        blkio_free_bounce_buffer(s, &bounce);
+    }
+
+    return cod.ret;
+}
+
+static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
+        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags)
+{
+    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
+    BlkioCoData cod = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    BDRVBlkioState *s = bs->opaque;
+    bool use_bounce_buffer = s->needs_mem_regions;
+    BlkioBounceBuf bounce;
+    struct iovec *iov = qiov->iov;
+    int iovcnt = qiov->niov;
+
+    if (use_bounce_buffer) {
+        int ret = blkio_alloc_bounce_buffer(s, &bounce, bytes);
+        if (ret < 0) {
+            return ret;
+        }
+
+        qemu_iovec_to_buf(qiov, 0, bounce.buf.iov_base, bytes);
+        iov = &bounce.buf;
+        iovcnt = 1;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkioq_writev(s->blkioq, offset, iov, iovcnt, &cod, blkio_flags);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+
+    if (use_bounce_buffer) {
+        blkio_free_bounce_buffer(s, &bounce);
+    }
+
+    return cod.ret;
+}
+
+static int coroutine_fn blkio_co_flush(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioCoData cod = {
+        .coroutine = qemu_coroutine_self(),
+    };
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkioq_flush(s->blkioq, &cod, 0);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+    return cod.ret;
+}
+
+static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
+    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioCoData cod = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    uint32_t blkio_flags = 0;
+
+    if (flags & BDRV_REQ_FUA) {
+        blkio_flags |= BLKIO_REQ_FUA;
+    }
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        blkio_flags |= BLKIO_REQ_NO_UNMAP;
+    }
+    if (flags & BDRV_REQ_NO_FALLBACK) {
+        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkioq_write_zeroes(s->blkioq, offset, bytes, &cod, blkio_flags);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+    return cod.ret;
+}
+
+static void blkio_io_unplug(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkio_submit_io(bs);
+    }
+}
+
+static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
+                               Error **errp)
+{
+    const char *filename = qdict_get_str(options, "filename");
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    ret = blkio_set_str(s->blkio, "path", filename);
+    qdict_del(options, "filename");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to set path: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (flags & BDRV_O_NOCACHE) {
+        ret = blkio_set_bool(s->blkio, "direct", true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set direct: %s",
+                             blkio_get_error_msg());
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static int blkio_virtio_blk_vhost_user_open(BlockDriverState *bs,
+        QDict *options, int flags, Error **errp)
+{
+    const char *path = qdict_get_try_str(options, "path");
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    if (!path) {
+        error_setg(errp, "missing 'path' option");
+        return -EINVAL;
+    }
+
+    ret = blkio_set_str(s->blkio, "path", path);
+    qdict_del(options, "path");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to set path: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (!(flags & BDRV_O_NOCACHE)) {
+        error_setg(errp, "cache.direct=off is not supported");
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
+        QDict *options, int flags, Error **errp)
+{
+    const char *path = qdict_get_try_str(options, "path");
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    if (!path) {
+        error_setg(errp, "missing 'path' option");
+        return -EINVAL;
+    }
+
+    ret = blkio_set_str(s->blkio, "path", path);
+    qdict_del(options, "path");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to set path: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (!(flags & BDRV_O_NOCACHE)) {
+        error_setg(errp, "cache.direct=off is not supported");
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
+                           Error **errp)
+{
+    const char *blkio_driver = bs->drv->protocol_name;
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    ret = blkio_create(blkio_driver, &s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_create failed: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (strcmp(blkio_driver, DRIVER_IO_URING) == 0) {
+        ret = blkio_io_uring_open(bs, options, flags, errp);
+    } else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_USER) == 0) {
+        ret = blkio_virtio_blk_vhost_user_open(bs, options, flags, errp);
+    } else if (strcmp(blkio_driver, DRIVER_VIRTIO_BLK_VHOST_VDPA) == 0) {
+        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
+    } else {
+        g_assert_not_reached();
+    }
+    if (ret < 0) {
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    if (!(flags & BDRV_O_RDWR)) {
+        ret = blkio_set_bool(s->blkio, "read-only", true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set read-only: %s",
+                             blkio_get_error_msg());
+            blkio_destroy(&s->blkio);
+            return ret;
+        }
+    }
+
+    ret = blkio_connect(s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_get_bool(s->blkio,
+                         "needs-mem-regions",
+                         &s->needs_mem_regions);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get needs-mem-regions: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_get_uint64(s->blkio,
+                           "mem-region-alignment",
+                           &s->mem_region_alignment);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get mem-region-alignment: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_start(s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_start failed: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    bs->supported_write_flags = BDRV_REQ_FUA;
+    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
+                               BDRV_REQ_NO_FALLBACK;
+
+    qemu_mutex_init(&s->blkio_lock);
+    qemu_co_mutex_init(&s->bounce_lock);
+    qemu_co_queue_init(&s->bounce_available);
+    QLIST_INIT(&s->bounce_bufs);
+    s->blkioq = blkio_get_queue(s->blkio, 0);
+    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
+
+    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
+    return 0;
+}
+
+static void blkio_close(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    /* There is no destroy() API for s->bounce_lock */
+
+    qemu_mutex_destroy(&s->blkio_lock);
+    blkio_detach_aio_context(bs);
+    blkio_destroy(&s->blkio);
+}
+
+static int64_t blkio_getlength(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+    uint64_t capacity;
+    int ret;
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
+    }
+    if (ret < 0) {
+        return -ret;
+    }
+
+    return capacity;
+}
+
+static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    return 0;
+}
+
+static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    BDRVBlkioState *s = bs->opaque;
+    QEMU_LOCK_GUARD(&s->blkio_lock);
+    int value;
+    int ret;
+
+    ret = blkio_get_int(s->blkio, "request-alignment", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    bs->bl.request_alignment = value;
+    if (bs->bl.request_alignment < 1 ||
+        bs->bl.request_alignment >= INT_MAX ||
+        !is_power_of_2(bs->bl.request_alignment)) {
+        error_setg(errp, "invalid \"request-alignment\" value %" PRIu32 ", "
+                   "must be a power of 2 less than INT_MAX",
+                   bs->bl.request_alignment);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio, "optimal-io-size", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"optimal-io-size\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    bs->bl.opt_transfer = value;
+    if (bs->bl.opt_transfer > INT_MAX ||
+        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
+        error_setg(errp, "invalid \"optimal-io-size\" value %" PRIu32 ", must "
+                   "be a multiple of %" PRIu32, bs->bl.opt_transfer,
+                   bs->bl.request_alignment);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio, "max-transfer", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    bs->bl.max_transfer = value;
+    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
+        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
+        error_setg(errp, "invalid \"max-transfer\" value %" PRIu32 ", must be "
+                   "a multiple of %" PRIu32 " and %" PRIu32 " (if non-zero)",
+                   bs->bl.max_transfer, bs->bl.request_alignment,
+                   bs->bl.opt_transfer);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
+                   "positive", value);
+        return;
+    }
+    bs->bl.min_mem_alignment = value;
+
+    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get \"optimal-buf-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
+                   "must be positive", value);
+        return;
+    }
+    bs->bl.opt_mem_alignment = value;
+
+    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
+                   bs->bl.max_iov);
+        return;
+    }
+}
+
+/*
+ * TODO
+ * Missing libblkio APIs:
+ * - block_status
+ * - co_invalidate_cache
+ *
+ * Out of scope?
+ * - create
+ * - truncate
+ */
+
+static BlockDriver bdrv_io_uring = {
+    .format_name                = DRIVER_IO_URING,
+    .protocol_name              = DRIVER_IO_URING,
+    .instance_size              = sizeof(BDRVBlkioState),
+    .bdrv_needs_filename        = true,
+    .bdrv_file_open             = blkio_file_open,
+    .bdrv_close                 = blkio_close,
+    .bdrv_getlength             = blkio_getlength,
+    .bdrv_get_info              = blkio_get_info,
+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
+    .bdrv_co_pdiscard           = blkio_co_pdiscard,
+    .bdrv_co_preadv             = blkio_co_preadv,
+    .bdrv_co_pwritev            = blkio_co_pwritev,
+    .bdrv_co_flush_to_disk      = blkio_co_flush,
+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
+    .bdrv_io_unplug             = blkio_io_unplug,
+    .bdrv_refresh_limits        = blkio_refresh_limits,
+};
+
+static BlockDriver bdrv_virtio_blk_vhost_user = {
+    .format_name                = DRIVER_VIRTIO_BLK_VHOST_USER,
+    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_USER,
+    .instance_size              = sizeof(BDRVBlkioState),
+    .bdrv_file_open             = blkio_file_open,
+    .bdrv_close                 = blkio_close,
+    .bdrv_getlength             = blkio_getlength,
+    .bdrv_get_info              = blkio_get_info,
+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
+    .bdrv_co_pdiscard           = blkio_co_pdiscard,
+    .bdrv_co_preadv             = blkio_co_preadv,
+    .bdrv_co_pwritev            = blkio_co_pwritev,
+    .bdrv_co_flush_to_disk      = blkio_co_flush,
+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
+    .bdrv_io_unplug             = blkio_io_unplug,
+    .bdrv_refresh_limits        = blkio_refresh_limits,
+};
+
+static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
+    .format_name                = DRIVER_VIRTIO_BLK_VHOST_VDPA,
+    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_VDPA,
+    .instance_size              = sizeof(BDRVBlkioState),
+    .bdrv_file_open             = blkio_file_open,
+    .bdrv_close                 = blkio_close,
+    .bdrv_getlength             = blkio_getlength,
+    .bdrv_get_info              = blkio_get_info,
+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
+    .bdrv_co_pdiscard           = blkio_co_pdiscard,
+    .bdrv_co_preadv             = blkio_co_preadv,
+    .bdrv_co_pwritev            = blkio_co_pwritev,
+    .bdrv_co_flush_to_disk      = blkio_co_flush,
+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
+    .bdrv_io_unplug             = blkio_io_unplug,
+    .bdrv_refresh_limits        = blkio_refresh_limits,
+};
+
+static void bdrv_blkio_init(void)
+{
+    bdrv_register(&bdrv_io_uring);
+    bdrv_register(&bdrv_virtio_blk_vhost_user);
+    bdrv_register(&bdrv_virtio_blk_vhost_vdpa);
+}
+
+block_init(bdrv_blkio_init);
diff --git a/tests/qtest/modules-test.c b/tests/qtest/modules-test.c
index 88217686e1..be2575ae6d 100644
--- a/tests/qtest/modules-test.c
+++ b/tests/qtest/modules-test.c
@@ -16,6 +16,9 @@ static void test_modules_load(const void *data)
 int main(int argc, char *argv[])
 {
     const char *modules[] = {
+#ifdef CONFIG_BLKIO
+        "block-", "blkio",
+#endif
 #ifdef CONFIG_CURL
         "block-", "curl",
 #endif
diff --git a/block/meson.build b/block/meson.build
index 60bc305597..500878f082 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -92,6 +92,7 @@ block_modules = {}
 
 modsrc = []
 foreach m : [
+  [blkio, 'blkio', files('blkio.c')],
   [curl, 'curl', files('curl.c')],
   [glusterfs, 'gluster', files('gluster.c')],
   [libiscsi, 'iscsi', [files('iscsi.c'), libm]],
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index eb3267bef5..2cb0de5601 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -67,6 +67,7 @@ meson_options_help() {
   printf "%s\n" '  auth-pam        PAM access control'
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512f         AVX512F optimizations'
+  printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
   printf "%s\n" '  brlapi          brlapi character device driver'
@@ -198,6 +199,8 @@ _meson_option_parse() {
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
     --disable-lto) printf "%s" -Db_lto=false ;;
+    --enable-blkio) printf "%s" -Dblkio=enabled ;;
+    --disable-blkio) printf "%s" -Dblkio=disabled ;;
     --block-drv-ro-whitelist=*) quote_sh "-Dblock_drv_ro_whitelist=$2" ;;
     --block-drv-rw-whitelist=*) quote_sh "-Dblock_drv_rw_whitelist=$2" ;;
     --enable-block-drv-whitelist-in-tools) printf "%s" -Dblock_drv_whitelist_in_tools=true ;;
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 03/12] numa: call ->ram_block_removed() in ram_block_notifer_remove()
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 01/12] coroutine: add flag to re-queue at front of CoQueue Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 02/12] blkio: add libblkio block driver Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 04/12] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

When a RAMBlockNotifier is added, ->ram_block_added() is called with all
existing RAMBlocks. There is no equivalent ->ram_block_removed() call
when a RAMBlockNotifier is removed.

The util/vfio-helpers.c code (the sole user of RAMBlockNotifier) is fine
with this asymmetry because it does not rely on RAMBlockNotifier for
cleanup. It walks its internal list of DMA mappings and unmaps them by
itself.

Future users of RAMBlockNotifier may not have an internal data structure
that records added RAMBlocks so they will need ->ram_block_removed()
callbacks.

This patch makes ram_block_notifier_remove() symmetric with respect to
callbacks. Now util/vfio-helpers.c needs to unmap remaining DMA mappings
after ram_block_notifier_remove() has been called. This is necessary
since users like block/nvme.c may create additional DMA mappings that do
not originate from the RAMBlockNotifier.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/core/numa.c      | 17 +++++++++++++++++
 util/vfio-helpers.c |  5 ++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 26d8e5f616..31e6fe1caa 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -822,6 +822,19 @@ static int ram_block_notify_add_single(RAMBlock *rb, void *opaque)
     return 0;
 }
 
+static int ram_block_notify_remove_single(RAMBlock *rb, void *opaque)
+{
+    const ram_addr_t max_size = qemu_ram_get_max_length(rb);
+    const ram_addr_t size = qemu_ram_get_used_length(rb);
+    void *host = qemu_ram_get_host_addr(rb);
+    RAMBlockNotifier *notifier = opaque;
+
+    if (host) {
+        notifier->ram_block_removed(notifier, host, size, max_size);
+    }
+    return 0;
+}
+
 void ram_block_notifier_add(RAMBlockNotifier *n)
 {
     QLIST_INSERT_HEAD(&ram_list.ramblock_notifiers, n, next);
@@ -835,6 +848,10 @@ void ram_block_notifier_add(RAMBlockNotifier *n)
 void ram_block_notifier_remove(RAMBlockNotifier *n)
 {
     QLIST_REMOVE(n, next);
+
+    if (n->ram_block_removed) {
+        qemu_ram_foreach_block(ram_block_notify_remove_single, n);
+    }
 }
 
 void ram_block_notify_add(void *host, size_t size, size_t max_size)
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 5ba01177bf..0d1520caac 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -847,10 +847,13 @@ void qemu_vfio_close(QEMUVFIOState *s)
     if (!s) {
         return;
     }
+
+    ram_block_notifier_remove(&s->ram_notifier);
+
     for (i = 0; i < s->nr_mappings; ++i) {
         qemu_vfio_undo_mapping(s, &s->mappings[i], NULL);
     }
-    ram_block_notifier_remove(&s->ram_notifier);
+
     g_free(s->usable_iova_ranges);
     s->nb_iova_ranges = 0;
     qemu_vfio_reset(s);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 04/12] block: pass size to bdrv_unregister_buf()
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 03/12] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 05/12] block: use BdrvRequestFlags type for supported flag fields Stefan Hajnoczi
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

The only implementor of bdrv_register_buf() is block/nvme.c, where the
size is not needed when unregistering a buffer. This is because
util/vfio-helpers.c can look up mappings by address.

Future block drivers that implement bdrv_register_buf() may not be able
to do their job given only the buffer address. Add a size argument to
bdrv_unregister_buf().

Also document the assumptions about
bdrv_register_buf()/bdrv_unregister_buf() calls. The same <host, size>
values that were given to bdrv_register_buf() must be given to
bdrv_unregister_buf().

gcc 11.2.1 emits a spurious warning that img_bench()'s buf_size local
variable might be uninitialized, so it's necessary to silence the
compiler.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-global-state.h          | 5 ++++-
 include/block/block_int-common.h            | 2 +-
 include/sysemu/block-backend-global-state.h | 2 +-
 block/block-backend.c                       | 4 ++--
 block/io.c                                  | 6 +++---
 block/nvme.c                                | 2 +-
 qemu-img.c                                  | 4 ++--
 7 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 21265e3966..7901f35863 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -243,9 +243,12 @@ void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
  * Register/unregister a buffer for I/O. For example, VFIO drivers are
  * interested to know the memory areas that would later be used for I/O, so
  * that they can prepare IOMMU mapping etc., to get better performance.
+ *
+ * Buffers must not overlap and they must be unregistered with the same <host,
+ * size> values that they were registered with.
  */
 void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);
-void bdrv_unregister_buf(BlockDriverState *bs, void *host);
+void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size);
 
 void bdrv_cancel_in_flight(BlockDriverState *bs);
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..b7a7cbd3a5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -435,7 +435,7 @@ struct BlockDriver {
      * DMA mapping for hot buffers.
      */
     void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
-    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
+    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host, size_t size);
 
     /*
      * This field is modified only under the BQL, and is part of
diff --git a/include/sysemu/block-backend-global-state.h b/include/sysemu/block-backend-global-state.h
index 415f0c91d7..97f7dad2c3 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -107,7 +107,7 @@ void blk_io_limits_update_group(BlockBackend *blk, const char *group);
 void blk_set_force_allow_inactivate(BlockBackend *blk);
 
 void blk_register_buf(BlockBackend *blk, void *host, size_t size);
-void blk_unregister_buf(BlockBackend *blk, void *host);
+void blk_unregister_buf(BlockBackend *blk, void *host, size_t size);
 
 const BdrvChild *blk_root(BlockBackend *blk);
 
diff --git a/block/block-backend.c b/block/block-backend.c
index d4a5df2ac2..99141f8f06 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2551,10 +2551,10 @@ void blk_register_buf(BlockBackend *blk, void *host, size_t size)
     bdrv_register_buf(blk_bs(blk), host, size);
 }
 
-void blk_unregister_buf(BlockBackend *blk, void *host)
+void blk_unregister_buf(BlockBackend *blk, void *host, size_t size)
 {
     GLOBAL_STATE_CODE();
-    bdrv_unregister_buf(blk_bs(blk), host);
+    bdrv_unregister_buf(blk_bs(blk), host, size);
 }
 
 int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..af85fa4bcc 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3305,16 +3305,16 @@ void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size)
     }
 }
 
-void bdrv_unregister_buf(BlockDriverState *bs, void *host)
+void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size)
 {
     BdrvChild *child;
 
     GLOBAL_STATE_CODE();
     if (bs->drv && bs->drv->bdrv_unregister_buf) {
-        bs->drv->bdrv_unregister_buf(bs, host);
+        bs->drv->bdrv_unregister_buf(bs, host, size);
     }
     QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_unregister_buf(child->bs, host);
+        bdrv_unregister_buf(child->bs, host, size);
     }
 }
 
diff --git a/block/nvme.c b/block/nvme.c
index 01fb28aa63..696502acea 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1592,7 +1592,7 @@ static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
     }
 }
 
-static void nvme_unregister_buf(BlockDriverState *bs, void *host)
+static void nvme_unregister_buf(BlockDriverState *bs, void *host, size_t size)
 {
     BDRVNVMeState *s = bs->opaque;
 
diff --git a/qemu-img.c b/qemu-img.c
index 7d4b33b3da..b8a8818ddb 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4368,7 +4368,7 @@ static int img_bench(int argc, char **argv)
     struct timeval t1, t2;
     int i;
     bool force_share = false;
-    size_t buf_size;
+    size_t buf_size = 0;
 
     for (;;) {
         static const struct option long_options[] = {
@@ -4590,7 +4590,7 @@ static int img_bench(int argc, char **argv)
 
 out:
     if (data.buf) {
-        blk_unregister_buf(blk, data.buf);
+        blk_unregister_buf(blk, data.buf, buf_size);
     }
     qemu_vfree(data.buf);
     blk_unref(blk);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 05/12] block: use BdrvRequestFlags type for supported flag fields
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 04/12] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 06/12] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Use the enum type so GDB displays the enum members instead of printing a
numeric constant.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block_int-common.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index b7a7cbd3a5..19798d0e77 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -1051,7 +1051,7 @@ struct BlockDriverState {
     /*
      * Flags honored during pread
      */
-    unsigned int supported_read_flags;
+    BdrvRequestFlags supported_read_flags;
     /*
      * Flags honored during pwrite (so far: BDRV_REQ_FUA,
      * BDRV_REQ_WRITE_UNCHANGED).
@@ -1069,12 +1069,12 @@ struct BlockDriverState {
      * flag), or they have to explicitly take the WRITE permission for
      * their children.
      */
-    unsigned int supported_write_flags;
+    BdrvRequestFlags supported_write_flags;
     /*
      * Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
      * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED)
      */
-    unsigned int supported_zero_flags;
+    BdrvRequestFlags supported_zero_flags;
     /*
      * Flags honoured during truncate (so far: BDRV_REQ_ZERO_WRITE).
      *
@@ -1082,7 +1082,7 @@ struct BlockDriverState {
      * that any added space reads as all zeros. If this can't be guaranteed,
      * the operation must fail.
      */
-    unsigned int supported_truncate_flags;
+    BdrvRequestFlags supported_truncate_flags;
 
     /* the following member gives a name to every node on the bs graph. */
     char node_name[32];
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 06/12] block: add BDRV_REQ_REGISTERED_BUF request flag
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 05/12] block: use BdrvRequestFlags type for supported flag fields Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 07/12] block: return errors from bdrv_register_buf() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Block drivers may optimize I/O requests accessing buffers previously
registered with bdrv_register_buf(). Checking whether all elements of a
request's QEMUIOVector are within previously registered buffers is
expensive, so we need a hint from the user to avoid costly checks.

Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
QEMUIOVector elements in an I/O request are known to be within
previously registered buffers.

Always pass the flag through to driver read/write functions. There is
little harm in passing the flag to a driver that does not use it.
Passing the flag to drivers avoids changes across many block drivers.
Filter drivers would need to explicitly support the flag and pass
through to their children when the children support it. That's a lot of
code changes and it's hard to remember to do that everywhere, leading to
silent reduced performance when the flag is accidentally dropped.

The only problematic scenario with the approach in this patch is when a
driver passes the flag through to internal I/O requests that don't use
the same I/O buffer. In that case the hint may be set when it should
actually be clear. This is a rare case though so the risk is low.

Some drivers have assert(!flags), which no longer works when
BDRV_REQ_REGISTERED_BUF is passed in. These assertions aren't very
useful anyway since the functions are called almost exclusively by
bdrv_driver_preadv/pwritev() so if we get flags handling right there
then the assertion is not needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-common.h |  9 ++++++
 block.c                      | 14 +++++++++
 block/blkverify.c            |  4 +--
 block/crypto.c               |  4 +--
 block/file-posix.c           |  1 -
 block/gluster.c              |  1 -
 block/io.c                   | 61 ++++++++++++++++++++++--------------
 block/mirror.c               |  2 ++
 block/nbd.c                  |  1 -
 block/parallels.c            |  1 -
 block/qcow.c                 |  2 --
 block/qed.c                  |  1 -
 block/raw-format.c           |  2 ++
 block/replication.c          |  1 -
 block/ssh.c                  |  1 -
 block/vhdx.c                 |  1 -
 16 files changed, 69 insertions(+), 37 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..061606e867 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -80,6 +80,15 @@ typedef enum {
      */
     BDRV_REQ_MAY_UNMAP          = 0x4,
 
+    /*
+     * An optimization hint when all QEMUIOVector elements are within
+     * previously registered bdrv_register_buf() memory ranges.
+     *
+     * Code that replaces the user's QEMUIOVector elements with bounce buffers
+     * must take care to clear this flag.
+     */
+    BDRV_REQ_REGISTERED_BUF     = 0x8,
+
     BDRV_REQ_FUA                = 0x10,
     BDRV_REQ_WRITE_COMPRESSED   = 0x20,
 
diff --git a/block.c b/block.c
index bc85f46eed..70abbf774e 100644
--- a/block.c
+++ b/block.c
@@ -1640,6 +1640,20 @@ static int bdrv_open_driver(BlockDriverState *bs, BlockDriver *drv,
         goto open_failed;
     }
 
+    assert(!(bs->supported_read_flags & ~BDRV_REQ_MASK));
+    assert(!(bs->supported_write_flags & ~BDRV_REQ_MASK));
+
+    /*
+     * Always allow the BDRV_REQ_REGISTERED_BUF optimization hint. This saves
+     * drivers that pass read/write requests through to a child the trouble of
+     * declaring support explicitly.
+     *
+     * Drivers must not propagate this flag accidentally when they initiate I/O
+     * to a bounce buffer. That case should be rare though.
+     */
+    bs->supported_read_flags |= BDRV_REQ_REGISTERED_BUF;
+    bs->supported_write_flags |= BDRV_REQ_REGISTERED_BUF;
+
     ret = refresh_total_sectors(bs, bs->total_sectors);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "Could not refresh total sector count");
diff --git a/block/blkverify.c b/block/blkverify.c
index e4a37af3b2..d624f4fd05 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -235,8 +235,8 @@ blkverify_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
     qemu_iovec_init(&raw_qiov, qiov->niov);
     qemu_iovec_clone(&raw_qiov, qiov, buf);
 
-    ret = blkverify_co_prwv(bs, &r, offset, bytes, qiov, &raw_qiov, flags,
-                            false);
+    ret = blkverify_co_prwv(bs, &r, offset, bytes, qiov, &raw_qiov,
+                            flags & ~BDRV_REQ_REGISTERED_BUF, false);
 
     cmp_offset = qemu_iovec_compare(qiov, &raw_qiov);
     if (cmp_offset != -1) {
diff --git a/block/crypto.c b/block/crypto.c
index 7a57774b76..c7365598a7 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -410,7 +410,6 @@ block_crypto_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
     uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
     uint64_t payload_offset = qcrypto_block_get_payload_offset(crypto->block);
 
-    assert(!flags);
     assert(payload_offset < INT64_MAX);
     assert(QEMU_IS_ALIGNED(offset, sector_size));
     assert(QEMU_IS_ALIGNED(bytes, sector_size));
@@ -473,7 +472,8 @@ block_crypto_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
     uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
     uint64_t payload_offset = qcrypto_block_get_payload_offset(crypto->block);
 
-    assert(!(flags & ~BDRV_REQ_FUA));
+    flags &= ~BDRV_REQ_REGISTERED_BUF;
+
     assert(payload_offset < INT64_MAX);
     assert(QEMU_IS_ALIGNED(offset, sector_size));
     assert(QEMU_IS_ALIGNED(bytes, sector_size));
diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..51fd57137d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2120,7 +2120,6 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
                                        int64_t bytes, QEMUIOVector *qiov,
                                        BdrvRequestFlags flags)
 {
-    assert(flags == 0);
     return raw_co_prw(bs, offset, bytes, qiov, QEMU_AIO_WRITE);
 }
 
diff --git a/block/gluster.c b/block/gluster.c
index b60213ab80..3894422e23 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1236,7 +1236,6 @@ static coroutine_fn int qemu_gluster_co_writev(BlockDriverState *bs,
                                                QEMUIOVector *qiov,
                                                int flags)
 {
-    assert(!flags);
     return qemu_gluster_co_rw(bs, sector_num, nb_sectors, qiov, 1);
 }
 
diff --git a/block/io.c b/block/io.c
index af85fa4bcc..de04594299 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1139,8 +1139,7 @@ static int coroutine_fn bdrv_driver_preadv(BlockDriverState *bs,
     int ret;
 
     bdrv_check_qiov_request(offset, bytes, qiov, qiov_offset, &error_abort);
-    assert(!(flags & ~BDRV_REQ_MASK));
-    assert(!(flags & BDRV_REQ_NO_FALLBACK));
+    assert(!(flags & ~bs->supported_read_flags));
 
     if (!drv) {
         return -ENOMEDIUM;
@@ -1204,23 +1203,29 @@ static int coroutine_fn bdrv_driver_pwritev(BlockDriverState *bs,
                                             BdrvRequestFlags flags)
 {
     BlockDriver *drv = bs->drv;
+    bool emulate_fua = false;
     int64_t sector_num;
     unsigned int nb_sectors;
     QEMUIOVector local_qiov;
     int ret;
 
     bdrv_check_qiov_request(offset, bytes, qiov, qiov_offset, &error_abort);
-    assert(!(flags & ~BDRV_REQ_MASK));
-    assert(!(flags & BDRV_REQ_NO_FALLBACK));
 
     if (!drv) {
         return -ENOMEDIUM;
     }
 
+    if ((flags & BDRV_REQ_FUA) &&
+        (~bs->supported_write_flags & BDRV_REQ_FUA)) {
+        flags &= ~BDRV_REQ_FUA;
+        emulate_fua = true;
+    }
+
+    flags &= bs->supported_write_flags;
+
     if (drv->bdrv_co_pwritev_part) {
         ret = drv->bdrv_co_pwritev_part(bs, offset, bytes, qiov, qiov_offset,
-                                        flags & bs->supported_write_flags);
-        flags &= ~bs->supported_write_flags;
+                                        flags);
         goto emulate_flags;
     }
 
@@ -1230,9 +1235,7 @@ static int coroutine_fn bdrv_driver_pwritev(BlockDriverState *bs,
     }
 
     if (drv->bdrv_co_pwritev) {
-        ret = drv->bdrv_co_pwritev(bs, offset, bytes, qiov,
-                                   flags & bs->supported_write_flags);
-        flags &= ~bs->supported_write_flags;
+        ret = drv->bdrv_co_pwritev(bs, offset, bytes, qiov, flags);
         goto emulate_flags;
     }
 
@@ -1242,10 +1245,8 @@ static int coroutine_fn bdrv_driver_pwritev(BlockDriverState *bs,
             .coroutine = qemu_coroutine_self(),
         };
 
-        acb = drv->bdrv_aio_pwritev(bs, offset, bytes, qiov,
-                                    flags & bs->supported_write_flags,
+        acb = drv->bdrv_aio_pwritev(bs, offset, bytes, qiov, flags,
                                     bdrv_co_io_em_complete, &co);
-        flags &= ~bs->supported_write_flags;
         if (acb == NULL) {
             ret = -EIO;
         } else {
@@ -1263,12 +1264,10 @@ static int coroutine_fn bdrv_driver_pwritev(BlockDriverState *bs,
     assert(bytes <= BDRV_REQUEST_MAX_BYTES);
 
     assert(drv->bdrv_co_writev);
-    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov,
-                              flags & bs->supported_write_flags);
-    flags &= ~bs->supported_write_flags;
+    ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov, flags);
 
 emulate_flags:
-    if (ret == 0 && (flags & BDRV_REQ_FUA)) {
+    if (ret == 0 && emulate_fua) {
         ret = bdrv_co_flush(bs);
     }
 
@@ -1496,11 +1495,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
     max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
                                    align);
 
-    /* TODO: We would need a per-BDS .supported_read_flags and
+    /*
+     * TODO: We would need a per-BDS .supported_read_flags and
      * potential fallback support, if we ever implement any read flags
      * to pass through to drivers.  For now, there aren't any
-     * passthrough flags.  */
-    assert(!(flags & ~(BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH)));
+     * passthrough flags except the BDRV_REQ_REGISTERED_BUF optimization hint.
+     */
+    assert(!(flags & ~(BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH |
+                       BDRV_REQ_REGISTERED_BUF)));
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -1541,7 +1543,7 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
         goto out;
     }
 
-    assert(!(flags & ~bs->supported_read_flags));
+    assert(!(flags & ~(bs->supported_read_flags | BDRV_REQ_REGISTERED_BUF)));
 
     max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
     if (bytes <= max_bytes && bytes <= max_transfer) {
@@ -1730,7 +1732,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
-                            BdrvRequestPadding *pad, bool *padded)
+                            BdrvRequestPadding *pad, bool *padded,
+                            BdrvRequestFlags *flags)
 {
     int ret;
 
@@ -1758,6 +1761,10 @@ static int bdrv_pad_request(BlockDriverState *bs,
     if (padded) {
         *padded = true;
     }
+    if (flags) {
+        /* Can't use optimization hint with bounce buffer */
+        *flags &= ~BDRV_REQ_REGISTERED_BUF;
+    }
 
     return 0;
 }
@@ -1812,7 +1819,7 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
     }
 
     ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL);
+                           NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -1857,6 +1864,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* By definition there is no user buffer so this flag doesn't make sense */
+    if (flags & BDRV_REQ_REGISTERED_BUF) {
+        return -EINVAL;
+    }
+
     /* Invalidate the cached block-status data range if this write overlaps */
     bdrv_bsc_invalidate_range(bs, offset, bytes);
 
@@ -2142,6 +2154,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
     bool padding;
     BdrvRequestPadding pad;
 
+    /* This flag doesn't make sense for padding or zero writes */
+    flags &= ~BDRV_REQ_REGISTERED_BUF;
+
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
@@ -2259,7 +2274,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * alignment only if there is no ZERO flag.
          */
         ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded);
+                               &padded, &flags);
         if (ret < 0) {
             return ret;
         }
diff --git a/block/mirror.c b/block/mirror.c
index 3c4ab1159d..8d3fc3f19b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1477,6 +1477,8 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
         qemu_iovec_init(&bounce_qiov, 1);
         qemu_iovec_add(&bounce_qiov, bounce_buf, bytes);
         qiov = &bounce_qiov;
+
+        flags &= ~BDRV_REQ_REGISTERED_BUF;
     }
 
     ret = bdrv_mirror_top_do_write(bs, MIRROR_METHOD_COPY, offset, bytes, qiov,
diff --git a/block/nbd.c b/block/nbd.c
index 97683cce27..ce2956e083 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1221,7 +1221,6 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
     };
 
     assert(bytes <= NBD_MAX_BUFFER_SIZE);
-    assert(!flags);
 
     if (!bytes) {
         return 0;
diff --git a/block/parallels.c b/block/parallels.c
index a229c06f25..7c04c6a135 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -328,7 +328,6 @@ static coroutine_fn int parallels_co_writev(BlockDriverState *bs,
     QEMUIOVector hd_qiov;
     int ret = 0;
 
-    assert(!flags);
     qemu_iovec_init(&hd_qiov, qiov->niov);
 
     while (nb_sectors > 0) {
diff --git a/block/qcow.c b/block/qcow.c
index 311aaa8705..e9180c7b61 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -628,7 +628,6 @@ static coroutine_fn int qcow_co_preadv(BlockDriverState *bs, int64_t offset,
     uint8_t *buf;
     void *orig_buf;
 
-    assert(!flags);
     if (qiov->niov > 1) {
         buf = orig_buf = qemu_try_blockalign(bs, qiov->size);
         if (buf == NULL) {
@@ -725,7 +724,6 @@ static coroutine_fn int qcow_co_pwritev(BlockDriverState *bs, int64_t offset,
     uint8_t *buf;
     void *orig_buf;
 
-    assert(!flags);
     s->cluster_cache_offset = -1; /* disable compressed cache */
 
     /* We must always copy the iov when encrypting, so we
diff --git a/block/qed.c b/block/qed.c
index 40943e679b..32429682a4 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -1388,7 +1388,6 @@ static int coroutine_fn bdrv_qed_co_writev(BlockDriverState *bs,
                                            int64_t sector_num, int nb_sectors,
                                            QEMUIOVector *qiov, int flags)
 {
-    assert(!flags);
     return qed_co_request(bs, sector_num, qiov, nb_sectors, QED_AIOCB_WRITE);
 }
 
diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..9bae3dd7f2 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -258,6 +258,8 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
         qemu_iovec_add(&local_qiov, buf, 512);
         qemu_iovec_concat(&local_qiov, qiov, 512, qiov->size - 512);
         qiov = &local_qiov;
+
+        flags &= ~BDRV_REQ_REGISTERED_BUF;
     }
 
     ret = raw_adjust_offset(bs, &offset, bytes, true);
diff --git a/block/replication.c b/block/replication.c
index 55c8f894aa..942f88b205 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -260,7 +260,6 @@ static coroutine_fn int replication_co_writev(BlockDriverState *bs,
     int ret;
     int64_t n;
 
-    assert(!flags);
     ret = replication_get_io_status(s);
     if (ret < 0) {
         goto out;
diff --git a/block/ssh.c b/block/ssh.c
index a2dc646536..a3cddc392c 100644
--- a/block/ssh.c
+++ b/block/ssh.c
@@ -1196,7 +1196,6 @@ static coroutine_fn int ssh_co_writev(BlockDriverState *bs,
     BDRVSSHState *s = bs->opaque;
     int ret;
 
-    assert(!flags);
     qemu_co_mutex_lock(&s->lock);
     ret = ssh_write(s, bs, sector_num * BDRV_SECTOR_SIZE,
                     nb_sectors * BDRV_SECTOR_SIZE, qiov);
diff --git a/block/vhdx.c b/block/vhdx.c
index e10e78ebfd..e2344ee0b7 100644
--- a/block/vhdx.c
+++ b/block/vhdx.c
@@ -1342,7 +1342,6 @@ static coroutine_fn int vhdx_co_writev(BlockDriverState *bs, int64_t sector_num,
     uint64_t bat_prior_offset = 0;
     bool bat_update = false;
 
-    assert(!flags);
     qemu_iovec_init(&hd_qiov, qiov->niov);
 
     qemu_co_mutex_lock(&s->lock);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 07/12] block: return errors from bdrv_register_buf()
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 06/12] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 08/12] block: add BlockRAMRegistrar Stefan Hajnoczi
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Registering an I/O buffer is only a performance optimization hint but it
is still necessary to return errors when it fails.

Later patches will need to detect errors when registering buffers but an
immediate advantage is that error_report() calls are no longer needed in
block driver .bdrv_register_buf() functions.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-global-state.h          |  5 ++-
 include/block/block_int-common.h            |  5 ++-
 include/sysemu/block-backend-global-state.h |  2 +-
 block/block-backend.c                       |  4 +--
 block/io.c                                  | 34 +++++++++++++++++++--
 block/nvme.c                                | 18 +++++------
 qemu-img.c                                  |  2 +-
 7 files changed, 52 insertions(+), 18 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 7901f35863..eba4ed23b4 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -246,8 +246,11 @@ void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
  *
  * Buffers must not overlap and they must be unregistered with the same <host,
  * size> values that they were registered with.
+ *
+ * Returns: true on success, false on failure
  */
-void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);
+bool bdrv_register_buf(BlockDriverState *bs, void *host, size_t size,
+                       Error **errp);
 void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size);
 
 void bdrv_cancel_in_flight(BlockDriverState *bs);
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 19798d0e77..9c569be162 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -433,8 +433,11 @@ struct BlockDriver {
      * that it can do IOMMU mapping with VFIO etc., in order to get better
      * performance. In the case of VFIO drivers, this callback is used to do
      * DMA mapping for hot buffers.
+     *
+     * Returns: true on success, false on failure
      */
-    void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
+    bool (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size,
+                              Error **errp);
     void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host, size_t size);
 
     /*
diff --git a/include/sysemu/block-backend-global-state.h b/include/sysemu/block-backend-global-state.h
index 97f7dad2c3..6858e39cb6 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -106,7 +106,7 @@ void blk_io_limits_enable(BlockBackend *blk, const char *group);
 void blk_io_limits_update_group(BlockBackend *blk, const char *group);
 void blk_set_force_allow_inactivate(BlockBackend *blk);
 
-void blk_register_buf(BlockBackend *blk, void *host, size_t size);
+bool blk_register_buf(BlockBackend *blk, void *host, size_t size, Error **errp);
 void blk_unregister_buf(BlockBackend *blk, void *host, size_t size);
 
 const BdrvChild *blk_root(BlockBackend *blk);
diff --git a/block/block-backend.c b/block/block-backend.c
index 99141f8f06..34399d3b7b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2545,10 +2545,10 @@ static void blk_root_drained_end(BdrvChild *child, int *drained_end_counter)
     }
 }
 
-void blk_register_buf(BlockBackend *blk, void *host, size_t size)
+bool blk_register_buf(BlockBackend *blk, void *host, size_t size, Error **errp)
 {
     GLOBAL_STATE_CODE();
-    bdrv_register_buf(blk_bs(blk), host, size);
+    return bdrv_register_buf(blk_bs(blk), host, size, errp);
 }
 
 void blk_unregister_buf(BlockBackend *blk, void *host, size_t size)
diff --git a/block/io.c b/block/io.c
index de04594299..1c3392c7f6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3307,17 +3307,45 @@ void bdrv_io_unplug(BlockDriverState *bs)
     }
 }
 
-void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size)
+/* Helper that undoes bdrv_register_buf() when it fails partway through */
+static void bdrv_register_buf_rollback(BlockDriverState *bs,
+                                       void *host,
+                                       size_t size,
+                                       BdrvChild *final_child)
+{
+    BdrvChild *child;
+
+    QLIST_FOREACH(child, &bs->children, next) {
+        if (child == final_child) {
+            break;
+        }
+
+        bdrv_unregister_buf(child->bs, host, size);
+    }
+
+    if (bs->drv && bs->drv->bdrv_unregister_buf) {
+        bs->drv->bdrv_unregister_buf(bs, host, size);
+    }
+}
+
+bool bdrv_register_buf(BlockDriverState *bs, void *host, size_t size,
+                       Error **errp)
 {
     BdrvChild *child;
 
     GLOBAL_STATE_CODE();
     if (bs->drv && bs->drv->bdrv_register_buf) {
-        bs->drv->bdrv_register_buf(bs, host, size);
+        if (!bs->drv->bdrv_register_buf(bs, host, size, errp)) {
+            return false;
+        }
     }
     QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_register_buf(child->bs, host, size);
+        if (!bdrv_register_buf(child->bs, host, size, errp)) {
+            bdrv_register_buf_rollback(bs, host, size, child);
+            return false;
+        }
     }
+    return true;
 }
 
 void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size)
diff --git a/block/nvme.c b/block/nvme.c
index 696502acea..1e76fe21d3 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1577,19 +1577,19 @@ static void nvme_aio_unplug(BlockDriverState *bs)
     }
 }
 
-static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
+static bool nvme_register_buf(BlockDriverState *bs, void *host, size_t size,
+                              Error **errp)
 {
     int ret;
-    Error *local_err = NULL;
     BDRVNVMeState *s = bs->opaque;
 
-    ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL, &local_err);
-    if (ret) {
-        /* FIXME: we may run out of IOVA addresses after repeated
-         * bdrv_register_buf/bdrv_unregister_buf, because nvme_vfio_dma_unmap
-         * doesn't reclaim addresses for fixed mappings. */
-        error_reportf_err(local_err, "nvme_register_buf failed: ");
-    }
+    /*
+     * FIXME: we may run out of IOVA addresses after repeated
+     * bdrv_register_buf/bdrv_unregister_buf, because nvme_vfio_dma_unmap
+     * doesn't reclaim addresses for fixed mappings.
+     */
+    ret = qemu_vfio_dma_map(s->vfio, host, size, false, NULL, errp);
+    return ret == 0;
 }
 
 static void nvme_unregister_buf(BlockDriverState *bs, void *host, size_t size)
diff --git a/qemu-img.c b/qemu-img.c
index b8a8818ddb..3f2927f5b4 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4567,7 +4567,7 @@ static int img_bench(int argc, char **argv)
     data.buf = blk_blockalign(blk, buf_size);
     memset(data.buf, pattern, data.nrreq * data.bufsize);
 
-    blk_register_buf(blk, data.buf, buf_size);
+    blk_register_buf(blk, data.buf, buf_size, &error_fatal);
 
     data.qiov = g_new(QEMUIOVector, data.nrreq);
     for (i = 0; i < data.nrreq; i++) {
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 08/12] block: add BlockRAMRegistrar
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 07/12] block: return errors from bdrv_register_buf() Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 09/12] exec/cpu-common: add qemu_ram_get_fd() Stefan Hajnoczi
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Emulated devices and other BlockBackend users wishing to take advantage
of blk_register_buf() all have the same repetitive job: register
RAMBlocks with the BlockBackend using RAMBlockNotifier.

Add a BlockRAMRegistrar API to do this. A later commit will use this
from hw/block/virtio-blk.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                          |  1 +
 include/sysemu/block-ram-registrar.h | 37 +++++++++++++++++++
 block/block-ram-registrar.c          | 54 ++++++++++++++++++++++++++++
 block/meson.build                    |  1 +
 4 files changed, 93 insertions(+)
 create mode 100644 include/sysemu/block-ram-registrar.h
 create mode 100644 block/block-ram-registrar.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 878005f65b..1863724fa5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2498,6 +2498,7 @@ F: block*
 F: block/
 F: hw/block/
 F: include/block/
+F: include/sysemu/block-*.h
 F: qemu-img*
 F: docs/tools/qemu-img.rst
 F: qemu-io*
diff --git a/include/sysemu/block-ram-registrar.h b/include/sysemu/block-ram-registrar.h
new file mode 100644
index 0000000000..d8b2f7942b
--- /dev/null
+++ b/include/sysemu/block-ram-registrar.h
@@ -0,0 +1,37 @@
+/*
+ * BlockBackend RAM Registrar
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef BLOCK_RAM_REGISTRAR_H
+#define BLOCK_RAM_REGISTRAR_H
+
+#include "exec/ramlist.h"
+
+/**
+ * struct BlockRAMRegistrar:
+ *
+ * Keeps RAMBlock memory registered with a BlockBackend using
+ * blk_register_buf() including hotplugged memory.
+ *
+ * Emulated devices or other BlockBackend users initialize a BlockRAMRegistrar
+ * with blk_ram_registrar_init() before submitting I/O requests with the
+ * BDRV_REQ_REGISTERED_BUF flag set.
+ */
+typedef struct {
+    BlockBackend *blk;
+    RAMBlockNotifier notifier;
+    bool ok;
+} BlockRAMRegistrar;
+
+void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk);
+void blk_ram_registrar_destroy(BlockRAMRegistrar *r);
+
+/* Have all RAMBlocks been registered successfully? */
+static inline bool blk_ram_registrar_ok(BlockRAMRegistrar *r)
+{
+    return r->ok;
+}
+
+#endif /* BLOCK_RAM_REGISTRAR_H */
diff --git a/block/block-ram-registrar.c b/block/block-ram-registrar.c
new file mode 100644
index 0000000000..32935006c1
--- /dev/null
+++ b/block/block-ram-registrar.c
@@ -0,0 +1,54 @@
+/*
+ * BlockBackend RAM Registrar
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/block-backend.h"
+#include "sysemu/block-ram-registrar.h"
+#include "qapi/error.h"
+
+static void ram_block_added(RAMBlockNotifier *n, void *host, size_t size,
+                            size_t max_size)
+{
+    BlockRAMRegistrar *r = container_of(n, BlockRAMRegistrar, notifier);
+    Error *err = NULL;
+
+    if (!blk_register_buf(r->blk, host, max_size, &err)) {
+        error_report_err(err);
+        ram_block_notifier_remove(&r->notifier);
+        r->ok = false;
+    }
+}
+
+static void ram_block_removed(RAMBlockNotifier *n, void *host, size_t size,
+                              size_t max_size)
+{
+    BlockRAMRegistrar *r = container_of(n, BlockRAMRegistrar, notifier);
+    blk_unregister_buf(r->blk, host, max_size);
+}
+
+void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk)
+{
+    r->blk = blk;
+    r->notifier = (RAMBlockNotifier){
+        .ram_block_added = ram_block_added,
+        .ram_block_removed = ram_block_removed,
+
+        /*
+         * .ram_block_resized() is not necessary because we use the max_size
+         * value that does not change across resize.
+         */
+    };
+    r->ok = true;
+
+    ram_block_notifier_add(&r->notifier);
+}
+
+void blk_ram_registrar_destroy(BlockRAMRegistrar *r)
+{
+    if (r->ok) {
+        ram_block_notifier_remove(&r->notifier);
+    }
+}
diff --git a/block/meson.build b/block/meson.build
index 500878f082..b7c68b83a3 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -46,6 +46,7 @@ block_ss.add(files(
 ), zstd, zlib, gnutls)
 
 softmmu_ss.add(when: 'CONFIG_TCG', if_true: files('blkreplay.c'))
+softmmu_ss.add(files('block-ram-registrar.c'))
 
 if get_option('qcow1').allowed()
   block_ss.add(files('qcow.c'))
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 09/12] exec/cpu-common: add qemu_ram_get_fd()
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (7 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 08/12] block: add BlockRAMRegistrar Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 10/12] stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd() Stefan Hajnoczi
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Add a function to get the file descriptor for a RAMBlock. Device
emulation code typically uses the MemoryRegion APIs but vhost-style code
may use RAMBlock directly for sharing guest memory with another process.

This new API will be used by the libblkio block driver so it can share
guest memory via .bdrv_register_buf().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/exec/cpu-common.h | 1 +
 softmmu/physmem.c         | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index d909429427..5bd73a9db5 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -91,6 +91,7 @@ void qemu_ram_set_uf_zeroable(RAMBlock *rb);
 bool qemu_ram_is_migratable(RAMBlock *rb);
 void qemu_ram_set_migratable(RAMBlock *rb);
 void qemu_ram_unset_migratable(RAMBlock *rb);
+int qemu_ram_get_fd(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 56e03e07b5..d9578ccfd4 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -1748,6 +1748,11 @@ void qemu_ram_unset_migratable(RAMBlock *rb)
     rb->flags &= ~RAM_MIGRATABLE;
 }
 
+int qemu_ram_get_fd(RAMBlock *rb)
+{
+    return rb->fd;
+}
+
 /* Called with iothread lock held.  */
 void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
 {
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 10/12] stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd()
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (8 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 09/12] exec/cpu-common: add qemu_ram_get_fd() Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

The blkio block driver will need to look up the file descriptor for a
given pointer. This is possible in softmmu builds where the RAMBlock API
is available for querying guest RAM.

Add stubs so tools like qemu-img that link the block layer still build
successfully. In this case there is no guest RAM but that is fine.
Bounce buffers and their file descriptors will be allocated with
libblkio's blkio_alloc_mem_region() so we won't rely on QEMU's
qemu_ram_get_fd() in that case.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 stubs/physmem.c   | 13 +++++++++++++
 stubs/meson.build |  1 +
 2 files changed, 14 insertions(+)
 create mode 100644 stubs/physmem.c

diff --git a/stubs/physmem.c b/stubs/physmem.c
new file mode 100644
index 0000000000..1fc5f2df29
--- /dev/null
+++ b/stubs/physmem.c
@@ -0,0 +1,13 @@
+#include "qemu/osdep.h"
+#include "exec/cpu-common.h"
+
+RAMBlock *qemu_ram_block_from_host(void *ptr, bool round_offset,
+                                   ram_addr_t *offset)
+{
+    return NULL;
+}
+
+int qemu_ram_get_fd(RAMBlock *rb)
+{
+    return -1;
+}
diff --git a/stubs/meson.build b/stubs/meson.build
index d8f3fd5c44..4314161f5f 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -29,6 +29,7 @@ stub_ss.add(files('migr-blocker.c'))
 stub_ss.add(files('module-opts.c'))
 stub_ss.add(files('monitor.c'))
 stub_ss.add(files('monitor-core.c'))
+stub_ss.add(files('physmem.c'))
 stub_ss.add(files('qemu-timer-notify-cb.c'))
 stub_ss.add(files('qmp_memory_device.c'))
 stub_ss.add(files('qmp-command-available.c'))
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (9 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 10/12] stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd() Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-09-28 19:21   ` Stefan Hajnoczi
  2022-09-27 19:34 ` [PATCH v5 12/12] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
  2022-10-06 12:18 ` [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefano Garzarella
  12 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Avoid bounce buffers when QEMUIOVector elements are within previously
registered bdrv_register_buf() buffers.

The idea is that emulated storage controllers will register guest RAM
using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
requests. Therefore no blkio_map_mem_region() calls are necessary in the
performance-critical I/O code path.

This optimization doesn't apply if the I/O buffer is internally
allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
path because BDRV_REQ_REGISTERED_BUF is not set.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkio.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 171 insertions(+), 3 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 9244a653ef..ed6ec7f167 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -11,9 +11,13 @@
 #include "qemu/osdep.h"
 #include <blkio.h>
 #include "block/block_int.h"
+#include "exec/memory.h"
+#include "exec/cpu-common.h" /* for qemu_ram_get_fd() */
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
+#include "exec/memory.h" /* for ram_block_discard_disable() */
 
 /*
  * Keep the QEMU BlockDriver names identical to the libblkio driver names.
@@ -72,6 +76,12 @@ typedef struct {
 
     /* Can we skip adding/deleting blkio_mem_regions? */
     bool needs_mem_regions;
+
+    /* Are file descriptors necessary for blkio_mem_regions? */
+    bool needs_mem_region_fd;
+
+    /* Are madvise(MADV_DONTNEED)-style operations unavailable? */
+    bool mem_regions_pinned;
 } BDRVBlkioState;
 
 /* Called with s->bounce_lock held */
@@ -346,7 +356,8 @@ blkio_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
         .coroutine = qemu_coroutine_self(),
     };
     BDRVBlkioState *s = bs->opaque;
-    bool use_bounce_buffer = s->needs_mem_regions;
+    bool use_bounce_buffer =
+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
     BlkioBounceBuf bounce;
     struct iovec *iov = qiov->iov;
     int iovcnt = qiov->niov;
@@ -389,7 +400,8 @@ static int coroutine_fn blkio_co_pwritev(BlockDriverState *bs, int64_t offset,
         .coroutine = qemu_coroutine_self(),
     };
     BDRVBlkioState *s = bs->opaque;
-    bool use_bounce_buffer = s->needs_mem_regions;
+    bool use_bounce_buffer =
+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
     BlkioBounceBuf bounce;
     struct iovec *iov = qiov->iov;
     int iovcnt = qiov->niov;
@@ -472,6 +484,117 @@ static void blkio_io_unplug(BlockDriverState *bs)
     }
 }
 
+typedef enum {
+    BMRR_OK,
+    BMRR_SKIP,
+    BMRR_FAIL,
+} BlkioMemRegionResult;
+
+/*
+ * Produce a struct blkio_mem_region for a given address and size.
+ *
+ * This function produces identical results when called multiple times with the
+ * same arguments. This property is necessary because blkio_unmap_mem_region()
+ * must receive the same struct blkio_mem_region field values that were passed
+ * to blkio_map_mem_region().
+ */
+static BlkioMemRegionResult
+blkio_mem_region_from_host(BlockDriverState *bs,
+                           void *host, size_t size,
+                           struct blkio_mem_region *region,
+                           Error **errp)
+{
+    BDRVBlkioState *s = bs->opaque;
+    int fd = -1;
+    ram_addr_t fd_offset = 0;
+
+    if (((uintptr_t)host | size) % s->mem_region_alignment) {
+        error_setg(errp, "unaligned buf %p with size %zu", host, size);
+        return BMRR_FAIL;
+    }
+
+    /* Attempt to find the fd for the underlying memory */
+    if (s->needs_mem_region_fd) {
+        RAMBlock *ram_block;
+        RAMBlock *end_block;
+        ram_addr_t offset;
+
+        /*
+         * bdrv_register_buf() is called with the BQL held so mr lives at least
+         * until this function returns.
+         */
+        ram_block = qemu_ram_block_from_host(host, false, &fd_offset);
+        if (ram_block) {
+            fd = qemu_ram_get_fd(ram_block);
+        }
+        if (fd == -1) {
+            /*
+             * Ideally every RAMBlock would have an fd. pc-bios and other
+             * things don't. Luckily they are usually not I/O buffers and we
+             * can just ignore them.
+             */
+            return BMRR_SKIP;
+        }
+
+        /* Make sure the fd covers the entire range */
+        end_block = qemu_ram_block_from_host(host + size - 1, false, &offset);
+        if (ram_block != end_block) {
+            error_setg(errp, "registered buffer at %p with size %zu extends "
+                       "beyond RAMBlock", host, size);
+            return BMRR_FAIL;
+        }
+    }
+
+    *region = (struct blkio_mem_region){
+        .addr = host,
+        .len = size,
+        .fd = fd,
+        .fd_offset = fd_offset,
+    };
+    return BMRR_OK;
+}
+
+static bool blkio_register_buf(BlockDriverState *bs, void *host, size_t size,
+                               Error **errp)
+{
+    BDRVBlkioState *s = bs->opaque;
+    struct blkio_mem_region region;
+    BlkioMemRegionResult region_result;
+    int ret;
+
+    region_result = blkio_mem_region_from_host(bs, host, size, &region, errp);
+    if (region_result == BMRR_SKIP) {
+        return true;
+    } else if (region_result != BMRR_OK) {
+        return false;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        ret = blkio_map_mem_region(s->blkio, &region);
+    }
+
+    if (ret < 0) {
+        error_setg(errp, "Failed to add blkio mem region %p with size %zu: %s",
+                   host, size, blkio_get_error_msg());
+        return false;
+    }
+    return true;
+}
+
+static void blkio_unregister_buf(BlockDriverState *bs, void *host, size_t size)
+{
+    BDRVBlkioState *s = bs->opaque;
+    struct blkio_mem_region region;
+
+    if (blkio_mem_region_from_host(bs, host, size, &region, NULL) != BMRR_OK) {
+        return;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->blkio_lock) {
+        blkio_unmap_mem_region(s->blkio, &region);
+    }
+}
+
 static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
                                Error **errp)
 {
@@ -610,6 +733,17 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
         return ret;
     }
 
+    ret = blkio_get_bool(s->blkio,
+                         "needs-mem-region-fd",
+                         &s->needs_mem_region_fd);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get needs-mem-region-fd: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
     ret = blkio_get_uint64(s->blkio,
                            "mem-region-alignment",
                            &s->mem_region_alignment);
@@ -621,15 +755,39 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
         return ret;
     }
 
+    ret = blkio_get_bool(s->blkio,
+                         "mem-regions-pinned",
+                         &s->mem_regions_pinned);
+    if (ret < 0) {
+        /* Be conservative (assume pinning) if the property is not supported */
+        s->mem_regions_pinned = true;
+    }
+
+    /*
+     * Notify if libblkio drivers pin memory and prevent features like
+     * virtio-mem from working.
+     */
+    if (s->mem_regions_pinned) {
+        ret = ram_block_discard_disable(true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "ram_block_discard_disable() failed");
+            blkio_destroy(&s->blkio);
+            return ret;
+        }
+    }
+
     ret = blkio_start(s->blkio);
     if (ret < 0) {
         error_setg_errno(errp, -ret, "blkio_start failed: %s",
                          blkio_get_error_msg());
         blkio_destroy(&s->blkio);
+        if (s->mem_regions_pinned) {
+            ram_block_discard_disable(false);
+        }
         return ret;
     }
 
-    bs->supported_write_flags = BDRV_REQ_FUA;
+    bs->supported_write_flags = BDRV_REQ_FUA | BDRV_REQ_REGISTERED_BUF;
     bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
                                BDRV_REQ_NO_FALLBACK;
 
@@ -653,6 +811,10 @@ static void blkio_close(BlockDriverState *bs)
     qemu_mutex_destroy(&s->blkio_lock);
     blkio_detach_aio_context(bs);
     blkio_destroy(&s->blkio);
+
+    if (s->mem_regions_pinned) {
+        ram_block_discard_disable(false);
+    }
 }
 
 static int64_t blkio_getlength(BlockDriverState *bs)
@@ -799,6 +961,8 @@ static BlockDriver bdrv_io_uring = {
     .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
     .bdrv_io_unplug             = blkio_io_unplug,
     .bdrv_refresh_limits        = blkio_refresh_limits,
+    .bdrv_register_buf          = blkio_register_buf,
+    .bdrv_unregister_buf        = blkio_unregister_buf,
 };
 
 static BlockDriver bdrv_virtio_blk_vhost_user = {
@@ -818,6 +982,8 @@ static BlockDriver bdrv_virtio_blk_vhost_user = {
     .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
     .bdrv_io_unplug             = blkio_io_unplug,
     .bdrv_refresh_limits        = blkio_refresh_limits,
+    .bdrv_register_buf          = blkio_register_buf,
+    .bdrv_unregister_buf        = blkio_unregister_buf,
 };
 
 static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
@@ -837,6 +1003,8 @@ static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
     .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
     .bdrv_io_unplug             = blkio_io_unplug,
     .bdrv_refresh_limits        = blkio_refresh_limits,
+    .bdrv_register_buf          = blkio_register_buf,
+    .bdrv_unregister_buf        = blkio_unregister_buf,
 };
 
 static void bdrv_blkio_init(void)
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 12/12] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (10 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
@ 2022-09-27 19:34 ` Stefan Hajnoczi
  2022-10-06 12:18 ` [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefano Garzarella
  12 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-27 19:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Register guest RAM using BlockRAMRegistrar and set the
BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
accesses in I/O requests.

This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
on DMA mapping/unmapping.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/hw/virtio/virtio-blk.h |  2 ++
 hw/block/virtio-blk.c          | 39 ++++++++++++++++++++++------------
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index d311c57cca..7f589b4146 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -19,6 +19,7 @@
 #include "hw/block/block.h"
 #include "sysemu/iothread.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/block-ram-registrar.h"
 #include "qom/object.h"
 
 #define TYPE_VIRTIO_BLK "virtio-blk-device"
@@ -64,6 +65,7 @@ struct VirtIOBlock {
     struct VirtIOBlockDataPlane *dataplane;
     uint64_t host_features;
     size_t config_size;
+    BlockRAMRegistrar blk_ram_registrar;
 };
 
 typedef struct VirtIOBlockReq {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e9ba752f6b..907f012c45 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -21,6 +21,7 @@
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
 #include "sysemu/blockdev.h"
+#include "sysemu/block-ram-registrar.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/runstate.h"
 #include "hw/virtio/virtio-blk.h"
@@ -384,12 +385,14 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
     }
 }
 
-static inline void submit_requests(BlockBackend *blk, MultiReqBuffer *mrb,
+static inline void submit_requests(VirtIOBlock *s, MultiReqBuffer *mrb,
                                    int start, int num_reqs, int niov)
 {
+    BlockBackend *blk = s->blk;
     QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
     int64_t sector_num = mrb->reqs[start]->sector_num;
     bool is_write = mrb->is_write;
+    BdrvRequestFlags flags = 0;
 
     if (num_reqs > 1) {
         int i;
@@ -420,12 +423,18 @@ static inline void submit_requests(BlockBackend *blk, MultiReqBuffer *mrb,
                               num_reqs - 1);
     }
 
+    if (blk_ram_registrar_ok(&s->blk_ram_registrar)) {
+        flags |= BDRV_REQ_REGISTERED_BUF;
+    }
+
     if (is_write) {
-        blk_aio_pwritev(blk, sector_num << BDRV_SECTOR_BITS, qiov, 0,
-                        virtio_blk_rw_complete, mrb->reqs[start]);
+        blk_aio_pwritev(blk, sector_num << BDRV_SECTOR_BITS, qiov,
+                        flags, virtio_blk_rw_complete,
+                        mrb->reqs[start]);
     } else {
-        blk_aio_preadv(blk, sector_num << BDRV_SECTOR_BITS, qiov, 0,
-                       virtio_blk_rw_complete, mrb->reqs[start]);
+        blk_aio_preadv(blk, sector_num << BDRV_SECTOR_BITS, qiov,
+                       flags, virtio_blk_rw_complete,
+                       mrb->reqs[start]);
     }
 }
 
@@ -447,14 +456,14 @@ static int multireq_compare(const void *a, const void *b)
     }
 }
 
-static void virtio_blk_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
+static void virtio_blk_submit_multireq(VirtIOBlock *s, MultiReqBuffer *mrb)
 {
     int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0;
     uint32_t max_transfer;
     int64_t sector_num = 0;
 
     if (mrb->num_reqs == 1) {
-        submit_requests(blk, mrb, 0, 1, -1);
+        submit_requests(s, mrb, 0, 1, -1);
         mrb->num_reqs = 0;
         return;
     }
@@ -474,11 +483,11 @@ static void virtio_blk_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
              * 3. merge would exceed maximum transfer length of backend device
              */
             if (sector_num + nb_sectors != req->sector_num ||
-                niov > blk_get_max_iov(blk) - req->qiov.niov ||
+                niov > blk_get_max_iov(s->blk) - req->qiov.niov ||
                 req->qiov.size > max_transfer ||
                 nb_sectors > (max_transfer -
                               req->qiov.size) / BDRV_SECTOR_SIZE) {
-                submit_requests(blk, mrb, start, num_reqs, niov);
+                submit_requests(s, mrb, start, num_reqs, niov);
                 num_reqs = 0;
             }
         }
@@ -494,7 +503,7 @@ static void virtio_blk_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
         num_reqs++;
     }
 
-    submit_requests(blk, mrb, start, num_reqs, niov);
+    submit_requests(s, mrb, start, num_reqs, niov);
     mrb->num_reqs = 0;
 }
 
@@ -509,7 +518,7 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
      * Make sure all outstanding writes are posted to the backing device.
      */
     if (mrb->is_write && mrb->num_reqs > 0) {
-        virtio_blk_submit_multireq(s->blk, mrb);
+        virtio_blk_submit_multireq(s, mrb);
     }
     blk_aio_flush(s->blk, virtio_blk_flush_complete, req);
 }
@@ -689,7 +698,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
         if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
                                   is_write != mrb->is_write ||
                                   !s->conf.request_merging)) {
-            virtio_blk_submit_multireq(s->blk, mrb);
+            virtio_blk_submit_multireq(s, mrb);
         }
 
         assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS);
@@ -796,7 +805,7 @@ void virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq)
     } while (!virtio_queue_empty(vq));
 
     if (mrb.num_reqs) {
-        virtio_blk_submit_multireq(s->blk, &mrb);
+        virtio_blk_submit_multireq(s, &mrb);
     }
 
     blk_io_unplug(s->blk);
@@ -845,7 +854,7 @@ void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh)
     }
 
     if (mrb.num_reqs) {
-        virtio_blk_submit_multireq(s->blk, &mrb);
+        virtio_blk_submit_multireq(s, &mrb);
     }
     if (is_bh) {
         blk_dec_in_flight(s->conf.conf.blk);
@@ -1227,6 +1236,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     s->change = qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
+    blk_ram_registrar_init(&s->blk_ram_registrar, s->blk);
     blk_set_dev_ops(s->blk, &virtio_block_ops, s);
 
     blk_iostatus_enable(s->blk);
@@ -1252,6 +1262,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev)
         virtio_del_queue(vdev, i);
     }
     qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
+    blk_ram_registrar_destroy(&s->blk_ram_registrar);
     qemu_del_vm_change_state_handler(s->change);
     blockdev_mark_auto_del(s->blk);
     virtio_cleanup(vdev);
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/12] blkio: add libblkio block driver
  2022-09-27 19:34 ` [PATCH v5 02/12] blkio: add libblkio block driver Stefan Hajnoczi
@ 2022-09-28  5:27   ` Markus Armbruster
  2022-09-28 20:10     ` Stefan Hajnoczi
  2022-10-06 16:41   ` Alberto Faria
  1 sibling, 1 reply; 25+ messages in thread
From: Markus Armbruster @ 2022-09-28  5:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

Stefan Hajnoczi <stefanha@redhat.com> writes:

> libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> high-performance disk I/O. It currently supports io_uring,
> virtio-blk-vhost-user, and virtio-blk-vhost-vdpa with additional drivers
> under development.
>
> One of the reasons for developing libblkio is that other applications
> besides QEMU can use it. This will be particularly useful for
> virtio-blk-vhost-user which applications may wish to use for connecting
> to qemu-storage-daemon.
>
> libblkio also gives us an opportunity to develop in Rust behind a C API
> that is easy to consume from QEMU.
>
> This commit adds io_uring, virtio-blk-vhost-user, and
> virtio-blk-vhost-vdpa BlockDrivers to QEMU using libblkio. It will be
> easy to add other libblkio drivers since they will share the majority of
> code.
>
> For now I/O buffers are copied through bounce buffers if the libblkio
> driver requires it. Later commits add an optimization for
> pre-registering guest RAM to avoid bounce buffers.
>
> The syntax is:
>
>   --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
>
> and:
>
>   --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>

[...]

> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index f21fa235f2..5aed0dd436 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2951,11 +2951,16 @@
>              'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
>              {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
>              {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
> -            'http', 'https', 'iscsi',
> +            'http', 'https',
> +            { 'name': 'io_uring', 'if': 'CONFIG_BLKIO' },
> +            'iscsi',
>              'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>              'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>              { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> +            'ssh', 'throttle', 'vdi', 'vhdx',
> +            { 'name': 'virtio-blk-vhost-user', 'if': 'CONFIG_BLKIO' },
> +            { 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
> +            'vmdk', 'vpc', 'vvfat' ] }
>  
>  ##
>  # @BlockdevOptionsFile:
> @@ -3678,6 +3683,42 @@
>              '*debug': 'int',
>              '*logfile': 'str' } }
>  
> +##
> +# @BlockdevOptionsIoUring:
> +#
> +# Driver specific block device options for the io_uring backend.
> +#
> +# @filename: path to the image file
> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'BlockdevOptionsIoUring',
> +  'data': { 'filename': 'str' } }
> +
> +##
> +# @BlockdevOptionsVirtioBlkVhostUser:
> +#
> +# Driver specific block device options for the virtio-blk-vhost-user backend.
> +#
> +# @path: path to the vhost-user UNIX domain socket.
> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'BlockdevOptionsVirtioBlkVhostUser',
> +  'data': { 'path': 'str' } }
> +
> +##
> +# @BlockdevOptionsVirtioBlkVhostVdpa:
> +#
> +# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
> +#
> +# @path: path to the vhost-vdpa character device.
> +#
> +# Since: 7.2
> +##
> +{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
> +  'data': { 'path': 'str' } }
> +

Should these be 'if': 'CONFIG_BLKIO'?

>  ##
>  # @IscsiTransport:
>  #
> @@ -4305,6 +4346,8 @@
>                         'if': 'HAVE_HOST_BLOCK_DEVICE' },
>        'http':       'BlockdevOptionsCurlHttp',
>        'https':      'BlockdevOptionsCurlHttps',
> +      'io_uring':   { 'type': 'BlockdevOptionsIoUring',
> +                      'if': 'CONFIG_BLKIO' },
>        'iscsi':      'BlockdevOptionsIscsi',
>        'luks':       'BlockdevOptionsLUKS',
>        'nbd':        'BlockdevOptionsNbd',
> @@ -4327,6 +4370,12 @@
>        'throttle':   'BlockdevOptionsThrottle',
>        'vdi':        'BlockdevOptionsGenericFormat',
>        'vhdx':       'BlockdevOptionsGenericFormat',
> +      'virtio-blk-vhost-user':
> +                    { 'type': 'BlockdevOptionsVirtioBlkVhostUser',
> +                      'if': 'CONFIG_BLKIO' },
> +      'virtio-blk-vhost-vdpa':
> +                    { 'type': 'BlockdevOptionsVirtioBlkVhostVdpa',
> +                      'if': 'CONFIG_BLKIO' },
>        'vmdk':       'BlockdevOptionsGenericCOWFormat',
>        'vpc':        'BlockdevOptionsGenericFormat',
>        'vvfat':      'BlockdevOptionsVVFAT'

[...]



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-09-27 19:34 ` [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
@ 2022-09-28 19:21   ` Stefan Hajnoczi
  2022-09-28 20:12     ` Alberto Campinho Faria
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-28 19:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Tue, Sep 27, 2022 at 03:34:30PM -0400, Stefan Hajnoczi wrote:
> +    ret = blkio_get_bool(s->blkio,
> +                         "mem-regions-pinned",
> +                         &s->mem_regions_pinned);
> +    if (ret < 0) {
> +        /* Be conservative (assume pinning) if the property is not supported */
> +        s->mem_regions_pinned = true;

This is too conservative :). It can be changed to:

  s->mem_regions_pinned = s->needs_mem_regions;

That way we avoid ram_block_discard_disable() for libblkio drivers (like
io_uring in libblkio 1.0) that don't use memory regions and don't
support the "mem-regions-pinned" property yet.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/12] blkio: add libblkio block driver
  2022-09-28  5:27   ` Markus Armbruster
@ 2022-09-28 20:10     ` Stefan Hajnoczi
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-09-28 20:10 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, John Snow, Denis V. Lunev,
	Xie Changlong, Eric Blake, integration, David Hildenbrand,
	Wen Congyang, Laurent Vivier, Richard W.M. Jones, afaria,
	Fam Zheng, Thomas Huth, Hanna Reitz, Eduardo Habkost, Peter Xu,
	Raphael Norwitz, Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

[-- Attachment #1: Type: text/plain, Size: 500 bytes --]

On Wed, Sep 28, 2022 at 07:27:34AM +0200, Markus Armbruster wrote:
> > +##
> > +# @BlockdevOptionsVirtioBlkVhostVdpa:
> > +#
> > +# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
> > +#
> > +# @path: path to the vhost-vdpa character device.
> > +#
> > +# Since: 7.2
> > +##
> > +{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
> > +  'data': { 'path': 'str' } }
> > +
> 
> Should these be 'if': 'CONFIG_BLKIO'?

Will fix in the next revision.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-09-28 19:21   ` Stefan Hajnoczi
@ 2022-09-28 20:12     ` Alberto Campinho Faria
  2022-10-06 18:00       ` Stefan Hajnoczi
  0 siblings, 1 reply; 25+ messages in thread
From: Alberto Campinho Faria @ 2022-09-28 20:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

On Wed, Sep 28, 2022 at 8:21 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Tue, Sep 27, 2022 at 03:34:30PM -0400, Stefan Hajnoczi wrote:
> > +    ret = blkio_get_bool(s->blkio,
> > +                         "mem-regions-pinned",
> > +                         &s->mem_regions_pinned);
> > +    if (ret < 0) {
> > +        /* Be conservative (assume pinning) if the property is not supported */
> > +        s->mem_regions_pinned = true;
>
> This is too conservative :). It can be changed to:
>
>   s->mem_regions_pinned = s->needs_mem_regions;
>
> That way we avoid ram_block_discard_disable() for libblkio drivers (like
> io_uring in libblkio 1.0) that don't use memory regions and don't
> support the "mem-regions-pinned" property yet.

Even if a driver doesn't _need_ memory regions to be mapped before
use, it may still do something special with the ones that _are_
mapped, so we may have no choice but to set s->mem_regions_pinned =
true.

(Unless we are assuming that all future libblkio versions will either
not have such drivers, or will provide a "mem-regions-pinned"
property, but that feels brittle.)

Alberto



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 00/12] blkio: add libblkio BlockDriver
  2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (11 preceding siblings ...)
  2022-09-27 19:34 ` [PATCH v5 12/12] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
@ 2022-10-06 12:18 ` Stefano Garzarella
  2022-10-06 17:32   ` Stefan Hajnoczi
  12 siblings, 1 reply; 25+ messages in thread
From: Stefano Garzarella @ 2022-10-06 12:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Stefan Hajnoczi,
	Marcel Apfelbaum, Vladimir Sementsov-Ogievskiy,
	Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin

Hi Stefan,
I tested this series with the vDPA block simulator in Linux v6.0.
It worked well, but I had a segfault when blkio_map_mem_region() fails.

In my case, it failed because I forgot to increase the `memlock` limit 
with `ulimit -l unlimited` since the simulator still requires locking 
all the VM memory.

QEMU failed with this messages:
$ qemu-system-x86_64 ... \
  -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on
qemu-system-x86_64: -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1,iothread=iothread0: Failed to add blkio mem region 0x7f60bfe00000 with size 536870912: Bad address (os error 14)
qemu-system-x86_64: -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1,iothread=iothread0: Failed to add blkio mem region 0x7f60be400000 with size 16777216: Bad address (os error 14)
[1]    20803 segmentation fault

IIUC this could be related to a double call to 
ram_block_notifier_remove() in the error path of ram_block_added() 
(block/block-ram-registrar.c) callback.

Maybe we should call blk_register_buf() only if r->ok is true.

The stack trace is the following:
#0  0x00005641a8b097dd in ram_block_notifier_remove (n=n@entry=0x5641ab8354d8) at ../hw/core/numa.c:850
#1  0x00005641a89c4701 in ram_block_added
     (n=0x5641ab8354d8, host=<optimized out>, size=<optimized out>, max_size=<optimized out>)
    at ../block/block-ram-registrar.c:20
#2  0x00005641a8b083af in ram_block_notify_add_single (rb=0x5641ab3d2810, opaque=0x5641ab8354d8)
    at ../hw/core/numa.c:820
#3  0x00005641a8b92d8d in qemu_ram_foreach_block
    (func=func@entry=0x5641a8b08370 <ram_block_notify_add_single>, opaque=opaque@entry=0x5641ab8354d8)
    at ../softmmu/physmem.c:3571
#4  0x00005641a8b097af in ram_block_notifier_add (n=n@entry=0x5641ab8354d8) at ../hw/core/numa.c:844
#5  0x00005641a89c474f in blk_ram_registrar_init (r=r@entry=0x5641ab8354d0, blk=<optimized out>)
    at ../block/block-ram-registrar.c:46
#6  0x00005641a8affe88 in virtio_blk_device_realize (dev=0x5641ab835230, errp=0x7ffc72657190)
    at ../hw/block/virtio-blk.c:1239
#7  0x00005641a8b4b716 in virtio_device_realize (dev=0x5641ab835230, errp=0x7ffc726571f0)
    at ../hw/virtio/virtio.c:3684
#8  0x00005641a8c2049b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7ffc72657380)
    at ../hw/core/qdev.c:553
#9  0x00005641a8c24738 in property_set_bool
    (obj=0x5641ab835230, v=<optimized out>, name=<optimized out>, opaque=0x5641aa518820, errp=0x7ffc72657380)
    at ../qom/object.c:2273
#10 0x00005641a8c27683 in object_property_set
   
     (obj=obj@entry=0x5641ab835230, name=name@entry=0x5641a8e9719a "realized", v=v@entry=0x5641ab83f5b0, errp=errp@entry=0x7ffc72657380) at ../qom/object.c:1408
#11 0x00005641a8c2a97f in object_property_set_qobject
    (obj=obj@entry=0x5641ab835230, name=name@entry=0x5641a8e9719a "realized", value=value@entry=0x5641ab83f4f0, errp=errp@entry=0x7ffc72657380) at ../qom/qom-qobject.c:28
#12 0x00005641a8c27c84 in object_property_set_bool
    (obj=0x5641ab835230, name=0x5641a8e9719a "realized", value=<optimized out>, errp=0x7ffc72657380)
    at ../qom/object.c:1477
#13 0x00005641a8929b74 in pci_qdev_realize (qdev=<optimized out>, errp=<optimized out>) at ../hw/pci/pci.c:2218
#14 0x00005641a8c2049b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7ffc726575b0)
    at ../hw/core/qdev.c:553
#15 0x00005641a8c24738 in property_set_bool
    (obj=0x5641ab82ce70, v=<optimized out>, name=<optimized out>, opaque=0x5641aa518820, errp=0x7ffc726575b0)
    at ../qom/object.c:2273
#16 0x00005641a8c27683 in object_property_set

Thanks,
Stefano

On Tue, Sep 27, 2022 at 9:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> v5:
> - Drop "RFC" since libblkio 1.0 has been released and the library API is stable
> - Disable BDRV_REQ_REGISTERED_BUF if we run out of blkio_mem_regions. The
>   bounce buffer slow path is taken when there are not enough blkio_mem_regions
>   to cover guest RAM. [Hanna & David Hildenbrand]
> - Call ram_block_discard_disable() when mem-region-pinned property is true or
>   absent [David Hildenbrand]
> - Use a bounce buffer pool instead of allocating/freeing a buffer for each
>   request. This reduces the number of blkio_mem_regions required for bounce
>   buffers to 1 and avoids frequent blkio_mem_region_map/unmap() calls.
> - Switch to .bdrv_co_*() instead of .bdrv_aio_*(). Needed for the bounce buffer
>   pool's CoQueue.
> v4:
> - Patch 1:
>   - Add virtio-blk-vhost-user driver [Kevin]
>   - Drop .bdrv_parse_filename() and .bdrv_needs_filename for virtio-blk-vhost-vdpa [Stefano]
>   - Add copyright and license header [Hanna]
>   - Drop .bdrv_parse_filename() in favor of --blockdev or json: [Hanna]
>   - Clarify that "filename" is always non-NULL for io_uring [Hanna]
>   - Check that virtio-blk-vhost-vdpa "path" option is non-NULL [Hanna]
>   - Fix virtio-blk-vhost-vdpa cache.direct=off logic [Hanna]
>   - Use macros for driver names [Hanna]
>   - Assert that the driver name is valid [Hanna]
>   - Update "readonly" property name to "read-only" [Hanna]
>   - Call blkio_detach_aio_context() in blkio_close() [Hanna]
>   - Avoid uint32_t * to int * casts in blkio_refresh_limits() [Hanna]
>   - Remove write zeroes and discard from the todo list [Hanna]
>   - Use PRIu32 instead of %d for uint32_t [Hanna]
>   - Fix error messages with buf-alignment instead of optimal-io-size [Hanna]
>   - Call map/unmap APIs since libblkio alloc/free APIs no longer do that
>   - Update QAPI schema QEMU version to 7.2
> - Patch 5:
>   - Expand the BDRV_REQ_REGISTERED_BUF flag passthrough and drop assert(!flags)
>     in drivers [Hanna]
> - Patch 7:
>   - Fix BLK->BDRV typo [Hanna]
>   - Make BlockRAMRegistrar handle failure [Hanna]
> - Patch 8:
>   - Replace memory_region_get_fd() approach with qemu_ram_get_fd()
> - Patch 10:
>   - Use (void)ret; to discard unused return value [Hanna]
>   - libblkio's blkio_unmap_mem_region() API no longer has a return value
>   - Check for registered bufs that cross RAMBlocks [Hanna]
> - Patch 11:
>   - Handle bdrv_register_buf() errors [Hanna]
> v3:
> - Add virtio-blk-vhost-vdpa for vdpa-blk devices including VDUSE
> - Add discard and write zeroes support
> - Rebase and adopt latest libblkio APIs
> v2:
> - Add BDRV_REQ_REGISTERED_BUF to bs.supported_write_flags [Stefano]
> - Use new blkioq_get_num_completions() API
> - Implement .bdrv_refresh_limits()
>
> This patch series adds a QEMU BlockDriver for libblkio
> (https://gitlab.com/libblkio/libblkio/), a library for high-performance block
> device I/O. This work was presented at KVM Forum 2022 and slides are available
> here:
> https://static.sched.com/hosted_files/kvmforum2022/8c/libblkio-kvm-forum-2022.pdf
>
> The second patch adds the core BlockDriver and most of the libblkio API usage.
> Three libblkio drivers are included:
> - io_uring
> - virtio-blk-vhost-user
> - virtio-blk-vhost-vdpa
>
> The remainder of the patch series reworks the existing QEMU bdrv_register_buf()
> API so virtio-blk emulation efficiently map guest RAM for libblkio - some
> libblkio drivers require that I/O buffer memory is pre-registered (think VFIO,
> vhost, etc).
>
> Vladimir requested performance results that show the effect of the
> BDRV_REQ_REGISTERED_BUF flag. I ran the patches against qemu-storage-daemon's
> vhost-user-blk export with iodepth=1 bs=512 to see the per-request overhead due
> to bounce buffer allocation/mapping:
>
> Name                                   IOPS   Error
> bounce-buf                          4373.81 ± 0.01%
> registered-buf                     13062.80 ± 0.67%
>
> The BDRV_REQ_REGISTERED_BUF optimization version is about 3x faster.
>
> See the BlockDriver struct in block/blkio.c for a list of APIs that still need
> to be implemented. The core functionality is covered.
>
> Regarding the design: each libblkio driver is a separately named BlockDriver.
> That means there is an "io_uring" BlockDriver and not a generic "libblkio"
> BlockDriver. This way QAPI and open parameters are type-safe and mandatory
> parameters can be checked by QEMU.
>
> Stefan Hajnoczi (12):
>   coroutine: add flag to re-queue at front of CoQueue
>   blkio: add libblkio block driver
>   numa: call ->ram_block_removed() in ram_block_notifer_remove()
>   block: pass size to bdrv_unregister_buf()
>   block: use BdrvRequestFlags type for supported flag fields
>   block: add BDRV_REQ_REGISTERED_BUF request flag
>   block: return errors from bdrv_register_buf()
>   block: add BlockRAMRegistrar
>   exec/cpu-common: add qemu_ram_get_fd()
>   stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd()
>   blkio: implement BDRV_REQ_REGISTERED_BUF optimization
>   virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
>
>  MAINTAINERS                                 |    7 +
>  meson_options.txt                           |    2 +
>  qapi/block-core.json                        |   53 +-
>  meson.build                                 |    9 +
>  include/block/block-common.h                |    9 +
>  include/block/block-global-state.h          |   10 +-
>  include/block/block_int-common.h            |   15 +-
>  include/exec/cpu-common.h                   |    1 +
>  include/hw/virtio/virtio-blk.h              |    2 +
>  include/qemu/coroutine.h                    |   15 +-
>  include/sysemu/block-backend-global-state.h |    4 +-
>  include/sysemu/block-ram-registrar.h        |   37 +
>  block.c                                     |   14 +
>  block/blkio.c                               | 1017 +++++++++++++++++++
>  block/blkverify.c                           |    4 +-
>  block/block-backend.c                       |    8 +-
>  block/block-ram-registrar.c                 |   54 +
>  block/crypto.c                              |    4 +-
>  block/file-posix.c                          |    1 -
>  block/gluster.c                             |    1 -
>  block/io.c                                  |  101 +-
>  block/mirror.c                              |    2 +
>  block/nbd.c                                 |    1 -
>  block/nvme.c                                |   20 +-
>  block/parallels.c                           |    1 -
>  block/qcow.c                                |    2 -
>  block/qed.c                                 |    1 -
>  block/raw-format.c                          |    2 +
>  block/replication.c                         |    1 -
>  block/ssh.c                                 |    1 -
>  block/vhdx.c                                |    1 -
>  hw/block/virtio-blk.c                       |   39 +-
>  hw/core/numa.c                              |   17 +
>  qemu-img.c                                  |    6 +-
>  softmmu/physmem.c                           |    5 +
>  stubs/physmem.c                             |   13 +
>  tests/qtest/modules-test.c                  |    3 +
>  util/qemu-coroutine-lock.c                  |    9 +-
>  util/vfio-helpers.c                         |    5 +-
>  block/meson.build                           |    2 +
>  scripts/meson-buildoptions.sh               |    3 +
>  stubs/meson.build                           |    1 +
>  42 files changed, 1412 insertions(+), 91 deletions(-)
>  create mode 100644 include/sysemu/block-ram-registrar.h
>  create mode 100644 block/blkio.c
>  create mode 100644 block/block-ram-registrar.c
>  create mode 100644 stubs/physmem.c
>
> --
> 2.37.3
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/12] blkio: add libblkio block driver
  2022-09-27 19:34 ` [PATCH v5 02/12] blkio: add libblkio block driver Stefan Hajnoczi
  2022-09-28  5:27   ` Markus Armbruster
@ 2022-10-06 16:41   ` Alberto Faria
  2022-10-06 18:56     ` Stefan Hajnoczi
  1 sibling, 1 reply; 25+ messages in thread
From: Alberto Faria @ 2022-10-06 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

On Tue, Sep 27, 2022 at 8:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> +static int blkio_virtio_blk_vhost_user_open(BlockDriverState *bs,
> +        QDict *options, int flags, Error **errp)
> +{
> +    const char *path = qdict_get_try_str(options, "path");
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +
> +    if (!path) {
> +        error_setg(errp, "missing 'path' option");
> +        return -EINVAL;
> +    }
> +
> +    ret = blkio_set_str(s->blkio, "path", path);
> +    qdict_del(options, "path");
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to set path: %s",
> +                         blkio_get_error_msg());
> +        return ret;
> +    }
> +
> +    if (!(flags & BDRV_O_NOCACHE)) {
> +        error_setg(errp, "cache.direct=off is not supported");
> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
> +        QDict *options, int flags, Error **errp)
> +{
> +    const char *path = qdict_get_try_str(options, "path");
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +
> +    if (!path) {
> +        error_setg(errp, "missing 'path' option");
> +        return -EINVAL;
> +    }
> +
> +    ret = blkio_set_str(s->blkio, "path", path);
> +    qdict_del(options, "path");
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to set path: %s",
> +                         blkio_get_error_msg());
> +        return ret;
> +    }
> +
> +    if (!(flags & BDRV_O_NOCACHE)) {
> +        error_setg(errp, "cache.direct=off is not supported");
> +        return -EINVAL;
> +    }
> +    return 0;
> +}

These two functions are (currently) exact duplicates. Should we just
have a single blkio_virtio_blk_open() function?

> +static BlockDriver bdrv_io_uring = {
> +    .format_name                = DRIVER_IO_URING,
> +    .protocol_name              = DRIVER_IO_URING,
> +    .instance_size              = sizeof(BDRVBlkioState),
> +    .bdrv_needs_filename        = true,
> +    .bdrv_file_open             = blkio_file_open,
> +    .bdrv_close                 = blkio_close,
> +    .bdrv_getlength             = blkio_getlength,
> +    .bdrv_get_info              = blkio_get_info,
> +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> +    .bdrv_co_preadv             = blkio_co_preadv,
> +    .bdrv_co_pwritev            = blkio_co_pwritev,
> +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> +    .bdrv_io_unplug             = blkio_io_unplug,
> +    .bdrv_refresh_limits        = blkio_refresh_limits,
> +};
> +
> +static BlockDriver bdrv_virtio_blk_vhost_user = {
> +    .format_name                = DRIVER_VIRTIO_BLK_VHOST_USER,
> +    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_USER,
> +    .instance_size              = sizeof(BDRVBlkioState),
> +    .bdrv_file_open             = blkio_file_open,
> +    .bdrv_close                 = blkio_close,
> +    .bdrv_getlength             = blkio_getlength,
> +    .bdrv_get_info              = blkio_get_info,
> +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> +    .bdrv_co_preadv             = blkio_co_preadv,
> +    .bdrv_co_pwritev            = blkio_co_pwritev,
> +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> +    .bdrv_io_unplug             = blkio_io_unplug,
> +    .bdrv_refresh_limits        = blkio_refresh_limits,
> +};
> +
> +static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
> +    .format_name                = DRIVER_VIRTIO_BLK_VHOST_VDPA,
> +    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_VDPA,
> +    .instance_size              = sizeof(BDRVBlkioState),
> +    .bdrv_file_open             = blkio_file_open,
> +    .bdrv_close                 = blkio_close,
> +    .bdrv_getlength             = blkio_getlength,
> +    .bdrv_get_info              = blkio_get_info,
> +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> +    .bdrv_co_preadv             = blkio_co_preadv,
> +    .bdrv_co_pwritev            = blkio_co_pwritev,
> +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> +    .bdrv_io_unplug             = blkio_io_unplug,
> +    .bdrv_refresh_limits        = blkio_refresh_limits,
> +};

It's difficult to identify the fields that differ between
BlockDrivers. Maybe we could do something like:

    #define DRIVER(name, ...) \
        { \
            .format_name             = name, \
            .protocol_name           = name, \
            .instance_size           = sizeof(BDRVBlkioState), \
            .bdrv_file_open          = blkio_file_open, \
            .bdrv_close              = blkio_close, \
            .bdrv_getlength          = blkio_getlength, \
            .bdrv_get_info           = blkio_get_info, \
            .bdrv_attach_aio_context = blkio_attach_aio_context, \
            .bdrv_detach_aio_context = blkio_detach_aio_context, \
            .bdrv_co_pdiscard        = blkio_co_pdiscard, \
            .bdrv_co_preadv          = blkio_co_preadv, \
            .bdrv_co_pwritev         = blkio_co_pwritev, \
            .bdrv_co_flush_to_disk   = blkio_co_flush, \
            .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
            .bdrv_io_unplug          = blkio_io_unplug, \
            .bdrv_refresh_limits     = blkio_refresh_limits, \
            __VA_ARGS__
        } \

    static BlockDriver bdrv_io_uring = DRIVER(
        DRIVER_IO_URING,
        .bdrv_needs_filename = true,
    );

    static BlockDriver bdrv_virtio_blk_vhost_user = DRIVER(
        DRIVER_VIRTIO_BLK_VHOST_USER
    );

    static BlockDriver bdrv_virtio_blk_vhost_vdpa = DRIVER(
        DRIVER_VIRTIO_BLK_VHOST_VDPA
    );

Alberto



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 00/12] blkio: add libblkio BlockDriver
  2022-10-06 12:18 ` [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefano Garzarella
@ 2022-10-06 17:32   ` Stefan Hajnoczi
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 17:32 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, afaria, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 12351 bytes --]

On Thu, Oct 06, 2022 at 02:18:35PM +0200, Stefano Garzarella wrote:
> Hi Stefan,
> I tested this series with the vDPA block simulator in Linux v6.0.
> It worked well, but I had a segfault when blkio_map_mem_region() fails.
> 
> In my case, it failed because I forgot to increase the `memlock` limit 
> with `ulimit -l unlimited` since the simulator still requires locking 
> all the VM memory.
> 
> QEMU failed with this messages:
> $ qemu-system-x86_64 ... \
>   -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on
> qemu-system-x86_64: -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1,iothread=iothread0: Failed to add blkio mem region 0x7f60bfe00000 with size 536870912: Bad address (os error 14)
> qemu-system-x86_64: -device virtio-blk-pci,id=src1,bootindex=2,drive=drive_src1,iothread=iothread0: Failed to add blkio mem region 0x7f60be400000 with size 16777216: Bad address (os error 14)
> [1]    20803 segmentation fault
> 
> IIUC this could be related to a double call to 
> ram_block_notifier_remove() in the error path of ram_block_added() 
> (block/block-ram-registrar.c) callback.
> 
> Maybe we should call blk_register_buf() only if r->ok is true.

Thanks for pointing this out!

The notifiers don't use QLIST_FOREACH_SAFE() so calling
ram_block_notifier_remove() from within .ram_block_added() is incorrect.

Stefan

> 
> The stack trace is the following:
> #0  0x00005641a8b097dd in ram_block_notifier_remove (n=n@entry=0x5641ab8354d8) at ../hw/core/numa.c:850
> #1  0x00005641a89c4701 in ram_block_added
>      (n=0x5641ab8354d8, host=<optimized out>, size=<optimized out>, max_size=<optimized out>)
>     at ../block/block-ram-registrar.c:20
> #2  0x00005641a8b083af in ram_block_notify_add_single (rb=0x5641ab3d2810, opaque=0x5641ab8354d8)
>     at ../hw/core/numa.c:820
> #3  0x00005641a8b92d8d in qemu_ram_foreach_block
>     (func=func@entry=0x5641a8b08370 <ram_block_notify_add_single>, opaque=opaque@entry=0x5641ab8354d8)
>     at ../softmmu/physmem.c:3571
> #4  0x00005641a8b097af in ram_block_notifier_add (n=n@entry=0x5641ab8354d8) at ../hw/core/numa.c:844
> #5  0x00005641a89c474f in blk_ram_registrar_init (r=r@entry=0x5641ab8354d0, blk=<optimized out>)
>     at ../block/block-ram-registrar.c:46
> #6  0x00005641a8affe88 in virtio_blk_device_realize (dev=0x5641ab835230, errp=0x7ffc72657190)
>     at ../hw/block/virtio-blk.c:1239
> #7  0x00005641a8b4b716 in virtio_device_realize (dev=0x5641ab835230, errp=0x7ffc726571f0)
>     at ../hw/virtio/virtio.c:3684
> #8  0x00005641a8c2049b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7ffc72657380)
>     at ../hw/core/qdev.c:553
> #9  0x00005641a8c24738 in property_set_bool
>     (obj=0x5641ab835230, v=<optimized out>, name=<optimized out>, opaque=0x5641aa518820, errp=0x7ffc72657380)
>     at ../qom/object.c:2273
> #10 0x00005641a8c27683 in object_property_set
>    
>      (obj=obj@entry=0x5641ab835230, name=name@entry=0x5641a8e9719a "realized", v=v@entry=0x5641ab83f5b0, errp=errp@entry=0x7ffc72657380) at ../qom/object.c:1408
> #11 0x00005641a8c2a97f in object_property_set_qobject
>     (obj=obj@entry=0x5641ab835230, name=name@entry=0x5641a8e9719a "realized", value=value@entry=0x5641ab83f4f0, errp=errp@entry=0x7ffc72657380) at ../qom/qom-qobject.c:28
> #12 0x00005641a8c27c84 in object_property_set_bool
>     (obj=0x5641ab835230, name=0x5641a8e9719a "realized", value=<optimized out>, errp=0x7ffc72657380)
>     at ../qom/object.c:1477
> #13 0x00005641a8929b74 in pci_qdev_realize (qdev=<optimized out>, errp=<optimized out>) at ../hw/pci/pci.c:2218
> #14 0x00005641a8c2049b in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7ffc726575b0)
>     at ../hw/core/qdev.c:553
> #15 0x00005641a8c24738 in property_set_bool
>     (obj=0x5641ab82ce70, v=<optimized out>, name=<optimized out>, opaque=0x5641aa518820, errp=0x7ffc726575b0)
>     at ../qom/object.c:2273
> #16 0x00005641a8c27683 in object_property_set
> 
> Thanks,
> Stefano
> 
> On Tue, Sep 27, 2022 at 9:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > v5:
> > - Drop "RFC" since libblkio 1.0 has been released and the library API is stable
> > - Disable BDRV_REQ_REGISTERED_BUF if we run out of blkio_mem_regions. The
> >   bounce buffer slow path is taken when there are not enough blkio_mem_regions
> >   to cover guest RAM. [Hanna & David Hildenbrand]
> > - Call ram_block_discard_disable() when mem-region-pinned property is true or
> >   absent [David Hildenbrand]
> > - Use a bounce buffer pool instead of allocating/freeing a buffer for each
> >   request. This reduces the number of blkio_mem_regions required for bounce
> >   buffers to 1 and avoids frequent blkio_mem_region_map/unmap() calls.
> > - Switch to .bdrv_co_*() instead of .bdrv_aio_*(). Needed for the bounce buffer
> >   pool's CoQueue.
> > v4:
> > - Patch 1:
> >   - Add virtio-blk-vhost-user driver [Kevin]
> >   - Drop .bdrv_parse_filename() and .bdrv_needs_filename for virtio-blk-vhost-vdpa [Stefano]
> >   - Add copyright and license header [Hanna]
> >   - Drop .bdrv_parse_filename() in favor of --blockdev or json: [Hanna]
> >   - Clarify that "filename" is always non-NULL for io_uring [Hanna]
> >   - Check that virtio-blk-vhost-vdpa "path" option is non-NULL [Hanna]
> >   - Fix virtio-blk-vhost-vdpa cache.direct=off logic [Hanna]
> >   - Use macros for driver names [Hanna]
> >   - Assert that the driver name is valid [Hanna]
> >   - Update "readonly" property name to "read-only" [Hanna]
> >   - Call blkio_detach_aio_context() in blkio_close() [Hanna]
> >   - Avoid uint32_t * to int * casts in blkio_refresh_limits() [Hanna]
> >   - Remove write zeroes and discard from the todo list [Hanna]
> >   - Use PRIu32 instead of %d for uint32_t [Hanna]
> >   - Fix error messages with buf-alignment instead of optimal-io-size [Hanna]
> >   - Call map/unmap APIs since libblkio alloc/free APIs no longer do that
> >   - Update QAPI schema QEMU version to 7.2
> > - Patch 5:
> >   - Expand the BDRV_REQ_REGISTERED_BUF flag passthrough and drop assert(!flags)
> >     in drivers [Hanna]
> > - Patch 7:
> >   - Fix BLK->BDRV typo [Hanna]
> >   - Make BlockRAMRegistrar handle failure [Hanna]
> > - Patch 8:
> >   - Replace memory_region_get_fd() approach with qemu_ram_get_fd()
> > - Patch 10:
> >   - Use (void)ret; to discard unused return value [Hanna]
> >   - libblkio's blkio_unmap_mem_region() API no longer has a return value
> >   - Check for registered bufs that cross RAMBlocks [Hanna]
> > - Patch 11:
> >   - Handle bdrv_register_buf() errors [Hanna]
> > v3:
> > - Add virtio-blk-vhost-vdpa for vdpa-blk devices including VDUSE
> > - Add discard and write zeroes support
> > - Rebase and adopt latest libblkio APIs
> > v2:
> > - Add BDRV_REQ_REGISTERED_BUF to bs.supported_write_flags [Stefano]
> > - Use new blkioq_get_num_completions() API
> > - Implement .bdrv_refresh_limits()
> >
> > This patch series adds a QEMU BlockDriver for libblkio
> > (https://gitlab.com/libblkio/libblkio/), a library for high-performance block
> > device I/O. This work was presented at KVM Forum 2022 and slides are available
> > here:
> > https://static.sched.com/hosted_files/kvmforum2022/8c/libblkio-kvm-forum-2022.pdf
> >
> > The second patch adds the core BlockDriver and most of the libblkio API usage.
> > Three libblkio drivers are included:
> > - io_uring
> > - virtio-blk-vhost-user
> > - virtio-blk-vhost-vdpa
> >
> > The remainder of the patch series reworks the existing QEMU bdrv_register_buf()
> > API so virtio-blk emulation efficiently map guest RAM for libblkio - some
> > libblkio drivers require that I/O buffer memory is pre-registered (think VFIO,
> > vhost, etc).
> >
> > Vladimir requested performance results that show the effect of the
> > BDRV_REQ_REGISTERED_BUF flag. I ran the patches against qemu-storage-daemon's
> > vhost-user-blk export with iodepth=1 bs=512 to see the per-request overhead due
> > to bounce buffer allocation/mapping:
> >
> > Name                                   IOPS   Error
> > bounce-buf                          4373.81 ± 0.01%
> > registered-buf                     13062.80 ± 0.67%
> >
> > The BDRV_REQ_REGISTERED_BUF optimization version is about 3x faster.
> >
> > See the BlockDriver struct in block/blkio.c for a list of APIs that still need
> > to be implemented. The core functionality is covered.
> >
> > Regarding the design: each libblkio driver is a separately named BlockDriver.
> > That means there is an "io_uring" BlockDriver and not a generic "libblkio"
> > BlockDriver. This way QAPI and open parameters are type-safe and mandatory
> > parameters can be checked by QEMU.
> >
> > Stefan Hajnoczi (12):
> >   coroutine: add flag to re-queue at front of CoQueue
> >   blkio: add libblkio block driver
> >   numa: call ->ram_block_removed() in ram_block_notifer_remove()
> >   block: pass size to bdrv_unregister_buf()
> >   block: use BdrvRequestFlags type for supported flag fields
> >   block: add BDRV_REQ_REGISTERED_BUF request flag
> >   block: return errors from bdrv_register_buf()
> >   block: add BlockRAMRegistrar
> >   exec/cpu-common: add qemu_ram_get_fd()
> >   stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd()
> >   blkio: implement BDRV_REQ_REGISTERED_BUF optimization
> >   virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
> >
> >  MAINTAINERS                                 |    7 +
> >  meson_options.txt                           |    2 +
> >  qapi/block-core.json                        |   53 +-
> >  meson.build                                 |    9 +
> >  include/block/block-common.h                |    9 +
> >  include/block/block-global-state.h          |   10 +-
> >  include/block/block_int-common.h            |   15 +-
> >  include/exec/cpu-common.h                   |    1 +
> >  include/hw/virtio/virtio-blk.h              |    2 +
> >  include/qemu/coroutine.h                    |   15 +-
> >  include/sysemu/block-backend-global-state.h |    4 +-
> >  include/sysemu/block-ram-registrar.h        |   37 +
> >  block.c                                     |   14 +
> >  block/blkio.c                               | 1017 +++++++++++++++++++
> >  block/blkverify.c                           |    4 +-
> >  block/block-backend.c                       |    8 +-
> >  block/block-ram-registrar.c                 |   54 +
> >  block/crypto.c                              |    4 +-
> >  block/file-posix.c                          |    1 -
> >  block/gluster.c                             |    1 -
> >  block/io.c                                  |  101 +-
> >  block/mirror.c                              |    2 +
> >  block/nbd.c                                 |    1 -
> >  block/nvme.c                                |   20 +-
> >  block/parallels.c                           |    1 -
> >  block/qcow.c                                |    2 -
> >  block/qed.c                                 |    1 -
> >  block/raw-format.c                          |    2 +
> >  block/replication.c                         |    1 -
> >  block/ssh.c                                 |    1 -
> >  block/vhdx.c                                |    1 -
> >  hw/block/virtio-blk.c                       |   39 +-
> >  hw/core/numa.c                              |   17 +
> >  qemu-img.c                                  |    6 +-
> >  softmmu/physmem.c                           |    5 +
> >  stubs/physmem.c                             |   13 +
> >  tests/qtest/modules-test.c                  |    3 +
> >  util/qemu-coroutine-lock.c                  |    9 +-
> >  util/vfio-helpers.c                         |    5 +-
> >  block/meson.build                           |    2 +
> >  scripts/meson-buildoptions.sh               |    3 +
> >  stubs/meson.build                           |    1 +
> >  42 files changed, 1412 insertions(+), 91 deletions(-)
> >  create mode 100644 include/sysemu/block-ram-registrar.h
> >  create mode 100644 block/blkio.c
> >  create mode 100644 block/block-ram-registrar.c
> >  create mode 100644 stubs/physmem.c
> >
> > --
> > 2.37.3
> >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-09-28 20:12     ` Alberto Campinho Faria
@ 2022-10-06 18:00       ` Stefan Hajnoczi
  2022-10-06 18:09         ` Alberto Faria
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 18:00 UTC (permalink / raw)
  To: Alberto Campinho Faria
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

[-- Attachment #1: Type: text/plain, Size: 1468 bytes --]

On Wed, Sep 28, 2022 at 09:12:36PM +0100, Alberto Campinho Faria wrote:
> On Wed, Sep 28, 2022 at 8:21 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Sep 27, 2022 at 03:34:30PM -0400, Stefan Hajnoczi wrote:
> > > +    ret = blkio_get_bool(s->blkio,
> > > +                         "mem-regions-pinned",
> > > +                         &s->mem_regions_pinned);
> > > +    if (ret < 0) {
> > > +        /* Be conservative (assume pinning) if the property is not supported */
> > > +        s->mem_regions_pinned = true;
> >
> > This is too conservative :). It can be changed to:
> >
> >   s->mem_regions_pinned = s->needs_mem_regions;
> >
> > That way we avoid ram_block_discard_disable() for libblkio drivers (like
> > io_uring in libblkio 1.0) that don't use memory regions and don't
> > support the "mem-regions-pinned" property yet.
> 
> Even if a driver doesn't _need_ memory regions to be mapped before
> use, it may still do something special with the ones that _are_
> mapped, so we may have no choice but to set s->mem_regions_pinned =
> true.
> 
> (Unless we are assuming that all future libblkio versions will either
> not have such drivers, or will provide a "mem-regions-pinned"
> property, but that feels brittle.)

s->needs_mem_regions determines if we'll use libblkio memory regions at
all. When it's false we skip blkio_map_mem_region() and therefore it's
safe to set s->mem_regions_pinned to false.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-10-06 18:00       ` Stefan Hajnoczi
@ 2022-10-06 18:09         ` Alberto Faria
  2022-10-06 18:46           ` Stefan Hajnoczi
  0 siblings, 1 reply; 25+ messages in thread
From: Alberto Faria @ 2022-10-06 18:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

On Thu, Oct 6, 2022 at 7:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> s->needs_mem_regions determines if we'll use libblkio memory regions at
> all. When it's false we skip blkio_map_mem_region() and therefore it's
> safe to set s->mem_regions_pinned to false.

blkio_register_buf() calls blkio_map_mem_region(). Is the
bdrv_register_buf callback maybe skipped somehow when
needs_mem_regions is false?

Regardless, I'd say we want to map memory regions even if we don't
strictly need to (in cases where we can do so at no additional cost),
since that may improve performance for some drivers.

Alberto



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-10-06 18:09         ` Alberto Faria
@ 2022-10-06 18:46           ` Stefan Hajnoczi
  2022-10-06 18:54             ` Alberto Faria
  0 siblings, 1 reply; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 18:46 UTC (permalink / raw)
  To: Alberto Faria
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

[-- Attachment #1: Type: text/plain, Size: 1148 bytes --]

On Thu, Oct 06, 2022 at 07:09:36PM +0100, Alberto Faria wrote:
> On Thu, Oct 6, 2022 at 7:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > s->needs_mem_regions determines if we'll use libblkio memory regions at
> > all. When it's false we skip blkio_map_mem_region() and therefore it's
> > safe to set s->mem_regions_pinned to false.
> 
> blkio_register_buf() calls blkio_map_mem_region(). Is the
> bdrv_register_buf callback maybe skipped somehow when
> needs_mem_regions is false?

You're right. blkio_register_buf() will be called by virtio-blk's ram
block registrar even when s->needs_mem_regions is false.

s->mem_regions_pinned therefore has to default to true even when
s->needs_mem_regions is false.

> Regardless, I'd say we want to map memory regions even if we don't
> strictly need to (in cases where we can do so at no additional cost),
> since that may improve performance for some drivers.

The downside is that when s->mem_regions_pinned is true, virtio-mem and
anything else that calls ram discard cannot be used.

I think we can try that for now and see if the policy needs to be
refined.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-10-06 18:46           ` Stefan Hajnoczi
@ 2022-10-06 18:54             ` Alberto Faria
  0 siblings, 0 replies; 25+ messages in thread
From: Alberto Faria @ 2022-10-06 18:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

On Thu, Oct 6, 2022 at 7:46 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > Regardless, I'd say we want to map memory regions even if we don't
> > strictly need to (in cases where we can do so at no additional cost),
> > since that may improve performance for some drivers.
>
> The downside is that when s->mem_regions_pinned is true, virtio-mem and
> anything else that calls ram discard cannot be used.

Hmm right, losing that functionality would probably be worse than
potentially losing some performance for some drivers. Maybe a good
middle point would be to call blkio_map_mem_region() in
blkio_register_buf() iff s->needs_mem_regions ||
!s->mem_regions_pinned.

Alberto



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 02/12] blkio: add libblkio block driver
  2022-10-06 16:41   ` Alberto Faria
@ 2022-10-06 18:56     ` Stefan Hajnoczi
  0 siblings, 0 replies; 25+ messages in thread
From: Stefan Hajnoczi @ 2022-10-06 18:56 UTC (permalink / raw)
  To: Alberto Faria
  Cc: qemu-devel, Yanan Wang, Kevin Wolf, Markus Armbruster, John Snow,
	Denis V. Lunev, Xie Changlong, Eric Blake, integration,
	David Hildenbrand, Wen Congyang, Laurent Vivier,
	Richard W.M. Jones, Fam Zheng, Thomas Huth, Hanna Reitz,
	Eduardo Habkost, Peter Xu, Raphael Norwitz, Marcel Apfelbaum,
	Vladimir Sementsov-Ogievskiy, Philippe Mathieu-Daudé,
	Jeff Cody, qemu-block, Paolo Bonzini, Richard Henderson,
	Michael S. Tsirkin, sgarzare

[-- Attachment #1: Type: text/plain, Size: 6755 bytes --]

On Thu, Oct 06, 2022 at 05:41:39PM +0100, Alberto Faria wrote:
> On Tue, Sep 27, 2022 at 8:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > +static int blkio_virtio_blk_vhost_user_open(BlockDriverState *bs,
> > +        QDict *options, int flags, Error **errp)
> > +{
> > +    const char *path = qdict_get_try_str(options, "path");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    if (!path) {
> > +        error_setg(errp, "missing 'path' option");
> > +        return -EINVAL;
> > +    }
> > +
> > +    ret = blkio_set_str(s->blkio, "path", path);
> > +    qdict_del(options, "path");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (!(flags & BDRV_O_NOCACHE)) {
> > +        error_setg(errp, "cache.direct=off is not supported");
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
> > +        QDict *options, int flags, Error **errp)
> > +{
> > +    const char *path = qdict_get_try_str(options, "path");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    if (!path) {
> > +        error_setg(errp, "missing 'path' option");
> > +        return -EINVAL;
> > +    }
> > +
> > +    ret = blkio_set_str(s->blkio, "path", path);
> > +    qdict_del(options, "path");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (!(flags & BDRV_O_NOCACHE)) {
> > +        error_setg(errp, "cache.direct=off is not supported");
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> 
> These two functions are (currently) exact duplicates. Should we just
> have a single blkio_virtio_blk_open() function?

Will fix.

> > +static BlockDriver bdrv_io_uring = {
> > +    .format_name                = DRIVER_IO_URING,
> > +    .protocol_name              = DRIVER_IO_URING,
> > +    .instance_size              = sizeof(BDRVBlkioState),
> > +    .bdrv_needs_filename        = true,
> > +    .bdrv_file_open             = blkio_file_open,
> > +    .bdrv_close                 = blkio_close,
> > +    .bdrv_getlength             = blkio_getlength,
> > +    .bdrv_get_info              = blkio_get_info,
> > +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> > +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> > +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> > +    .bdrv_co_preadv             = blkio_co_preadv,
> > +    .bdrv_co_pwritev            = blkio_co_pwritev,
> > +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> > +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> > +    .bdrv_io_unplug             = blkio_io_unplug,
> > +    .bdrv_refresh_limits        = blkio_refresh_limits,
> > +};
> > +
> > +static BlockDriver bdrv_virtio_blk_vhost_user = {
> > +    .format_name                = DRIVER_VIRTIO_BLK_VHOST_USER,
> > +    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_USER,
> > +    .instance_size              = sizeof(BDRVBlkioState),
> > +    .bdrv_file_open             = blkio_file_open,
> > +    .bdrv_close                 = blkio_close,
> > +    .bdrv_getlength             = blkio_getlength,
> > +    .bdrv_get_info              = blkio_get_info,
> > +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> > +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> > +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> > +    .bdrv_co_preadv             = blkio_co_preadv,
> > +    .bdrv_co_pwritev            = blkio_co_pwritev,
> > +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> > +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> > +    .bdrv_io_unplug             = blkio_io_unplug,
> > +    .bdrv_refresh_limits        = blkio_refresh_limits,
> > +};
> > +
> > +static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
> > +    .format_name                = DRIVER_VIRTIO_BLK_VHOST_VDPA,
> > +    .protocol_name              = DRIVER_VIRTIO_BLK_VHOST_VDPA,
> > +    .instance_size              = sizeof(BDRVBlkioState),
> > +    .bdrv_file_open             = blkio_file_open,
> > +    .bdrv_close                 = blkio_close,
> > +    .bdrv_getlength             = blkio_getlength,
> > +    .bdrv_get_info              = blkio_get_info,
> > +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> > +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> > +    .bdrv_co_pdiscard           = blkio_co_pdiscard,
> > +    .bdrv_co_preadv             = blkio_co_preadv,
> > +    .bdrv_co_pwritev            = blkio_co_pwritev,
> > +    .bdrv_co_flush_to_disk      = blkio_co_flush,
> > +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> > +    .bdrv_io_unplug             = blkio_io_unplug,
> > +    .bdrv_refresh_limits        = blkio_refresh_limits,
> > +};
> 
> It's difficult to identify the fields that differ between
> BlockDrivers. Maybe we could do something like:
> 
>     #define DRIVER(name, ...) \
>         { \
>             .format_name             = name, \
>             .protocol_name           = name, \
>             .instance_size           = sizeof(BDRVBlkioState), \
>             .bdrv_file_open          = blkio_file_open, \
>             .bdrv_close              = blkio_close, \
>             .bdrv_getlength          = blkio_getlength, \
>             .bdrv_get_info           = blkio_get_info, \
>             .bdrv_attach_aio_context = blkio_attach_aio_context, \
>             .bdrv_detach_aio_context = blkio_detach_aio_context, \
>             .bdrv_co_pdiscard        = blkio_co_pdiscard, \
>             .bdrv_co_preadv          = blkio_co_preadv, \
>             .bdrv_co_pwritev         = blkio_co_pwritev, \
>             .bdrv_co_flush_to_disk   = blkio_co_flush, \
>             .bdrv_co_pwrite_zeroes   = blkio_co_pwrite_zeroes, \
>             .bdrv_io_unplug          = blkio_io_unplug, \
>             .bdrv_refresh_limits     = blkio_refresh_limits, \
>             __VA_ARGS__
>         } \
> 
>     static BlockDriver bdrv_io_uring = DRIVER(
>         DRIVER_IO_URING,
>         .bdrv_needs_filename = true,
>     );
> 
>     static BlockDriver bdrv_virtio_blk_vhost_user = DRIVER(
>         DRIVER_VIRTIO_BLK_VHOST_USER
>     );
> 
>     static BlockDriver bdrv_virtio_blk_vhost_vdpa = DRIVER(
>         DRIVER_VIRTIO_BLK_VHOST_VDPA
>     );

Makes sense to me.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-10-06 19:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-27 19:34 [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 01/12] coroutine: add flag to re-queue at front of CoQueue Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 02/12] blkio: add libblkio block driver Stefan Hajnoczi
2022-09-28  5:27   ` Markus Armbruster
2022-09-28 20:10     ` Stefan Hajnoczi
2022-10-06 16:41   ` Alberto Faria
2022-10-06 18:56     ` Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 03/12] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 04/12] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 05/12] block: use BdrvRequestFlags type for supported flag fields Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 06/12] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 07/12] block: return errors from bdrv_register_buf() Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 08/12] block: add BlockRAMRegistrar Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 09/12] exec/cpu-common: add qemu_ram_get_fd() Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 10/12] stubs: add qemu_ram_block_from_host() and qemu_ram_get_fd() Stefan Hajnoczi
2022-09-27 19:34 ` [PATCH v5 11/12] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
2022-09-28 19:21   ` Stefan Hajnoczi
2022-09-28 20:12     ` Alberto Campinho Faria
2022-10-06 18:00       ` Stefan Hajnoczi
2022-10-06 18:09         ` Alberto Faria
2022-10-06 18:46           ` Stefan Hajnoczi
2022-10-06 18:54             ` Alberto Faria
2022-09-27 19:34 ` [PATCH v5 12/12] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
2022-10-06 12:18 ` [PATCH v5 00/12] blkio: add libblkio BlockDriver Stefano Garzarella
2022-10-06 17:32   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.