All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v3 0/8] blkio: add libblkio BlockDriver
@ 2022-07-08  4:17 Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

v3:
- Add virtio-blk-vhost-vdpa for vdpa-blk devices including VDUSE
- Add discard and write zeroes support
- Rebase and adopt latest libblkio APIs
v2:
- Add BDRV_REQ_REGISTERED_BUF to bs.supported_write_flags [Stefano]
- Use new blkioq_get_num_completions() API
- Implement .bdrv_refresh_limits()

This patch series adds a QEMU BlockDriver for libblkio
(https://gitlab.com/libblkio/libblkio/), a library for high-performance block
device I/O. Currently libblkio has io_uring and virtio-blk-vhost-vdpa support
with additional drivers in development.

The first patch adds the core BlockDriver and most of the libblkio API usage.
The remainder of the patch series reworks the existing QEMU bdrv_register_buf()
API so virtio-blk emulation efficiently map guest RAM for libblkio - some
libblkio drivers require that I/O buffer memory is pre-registered (think VFIO,
vhost, etc).

This block driver is functional enough to boot guests. The libblkio 1.0 release
is expected soon and I will drop the "RFC" once the API is stable.

See the BlockDriver struct in block/blkio.c for a list of APIs that still need
to be implemented.

Regarding the design: each libblkio driver is a separately named BlockDriver.
That means there is an "io_uring" BlockDriver and not a generic "libblkio"
BlockDriver. This way QAPI and open parameters are type-safe and mandatory
parameters can be checked by QEMU.

Stefan Hajnoczi (8):
  blkio: add io_uring block driver using libblkio
  numa: call ->ram_block_removed() in ram_block_notifer_remove()
  block: pass size to bdrv_unregister_buf()
  block: add BDRV_REQ_REGISTERED_BUF request flag
  block: add BlockRAMRegistrar
  stubs: add memory_region_from_host() and memory_region_get_fd()
  blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint

 MAINTAINERS                                 |   7 +
 meson_options.txt                           |   2 +
 qapi/block-core.json                        |  37 +-
 meson.build                                 |   9 +
 include/block/block-common.h                |   9 +
 include/block/block-global-state.h          |   5 +-
 include/block/block_int-common.h            |   2 +-
 include/hw/virtio/virtio-blk.h              |   2 +
 include/sysemu/block-backend-global-state.h |   2 +-
 include/sysemu/block-ram-registrar.h        |  30 +
 block/blkio.c                               | 757 ++++++++++++++++++++
 block/blkverify.c                           |   4 +-
 block/block-backend.c                       |   4 +-
 block/block-ram-registrar.c                 |  39 +
 block/crypto.c                              |   2 +
 block/io.c                                  |  36 +-
 block/mirror.c                              |   2 +
 block/nvme.c                                |   2 +-
 block/raw-format.c                          |   2 +
 hw/block/virtio-blk.c                       |  13 +-
 hw/core/numa.c                              |  17 +
 qemu-img.c                                  |   4 +-
 stubs/memory.c                              |  13 +
 tests/qtest/modules-test.c                  |   3 +
 util/vfio-helpers.c                         |   5 +-
 block/meson.build                           |   2 +
 scripts/meson-buildoptions.sh               |   3 +
 stubs/meson.build                           |   1 +
 28 files changed, 987 insertions(+), 27 deletions(-)
 create mode 100644 include/sysemu/block-ram-registrar.h
 create mode 100644 block/blkio.c
 create mode 100644 block/block-ram-registrar.c
 create mode 100644 stubs/memory.c

-- 
2.36.1



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-12 14:23   ` Stefano Garzarella
                     ` (2 more replies)
  2022-07-08  4:17 ` [RFC v3 2/8] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
high-performance disk I/O. It currently supports io_uring and
virtio-blk-vhost-vdpa with additional drivers under development.

One of the reasons for developing libblkio is that other applications
besides QEMU can use it. This will be particularly useful for
vhost-user-blk which applications may wish to use for connecting to
qemu-storage-daemon.

libblkio also gives us an opportunity to develop in Rust behind a C API
that is easy to consume from QEMU.

This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
using libblkio. It will be easy to add other libblkio drivers since they
will share the majority of code.

For now I/O buffers are copied through bounce buffers if the libblkio
driver requires it. Later commits add an optimization for
pre-registering guest RAM to avoid bounce buffers.

The syntax is:

  --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off

and:

  --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                   |   6 +
 meson_options.txt             |   2 +
 qapi/block-core.json          |  37 +-
 meson.build                   |   9 +
 block/blkio.c                 | 659 ++++++++++++++++++++++++++++++++++
 tests/qtest/modules-test.c    |   3 +
 block/meson.build             |   1 +
 scripts/meson-buildoptions.sh |   3 +
 8 files changed, 718 insertions(+), 2 deletions(-)
 create mode 100644 block/blkio.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 450abd0252..50f340d9ee 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3395,6 +3395,12 @@ L: qemu-block@nongnu.org
 S: Maintained
 F: block/vdi.c
 
+blkio
+M: Stefan Hajnoczi <stefanha@redhat.com>
+L: qemu-block@nongnu.org
+S: Maintained
+F: block/blkio.c
+
 iSCSI
 M: Ronnie Sahlberg <ronniesahlberg@gmail.com>
 M: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/meson_options.txt b/meson_options.txt
index 97c38109b1..b0b2e0c9b5 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -117,6 +117,8 @@ option('bzip2', type : 'feature', value : 'auto',
        description: 'bzip2 support for DMG images')
 option('cap_ng', type : 'feature', value : 'auto',
        description: 'cap_ng support')
+option('blkio', type : 'feature', value : 'auto',
+       description: 'libblkio block device driver')
 option('bpf', type : 'feature', value : 'auto',
         description: 'eBPF support')
 option('cocoa', type : 'feature', value : 'auto',
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2173e7734a..aa63d5e9bd 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2951,11 +2951,15 @@
             'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
             {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
             {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
-            'http', 'https', 'iscsi',
+            'http', 'https',
+            { 'name': 'io_uring', 'if': 'CONFIG_BLKIO' },
+            'iscsi',
             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
             { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
+            'ssh', 'throttle', 'vdi', 'vhdx',
+            { 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
+            'vmdk', 'vpc', 'vvfat' ] }
 
 ##
 # @BlockdevOptionsFile:
@@ -3678,6 +3682,30 @@
             '*debug': 'int',
             '*logfile': 'str' } }
 
+##
+# @BlockdevOptionsIoUring:
+#
+# Driver specific block device options for the io_uring backend.
+#
+# @filename: path to the image file
+#
+# Since: 7.1
+##
+{ 'struct': 'BlockdevOptionsIoUring',
+  'data': { 'filename': 'str' } }
+
+##
+# @BlockdevOptionsVirtioBlkVhostVdpa:
+#
+# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
+#
+# @path: path to the vhost-vdpa character device.
+#
+# Since: 7.1
+##
+{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
+  'data': { 'path': 'str' } }
+
 ##
 # @IscsiTransport:
 #
@@ -4305,6 +4333,8 @@
                        'if': 'HAVE_HOST_BLOCK_DEVICE' },
       'http':       'BlockdevOptionsCurlHttp',
       'https':      'BlockdevOptionsCurlHttps',
+      'io_uring':   { 'type': 'BlockdevOptionsIoUring',
+                      'if': 'CONFIG_BLKIO' },
       'iscsi':      'BlockdevOptionsIscsi',
       'luks':       'BlockdevOptionsLUKS',
       'nbd':        'BlockdevOptionsNbd',
@@ -4327,6 +4357,9 @@
       'throttle':   'BlockdevOptionsThrottle',
       'vdi':        'BlockdevOptionsGenericFormat',
       'vhdx':       'BlockdevOptionsGenericFormat',
+      'virtio-blk-vhost-vdpa':
+                    { 'type': 'BlockdevOptionsVirtioBlkVhostVdpa',
+                      'if': 'CONFIG_BLKIO' },
       'vmdk':       'BlockdevOptionsGenericCOWFormat',
       'vpc':        'BlockdevOptionsGenericFormat',
       'vvfat':      'BlockdevOptionsVVFAT'
diff --git a/meson.build b/meson.build
index bc5569ace1..f09b009428 100644
--- a/meson.build
+++ b/meson.build
@@ -713,6 +713,13 @@ if not get_option('virglrenderer').auto() or have_system or have_vhost_user_gpu
                      required: get_option('virglrenderer'),
                      kwargs: static_kwargs)
 endif
+blkio = not_found
+if not get_option('blkio').auto() or have_block
+  blkio = dependency('blkio',
+                     method: 'pkg-config',
+                     required: get_option('blkio'),
+                     kwargs: static_kwargs)
+endif
 curl = not_found
 if not get_option('curl').auto() or have_block
   curl = dependency('libcurl', version: '>=7.29.0',
@@ -1755,6 +1762,7 @@ config_host_data.set('CONFIG_LIBUDEV', libudev.found())
 config_host_data.set('CONFIG_LZO', lzo.found())
 config_host_data.set('CONFIG_MPATH', mpathpersist.found())
 config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
+config_host_data.set('CONFIG_BLKIO', blkio.found())
 config_host_data.set('CONFIG_CURL', curl.found())
 config_host_data.set('CONFIG_CURSES', curses.found())
 config_host_data.set('CONFIG_GBM', gbm.found())
@@ -3909,6 +3917,7 @@ summary_info += {'PAM':               pam}
 summary_info += {'iconv support':     iconv}
 summary_info += {'curses support':    curses}
 summary_info += {'virgl support':     virgl}
+summary_info += {'blkio support':     blkio}
 summary_info += {'curl support':      curl}
 summary_info += {'Multipath support': mpathpersist}
 summary_info += {'PNG support':       png}
diff --git a/block/blkio.c b/block/blkio.c
new file mode 100644
index 0000000000..7fbdbd7fae
--- /dev/null
+++ b/block/blkio.c
@@ -0,0 +1,659 @@
+#include "qemu/osdep.h"
+#include <blkio.h>
+#include "block/block_int.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+#include "qemu/module.h"
+
+typedef struct BlkAIOCB {
+    BlockAIOCB common;
+    struct blkio_mem_region mem_region;
+    QEMUIOVector qiov;
+    struct iovec bounce_iov;
+} BlkioAIOCB;
+
+typedef struct {
+    /* Protects ->blkio and request submission on ->blkioq */
+    QemuMutex lock;
+
+    struct blkio *blkio;
+    struct blkioq *blkioq; /* this could be multi-queue in the future */
+    int completion_fd;
+
+    /* Polling fetches the next completion into this field */
+    struct blkio_completion poll_completion;
+
+    /* The value of the "mem-region-alignment" property */
+    size_t mem_region_alignment;
+
+    /* Can we skip adding/deleting blkio_mem_regions? */
+    bool needs_mem_regions;
+} BDRVBlkioState;
+
+static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
+{
+    /* Copy bounce buffer back to qiov */
+    if (acb->qiov.niov > 0) {
+        qemu_iovec_from_buf(&acb->qiov, 0,
+                acb->bounce_iov.iov_base,
+                acb->bounce_iov.iov_len);
+        qemu_iovec_destroy(&acb->qiov);
+    }
+
+    acb->common.cb(acb->common.opaque, ret);
+
+    if (acb->mem_region.len > 0) {
+        BDRVBlkioState *s = acb->common.bs->opaque;
+
+        WITH_QEMU_LOCK_GUARD(&s->lock) {
+            blkio_free_mem_region(s->blkio, &acb->mem_region);
+        }
+    }
+
+    qemu_aio_unref(&acb->common);
+}
+
+/*
+ * Only the thread that calls aio_poll() invokes fd and poll handlers.
+ * Therefore locks are not necessary except when accessing s->blkio.
+ *
+ * No locking is performed around blkioq_get_completions() although other
+ * threads may submit I/O requests on s->blkioq. We're assuming there is no
+ * inteference between blkioq_get_completions() and other s->blkioq APIs.
+ */
+
+static void blkio_completion_fd_read(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVBlkioState *s = bs->opaque;
+    struct blkio_completion completion;
+    uint64_t val;
+    ssize_t ret __attribute__((unused));
+
+    /* Polling may have already fetched a completion */
+    if (s->poll_completion.user_data != NULL) {
+        completion = s->poll_completion;
+
+        /* Clear it in case blkio_aiocb_complete() has a nested event loop */
+        s->poll_completion.user_data = NULL;
+
+        blkio_aiocb_complete(completion.user_data, completion.ret);
+    }
+
+    /* Reset completion fd status */
+    ret = read(s->completion_fd, &val, sizeof(val));
+
+    /*
+     * Reading one completion at a time makes nested event loop re-entrancy
+     * simple. Change this loop to get multiple completions in one go if it
+     * becomes a performance bottleneck.
+     */
+    while (blkioq_do_io(s->blkioq, &completion, 0, 1, NULL) == 1) {
+        blkio_aiocb_complete(completion.user_data, completion.ret);
+    }
+}
+
+static bool blkio_completion_fd_poll(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BDRVBlkioState *s = bs->opaque;
+
+    /* Just in case we already fetched a completion */
+    if (s->poll_completion.user_data != NULL) {
+        return true;
+    }
+
+    return blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL) == 1;
+}
+
+static void blkio_completion_fd_poll_ready(void *opaque)
+{
+    blkio_completion_fd_read(opaque);
+}
+
+static void blkio_attach_aio_context(BlockDriverState *bs,
+                                     AioContext *new_context)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    aio_set_fd_handler(new_context,
+                       s->completion_fd,
+                       false,
+                       blkio_completion_fd_read,
+                       NULL,
+                       blkio_completion_fd_poll,
+                       blkio_completion_fd_poll_ready,
+                       bs);
+}
+
+static void blkio_detach_aio_context(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    aio_set_fd_handler(bdrv_get_aio_context(bs),
+                       s->completion_fd,
+                       false, NULL, NULL, NULL, NULL, NULL);
+}
+
+static const AIOCBInfo blkio_aiocb_info = {
+    .aiocb_size = sizeof(BlkioAIOCB),
+};
+
+/* Create a BlkioAIOCB */
+static BlkioAIOCB *blkio_aiocb_get(BlockDriverState *bs,
+                                   BlockCompletionFunc *cb,
+                                   void *opaque)
+{
+    BlkioAIOCB *acb = qemu_aio_get(&blkio_aiocb_info, bs, cb, opaque);
+
+    /* A few fields need to be initialized, leave the rest... */
+    acb->qiov.niov = 0;
+    acb->mem_region.len = 0;
+    return acb;
+}
+
+/* s->lock must be held */
+static int blkio_aiocb_init_mem_region_locked(BlkioAIOCB *acb, size_t len)
+{
+    BDRVBlkioState *s = acb->common.bs->opaque;
+    size_t mem_region_len = QEMU_ALIGN_UP(len, s->mem_region_alignment);
+    int ret;
+
+    ret = blkio_alloc_mem_region(s->blkio, &acb->mem_region, mem_region_len);
+    if (ret < 0) {
+        return ret;
+    }
+
+    acb->bounce_iov.iov_base = acb->mem_region.addr;
+    acb->bounce_iov.iov_len = len;
+    return 0;
+}
+
+/* Call this to submit I/O after enqueuing a new request */
+static void blkio_submit_io(BlockDriverState *bs)
+{
+    if (qatomic_read(&bs->io_plugged) == 0) {
+        BDRVBlkioState *s = bs->opaque;
+
+        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
+    }
+}
+
+static BlockAIOCB *blkio_aio_pdiscard(BlockDriverState *bs, int64_t offset,
+        int bytes, BlockCompletionFunc *cb, void *opaque)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioAIOCB *acb;
+
+    QEMU_LOCK_GUARD(&s->lock);
+
+    acb = blkio_aiocb_get(bs, cb, opaque);
+    blkioq_discard(s->blkioq, offset, bytes, acb, 0);
+    blkio_submit_io(bs);
+    return &acb->common;
+}
+
+static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
+        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
+        BlockCompletionFunc *cb, void *opaque)
+{
+    BDRVBlkioState *s = bs->opaque;
+    struct iovec *iov = qiov->iov;
+    int iovcnt = qiov->niov;
+    BlkioAIOCB *acb;
+
+    QEMU_LOCK_GUARD(&s->lock);
+
+    acb = blkio_aiocb_get(bs, cb, opaque);
+
+    if (s->needs_mem_regions) {
+        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
+            qemu_aio_unref(&acb->common);
+            return NULL;
+        }
+
+        /* Copy qiov because we'll call qemu_iovec_from_buf() on completion */
+        qemu_iovec_init_slice(&acb->qiov, qiov, 0, qiov->size);
+
+        iov = &acb->bounce_iov;
+        iovcnt = 1;
+    }
+
+    blkioq_readv(s->blkioq, offset, iov, iovcnt, acb, 0);
+    blkio_submit_io(bs);
+    return &acb->common;
+}
+
+static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
+        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
+        BlockCompletionFunc *cb, void *opaque)
+{
+    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
+    BDRVBlkioState *s = bs->opaque;
+    struct iovec *iov = qiov->iov;
+    int iovcnt = qiov->niov;
+    BlkioAIOCB *acb;
+
+    QEMU_LOCK_GUARD(&s->lock);
+
+    acb = blkio_aiocb_get(bs, cb, opaque);
+
+    if (s->needs_mem_regions) {
+        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
+            qemu_aio_unref(&acb->common);
+            return NULL;
+        }
+
+        qemu_iovec_to_buf(qiov, 0, acb->bounce_iov.iov_base, bytes);
+
+        iov = &acb->bounce_iov;
+        iovcnt = 1;
+    }
+
+    blkioq_writev(s->blkioq, offset, iov, iovcnt, acb, blkio_flags);
+    blkio_submit_io(bs);
+    return &acb->common;
+}
+
+static BlockAIOCB *blkio_aio_flush(BlockDriverState *bs,
+                                   BlockCompletionFunc *cb,
+                                   void *opaque)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioAIOCB *acb;
+
+    QEMU_LOCK_GUARD(&s->lock);
+
+    acb = blkio_aiocb_get(bs, cb, opaque);
+
+    blkioq_flush(s->blkioq, acb, 0);
+    blkio_submit_io(bs);
+    return &acb->common;
+}
+
+/* For async to .bdrv_co_*() conversion */
+typedef struct {
+    Coroutine *coroutine;
+    int ret;
+} BlkioCoData;
+
+static void blkio_co_pwrite_zeroes_complete(void *opaque, int ret)
+{
+    BlkioCoData *data = opaque;
+
+    data->ret = ret;
+    aio_co_wake(data->coroutine);
+}
+
+static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
+    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
+{
+    BDRVBlkioState *s = bs->opaque;
+    BlkioCoData data = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    uint32_t blkio_flags = 0;
+
+    if (flags & BDRV_REQ_FUA) {
+        blkio_flags |= BLKIO_REQ_FUA;
+    }
+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+        blkio_flags |= BLKIO_REQ_NO_UNMAP;
+    }
+    if (flags & BDRV_REQ_NO_FALLBACK) {
+        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        BlkioAIOCB *acb =
+            blkio_aiocb_get(bs, blkio_co_pwrite_zeroes_complete, &data);
+        blkioq_write_zeroes(s->blkioq, offset, bytes, acb, blkio_flags);
+        blkio_submit_io(bs);
+    }
+
+    qemu_coroutine_yield();
+    return data.ret;
+}
+
+static void blkio_io_unplug(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        blkio_submit_io(bs);
+    }
+}
+
+static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
+                                          Error **errp)
+{
+    bdrv_parse_filename_strip_prefix(filename, "io_uring:", options);
+}
+
+static void blkio_parse_filename_virtio_blk_vhost_vdpa(
+        const char *filename,
+        QDict *options,
+        Error **errp)
+{
+    bdrv_parse_filename_strip_prefix(filename, "virtio-blk-vhost-vdpa:", options);
+}
+
+static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
+                               Error **errp)
+{
+    const char *filename = qdict_get_try_str(options, "filename");
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    ret = blkio_set_str(s->blkio, "path", filename);
+    qdict_del(options, "filename");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to set path: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (flags & BDRV_O_NOCACHE) {
+        ret = blkio_set_bool(s->blkio, "direct", true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set direct: %s",
+                             blkio_get_error_msg());
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
+        QDict *options, int flags, Error **errp)
+{
+    const char *path = qdict_get_try_str(options, "path");
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    ret = blkio_set_str(s->blkio, "path", path);
+    qdict_del(options, "path");
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to set path: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (flags & BDRV_O_NOCACHE) {
+        error_setg(errp, "cache.direct=off is not supported");
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
+                           Error **errp)
+{
+    const char *blkio_driver = bs->drv->protocol_name;
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+
+    ret = blkio_create(blkio_driver, &s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_create failed: %s",
+                         blkio_get_error_msg());
+        return ret;
+    }
+
+    if (strcmp(blkio_driver, "io_uring") == 0) {
+        ret = blkio_io_uring_open(bs, options, flags, errp);
+    } else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
+        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
+    }
+    if (ret < 0) {
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    if (!(flags & BDRV_O_RDWR)) {
+        ret = blkio_set_bool(s->blkio, "readonly", true);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "failed to set readonly: %s",
+                             blkio_get_error_msg());
+            blkio_destroy(&s->blkio);
+            return ret;
+        }
+    }
+
+    ret = blkio_connect(s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_get_bool(s->blkio,
+                         "needs-mem-regions",
+                         &s->needs_mem_regions);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get needs-mem-regions: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_get_uint64(s->blkio,
+                           "mem-region-alignment",
+                           &s->mem_region_alignment);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get mem-region-alignment: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    ret = blkio_start(s->blkio);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "blkio_start failed: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
+    bs->supported_write_flags = BDRV_REQ_FUA;
+    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
+                               BDRV_REQ_NO_FALLBACK;
+
+    qemu_mutex_init(&s->lock);
+    s->blkioq = blkio_get_queue(s->blkio, 0);
+    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
+
+    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
+    return 0;
+}
+
+static void blkio_close(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+
+    qemu_mutex_destroy(&s->lock);
+    blkio_destroy(&s->blkio);
+}
+
+static int64_t blkio_getlength(BlockDriverState *bs)
+{
+    BDRVBlkioState *s = bs->opaque;
+    uint64_t capacity;
+    int ret;
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
+    }
+    if (ret < 0) {
+        return -ret;
+    }
+
+    return capacity;
+}
+
+static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    return 0;
+}
+
+static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+    BDRVBlkioState *s = bs->opaque;
+    int value;
+    int ret;
+
+    ret = blkio_get_int(s->blkio,
+                        "request-alignment",
+                        (int *)&bs->bl.request_alignment);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (bs->bl.request_alignment < 1 ||
+        bs->bl.request_alignment >= INT_MAX ||
+        !is_power_of_2(bs->bl.request_alignment)) {
+        error_setg(errp, "invalid \"request-alignment\" value %d, must be "
+                   "power of 2 less than INT_MAX", bs->bl.request_alignment);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio,
+                        "optimal-io-size",
+                        (int *)&bs->bl.opt_transfer);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (bs->bl.opt_transfer > INT_MAX ||
+        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
+        error_setg(errp, "invalid \"buf-alignment\" value %d, must be a "
+                   "multiple of %d", bs->bl.opt_transfer,
+                   bs->bl.request_alignment);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio,
+                        "max-transfer",
+                        (int *)&bs->bl.max_transfer);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
+        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
+        error_setg(errp, "invalid \"max-transfer\" value %d, must be a "
+                   "multiple of %d and %d (if non-zero)",
+                   bs->bl.max_transfer, bs->bl.request_alignment,
+                   bs->bl.opt_transfer);
+        return;
+    }
+
+    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
+                   "positive", value);
+        return;
+    }
+    bs->bl.min_mem_alignment = value;
+
+    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get \"optimal-buf-alignment\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
+                   "must be positive", value);
+        return;
+    }
+    bs->bl.opt_mem_alignment = value;
+
+    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
+                         blkio_get_error_msg());
+        return;
+    }
+    if (value < 1) {
+        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
+                   bs->bl.max_iov);
+        return;
+    }
+}
+
+/*
+ * TODO
+ * Missing libblkio APIs:
+ * - write zeroes
+ * - discard
+ * - block_status
+ * - co_invalidate_cache
+ *
+ * Out of scope?
+ * - create
+ * - truncate
+ */
+
+static BlockDriver bdrv_io_uring = {
+    .format_name                = "io_uring",
+    .protocol_name              = "io_uring",
+    .instance_size              = sizeof(BDRVBlkioState),
+    .bdrv_needs_filename        = true,
+    .bdrv_parse_filename        = blkio_parse_filename_io_uring,
+    .bdrv_file_open             = blkio_file_open,
+    .bdrv_close                 = blkio_close,
+    .bdrv_getlength             = blkio_getlength,
+    .bdrv_get_info              = blkio_get_info,
+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
+    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
+    .bdrv_aio_preadv            = blkio_aio_preadv,
+    .bdrv_aio_pwritev           = blkio_aio_pwritev,
+    .bdrv_aio_flush             = blkio_aio_flush,
+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
+    .bdrv_io_unplug             = blkio_io_unplug,
+    .bdrv_refresh_limits        = blkio_refresh_limits,
+};
+
+static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
+    .format_name                = "virtio-blk-vhost-vdpa",
+    .protocol_name              = "virtio-blk-vhost-vdpa",
+    .instance_size              = sizeof(BDRVBlkioState),
+    .bdrv_needs_filename        = true,
+    .bdrv_parse_filename        = blkio_parse_filename_virtio_blk_vhost_vdpa,
+    .bdrv_file_open             = blkio_file_open,
+    .bdrv_close                 = blkio_close,
+    .bdrv_getlength             = blkio_getlength,
+    .bdrv_get_info              = blkio_get_info,
+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
+    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
+    .bdrv_aio_preadv            = blkio_aio_preadv,
+    .bdrv_aio_pwritev           = blkio_aio_pwritev,
+    .bdrv_aio_flush             = blkio_aio_flush,
+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
+    .bdrv_io_unplug             = blkio_io_unplug,
+    .bdrv_refresh_limits        = blkio_refresh_limits,
+};
+
+static void bdrv_blkio_init(void)
+{
+    bdrv_register(&bdrv_io_uring);
+    bdrv_register(&bdrv_virtio_blk_vhost_vdpa);
+}
+
+block_init(bdrv_blkio_init);
diff --git a/tests/qtest/modules-test.c b/tests/qtest/modules-test.c
index 88217686e1..be2575ae6d 100644
--- a/tests/qtest/modules-test.c
+++ b/tests/qtest/modules-test.c
@@ -16,6 +16,9 @@ static void test_modules_load(const void *data)
 int main(int argc, char *argv[])
 {
     const char *modules[] = {
+#ifdef CONFIG_BLKIO
+        "block-", "blkio",
+#endif
 #ifdef CONFIG_CURL
         "block-", "curl",
 #endif
diff --git a/block/meson.build b/block/meson.build
index 0b2a60c99b..787667384a 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -92,6 +92,7 @@ block_modules = {}
 
 modsrc = []
 foreach m : [
+  [blkio, 'blkio', files('blkio.c')],
   [curl, 'curl', files('curl.c')],
   [glusterfs, 'gluster', files('gluster.c')],
   [libiscsi, 'iscsi', [files('iscsi.c'), libm]],
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index d0e14fd6de..fb0d559eb1 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -69,6 +69,7 @@ meson_options_help() {
   printf "%s\n" '  auth-pam        PAM access control'
   printf "%s\n" '  avx2            AVX2 optimizations'
   printf "%s\n" '  avx512f         AVX512F optimizations'
+  printf "%s\n" '  blkio           libblkio block device driver'
   printf "%s\n" '  bochs           bochs image format support'
   printf "%s\n" '  bpf             eBPF support'
   printf "%s\n" '  brlapi          brlapi character device driver'
@@ -198,6 +199,8 @@ _meson_option_parse() {
     --disable-gcov) printf "%s" -Db_coverage=false ;;
     --enable-lto) printf "%s" -Db_lto=true ;;
     --disable-lto) printf "%s" -Db_lto=false ;;
+    --enable-blkio) printf "%s" -Dblkio=enabled ;;
+    --disable-blkio) printf "%s" -Dblkio=disabled ;;
     --block-drv-ro-whitelist=*) quote_sh "-Dblock_drv_ro_whitelist=$2" ;;
     --block-drv-rw-whitelist=*) quote_sh "-Dblock_drv_rw_whitelist=$2" ;;
     --enable-block-drv-whitelist-in-tools) printf "%s" -Dblock_drv_whitelist_in_tools=true ;;
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 2/8] numa: call ->ram_block_removed() in ram_block_notifer_remove()
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 3/8] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang, David Hildenbrand

When a RAMBlockNotifier is added, ->ram_block_added() is called with all
existing RAMBlocks. There is no equivalent ->ram_block_removed() call
when a RAMBlockNotifier is removed.

The util/vfio-helpers.c code (the sole user of RAMBlockNotifier) is fine
with this asymmetry because it does not rely on RAMBlockNotifier for
cleanup. It walks its internal list of DMA mappings and unmaps them by
itself.

Future users of RAMBlockNotifier may not have an internal data structure
that records added RAMBlocks so they will need ->ram_block_removed()
callbacks.

This patch makes ram_block_notifier_remove() symmetric with respect to
callbacks. Now util/vfio-helpers.c needs to unmap remaining DMA mappings
after ram_block_notifier_remove() has been called. This is necessary
since users like block/nvme.c may create additional DMA mappings that do
not originate from the RAMBlockNotifier.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/core/numa.c      | 17 +++++++++++++++++
 util/vfio-helpers.c |  5 ++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/hw/core/numa.c b/hw/core/numa.c
index 26d8e5f616..31e6fe1caa 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -822,6 +822,19 @@ static int ram_block_notify_add_single(RAMBlock *rb, void *opaque)
     return 0;
 }
 
+static int ram_block_notify_remove_single(RAMBlock *rb, void *opaque)
+{
+    const ram_addr_t max_size = qemu_ram_get_max_length(rb);
+    const ram_addr_t size = qemu_ram_get_used_length(rb);
+    void *host = qemu_ram_get_host_addr(rb);
+    RAMBlockNotifier *notifier = opaque;
+
+    if (host) {
+        notifier->ram_block_removed(notifier, host, size, max_size);
+    }
+    return 0;
+}
+
 void ram_block_notifier_add(RAMBlockNotifier *n)
 {
     QLIST_INSERT_HEAD(&ram_list.ramblock_notifiers, n, next);
@@ -835,6 +848,10 @@ void ram_block_notifier_add(RAMBlockNotifier *n)
 void ram_block_notifier_remove(RAMBlockNotifier *n)
 {
     QLIST_REMOVE(n, next);
+
+    if (n->ram_block_removed) {
+        qemu_ram_foreach_block(ram_block_notify_remove_single, n);
+    }
 }
 
 void ram_block_notify_add(void *host, size_t size, size_t max_size)
diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
index 5ba01177bf..0d1520caac 100644
--- a/util/vfio-helpers.c
+++ b/util/vfio-helpers.c
@@ -847,10 +847,13 @@ void qemu_vfio_close(QEMUVFIOState *s)
     if (!s) {
         return;
     }
+
+    ram_block_notifier_remove(&s->ram_notifier);
+
     for (i = 0; i < s->nr_mappings; ++i) {
         qemu_vfio_undo_mapping(s, &s->mappings[i], NULL);
     }
-    ram_block_notifier_remove(&s->ram_notifier);
+
     g_free(s->usable_iova_ranges);
     s->nb_iova_ranges = 0;
     qemu_vfio_reset(s);
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 3/8] block: pass size to bdrv_unregister_buf()
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 2/8] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-13 14:08   ` Hanna Reitz
  2022-07-08  4:17 ` [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

The only implementor of bdrv_register_buf() is block/nvme.c, where the
size is not needed when unregistering a buffer. This is because
util/vfio-helpers.c can look up mappings by address.

Future block drivers that implement bdrv_register_buf() may not be able
to do their job given only the buffer address. Add a size argument to
bdrv_unregister_buf().

Also document the assumptions about
bdrv_register_buf()/bdrv_unregister_buf() calls. The same <host, size>
values that were given to bdrv_register_buf() must be given to
bdrv_unregister_buf().

gcc 11.2.1 emits a spurious warning that img_bench()'s buf_size local
variable might be uninitialized, so it's necessary to silence the
compiler.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-global-state.h          | 5 ++++-
 include/block/block_int-common.h            | 2 +-
 include/sysemu/block-backend-global-state.h | 2 +-
 block/block-backend.c                       | 4 ++--
 block/io.c                                  | 6 +++---
 block/nvme.c                                | 2 +-
 qemu-img.c                                  | 4 ++--
 7 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/block/block-global-state.h b/include/block/block-global-state.h
index 21265e3966..7901f35863 100644
--- a/include/block/block-global-state.h
+++ b/include/block/block-global-state.h
@@ -243,9 +243,12 @@ void bdrv_del_child(BlockDriverState *parent, BdrvChild *child, Error **errp);
  * Register/unregister a buffer for I/O. For example, VFIO drivers are
  * interested to know the memory areas that would later be used for I/O, so
  * that they can prepare IOMMU mapping etc., to get better performance.
+ *
+ * Buffers must not overlap and they must be unregistered with the same <host,
+ * size> values that they were registered with.
  */
 void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);
-void bdrv_unregister_buf(BlockDriverState *bs, void *host);
+void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size);
 
 void bdrv_cancel_in_flight(BlockDriverState *bs);
 
diff --git a/include/block/block_int-common.h b/include/block/block_int-common.h
index 8947abab76..b7a7cbd3a5 100644
--- a/include/block/block_int-common.h
+++ b/include/block/block_int-common.h
@@ -435,7 +435,7 @@ struct BlockDriver {
      * DMA mapping for hot buffers.
      */
     void (*bdrv_register_buf)(BlockDriverState *bs, void *host, size_t size);
-    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host);
+    void (*bdrv_unregister_buf)(BlockDriverState *bs, void *host, size_t size);
 
     /*
      * This field is modified only under the BQL, and is part of
diff --git a/include/sysemu/block-backend-global-state.h b/include/sysemu/block-backend-global-state.h
index 415f0c91d7..97f7dad2c3 100644
--- a/include/sysemu/block-backend-global-state.h
+++ b/include/sysemu/block-backend-global-state.h
@@ -107,7 +107,7 @@ void blk_io_limits_update_group(BlockBackend *blk, const char *group);
 void blk_set_force_allow_inactivate(BlockBackend *blk);
 
 void blk_register_buf(BlockBackend *blk, void *host, size_t size);
-void blk_unregister_buf(BlockBackend *blk, void *host);
+void blk_unregister_buf(BlockBackend *blk, void *host, size_t size);
 
 const BdrvChild *blk_root(BlockBackend *blk);
 
diff --git a/block/block-backend.c b/block/block-backend.c
index f425b00793..44f7c61e0b 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -2581,10 +2581,10 @@ void blk_register_buf(BlockBackend *blk, void *host, size_t size)
     bdrv_register_buf(blk_bs(blk), host, size);
 }
 
-void blk_unregister_buf(BlockBackend *blk, void *host)
+void blk_unregister_buf(BlockBackend *blk, void *host, size_t size)
 {
     GLOBAL_STATE_CODE();
-    bdrv_unregister_buf(blk_bs(blk), host);
+    bdrv_unregister_buf(blk_bs(blk), host, size);
 }
 
 int coroutine_fn blk_co_copy_range(BlockBackend *blk_in, int64_t off_in,
diff --git a/block/io.c b/block/io.c
index 1e9bf09a49..e7f4117fe7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3350,16 +3350,16 @@ void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size)
     }
 }
 
-void bdrv_unregister_buf(BlockDriverState *bs, void *host)
+void bdrv_unregister_buf(BlockDriverState *bs, void *host, size_t size)
 {
     BdrvChild *child;
 
     GLOBAL_STATE_CODE();
     if (bs->drv && bs->drv->bdrv_unregister_buf) {
-        bs->drv->bdrv_unregister_buf(bs, host);
+        bs->drv->bdrv_unregister_buf(bs, host, size);
     }
     QLIST_FOREACH(child, &bs->children, next) {
-        bdrv_unregister_buf(child->bs, host);
+        bdrv_unregister_buf(child->bs, host, size);
     }
 }
 
diff --git a/block/nvme.c b/block/nvme.c
index 01fb28aa63..696502acea 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1592,7 +1592,7 @@ static void nvme_register_buf(BlockDriverState *bs, void *host, size_t size)
     }
 }
 
-static void nvme_unregister_buf(BlockDriverState *bs, void *host)
+static void nvme_unregister_buf(BlockDriverState *bs, void *host, size_t size)
 {
     BDRVNVMeState *s = bs->opaque;
 
diff --git a/qemu-img.c b/qemu-img.c
index 4cf4d2423d..b7ffc37a49 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -4368,7 +4368,7 @@ static int img_bench(int argc, char **argv)
     struct timeval t1, t2;
     int i;
     bool force_share = false;
-    size_t buf_size;
+    size_t buf_size = 0;
 
     for (;;) {
         static const struct option long_options[] = {
@@ -4590,7 +4590,7 @@ static int img_bench(int argc, char **argv)
 
 out:
     if (data.buf) {
-        blk_unregister_buf(blk, data.buf);
+        blk_unregister_buf(blk, data.buf, buf_size);
     }
     qemu_vfree(data.buf);
     blk_unref(blk);
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2022-07-08  4:17 ` [RFC v3 3/8] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-14  8:54   ` Hanna Reitz
  2022-07-08  4:17 ` [RFC v3 5/8] block: add BlockRAMRegistrar Stefan Hajnoczi
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

Block drivers may optimize I/O requests accessing buffers previously
registered with bdrv_register_buf(). Checking whether all elements of a
request's QEMUIOVector are within previously registered buffers is
expensive, so we need a hint from the user to avoid costly checks.

Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
QEMUIOVector elements in an I/O request are known to be within
previously registered buffers.

bdrv_aligned_preadv() is strict in validating supported read flags and
its assertions fail when it sees BDRV_REQ_REGISTERED_BUF. There is no
harm in passing BDRV_REQ_REGISTERED_BUF to block drivers that do not
support it, so update the assertions to ignore BDRV_REQ_REGISTERED_BUF.

Care must be taken to clear the flag when the block layer or filter
drivers replace QEMUIOVector elements with bounce buffers since these
have not been registered with bdrv_register_buf(). A lot of the changes
in this commit deal with clearing the flag in those cases.

Ensuring that the flag is cleared properly is somewhat invasive to
implement across the block layer and it's hard to spot when future code
changes accidentally break it. Another option might be to add a flag to
QEMUIOVector itself and clear it in qemu_iovec_*() functions that modify
elements. That is more robust but somewhat of a layering violation, so I
haven't attempted that.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/block/block-common.h |  9 +++++++++
 block/blkverify.c            |  4 ++--
 block/crypto.c               |  2 ++
 block/io.c                   | 30 +++++++++++++++++++++++-------
 block/mirror.c               |  2 ++
 block/raw-format.c           |  2 ++
 6 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/include/block/block-common.h b/include/block/block-common.h
index fdb7306e78..061606e867 100644
--- a/include/block/block-common.h
+++ b/include/block/block-common.h
@@ -80,6 +80,15 @@ typedef enum {
      */
     BDRV_REQ_MAY_UNMAP          = 0x4,
 
+    /*
+     * An optimization hint when all QEMUIOVector elements are within
+     * previously registered bdrv_register_buf() memory ranges.
+     *
+     * Code that replaces the user's QEMUIOVector elements with bounce buffers
+     * must take care to clear this flag.
+     */
+    BDRV_REQ_REGISTERED_BUF     = 0x8,
+
     BDRV_REQ_FUA                = 0x10,
     BDRV_REQ_WRITE_COMPRESSED   = 0x20,
 
diff --git a/block/blkverify.c b/block/blkverify.c
index e4a37af3b2..d624f4fd05 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -235,8 +235,8 @@ blkverify_co_preadv(BlockDriverState *bs, int64_t offset, int64_t bytes,
     qemu_iovec_init(&raw_qiov, qiov->niov);
     qemu_iovec_clone(&raw_qiov, qiov, buf);
 
-    ret = blkverify_co_prwv(bs, &r, offset, bytes, qiov, &raw_qiov, flags,
-                            false);
+    ret = blkverify_co_prwv(bs, &r, offset, bytes, qiov, &raw_qiov,
+                            flags & ~BDRV_REQ_REGISTERED_BUF, false);
 
     cmp_offset = qemu_iovec_compare(qiov, &raw_qiov);
     if (cmp_offset != -1) {
diff --git a/block/crypto.c b/block/crypto.c
index 1ba82984ef..c900355adb 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -473,6 +473,8 @@ block_crypto_co_pwritev(BlockDriverState *bs, int64_t offset, int64_t bytes,
     uint64_t sector_size = qcrypto_block_get_sector_size(crypto->block);
     uint64_t payload_offset = qcrypto_block_get_payload_offset(crypto->block);
 
+    flags &= ~BDRV_REQ_REGISTERED_BUF;
+
     assert(!(flags & ~BDRV_REQ_FUA));
     assert(payload_offset < INT64_MAX);
     assert(QEMU_IS_ALIGNED(offset, sector_size));
diff --git a/block/io.c b/block/io.c
index e7f4117fe7..83b8259227 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1541,11 +1541,14 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
     max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
                                    align);
 
-    /* TODO: We would need a per-BDS .supported_read_flags and
+    /*
+     * TODO: We would need a per-BDS .supported_read_flags and
      * potential fallback support, if we ever implement any read flags
      * to pass through to drivers.  For now, there aren't any
-     * passthrough flags.  */
-    assert(!(flags & ~(BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH)));
+     * passthrough flags except the BDRV_REQ_REGISTERED_BUF optimization hint.
+     */
+    assert(!(flags & ~(BDRV_REQ_COPY_ON_READ | BDRV_REQ_PREFETCH |
+                       BDRV_REQ_REGISTERED_BUF)));
 
     /* Handle Copy on Read and associated serialisation */
     if (flags & BDRV_REQ_COPY_ON_READ) {
@@ -1586,7 +1589,7 @@ static int coroutine_fn bdrv_aligned_preadv(BdrvChild *child,
         goto out;
     }
 
-    assert(!(flags & ~bs->supported_read_flags));
+    assert(!(flags & ~(bs->supported_read_flags | BDRV_REQ_REGISTERED_BUF)));
 
     max_bytes = ROUND_UP(MAX(0, total_bytes - offset), align);
     if (bytes <= max_bytes && bytes <= max_transfer) {
@@ -1775,7 +1778,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
-                            BdrvRequestPadding *pad, bool *padded)
+                            BdrvRequestPadding *pad, bool *padded,
+                            BdrvRequestFlags *flags)
 {
     int ret;
 
@@ -1803,6 +1807,10 @@ static int bdrv_pad_request(BlockDriverState *bs,
     if (padded) {
         *padded = true;
     }
+    if (flags) {
+        /* Can't use optimization hint with bounce buffer */
+        *flags &= ~BDRV_REQ_REGISTERED_BUF;
+    }
 
     return 0;
 }
@@ -1857,7 +1865,7 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
     }
 
     ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL);
+                           NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -1902,6 +1910,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
         return -ENOTSUP;
     }
 
+    /* By definition there is no user buffer so this flag doesn't make sense */
+    if (flags & BDRV_REQ_REGISTERED_BUF) {
+        return -EINVAL;
+    }
+
     /* Invalidate the cached block-status data range if this write overlaps */
     bdrv_bsc_invalidate_range(bs, offset, bytes);
 
@@ -2187,6 +2200,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
     bool padding;
     BdrvRequestPadding pad;
 
+    /* This flag doesn't make sense for padding or zero writes */
+    flags &= ~BDRV_REQ_REGISTERED_BUF;
+
     padding = bdrv_init_padding(bs, offset, bytes, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
@@ -2304,7 +2320,7 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * alignment only if there is no ZERO flag.
          */
         ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded);
+                               &padded, &flags);
         if (ret < 0) {
             return ret;
         }
diff --git a/block/mirror.c b/block/mirror.c
index 3c4ab1159d..8d3fc3f19b 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1477,6 +1477,8 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
         qemu_iovec_init(&bounce_qiov, 1);
         qemu_iovec_add(&bounce_qiov, bounce_buf, bytes);
         qiov = &bounce_qiov;
+
+        flags &= ~BDRV_REQ_REGISTERED_BUF;
     }
 
     ret = bdrv_mirror_top_do_write(bs, MIRROR_METHOD_COPY, offset, bytes, qiov,
diff --git a/block/raw-format.c b/block/raw-format.c
index 69fd650eaf..9bae3dd7f2 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -258,6 +258,8 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, int64_t offset,
         qemu_iovec_add(&local_qiov, buf, 512);
         qemu_iovec_concat(&local_qiov, qiov, 512, qiov->size - 512);
         qiov = &local_qiov;
+
+        flags &= ~BDRV_REQ_REGISTERED_BUF;
     }
 
     ret = raw_adjust_offset(bs, &offset, bytes, true);
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 5/8] block: add BlockRAMRegistrar
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2022-07-08  4:17 ` [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-14  9:30   ` Hanna Reitz
  2022-07-08  4:17 ` [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd() Stefan Hajnoczi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

Emulated devices and other BlockBackend users wishing to take advantage
of blk_register_buf() all have the same repetitive job: register
RAMBlocks with the BlockBackend using RAMBlockNotifier.

Add a BlockRAMRegistrar API to do this. A later commit will use this
from hw/block/virtio-blk.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                          |  1 +
 include/sysemu/block-ram-registrar.h | 30 +++++++++++++++++++++
 block/block-ram-registrar.c          | 39 ++++++++++++++++++++++++++++
 block/meson.build                    |  1 +
 4 files changed, 71 insertions(+)
 create mode 100644 include/sysemu/block-ram-registrar.h
 create mode 100644 block/block-ram-registrar.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 50f340d9ee..d16189449f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2490,6 +2490,7 @@ F: block*
 F: block/
 F: hw/block/
 F: include/block/
+F: include/sysemu/block-*.h
 F: qemu-img*
 F: docs/tools/qemu-img.rst
 F: qemu-io*
diff --git a/include/sysemu/block-ram-registrar.h b/include/sysemu/block-ram-registrar.h
new file mode 100644
index 0000000000..09d63f64b2
--- /dev/null
+++ b/include/sysemu/block-ram-registrar.h
@@ -0,0 +1,30 @@
+/*
+ * BlockBackend RAM Registrar
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef BLOCK_RAM_REGISTRAR_H
+#define BLOCK_RAM_REGISTRAR_H
+
+#include "exec/ramlist.h"
+
+/**
+ * struct BlockRAMRegistrar:
+ *
+ * Keeps RAMBlock memory registered with a BlockBackend using
+ * blk_register_buf() including hotplugged memory.
+ *
+ * Emulated devices or other BlockBackend users initialize a BlockRAMRegistrar
+ * with blk_ram_registrar_init() before submitting I/O requests with the
+ * BLK_REQ_REGISTERED_BUF flag set.
+ */
+typedef struct {
+    BlockBackend *blk;
+    RAMBlockNotifier notifier;
+} BlockRAMRegistrar;
+
+void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk);
+void blk_ram_registrar_destroy(BlockRAMRegistrar *r);
+
+#endif /* BLOCK_RAM_REGISTRAR_H */
diff --git a/block/block-ram-registrar.c b/block/block-ram-registrar.c
new file mode 100644
index 0000000000..32a14b69ae
--- /dev/null
+++ b/block/block-ram-registrar.c
@@ -0,0 +1,39 @@
+/*
+ * BlockBackend RAM Registrar
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/block-backend.h"
+#include "sysemu/block-ram-registrar.h"
+
+static void ram_block_added(RAMBlockNotifier *n, void *host, size_t size,
+                            size_t max_size)
+{
+    BlockRAMRegistrar *r = container_of(n, BlockRAMRegistrar, notifier);
+    blk_register_buf(r->blk, host, max_size);
+}
+
+static void ram_block_removed(RAMBlockNotifier *n, void *host, size_t size,
+                              size_t max_size)
+{
+    BlockRAMRegistrar *r = container_of(n, BlockRAMRegistrar, notifier);
+    blk_unregister_buf(r->blk, host, max_size);
+}
+
+void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk)
+{
+    r->blk = blk;
+    r->notifier = (RAMBlockNotifier){
+        .ram_block_added = ram_block_added,
+        .ram_block_removed = ram_block_removed,
+    };
+
+    ram_block_notifier_add(&r->notifier);
+}
+
+void blk_ram_registrar_destroy(BlockRAMRegistrar *r)
+{
+    ram_block_notifier_remove(&r->notifier);
+}
diff --git a/block/meson.build b/block/meson.build
index 787667384a..b315593054 100644
--- a/block/meson.build
+++ b/block/meson.build
@@ -46,6 +46,7 @@ block_ss.add(files(
 ), zstd, zlib, gnutls)
 
 softmmu_ss.add(when: 'CONFIG_TCG', if_true: files('blkreplay.c'))
+softmmu_ss.add(files('block-ram-registrar.c'))
 
 if get_option('qcow1').allowed()
   block_ss.add(files('qcow.c'))
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd()
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2022-07-08  4:17 ` [RFC v3 5/8] block: add BlockRAMRegistrar Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-14  9:39   ` Hanna Reitz
  2022-07-08  4:17 ` [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
  2022-07-08  4:17 ` [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
  7 siblings, 1 reply; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

The blkio block driver will need to look up the file descriptor for a
given pointer. This is possible in softmmu builds where the memory API
is available for querying guest RAM.

Add stubs so tools like qemu-img that link the block layer still build
successfully. In this case there is no guest RAM but that is fine.
Bounce buffers and their file descriptors will be allocated with
libblkio's blkio_alloc_mem_region() so we won't rely on QEMU's
memory_region_get_fd() in that case.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 stubs/memory.c    | 13 +++++++++++++
 stubs/meson.build |  1 +
 2 files changed, 14 insertions(+)
 create mode 100644 stubs/memory.c

diff --git a/stubs/memory.c b/stubs/memory.c
new file mode 100644
index 0000000000..e9ec4e384b
--- /dev/null
+++ b/stubs/memory.c
@@ -0,0 +1,13 @@
+#include "qemu/osdep.h"
+#include "exec/memory.h"
+
+MemoryRegion *memory_region_from_host(void *host, ram_addr_t *offset)
+{
+    return NULL;
+}
+
+int memory_region_get_fd(MemoryRegion *mr)
+{
+    return -1;
+}
+
diff --git a/stubs/meson.build b/stubs/meson.build
index d8f3fd5c44..fbd3dfa7b4 100644
--- a/stubs/meson.build
+++ b/stubs/meson.build
@@ -25,6 +25,7 @@ stub_ss.add(files('is-daemonized.c'))
 if libaio.found()
   stub_ss.add(files('linux-aio.c'))
 endif
+stub_ss.add(files('memory.c'))
 stub_ss.add(files('migr-blocker.c'))
 stub_ss.add(files('module-opts.c'))
 stub_ss.add(files('monitor.c'))
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2022-07-08  4:17 ` [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd() Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-12 14:28   ` Stefano Garzarella
  2022-07-14 10:13   ` Hanna Reitz
  2022-07-08  4:17 ` [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
  7 siblings, 2 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

Avoid bounce buffers when QEMUIOVector elements are within previously
registered bdrv_register_buf() buffers.

The idea is that emulated storage controllers will register guest RAM
using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
requests. Therefore no blkio_map_mem_region() calls are necessary in the
performance-critical I/O code path.

This optimization doesn't apply if the I/O buffer is internally
allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
path because BDRV_REQ_REGISTERED_BUF is not set.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/blkio.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 101 insertions(+), 3 deletions(-)

diff --git a/block/blkio.c b/block/blkio.c
index 7fbdbd7fae..37d593a20c 100644
--- a/block/blkio.c
+++ b/block/blkio.c
@@ -1,7 +1,9 @@
 #include "qemu/osdep.h"
 #include <blkio.h>
 #include "block/block_int.h"
+#include "exec/memory.h"
 #include "qapi/error.h"
+#include "qemu/error-report.h"
 #include "qapi/qmp/qdict.h"
 #include "qemu/module.h"
 
@@ -28,6 +30,9 @@ typedef struct {
 
     /* Can we skip adding/deleting blkio_mem_regions? */
     bool needs_mem_regions;
+
+    /* Are file descriptors necessary for blkio_mem_regions? */
+    bool needs_mem_region_fd;
 } BDRVBlkioState;
 
 static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
@@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
         BlockCompletionFunc *cb, void *opaque)
 {
     BDRVBlkioState *s = bs->opaque;
+    bool needs_mem_regions =
+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
     struct iovec *iov = qiov->iov;
     int iovcnt = qiov->niov;
     BlkioAIOCB *acb;
@@ -206,7 +213,7 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
 
     acb = blkio_aiocb_get(bs, cb, opaque);
 
-    if (s->needs_mem_regions) {
+    if (needs_mem_regions) {
         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
             qemu_aio_unref(&acb->common);
             return NULL;
@@ -230,6 +237,8 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
 {
     uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
     BDRVBlkioState *s = bs->opaque;
+    bool needs_mem_regions =
+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
     struct iovec *iov = qiov->iov;
     int iovcnt = qiov->niov;
     BlkioAIOCB *acb;
@@ -238,7 +247,7 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
 
     acb = blkio_aiocb_get(bs, cb, opaque);
 
-    if (s->needs_mem_regions) {
+    if (needs_mem_regions) {
         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
             qemu_aio_unref(&acb->common);
             return NULL;
@@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
     }
 }
 
+static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)
+{
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+    struct blkio_mem_region region = (struct blkio_mem_region){
+        .addr = host,
+        .len = size,
+        .fd = -1,
+    };
+
+    if (((uintptr_t)host | size) % s->mem_region_alignment) {
+        error_report_once("%s: skipping unaligned buf %p with size %zu",
+                          __func__, host, size);
+        return; /* skip unaligned */
+    }
+
+    /* Attempt to find the fd for a MemoryRegion */
+    if (s->needs_mem_region_fd) {
+        int fd = -1;
+        ram_addr_t offset;
+        MemoryRegion *mr;
+
+        /*
+         * bdrv_register_buf() is called with the BQL held so mr lives at least
+         * until this function returns.
+         */
+        mr = memory_region_from_host(host, &offset);
+        if (mr) {
+            fd = memory_region_get_fd(mr);
+        }
+        if (fd == -1) {
+            error_report_once("%s: skipping fd-less buf %p with size %zu",
+                              __func__, host, size);
+            return; /* skip if there is no fd */
+        }
+
+        region.fd = fd;
+        region.fd_offset = offset;
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        ret = blkio_map_mem_region(s->blkio, &region);
+    }
+
+    if (ret < 0) {
+        error_report_once("Failed to add blkio mem region %p with size %zu: %s",
+                          host, size, blkio_get_error_msg());
+    }
+}
+
+static void blkio_unregister_buf(BlockDriverState *bs, void *host, size_t size)
+{
+    BDRVBlkioState *s = bs->opaque;
+    int ret;
+    struct blkio_mem_region region = (struct blkio_mem_region){
+        .addr = host,
+        .len = size,
+        .fd = -1,
+    };
+
+    if (((uintptr_t)host | size) % s->mem_region_alignment) {
+        return; /* skip unaligned */
+    }
+
+    WITH_QEMU_LOCK_GUARD(&s->lock) {
+        ret = blkio_unmap_mem_region(s->blkio, &region);
+    }
+
+    if (ret < 0) {
+        error_report_once("Failed to delete blkio mem region %p with size %zu: %s",
+                          host, size, blkio_get_error_msg());
+    }
+}
+
 static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
                                           Error **errp)
 {
@@ -440,6 +523,17 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
         return ret;
     }
 
+    ret = blkio_get_bool(s->blkio,
+                         "needs-mem-region-fd",
+                         &s->needs_mem_region_fd);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "failed to get needs-mem-region-fd: %s",
+                         blkio_get_error_msg());
+        blkio_destroy(&s->blkio);
+        return ret;
+    }
+
     ret = blkio_get_uint64(s->blkio,
                            "mem-region-alignment",
                            &s->mem_region_alignment);
@@ -459,7 +553,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
         return ret;
     }
 
-    bs->supported_write_flags = BDRV_REQ_FUA;
+    bs->supported_write_flags = BDRV_REQ_FUA | BDRV_REQ_REGISTERED_BUF;
     bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
                                BDRV_REQ_NO_FALLBACK;
 
@@ -627,6 +721,8 @@ static BlockDriver bdrv_io_uring = {
     .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
     .bdrv_io_unplug             = blkio_io_unplug,
     .bdrv_refresh_limits        = blkio_refresh_limits,
+    .bdrv_register_buf          = blkio_register_buf,
+    .bdrv_unregister_buf        = blkio_unregister_buf,
 };
 
 static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
@@ -648,6 +744,8 @@ static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
     .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
     .bdrv_io_unplug             = blkio_io_unplug,
     .bdrv_refresh_limits        = blkio_refresh_limits,
+    .bdrv_register_buf          = blkio_register_buf,
+    .bdrv_unregister_buf        = blkio_unregister_buf,
 };
 
 static void bdrv_blkio_init(void)
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
  2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2022-07-08  4:17 ` [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
@ 2022-07-08  4:17 ` Stefan Hajnoczi
  2022-07-14 10:16   ` Hanna Reitz
  7 siblings, 1 reply; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-07-08  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alberto Faria, Stefan Hajnoczi, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

Register guest RAM using BlockRAMRegistrar and set the
BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
accesses in I/O requests.

This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
on DMA mapping/unmapping.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/hw/virtio/virtio-blk.h |  2 ++
 hw/block/virtio-blk.c          | 13 +++++++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index d311c57cca..7f589b4146 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -19,6 +19,7 @@
 #include "hw/block/block.h"
 #include "sysemu/iothread.h"
 #include "sysemu/block-backend.h"
+#include "sysemu/block-ram-registrar.h"
 #include "qom/object.h"
 
 #define TYPE_VIRTIO_BLK "virtio-blk-device"
@@ -64,6 +65,7 @@ struct VirtIOBlock {
     struct VirtIOBlockDataPlane *dataplane;
     uint64_t host_features;
     size_t config_size;
+    BlockRAMRegistrar blk_ram_registrar;
 };
 
 typedef struct VirtIOBlockReq {
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e9ba752f6b..41f8c73453 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -21,6 +21,7 @@
 #include "hw/block/block.h"
 #include "hw/qdev-properties.h"
 #include "sysemu/blockdev.h"
+#include "sysemu/block-ram-registrar.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/runstate.h"
 #include "hw/virtio/virtio-blk.h"
@@ -421,11 +422,13 @@ static inline void submit_requests(BlockBackend *blk, MultiReqBuffer *mrb,
     }
 
     if (is_write) {
-        blk_aio_pwritev(blk, sector_num << BDRV_SECTOR_BITS, qiov, 0,
-                        virtio_blk_rw_complete, mrb->reqs[start]);
+        blk_aio_pwritev(blk, sector_num << BDRV_SECTOR_BITS, qiov,
+                        BDRV_REQ_REGISTERED_BUF, virtio_blk_rw_complete,
+                        mrb->reqs[start]);
     } else {
-        blk_aio_preadv(blk, sector_num << BDRV_SECTOR_BITS, qiov, 0,
-                       virtio_blk_rw_complete, mrb->reqs[start]);
+        blk_aio_preadv(blk, sector_num << BDRV_SECTOR_BITS, qiov,
+                       BDRV_REQ_REGISTERED_BUF, virtio_blk_rw_complete,
+                       mrb->reqs[start]);
     }
 }
 
@@ -1227,6 +1230,7 @@ static void virtio_blk_device_realize(DeviceState *dev, Error **errp)
     }
 
     s->change = qemu_add_vm_change_state_handler(virtio_blk_dma_restart_cb, s);
+    blk_ram_registrar_init(&s->blk_ram_registrar, s->blk);
     blk_set_dev_ops(s->blk, &virtio_block_ops, s);
 
     blk_iostatus_enable(s->blk);
@@ -1252,6 +1256,7 @@ static void virtio_blk_device_unrealize(DeviceState *dev)
         virtio_del_queue(vdev, i);
     }
     qemu_coroutine_dec_pool_size(conf->num_queues * conf->queue_size / 2);
+    blk_ram_registrar_destroy(&s->blk_ram_registrar);
     qemu_del_vm_change_state_handler(s->change);
     blockdev_mark_auto_del(s->blk);
     virtio_cleanup(vdev);
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
@ 2022-07-12 14:23   ` Stefano Garzarella
  2022-08-11 16:51     ` Stefan Hajnoczi
  2022-07-13 12:05   ` Hanna Reitz
  2022-07-27 19:33   ` Kevin Wolf
  2 siblings, 1 reply; 29+ messages in thread
From: Stefano Garzarella @ 2022-07-12 14:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

On Fri, Jul 08, 2022 at 05:17:30AM +0100, Stefan Hajnoczi wrote:
>libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
>high-performance disk I/O. It currently supports io_uring and
>virtio-blk-vhost-vdpa with additional drivers under development.
>
>One of the reasons for developing libblkio is that other applications
>besides QEMU can use it. This will be particularly useful for
>vhost-user-blk which applications may wish to use for connecting to
>qemu-storage-daemon.
>
>libblkio also gives us an opportunity to develop in Rust behind a C API
>that is easy to consume from QEMU.
>
>This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
>using libblkio. It will be easy to add other libblkio drivers since they
>will share the majority of code.
>
>For now I/O buffers are copied through bounce buffers if the libblkio
>driver requires it. Later commits add an optimization for
>pre-registering guest RAM to avoid bounce buffers.
>
>The syntax is:
>
>  --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
>
>and:
>
>  --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
>
>Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>---
> MAINTAINERS                   |   6 +
> meson_options.txt             |   2 +
> qapi/block-core.json          |  37 +-
> meson.build                   |   9 +
> block/blkio.c                 | 659 ++++++++++++++++++++++++++++++++++
> tests/qtest/modules-test.c    |   3 +
> block/meson.build             |   1 +
> scripts/meson-buildoptions.sh |   3 +
> 8 files changed, 718 insertions(+), 2 deletions(-)
> create mode 100644 block/blkio.c
>
>diff --git a/MAINTAINERS b/MAINTAINERS
>index 450abd0252..50f340d9ee 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -3395,6 +3395,12 @@ L: qemu-block@nongnu.org
> S: Maintained
> F: block/vdi.c
>
>+blkio
>+M: Stefan Hajnoczi <stefanha@redhat.com>
>+L: qemu-block@nongnu.org
>+S: Maintained
>+F: block/blkio.c
>+
> iSCSI
> M: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> M: Paolo Bonzini <pbonzini@redhat.com>
>diff --git a/meson_options.txt b/meson_options.txt
>index 97c38109b1..b0b2e0c9b5 100644
>--- a/meson_options.txt
>+++ b/meson_options.txt
>@@ -117,6 +117,8 @@ option('bzip2', type : 'feature', value : 'auto',
>        description: 'bzip2 support for DMG images')
> option('cap_ng', type : 'feature', value : 'auto',
>        description: 'cap_ng support')
>+option('blkio', type : 'feature', value : 'auto',
>+       description: 'libblkio block device driver')
> option('bpf', type : 'feature', value : 'auto',
>         description: 'eBPF support')
> option('cocoa', type : 'feature', value : 'auto',
>diff --git a/qapi/block-core.json b/qapi/block-core.json
>index 2173e7734a..aa63d5e9bd 100644
>--- a/qapi/block-core.json
>+++ b/qapi/block-core.json
>@@ -2951,11 +2951,15 @@
>             'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
>             {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
>             {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
>-            'http', 'https', 'iscsi',
>+            'http', 'https',
>+            { 'name': 'io_uring', 'if': 'CONFIG_BLKIO' },
>+            'iscsi',
>             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
>             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
>             { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
>-            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
>+            'ssh', 'throttle', 'vdi', 'vhdx',
>+            { 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
>+            'vmdk', 'vpc', 'vvfat' ] }
>
> ##
> # @BlockdevOptionsFile:
>@@ -3678,6 +3682,30 @@
>             '*debug': 'int',
>             '*logfile': 'str' } }
>
>+##
>+# @BlockdevOptionsIoUring:
>+#
>+# Driver specific block device options for the io_uring backend.
>+#
>+# @filename: path to the image file
>+#
>+# Since: 7.1
>+##
>+{ 'struct': 'BlockdevOptionsIoUring',
>+  'data': { 'filename': 'str' } }
>+
>+##
>+# @BlockdevOptionsVirtioBlkVhostVdpa:
>+#
>+# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
>+#
>+# @path: path to the vhost-vdpa character device.
>+#
>+# Since: 7.1
>+##
>+{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
>+  'data': { 'path': 'str' } }
>+
> ##
> # @IscsiTransport:
> #
>@@ -4305,6 +4333,8 @@
>                        'if': 'HAVE_HOST_BLOCK_DEVICE' },
>       'http':       'BlockdevOptionsCurlHttp',
>       'https':      'BlockdevOptionsCurlHttps',
>+      'io_uring':   { 'type': 'BlockdevOptionsIoUring',
>+                      'if': 'CONFIG_BLKIO' },
>       'iscsi':      'BlockdevOptionsIscsi',
>       'luks':       'BlockdevOptionsLUKS',
>       'nbd':        'BlockdevOptionsNbd',
>@@ -4327,6 +4357,9 @@
>       'throttle':   'BlockdevOptionsThrottle',
>       'vdi':        'BlockdevOptionsGenericFormat',
>       'vhdx':       'BlockdevOptionsGenericFormat',
>+      'virtio-blk-vhost-vdpa':
>+                    { 'type': 'BlockdevOptionsVirtioBlkVhostVdpa',
>+                      'if': 'CONFIG_BLKIO' },
>       'vmdk':       'BlockdevOptionsGenericCOWFormat',
>       'vpc':        'BlockdevOptionsGenericFormat',
>       'vvfat':      'BlockdevOptionsVVFAT'
>diff --git a/meson.build b/meson.build
>index bc5569ace1..f09b009428 100644
>--- a/meson.build
>+++ b/meson.build
>@@ -713,6 +713,13 @@ if not get_option('virglrenderer').auto() or have_system or have_vhost_user_gpu
>                      required: get_option('virglrenderer'),
>                      kwargs: static_kwargs)
> endif
>+blkio = not_found
>+if not get_option('blkio').auto() or have_block
>+  blkio = dependency('blkio',
>+                     method: 'pkg-config',
>+                     required: get_option('blkio'),
>+                     kwargs: static_kwargs)
>+endif
> curl = not_found
> if not get_option('curl').auto() or have_block
>   curl = dependency('libcurl', version: '>=7.29.0',
>@@ -1755,6 +1762,7 @@ config_host_data.set('CONFIG_LIBUDEV', libudev.found())
> config_host_data.set('CONFIG_LZO', lzo.found())
> config_host_data.set('CONFIG_MPATH', mpathpersist.found())
> config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
>+config_host_data.set('CONFIG_BLKIO', blkio.found())
> config_host_data.set('CONFIG_CURL', curl.found())
> config_host_data.set('CONFIG_CURSES', curses.found())
> config_host_data.set('CONFIG_GBM', gbm.found())
>@@ -3909,6 +3917,7 @@ summary_info += {'PAM':               pam}
> summary_info += {'iconv support':     iconv}
> summary_info += {'curses support':    curses}
> summary_info += {'virgl support':     virgl}
>+summary_info += {'blkio support':     blkio}
> summary_info += {'curl support':      curl}
> summary_info += {'Multipath support': mpathpersist}
> summary_info += {'PNG support':       png}
>diff --git a/block/blkio.c b/block/blkio.c
>new file mode 100644
>index 0000000000..7fbdbd7fae
>--- /dev/null
>+++ b/block/blkio.c
>@@ -0,0 +1,659 @@
>+#include "qemu/osdep.h"
>+#include <blkio.h>
>+#include "block/block_int.h"
>+#include "qapi/error.h"
>+#include "qapi/qmp/qdict.h"
>+#include "qemu/module.h"
>+
>+typedef struct BlkAIOCB {
>+    BlockAIOCB common;
>+    struct blkio_mem_region mem_region;
>+    QEMUIOVector qiov;
>+    struct iovec bounce_iov;
>+} BlkioAIOCB;
>+
>+typedef struct {
>+    /* Protects ->blkio and request submission on ->blkioq */
>+    QemuMutex lock;
>+
>+    struct blkio *blkio;
>+    struct blkioq *blkioq; /* this could be multi-queue in the future */
>+    int completion_fd;
>+
>+    /* Polling fetches the next completion into this field */
>+    struct blkio_completion poll_completion;
>+
>+    /* The value of the "mem-region-alignment" property */
>+    size_t mem_region_alignment;
>+
>+    /* Can we skip adding/deleting blkio_mem_regions? */
>+    bool needs_mem_regions;
>+} BDRVBlkioState;
>+
>+static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
>+{
>+    /* Copy bounce buffer back to qiov */
>+    if (acb->qiov.niov > 0) {
>+        qemu_iovec_from_buf(&acb->qiov, 0,
>+                acb->bounce_iov.iov_base,
>+                acb->bounce_iov.iov_len);
>+        qemu_iovec_destroy(&acb->qiov);
>+    }
>+
>+    acb->common.cb(acb->common.opaque, ret);
>+
>+    if (acb->mem_region.len > 0) {
>+        BDRVBlkioState *s = acb->common.bs->opaque;
>+
>+        WITH_QEMU_LOCK_GUARD(&s->lock) {
>+            blkio_free_mem_region(s->blkio, &acb->mem_region);
>+        }
>+    }
>+
>+    qemu_aio_unref(&acb->common);
>+}
>+
>+/*
>+ * Only the thread that calls aio_poll() invokes fd and poll handlers.
>+ * Therefore locks are not necessary except when accessing s->blkio.
>+ *
>+ * No locking is performed around blkioq_get_completions() although other
>+ * threads may submit I/O requests on s->blkioq. We're assuming there is no
>+ * inteference between blkioq_get_completions() and other s->blkioq APIs.
>+ */
>+
>+static void blkio_completion_fd_read(void *opaque)
>+{
>+    BlockDriverState *bs = opaque;
>+    BDRVBlkioState *s = bs->opaque;
>+    struct blkio_completion completion;
>+    uint64_t val;
>+    ssize_t ret __attribute__((unused));
>+
>+    /* Polling may have already fetched a completion */
>+    if (s->poll_completion.user_data != NULL) {
>+        completion = s->poll_completion;
>+
>+        /* Clear it in case blkio_aiocb_complete() has a nested event loop */
>+        s->poll_completion.user_data = NULL;
>+
>+        blkio_aiocb_complete(completion.user_data, completion.ret);
>+    }
>+
>+    /* Reset completion fd status */
>+    ret = read(s->completion_fd, &val, sizeof(val));
>+
>+    /*
>+     * Reading one completion at a time makes nested event loop re-entrancy
>+     * simple. Change this loop to get multiple completions in one go if it
>+     * becomes a performance bottleneck.
>+     */
>+    while (blkioq_do_io(s->blkioq, &completion, 0, 1, NULL) == 1) {
>+        blkio_aiocb_complete(completion.user_data, completion.ret);
>+    }
>+}
>+
>+static bool blkio_completion_fd_poll(void *opaque)
>+{
>+    BlockDriverState *bs = opaque;
>+    BDRVBlkioState *s = bs->opaque;
>+
>+    /* Just in case we already fetched a completion */
>+    if (s->poll_completion.user_data != NULL) {
>+        return true;
>+    }
>+
>+    return blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL) == 1;
>+}
>+
>+static void blkio_completion_fd_poll_ready(void *opaque)
>+{
>+    blkio_completion_fd_read(opaque);
>+}
>+
>+static void blkio_attach_aio_context(BlockDriverState *bs,
>+                                     AioContext *new_context)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+
>+    aio_set_fd_handler(new_context,
>+                       s->completion_fd,
>+                       false,
>+                       blkio_completion_fd_read,
>+                       NULL,
>+                       blkio_completion_fd_poll,
>+                       blkio_completion_fd_poll_ready,
>+                       bs);
>+}
>+
>+static void blkio_detach_aio_context(BlockDriverState *bs)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+
>+    aio_set_fd_handler(bdrv_get_aio_context(bs),
>+                       s->completion_fd,
>+                       false, NULL, NULL, NULL, NULL, NULL);
>+}
>+
>+static const AIOCBInfo blkio_aiocb_info = {
>+    .aiocb_size = sizeof(BlkioAIOCB),
>+};
>+
>+/* Create a BlkioAIOCB */
>+static BlkioAIOCB *blkio_aiocb_get(BlockDriverState *bs,
>+                                   BlockCompletionFunc *cb,
>+                                   void *opaque)
>+{
>+    BlkioAIOCB *acb = qemu_aio_get(&blkio_aiocb_info, bs, cb, opaque);
>+
>+    /* A few fields need to be initialized, leave the rest... */
>+    acb->qiov.niov = 0;
>+    acb->mem_region.len = 0;
>+    return acb;
>+}
>+
>+/* s->lock must be held */
>+static int blkio_aiocb_init_mem_region_locked(BlkioAIOCB *acb, size_t len)
>+{
>+    BDRVBlkioState *s = acb->common.bs->opaque;
>+    size_t mem_region_len = QEMU_ALIGN_UP(len, s->mem_region_alignment);
>+    int ret;
>+
>+    ret = blkio_alloc_mem_region(s->blkio, &acb->mem_region, mem_region_len);
>+    if (ret < 0) {
>+        return ret;
>+    }
>+
>+    acb->bounce_iov.iov_base = acb->mem_region.addr;
>+    acb->bounce_iov.iov_len = len;
>+    return 0;
>+}
>+
>+/* Call this to submit I/O after enqueuing a new request */
>+static void blkio_submit_io(BlockDriverState *bs)
>+{
>+    if (qatomic_read(&bs->io_plugged) == 0) {
>+        BDRVBlkioState *s = bs->opaque;
>+
>+        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
>+    }
>+}
>+
>+static BlockAIOCB *blkio_aio_pdiscard(BlockDriverState *bs, int64_t offset,
>+        int bytes, BlockCompletionFunc *cb, void *opaque)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    BlkioAIOCB *acb;
>+
>+    QEMU_LOCK_GUARD(&s->lock);
>+
>+    acb = blkio_aiocb_get(bs, cb, opaque);
>+    blkioq_discard(s->blkioq, offset, bytes, acb, 0);
>+    blkio_submit_io(bs);
>+    return &acb->common;
>+}
>+
>+static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
>+        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
>+        BlockCompletionFunc *cb, void *opaque)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    struct iovec *iov = qiov->iov;
>+    int iovcnt = qiov->niov;
>+    BlkioAIOCB *acb;
>+
>+    QEMU_LOCK_GUARD(&s->lock);
>+
>+    acb = blkio_aiocb_get(bs, cb, opaque);
>+
>+    if (s->needs_mem_regions) {
>+        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
>+            qemu_aio_unref(&acb->common);
>+            return NULL;
>+        }
>+
>+        /* Copy qiov because we'll call qemu_iovec_from_buf() on completion */
>+        qemu_iovec_init_slice(&acb->qiov, qiov, 0, qiov->size);
>+
>+        iov = &acb->bounce_iov;
>+        iovcnt = 1;
>+    }
>+
>+    blkioq_readv(s->blkioq, offset, iov, iovcnt, acb, 0);
>+    blkio_submit_io(bs);
>+    return &acb->common;
>+}
>+
>+static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
>+        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
>+        BlockCompletionFunc *cb, void *opaque)
>+{
>+    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
>+    BDRVBlkioState *s = bs->opaque;
>+    struct iovec *iov = qiov->iov;
>+    int iovcnt = qiov->niov;
>+    BlkioAIOCB *acb;
>+
>+    QEMU_LOCK_GUARD(&s->lock);
>+
>+    acb = blkio_aiocb_get(bs, cb, opaque);
>+
>+    if (s->needs_mem_regions) {
>+        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
>+            qemu_aio_unref(&acb->common);
>+            return NULL;
>+        }
>+
>+        qemu_iovec_to_buf(qiov, 0, acb->bounce_iov.iov_base, bytes);
>+
>+        iov = &acb->bounce_iov;
>+        iovcnt = 1;
>+    }
>+
>+    blkioq_writev(s->blkioq, offset, iov, iovcnt, acb, blkio_flags);
>+    blkio_submit_io(bs);
>+    return &acb->common;
>+}
>+
>+static BlockAIOCB *blkio_aio_flush(BlockDriverState *bs,
>+                                   BlockCompletionFunc *cb,
>+                                   void *opaque)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    BlkioAIOCB *acb;
>+
>+    QEMU_LOCK_GUARD(&s->lock);
>+
>+    acb = blkio_aiocb_get(bs, cb, opaque);
>+
>+    blkioq_flush(s->blkioq, acb, 0);
>+    blkio_submit_io(bs);
>+    return &acb->common;
>+}
>+
>+/* For async to .bdrv_co_*() conversion */
>+typedef struct {
>+    Coroutine *coroutine;
>+    int ret;
>+} BlkioCoData;
>+
>+static void blkio_co_pwrite_zeroes_complete(void *opaque, int ret)
>+{
>+    BlkioCoData *data = opaque;
>+
>+    data->ret = ret;
>+    aio_co_wake(data->coroutine);
>+}
>+
>+static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
>+    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    BlkioCoData data = {
>+        .coroutine = qemu_coroutine_self(),
>+    };
>+    uint32_t blkio_flags = 0;
>+
>+    if (flags & BDRV_REQ_FUA) {
>+        blkio_flags |= BLKIO_REQ_FUA;
>+    }
>+    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
>+        blkio_flags |= BLKIO_REQ_NO_UNMAP;
>+    }
>+    if (flags & BDRV_REQ_NO_FALLBACK) {
>+        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
>+    }
>+
>+    WITH_QEMU_LOCK_GUARD(&s->lock) {
>+        BlkioAIOCB *acb =
>+            blkio_aiocb_get(bs, blkio_co_pwrite_zeroes_complete, &data);
>+        blkioq_write_zeroes(s->blkioq, offset, bytes, acb, blkio_flags);
>+        blkio_submit_io(bs);
>+    }
>+
>+    qemu_coroutine_yield();
>+    return data.ret;
>+}
>+
>+static void blkio_io_unplug(BlockDriverState *bs)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+
>+    WITH_QEMU_LOCK_GUARD(&s->lock) {
>+        blkio_submit_io(bs);
>+    }
>+}
>+
>+static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
>+                                          Error **errp)
>+{
>+    bdrv_parse_filename_strip_prefix(filename, "io_uring:", options);
>+}
>+
>+static void blkio_parse_filename_virtio_blk_vhost_vdpa(
>+        const char *filename,
>+        QDict *options,
>+        Error **errp)
>+{
>+    bdrv_parse_filename_strip_prefix(filename, "virtio-blk-vhost-vdpa:", options);
>+}
>+
>+static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
>+                               Error **errp)
>+{
>+    const char *filename = qdict_get_try_str(options, "filename");
>+    BDRVBlkioState *s = bs->opaque;
>+    int ret;
>+
>+    ret = blkio_set_str(s->blkio, "path", filename);
>+    qdict_del(options, "filename");
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to set path: %s",
>+                         blkio_get_error_msg());
>+        return ret;
>+    }
>+
>+    if (flags & BDRV_O_NOCACHE) {
>+        ret = blkio_set_bool(s->blkio, "direct", true);
>+        if (ret < 0) {
>+            error_setg_errno(errp, -ret, "failed to set direct: %s",
>+                             blkio_get_error_msg());
>+            return ret;
>+        }
>+    }
>+
>+    return 0;
>+}
>+
>+static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
>+        QDict *options, int flags, Error **errp)
>+{
>+    const char *path = qdict_get_try_str(options, "path");
>+    BDRVBlkioState *s = bs->opaque;
>+    int ret;
>+
>+    ret = blkio_set_str(s->blkio, "path", path);
>+    qdict_del(options, "path");
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to set path: %s",
>+                         blkio_get_error_msg());
>+        return ret;
>+    }
>+
>+    if (flags & BDRV_O_NOCACHE) {
>+        error_setg(errp, "cache.direct=off is not supported");
>+        return -EINVAL;
>+    }
>+    return 0;
>+}
>+
>+static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
>+                           Error **errp)
>+{
>+    const char *blkio_driver = bs->drv->protocol_name;
>+    BDRVBlkioState *s = bs->opaque;
>+    int ret;
>+
>+    ret = blkio_create(blkio_driver, &s->blkio);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "blkio_create failed: %s",
>+                         blkio_get_error_msg());
>+        return ret;
>+    }
>+
>+    if (strcmp(blkio_driver, "io_uring") == 0) {
>+        ret = blkio_io_uring_open(bs, options, flags, errp);
>+    } else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
>+        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
>+    }
>+    if (ret < 0) {
>+        blkio_destroy(&s->blkio);
>+        return ret;
>+    }
>+
>+    if (!(flags & BDRV_O_RDWR)) {
>+        ret = blkio_set_bool(s->blkio, "readonly", true);
>+        if (ret < 0) {
>+            error_setg_errno(errp, -ret, "failed to set readonly: %s",
>+                             blkio_get_error_msg());
>+            blkio_destroy(&s->blkio);
>+            return ret;
>+        }
>+    }
>+
>+    ret = blkio_connect(s->blkio);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
>+                         blkio_get_error_msg());
>+        blkio_destroy(&s->blkio);
>+        return ret;
>+    }
>+
>+    ret = blkio_get_bool(s->blkio,
>+                         "needs-mem-regions",
>+                         &s->needs_mem_regions);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret,
>+                         "failed to get needs-mem-regions: %s",
>+                         blkio_get_error_msg());
>+        blkio_destroy(&s->blkio);
>+        return ret;
>+    }
>+
>+    ret = blkio_get_uint64(s->blkio,
>+                           "mem-region-alignment",
>+                           &s->mem_region_alignment);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret,
>+                         "failed to get mem-region-alignment: %s",
>+                         blkio_get_error_msg());
>+        blkio_destroy(&s->blkio);
>+        return ret;
>+    }
>+
>+    ret = blkio_start(s->blkio);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "blkio_start failed: %s",
>+                         blkio_get_error_msg());
>+        blkio_destroy(&s->blkio);
>+        return ret;
>+    }
>+
>+    bs->supported_write_flags = BDRV_REQ_FUA;
>+    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
>+                               BDRV_REQ_NO_FALLBACK;
>+
>+    qemu_mutex_init(&s->lock);
>+    s->blkioq = blkio_get_queue(s->blkio, 0);
>+    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
>+
>+    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
>+    return 0;
>+}
>+
>+static void blkio_close(BlockDriverState *bs)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+
>+    qemu_mutex_destroy(&s->lock);
>+    blkio_destroy(&s->blkio);
>+}
>+
>+static int64_t blkio_getlength(BlockDriverState *bs)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    uint64_t capacity;
>+    int ret;
>+
>+    WITH_QEMU_LOCK_GUARD(&s->lock) {
>+        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
>+    }
>+    if (ret < 0) {
>+        return -ret;
>+    }
>+
>+    return capacity;
>+}
>+
>+static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
>+{
>+    return 0;
>+}
>+
>+static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    int value;
>+    int ret;
>+
>+    ret = blkio_get_int(s->blkio,
>+                        "request-alignment",
>+                        (int *)&bs->bl.request_alignment);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if (bs->bl.request_alignment < 1 ||
>+        bs->bl.request_alignment >= INT_MAX ||
>+        !is_power_of_2(bs->bl.request_alignment)) {
>+        error_setg(errp, "invalid \"request-alignment\" value %d, must be "
>+                   "power of 2 less than INT_MAX", bs->bl.request_alignment);
>+        return;
>+    }
>+
>+    ret = blkio_get_int(s->blkio,
>+                        "optimal-io-size",
>+                        (int *)&bs->bl.opt_transfer);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if (bs->bl.opt_transfer > INT_MAX ||
>+        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
>+        error_setg(errp, "invalid \"buf-alignment\" value %d, must be a "
>+                   "multiple of %d", bs->bl.opt_transfer,
>+                   bs->bl.request_alignment);
>+        return;
>+    }
>+
>+    ret = blkio_get_int(s->blkio,
>+                        "max-transfer",
>+                        (int *)&bs->bl.max_transfer);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
>+        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
>+        error_setg(errp, "invalid \"max-transfer\" value %d, must be a "
>+                   "multiple of %d and %d (if non-zero)",
>+                   bs->bl.max_transfer, bs->bl.request_alignment,
>+                   bs->bl.opt_transfer);
>+        return;
>+    }
>+
>+    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if (value < 1) {
>+        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
>+                   "positive", value);
>+        return;
>+    }
>+    bs->bl.min_mem_alignment = value;
>+
>+    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret,
>+                         "failed to get \"optimal-buf-alignment\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if (value < 1) {
>+        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
>+                   "must be positive", value);
>+        return;
>+    }
>+    bs->bl.opt_mem_alignment = value;
>+
>+    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
>+    if (ret < 0) {
>+        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
>+                         blkio_get_error_msg());
>+        return;
>+    }
>+    if (value < 1) {
>+        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
>+                   bs->bl.max_iov);
>+        return;
>+    }
>+}
>+
>+/*
>+ * TODO
>+ * Missing libblkio APIs:
>+ * - write zeroes
>+ * - discard
>+ * - block_status
>+ * - co_invalidate_cache
>+ *
>+ * Out of scope?
>+ * - create
>+ * - truncate
>+ */
>+
>+static BlockDriver bdrv_io_uring = {
>+    .format_name                = "io_uring",
>+    .protocol_name              = "io_uring",
>+    .instance_size              = sizeof(BDRVBlkioState),
>+    .bdrv_needs_filename        = true,
>+    .bdrv_parse_filename        = blkio_parse_filename_io_uring,
>+    .bdrv_file_open             = blkio_file_open,
>+    .bdrv_close                 = blkio_close,
>+    .bdrv_getlength             = blkio_getlength,
>+    .bdrv_get_info              = blkio_get_info,
>+    .bdrv_attach_aio_context    = blkio_attach_aio_context,
>+    .bdrv_detach_aio_context    = blkio_detach_aio_context,
>+    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
>+    .bdrv_aio_preadv            = blkio_aio_preadv,
>+    .bdrv_aio_pwritev           = blkio_aio_pwritev,
>+    .bdrv_aio_flush             = blkio_aio_flush,
>+    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
>+    .bdrv_io_unplug             = blkio_io_unplug,
>+    .bdrv_refresh_limits        = blkio_refresh_limits,
>+};
>+
>+static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
>+    .format_name                = "virtio-blk-vhost-vdpa",
>+    .protocol_name              = "virtio-blk-vhost-vdpa",
>+    .instance_size              = sizeof(BDRVBlkioState),
>+    .bdrv_needs_filename        = true,

Should we set `.bdrv_needs_filename` to false for 
`bdrv_virtio_blk_vhost_vdpa`?

I have this error:
     qemu-system-x86_64: -blockdev 
     node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0: 
     The 'virtio-blk-vhost-vdpa' block driver requires a file name

>+    .bdrv_parse_filename        = blkio_parse_filename_virtio_blk_vhost_vdpa,

For my education, since virtio-blk-vhost-vdpa doesn't use the filename 
parameter, we still need to set .bdrv_parse_filename?

Thanks,
Stefano



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-07-08  4:17 ` [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
@ 2022-07-12 14:28   ` Stefano Garzarella
  2022-08-15 20:52     ` Stefan Hajnoczi
  2022-07-14 10:13   ` Hanna Reitz
  1 sibling, 1 reply; 29+ messages in thread
From: Stefano Garzarella @ 2022-07-12 14:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

On Fri, Jul 08, 2022 at 05:17:36AM +0100, Stefan Hajnoczi wrote:
>Avoid bounce buffers when QEMUIOVector elements are within previously
>registered bdrv_register_buf() buffers.
>
>The idea is that emulated storage controllers will register guest RAM
>using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
>requests. Therefore no blkio_map_mem_region() calls are necessary in the
>performance-critical I/O code path.
>
>This optimization doesn't apply if the I/O buffer is internally
>allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
>path because BDRV_REQ_REGISTERED_BUF is not set.
>
>Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>---
> block/blkio.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 101 insertions(+), 3 deletions(-)
>
>diff --git a/block/blkio.c b/block/blkio.c
>index 7fbdbd7fae..37d593a20c 100644
>--- a/block/blkio.c
>+++ b/block/blkio.c
>@@ -1,7 +1,9 @@
> #include "qemu/osdep.h"
> #include <blkio.h>
> #include "block/block_int.h"
>+#include "exec/memory.h"
> #include "qapi/error.h"
>+#include "qemu/error-report.h"
> #include "qapi/qmp/qdict.h"
> #include "qemu/module.h"
>
>@@ -28,6 +30,9 @@ typedef struct {
>
>     /* Can we skip adding/deleting blkio_mem_regions? */
>     bool needs_mem_regions;
>+
>+    /* Are file descriptors necessary for blkio_mem_regions? */
>+    bool needs_mem_region_fd;
> } BDRVBlkioState;
>
> static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
>@@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
>         BlockCompletionFunc *cb, void *opaque)
> {
>     BDRVBlkioState *s = bs->opaque;
>+    bool needs_mem_regions =
>+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
>     struct iovec *iov = qiov->iov;
>     int iovcnt = qiov->niov;
>     BlkioAIOCB *acb;
>@@ -206,7 +213,7 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
>
>     acb = blkio_aiocb_get(bs, cb, opaque);
>
>-    if (s->needs_mem_regions) {
>+    if (needs_mem_regions) {
>         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
>             qemu_aio_unref(&acb->common);
>             return NULL;
>@@ -230,6 +237,8 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> {
>     uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
>     BDRVBlkioState *s = bs->opaque;
>+    bool needs_mem_regions =
>+        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
>     struct iovec *iov = qiov->iov;
>     int iovcnt = qiov->niov;
>     BlkioAIOCB *acb;
>@@ -238,7 +247,7 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
>
>     acb = blkio_aiocb_get(bs, cb, opaque);
>
>-    if (s->needs_mem_regions) {
>+    if (needs_mem_regions) {
>         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
>             qemu_aio_unref(&acb->common);
>             return NULL;
>@@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
>     }
> }
>
>+static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)
>+{
>+    BDRVBlkioState *s = bs->opaque;
>+    int ret;
>+    struct blkio_mem_region region = (struct blkio_mem_region){
>+        .addr = host,
>+        .len = size,
>+        .fd = -1,
>+    };
>+
>+    if (((uintptr_t)host | size) % s->mem_region_alignment) {
>+        error_report_once("%s: skipping unaligned buf %p with size %zu",
>+                          __func__, host, size);
>+        return; /* skip unaligned */
>+    }
>+
>+    /* Attempt to find the fd for a MemoryRegion */
>+    if (s->needs_mem_region_fd) {
>+        int fd = -1;
>+        ram_addr_t offset;
>+        MemoryRegion *mr;
>+
>+        /*
>+         * bdrv_register_buf() is called with the BQL held so mr lives at least
>+         * until this function returns.
>+         */
>+        mr = memory_region_from_host(host, &offset);
>+        if (mr) {
>+            fd = memory_region_get_fd(mr);

If s->needs_mem_region_fd is true, memory_region_get_fd() crashes I 
think because mr->ram_block is not yet set, indeed from the stack trace 
blkio_register_buf() is called inside qemu_ram_alloc_resizeable(), and 
its result is used to set mr->ram_block in 
memory_region_init_resizeable_ram():

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000056235bf1f7a3 in memory_region_get_fd (mr=<optimized out>) at ../softmmu/memory.c:2309
#1  0x000056235c07e54d in blkio_register_buf (bs=<optimized out>, host=0x7f824e200000, size=2097152)
     at ../block/blkio.c:364
#2  0x000056235c0246c6 in bdrv_register_buf (bs=0x56235d606b40, host=0x7f824e200000, size=2097152)
     at ../block/io.c:3362
#3  0x000056235bea44e6 in ram_block_notify_add (host=0x7f824e200000, size=131072, max_size=2097152)
     at ../hw/core/numa.c:863
#4  0x000056235bf22c00 in ram_block_add (new_block=<optimized out>, errp=<optimized out>)
     at ../softmmu/physmem.c:2057
#5  0x000056235bf232e4 in qemu_ram_alloc_internal (size=size@entry=131072, 
     max_size=max_size@entry=2097152, resized=resized@entry=0x56235bc0f920 <fw_cfg_resized>, 
     host=host@entry=0x0, ram_flags=ram_flags@entry=4, mr=mr@entry=0x56235dc3fe00, 
     errp=0x7ffcb21f1be0) at ../softmmu/physmem.c:2180
#6  0x000056235bf26426 in qemu_ram_alloc_resizeable (size=size@entry=131072, 
     maxsz=maxsz@entry=2097152, resized=resized@entry=0x56235bc0f920 <fw_cfg_resized>, 
     mr=mr@entry=0x56235dc3fe00, errp=errp@entry=0x7ffcb21f1be0) at ../softmmu/physmem.c:2209
#7  0x000056235bf1cc99 in memory_region_init_resizeable_ram (mr=0x56235dc3fe00, 
     owner=owner@entry=0x56235d93ffc0, name=name@entry=0x7ffcb21f1ca0 "/rom@etc/acpi/tables", 
     size=131072, max_size=2097152, resized=resized@entry=0x56235bc0f920 <fw_cfg_resized>, 
     errp=0x56235c996490 <error_fatal>) at ../softmmu/memory.c:1586
#8  0x000056235bc0f99c in rom_set_mr (rom=rom@entry=0x56235ddd0200, owner=0x56235d93ffc0, 
     name=name@entry=0x7ffcb21f1ca0 "/rom@etc/acpi/tables", ro=ro@entry=true)
     at ../hw/core/loader.c:961
#9  0x000056235bc12a65 in rom_add_blob (name=name@entry=0x56235c1a2a09 "etc/acpi/tables", 
     blob=0x56235df4f4b0, len=<optimized out>, max_len=max_len@entry=2097152, 
     addr=addr@entry=18446744073709551615, 
     fw_file_name=fw_file_name@entry=0x56235c1a2a09 "etc/acpi/tables", 
     fw_callback=0x56235be47f90 <acpi_build_update>, callback_opaque=0x56235d817830, as=0x0, 
     read_only=true) at ../hw/core/loader.c:1102
#10 0x000056235bbe0990 in acpi_add_rom_blob (
     update=update@entry=0x56235be47f90 <acpi_build_update>, opaque=opaque@entry=0x56235d817830, 
     blob=0x56235d3ab750, name=name@entry=0x56235c1a2a09 "etc/acpi/tables") at ../hw/acpi/utils.c:46
#11 0x000056235be481e6 in acpi_setup () at ../hw/i386/acpi-build.c:2805
#12 0x000056235be3e209 in pc_machine_done (notifier=0x56235d5efce8, data=<optimized out>)
     at ../hw/i386/pc.c:758
#13 0x000056235c12e4a7 in notifier_list_notify (
     list=list@entry=0x56235c963790 <machine_init_done_notifiers>, data=data@entry=0x0)
     at ../util/notify.c:39

Thanks,
Stefano



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
  2022-07-12 14:23   ` Stefano Garzarella
@ 2022-07-13 12:05   ` Hanna Reitz
  2022-08-11 19:08     ` Stefan Hajnoczi
  2022-07-27 19:33   ` Kevin Wolf
  2 siblings, 1 reply; 29+ messages in thread
From: Hanna Reitz @ 2022-07-13 12:05 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> high-performance disk I/O. It currently supports io_uring and
> virtio-blk-vhost-vdpa with additional drivers under development.
>
> One of the reasons for developing libblkio is that other applications
> besides QEMU can use it. This will be particularly useful for
> vhost-user-blk which applications may wish to use for connecting to
> qemu-storage-daemon.
>
> libblkio also gives us an opportunity to develop in Rust behind a C API
> that is easy to consume from QEMU.
>
> This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> using libblkio. It will be easy to add other libblkio drivers since they
> will share the majority of code.
>
> For now I/O buffers are copied through bounce buffers if the libblkio
> driver requires it. Later commits add an optimization for
> pre-registering guest RAM to avoid bounce buffers.
>
> The syntax is:
>
>    --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
>
> and:
>
>    --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   MAINTAINERS                   |   6 +
>   meson_options.txt             |   2 +
>   qapi/block-core.json          |  37 +-
>   meson.build                   |   9 +
>   block/blkio.c                 | 659 ++++++++++++++++++++++++++++++++++
>   tests/qtest/modules-test.c    |   3 +
>   block/meson.build             |   1 +
>   scripts/meson-buildoptions.sh |   3 +
>   8 files changed, 718 insertions(+), 2 deletions(-)
>   create mode 100644 block/blkio.c

[...]

> diff --git a/block/blkio.c b/block/blkio.c
> new file mode 100644
> index 0000000000..7fbdbd7fae
> --- /dev/null
> +++ b/block/blkio.c
> @@ -0,0 +1,659 @@

Not sure whether it’s necessary, but I would have expected a copyright 
header here.

> +#include "qemu/osdep.h"
> +#include <blkio.h>
> +#include "block/block_int.h"
> +#include "qapi/error.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qemu/module.h"
> +
> +typedef struct BlkAIOCB {
> +    BlockAIOCB common;
> +    struct blkio_mem_region mem_region;
> +    QEMUIOVector qiov;
> +    struct iovec bounce_iov;
> +} BlkioAIOCB;
> +
> +typedef struct {
> +    /* Protects ->blkio and request submission on ->blkioq */
> +    QemuMutex lock;
> +
> +    struct blkio *blkio;
> +    struct blkioq *blkioq; /* this could be multi-queue in the future */
> +    int completion_fd;
> +
> +    /* Polling fetches the next completion into this field */
> +    struct blkio_completion poll_completion;
> +
> +    /* The value of the "mem-region-alignment" property */
> +    size_t mem_region_alignment;
> +
> +    /* Can we skip adding/deleting blkio_mem_regions? */
> +    bool needs_mem_regions;
> +} BDRVBlkioState;
> +
> +static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
> +{
> +    /* Copy bounce buffer back to qiov */
> +    if (acb->qiov.niov > 0) {
> +        qemu_iovec_from_buf(&acb->qiov, 0,
> +                acb->bounce_iov.iov_base,
> +                acb->bounce_iov.iov_len);
> +        qemu_iovec_destroy(&acb->qiov);
> +    }
> +
> +    acb->common.cb(acb->common.opaque, ret);
> +
> +    if (acb->mem_region.len > 0) {
> +        BDRVBlkioState *s = acb->common.bs->opaque;
> +
> +        WITH_QEMU_LOCK_GUARD(&s->lock) {
> +            blkio_free_mem_region(s->blkio, &acb->mem_region);
> +        }
> +    }
> +
> +    qemu_aio_unref(&acb->common);
> +}
> +
> +/*
> + * Only the thread that calls aio_poll() invokes fd and poll handlers.
> + * Therefore locks are not necessary except when accessing s->blkio.
> + *
> + * No locking is performed around blkioq_get_completions() although other
> + * threads may submit I/O requests on s->blkioq. We're assuming there is no
> + * inteference between blkioq_get_completions() and other s->blkioq APIs.
> + */
> +
> +static void blkio_completion_fd_read(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    BDRVBlkioState *s = bs->opaque;
> +    struct blkio_completion completion;
> +    uint64_t val;
> +    ssize_t ret __attribute__((unused));

I’d prefer a `(void)ret;` over this attribute, not least because that 
line would give a nice opportunity to explain in a short comment why we 
ignore this return value that the compiler tells us not to ignore, but 
if you don’t, then this’ll be fine.

> +
> +    /* Polling may have already fetched a completion */
> +    if (s->poll_completion.user_data != NULL) {
> +        completion = s->poll_completion;
> +
> +        /* Clear it in case blkio_aiocb_complete() has a nested event loop */
> +        s->poll_completion.user_data = NULL;
> +
> +        blkio_aiocb_complete(completion.user_data, completion.ret);
> +    }
> +
> +    /* Reset completion fd status */
> +    ret = read(s->completion_fd, &val, sizeof(val));
> +
> +    /*
> +     * Reading one completion at a time makes nested event loop re-entrancy
> +     * simple. Change this loop to get multiple completions in one go if it
> +     * becomes a performance bottleneck.
> +     */
> +    while (blkioq_do_io(s->blkioq, &completion, 0, 1, NULL) == 1) {
> +        blkio_aiocb_complete(completion.user_data, completion.ret);
> +    }
> +}
> +
> +static bool blkio_completion_fd_poll(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    BDRVBlkioState *s = bs->opaque;
> +
> +    /* Just in case we already fetched a completion */
> +    if (s->poll_completion.user_data != NULL) {
> +        return true;
> +    }
> +
> +    return blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL) == 1;
> +}
> +
> +static void blkio_completion_fd_poll_ready(void *opaque)
> +{
> +    blkio_completion_fd_read(opaque);
> +}
> +
> +static void blkio_attach_aio_context(BlockDriverState *bs,
> +                                     AioContext *new_context)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +
> +    aio_set_fd_handler(new_context,
> +                       s->completion_fd,
> +                       false,
> +                       blkio_completion_fd_read,
> +                       NULL,
> +                       blkio_completion_fd_poll,
> +                       blkio_completion_fd_poll_ready,
> +                       bs);
> +}
> +
> +static void blkio_detach_aio_context(BlockDriverState *bs)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +
> +    aio_set_fd_handler(bdrv_get_aio_context(bs),
> +                       s->completion_fd,
> +                       false, NULL, NULL, NULL, NULL, NULL);
> +}
> +
> +static const AIOCBInfo blkio_aiocb_info = {
> +    .aiocb_size = sizeof(BlkioAIOCB),
> +};
> +
> +/* Create a BlkioAIOCB */
> +static BlkioAIOCB *blkio_aiocb_get(BlockDriverState *bs,
> +                                   BlockCompletionFunc *cb,
> +                                   void *opaque)
> +{
> +    BlkioAIOCB *acb = qemu_aio_get(&blkio_aiocb_info, bs, cb, opaque);
> +
> +    /* A few fields need to be initialized, leave the rest... */
> +    acb->qiov.niov = 0;
> +    acb->mem_region.len = 0;
> +    return acb;
> +}
> +
> +/* s->lock must be held */
> +static int blkio_aiocb_init_mem_region_locked(BlkioAIOCB *acb, size_t len)
> +{
> +    BDRVBlkioState *s = acb->common.bs->opaque;
> +    size_t mem_region_len = QEMU_ALIGN_UP(len, s->mem_region_alignment);
> +    int ret;
> +
> +    ret = blkio_alloc_mem_region(s->blkio, &acb->mem_region, mem_region_len);

I don’t find the blkio doc clear on whether this function is 
sufficiently fast to be used in an I/O path.  Is it?

(Or is this perhaps addressed in a later function in this series?)

> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    acb->bounce_iov.iov_base = acb->mem_region.addr;
> +    acb->bounce_iov.iov_len = len;
> +    return 0;
> +}
> +
> +/* Call this to submit I/O after enqueuing a new request */
> +static void blkio_submit_io(BlockDriverState *bs)
> +{
> +    if (qatomic_read(&bs->io_plugged) == 0) {
> +        BDRVBlkioState *s = bs->opaque;
> +
> +        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
> +    }
> +}
> +
> +static BlockAIOCB *blkio_aio_pdiscard(BlockDriverState *bs, int64_t offset,
> +        int bytes, BlockCompletionFunc *cb, void *opaque)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    BlkioAIOCB *acb;
> +
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    acb = blkio_aiocb_get(bs, cb, opaque);
> +    blkioq_discard(s->blkioq, offset, bytes, acb, 0);
> +    blkio_submit_io(bs);
> +    return &acb->common;
> +}
> +
> +static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> +        BlockCompletionFunc *cb, void *opaque)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    struct iovec *iov = qiov->iov;
> +    int iovcnt = qiov->niov;
> +    BlkioAIOCB *acb;
> +
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    acb = blkio_aiocb_get(bs, cb, opaque);
> +
> +    if (s->needs_mem_regions) {
> +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> +            qemu_aio_unref(&acb->common);
> +            return NULL;
> +        }
> +
> +        /* Copy qiov because we'll call qemu_iovec_from_buf() on completion */
> +        qemu_iovec_init_slice(&acb->qiov, qiov, 0, qiov->size);
> +
> +        iov = &acb->bounce_iov;
> +        iovcnt = 1;
> +    }
> +
> +    blkioq_readv(s->blkioq, offset, iov, iovcnt, acb, 0);
> +    blkio_submit_io(bs);
> +    return &acb->common;
> +}
> +
> +static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> +        BlockCompletionFunc *cb, void *opaque)
> +{
> +    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
> +    BDRVBlkioState *s = bs->opaque;
> +    struct iovec *iov = qiov->iov;
> +    int iovcnt = qiov->niov;
> +    BlkioAIOCB *acb;
> +
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    acb = blkio_aiocb_get(bs, cb, opaque);
> +
> +    if (s->needs_mem_regions) {
> +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> +            qemu_aio_unref(&acb->common);
> +            return NULL;
> +        }
> +
> +        qemu_iovec_to_buf(qiov, 0, acb->bounce_iov.iov_base, bytes);
> +
> +        iov = &acb->bounce_iov;
> +        iovcnt = 1;
> +    }
> +
> +    blkioq_writev(s->blkioq, offset, iov, iovcnt, acb, blkio_flags);
> +    blkio_submit_io(bs);
> +    return &acb->common;
> +}
> +
> +static BlockAIOCB *blkio_aio_flush(BlockDriverState *bs,
> +                                   BlockCompletionFunc *cb,
> +                                   void *opaque)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    BlkioAIOCB *acb;
> +
> +    QEMU_LOCK_GUARD(&s->lock);
> +
> +    acb = blkio_aiocb_get(bs, cb, opaque);
> +
> +    blkioq_flush(s->blkioq, acb, 0);
> +    blkio_submit_io(bs);
> +    return &acb->common;
> +}
> +
> +/* For async to .bdrv_co_*() conversion */
> +typedef struct {
> +    Coroutine *coroutine;
> +    int ret;
> +} BlkioCoData;
> +
> +static void blkio_co_pwrite_zeroes_complete(void *opaque, int ret)
> +{
> +    BlkioCoData *data = opaque;
> +
> +    data->ret = ret;
> +    aio_co_wake(data->coroutine);
> +}
> +
> +static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
> +    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    BlkioCoData data = {
> +        .coroutine = qemu_coroutine_self(),
> +    };
> +    uint32_t blkio_flags = 0;
> +
> +    if (flags & BDRV_REQ_FUA) {
> +        blkio_flags |= BLKIO_REQ_FUA;
> +    }
> +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> +        blkio_flags |= BLKIO_REQ_NO_UNMAP;
> +    }
> +    if (flags & BDRV_REQ_NO_FALLBACK) {
> +        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        BlkioAIOCB *acb =
> +            blkio_aiocb_get(bs, blkio_co_pwrite_zeroes_complete, &data);
> +        blkioq_write_zeroes(s->blkioq, offset, bytes, acb, blkio_flags);
> +        blkio_submit_io(bs);
> +    }
> +
> +    qemu_coroutine_yield();
> +    return data.ret;
> +}
> +
> +static void blkio_io_unplug(BlockDriverState *bs)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        blkio_submit_io(bs);
> +    }
> +}
> +
> +static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
> +                                          Error **errp)
> +{
> +    bdrv_parse_filename_strip_prefix(filename, "io_uring:", options);
> +}
> +
> +static void blkio_parse_filename_virtio_blk_vhost_vdpa(
> +        const char *filename,
> +        QDict *options,
> +        Error **errp)
> +{
> +    bdrv_parse_filename_strip_prefix(filename, "virtio-blk-vhost-vdpa:", options);
> +}

Besides the fact that this doesn’t work for virtio-blk-vhost-vdpa 
(because it provides a @filename option, but that driver expects a @path 
option), is it really worth implementing these, or should we just expect 
users to use -blockdev (or -drive with blockdev-like options)?

> +
> +static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
> +                               Error **errp)
> +{
> +    const char *filename = qdict_get_try_str(options, "filename");
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +
> +    ret = blkio_set_str(s->blkio, "path", filename);

You don’t check that @filename is non-NULL, and I don’t think that 
libblkio would accept a NULL here.  Admittedly, I can’t produce a case 
where it would be NULL (because -blockdev checks the QAPI schema, and 
-drive expects a @filename parameter thanks to .bdrv_needs_filename), 
but I think it’s still isn’t ideal.

> +    qdict_del(options, "filename");
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to set path: %s",
> +                         blkio_get_error_msg());
> +        return ret;
> +    }
> +
> +    if (flags & BDRV_O_NOCACHE) {
> +        ret = blkio_set_bool(s->blkio, "direct", true);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "failed to set direct: %s",
> +                             blkio_get_error_msg());
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
> +        QDict *options, int flags, Error **errp)
> +{
> +    const char *path = qdict_get_try_str(options, "path");
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +
> +    ret = blkio_set_str(s->blkio, "path", path);

In contrast to the above, I can make @path NULL here, because 
.bdrv_needs_filename only ensures that there’s a @filename parameter, 
and so:

$ ./qemu-system-x86_64 -drive 
if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo
[1]    49946 segmentation fault (core dumped)  ./qemu-system-x86_64 -drive

> +    qdict_del(options, "path");
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to set path: %s",
> +                         blkio_get_error_msg());
> +        return ret;
> +    }
> +
> +    if (flags & BDRV_O_NOCACHE) {
> +        error_setg(errp, "cache.direct=off is not supported");

The condition is the opposite of that, though, isn’t it?

I.e.:

$ ./qemu-system-x86_64 -drive 
if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo,path=foo,cache.direct=on 

qemu-system-x86_64: -drive 
if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo,path=foo,cache.direct=on: 
cache.direct=off is not supported

> +        return -EINVAL;
> +    }
> +    return 0;
> +}
> +
> +static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
> +                           Error **errp)
> +{
> +    const char *blkio_driver = bs->drv->protocol_name;
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +
> +    ret = blkio_create(blkio_driver, &s->blkio);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "blkio_create failed: %s",
> +                         blkio_get_error_msg());
> +        return ret;
> +    }
> +
> +    if (strcmp(blkio_driver, "io_uring") == 0) {
> +        ret = blkio_io_uring_open(bs, options, flags, errp);
> +    } else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
> +        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
> +    }

First, I’d like to suggest using macros for the driver names (and use 
them here and below for format_name/protocol_name).

Second, what do you think about adding an `else` branch with 
`g_assert_not_reached()` (or just abort)?

> +    if (ret < 0) {
> +        blkio_destroy(&s->blkio);
> +        return ret;
> +    }
> +
> +    if (!(flags & BDRV_O_RDWR)) {
> +        ret = blkio_set_bool(s->blkio, "readonly", true);

The libblkio doc says it’s “read-only”, and when I try to set this 
option, I get an error:

$ ./qemu-system-x86_64 -blockdev 
io_uring,node-name=node0,filename=/dev/null,read-only=on
qemu-system-x86_64: -blockdev 
io_uring,node-name=node0,filename=/dev/null,read-only=on: failed to set 
readonly: Unknown property name: No such file or directory

> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "failed to set readonly: %s",
> +                             blkio_get_error_msg());
> +            blkio_destroy(&s->blkio);
> +            return ret;
> +        }
> +    }
> +
> +    ret = blkio_connect(s->blkio);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
> +                         blkio_get_error_msg());
> +        blkio_destroy(&s->blkio);
> +        return ret;
> +    }
> +
> +    ret = blkio_get_bool(s->blkio,
> +                         "needs-mem-regions",
> +                         &s->needs_mem_regions);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "failed to get needs-mem-regions: %s",
> +                         blkio_get_error_msg());
> +        blkio_destroy(&s->blkio);
> +        return ret;
> +    }
> +
> +    ret = blkio_get_uint64(s->blkio,
> +                           "mem-region-alignment",
> +                           &s->mem_region_alignment);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "failed to get mem-region-alignment: %s",
> +                         blkio_get_error_msg());
> +        blkio_destroy(&s->blkio);
> +        return ret;
> +    }
> +
> +    ret = blkio_start(s->blkio);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "blkio_start failed: %s",
> +                         blkio_get_error_msg());
> +        blkio_destroy(&s->blkio);
> +        return ret;
> +    }
> +
> +    bs->supported_write_flags = BDRV_REQ_FUA;
> +    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
> +                               BDRV_REQ_NO_FALLBACK;
> +
> +    qemu_mutex_init(&s->lock);
> +    s->blkioq = blkio_get_queue(s->blkio, 0);
> +    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
> +
> +    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
> +    return 0;
> +}
> +
> +static void blkio_close(BlockDriverState *bs)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +
> +    qemu_mutex_destroy(&s->lock);
> +    blkio_destroy(&s->blkio);

Should we call blkio_detach_aio_context() here?

> +}
> +
> +static int64_t blkio_getlength(BlockDriverState *bs)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    uint64_t capacity;
> +    int ret;
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
> +    }
> +    if (ret < 0) {
> +        return -ret;
> +    }
> +
> +    return capacity;
> +}
> +
> +static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> +{
> +    return 0;
> +}
> +
> +static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    int value;
> +    int ret;
> +
> +    ret = blkio_get_int(s->blkio,
> +                        "request-alignment",
> +                        (int *)&bs->bl.request_alignment);

I find this pointer cast and the ones below quite questionable. 
Admittedly, I can’t think of a reasonably common system (nowadays) where 
this would actually cause problems, but I’d prefer just reading all ints 
into `value` and then assigning the respective limit from it.

> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if (bs->bl.request_alignment < 1 ||
> +        bs->bl.request_alignment >= INT_MAX ||
> +        !is_power_of_2(bs->bl.request_alignment)) {
> +        error_setg(errp, "invalid \"request-alignment\" value %d, must be "
> +                   "power of 2 less than INT_MAX", bs->bl.request_alignment);

Minor (because auto-checked by the compiler anyway), but I’d prefer `%" 
PRIu32 "` instead of `%d` (same for other limits below).

> +        return;
> +    }
> +
> +    ret = blkio_get_int(s->blkio,
> +                        "optimal-io-size",
> +                        (int *)&bs->bl.opt_transfer);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if (bs->bl.opt_transfer > INT_MAX ||
> +        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
> +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be a "
> +                   "multiple of %d", bs->bl.opt_transfer,
> +                   bs->bl.request_alignment);

Both error messages call it buf-alignment, but here we’re actually 
querying optimal-io-size.

Second, is it really fatal if we fail to query it?  It was my impression 
that this is optional anyway, so why don’t we just ignore `ret < 0` and 
make it zero then?

> +        return;
> +    }
> +
> +    ret = blkio_get_int(s->blkio,
> +                        "max-transfer",
> +                        (int *)&bs->bl.max_transfer);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
> +        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
> +        error_setg(errp, "invalid \"max-transfer\" value %d, must be a "
> +                   "multiple of %d and %d (if non-zero)",
> +                   bs->bl.max_transfer, bs->bl.request_alignment,
> +                   bs->bl.opt_transfer);
> +        return;
> +    }
> +
> +    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if (value < 1) {
> +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
> +                   "positive", value);
> +        return;
> +    }
> +    bs->bl.min_mem_alignment = value;
> +
> +    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "failed to get \"optimal-buf-alignment\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if (value < 1) {
> +        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
> +                   "must be positive", value);
> +        return;
> +    }
> +    bs->bl.opt_mem_alignment = value;
> +
> +    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
> +                         blkio_get_error_msg());
> +        return;
> +    }
> +    if (value < 1) {
> +        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
> +                   bs->bl.max_iov);
> +        return;
> +    }
> +}
> +
> +/*
> + * TODO
> + * Missing libblkio APIs:
> + * - write zeroes
> + * - discard

But you’ve added functionality for both here, haven’t you?

> + * - block_status
> + * - co_invalidate_cache
> + *
> + * Out of scope?
> + * - create
> + * - truncate

I don’t know why truncate would be out of scope, we even have truncate 
support for block devices so that users can signal size changes to qemu.

I can see that it isn’t important right now, but I don’t think that 
makes it out of scope.

(Creation seems out of scope, because you can just create regular files 
via the “file” driver.)

Hanna

> + */
> +
> +static BlockDriver bdrv_io_uring = {
> +    .format_name                = "io_uring",
> +    .protocol_name              = "io_uring",
> +    .instance_size              = sizeof(BDRVBlkioState),
> +    .bdrv_needs_filename        = true,
> +    .bdrv_parse_filename        = blkio_parse_filename_io_uring,
> +    .bdrv_file_open             = blkio_file_open,
> +    .bdrv_close                 = blkio_close,
> +    .bdrv_getlength             = blkio_getlength,
> +    .bdrv_get_info              = blkio_get_info,
> +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> +    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
> +    .bdrv_aio_preadv            = blkio_aio_preadv,
> +    .bdrv_aio_pwritev           = blkio_aio_pwritev,
> +    .bdrv_aio_flush             = blkio_aio_flush,
> +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> +    .bdrv_io_unplug             = blkio_io_unplug,
> +    .bdrv_refresh_limits        = blkio_refresh_limits,
> +};
> +
> +static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
> +    .format_name                = "virtio-blk-vhost-vdpa",
> +    .protocol_name              = "virtio-blk-vhost-vdpa",
> +    .instance_size              = sizeof(BDRVBlkioState),
> +    .bdrv_needs_filename        = true,
> +    .bdrv_parse_filename        = blkio_parse_filename_virtio_blk_vhost_vdpa,
> +    .bdrv_file_open             = blkio_file_open,
> +    .bdrv_close                 = blkio_close,
> +    .bdrv_getlength             = blkio_getlength,
> +    .bdrv_get_info              = blkio_get_info,
> +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> +    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
> +    .bdrv_aio_preadv            = blkio_aio_preadv,
> +    .bdrv_aio_pwritev           = blkio_aio_pwritev,
> +    .bdrv_aio_flush             = blkio_aio_flush,
> +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> +    .bdrv_io_unplug             = blkio_io_unplug,
> +    .bdrv_refresh_limits        = blkio_refresh_limits,
> +};
> +
> +static void bdrv_blkio_init(void)
> +{
> +    bdrv_register(&bdrv_io_uring);
> +    bdrv_register(&bdrv_virtio_blk_vhost_vdpa);
> +}
> +
> +block_init(bdrv_blkio_init);



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 3/8] block: pass size to bdrv_unregister_buf()
  2022-07-08  4:17 ` [RFC v3 3/8] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
@ 2022-07-13 14:08   ` Hanna Reitz
  0 siblings, 0 replies; 29+ messages in thread
From: Hanna Reitz @ 2022-07-13 14:08 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> The only implementor of bdrv_register_buf() is block/nvme.c, where the
> size is not needed when unregistering a buffer. This is because
> util/vfio-helpers.c can look up mappings by address.
>
> Future block drivers that implement bdrv_register_buf() may not be able
> to do their job given only the buffer address. Add a size argument to
> bdrv_unregister_buf().
>
> Also document the assumptions about
> bdrv_register_buf()/bdrv_unregister_buf() calls. The same <host, size>
> values that were given to bdrv_register_buf() must be given to
> bdrv_unregister_buf().
>
> gcc 11.2.1 emits a spurious warning that img_bench()'s buf_size local
> variable might be uninitialized, so it's necessary to silence the
> compiler.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   include/block/block-global-state.h          | 5 ++++-
>   include/block/block_int-common.h            | 2 +-
>   include/sysemu/block-backend-global-state.h | 2 +-
>   block/block-backend.c                       | 4 ++--
>   block/io.c                                  | 6 +++---
>   block/nvme.c                                | 2 +-
>   qemu-img.c                                  | 4 ++--
>   7 files changed, 14 insertions(+), 11 deletions(-)

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag
  2022-07-08  4:17 ` [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
@ 2022-07-14  8:54   ` Hanna Reitz
  2022-08-17 20:46     ` Stefan Hajnoczi
  0 siblings, 1 reply; 29+ messages in thread
From: Hanna Reitz @ 2022-07-14  8:54 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> Block drivers may optimize I/O requests accessing buffers previously
> registered with bdrv_register_buf(). Checking whether all elements of a
> request's QEMUIOVector are within previously registered buffers is
> expensive, so we need a hint from the user to avoid costly checks.
>
> Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
> QEMUIOVector elements in an I/O request are known to be within
> previously registered buffers.
>
> bdrv_aligned_preadv() is strict in validating supported read flags and
> its assertions fail when it sees BDRV_REQ_REGISTERED_BUF. There is no
> harm in passing BDRV_REQ_REGISTERED_BUF to block drivers that do not
> support it, so update the assertions to ignore BDRV_REQ_REGISTERED_BUF.
>
> Care must be taken to clear the flag when the block layer or filter
> drivers replace QEMUIOVector elements with bounce buffers since these
> have not been registered with bdrv_register_buf(). A lot of the changes
> in this commit deal with clearing the flag in those cases.
>
> Ensuring that the flag is cleared properly is somewhat invasive to
> implement across the block layer and it's hard to spot when future code
> changes accidentally break it. Another option might be to add a flag to
> QEMUIOVector itself and clear it in qemu_iovec_*() functions that modify
> elements. That is more robust but somewhat of a layering violation, so I
> haven't attempted that.

Yeah...  I will say that most read code already looks quite reasonable 
in that it’ll pass @flags to lower layers basically only if it’s an 
unmodified request, so it seems like in the past most people have 
already adhered to “don’t pass on any flags if you’re reading to a local 
bounce buffer”.

> Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> ---
>   include/block/block-common.h |  9 +++++++++
>   block/blkverify.c            |  4 ++--
>   block/crypto.c               |  2 ++
>   block/io.c                   | 30 +++++++++++++++++++++++-------
>   block/mirror.c               |  2 ++
>   block/raw-format.c           |  2 ++
>   6 files changed, 40 insertions(+), 9 deletions(-)

Some things not covered here that look a bit wrong:

While bdrv_driver_preadv() asserts that the flags don’t contain anything 
the driver couldn’t handle (and this new flag is made exempt from that 
assertion here in this patch), bdrv_driver_pwritev() just hides those 
flags from drivers silently. I think just like we exempt the new flag 
from the assertion in bdrv_driver_preadv(), we should have 
bdrv_driver_pwritev() always pass it to drivers.

The following driver read/write functions assert that @flags is 0, which 
is probably no longer ideal:
- bdrv_qed_co_writev()
- block_crypto_co_preadv()
- nbd_client_co_preadv()
- parallels_co_writev()
- qcow_co_preadv()
- qcow_co_pwritev()
- qemu_gluster_co_writev()
- raw_co_pwritev() (block/file-posix.c)
- replication_co_writev()
- ssh_co_writev()
- vhdx_co_writev()

snapshot_access_co_preadv_part() returns an error when any flags are 
set, but should probably ignore BDRV_REQ_REGISTERED_BUF for this check.


While looking around, I spotted a couple of places that look like they 
could pass the flag on but currently don’t (just FYI, not asking for 
anything here):

bdrv_co_do_copy_on_readv() never passes the flags through to its calls, 
but I think it could pass this flag on in the one bdrv_driver_preadv() 
call where it doesn’t use a bounce buffer (“Read directly into the 
destination”).

qcow2’s qcow2_co_preadv_task() and qcow2_co_pwritev_task() (besides the 
encryption part) also look like they should pass this flag on, but, 
well, the functions themselves currently don’t get the flag, so they can’t.

qcow1’s qcow_co_preadv() and qcow_co_pwritev() are so-so, sometimes 
using a bounce buffer, and sometimes not, but those function could use 
optimization in general if anyone cared.

vpc_co_preadv()’s and vpc_co_pwritev()’s first 
bdrv_co_preadv()/bdrv_co_pwritev() invocations look straightforward, but 
as with qcow1, not sure if anyone cares.

I’m too lazy to thoroughly check what’s going on with 
qed_aio_write_main().  Passing 0 is safe, and it doesn’t get the 
original request flags, so I guess doing anything about this would be 
difficult.

quorum’s read_fifo_child() probably could pass acb->flags. Probably.  
Perhaps not.  Difficult to say it is.

block/replication.c also looks like a candidate for passing flags, but 
personally, I’d like to refrain from touching it.  (Well, besides the 
fact that replication_co_writev() asserts that @flags is 0.)


(And finally, I found that block/parallels.c invokes bdrv_co_pwritev() 
with a buffer instead of an I/O vector, which looks really wrong, but 
has nothing to do with this patch.)

[...]

> diff --git a/block/io.c b/block/io.c
> index e7f4117fe7..83b8259227 100644
> --- a/block/io.c
> +++ b/block/io.c

[...]

> @@ -1902,6 +1910,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>           return -ENOTSUP;
>       }
>   
> +    /* By definition there is no user buffer so this flag doesn't make sense */
> +    if (flags & BDRV_REQ_REGISTERED_BUF) {
> +        return -EINVAL;
> +    }
> +

Here we return an error when the flag is met...

>       /* Invalidate the cached block-status data range if this write overlaps */
>       bdrv_bsc_invalidate_range(bs, offset, bytes);
>   
> @@ -2187,6 +2200,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
>       bool padding;
>       BdrvRequestPadding pad;
>   
> +    /* This flag doesn't make sense for padding or zero writes */
> +    flags &= ~BDRV_REQ_REGISTERED_BUF;
> +

...and here we just ignore it.  Why don’t we handle this the same in 
both of these functions?  (And what about bdrv_co_pwrite_zeroes()?)

Besides that, if we do make it an error, I wonder if it shouldn’t be an 
assertion instead so the duty of clearing the flag falls on the caller.  
(I personally like just silently clearing it in the zero-write 
functions, though.)

Hanna



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 5/8] block: add BlockRAMRegistrar
  2022-07-08  4:17 ` [RFC v3 5/8] block: add BlockRAMRegistrar Stefan Hajnoczi
@ 2022-07-14  9:30   ` Hanna Reitz
  2022-08-17 20:51     ` Stefan Hajnoczi
  0 siblings, 1 reply; 29+ messages in thread
From: Hanna Reitz @ 2022-07-14  9:30 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> Emulated devices and other BlockBackend users wishing to take advantage
> of blk_register_buf() all have the same repetitive job: register
> RAMBlocks with the BlockBackend using RAMBlockNotifier.
>
> Add a BlockRAMRegistrar API to do this. A later commit will use this
> from hw/block/virtio-blk.c.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   MAINTAINERS                          |  1 +
>   include/sysemu/block-ram-registrar.h | 30 +++++++++++++++++++++
>   block/block-ram-registrar.c          | 39 ++++++++++++++++++++++++++++
>   block/meson.build                    |  1 +
>   4 files changed, 71 insertions(+)
>   create mode 100644 include/sysemu/block-ram-registrar.h
>   create mode 100644 block/block-ram-registrar.c

What memory is handled in ram_list?  Is it everything?  If so, won’t 
devices have trouble registering all those buffer, especially if they 
happen to be fragmented in physical memory? (nvme_register_buf() seems 
to say it can run out of slots quite easily.)

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 50f340d9ee..d16189449f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2490,6 +2490,7 @@ F: block*
>   F: block/
>   F: hw/block/
>   F: include/block/
> +F: include/sysemu/block-*.h
>   F: qemu-img*
>   F: docs/tools/qemu-img.rst
>   F: qemu-io*

Sneaky. ;)

> diff --git a/include/sysemu/block-ram-registrar.h b/include/sysemu/block-ram-registrar.h
> new file mode 100644
> index 0000000000..09d63f64b2
> --- /dev/null
> +++ b/include/sysemu/block-ram-registrar.h
> @@ -0,0 +1,30 @@
> +/*
> + * BlockBackend RAM Registrar
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef BLOCK_RAM_REGISTRAR_H
> +#define BLOCK_RAM_REGISTRAR_H
> +
> +#include "exec/ramlist.h"
> +
> +/**
> + * struct BlockRAMRegistrar:
> + *
> + * Keeps RAMBlock memory registered with a BlockBackend using
> + * blk_register_buf() including hotplugged memory.
> + *
> + * Emulated devices or other BlockBackend users initialize a BlockRAMRegistrar
> + * with blk_ram_registrar_init() before submitting I/O requests with the
> + * BLK_REQ_REGISTERED_BUF flag set.

s/BLK/BDRV/, right?

> + */
> +typedef struct {
> +    BlockBackend *blk;
> +    RAMBlockNotifier notifier;
> +} BlockRAMRegistrar;
> +
> +void blk_ram_registrar_init(BlockRAMRegistrar *r, BlockBackend *blk);
> +void blk_ram_registrar_destroy(BlockRAMRegistrar *r);
> +
> +#endif /* BLOCK_RAM_REGISTRAR_H */



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd()
  2022-07-08  4:17 ` [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd() Stefan Hajnoczi
@ 2022-07-14  9:39   ` Hanna Reitz
  0 siblings, 0 replies; 29+ messages in thread
From: Hanna Reitz @ 2022-07-14  9:39 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> The blkio block driver will need to look up the file descriptor for a
> given pointer. This is possible in softmmu builds where the memory API
> is available for querying guest RAM.
>
> Add stubs so tools like qemu-img that link the block layer still build
> successfully. In this case there is no guest RAM but that is fine.
> Bounce buffers and their file descriptors will be allocated with
> libblkio's blkio_alloc_mem_region() so we won't rely on QEMU's
> memory_region_get_fd() in that case.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   stubs/memory.c    | 13 +++++++++++++
>   stubs/meson.build |  1 +
>   2 files changed, 14 insertions(+)
>   create mode 100644 stubs/memory.c

Reviewed-by: Hanna Reitz <hreitz@redhat.com>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-07-08  4:17 ` [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
  2022-07-12 14:28   ` Stefano Garzarella
@ 2022-07-14 10:13   ` Hanna Reitz
  2022-08-18 19:46     ` Stefan Hajnoczi
  1 sibling, 1 reply; 29+ messages in thread
From: Hanna Reitz @ 2022-07-14 10:13 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> Avoid bounce buffers when QEMUIOVector elements are within previously
> registered bdrv_register_buf() buffers.
>
> The idea is that emulated storage controllers will register guest RAM
> using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
> requests. Therefore no blkio_map_mem_region() calls are necessary in the
> performance-critical I/O code path.
>
> This optimization doesn't apply if the I/O buffer is internally
> allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
> path because BDRV_REQ_REGISTERED_BUF is not set.

Which keeps the question relevant of how slow the slow path is, i.e. 
whether it wouldn’t make sense to keep some of the mem regions allocated 
there in a cache instead of allocating/freeing them on every I/O request.

> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/blkio.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 101 insertions(+), 3 deletions(-)
>
> diff --git a/block/blkio.c b/block/blkio.c
> index 7fbdbd7fae..37d593a20c 100644
> --- a/block/blkio.c
> +++ b/block/blkio.c

[...]

> @@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
>           BlockCompletionFunc *cb, void *opaque)
>   {
>       BDRVBlkioState *s = bs->opaque;
> +    bool needs_mem_regions =
> +        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);

Is that condition sufficient?  bdrv_register_buf() has no way of 
returning an error, so it’s possible that buffers are silently not 
registered.  (And there are conditions in blkio_register_buf() where the 
buffer will not be registered, e.g. because it isn’t aligned.)

The caller knows nothing of this and will still pass 
BDRV_REQ_REGISTERED_BUF, and then we’ll assume the region is mapped but 
it won’t be.

>       struct iovec *iov = qiov->iov;
>       int iovcnt = qiov->niov;
>       BlkioAIOCB *acb;

[...]

> @@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
>       }
>   }
>   
> +static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +    struct blkio_mem_region region = (struct blkio_mem_region){
> +        .addr = host,
> +        .len = size,
> +        .fd = -1,
> +    };
> +
> +    if (((uintptr_t)host | size) % s->mem_region_alignment) {
> +        error_report_once("%s: skipping unaligned buf %p with size %zu",
> +                          __func__, host, size);
> +        return; /* skip unaligned */
> +    }

How big is mem-region-alignment generally?  Is it like 4k or is it going 
to be a real issue?

(Also, we could probably register a truncated region.  I know, that’ll 
break the BDRV_REQ_REGISTERED_BUF idea because the caller won’t know 
we’ve truncated it, but that’s no different than just not registering 
the buffer at all.)

> +
> +    /* Attempt to find the fd for a MemoryRegion */
> +    if (s->needs_mem_region_fd) {
> +        int fd = -1;
> +        ram_addr_t offset;
> +        MemoryRegion *mr;
> +
> +        /*
> +         * bdrv_register_buf() is called with the BQL held so mr lives at least
> +         * until this function returns.
> +         */
> +        mr = memory_region_from_host(host, &offset);
> +        if (mr) {
> +            fd = memory_region_get_fd(mr);
> +        }

I don’t think it’s specified that buffers registered with 
bdrv_register_buf() must be within a single memory region, is it? So can 
we somehow verify that the memory region covers the whole buffer?

> +        if (fd == -1) {
> +            error_report_once("%s: skipping fd-less buf %p with size %zu",
> +                              __func__, host, size);
> +            return; /* skip if there is no fd */
> +        }
> +
> +        region.fd = fd;
> +        region.fd_offset = offset;
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        ret = blkio_map_mem_region(s->blkio, &region);
> +    }
> +
> +    if (ret < 0) {
> +        error_report_once("Failed to add blkio mem region %p with size %zu: %s",
> +                          host, size, blkio_get_error_msg());
> +    }
> +}
> +
> +static void blkio_unregister_buf(BlockDriverState *bs, void *host, size_t size)
> +{
> +    BDRVBlkioState *s = bs->opaque;
> +    int ret;
> +    struct blkio_mem_region region = (struct blkio_mem_region){
> +        .addr = host,
> +        .len = size,
> +        .fd = -1,
> +    };
> +
> +    if (((uintptr_t)host | size) % s->mem_region_alignment) {
> +        return; /* skip unaligned */
> +    }
> +
> +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> +        ret = blkio_unmap_mem_region(s->blkio, &region);
> +    }

The documentation of libblkio says that “memory regions must be 
unmapped/freed with exactly the same `region` field values that they 
were mapped/allocated with.”  We don’t set .fd here, though.

It’s also unclear whether it’s allowed to unmap a region that wasn’t 
mapped, but I’ll trust libblkio to detect that.

> +
> +    if (ret < 0) {
> +        error_report_once("Failed to delete blkio mem region %p with size %zu: %s",
> +                          host, size, blkio_get_error_msg());
> +    }
> +}
> +
>   static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
>                                             Error **errp)
>   {

[...]

> @@ -459,7 +553,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
>           return ret;
>       }
>   
> -    bs->supported_write_flags = BDRV_REQ_FUA;
> +    bs->supported_write_flags = BDRV_REQ_FUA | BDRV_REQ_REGISTERED_BUF;

Shouldn’t we also report it as a supported read flag then?

Hanna

>       bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
>                                  BDRV_REQ_NO_FALLBACK;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
  2022-07-08  4:17 ` [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
@ 2022-07-14 10:16   ` Hanna Reitz
  2022-08-15 21:24     ` Stefan Hajnoczi
  0 siblings, 1 reply; 29+ messages in thread
From: Hanna Reitz @ 2022-07-14 10:16 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Alberto Faria, Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin,
	Paolo Bonzini, Laurent Vivier, Eric Blake, sgarzare,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

On 08.07.22 06:17, Stefan Hajnoczi wrote:
> Register guest RAM using BlockRAMRegistrar and set the
> BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
> accesses in I/O requests.
>
> This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
> on DMA mapping/unmapping.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   include/hw/virtio/virtio-blk.h |  2 ++
>   hw/block/virtio-blk.c          | 13 +++++++++----
>   2 files changed, 11 insertions(+), 4 deletions(-)

Seems fair, but as said on patch 5, I’m quite wary of “register guest 
RAM”.  How can we guarantee that it won’t be too fragmented to be 
registerable with either nvme.c or blkio.c?

Hanna



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
  2022-07-12 14:23   ` Stefano Garzarella
  2022-07-13 12:05   ` Hanna Reitz
@ 2022-07-27 19:33   ` Kevin Wolf
  2022-08-03 12:25     ` Peter Krempa
  2022-08-11 19:09     ` Stefan Hajnoczi
  2 siblings, 2 replies; 29+ messages in thread
From: Kevin Wolf @ 2022-07-27 19:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Markus Armbruster, Hanna Reitz,
	Fam Zheng, Yanan Wang, pkrempa

Am 08.07.2022 um 06:17 hat Stefan Hajnoczi geschrieben:
> libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> high-performance disk I/O. It currently supports io_uring and
> virtio-blk-vhost-vdpa with additional drivers under development.
> 
> One of the reasons for developing libblkio is that other applications
> besides QEMU can use it. This will be particularly useful for
> vhost-user-blk which applications may wish to use for connecting to
> qemu-storage-daemon.
> 
> libblkio also gives us an opportunity to develop in Rust behind a C API
> that is easy to consume from QEMU.
> 
> This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> using libblkio. It will be easy to add other libblkio drivers since they
> will share the majority of code.
> 
> For now I/O buffers are copied through bounce buffers if the libblkio
> driver requires it. Later commits add an optimization for
> pre-registering guest RAM to avoid bounce buffers.
> 
> The syntax is:
> 
>   --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> 
> and:
> 
>   --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

The subject line implies only io_uring, but you actually add vhost-vdpa
support, too. I think the subject line should be changed.

I think it would also make sense to already implement support for
vhost-user-blk on the QEMU side even if support isn't compiled in
libblkio by default and opening vhost-user-blk images would therefore
always fail with a default build.

But then you could run QEMU with a custom build of libblkio to make use
of it without patching QEMU. This is probably useful for getting libvirt
support for using a storage daemon implemented without having to wait
for another QEMU release. (Peter, do you have any opinion on this?)

Kevin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-27 19:33   ` Kevin Wolf
@ 2022-08-03 12:25     ` Peter Krempa
  2022-08-03 13:30       ` Kevin Wolf
  2022-08-11 19:09     ` Stefan Hajnoczi
  1 sibling, 1 reply; 29+ messages in thread
From: Peter Krempa @ 2022-08-03 12:25 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, qemu-devel, Alberto Faria,
	Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin, Paolo Bonzini,
	Laurent Vivier, Eric Blake, sgarzare, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Markus Armbruster, Hanna Reitz,
	Fam Zheng, Yanan Wang

On Wed, Jul 27, 2022 at 21:33:40 +0200, Kevin Wolf wrote:
> Am 08.07.2022 um 06:17 hat Stefan Hajnoczi geschrieben:
> > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > high-performance disk I/O. It currently supports io_uring and
> > virtio-blk-vhost-vdpa with additional drivers under development.
> > 
> > One of the reasons for developing libblkio is that other applications
> > besides QEMU can use it. This will be particularly useful for
> > vhost-user-blk which applications may wish to use for connecting to
> > qemu-storage-daemon.
> > 
> > libblkio also gives us an opportunity to develop in Rust behind a C API
> > that is easy to consume from QEMU.
> > 
> > This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> > using libblkio. It will be easy to add other libblkio drivers since they
> > will share the majority of code.
> > 
> > For now I/O buffers are copied through bounce buffers if the libblkio
> > driver requires it. Later commits add an optimization for
> > pre-registering guest RAM to avoid bounce buffers.
> > 
> > The syntax is:
> > 
> >   --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> > 
> > and:
> > 
> >   --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> The subject line implies only io_uring, but you actually add vhost-vdpa
> support, too. I think the subject line should be changed.
> 
> I think it would also make sense to already implement support for
> vhost-user-blk on the QEMU side even if support isn't compiled in
> libblkio by default and opening vhost-user-blk images would therefore
> always fail with a default build.
> 
> But then you could run QEMU with a custom build of libblkio to make use
> of it without patching QEMU. This is probably useful for getting libvirt
> support for using a storage daemon implemented without having to wait
> for another QEMU release. (Peter, do you have any opinion on this?)

How will this work in terms of detecting whether that feature is
present?

The issue is that libvirt caches capabilities of qemu and the cache is
invalidated based on the timestamp of the qemu binary (and few other
mostly host kernel and cpu properties). In case when a backend library
is updated/changed this probably means that libvirt will not be able to
detect that qemu gained support.

In case when qemu lies about the support even if the backend library
doesn't suport it then we have a problem in not being even able to see
whether we can use it.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-08-03 12:25     ` Peter Krempa
@ 2022-08-03 13:30       ` Kevin Wolf
  0 siblings, 0 replies; 29+ messages in thread
From: Kevin Wolf @ 2022-08-03 13:30 UTC (permalink / raw)
  To: Peter Krempa
  Cc: Stefan Hajnoczi, qemu-devel, Alberto Faria,
	Vladimir Sementsov-Ogievskiy, Michael S. Tsirkin, Paolo Bonzini,
	Laurent Vivier, Eric Blake, sgarzare, Marcel Apfelbaum,
	Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Markus Armbruster, Hanna Reitz,
	Fam Zheng, Yanan Wang

Am 03.08.2022 um 14:25 hat Peter Krempa geschrieben:
> On Wed, Jul 27, 2022 at 21:33:40 +0200, Kevin Wolf wrote:
> > Am 08.07.2022 um 06:17 hat Stefan Hajnoczi geschrieben:
> > > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > > high-performance disk I/O. It currently supports io_uring and
> > > virtio-blk-vhost-vdpa with additional drivers under development.
> > > 
> > > One of the reasons for developing libblkio is that other applications
> > > besides QEMU can use it. This will be particularly useful for
> > > vhost-user-blk which applications may wish to use for connecting to
> > > qemu-storage-daemon.
> > > 
> > > libblkio also gives us an opportunity to develop in Rust behind a C API
> > > that is easy to consume from QEMU.
> > > 
> > > This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> > > using libblkio. It will be easy to add other libblkio drivers since they
> > > will share the majority of code.
> > > 
> > > For now I/O buffers are copied through bounce buffers if the libblkio
> > > driver requires it. Later commits add an optimization for
> > > pre-registering guest RAM to avoid bounce buffers.
> > > 
> > > The syntax is:
> > > 
> > >   --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> > > 
> > > and:
> > > 
> > >   --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> > > 
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > 
> > The subject line implies only io_uring, but you actually add vhost-vdpa
> > support, too. I think the subject line should be changed.
> > 
> > I think it would also make sense to already implement support for
> > vhost-user-blk on the QEMU side even if support isn't compiled in
> > libblkio by default and opening vhost-user-blk images would therefore
> > always fail with a default build.
> > 
> > But then you could run QEMU with a custom build of libblkio to make use
> > of it without patching QEMU. This is probably useful for getting libvirt
> > support for using a storage daemon implemented without having to wait
> > for another QEMU release. (Peter, do you have any opinion on this?)
> 
> How will this work in terms of detecting whether that feature is
> present?
> 
> The issue is that libvirt caches capabilities of qemu and the cache is
> invalidated based on the timestamp of the qemu binary (and few other
> mostly host kernel and cpu properties). In case when a backend library
> is updated/changed this probably means that libvirt will not be able to
> detect that qemu gained support.

How is this done with other libraries? We use a few more storage
libraries and depending on their version, we may or may not be able to
provide some feature. I assume we always just ignored this and if you
don't have the right version, you get runtime errors.

> In case when qemu lies about the support even if the backend library
> doesn't suport it then we have a problem in not being even able to see
> whether we can use it.

I'm not sure if I would call it "lying", it's just that we have a static
QAPI schema that can only represent what the QEMU binary could
theoretically handle, but not dynamically what is actually available at
runtime.

Another option would be to either add an API to libblkio that returns a
list of supported drivers or probe it with a pair of blkio_create() and
blkio_destroy() before registering the QEMU drivers. QEMU and qemu-img
can print a list of registered read-write and read-only block drivers
and I think libvirt has been using that?

Of course, it doesn't change anything about the fact that this list
can change between two QEMU runs if you replace the library, but don't
touch QEMU.

Kevin



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-12 14:23   ` Stefano Garzarella
@ 2022-08-11 16:51     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-11 16:51 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 31185 bytes --]

On Tue, Jul 12, 2022 at 04:23:32PM +0200, Stefano Garzarella wrote:
> On Fri, Jul 08, 2022 at 05:17:30AM +0100, Stefan Hajnoczi wrote:
> > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > high-performance disk I/O. It currently supports io_uring and
> > virtio-blk-vhost-vdpa with additional drivers under development.
> > 
> > One of the reasons for developing libblkio is that other applications
> > besides QEMU can use it. This will be particularly useful for
> > vhost-user-blk which applications may wish to use for connecting to
> > qemu-storage-daemon.
> > 
> > libblkio also gives us an opportunity to develop in Rust behind a C API
> > that is easy to consume from QEMU.
> > 
> > This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> > using libblkio. It will be easy to add other libblkio drivers since they
> > will share the majority of code.
> > 
> > For now I/O buffers are copied through bounce buffers if the libblkio
> > driver requires it. Later commits add an optimization for
> > pre-registering guest RAM to avoid bounce buffers.
> > 
> > The syntax is:
> > 
> >  --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> > 
> > and:
> > 
> >  --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> > MAINTAINERS                   |   6 +
> > meson_options.txt             |   2 +
> > qapi/block-core.json          |  37 +-
> > meson.build                   |   9 +
> > block/blkio.c                 | 659 ++++++++++++++++++++++++++++++++++
> > tests/qtest/modules-test.c    |   3 +
> > block/meson.build             |   1 +
> > scripts/meson-buildoptions.sh |   3 +
> > 8 files changed, 718 insertions(+), 2 deletions(-)
> > create mode 100644 block/blkio.c
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 450abd0252..50f340d9ee 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3395,6 +3395,12 @@ L: qemu-block@nongnu.org
> > S: Maintained
> > F: block/vdi.c
> > 
> > +blkio
> > +M: Stefan Hajnoczi <stefanha@redhat.com>
> > +L: qemu-block@nongnu.org
> > +S: Maintained
> > +F: block/blkio.c
> > +
> > iSCSI
> > M: Ronnie Sahlberg <ronniesahlberg@gmail.com>
> > M: Paolo Bonzini <pbonzini@redhat.com>
> > diff --git a/meson_options.txt b/meson_options.txt
> > index 97c38109b1..b0b2e0c9b5 100644
> > --- a/meson_options.txt
> > +++ b/meson_options.txt
> > @@ -117,6 +117,8 @@ option('bzip2', type : 'feature', value : 'auto',
> >        description: 'bzip2 support for DMG images')
> > option('cap_ng', type : 'feature', value : 'auto',
> >        description: 'cap_ng support')
> > +option('blkio', type : 'feature', value : 'auto',
> > +       description: 'libblkio block device driver')
> > option('bpf', type : 'feature', value : 'auto',
> >         description: 'eBPF support')
> > option('cocoa', type : 'feature', value : 'auto',
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 2173e7734a..aa63d5e9bd 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -2951,11 +2951,15 @@
> >             'file', 'snapshot-access', 'ftp', 'ftps', 'gluster',
> >             {'name': 'host_cdrom', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
> >             {'name': 'host_device', 'if': 'HAVE_HOST_BLOCK_DEVICE' },
> > -            'http', 'https', 'iscsi',
> > +            'http', 'https',
> > +            { 'name': 'io_uring', 'if': 'CONFIG_BLKIO' },
> > +            'iscsi',
> >             'luks', 'nbd', 'nfs', 'null-aio', 'null-co', 'nvme', 'parallels',
> >             'preallocate', 'qcow', 'qcow2', 'qed', 'quorum', 'raw', 'rbd',
> >             { 'name': 'replication', 'if': 'CONFIG_REPLICATION' },
> > -            'ssh', 'throttle', 'vdi', 'vhdx', 'vmdk', 'vpc', 'vvfat' ] }
> > +            'ssh', 'throttle', 'vdi', 'vhdx',
> > +            { 'name': 'virtio-blk-vhost-vdpa', 'if': 'CONFIG_BLKIO' },
> > +            'vmdk', 'vpc', 'vvfat' ] }
> > 
> > ##
> > # @BlockdevOptionsFile:
> > @@ -3678,6 +3682,30 @@
> >             '*debug': 'int',
> >             '*logfile': 'str' } }
> > 
> > +##
> > +# @BlockdevOptionsIoUring:
> > +#
> > +# Driver specific block device options for the io_uring backend.
> > +#
> > +# @filename: path to the image file
> > +#
> > +# Since: 7.1
> > +##
> > +{ 'struct': 'BlockdevOptionsIoUring',
> > +  'data': { 'filename': 'str' } }
> > +
> > +##
> > +# @BlockdevOptionsVirtioBlkVhostVdpa:
> > +#
> > +# Driver specific block device options for the virtio-blk-vhost-vdpa backend.
> > +#
> > +# @path: path to the vhost-vdpa character device.
> > +#
> > +# Since: 7.1
> > +##
> > +{ 'struct': 'BlockdevOptionsVirtioBlkVhostVdpa',
> > +  'data': { 'path': 'str' } }
> > +
> > ##
> > # @IscsiTransport:
> > #
> > @@ -4305,6 +4333,8 @@
> >                        'if': 'HAVE_HOST_BLOCK_DEVICE' },
> >       'http':       'BlockdevOptionsCurlHttp',
> >       'https':      'BlockdevOptionsCurlHttps',
> > +      'io_uring':   { 'type': 'BlockdevOptionsIoUring',
> > +                      'if': 'CONFIG_BLKIO' },
> >       'iscsi':      'BlockdevOptionsIscsi',
> >       'luks':       'BlockdevOptionsLUKS',
> >       'nbd':        'BlockdevOptionsNbd',
> > @@ -4327,6 +4357,9 @@
> >       'throttle':   'BlockdevOptionsThrottle',
> >       'vdi':        'BlockdevOptionsGenericFormat',
> >       'vhdx':       'BlockdevOptionsGenericFormat',
> > +      'virtio-blk-vhost-vdpa':
> > +                    { 'type': 'BlockdevOptionsVirtioBlkVhostVdpa',
> > +                      'if': 'CONFIG_BLKIO' },
> >       'vmdk':       'BlockdevOptionsGenericCOWFormat',
> >       'vpc':        'BlockdevOptionsGenericFormat',
> >       'vvfat':      'BlockdevOptionsVVFAT'
> > diff --git a/meson.build b/meson.build
> > index bc5569ace1..f09b009428 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -713,6 +713,13 @@ if not get_option('virglrenderer').auto() or have_system or have_vhost_user_gpu
> >                      required: get_option('virglrenderer'),
> >                      kwargs: static_kwargs)
> > endif
> > +blkio = not_found
> > +if not get_option('blkio').auto() or have_block
> > +  blkio = dependency('blkio',
> > +                     method: 'pkg-config',
> > +                     required: get_option('blkio'),
> > +                     kwargs: static_kwargs)
> > +endif
> > curl = not_found
> > if not get_option('curl').auto() or have_block
> >   curl = dependency('libcurl', version: '>=7.29.0',
> > @@ -1755,6 +1762,7 @@ config_host_data.set('CONFIG_LIBUDEV', libudev.found())
> > config_host_data.set('CONFIG_LZO', lzo.found())
> > config_host_data.set('CONFIG_MPATH', mpathpersist.found())
> > config_host_data.set('CONFIG_MPATH_NEW_API', mpathpersist_new_api)
> > +config_host_data.set('CONFIG_BLKIO', blkio.found())
> > config_host_data.set('CONFIG_CURL', curl.found())
> > config_host_data.set('CONFIG_CURSES', curses.found())
> > config_host_data.set('CONFIG_GBM', gbm.found())
> > @@ -3909,6 +3917,7 @@ summary_info += {'PAM':               pam}
> > summary_info += {'iconv support':     iconv}
> > summary_info += {'curses support':    curses}
> > summary_info += {'virgl support':     virgl}
> > +summary_info += {'blkio support':     blkio}
> > summary_info += {'curl support':      curl}
> > summary_info += {'Multipath support': mpathpersist}
> > summary_info += {'PNG support':       png}
> > diff --git a/block/blkio.c b/block/blkio.c
> > new file mode 100644
> > index 0000000000..7fbdbd7fae
> > --- /dev/null
> > +++ b/block/blkio.c
> > @@ -0,0 +1,659 @@
> > +#include "qemu/osdep.h"
> > +#include <blkio.h>
> > +#include "block/block_int.h"
> > +#include "qapi/error.h"
> > +#include "qapi/qmp/qdict.h"
> > +#include "qemu/module.h"
> > +
> > +typedef struct BlkAIOCB {
> > +    BlockAIOCB common;
> > +    struct blkio_mem_region mem_region;
> > +    QEMUIOVector qiov;
> > +    struct iovec bounce_iov;
> > +} BlkioAIOCB;
> > +
> > +typedef struct {
> > +    /* Protects ->blkio and request submission on ->blkioq */
> > +    QemuMutex lock;
> > +
> > +    struct blkio *blkio;
> > +    struct blkioq *blkioq; /* this could be multi-queue in the future */
> > +    int completion_fd;
> > +
> > +    /* Polling fetches the next completion into this field */
> > +    struct blkio_completion poll_completion;
> > +
> > +    /* The value of the "mem-region-alignment" property */
> > +    size_t mem_region_alignment;
> > +
> > +    /* Can we skip adding/deleting blkio_mem_regions? */
> > +    bool needs_mem_regions;
> > +} BDRVBlkioState;
> > +
> > +static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
> > +{
> > +    /* Copy bounce buffer back to qiov */
> > +    if (acb->qiov.niov > 0) {
> > +        qemu_iovec_from_buf(&acb->qiov, 0,
> > +                acb->bounce_iov.iov_base,
> > +                acb->bounce_iov.iov_len);
> > +        qemu_iovec_destroy(&acb->qiov);
> > +    }
> > +
> > +    acb->common.cb(acb->common.opaque, ret);
> > +
> > +    if (acb->mem_region.len > 0) {
> > +        BDRVBlkioState *s = acb->common.bs->opaque;
> > +
> > +        WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +            blkio_free_mem_region(s->blkio, &acb->mem_region);
> > +        }
> > +    }
> > +
> > +    qemu_aio_unref(&acb->common);
> > +}
> > +
> > +/*
> > + * Only the thread that calls aio_poll() invokes fd and poll handlers.
> > + * Therefore locks are not necessary except when accessing s->blkio.
> > + *
> > + * No locking is performed around blkioq_get_completions() although other
> > + * threads may submit I/O requests on s->blkioq. We're assuming there is no
> > + * inteference between blkioq_get_completions() and other s->blkioq APIs.
> > + */
> > +
> > +static void blkio_completion_fd_read(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct blkio_completion completion;
> > +    uint64_t val;
> > +    ssize_t ret __attribute__((unused));
> > +
> > +    /* Polling may have already fetched a completion */
> > +    if (s->poll_completion.user_data != NULL) {
> > +        completion = s->poll_completion;
> > +
> > +        /* Clear it in case blkio_aiocb_complete() has a nested event loop */
> > +        s->poll_completion.user_data = NULL;
> > +
> > +        blkio_aiocb_complete(completion.user_data, completion.ret);
> > +    }
> > +
> > +    /* Reset completion fd status */
> > +    ret = read(s->completion_fd, &val, sizeof(val));
> > +
> > +    /*
> > +     * Reading one completion at a time makes nested event loop re-entrancy
> > +     * simple. Change this loop to get multiple completions in one go if it
> > +     * becomes a performance bottleneck.
> > +     */
> > +    while (blkioq_do_io(s->blkioq, &completion, 0, 1, NULL) == 1) {
> > +        blkio_aiocb_complete(completion.user_data, completion.ret);
> > +    }
> > +}
> > +
> > +static bool blkio_completion_fd_poll(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    /* Just in case we already fetched a completion */
> > +    if (s->poll_completion.user_data != NULL) {
> > +        return true;
> > +    }
> > +
> > +    return blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL) == 1;
> > +}
> > +
> > +static void blkio_completion_fd_poll_ready(void *opaque)
> > +{
> > +    blkio_completion_fd_read(opaque);
> > +}
> > +
> > +static void blkio_attach_aio_context(BlockDriverState *bs,
> > +                                     AioContext *new_context)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    aio_set_fd_handler(new_context,
> > +                       s->completion_fd,
> > +                       false,
> > +                       blkio_completion_fd_read,
> > +                       NULL,
> > +                       blkio_completion_fd_poll,
> > +                       blkio_completion_fd_poll_ready,
> > +                       bs);
> > +}
> > +
> > +static void blkio_detach_aio_context(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    aio_set_fd_handler(bdrv_get_aio_context(bs),
> > +                       s->completion_fd,
> > +                       false, NULL, NULL, NULL, NULL, NULL);
> > +}
> > +
> > +static const AIOCBInfo blkio_aiocb_info = {
> > +    .aiocb_size = sizeof(BlkioAIOCB),
> > +};
> > +
> > +/* Create a BlkioAIOCB */
> > +static BlkioAIOCB *blkio_aiocb_get(BlockDriverState *bs,
> > +                                   BlockCompletionFunc *cb,
> > +                                   void *opaque)
> > +{
> > +    BlkioAIOCB *acb = qemu_aio_get(&blkio_aiocb_info, bs, cb, opaque);
> > +
> > +    /* A few fields need to be initialized, leave the rest... */
> > +    acb->qiov.niov = 0;
> > +    acb->mem_region.len = 0;
> > +    return acb;
> > +}
> > +
> > +/* s->lock must be held */
> > +static int blkio_aiocb_init_mem_region_locked(BlkioAIOCB *acb, size_t len)
> > +{
> > +    BDRVBlkioState *s = acb->common.bs->opaque;
> > +    size_t mem_region_len = QEMU_ALIGN_UP(len, s->mem_region_alignment);
> > +    int ret;
> > +
> > +    ret = blkio_alloc_mem_region(s->blkio, &acb->mem_region, mem_region_len);
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    acb->bounce_iov.iov_base = acb->mem_region.addr;
> > +    acb->bounce_iov.iov_len = len;
> > +    return 0;
> > +}
> > +
> > +/* Call this to submit I/O after enqueuing a new request */
> > +static void blkio_submit_io(BlockDriverState *bs)
> > +{
> > +    if (qatomic_read(&bs->io_plugged) == 0) {
> > +        BDRVBlkioState *s = bs->opaque;
> > +
> > +        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
> > +    }
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_pdiscard(BlockDriverState *bs, int64_t offset,
> > +        int bytes, BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +    blkioq_discard(s->blkioq, offset, bytes, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> > +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> > +        BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct iovec *iov = qiov->iov;
> > +    int iovcnt = qiov->niov;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    if (s->needs_mem_regions) {
> > +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> > +            qemu_aio_unref(&acb->common);
> > +            return NULL;
> > +        }
> > +
> > +        /* Copy qiov because we'll call qemu_iovec_from_buf() on completion */
> > +        qemu_iovec_init_slice(&acb->qiov, qiov, 0, qiov->size);
> > +
> > +        iov = &acb->bounce_iov;
> > +        iovcnt = 1;
> > +    }
> > +
> > +    blkioq_readv(s->blkioq, offset, iov, iovcnt, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> > +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> > +        BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct iovec *iov = qiov->iov;
> > +    int iovcnt = qiov->niov;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    if (s->needs_mem_regions) {
> > +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> > +            qemu_aio_unref(&acb->common);
> > +            return NULL;
> > +        }
> > +
> > +        qemu_iovec_to_buf(qiov, 0, acb->bounce_iov.iov_base, bytes);
> > +
> > +        iov = &acb->bounce_iov;
> > +        iovcnt = 1;
> > +    }
> > +
> > +    blkioq_writev(s->blkioq, offset, iov, iovcnt, acb, blkio_flags);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_flush(BlockDriverState *bs,
> > +                                   BlockCompletionFunc *cb,
> > +                                   void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    blkioq_flush(s->blkioq, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +/* For async to .bdrv_co_*() conversion */
> > +typedef struct {
> > +    Coroutine *coroutine;
> > +    int ret;
> > +} BlkioCoData;
> > +
> > +static void blkio_co_pwrite_zeroes_complete(void *opaque, int ret)
> > +{
> > +    BlkioCoData *data = opaque;
> > +
> > +    data->ret = ret;
> > +    aio_co_wake(data->coroutine);
> > +}
> > +
> > +static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
> > +    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioCoData data = {
> > +        .coroutine = qemu_coroutine_self(),
> > +    };
> > +    uint32_t blkio_flags = 0;
> > +
> > +    if (flags & BDRV_REQ_FUA) {
> > +        blkio_flags |= BLKIO_REQ_FUA;
> > +    }
> > +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> > +        blkio_flags |= BLKIO_REQ_NO_UNMAP;
> > +    }
> > +    if (flags & BDRV_REQ_NO_FALLBACK) {
> > +        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
> > +    }
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        BlkioAIOCB *acb =
> > +            blkio_aiocb_get(bs, blkio_co_pwrite_zeroes_complete, &data);
> > +        blkioq_write_zeroes(s->blkioq, offset, bytes, acb, blkio_flags);
> > +        blkio_submit_io(bs);
> > +    }
> > +
> > +    qemu_coroutine_yield();
> > +    return data.ret;
> > +}
> > +
> > +static void blkio_io_unplug(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        blkio_submit_io(bs);
> > +    }
> > +}
> > +
> > +static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
> > +                                          Error **errp)
> > +{
> > +    bdrv_parse_filename_strip_prefix(filename, "io_uring:", options);
> > +}
> > +
> > +static void blkio_parse_filename_virtio_blk_vhost_vdpa(
> > +        const char *filename,
> > +        QDict *options,
> > +        Error **errp)
> > +{
> > +    bdrv_parse_filename_strip_prefix(filename, "virtio-blk-vhost-vdpa:", options);
> > +}
> > +
> > +static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
> > +                               Error **errp)
> > +{
> > +    const char *filename = qdict_get_try_str(options, "filename");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_set_str(s->blkio, "path", filename);
> > +    qdict_del(options, "filename");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (flags & BDRV_O_NOCACHE) {
> > +        ret = blkio_set_bool(s->blkio, "direct", true);
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "failed to set direct: %s",
> > +                             blkio_get_error_msg());
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
> > +        QDict *options, int flags, Error **errp)
> > +{
> > +    const char *path = qdict_get_try_str(options, "path");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_set_str(s->blkio, "path", path);
> > +    qdict_del(options, "path");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (flags & BDRV_O_NOCACHE) {
> > +        error_setg(errp, "cache.direct=off is not supported");
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
> > +                           Error **errp)
> > +{
> > +    const char *blkio_driver = bs->drv->protocol_name;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_create(blkio_driver, &s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_create failed: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (strcmp(blkio_driver, "io_uring") == 0) {
> > +        ret = blkio_io_uring_open(bs, options, flags, errp);
> > +    } else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
> > +        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
> > +    }
> > +    if (ret < 0) {
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    if (!(flags & BDRV_O_RDWR)) {
> > +        ret = blkio_set_bool(s->blkio, "readonly", true);
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "failed to set readonly: %s",
> > +                             blkio_get_error_msg());
> > +            blkio_destroy(&s->blkio);
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    ret = blkio_connect(s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_get_bool(s->blkio,
> > +                         "needs-mem-regions",
> > +                         &s->needs_mem_regions);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get needs-mem-regions: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_get_uint64(s->blkio,
> > +                           "mem-region-alignment",
> > +                           &s->mem_region_alignment);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get mem-region-alignment: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_start(s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_start failed: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    bs->supported_write_flags = BDRV_REQ_FUA;
> > +    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
> > +                               BDRV_REQ_NO_FALLBACK;
> > +
> > +    qemu_mutex_init(&s->lock);
> > +    s->blkioq = blkio_get_queue(s->blkio, 0);
> > +    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
> > +
> > +    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
> > +    return 0;
> > +}
> > +
> > +static void blkio_close(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    qemu_mutex_destroy(&s->lock);
> > +    blkio_destroy(&s->blkio);
> > +}
> > +
> > +static int64_t blkio_getlength(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    uint64_t capacity;
> > +    int ret;
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
> > +    }
> > +    if (ret < 0) {
> > +        return -ret;
> > +    }
> > +
> > +    return capacity;
> > +}
> > +
> > +static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int value;
> > +    int ret;
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "request-alignment",
> > +                        (int *)&bs->bl.request_alignment);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (bs->bl.request_alignment < 1 ||
> > +        bs->bl.request_alignment >= INT_MAX ||
> > +        !is_power_of_2(bs->bl.request_alignment)) {
> > +        error_setg(errp, "invalid \"request-alignment\" value %d, must be "
> > +                   "power of 2 less than INT_MAX", bs->bl.request_alignment);
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "optimal-io-size",
> > +                        (int *)&bs->bl.opt_transfer);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (bs->bl.opt_transfer > INT_MAX ||
> > +        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
> > +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be a "
> > +                   "multiple of %d", bs->bl.opt_transfer,
> > +                   bs->bl.request_alignment);
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "max-transfer",
> > +                        (int *)&bs->bl.max_transfer);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
> > +        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
> > +        error_setg(errp, "invalid \"max-transfer\" value %d, must be a "
> > +                   "multiple of %d and %d (if non-zero)",
> > +                   bs->bl.max_transfer, bs->bl.request_alignment,
> > +                   bs->bl.opt_transfer);
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
> > +                   "positive", value);
> > +        return;
> > +    }
> > +    bs->bl.min_mem_alignment = value;
> > +
> > +    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get \"optimal-buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
> > +                   "must be positive", value);
> > +        return;
> > +    }
> > +    bs->bl.opt_mem_alignment = value;
> > +
> > +    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
> > +                   bs->bl.max_iov);
> > +        return;
> > +    }
> > +}
> > +
> > +/*
> > + * TODO
> > + * Missing libblkio APIs:
> > + * - write zeroes
> > + * - discard
> > + * - block_status
> > + * - co_invalidate_cache
> > + *
> > + * Out of scope?
> > + * - create
> > + * - truncate
> > + */
> > +
> > +static BlockDriver bdrv_io_uring = {
> > +    .format_name                = "io_uring",
> > +    .protocol_name              = "io_uring",
> > +    .instance_size              = sizeof(BDRVBlkioState),
> > +    .bdrv_needs_filename        = true,
> > +    .bdrv_parse_filename        = blkio_parse_filename_io_uring,
> > +    .bdrv_file_open             = blkio_file_open,
> > +    .bdrv_close                 = blkio_close,
> > +    .bdrv_getlength             = blkio_getlength,
> > +    .bdrv_get_info              = blkio_get_info,
> > +    .bdrv_attach_aio_context    = blkio_attach_aio_context,
> > +    .bdrv_detach_aio_context    = blkio_detach_aio_context,
> > +    .bdrv_aio_pdiscard          = blkio_aio_pdiscard,
> > +    .bdrv_aio_preadv            = blkio_aio_preadv,
> > +    .bdrv_aio_pwritev           = blkio_aio_pwritev,
> > +    .bdrv_aio_flush             = blkio_aio_flush,
> > +    .bdrv_co_pwrite_zeroes      = blkio_co_pwrite_zeroes,
> > +    .bdrv_io_unplug             = blkio_io_unplug,
> > +    .bdrv_refresh_limits        = blkio_refresh_limits,
> > +};
> > +
> > +static BlockDriver bdrv_virtio_blk_vhost_vdpa = {
> > +    .format_name                = "virtio-blk-vhost-vdpa",
> > +    .protocol_name              = "virtio-blk-vhost-vdpa",
> > +    .instance_size              = sizeof(BDRVBlkioState),
> > +    .bdrv_needs_filename        = true,
> 
> Should we set `.bdrv_needs_filename` to false for
> `bdrv_virtio_blk_vhost_vdpa`?
> 
> I have this error:
>     qemu-system-x86_64: -blockdev
> node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0:
> The 'virtio-blk-vhost-vdpa' block driver requires a file name

Yes.

> > +    .bdrv_parse_filename        = blkio_parse_filename_virtio_blk_vhost_vdpa,
> 
> For my education, since virtio-blk-vhost-vdpa doesn't use the filename
> parameter, we still need to set .bdrv_parse_filename?

.bdrv_parse_filename is for converting filename strings to the richer
options structure that QAPI/JSON interfaces offer. We don't need to do
that here and have no "filename" parameter for virtio-blk-vhost-vdpa. I
think it's safe to drop .bdrv_parse_filename().

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-13 12:05   ` Hanna Reitz
@ 2022-08-11 19:08     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-11 19:08 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 28389 bytes --]

On Wed, Jul 13, 2022 at 02:05:18PM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > high-performance disk I/O. It currently supports io_uring and
> > virtio-blk-vhost-vdpa with additional drivers under development.
> > 
> > One of the reasons for developing libblkio is that other applications
> > besides QEMU can use it. This will be particularly useful for
> > vhost-user-blk which applications may wish to use for connecting to
> > qemu-storage-daemon.
> > 
> > libblkio also gives us an opportunity to develop in Rust behind a C API
> > that is easy to consume from QEMU.
> > 
> > This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> > using libblkio. It will be easy to add other libblkio drivers since they
> > will share the majority of code.
> > 
> > For now I/O buffers are copied through bounce buffers if the libblkio
> > driver requires it. Later commits add an optimization for
> > pre-registering guest RAM to avoid bounce buffers.
> > 
> > The syntax is:
> > 
> >    --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> > 
> > and:
> > 
> >    --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >   MAINTAINERS                   |   6 +
> >   meson_options.txt             |   2 +
> >   qapi/block-core.json          |  37 +-
> >   meson.build                   |   9 +
> >   block/blkio.c                 | 659 ++++++++++++++++++++++++++++++++++
> >   tests/qtest/modules-test.c    |   3 +
> >   block/meson.build             |   1 +
> >   scripts/meson-buildoptions.sh |   3 +
> >   8 files changed, 718 insertions(+), 2 deletions(-)
> >   create mode 100644 block/blkio.c
> 
> [...]
> 
> > diff --git a/block/blkio.c b/block/blkio.c
> > new file mode 100644
> > index 0000000000..7fbdbd7fae
> > --- /dev/null
> > +++ b/block/blkio.c
> > @@ -0,0 +1,659 @@
> 
> Not sure whether it’s necessary, but I would have expected a copyright
> header here.

Thanks for reminding me, I will add a header.

> 
> > +#include "qemu/osdep.h"
> > +#include <blkio.h>
> > +#include "block/block_int.h"
> > +#include "qapi/error.h"
> > +#include "qapi/qmp/qdict.h"
> > +#include "qemu/module.h"
> > +
> > +typedef struct BlkAIOCB {
> > +    BlockAIOCB common;
> > +    struct blkio_mem_region mem_region;
> > +    QEMUIOVector qiov;
> > +    struct iovec bounce_iov;
> > +} BlkioAIOCB;
> > +
> > +typedef struct {
> > +    /* Protects ->blkio and request submission on ->blkioq */
> > +    QemuMutex lock;
> > +
> > +    struct blkio *blkio;
> > +    struct blkioq *blkioq; /* this could be multi-queue in the future */
> > +    int completion_fd;
> > +
> > +    /* Polling fetches the next completion into this field */
> > +    struct blkio_completion poll_completion;
> > +
> > +    /* The value of the "mem-region-alignment" property */
> > +    size_t mem_region_alignment;
> > +
> > +    /* Can we skip adding/deleting blkio_mem_regions? */
> > +    bool needs_mem_regions;
> > +} BDRVBlkioState;
> > +
> > +static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
> > +{
> > +    /* Copy bounce buffer back to qiov */
> > +    if (acb->qiov.niov > 0) {
> > +        qemu_iovec_from_buf(&acb->qiov, 0,
> > +                acb->bounce_iov.iov_base,
> > +                acb->bounce_iov.iov_len);
> > +        qemu_iovec_destroy(&acb->qiov);
> > +    }
> > +
> > +    acb->common.cb(acb->common.opaque, ret);
> > +
> > +    if (acb->mem_region.len > 0) {
> > +        BDRVBlkioState *s = acb->common.bs->opaque;
> > +
> > +        WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +            blkio_free_mem_region(s->blkio, &acb->mem_region);
> > +        }
> > +    }
> > +
> > +    qemu_aio_unref(&acb->common);
> > +}
> > +
> > +/*
> > + * Only the thread that calls aio_poll() invokes fd and poll handlers.
> > + * Therefore locks are not necessary except when accessing s->blkio.
> > + *
> > + * No locking is performed around blkioq_get_completions() although other
> > + * threads may submit I/O requests on s->blkioq. We're assuming there is no
> > + * inteference between blkioq_get_completions() and other s->blkioq APIs.
> > + */
> > +
> > +static void blkio_completion_fd_read(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct blkio_completion completion;
> > +    uint64_t val;
> > +    ssize_t ret __attribute__((unused));
> 
> I’d prefer a `(void)ret;` over this attribute, not least because that line
> would give a nice opportunity to explain in a short comment why we ignore
> this return value that the compiler tells us not to ignore, but if you
> don’t, then this’ll be fine.

Okay, I'll use (void)ret; and add a comment.

> 
> > +
> > +    /* Polling may have already fetched a completion */
> > +    if (s->poll_completion.user_data != NULL) {
> > +        completion = s->poll_completion;
> > +
> > +        /* Clear it in case blkio_aiocb_complete() has a nested event loop */
> > +        s->poll_completion.user_data = NULL;
> > +
> > +        blkio_aiocb_complete(completion.user_data, completion.ret);
> > +    }
> > +
> > +    /* Reset completion fd status */
> > +    ret = read(s->completion_fd, &val, sizeof(val));
> > +
> > +    /*
> > +     * Reading one completion at a time makes nested event loop re-entrancy
> > +     * simple. Change this loop to get multiple completions in one go if it
> > +     * becomes a performance bottleneck.
> > +     */
> > +    while (blkioq_do_io(s->blkioq, &completion, 0, 1, NULL) == 1) {
> > +        blkio_aiocb_complete(completion.user_data, completion.ret);
> > +    }
> > +}
> > +
> > +static bool blkio_completion_fd_poll(void *opaque)
> > +{
> > +    BlockDriverState *bs = opaque;
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    /* Just in case we already fetched a completion */
> > +    if (s->poll_completion.user_data != NULL) {
> > +        return true;
> > +    }
> > +
> > +    return blkioq_do_io(s->blkioq, &s->poll_completion, 0, 1, NULL) == 1;
> > +}
> > +
> > +static void blkio_completion_fd_poll_ready(void *opaque)
> > +{
> > +    blkio_completion_fd_read(opaque);
> > +}
> > +
> > +static void blkio_attach_aio_context(BlockDriverState *bs,
> > +                                     AioContext *new_context)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    aio_set_fd_handler(new_context,
> > +                       s->completion_fd,
> > +                       false,
> > +                       blkio_completion_fd_read,
> > +                       NULL,
> > +                       blkio_completion_fd_poll,
> > +                       blkio_completion_fd_poll_ready,
> > +                       bs);
> > +}
> > +
> > +static void blkio_detach_aio_context(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    aio_set_fd_handler(bdrv_get_aio_context(bs),
> > +                       s->completion_fd,
> > +                       false, NULL, NULL, NULL, NULL, NULL);
> > +}
> > +
> > +static const AIOCBInfo blkio_aiocb_info = {
> > +    .aiocb_size = sizeof(BlkioAIOCB),
> > +};
> > +
> > +/* Create a BlkioAIOCB */
> > +static BlkioAIOCB *blkio_aiocb_get(BlockDriverState *bs,
> > +                                   BlockCompletionFunc *cb,
> > +                                   void *opaque)
> > +{
> > +    BlkioAIOCB *acb = qemu_aio_get(&blkio_aiocb_info, bs, cb, opaque);
> > +
> > +    /* A few fields need to be initialized, leave the rest... */
> > +    acb->qiov.niov = 0;
> > +    acb->mem_region.len = 0;
> > +    return acb;
> > +}
> > +
> > +/* s->lock must be held */
> > +static int blkio_aiocb_init_mem_region_locked(BlkioAIOCB *acb, size_t len)
> > +{
> > +    BDRVBlkioState *s = acb->common.bs->opaque;
> > +    size_t mem_region_len = QEMU_ALIGN_UP(len, s->mem_region_alignment);
> > +    int ret;
> > +
> > +    ret = blkio_alloc_mem_region(s->blkio, &acb->mem_region, mem_region_len);
> 
> I don’t find the blkio doc clear on whether this function is sufficiently
> fast to be used in an I/O path.  Is it?
> 
> (Or is this perhaps addressed in a later function in this series?)

It can be used from the I/O path but it may add overhead (depending on
the libblkio driver).

The later patches use .bdrv_register_buf() to avoid calling
blkio_alloc_mem_region() from the I/O path.

> 
> > +    if (ret < 0) {
> > +        return ret;
> > +    }
> > +
> > +    acb->bounce_iov.iov_base = acb->mem_region.addr;
> > +    acb->bounce_iov.iov_len = len;
> > +    return 0;
> > +}
> > +
> > +/* Call this to submit I/O after enqueuing a new request */
> > +static void blkio_submit_io(BlockDriverState *bs)
> > +{
> > +    if (qatomic_read(&bs->io_plugged) == 0) {
> > +        BDRVBlkioState *s = bs->opaque;
> > +
> > +        blkioq_do_io(s->blkioq, NULL, 0, 0, NULL);
> > +    }
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_pdiscard(BlockDriverState *bs, int64_t offset,
> > +        int bytes, BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +    blkioq_discard(s->blkioq, offset, bytes, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> > +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> > +        BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct iovec *iov = qiov->iov;
> > +    int iovcnt = qiov->niov;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    if (s->needs_mem_regions) {
> > +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> > +            qemu_aio_unref(&acb->common);
> > +            return NULL;
> > +        }
> > +
> > +        /* Copy qiov because we'll call qemu_iovec_from_buf() on completion */
> > +        qemu_iovec_init_slice(&acb->qiov, qiov, 0, qiov->size);
> > +
> > +        iov = &acb->bounce_iov;
> > +        iovcnt = 1;
> > +    }
> > +
> > +    blkioq_readv(s->blkioq, offset, iov, iovcnt, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> > +        int64_t bytes, QEMUIOVector *qiov, BdrvRequestFlags flags,
> > +        BlockCompletionFunc *cb, void *opaque)
> > +{
> > +    uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    struct iovec *iov = qiov->iov;
> > +    int iovcnt = qiov->niov;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    if (s->needs_mem_regions) {
> > +        if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> > +            qemu_aio_unref(&acb->common);
> > +            return NULL;
> > +        }
> > +
> > +        qemu_iovec_to_buf(qiov, 0, acb->bounce_iov.iov_base, bytes);
> > +
> > +        iov = &acb->bounce_iov;
> > +        iovcnt = 1;
> > +    }
> > +
> > +    blkioq_writev(s->blkioq, offset, iov, iovcnt, acb, blkio_flags);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +static BlockAIOCB *blkio_aio_flush(BlockDriverState *bs,
> > +                                   BlockCompletionFunc *cb,
> > +                                   void *opaque)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioAIOCB *acb;
> > +
> > +    QEMU_LOCK_GUARD(&s->lock);
> > +
> > +    acb = blkio_aiocb_get(bs, cb, opaque);
> > +
> > +    blkioq_flush(s->blkioq, acb, 0);
> > +    blkio_submit_io(bs);
> > +    return &acb->common;
> > +}
> > +
> > +/* For async to .bdrv_co_*() conversion */
> > +typedef struct {
> > +    Coroutine *coroutine;
> > +    int ret;
> > +} BlkioCoData;
> > +
> > +static void blkio_co_pwrite_zeroes_complete(void *opaque, int ret)
> > +{
> > +    BlkioCoData *data = opaque;
> > +
> > +    data->ret = ret;
> > +    aio_co_wake(data->coroutine);
> > +}
> > +
> > +static int coroutine_fn blkio_co_pwrite_zeroes(BlockDriverState *bs,
> > +    int64_t offset, int64_t bytes, BdrvRequestFlags flags)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    BlkioCoData data = {
> > +        .coroutine = qemu_coroutine_self(),
> > +    };
> > +    uint32_t blkio_flags = 0;
> > +
> > +    if (flags & BDRV_REQ_FUA) {
> > +        blkio_flags |= BLKIO_REQ_FUA;
> > +    }
> > +    if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> > +        blkio_flags |= BLKIO_REQ_NO_UNMAP;
> > +    }
> > +    if (flags & BDRV_REQ_NO_FALLBACK) {
> > +        blkio_flags |= BLKIO_REQ_NO_FALLBACK;
> > +    }
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        BlkioAIOCB *acb =
> > +            blkio_aiocb_get(bs, blkio_co_pwrite_zeroes_complete, &data);
> > +        blkioq_write_zeroes(s->blkioq, offset, bytes, acb, blkio_flags);
> > +        blkio_submit_io(bs);
> > +    }
> > +
> > +    qemu_coroutine_yield();
> > +    return data.ret;
> > +}
> > +
> > +static void blkio_io_unplug(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        blkio_submit_io(bs);
> > +    }
> > +}
> > +
> > +static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
> > +                                          Error **errp)
> > +{
> > +    bdrv_parse_filename_strip_prefix(filename, "io_uring:", options);
> > +}
> > +
> > +static void blkio_parse_filename_virtio_blk_vhost_vdpa(
> > +        const char *filename,
> > +        QDict *options,
> > +        Error **errp)
> > +{
> > +    bdrv_parse_filename_strip_prefix(filename, "virtio-blk-vhost-vdpa:", options);
> > +}
> 
> Besides the fact that this doesn’t work for virtio-blk-vhost-vdpa (because
> it provides a @filename option, but that driver expects a @path option), is
> it really worth implementing these, or should we just expect users to use
> -blockdev (or -drive with blockdev-like options)?

Yes, I think you're right. .bdrv_parse_filename() is for legacy
BlockDrivers and we don't need it. I'll remove it.

> 
> > +
> > +static int blkio_io_uring_open(BlockDriverState *bs, QDict *options, int flags,
> > +                               Error **errp)
> > +{
> > +    const char *filename = qdict_get_try_str(options, "filename");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_set_str(s->blkio, "path", filename);
> 
> You don’t check that @filename is non-NULL, and I don’t think that libblkio
> would accept a NULL here.  Admittedly, I can’t produce a case where it would
> be NULL (because -blockdev checks the QAPI schema, and -drive expects a
> @filename parameter thanks to .bdrv_needs_filename), but I think it’s still
> isn’t ideal.

Due to .bdrv_needs_filename we always have a "filename" QDict entry.
I'll change qdict_get_try_str() to qdict_get_str() so it's clearer that
this is always non-NULL.

> 
> > +    qdict_del(options, "filename");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (flags & BDRV_O_NOCACHE) {
> > +        ret = blkio_set_bool(s->blkio, "direct", true);
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "failed to set direct: %s",
> > +                             blkio_get_error_msg());
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int blkio_virtio_blk_vhost_vdpa_open(BlockDriverState *bs,
> > +        QDict *options, int flags, Error **errp)
> > +{
> > +    const char *path = qdict_get_try_str(options, "path");
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_set_str(s->blkio, "path", path);
> 
> In contrast to the above, I can make @path NULL here, because
> .bdrv_needs_filename only ensures that there’s a @filename parameter, and
> so:
> 
> $ ./qemu-system-x86_64 -drive
> if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo
> [1]    49946 segmentation fault (core dumped)  ./qemu-system-x86_64 -drive

Thanks, I will add a check.

> 
> > +    qdict_del(options, "path");
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to set path: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (flags & BDRV_O_NOCACHE) {
> > +        error_setg(errp, "cache.direct=off is not supported");
> 
> The condition is the opposite of that, though, isn’t it?
> 
> I.e.:
> 
> $ ./qemu-system-x86_64 -drive if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo,path=foo,cache.direct=on
> 
> qemu-system-x86_64: -drive if=none,driver=virtio-blk-vhost-vdpa,id=node0,filename=foo,path=foo,cache.direct=on:
> cache.direct=off is not supported

Will fix, thanks!

> 
> > +        return -EINVAL;
> > +    }
> > +    return 0;
> > +}
> > +
> > +static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
> > +                           Error **errp)
> > +{
> > +    const char *blkio_driver = bs->drv->protocol_name;
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +
> > +    ret = blkio_create(blkio_driver, &s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_create failed: %s",
> > +                         blkio_get_error_msg());
> > +        return ret;
> > +    }
> > +
> > +    if (strcmp(blkio_driver, "io_uring") == 0) {
> > +        ret = blkio_io_uring_open(bs, options, flags, errp);
> > +    } else if (strcmp(blkio_driver, "virtio-blk-vhost-vdpa") == 0) {
> > +        ret = blkio_virtio_blk_vhost_vdpa_open(bs, options, flags, errp);
> > +    }
> 
> First, I’d like to suggest using macros for the driver names (and use them
> here and below for format_name/protocol_name).

Good idea.

> Second, what do you think about adding an `else` branch with
> `g_assert_not_reached()` (or just abort)?

Good idea.

> 
> > +    if (ret < 0) {
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    if (!(flags & BDRV_O_RDWR)) {
> > +        ret = blkio_set_bool(s->blkio, "readonly", true);
> 
> The libblkio doc says it’s “read-only”, and when I try to set this option, I
> get an error:
> 
> $ ./qemu-system-x86_64 -blockdev
> io_uring,node-name=node0,filename=/dev/null,read-only=on
> qemu-system-x86_64: -blockdev
> io_uring,node-name=node0,filename=/dev/null,read-only=on: failed to set
> readonly: Unknown property name: No such file or directory

Thanks, this property was renamed in libblkio commit 3b6771d1b049
("Rename property "readonly" to "read-only"") and this patch is
outdated.

Will fix.

> 
> > +        if (ret < 0) {
> > +            error_setg_errno(errp, -ret, "failed to set readonly: %s",
> > +                             blkio_get_error_msg());
> > +            blkio_destroy(&s->blkio);
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    ret = blkio_connect(s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_connect failed: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_get_bool(s->blkio,
> > +                         "needs-mem-regions",
> > +                         &s->needs_mem_regions);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get needs-mem-regions: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_get_uint64(s->blkio,
> > +                           "mem-region-alignment",
> > +                           &s->mem_region_alignment);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get mem-region-alignment: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    ret = blkio_start(s->blkio);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "blkio_start failed: %s",
> > +                         blkio_get_error_msg());
> > +        blkio_destroy(&s->blkio);
> > +        return ret;
> > +    }
> > +
> > +    bs->supported_write_flags = BDRV_REQ_FUA;
> > +    bs->supported_zero_flags = BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP |
> > +                               BDRV_REQ_NO_FALLBACK;
> > +
> > +    qemu_mutex_init(&s->lock);
> > +    s->blkioq = blkio_get_queue(s->blkio, 0);
> > +    s->completion_fd = blkioq_get_completion_fd(s->blkioq);
> > +
> > +    blkio_attach_aio_context(bs, bdrv_get_aio_context(bs));
> > +    return 0;
> > +}
> > +
> > +static void blkio_close(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +
> > +    qemu_mutex_destroy(&s->lock);
> > +    blkio_destroy(&s->blkio);
> 
> Should we call blkio_detach_aio_context() here?

Good catch. I thought that would be called automatically, but I don't
see a .bdrv_detach_aio_context() call in block.c:bdrv_close().

> 
> > +}
> > +
> > +static int64_t blkio_getlength(BlockDriverState *bs)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    uint64_t capacity;
> > +    int ret;
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        ret = blkio_get_uint64(s->blkio, "capacity", &capacity);
> > +    }
> > +    if (ret < 0) {
> > +        return -ret;
> > +    }
> > +
> > +    return capacity;
> > +}
> > +
> > +static int blkio_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> > +{
> > +    return 0;
> > +}
> > +
> > +static void blkio_refresh_limits(BlockDriverState *bs, Error **errp)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int value;
> > +    int ret;
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "request-alignment",
> > +                        (int *)&bs->bl.request_alignment);
> 
> I find this pointer cast and the ones below quite questionable. Admittedly,
> I can’t think of a reasonably common system (nowadays) where this would
> actually cause problems, but I’d prefer just reading all ints into `value`
> and then assigning the respective limit from it.

Okay, let's do that. It's safer.

> 
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"request-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (bs->bl.request_alignment < 1 ||
> > +        bs->bl.request_alignment >= INT_MAX ||
> > +        !is_power_of_2(bs->bl.request_alignment)) {
> > +        error_setg(errp, "invalid \"request-alignment\" value %d, must be "
> > +                   "power of 2 less than INT_MAX", bs->bl.request_alignment);
> 
> Minor (because auto-checked by the compiler anyway), but I’d prefer `%"
> PRIu32 "` instead of `%d` (same for other limits below).

Okay.

> 
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "optimal-io-size",
> > +                        (int *)&bs->bl.opt_transfer);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (bs->bl.opt_transfer > INT_MAX ||
> > +        (bs->bl.opt_transfer % bs->bl.request_alignment)) {
> > +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be a "
> > +                   "multiple of %d", bs->bl.opt_transfer,
> > +                   bs->bl.request_alignment);
> 
> Both error messages call it buf-alignment, but here we’re actually querying
> optimal-io-size.
> 
> Second, is it really fatal if we fail to query it?  It was my impression
> that this is optional anyway, so why don’t we just ignore `ret < 0` and make
> it zero then?

The property always exists and blkio_get_int() should never fail.

> 
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio,
> > +                        "max-transfer",
> > +                        (int *)&bs->bl.max_transfer);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"max-transfer\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if ((bs->bl.max_transfer % bs->bl.request_alignment) ||
> > +        (bs->bl.opt_transfer && (bs->bl.max_transfer % bs->bl.opt_transfer))) {
> > +        error_setg(errp, "invalid \"max-transfer\" value %d, must be a "
> > +                   "multiple of %d and %d (if non-zero)",
> > +                   bs->bl.max_transfer, bs->bl.request_alignment,
> > +                   bs->bl.opt_transfer);
> > +        return;
> > +    }
> > +
> > +    ret = blkio_get_int(s->blkio, "buf-alignment", &value);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"buf-alignment\" value %d, must be "
> > +                   "positive", value);
> > +        return;
> > +    }
> > +    bs->bl.min_mem_alignment = value;
> > +
> > +    ret = blkio_get_int(s->blkio, "optimal-buf-alignment", &value);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "failed to get \"optimal-buf-alignment\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"optimal-buf-alignment\" value %d, "
> > +                   "must be positive", value);
> > +        return;
> > +    }
> > +    bs->bl.opt_mem_alignment = value;
> > +
> > +    ret = blkio_get_int(s->blkio, "max-segments", &bs->bl.max_iov);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret, "failed to get \"max-segments\": %s",
> > +                         blkio_get_error_msg());
> > +        return;
> > +    }
> > +    if (value < 1) {
> > +        error_setg(errp, "invalid \"max-segments\" value %d, must be positive",
> > +                   bs->bl.max_iov);
> > +        return;
> > +    }
> > +}
> > +
> > +/*
> > + * TODO
> > + * Missing libblkio APIs:
> > + * - write zeroes
> > + * - discard
> 
> But you’ve added functionality for both here, haven’t you?

Yes, will fix!

> 
> > + * - block_status
> > + * - co_invalidate_cache
> > + *
> > + * Out of scope?
> > + * - create
> > + * - truncate
> 
> I don’t know why truncate would be out of scope, we even have truncate
> support for block devices so that users can signal size changes to qemu.
> 
> I can see that it isn’t important right now, but I don’t think that makes it
> out of scope.
> 
> (Creation seems out of scope, because you can just create regular files via
> the “file” driver.)

You're right, we need to do something for truncate. I have filed an
issue with libblkio, which currently does not support device capacity
changes:
https://gitlab.com/libblkio/libblkio/-/issues/39

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 1/8] blkio: add io_uring block driver using libblkio
  2022-07-27 19:33   ` Kevin Wolf
  2022-08-03 12:25     ` Peter Krempa
@ 2022-08-11 19:09     ` Stefan Hajnoczi
  1 sibling, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-11 19:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Markus Armbruster, Hanna Reitz,
	Fam Zheng, Yanan Wang, pkrempa

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

On Wed, Jul 27, 2022 at 09:33:40PM +0200, Kevin Wolf wrote:
> Am 08.07.2022 um 06:17 hat Stefan Hajnoczi geschrieben:
> > libblkio (https://gitlab.com/libblkio/libblkio/) is a library for
> > high-performance disk I/O. It currently supports io_uring and
> > virtio-blk-vhost-vdpa with additional drivers under development.
> > 
> > One of the reasons for developing libblkio is that other applications
> > besides QEMU can use it. This will be particularly useful for
> > vhost-user-blk which applications may wish to use for connecting to
> > qemu-storage-daemon.
> > 
> > libblkio also gives us an opportunity to develop in Rust behind a C API
> > that is easy to consume from QEMU.
> > 
> > This commit adds io_uring and virtio-blk-vhost-vdpa BlockDrivers to QEMU
> > using libblkio. It will be easy to add other libblkio drivers since they
> > will share the majority of code.
> > 
> > For now I/O buffers are copied through bounce buffers if the libblkio
> > driver requires it. Later commits add an optimization for
> > pre-registering guest RAM to avoid bounce buffers.
> > 
> > The syntax is:
> > 
> >   --blockdev io_uring,node-name=drive0,filename=test.img,readonly=on|off,cache.direct=on|off
> > 
> > and:
> > 
> >   --blockdev virtio-blk-vhost-vdpa,node-name=drive0,path=/dev/vdpa...,readonly=on|off
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> 
> The subject line implies only io_uring, but you actually add vhost-vdpa
> support, too. I think the subject line should be changed.
> 
> I think it would also make sense to already implement support for
> vhost-user-blk on the QEMU side even if support isn't compiled in
> libblkio by default and opening vhost-user-blk images would therefore
> always fail with a default build.
> 
> But then you could run QEMU with a custom build of libblkio to make use
> of it without patching QEMU. This is probably useful for getting libvirt
> support for using a storage daemon implemented without having to wait
> for another QEMU release. (Peter, do you have any opinion on this?)

vhost-user-blk is now supported in all builds of libblkio. I'll add it.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-07-12 14:28   ` Stefano Garzarella
@ 2022-08-15 20:52     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-15 20:52 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster,
	Hanna Reitz, Fam Zheng, Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 7849 bytes --]

On Tue, Jul 12, 2022 at 04:28:02PM +0200, Stefano Garzarella wrote:
> On Fri, Jul 08, 2022 at 05:17:36AM +0100, Stefan Hajnoczi wrote:
> > Avoid bounce buffers when QEMUIOVector elements are within previously
> > registered bdrv_register_buf() buffers.
> > 
> > The idea is that emulated storage controllers will register guest RAM
> > using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
> > requests. Therefore no blkio_map_mem_region() calls are necessary in the
> > performance-critical I/O code path.
> > 
> > This optimization doesn't apply if the I/O buffer is internally
> > allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
> > path because BDRV_REQ_REGISTERED_BUF is not set.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> > block/blkio.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 101 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blkio.c b/block/blkio.c
> > index 7fbdbd7fae..37d593a20c 100644
> > --- a/block/blkio.c
> > +++ b/block/blkio.c
> > @@ -1,7 +1,9 @@
> > #include "qemu/osdep.h"
> > #include <blkio.h>
> > #include "block/block_int.h"
> > +#include "exec/memory.h"
> > #include "qapi/error.h"
> > +#include "qemu/error-report.h"
> > #include "qapi/qmp/qdict.h"
> > #include "qemu/module.h"
> > 
> > @@ -28,6 +30,9 @@ typedef struct {
> > 
> >     /* Can we skip adding/deleting blkio_mem_regions? */
> >     bool needs_mem_regions;
> > +
> > +    /* Are file descriptors necessary for blkio_mem_regions? */
> > +    bool needs_mem_region_fd;
> > } BDRVBlkioState;
> > 
> > static void blkio_aiocb_complete(BlkioAIOCB *acb, int ret)
> > @@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> >         BlockCompletionFunc *cb, void *opaque)
> > {
> >     BDRVBlkioState *s = bs->opaque;
> > +    bool needs_mem_regions =
> > +        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
> >     struct iovec *iov = qiov->iov;
> >     int iovcnt = qiov->niov;
> >     BlkioAIOCB *acb;
> > @@ -206,7 +213,7 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> > 
> >     acb = blkio_aiocb_get(bs, cb, opaque);
> > 
> > -    if (s->needs_mem_regions) {
> > +    if (needs_mem_regions) {
> >         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> >             qemu_aio_unref(&acb->common);
> >             return NULL;
> > @@ -230,6 +237,8 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> > {
> >     uint32_t blkio_flags = (flags & BDRV_REQ_FUA) ? BLKIO_REQ_FUA : 0;
> >     BDRVBlkioState *s = bs->opaque;
> > +    bool needs_mem_regions =
> > +        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
> >     struct iovec *iov = qiov->iov;
> >     int iovcnt = qiov->niov;
> >     BlkioAIOCB *acb;
> > @@ -238,7 +247,7 @@ static BlockAIOCB *blkio_aio_pwritev(BlockDriverState *bs, int64_t offset,
> > 
> >     acb = blkio_aiocb_get(bs, cb, opaque);
> > 
> > -    if (s->needs_mem_regions) {
> > +    if (needs_mem_regions) {
> >         if (blkio_aiocb_init_mem_region_locked(acb, bytes) < 0) {
> >             qemu_aio_unref(&acb->common);
> >             return NULL;
> > @@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
> >     }
> > }
> > 
> > +static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +    struct blkio_mem_region region = (struct blkio_mem_region){
> > +        .addr = host,
> > +        .len = size,
> > +        .fd = -1,
> > +    };
> > +
> > +    if (((uintptr_t)host | size) % s->mem_region_alignment) {
> > +        error_report_once("%s: skipping unaligned buf %p with size %zu",
> > +                          __func__, host, size);
> > +        return; /* skip unaligned */
> > +    }
> > +
> > +    /* Attempt to find the fd for a MemoryRegion */
> > +    if (s->needs_mem_region_fd) {
> > +        int fd = -1;
> > +        ram_addr_t offset;
> > +        MemoryRegion *mr;
> > +
> > +        /*
> > +         * bdrv_register_buf() is called with the BQL held so mr lives at least
> > +         * until this function returns.
> > +         */
> > +        mr = memory_region_from_host(host, &offset);
> > +        if (mr) {
> > +            fd = memory_region_get_fd(mr);
> 
> If s->needs_mem_region_fd is true, memory_region_get_fd() crashes I think
> because mr->ram_block is not yet set, indeed from the stack trace
> blkio_register_buf() is called inside qemu_ram_alloc_resizeable(), and its
> result is used to set mr->ram_block in memory_region_init_resizeable_ram():
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x000056235bf1f7a3 in memory_region_get_fd (mr=<optimized out>) at ../softmmu/memory.c:2309
> #1  0x000056235c07e54d in blkio_register_buf (bs=<optimized out>, host=0x7f824e200000, size=2097152)
>     at ../block/blkio.c:364
> #2  0x000056235c0246c6 in bdrv_register_buf (bs=0x56235d606b40, host=0x7f824e200000, size=2097152)
>     at ../block/io.c:3362
> #3  0x000056235bea44e6 in ram_block_notify_add (host=0x7f824e200000, size=131072, max_size=2097152)
>     at ../hw/core/numa.c:863
> #4  0x000056235bf22c00 in ram_block_add (new_block=<optimized out>, errp=<optimized out>)
>     at ../softmmu/physmem.c:2057
> #5  0x000056235bf232e4 in qemu_ram_alloc_internal (size=size@entry=131072,
> max_size=max_size@entry=2097152, resized=resized@entry=0x56235bc0f920
> <fw_cfg_resized>,     host=host@entry=0x0, ram_flags=ram_flags@entry=4,
> mr=mr@entry=0x56235dc3fe00,     errp=0x7ffcb21f1be0) at
> ../softmmu/physmem.c:2180
> #6  0x000056235bf26426 in qemu_ram_alloc_resizeable (size=size@entry=131072,
> maxsz=maxsz@entry=2097152, resized=resized@entry=0x56235bc0f920
> <fw_cfg_resized>,     mr=mr@entry=0x56235dc3fe00,
> errp=errp@entry=0x7ffcb21f1be0) at ../softmmu/physmem.c:2209
> #7  0x000056235bf1cc99 in memory_region_init_resizeable_ram
> (mr=0x56235dc3fe00,     owner=owner@entry=0x56235d93ffc0,
> name=name@entry=0x7ffcb21f1ca0 "/rom@etc/acpi/tables",     size=131072,
> max_size=2097152, resized=resized@entry=0x56235bc0f920 <fw_cfg_resized>,
> errp=0x56235c996490 <error_fatal>) at ../softmmu/memory.c:1586
> #8  0x000056235bc0f99c in rom_set_mr (rom=rom@entry=0x56235ddd0200,
> owner=0x56235d93ffc0,     name=name@entry=0x7ffcb21f1ca0
> "/rom@etc/acpi/tables", ro=ro@entry=true)
>     at ../hw/core/loader.c:961
> #9  0x000056235bc12a65 in rom_add_blob (name=name@entry=0x56235c1a2a09
> "etc/acpi/tables",     blob=0x56235df4f4b0, len=<optimized out>,
> max_len=max_len@entry=2097152,     addr=addr@entry=18446744073709551615,
> fw_file_name=fw_file_name@entry=0x56235c1a2a09 "etc/acpi/tables",
> fw_callback=0x56235be47f90 <acpi_build_update>,
> callback_opaque=0x56235d817830, as=0x0,     read_only=true) at
> ../hw/core/loader.c:1102
> #10 0x000056235bbe0990 in acpi_add_rom_blob (
>     update=update@entry=0x56235be47f90 <acpi_build_update>,
> opaque=opaque@entry=0x56235d817830,     blob=0x56235d3ab750,
> name=name@entry=0x56235c1a2a09 "etc/acpi/tables") at ../hw/acpi/utils.c:46
> #11 0x000056235be481e6 in acpi_setup () at ../hw/i386/acpi-build.c:2805
> #12 0x000056235be3e209 in pc_machine_done (notifier=0x56235d5efce8, data=<optimized out>)
>     at ../hw/i386/pc.c:758
> #13 0x000056235c12e4a7 in notifier_list_notify (
>     list=list@entry=0x56235c963790 <machine_init_done_notifiers>, data=data@entry=0x0)
>     at ../util/notify.c:39

Hi Stefano,
I have fixed this by using RAMBlock instead of MemoryRegion. The next
revision will call qemu_ram_block_from_host() to fetch the RAMBlock's
fd.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint
  2022-07-14 10:16   ` Hanna Reitz
@ 2022-08-15 21:24     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-15 21:24 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 1448 bytes --]

On Thu, Jul 14, 2022 at 12:16:16PM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > Register guest RAM using BlockRAMRegistrar and set the
> > BDRV_REQ_REGISTERED_BUF flag so block drivers can optimize memory
> > accesses in I/O requests.
> > 
> > This is for vdpa-blk, vhost-user-blk, and other I/O interfaces that rely
> > on DMA mapping/unmapping.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >   include/hw/virtio/virtio-blk.h |  2 ++
> >   hw/block/virtio-blk.c          | 13 +++++++++----
> >   2 files changed, 11 insertions(+), 4 deletions(-)
> 
> Seems fair, but as said on patch 5, I’m quite wary of “register guest RAM”. 
> How can we guarantee that it won’t be too fragmented to be registerable with
> either nvme.c or blkio.c?

We can't guarantee it. blkio instances have a maximum number of mappings
and we might exceed it. This patch doesn't have a smart solution.

Smart solutions are possible, but I haven't had time to work on one yet.
It is necessary to keep track of which mappings are referenced by
in-flight requests. When the maximum number of mappings is hit, a
mapping that currently has no references can be evicted to make space.
When the maximum number of mappings is reached by in-flight requests
then new requests may have to wait.

Until we hit the maximum number of mappings in the real world this
doesn't matter.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag
  2022-07-14  8:54   ` Hanna Reitz
@ 2022-08-17 20:46     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-17 20:46 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 7011 bytes --]

On Thu, Jul 14, 2022 at 10:54:12AM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > Block drivers may optimize I/O requests accessing buffers previously
> > registered with bdrv_register_buf(). Checking whether all elements of a
> > request's QEMUIOVector are within previously registered buffers is
> > expensive, so we need a hint from the user to avoid costly checks.
> > 
> > Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
> > QEMUIOVector elements in an I/O request are known to be within
> > previously registered buffers.
> > 
> > bdrv_aligned_preadv() is strict in validating supported read flags and
> > its assertions fail when it sees BDRV_REQ_REGISTERED_BUF. There is no
> > harm in passing BDRV_REQ_REGISTERED_BUF to block drivers that do not
> > support it, so update the assertions to ignore BDRV_REQ_REGISTERED_BUF.
> > 
> > Care must be taken to clear the flag when the block layer or filter
> > drivers replace QEMUIOVector elements with bounce buffers since these
> > have not been registered with bdrv_register_buf(). A lot of the changes
> > in this commit deal with clearing the flag in those cases.
> > 
> > Ensuring that the flag is cleared properly is somewhat invasive to
> > implement across the block layer and it's hard to spot when future code
> > changes accidentally break it. Another option might be to add a flag to
> > QEMUIOVector itself and clear it in qemu_iovec_*() functions that modify
> > elements. That is more robust but somewhat of a layering violation, so I
> > haven't attempted that.
> 
> Yeah...  I will say that most read code already looks quite reasonable in
> that it’ll pass @flags to lower layers basically only if it’s an unmodified
> request, so it seems like in the past most people have already adhered to
> “don’t pass on any flags if you’re reading to a local bounce buffer”.
> 
> > Signed-off-by: Stefan Hajnoczi<stefanha@redhat.com>
> > ---
> >   include/block/block-common.h |  9 +++++++++
> >   block/blkverify.c            |  4 ++--
> >   block/crypto.c               |  2 ++
> >   block/io.c                   | 30 +++++++++++++++++++++++-------
> >   block/mirror.c               |  2 ++
> >   block/raw-format.c           |  2 ++
> >   6 files changed, 40 insertions(+), 9 deletions(-)
> 
> Some things not covered here that look a bit wrong:
> 
> While bdrv_driver_preadv() asserts that the flags don’t contain anything the
> driver couldn’t handle (and this new flag is made exempt from that assertion
> here in this patch), bdrv_driver_pwritev() just hides those flags from
> drivers silently. I think just like we exempt the new flag from the
> assertion in bdrv_driver_preadv(), we should have bdrv_driver_pwritev()
> always pass it to drivers.
> 
> The following driver read/write functions assert that @flags is 0, which is
> probably no longer ideal:
> - bdrv_qed_co_writev()
> - block_crypto_co_preadv()
> - nbd_client_co_preadv()
> - parallels_co_writev()
> - qcow_co_preadv()
> - qcow_co_pwritev()
> - qemu_gluster_co_writev()
> - raw_co_pwritev() (block/file-posix.c)
> - replication_co_writev()
> - ssh_co_writev()
> - vhdx_co_writev()
> 
> snapshot_access_co_preadv_part() returns an error when any flags are set,
> but should probably ignore BDRV_REQ_REGISTERED_BUF for this check.

The assert(!flags) checks can be removed without losing much safety
since bdrv_driver_preadv/pwritev() prepare the flags bits appropriately
and calls from other locations are rare.

> 
> 
> While looking around, I spotted a couple of places that look like they could
> pass the flag on but currently don’t (just FYI, not asking for anything
> here):
> 
> bdrv_co_do_copy_on_readv() never passes the flags through to its calls, but
> I think it could pass this flag on in the one bdrv_driver_preadv() call
> where it doesn’t use a bounce buffer (“Read directly into the destination”).
> 
> qcow2’s qcow2_co_preadv_task() and qcow2_co_pwritev_task() (besides the
> encryption part) also look like they should pass this flag on, but, well,
> the functions themselves currently don’t get the flag, so they can’t.
> 
> qcow1’s qcow_co_preadv() and qcow_co_pwritev() are so-so, sometimes using a
> bounce buffer, and sometimes not, but those function could use optimization
> in general if anyone cared.
> 
> vpc_co_preadv()’s and vpc_co_pwritev()’s first
> bdrv_co_preadv()/bdrv_co_pwritev() invocations look straightforward, but as
> with qcow1, not sure if anyone cares.
> 
> I’m too lazy to thoroughly check what’s going on with qed_aio_write_main(). 
> Passing 0 is safe, and it doesn’t get the original request flags, so I guess
> doing anything about this would be difficult.
> 
> quorum’s read_fifo_child() probably could pass acb->flags. Probably. 
> Perhaps not.  Difficult to say it is.
> 
> block/replication.c also looks like a candidate for passing flags, but
> personally, I’d like to refrain from touching it.  (Well, besides the fact
> that replication_co_writev() asserts that @flags is 0.)
> 
> 
> (And finally, I found that block/parallels.c invokes bdrv_co_pwritev() with
> a buffer instead of an I/O vector, which looks really wrong, but has nothing
> to do with this patch.)

Thanks for looking at these. I haven't attempted to propagate
BDRV_REQ_REGISTERED_BUF through image format drivers yet so there are
optimization opportunities here.

> [...]
> 
> > diff --git a/block/io.c b/block/io.c
> > index e7f4117fe7..83b8259227 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> 
> [...]
> 
> > @@ -1902,6 +1910,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
> >           return -ENOTSUP;
> >       }
> > +    /* By definition there is no user buffer so this flag doesn't make sense */
> > +    if (flags & BDRV_REQ_REGISTERED_BUF) {
> > +        return -EINVAL;
> > +    }
> > +
> 
> Here we return an error when the flag is met...
> 
> >       /* Invalidate the cached block-status data range if this write overlaps */
> >       bdrv_bsc_invalidate_range(bs, offset, bytes);
> > @@ -2187,6 +2200,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild *child,
> >       bool padding;
> >       BdrvRequestPadding pad;
> > +    /* This flag doesn't make sense for padding or zero writes */
> > +    flags &= ~BDRV_REQ_REGISTERED_BUF;
> > +
> 
> ...and here we just ignore it.  Why don’t we handle this the same in both of
> these functions?  (And what about bdrv_co_pwrite_zeroes()?)
> 
> Besides that, if we do make it an error, I wonder if it shouldn’t be an
> assertion instead so the duty of clearing the flag falls on the caller.  (I
> personally like just silently clearing it in the zero-write functions,
> though.)

Thanks for catching this. Let's consistently clear
BDRV_REQ_REGISTERED_BUF silently for zero writes.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 5/8] block: add BlockRAMRegistrar
  2022-07-14  9:30   ` Hanna Reitz
@ 2022-08-17 20:51     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-17 20:51 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 2650 bytes --]

On Thu, Jul 14, 2022 at 11:30:11AM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > Emulated devices and other BlockBackend users wishing to take advantage
> > of blk_register_buf() all have the same repetitive job: register
> > RAMBlocks with the BlockBackend using RAMBlockNotifier.
> > 
> > Add a BlockRAMRegistrar API to do this. A later commit will use this
> > from hw/block/virtio-blk.c.
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >   MAINTAINERS                          |  1 +
> >   include/sysemu/block-ram-registrar.h | 30 +++++++++++++++++++++
> >   block/block-ram-registrar.c          | 39 ++++++++++++++++++++++++++++
> >   block/meson.build                    |  1 +
> >   4 files changed, 71 insertions(+)
> >   create mode 100644 include/sysemu/block-ram-registrar.h
> >   create mode 100644 block/block-ram-registrar.c
> 
> What memory is handled in ram_list?  Is it everything?  If so, won’t devices
> have trouble registering all those buffer, especially if they happen to be
> fragmented in physical memory? (nvme_register_buf() seems to say it can run
> out of slots quite easily.)

I replied to this in another sub-thread. You are right, there is a
possibility of running out of mappings and there's no smart resource
management at the moment.

> 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 50f340d9ee..d16189449f 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -2490,6 +2490,7 @@ F: block*
> >   F: block/
> >   F: hw/block/
> >   F: include/block/
> > +F: include/sysemu/block-*.h
> >   F: qemu-img*
> >   F: docs/tools/qemu-img.rst
> >   F: qemu-io*
> 
> Sneaky. ;)
> 
> > diff --git a/include/sysemu/block-ram-registrar.h b/include/sysemu/block-ram-registrar.h
> > new file mode 100644
> > index 0000000000..09d63f64b2
> > --- /dev/null
> > +++ b/include/sysemu/block-ram-registrar.h
> > @@ -0,0 +1,30 @@
> > +/*
> > + * BlockBackend RAM Registrar
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#ifndef BLOCK_RAM_REGISTRAR_H
> > +#define BLOCK_RAM_REGISTRAR_H
> > +
> > +#include "exec/ramlist.h"
> > +
> > +/**
> > + * struct BlockRAMRegistrar:
> > + *
> > + * Keeps RAMBlock memory registered with a BlockBackend using
> > + * blk_register_buf() including hotplugged memory.
> > + *
> > + * Emulated devices or other BlockBackend users initialize a BlockRAMRegistrar
> > + * with blk_ram_registrar_init() before submitting I/O requests with the
> > + * BLK_REQ_REGISTERED_BUF flag set.
> 
> s/BLK/BDRV/, right?

Thanks, fixed!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization
  2022-07-14 10:13   ` Hanna Reitz
@ 2022-08-18 19:46     ` Stefan Hajnoczi
  0 siblings, 0 replies; 29+ messages in thread
From: Stefan Hajnoczi @ 2022-08-18 19:46 UTC (permalink / raw)
  To: Hanna Reitz
  Cc: qemu-devel, Alberto Faria, Vladimir Sementsov-Ogievskiy,
	Michael S. Tsirkin, Paolo Bonzini, Laurent Vivier, Eric Blake,
	sgarzare, Marcel Apfelbaum, Philippe Mathieu-Daudé,
	qemu-block, Eduardo Habkost, Vladimir Sementsov-Ogievskiy,
	John Snow, Thomas Huth, Kevin Wolf, Markus Armbruster, Fam Zheng,
	Yanan Wang

[-- Attachment #1: Type: text/plain, Size: 7056 bytes --]

On Thu, Jul 14, 2022 at 12:13:53PM +0200, Hanna Reitz wrote:
> On 08.07.22 06:17, Stefan Hajnoczi wrote:
> > Avoid bounce buffers when QEMUIOVector elements are within previously
> > registered bdrv_register_buf() buffers.
> > 
> > The idea is that emulated storage controllers will register guest RAM
> > using bdrv_register_buf() and set the BDRV_REQ_REGISTERED_BUF on I/O
> > requests. Therefore no blkio_map_mem_region() calls are necessary in the
> > performance-critical I/O code path.
> > 
> > This optimization doesn't apply if the I/O buffer is internally
> > allocated by QEMU (e.g. qcow2 metadata). There we still take the slow
> > path because BDRV_REQ_REGISTERED_BUF is not set.
> 
> Which keeps the question relevant of how slow the slow path is, i.e. whether
> it wouldn’t make sense to keep some of the mem regions allocated there in a
> cache instead of allocating/freeing them on every I/O request.

Yes, bounce buffer reuse would be possible, but let's keep it simple for
now.

> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >   block/blkio.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >   1 file changed, 101 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blkio.c b/block/blkio.c
> > index 7fbdbd7fae..37d593a20c 100644
> > --- a/block/blkio.c
> > +++ b/block/blkio.c
> 
> [...]
> 
> > @@ -198,6 +203,8 @@ static BlockAIOCB *blkio_aio_preadv(BlockDriverState *bs, int64_t offset,
> >           BlockCompletionFunc *cb, void *opaque)
> >   {
> >       BDRVBlkioState *s = bs->opaque;
> > +    bool needs_mem_regions =
> > +        s->needs_mem_regions && !(flags & BDRV_REQ_REGISTERED_BUF);
> 
> Is that condition sufficient?  bdrv_register_buf() has no way of returning
> an error, so it’s possible that buffers are silently not registered.  (And
> there are conditions in blkio_register_buf() where the buffer will not be
> registered, e.g. because it isn’t aligned.)
> 
> The caller knows nothing of this and will still pass
> BDRV_REQ_REGISTERED_BUF, and then we’ll assume the region is mapped but it
> won’t be.
> 
> >       struct iovec *iov = qiov->iov;
> >       int iovcnt = qiov->niov;
> >       BlkioAIOCB *acb;
> 
> [...]
> 
> > @@ -324,6 +333,80 @@ static void blkio_io_unplug(BlockDriverState *bs)
> >       }
> >   }
> > +static void blkio_register_buf(BlockDriverState *bs, void *host, size_t size)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +    struct blkio_mem_region region = (struct blkio_mem_region){
> > +        .addr = host,
> > +        .len = size,
> > +        .fd = -1,
> > +    };
> > +
> > +    if (((uintptr_t)host | size) % s->mem_region_alignment) {
> > +        error_report_once("%s: skipping unaligned buf %p with size %zu",
> > +                          __func__, host, size);
> > +        return; /* skip unaligned */
> > +    }
> 
> How big is mem-region-alignment generally?  Is it like 4k or is it going to
> be a real issue?

Yes, it's usually the page size of the MMU/IOMMU. vhost-user and VFIO
have the same requirements so I don't think anything special is
necessary.

> (Also, we could probably register a truncated region.  I know, that’ll break
> the BDRV_REQ_REGISTERED_BUF idea because the caller won’t know we’ve
> truncated it, but that’s no different than just not registering the buffer
> at all.)
> 
> > +
> > +    /* Attempt to find the fd for a MemoryRegion */
> > +    if (s->needs_mem_region_fd) {
> > +        int fd = -1;
> > +        ram_addr_t offset;
> > +        MemoryRegion *mr;
> > +
> > +        /*
> > +         * bdrv_register_buf() is called with the BQL held so mr lives at least
> > +         * until this function returns.
> > +         */
> > +        mr = memory_region_from_host(host, &offset);
> > +        if (mr) {
> > +            fd = memory_region_get_fd(mr);
> > +        }
> 
> I don’t think it’s specified that buffers registered with
> bdrv_register_buf() must be within a single memory region, is it? So can we
> somehow verify that the memory region covers the whole buffer?

You are right, there is no guarantee. However, the range will always be
within a RAMBlock at the moment because the bdrv_register_buf() calls
are driven by a RAMBlock notifier and match the boundaries of the
RAMBlocks.

I will add a check so this starts failing when that assumption is
violated.

> 
> > +        if (fd == -1) {
> > +            error_report_once("%s: skipping fd-less buf %p with size %zu",
> > +                              __func__, host, size);
> > +            return; /* skip if there is no fd */
> > +        }
> > +
> > +        region.fd = fd;
> > +        region.fd_offset = offset;
> > +    }
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        ret = blkio_map_mem_region(s->blkio, &region);
> > +    }
> > +
> > +    if (ret < 0) {
> > +        error_report_once("Failed to add blkio mem region %p with size %zu: %s",
> > +                          host, size, blkio_get_error_msg());
> > +    }
> > +}
> > +
> > +static void blkio_unregister_buf(BlockDriverState *bs, void *host, size_t size)
> > +{
> > +    BDRVBlkioState *s = bs->opaque;
> > +    int ret;
> > +    struct blkio_mem_region region = (struct blkio_mem_region){
> > +        .addr = host,
> > +        .len = size,
> > +        .fd = -1,
> > +    };
> > +
> > +    if (((uintptr_t)host | size) % s->mem_region_alignment) {
> > +        return; /* skip unaligned */
> > +    }
> > +
> > +    WITH_QEMU_LOCK_GUARD(&s->lock) {
> > +        ret = blkio_unmap_mem_region(s->blkio, &region);
> > +    }
> 
> The documentation of libblkio says that “memory regions must be
> unmapped/freed with exactly the same `region` field values that they were
> mapped/allocated with.”  We don’t set .fd here, though.

That's a bug. The memory region will not be unmapped because libblkio's
HashSet won't match. I'll fix the QEMU code to pass the exact same
struct blkio_mem_region fields.

> 
> It’s also unclear whether it’s allowed to unmap a region that wasn’t mapped,
> but I’ll trust libblkio to detect that.

Yes, it's a nop.

> 
> > +
> > +    if (ret < 0) {
> > +        error_report_once("Failed to delete blkio mem region %p with size %zu: %s",
> > +                          host, size, blkio_get_error_msg());
> > +    }
> > +}
> > +
> >   static void blkio_parse_filename_io_uring(const char *filename, QDict *options,
> >                                             Error **errp)
> >   {
> 
> [...]
> 
> > @@ -459,7 +553,7 @@ static int blkio_file_open(BlockDriverState *bs, QDict *options, int flags,
> >           return ret;
> >       }
> > -    bs->supported_write_flags = BDRV_REQ_FUA;
> > +    bs->supported_write_flags = BDRV_REQ_FUA | BDRV_REQ_REGISTERED_BUF;
> 
> Shouldn’t we also report it as a supported read flag then?

Yes, thank you!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-08-18 19:49 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08  4:17 [RFC v3 0/8] blkio: add libblkio BlockDriver Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 1/8] blkio: add io_uring block driver using libblkio Stefan Hajnoczi
2022-07-12 14:23   ` Stefano Garzarella
2022-08-11 16:51     ` Stefan Hajnoczi
2022-07-13 12:05   ` Hanna Reitz
2022-08-11 19:08     ` Stefan Hajnoczi
2022-07-27 19:33   ` Kevin Wolf
2022-08-03 12:25     ` Peter Krempa
2022-08-03 13:30       ` Kevin Wolf
2022-08-11 19:09     ` Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 2/8] numa: call ->ram_block_removed() in ram_block_notifer_remove() Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 3/8] block: pass size to bdrv_unregister_buf() Stefan Hajnoczi
2022-07-13 14:08   ` Hanna Reitz
2022-07-08  4:17 ` [RFC v3 4/8] block: add BDRV_REQ_REGISTERED_BUF request flag Stefan Hajnoczi
2022-07-14  8:54   ` Hanna Reitz
2022-08-17 20:46     ` Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 5/8] block: add BlockRAMRegistrar Stefan Hajnoczi
2022-07-14  9:30   ` Hanna Reitz
2022-08-17 20:51     ` Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 6/8] stubs: add memory_region_from_host() and memory_region_get_fd() Stefan Hajnoczi
2022-07-14  9:39   ` Hanna Reitz
2022-07-08  4:17 ` [RFC v3 7/8] blkio: implement BDRV_REQ_REGISTERED_BUF optimization Stefan Hajnoczi
2022-07-12 14:28   ` Stefano Garzarella
2022-08-15 20:52     ` Stefan Hajnoczi
2022-07-14 10:13   ` Hanna Reitz
2022-08-18 19:46     ` Stefan Hajnoczi
2022-07-08  4:17 ` [RFC v3 8/8] virtio-blk: use BDRV_REQ_REGISTERED_BUF optimization hint Stefan Hajnoczi
2022-07-14 10:16   ` Hanna Reitz
2022-08-15 21:24     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.