All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/12] Add support for io_uring
@ 2019-06-10 13:48 Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring Aarushi Mehta
                   ` (12 more replies)
  0 siblings, 13 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

This patch series adds support for the newly developed io_uring Linux AIO
interface. Linux io_uring is faster than Linux's AIO asynchronous I/O code,
offers efficient buffered asynchronous I/O support, the ability to do I/O
without performing a system call via polled I/O, and other efficiency enhancements.

Testing it requires a host kernel (5.1+) and the liburing library.
Use the option -drive aio=io_uring to enable it.

v5:
- Adds completion polling
- Extends qemu-io
- Adds qemu-iotest

v4:
- Add error handling
- Add trace events
- Remove aio submission based code

v3:
- Fix major errors in io_uring (sorry)
- Option now enumerates for CONFIG_LINUX_IO_URING
- pkg config support added

Aarushi Mehta (12):
  configure: permit use of io_uring
  qapi/block-core: add option for io_uring Only enumerates option for
    devices that support it
  block/block: add BDRV flag for io_uring
  block/io_uring: implements interfaces for io_uring Aborts when sqe
    fails to be set as sqes cannot be returned to the ring.
  stubs: add stubs for io_uring interface
  util/async: add aio interfaces for io_uring
  blockdev: accept io_uring as option
  block/file-posix.c: extend to use io_uring
  block: add trace events for io_uring
  block/io_uring: adds userspace completion polling
  qemu-io: adds support for io_uring
  qemu-iotests/087: checks for io_uring

 MAINTAINERS                |   8 +
 block/Makefile.objs        |   3 +
 block/file-posix.c         |  85 ++++++++--
 block/io_uring.c           | 339 +++++++++++++++++++++++++++++++++++++
 block/trace-events         |   8 +
 blockdev.c                 |   4 +-
 configure                  |  27 +++
 include/block/aio.h        |  16 +-
 include/block/block.h      |   1 +
 include/block/raw-aio.h    |  12 ++
 qapi/block-core.json       |   4 +-
 qemu-io.c                  |  13 ++
 stubs/Makefile.objs        |   1 +
 stubs/io_uring.c           |  32 ++++
 tests/qemu-iotests/087     |  26 +++
 tests/qemu-iotests/087.out |  10 ++
 util/async.c               |  36 ++++
 17 files changed, 606 insertions(+), 19 deletions(-)
 create mode 100644 block/io_uring.c
 create mode 100644 stubs/io_uring.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring Aarushi Mehta
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Maxim Levitsky <maximlevitsky@gmail.com>
---
 configure | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/configure b/configure
index b091b82cb3..7aa18d308d 100755
--- a/configure
+++ b/configure
@@ -365,6 +365,7 @@ xen=""
 xen_ctrl_version=""
 xen_pci_passthrough=""
 linux_aio=""
+linux_io_uring=""
 cap_ng=""
 attr=""
 libattr=""
@@ -1266,6 +1267,10 @@ for opt do
   ;;
   --enable-linux-aio) linux_aio="yes"
   ;;
+  --disable-linux-io-uring) linux_io_uring="no"
+  ;;
+  --enable-linux-io-uring) linux_io_uring="yes"
+  ;;
   --disable-attr) attr="no"
   ;;
   --enable-attr) attr="yes"
@@ -1784,6 +1789,7 @@ disabled with --disable-FEATURE, default is enabled if available:
   vde             support for vde network
   netmap          support for netmap network
   linux-aio       Linux AIO support
+  linux-io-uring  Linux io_uring support
   cap-ng          libcap-ng support
   attr            attr and xattr support
   vhost-net       vhost-net kernel acceleration support
@@ -3973,6 +3979,21 @@ EOF
     linux_aio=no
   fi
 fi
+##########################################
+# linux-io-uring probe
+
+if test "$linux_io_uring" != "no" ; then
+  if $pkg_config liburing; then
+    linux_io_uring_cflags=$($pkg_config --cflags liburing)
+    linux_io_uring_libs=$($pkg_config --libs liburing)
+    linux_io_uring=yes
+  else
+    if test "$linux_io_uring" = "yes" ; then
+      feature_not_found "linux io_uring" "Install liburing devel"
+    fi
+    linux_io_uring=no
+  fi
+fi
 
 ##########################################
 # TPM emulation is only on POSIX
@@ -6396,6 +6417,7 @@ echo "PIE               $pie"
 echo "vde support       $vde"
 echo "netmap support    $netmap"
 echo "Linux AIO support $linux_aio"
+echo "Linux io_uring support $linux_io_uring"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
@@ -6874,6 +6896,11 @@ fi
 if test "$linux_aio" = "yes" ; then
   echo "CONFIG_LINUX_AIO=y" >> $config_host_mak
 fi
+if test "$linux_io_uring" = "yes" ; then
+  echo "CONFIG_LINUX_IO_URING=y" >> $config_host_mak
+  echo "LINUX_IO_URING_CFLAGS=$linux_io_uring_cflags" >> $config_host_mak
+  echo "LINUX_IO_URING_LIBS=$linux_io_uring_libs" >> $config_host_mak
+fi
 if test "$attr" = "yes" ; then
   echo "CONFIG_ATTR=y" >> $config_host_mak
 fi
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-11  7:36   ` Fam Zheng
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 03/12] block/block: add BDRV flag " Aarushi Mehta
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Option only enumerates for hosts that support it.

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 qapi/block-core.json | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1defcde048..db7eedd058 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2792,11 +2792,13 @@
 #
 # @threads:     Use qemu's thread pool
 # @native:      Use native AIO backend (only Linux and Windows)
+# @io_uring:    Use linux io_uring (since 4.1)
 #
 # Since: 2.9
 ##
 { 'enum': 'BlockdevAioOptions',
-  'data': [ 'threads', 'native' ] }
+  'data': [ 'threads', 'native',
+            { 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } ] }
 
 ##
 # @BlockdevCacheOptions:
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 03/12] block/block: add BDRV flag for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Maxim Levitsky <maximlevitsky@gmail.com>
---
 include/block/block.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/block/block.h b/include/block/block.h
index f9415ed740..5e08df716f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -121,6 +121,7 @@ typedef struct HDGeometry {
                                       ignoring the format layer */
 #define BDRV_O_NO_IO       0x10000 /* don't initialize for I/O */
 #define BDRV_O_AUTO_RDONLY 0x20000 /* degrade to read-only if opening read-write fails */
+#define BDRV_O_IO_URING    0x40000 /* use io_uring instead of the thread pool */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (2 preceding siblings ...)
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 03/12] block/block: add BDRV flag " Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-11 11:17   ` Fam Zheng
  2019-06-17 12:26   ` Maxim Levitsky
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface Aarushi Mehta
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Aborts when sqe fails to be set as sqes cannot be returned to the ring.

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 MAINTAINERS             |   7 +
 block/Makefile.objs     |   3 +
 block/io_uring.c        | 314 ++++++++++++++++++++++++++++++++++++++++
 include/block/aio.h     |  16 +-
 include/block/raw-aio.h |  12 ++
 5 files changed, 351 insertions(+), 1 deletion(-)
 create mode 100644 block/io_uring.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7be1225415..49f896796e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2516,6 +2516,13 @@ F: block/file-posix.c
 F: block/file-win32.c
 F: block/win32-aio.c
 
+Linux io_uring
+M: Aarushi Mehta <mehta.aaru20@gmail.com>
+R: Stefan Hajnoczi <stefan@redhat.com>
+L: qemu-block@nongnu.org
+S: Maintained
+F: block/io_uring.c
+
 qcow2
 M: Kevin Wolf <kwolf@redhat.com>
 M: Max Reitz <mreitz@redhat.com>
diff --git a/block/Makefile.objs b/block/Makefile.objs
index ae11605c9f..8fde7a23a5 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
 block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
 block-obj-$(CONFIG_POSIX) += file-posix.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
+block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
 block-obj-y += null.o mirror.o commit.o io.o create.o
 block-obj-y += throttle-groups.o
 block-obj-$(CONFIG_LINUX) += nvme.o
@@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
 dmg-lzfse.o-libs   := $(LZFSE_LIBS)
 qcow.o-libs        := -lz
 linux-aio.o-libs   := -laio
+io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
+io_uring.o-libs    := $(LINUX_IO_URING_LIBS)
 parallels.o-cflags := $(LIBXML2_CFLAGS)
 parallels.o-libs   := $(LIBXML2_LIBS)
diff --git a/block/io_uring.c b/block/io_uring.c
new file mode 100644
index 0000000000..f327c7ef96
--- /dev/null
+++ b/block/io_uring.c
@@ -0,0 +1,314 @@
+/*
+ * Linux io_uring support.
+ *
+ * Copyright (C) 2009 IBM, Corp.
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Copyright (C) 2019 Aarushi Mehta
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <liburing.h>
+#include "qemu-common.h"
+#include "block/aio.h"
+#include "qemu/queue.h"
+#include "block/block.h"
+#include "block/raw-aio.h"
+#include "qemu/coroutine.h"
+#include "qapi/error.h"
+
+#define MAX_EVENTS 128
+
+typedef struct LuringAIOCB {
+    Coroutine *co;
+    struct io_uring_sqe sqeq;
+    ssize_t ret;
+    QEMUIOVector *qiov;
+    bool is_read;
+    QSIMPLEQ_ENTRY(LuringAIOCB) next;
+} LuringAIOCB;
+
+typedef struct LuringQueue {
+    int plugged;
+    unsigned int in_queue;
+    unsigned int in_flight;
+    bool blocked;
+    QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
+} LuringQueue;
+
+typedef struct LuringState {
+    AioContext *aio_context;
+
+    struct io_uring ring;
+
+    /* io queue for submit at batch.  Protected by AioContext lock. */
+    LuringQueue io_q;
+
+    /* I/O completion processing.  Only runs in I/O thread.  */
+    QEMUBH *completion_bh;
+} LuringState;
+
+/**
+ * ioq_submit:
+ * @s: AIO state
+ *
+ * Queues pending sqes and submits them
+ *
+ */
+static int ioq_submit(LuringState *s);
+
+/**
+ * qemu_luring_process_completions:
+ * @s: AIO state
+ *
+ * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
+ *
+ */
+static void qemu_luring_process_completions(LuringState *s)
+{
+    struct io_uring_cqe *cqes;
+    int ret;
+
+    /*
+     * Request completion callbacks can run the nested event loop.
+     * Schedule ourselves so the nested event loop will "see" remaining
+     * completed requests and process them.  Without this, completion
+     * callbacks that wait for other requests using a nested event loop
+     * would hang forever.
+     */
+    qemu_bh_schedule(s->completion_bh);
+
+    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
+        if (!cqes) {
+            break;
+        }
+        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
+        ret = cqes->res;
+
+        if (ret == luringcb->qiov->size) {
+            ret = 0;
+        } else if (ret >= 0) {
+            /* Short Read/Write */
+            if (luringcb->is_read) {
+                /* Read, pad with zeroes */
+                qemu_iovec_memset(luringcb->qiov, ret, 0,
+                luringcb->qiov->size - ret);
+            } else {
+                ret = -ENOSPC;;
+            }
+        }
+        luringcb->ret = ret;
+
+        io_uring_cqe_seen(&s->ring, cqes);
+        cqes = NULL;
+        /* Change counters one-by-one because we can be nested. */
+        s->io_q.in_flight--;
+
+        /*
+         * If the coroutine is already entered it must be in ioq_submit()
+         * and will notice luringcb->ret has been filled in when it
+         * eventually runs later. Coroutines cannot be entered recursively
+         * so avoid doing that!
+         */
+        if (!qemu_coroutine_entered(luringcb->co)) {
+            aio_co_wake(luringcb->co);
+        }
+    }
+    qemu_bh_cancel(s->completion_bh);
+}
+
+static void qemu_luring_process_completions_and_submit(LuringState *s)
+{
+    aio_context_acquire(s->aio_context);
+    qemu_luring_process_completions(s);
+
+    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
+        ioq_submit(s);
+    }
+    aio_context_release(s->aio_context);
+}
+
+static void qemu_luring_completion_bh(void *opaque)
+{
+    LuringState *s = opaque;
+    qemu_luring_process_completions_and_submit(s);
+}
+
+static void qemu_luring_completion_cb(void *opaque)
+{
+    LuringState *s = opaque;
+    qemu_luring_process_completions_and_submit(s);
+}
+
+static void ioq_init(LuringQueue *io_q)
+{
+    QSIMPLEQ_INIT(&io_q->sq_overflow);
+    io_q->plugged = 0;
+    io_q->in_queue = 0;
+    io_q->in_flight = 0;
+    io_q->blocked = false;
+}
+
+static int ioq_submit(LuringState *s)
+{
+    int ret = 0;
+    LuringAIOCB *luringcb, *luringcb_next;
+
+    while (s->io_q.in_queue > 0) {
+        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
+                              luringcb_next) {
+            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
+            if (!sqes) {
+                break;
+            }
+            /* Prep sqe for submission */
+            *sqes = luringcb->sqeq;
+            QSIMPLEQ_REMOVE_HEAD(&s->io_q.sq_overflow, next);
+        }
+        ret =  io_uring_submit(&s->ring);
+        /* Prevent infinite loop if submission is refused */
+        if (ret <= 0) {
+            if (ret == -EAGAIN) {
+                continue;
+            }
+            break;
+        }
+        s->io_q.in_flight += ret;
+        s->io_q.in_queue  -= ret;
+    }
+    s->io_q.blocked = (s->io_q.in_queue > 0);
+
+    if (s->io_q.in_flight) {
+        /*
+         * We can try to complete something just right away if there are
+         * still requests in-flight.
+         */
+        qemu_luring_process_completions(s);
+    }
+    return ret;
+}
+
+void luring_io_plug(BlockDriverState *bs, LuringState *s)
+{
+    s->io_q.plugged++;
+}
+
+void luring_io_unplug(BlockDriverState *bs, LuringState *s)
+{
+    assert(s->io_q.plugged);
+    if (--s->io_q.plugged == 0 &&
+        !s->io_q.blocked && s->io_q.in_queue > 0) {
+        ioq_submit(s);
+    }
+}
+
+/**
+ * luring_do_submit:
+ * @fd: file descriptor for I/O
+ * @luringcb: AIO control block
+ * @s: AIO state
+ * @offset: offset for request
+ * @type: type of request
+ *
+ * Fetches sqes from ring, adds to pending queue and preps them
+ *
+ */
+static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
+                            uint64_t offset, int type)
+{
+    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
+    if (!sqes) {
+        sqes = &luringcb->sqeq;
+        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
+    }
+
+    switch (type) {
+    case QEMU_AIO_WRITE:
+        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
+                             luringcb->qiov->niov, offset);
+        break;
+    case QEMU_AIO_READ:
+        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
+                            luringcb->qiov->niov, offset);
+        break;
+    case QEMU_AIO_FLUSH:
+        io_uring_prep_fsync(sqes, fd, 0);
+        break;
+    default:
+        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
+                        __func__, type);
+        abort();
+    }
+    io_uring_sqe_set_data(sqes, luringcb);
+    s->io_q.in_queue++;
+
+    if (!s->io_q.blocked &&
+        (!s->io_q.plugged ||
+         s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
+        return ioq_submit(s);
+    }
+    return 0;
+}
+
+int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
+                                uint64_t offset, QEMUIOVector *qiov, int type)
+{
+    int ret;
+    LuringAIOCB luringcb = {
+        .co         = qemu_coroutine_self(),
+        .ret        = -EINPROGRESS,
+        .qiov       = qiov,
+        .is_read    = (type == QEMU_AIO_READ),
+    };
+
+    ret = luring_do_submit(fd, &luringcb, s, offset, type);
+    if (ret < 0) {
+        return ret;
+    }
+
+    if (luringcb.ret == -EINPROGRESS) {
+        qemu_coroutine_yield();
+    }
+    return luringcb.ret;
+}
+
+void luring_detach_aio_context(LuringState *s, AioContext *old_context)
+{
+    aio_set_fd_handler(old_context, s->ring.ring_fd, false, NULL, NULL, NULL,
+                       s);
+    qemu_bh_delete(s->completion_bh);
+    s->aio_context = NULL;
+}
+
+void luring_attach_aio_context(LuringState *s, AioContext *new_context)
+{
+    s->aio_context = new_context;
+    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
+    aio_set_fd_handler(s->aio_context, s->ring.ring_fd, false,
+                       qemu_luring_completion_cb, NULL, NULL, s);
+}
+
+LuringState *luring_init(Error **errp)
+{
+    int rc;
+    LuringState *s;
+    s = g_malloc0(sizeof(*s));
+    struct io_uring *ring = &s->ring;
+    rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
+    if (rc < 0) {
+        error_setg_errno(errp, errno, "failed to init linux io_uring ring");
+        g_free(s);
+        return NULL;
+    }
+
+    ioq_init(&s->io_q);
+    return s;
+
+}
+
+void luring_cleanup(LuringState *s)
+{
+    io_uring_queue_exit(&s->ring);
+    g_free(s);
+}
diff --git a/include/block/aio.h b/include/block/aio.h
index 0ca25dfec6..9da3fd9793 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -50,6 +50,7 @@ typedef void IOHandler(void *opaque);
 struct Coroutine;
 struct ThreadPool;
 struct LinuxAioState;
+struct LuringState;
 
 struct AioContext {
     GSource source;
@@ -118,11 +119,19 @@ struct AioContext {
     struct ThreadPool *thread_pool;
 
 #ifdef CONFIG_LINUX_AIO
-    /* State for native Linux AIO.  Uses aio_context_acquire/release for
+    /*
+     * State for native Linux AIO.  Uses aio_context_acquire/release for
      * locking.
      */
     struct LinuxAioState *linux_aio;
 #endif
+#ifdef CONFIG_LINUX_IO_URING
+    /*
+     * State for Linux io_uring.  Uses aio_context_acquire/release for
+     * locking.
+     */
+    struct LuringState *linux_io_uring;
+#endif
 
     /* TimerLists for calling timers - one per clock type.  Has its own
      * locking.
@@ -387,6 +396,11 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
 /* Return the LinuxAioState bound to this AioContext */
 struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
 
+/* Setup the LuringState bound to this AioContext */
+struct LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
+
+/* Return the LuringState bound to this AioContext */
+struct LuringState *aio_get_linux_io_uring(AioContext *ctx);
 /**
  * aio_timer_new_with_attrs:
  * @ctx: the aio context
diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
index 0cb7cc74a2..71d7d1395f 100644
--- a/include/block/raw-aio.h
+++ b/include/block/raw-aio.h
@@ -55,6 +55,18 @@ void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
 void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
 void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s);
 #endif
+/* io_uring.c - Linux io_uring implementation */
+#ifdef CONFIG_LINUX_IO_URING
+typedef struct LuringState LuringState;
+LuringState *luring_init(Error **errp);
+void luring_cleanup(LuringState *s);
+int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
+                                uint64_t offset, QEMUIOVector *qiov, int type);
+void luring_detach_aio_context(LuringState *s, AioContext *old_context);
+void luring_attach_aio_context(LuringState *s, AioContext *new_context);
+void luring_io_plug(BlockDriverState *bs, LuringState *s);
+void luring_io_unplug(BlockDriverState *bs, LuringState *s);
+#endif
 
 #ifdef _WIN32
 typedef struct QEMUWin32AIOState QEMUWin32AIOState;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (3 preceding siblings ...)
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-17 12:33   ` Maxim Levitsky
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring Aarushi Mehta
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS         |  1 +
 stubs/Makefile.objs |  1 +
 stubs/io_uring.c    | 32 ++++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+)
 create mode 100644 stubs/io_uring.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 49f896796e..bc38175124 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2522,6 +2522,7 @@ R: Stefan Hajnoczi <stefan@redhat.com>
 L: qemu-block@nongnu.org
 S: Maintained
 F: block/io_uring.c
+F: stubs/io_uring.c
 
 qcow2
 M: Kevin Wolf <kwolf@redhat.com>
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 9c7393b08c..5cf160a9c8 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -13,6 +13,7 @@ stub-obj-y += iothread.o
 stub-obj-y += iothread-lock.o
 stub-obj-y += is-daemonized.o
 stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
+stub-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
 stub-obj-y += machine-init-done.o
 stub-obj-y += migr-blocker.o
 stub-obj-y += change-state-handler.o
diff --git a/stubs/io_uring.c b/stubs/io_uring.c
new file mode 100644
index 0000000000..622d1e4648
--- /dev/null
+++ b/stubs/io_uring.c
@@ -0,0 +1,32 @@
+/*
+ * Linux io_uring support.
+ *
+ * Copyright (C) 2009 IBM, Corp.
+ * Copyright (C) 2009 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include "block/aio.h"
+#include "block/raw-aio.h"
+
+void luring_detach_aio_context(LuringState *s, AioContext *old_context)
+{
+    abort();
+}
+
+void luring_attach_aio_context(LuringState *s, AioContext *new_context)
+{
+    abort();
+}
+
+LuringState *luring_init(Error **errp)
+{
+    abort();
+}
+
+void luring_cleanup(LuringState *s)
+{
+    abort();
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (4 preceding siblings ...)
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface Aarushi Mehta
@ 2019-06-10 13:48 ` Aarushi Mehta
  2019-06-17 12:56   ` Maxim Levitsky
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option Aarushi Mehta
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 util/async.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/util/async.c b/util/async.c
index c10642a385..2709f0edc3 100644
--- a/util/async.c
+++ b/util/async.c
@@ -277,6 +277,14 @@ aio_ctx_finalize(GSource     *source)
     }
 #endif
 
+#ifdef CONFIG_LINUX_IO_URING
+    if (ctx->linux_io_uring) {
+        luring_detach_aio_context(ctx->linux_io_uring, ctx);
+        luring_cleanup(ctx->linux_io_uring);
+        ctx->linux_io_uring = NULL;
+    }
+#endif
+
     assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
     qemu_bh_delete(ctx->co_schedule_bh);
 
@@ -341,6 +349,29 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 }
 #endif
 
+#ifdef CONFIG_LINUX_IO_URING
+LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
+{
+    if (ctx->linux_io_uring) {
+        return ctx->linux_io_uring;
+    }
+
+    ctx->linux_io_uring = luring_init(errp);
+    if (!ctx->linux_io_uring) {
+        return NULL;
+    }
+
+    luring_attach_aio_context(ctx->linux_io_uring, ctx);
+    return ctx->linux_io_uring;
+}
+
+LuringState *aio_get_linux_io_uring(AioContext *ctx)
+{
+    assert(ctx->linux_io_uring);
+    return ctx->linux_io_uring;
+}
+#endif
+
 void aio_notify(AioContext *ctx)
 {
     /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
@@ -432,6 +463,11 @@ AioContext *aio_context_new(Error **errp)
 #ifdef CONFIG_LINUX_AIO
     ctx->linux_aio = NULL;
 #endif
+
+#ifdef CONFIG_LINUX_IO_URING
+    ctx->linux_io_uring = NULL;
+#endif
+
     ctx->thread_pool = NULL;
     qemu_rec_mutex_init(&ctx->lock);
     timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (5 preceding siblings ...)
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-17 13:01   ` Maxim Levitsky
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring Aarushi Mehta
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 blockdev.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/blockdev.c b/blockdev.c
index 3f44b891eb..a2a5b32604 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts *opts, int *bdrv_flags,
         if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
             if (!strcmp(aio, "native")) {
                 *bdrv_flags |= BDRV_O_NATIVE_AIO;
+            } else if (!strcmp(aio, "io_uring")) {
+                *bdrv_flags |= BDRV_O_IO_URING;
             } else if (!strcmp(aio, "threads")) {
                 /* this is the default */
             } else {
@@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
         },{
             .name = "aio",
             .type = QEMU_OPT_STRING,
-            .help = "host AIO implementation (threads, native)",
+            .help = "host AIO implementation (threads, native, io_uring)",
         },{
             .name = BDRV_OPT_CACHE_WB,
             .type = QEMU_OPT_BOOL,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (6 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-17 14:12   ` Maxim Levitsky
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 block/file-posix.c | 85 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 69 insertions(+), 16 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index d018429672..211dfe5337 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -154,6 +154,7 @@ typedef struct BDRVRawState {
     bool has_write_zeroes:1;
     bool discard_zeroes:1;
     bool use_linux_aio:1;
+    bool use_linux_io_uring:1;
     bool page_cache_inconsistent:1;
     bool has_fallocate;
     bool needs_alignment;
@@ -423,7 +424,7 @@ static QemuOptsList raw_runtime_opts = {
         {
             .name = "aio",
             .type = QEMU_OPT_STRING,
-            .help = "host AIO implementation (threads, native)",
+            .help = "host AIO implementation (threads, native, io_uring)",
         },
         {
             .name = "locking",
@@ -482,9 +483,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
         goto fail;
     }
 
-    aio_default = (bdrv_flags & BDRV_O_NATIVE_AIO)
-                  ? BLOCKDEV_AIO_OPTIONS_NATIVE
-                  : BLOCKDEV_AIO_OPTIONS_THREADS;
+    if (bdrv_flags & BDRV_O_NATIVE_AIO) {
+        aio_default = BLOCKDEV_AIO_OPTIONS_NATIVE;
+#ifdef CONFIG_LINUX_IO_URING
+    } else if (bdrv_flags & BDRV_O_IO_URING) {
+        aio_default = BLOCKDEV_AIO_OPTIONS_IO_URING;
+#endif
+    } else {
+        aio_default = BLOCKDEV_AIO_OPTIONS_THREADS;
+    }
     aio = qapi_enum_parse(&BlockdevAioOptions_lookup,
                           qemu_opt_get(opts, "aio"),
                           aio_default, &local_err);
@@ -493,7 +500,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
         ret = -EINVAL;
         goto fail;
     }
+
     s->use_linux_aio = (aio == BLOCKDEV_AIO_OPTIONS_NATIVE);
+#ifdef CONFIG_LINUX_IO_URING
+    s->use_linux_io_uring = (aio == BLOCKDEV_AIO_OPTIONS_IO_URING);
+#endif
 
     locking = qapi_enum_parse(&OnOffAuto_lookup,
                               qemu_opt_get(opts, "locking"),
@@ -557,7 +568,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     s->shared_perm = BLK_PERM_ALL;
 
 #ifdef CONFIG_LINUX_AIO
-     /* Currently Linux does AIO only for files opened with O_DIRECT */
+    /* Currently Linux does AIO only for files opened with O_DIRECT */
     if (s->use_linux_aio) {
         if (!(s->open_flags & O_DIRECT)) {
             error_setg(errp, "aio=native was specified, but it requires "
@@ -579,6 +590,22 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
 #endif /* !defined(CONFIG_LINUX_AIO) */
 
+#ifdef CONFIG_LINUX_IO_URING
+    if (s->use_linux_io_uring) {
+        if (!aio_setup_linux_io_uring(bdrv_get_aio_context(bs), errp)) {
+            error_prepend(errp, "Unable to use io_uring: ");
+            goto fail;
+        }
+    }
+#else
+    if (s->use_linux_io_uring) {
+        error_setg(errp, "aio=io_uring was specified, but is not supported "
+                         "in this build.");
+        ret = -EINVAL;
+        goto fail;
+    }
+#endif /* !defined(CONFIG_LINUX_IO_URING) */
+
     s->has_discard = true;
     s->has_write_zeroes = true;
     if ((bs->open_flags & BDRV_O_NOCACHE) != 0) {
@@ -1875,16 +1902,20 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
      * If this is the case tell the low-level driver that it needs
      * to copy the buffer.
      */
-    if (s->needs_alignment) {
-        if (!bdrv_qiov_is_aligned(bs, qiov)) {
-            type |= QEMU_AIO_MISALIGNED;
+    if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
+                type |= QEMU_AIO_MISALIGNED;
+#ifdef CONFIG_LINUX_IO_URING
+    } else if (s->use_linux_io_uring) {
+        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+        assert(qiov->size == bytes);
+        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
+#endif
 #ifdef CONFIG_LINUX_AIO
-        } else if (s->use_linux_aio) {
-            LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
-            assert(qiov->size == bytes);
-            return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
+    } else if (s->use_linux_aio && s->needs_alignment) {
+        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
+        assert(qiov->size == bytes);
+        return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
 #endif
-        }
     }
 
     acb = (RawPosixAIOData) {
@@ -1920,24 +1951,36 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
 
 static void raw_aio_plug(BlockDriverState *bs)
 {
+    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
-    BDRVRawState *s = bs->opaque;
     if (s->use_linux_aio) {
         LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
         laio_io_plug(bs, aio);
     }
 #endif
+#ifdef CONFIG_LINUX_IO_URING
+    if (s->use_linux_io_uring) {
+        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+        luring_io_plug(bs, aio);
+    }
+#endif
 }
 
 static void raw_aio_unplug(BlockDriverState *bs)
 {
+    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
-    BDRVRawState *s = bs->opaque;
     if (s->use_linux_aio) {
         LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
         laio_io_unplug(bs, aio);
     }
 #endif
+#ifdef CONFIG_LINUX_IO_URING
+    if (s->use_linux_io_uring) {
+        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
+        luring_io_unplug(bs, aio);
+    }
+#endif
 }
 
 static int raw_co_flush_to_disk(BlockDriverState *bs)
@@ -1963,8 +2006,8 @@ static int raw_co_flush_to_disk(BlockDriverState *bs)
 static void raw_aio_attach_aio_context(BlockDriverState *bs,
                                        AioContext *new_context)
 {
+    BDRVRawState __attribute__((unused)) *s = bs->opaque;
 #ifdef CONFIG_LINUX_AIO
-    BDRVRawState *s = bs->opaque;
     if (s->use_linux_aio) {
         Error *local_err;
         if (!aio_setup_linux_aio(new_context, &local_err)) {
@@ -1974,6 +2017,16 @@ static void raw_aio_attach_aio_context(BlockDriverState *bs,
         }
     }
 #endif
+#ifdef CONFIG_LINUX_IO_URING
+    if (s->use_linux_io_uring) {
+        Error *local_err;
+        if (!aio_setup_linux_io_uring(new_context, &local_err)) {
+            error_reportf_err(local_err, "Unable to use linux io_uring, "
+                                         "falling back to thread pool: ");
+            s->use_linux_io_uring = false;
+        }
+    }
+#endif
 }
 
 static void raw_close(BlockDriverState *bs)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (7 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-11  9:47   ` Stefan Hajnoczi
  2019-06-11 11:34   ` Fam Zheng
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling Aarushi Mehta
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 block/io_uring.c   | 14 ++++++++++++--
 block/trace-events |  8 ++++++++
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index f327c7ef96..47e027364a 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -17,6 +17,7 @@
 #include "block/raw-aio.h"
 #include "qemu/coroutine.h"
 #include "qapi/error.h"
+#include "trace.h"
 
 #define MAX_EVENTS 128
 
@@ -191,12 +192,15 @@ static int ioq_submit(LuringState *s)
 
 void luring_io_plug(BlockDriverState *bs, LuringState *s)
 {
+    trace_luring_io_plug();
     s->io_q.plugged++;
 }
 
 void luring_io_unplug(BlockDriverState *bs, LuringState *s)
 {
     assert(s->io_q.plugged);
+    trace_luring_io_unplug(s->io_q.blocked, s->io_q.plugged,
+                                 s->io_q.in_queue, s->io_q.in_flight);
     if (--s->io_q.plugged == 0 &&
         !s->io_q.blocked && s->io_q.in_queue > 0) {
         ioq_submit(s);
@@ -217,6 +221,7 @@ void luring_io_unplug(BlockDriverState *bs, LuringState *s)
 static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
                             uint64_t offset, int type)
 {
+    int ret;
     struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
     if (!sqes) {
         sqes = &luringcb->sqeq;
@@ -242,11 +247,14 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
     }
     io_uring_sqe_set_data(sqes, luringcb);
     s->io_q.in_queue++;
-
+    trace_luring_do_submit(s->io_q.blocked, s->io_q.plugged,
+                           s->io_q.in_queue, s->io_q.in_flight);
     if (!s->io_q.blocked &&
         (!s->io_q.plugged ||
          s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
-        return ioq_submit(s);
+        ret = ioq_submit(s);
+        trace_luring_do_submit_done(ret);
+        return ret;
     }
     return 0;
 }
@@ -294,6 +302,7 @@ LuringState *luring_init(Error **errp)
     int rc;
     LuringState *s;
     s = g_malloc0(sizeof(*s));
+    trace_luring_init_state((void *)s, sizeof(*s));
     struct io_uring *ring = &s->ring;
     rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
     if (rc < 0) {
@@ -311,4 +320,5 @@ void luring_cleanup(LuringState *s)
 {
     io_uring_queue_exit(&s->ring);
     g_free(s);
+    trace_luring_cleanup_state();
 }
diff --git a/block/trace-events b/block/trace-events
index eab51497fc..c4564dcd96 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -60,6 +60,14 @@ qmp_block_stream(void *bs, void *job) "bs %p job %p"
 file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
 file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
 
+#io_uring.c
+luring_init_state(void *s, size_t size) "s %p size %zu"
+luring_cleanup_state(void) "s freed"
+disable luring_io_plug(void) "plug"
+disable luring_io_unplug(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
+disable luring_do_submit(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
+disable luring_do_submit_done(int ret) "submitted to kernel %d"
+
 # qcow2.c
 qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
 qcow2_writev_done_req(void *co, int ret) "co %p ret %d"
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (8 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-11  9:51   ` Stefan Hajnoczi
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 block/io_uring.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index 47e027364a..acfaa48151 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -142,6 +142,21 @@ static void qemu_luring_completion_cb(void *opaque)
     qemu_luring_process_completions_and_submit(s);
 }
 
+static bool qemu_luring_poll_cb(void *opaque)
+{
+    LuringState *s = opaque;
+    struct io_uring_cqe *cqes;
+
+    if (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
+        if (!cqes) {
+            qemu_luring_process_completions_and_submit(s);
+            return true;
+        }
+    }
+
+    return false;
+}
+
 static void ioq_init(LuringQueue *io_q)
 {
     QSIMPLEQ_INIT(&io_q->sq_overflow);
@@ -294,7 +309,7 @@ void luring_attach_aio_context(LuringState *s, AioContext *new_context)
     s->aio_context = new_context;
     s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
     aio_set_fd_handler(s->aio_context, s->ring.ring_fd, false,
-                       qemu_luring_completion_cb, NULL, NULL, s);
+                       qemu_luring_completion_cb, NULL, qemu_luring_poll_cb, s);
 }
 
 LuringState *luring_init(Error **errp)
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (9 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-11  9:54   ` Stefan Hajnoczi
  2019-06-11 12:26   ` Fam Zheng
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
  2019-06-11  9:56 ` [Qemu-devel] [PATCH v5 00/12] Add support " Stefan Hajnoczi
  12 siblings, 2 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 qemu-io.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/qemu-io.c b/qemu-io.c
index 8d5d5911cb..54b82151c4 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -129,6 +129,7 @@ static void open_help(void)
 " -n, -- disable host cache, short for -t none\n"
 " -U, -- force shared permissions\n"
 " -k, -- use kernel AIO implementation (on Linux only)\n"
+" -i  -- use kernel io_uring (Linux 5.1+)\n"
 " -t, -- use the given cache mode for the image\n"
 " -d, -- use the given discard mode for the image\n"
 " -o, -- options to be given to the block driver"
@@ -188,6 +189,11 @@ static int open_f(BlockBackend *blk, int argc, char **argv)
         case 'k':
             flags |= BDRV_O_NATIVE_AIO;
             break;
+#ifdef CONFIG_LINUX_IO_URING
+        case 'i':
+            flags |= BDRV_O_IO_URING;
+            break;
+#endif
         case 't':
             if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) < 0) {
                 error_report("Invalid cache option: %s", optarg);
@@ -290,6 +296,7 @@ static void usage(const char *name)
 "  -C, --copy-on-read   enable copy-on-read\n"
 "  -m, --misalign       misalign allocations for O_DIRECT\n"
 "  -k, --native-aio     use kernel AIO implementation (on Linux only)\n"
+"  -i  --io_uring       use kernel io_uring (Linux 5.1+)\n"
 "  -t, --cache=MODE     use the given cache mode for the image\n"
 "  -d, --discard=MODE   use the given discard mode for the image\n"
 "  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
@@ -499,6 +506,7 @@ int main(int argc, char **argv)
         { "copy-on-read", no_argument, NULL, 'C' },
         { "misalign", no_argument, NULL, 'm' },
         { "native-aio", no_argument, NULL, 'k' },
+        { "io_uring", no_argument, NULL, 'i' },
         { "discard", required_argument, NULL, 'd' },
         { "cache", required_argument, NULL, 't' },
         { "trace", required_argument, NULL, 'T' },
@@ -566,6 +574,11 @@ int main(int argc, char **argv)
         case 'k':
             flags |= BDRV_O_NATIVE_AIO;
             break;
+#ifdef CONFIG_LINUX_IO_URING
+        case 'i':
+            flags |= BDRV_O_IO_URING;
+            break;
+#endif
         case 't':
             if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) < 0) {
                 error_report("Invalid cache option: %s", optarg);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (10 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
@ 2019-06-10 13:49 ` Aarushi Mehta
  2019-06-11  9:58   ` Stefan Hajnoczi
  2019-06-17 14:21   ` Maxim Levitsky
  2019-06-11  9:56 ` [Qemu-devel] [PATCH v5 00/12] Add support " Stefan Hajnoczi
  12 siblings, 2 replies; 40+ messages in thread
From: Aarushi Mehta @ 2019-06-10 13:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Maxim Levitsky, Julia Suvorova, Aarushi Mehta

Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
---
 tests/qemu-iotests/087     | 26 ++++++++++++++++++++++++++
 tests/qemu-iotests/087.out | 10 ++++++++++
 2 files changed, 36 insertions(+)

diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index d6c8613419..0cc7283ad8 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -124,6 +124,32 @@ run_qemu_filter_aio <<EOF
 { "execute": "quit" }
 EOF
 
+echo
+echo === aio=io_uring without O_DIRECT ===
+echo
+
+# Skip this test if io_uring is not enabled in this build
+run_qemu_filter_io_uring()
+{
+    run_qemu "$@"
+}
+
+run_qemu_filter_io_uring <<EOF
+{ "execute": "qmp_capabilities" }
+{ "execute": "blockdev-add",
+  "arguments": {
+      "driver": "$IMGFMT",
+      "node-name": "disk",
+      "file": {
+          "driver": "file",
+          "filename": "$TEST_IMG",
+          "aio": "io_uring"
+      }
+    }
+  }
+{ "execute": "quit" }
+EOF
+
 echo
 echo === Encrypted image QCow ===
 echo
diff --git a/tests/qemu-iotests/087.out b/tests/qemu-iotests/087.out
index 2d92ea847b..f0557d425f 100644
--- a/tests/qemu-iotests/087.out
+++ b/tests/qemu-iotests/087.out
@@ -32,6 +32,16 @@ QMP_VERSION
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 
 
+=== aio=io_uring without O_DIRECT ===
+
+Testing:
+QMP_VERSION
+{"return": {}}
+{"return": {}}
+{"return": {}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
+
+
 === Encrypted image QCow ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 encryption=on encrypt.key-secret=sec0
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring Aarushi Mehta
@ 2019-06-11  7:36   ` Fam Zheng
  2019-06-11  9:32     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Fam Zheng @ 2019-06-11  7:36 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Stefan Hajnoczi, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

On Mon, 06/10 19:18, Aarushi Mehta wrote:
> Option only enumerates for hosts that support it.
> 
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  qapi/block-core.json | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 1defcde048..db7eedd058 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2792,11 +2792,13 @@
>  #
>  # @threads:     Use qemu's thread pool
>  # @native:      Use native AIO backend (only Linux and Windows)
> +# @io_uring:    Use linux io_uring (since 4.1)
>  #
>  # Since: 2.9
>  ##
>  { 'enum': 'BlockdevAioOptions',
> -  'data': [ 'threads', 'native' ] }
> +  'data': [ 'threads', 'native',
> +            { 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } ] }

Question: 'native' has a dependency on libaio but it doesn't have the
condition.  Is the inconsistency intended?

>  
>  ##
>  # @BlockdevCacheOptions:
> -- 
> 2.17.1
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring
  2019-06-11  7:36   ` Fam Zheng
@ 2019-06-11  9:32     ` Stefan Hajnoczi
  2019-07-03  6:26       ` Markus Armbruster
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:32 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Paolo Bonzini, Maxim Levitsky, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On Tue, Jun 11, 2019 at 03:36:53PM +0800, Fam Zheng wrote:
> On Mon, 06/10 19:18, Aarushi Mehta wrote:
> > Option only enumerates for hosts that support it.
> > 
> > Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  qapi/block-core.json | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/qapi/block-core.json b/qapi/block-core.json
> > index 1defcde048..db7eedd058 100644
> > --- a/qapi/block-core.json
> > +++ b/qapi/block-core.json
> > @@ -2792,11 +2792,13 @@
> >  #
> >  # @threads:     Use qemu's thread pool
> >  # @native:      Use native AIO backend (only Linux and Windows)
> > +# @io_uring:    Use linux io_uring (since 4.1)
> >  #
> >  # Since: 2.9
> >  ##
> >  { 'enum': 'BlockdevAioOptions',
> > -  'data': [ 'threads', 'native' ] }
> > +  'data': [ 'threads', 'native',
> > +            { 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } ] }
> 
> Question: 'native' has a dependency on libaio but it doesn't have the
> condition.  Is the inconsistency intended?

'native' could be conditional too but I guess it's a historical thing.
Either QAPI 'if' didn't exit when BlockdevAioOptions was defined or we
simply forgot to use it :).

It doesn't need to be changed in this patch series.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
@ 2019-06-11  9:47   ` Stefan Hajnoczi
  2019-06-11 11:34   ` Fam Zheng
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:47 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1727 bytes --]

On Mon, Jun 10, 2019 at 07:19:02PM +0530, Aarushi Mehta wrote:
> @@ -294,6 +302,7 @@ LuringState *luring_init(Error **errp)
>      int rc;
>      LuringState *s;
>      s = g_malloc0(sizeof(*s));
> +    trace_luring_init_state((void *)s, sizeof(*s));

In C conversion to void * is automatic and doesn't need to be done
manually.

> diff --git a/block/trace-events b/block/trace-events
> index eab51497fc..c4564dcd96 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -60,6 +60,14 @@ qmp_block_stream(void *bs, void *job) "bs %p job %p"
>  file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
>  file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
>  
> +#io_uring.c
> +luring_init_state(void *s, size_t size) "s %p size %zu"
> +luring_cleanup_state(void) "s freed"
> +disable luring_io_plug(void) "plug"
> +disable luring_io_unplug(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit_done(int ret) "submitted to kernel %d"

Why are these disabled?  "disable" compiles them out and they won't be
available at runtime.  "disable" should probably be dropped here.

Please include the LuringState *s pointer in trace events since there
can be multiple LuringStates at any given time and it should be possible
to correlate trace events.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling Aarushi Mehta
@ 2019-06-11  9:51   ` Stefan Hajnoczi
  2019-06-17 14:14     ` Maxim Levitsky
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:51 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 434 bytes --]

On Mon, Jun 10, 2019 at 07:19:03PM +0530, Aarushi Mehta wrote:
> +static bool qemu_luring_poll_cb(void *opaque)
> +{
> +    LuringState *s = opaque;
> +    struct io_uring_cqe *cqes;
> +
> +    if (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
> +        if (!cqes) {
> +            qemu_luring_process_completions_and_submit(s);
> +            return true;
> +        }

Is this logic inverted?  We have a completion when cqes != NULL.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
@ 2019-06-11  9:54   ` Stefan Hajnoczi
  2019-06-11 12:26   ` Fam Zheng
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:54 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 2767 bytes --]

On Mon, Jun 10, 2019 at 07:19:04PM +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  qemu-io.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/qemu-io.c b/qemu-io.c
> index 8d5d5911cb..54b82151c4 100644
> --- a/qemu-io.c
> +++ b/qemu-io.c
> @@ -129,6 +129,7 @@ static void open_help(void)
>  " -n, -- disable host cache, short for -t none\n"
>  " -U, -- force shared permissions\n"
>  " -k, -- use kernel AIO implementation (on Linux only)\n"
> +" -i  -- use kernel io_uring (Linux 5.1+)\n"
>  " -t, -- use the given cache mode for the image\n"
>  " -d, -- use the given discard mode for the image\n"
>  " -o, -- options to be given to the block driver"
> @@ -188,6 +189,11 @@ static int open_f(BlockBackend *blk, int argc, char **argv)
>          case 'k':
>              flags |= BDRV_O_NATIVE_AIO;
>              break;
> +#ifdef CONFIG_LINUX_IO_URING
> +        case 'i':
> +            flags |= BDRV_O_IO_URING;
> +            break;
> +#endif
>          case 't':
>              if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) < 0) {
>                  error_report("Invalid cache option: %s", optarg);
> @@ -290,6 +296,7 @@ static void usage(const char *name)
>  "  -C, --copy-on-read   enable copy-on-read\n"
>  "  -m, --misalign       misalign allocations for O_DIRECT\n"
>  "  -k, --native-aio     use kernel AIO implementation (on Linux only)\n"
> +"  -i  --io_uring       use kernel io_uring (Linux 5.1+)\n"
>  "  -t, --cache=MODE     use the given cache mode for the image\n"
>  "  -d, --discard=MODE   use the given discard mode for the image\n"
>  "  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
> @@ -499,6 +506,7 @@ int main(int argc, char **argv)
>          { "copy-on-read", no_argument, NULL, 'C' },
>          { "misalign", no_argument, NULL, 'm' },
>          { "native-aio", no_argument, NULL, 'k' },
> +        { "io_uring", no_argument, NULL, 'i' },
>          { "discard", required_argument, NULL, 'd' },
>          { "cache", required_argument, NULL, 't' },
>          { "trace", required_argument, NULL, 'T' },
> @@ -566,6 +574,11 @@ int main(int argc, char **argv)
>          case 'k':
>              flags |= BDRV_O_NATIVE_AIO;
>              break;
> +#ifdef CONFIG_LINUX_IO_URING
> +        case 'i':
> +            flags |= BDRV_O_IO_URING;
> +            break;
> +#endif

An --aio=threads|native|io_uring option would be more general than
adding --io_uring.  That new AIO engines do not require their own
command-line options.

Can you implement something like the -drive aio= parameter so that a
single option can specify threads, native, or io_uring?

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 00/12] Add support for io_uring
  2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
                   ` (11 preceding siblings ...)
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
@ 2019-06-11  9:56 ` Stefan Hajnoczi
  2019-06-22 15:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  12 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:56 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 864 bytes --]

On Mon, Jun 10, 2019 at 07:18:53PM +0530, Aarushi Mehta wrote:
> This patch series adds support for the newly developed io_uring Linux AIO
> interface. Linux io_uring is faster than Linux's AIO asynchronous I/O code,
> offers efficient buffered asynchronous I/O support, the ability to do I/O
> without performing a system call via polled I/O, and other efficiency enhancements.
> 
> Testing it requires a host kernel (5.1+) and the liburing library.
> Use the option -drive aio=io_uring to enable it.
> 
> v5:
> - Adds completion polling
> - Extends qemu-io
> - Adds qemu-iotest

Flush is not hooked up.  Please use the io_uring IOURING_OP_FSYNC that
you've already written and connect it to file-posix.c.

When doing this watch out for the qiov->size check during completion
processing.  Flush doesn't have a qiov so it may be NULL.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
@ 2019-06-11  9:58   ` Stefan Hajnoczi
  2019-06-17 14:21   ` Maxim Levitsky
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-11  9:58 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 731 bytes --]

On Mon, Jun 10, 2019 at 07:19:05PM +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  tests/qemu-iotests/087     | 26 ++++++++++++++++++++++++++
>  tests/qemu-iotests/087.out | 10 ++++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
> index d6c8613419..0cc7283ad8 100755
> --- a/tests/qemu-iotests/087
> +++ b/tests/qemu-iotests/087
> @@ -124,6 +124,32 @@ run_qemu_filter_aio <<EOF
>  { "execute": "quit" }
>  EOF
>  
> +echo
> +echo === aio=io_uring without O_DIRECT ===
> +echo
> +
> +# Skip this test if io_uring is not enabled in this build

Is this comment a todo?  I see nothing that skips the test.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
@ 2019-06-11 11:17   ` Fam Zheng
  2019-06-12 14:43     ` Stefan Hajnoczi
  2019-06-17 12:26   ` Maxim Levitsky
  1 sibling, 1 reply; 40+ messages in thread
From: Fam Zheng @ 2019-06-11 11:17 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Stefan Hajnoczi, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

On Mon, 06/10 19:18, Aarushi Mehta wrote:
> Aborts when sqe fails to be set as sqes cannot be returned to the ring.
> 
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  MAINTAINERS             |   7 +
>  block/Makefile.objs     |   3 +
>  block/io_uring.c        | 314 ++++++++++++++++++++++++++++++++++++++++
>  include/block/aio.h     |  16 +-
>  include/block/raw-aio.h |  12 ++
>  5 files changed, 351 insertions(+), 1 deletion(-)
>  create mode 100644 block/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7be1225415..49f896796e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2516,6 +2516,13 @@ F: block/file-posix.c
>  F: block/file-win32.c
>  F: block/win32-aio.c
>  
> +Linux io_uring
> +M: Aarushi Mehta <mehta.aaru20@gmail.com>
> +R: Stefan Hajnoczi <stefan@redhat.com>
> +L: qemu-block@nongnu.org
> +S: Maintained
> +F: block/io_uring.c
> +
>  qcow2
>  M: Kevin Wolf <kwolf@redhat.com>
>  M: Max Reitz <mreitz@redhat.com>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..8fde7a23a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += file-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  block-obj-y += null.o mirror.o commit.o io.o create.o
>  block-obj-y += throttle-groups.o
>  block-obj-$(CONFIG_LINUX) += nvme.o
> @@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
>  dmg-lzfse.o-libs   := $(LZFSE_LIBS)
>  qcow.o-libs        := -lz
>  linux-aio.o-libs   := -laio
> +io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
> +io_uring.o-libs    := $(LINUX_IO_URING_LIBS)
>  parallels.o-cflags := $(LIBXML2_CFLAGS)
>  parallels.o-libs   := $(LIBXML2_LIBS)
> diff --git a/block/io_uring.c b/block/io_uring.c
> new file mode 100644
> index 0000000000..f327c7ef96
> --- /dev/null
> +++ b/block/io_uring.c
> @@ -0,0 +1,314 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2019 Aarushi Mehta
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <liburing.h>
> +#include "qemu-common.h"
> +#include "block/aio.h"
> +#include "qemu/queue.h"
> +#include "block/block.h"
> +#include "block/raw-aio.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/error.h"
> +
> +#define MAX_EVENTS 128
> +
> +typedef struct LuringAIOCB {

I have to say it is a good name.

> +    Coroutine *co;
> +    struct io_uring_sqe sqeq;
> +    ssize_t ret;
> +    QEMUIOVector *qiov;
> +    bool is_read;
> +    QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +} LuringAIOCB;
> +
> +typedef struct LuringQueue {
> +    int plugged;
> +    unsigned int in_queue;
> +    unsigned int in_flight;
> +    bool blocked;
> +    QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
> +} LuringQueue;
> +
> +typedef struct LuringState {
> +    AioContext *aio_context;
> +
> +    struct io_uring ring;
> +
> +    /* io queue for submit at batch.  Protected by AioContext lock. */
> +    LuringQueue io_q;
> +
> +    /* I/O completion processing.  Only runs in I/O thread.  */
> +    QEMUBH *completion_bh;
> +} LuringState;
> +
> +/**
> + * ioq_submit:
> + * @s: AIO state
> + *
> + * Queues pending sqes and submits them
> + *
> + */
> +static int ioq_submit(LuringState *s);
> +
> +/**
> + * qemu_luring_process_completions:
> + * @s: AIO state
> + *
> + * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
> + *
> + */
> +static void qemu_luring_process_completions(LuringState *s)
> +{
> +    struct io_uring_cqe *cqes;
> +    int ret;
> +
> +    /*
> +     * Request completion callbacks can run the nested event loop.
> +     * Schedule ourselves so the nested event loop will "see" remaining
> +     * completed requests and process them.  Without this, completion
> +     * callbacks that wait for other requests using a nested event loop
> +     * would hang forever.
> +     */
> +    qemu_bh_schedule(s->completion_bh);
> +
> +    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
> +        if (!cqes) {
> +            break;
> +        }
> +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> +        ret = cqes->res;

Declarations should be in the beginning of the code block.

> +
> +        if (ret == luringcb->qiov->size) {
> +            ret = 0;
> +        } else if (ret >= 0) {
> +            /* Short Read/Write */
> +            if (luringcb->is_read) {
> +                /* Read, pad with zeroes */
> +                qemu_iovec_memset(luringcb->qiov, ret, 0,
> +                luringcb->qiov->size - ret);

Should you check that (ret < luringcb->qiov->size) since ret is from external?

Either way, ret should be assigned 0, I think.

> +            } else {
> +                ret = -ENOSPC;;

s/;;/;/

> +            }
> +        }
> +        luringcb->ret = ret;
> +
> +        io_uring_cqe_seen(&s->ring, cqes);
> +        cqes = NULL;
> +        /* Change counters one-by-one because we can be nested. */
> +        s->io_q.in_flight--;
> +
> +        /*
> +         * If the coroutine is already entered it must be in ioq_submit()
> +         * and will notice luringcb->ret has been filled in when it
> +         * eventually runs later. Coroutines cannot be entered recursively
> +         * so avoid doing that!
> +         */
> +        if (!qemu_coroutine_entered(luringcb->co)) {
> +            aio_co_wake(luringcb->co);
> +        }
> +    }
> +    qemu_bh_cancel(s->completion_bh);
> +}
> +
> +static void qemu_luring_process_completions_and_submit(LuringState *s)
> +{
> +    aio_context_acquire(s->aio_context);
> +    qemu_luring_process_completions(s);
> +
> +    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
> +        ioq_submit(s);
> +    }
> +    aio_context_release(s->aio_context);
> +}
> +
> +static void qemu_luring_completion_bh(void *opaque)
> +{
> +    LuringState *s = opaque;
> +    qemu_luring_process_completions_and_submit(s);
> +}
> +
> +static void qemu_luring_completion_cb(void *opaque)
> +{
> +    LuringState *s = opaque;
> +    qemu_luring_process_completions_and_submit(s);
> +}
> +
> +static void ioq_init(LuringQueue *io_q)
> +{
> +    QSIMPLEQ_INIT(&io_q->sq_overflow);
> +    io_q->plugged = 0;
> +    io_q->in_queue = 0;
> +    io_q->in_flight = 0;
> +    io_q->blocked = false;
> +}
> +
> +static int ioq_submit(LuringState *s)
> +{
> +    int ret = 0;
> +    LuringAIOCB *luringcb, *luringcb_next;
> +
> +    while (s->io_q.in_queue > 0) {
> +        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> +                              luringcb_next) {
> +            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> +            if (!sqes) {
> +                break;
> +            }
> +            /* Prep sqe for submission */
> +            *sqes = luringcb->sqeq;
> +            QSIMPLEQ_REMOVE_HEAD(&s->io_q.sq_overflow, next);
> +        }
> +        ret =  io_uring_submit(&s->ring);

s/  / /

> +        /* Prevent infinite loop if submission is refused */
> +        if (ret <= 0) {
> +            if (ret == -EAGAIN) {
> +                continue;
> +            }
> +            break;
> +        }
> +        s->io_q.in_flight += ret;
> +        s->io_q.in_queue  -= ret;
> +    }
> +    s->io_q.blocked = (s->io_q.in_queue > 0);

I'm confused about s->io_q.blocked. ioq_submit is where it gets updated, but
if it becomes true, calling ioq_submit will be fenced. So how does it get
cleared?

> +
> +    if (s->io_q.in_flight) {
> +        /*
> +         * We can try to complete something just right away if there are
> +         * still requests in-flight.
> +         */
> +        qemu_luring_process_completions(s);
> +    }
> +    return ret;
> +}
> +
> +void luring_io_plug(BlockDriverState *bs, LuringState *s)
> +{
> +    s->io_q.plugged++;
> +}
> +
> +void luring_io_unplug(BlockDriverState *bs, LuringState *s)
> +{
> +    assert(s->io_q.plugged);
> +    if (--s->io_q.plugged == 0 &&
> +        !s->io_q.blocked && s->io_q.in_queue > 0) {
> +        ioq_submit(s);
> +    }
> +}
> +
> +/**
> + * luring_do_submit:
> + * @fd: file descriptor for I/O
> + * @luringcb: AIO control block
> + * @s: AIO state
> + * @offset: offset for request
> + * @type: type of request
> + *
> + * Fetches sqes from ring, adds to pending queue and preps them
> + *
> + */
> +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> +                            uint64_t offset, int type)
> +{
> +    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> +    if (!sqes) {
> +        sqes = &luringcb->sqeq;
> +        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> +    }
> +
> +    switch (type) {
> +    case QEMU_AIO_WRITE:
> +        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> +                             luringcb->qiov->niov, offset);
> +        break;
> +    case QEMU_AIO_READ:
> +        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> +                            luringcb->qiov->niov, offset);
> +        break;
> +    case QEMU_AIO_FLUSH:
> +        io_uring_prep_fsync(sqes, fd, 0);
> +        break;
> +    default:
> +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> +                        __func__, type);
> +        abort();
> +    }
> +    io_uring_sqe_set_data(sqes, luringcb);
> +    s->io_q.in_queue++;
> +
> +    if (!s->io_q.blocked &&
> +        (!s->io_q.plugged ||
> +         s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> +        return ioq_submit(s);
> +    }
> +    return 0;
> +}
> +
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
> +                                uint64_t offset, QEMUIOVector *qiov, int type)
> +{
> +    int ret;
> +    LuringAIOCB luringcb = {
> +        .co         = qemu_coroutine_self(),
> +        .ret        = -EINPROGRESS,
> +        .qiov       = qiov,
> +        .is_read    = (type == QEMU_AIO_READ),
> +    };
> +
> +    ret = luring_do_submit(fd, &luringcb, s, offset, type);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (luringcb.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +    return luringcb.ret;
> +}
> +
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> +{
> +    aio_set_fd_handler(old_context, s->ring.ring_fd, false, NULL, NULL, NULL,
> +                       s);
> +    qemu_bh_delete(s->completion_bh);
> +    s->aio_context = NULL;
> +}
> +
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> +{
> +    s->aio_context = new_context;
> +    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
> +    aio_set_fd_handler(s->aio_context, s->ring.ring_fd, false,
> +                       qemu_luring_completion_cb, NULL, NULL, s);
> +}
> +
> +LuringState *luring_init(Error **errp)
> +{
> +    int rc;
> +    LuringState *s;
> +    s = g_malloc0(sizeof(*s));

You can use g_new0() to be more concise.

> +    struct io_uring *ring = &s->ring;
> +    rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);

s/  / /

> +    if (rc < 0) {
> +        error_setg_errno(errp, errno, "failed to init linux io_uring ring");
> +        g_free(s);
> +        return NULL;
> +    }
> +
> +    ioq_init(&s->io_q);
> +    return s;
> +
> +}
> +
> +void luring_cleanup(LuringState *s)
> +{
> +    io_uring_queue_exit(&s->ring);
> +    g_free(s);
> +}
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..9da3fd9793 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -50,6 +50,7 @@ typedef void IOHandler(void *opaque);
>  struct Coroutine;
>  struct ThreadPool;
>  struct LinuxAioState;
> +struct LuringState;
>  
>  struct AioContext {
>      GSource source;
> @@ -118,11 +119,19 @@ struct AioContext {
>      struct ThreadPool *thread_pool;
>  
>  #ifdef CONFIG_LINUX_AIO
> -    /* State for native Linux AIO.  Uses aio_context_acquire/release for
> +    /*
> +     * State for native Linux AIO.  Uses aio_context_acquire/release for
>       * locking.
>       */
>      struct LinuxAioState *linux_aio;
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +    /*
> +     * State for Linux io_uring.  Uses aio_context_acquire/release for
> +     * locking.
> +     */
> +    struct LuringState *linux_io_uring;
> +#endif
>  
>      /* TimerLists for calling timers - one per clock type.  Has its own
>       * locking.
> @@ -387,6 +396,11 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
>  /* Return the LinuxAioState bound to this AioContext */
>  struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
>  
> +/* Setup the LuringState bound to this AioContext */
> +struct LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> +
> +/* Return the LuringState bound to this AioContext */
> +struct LuringState *aio_get_linux_io_uring(AioContext *ctx);
>  /**
>   * aio_timer_new_with_attrs:
>   * @ctx: the aio context
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 0cb7cc74a2..71d7d1395f 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -55,6 +55,18 @@ void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
>  void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
>  void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s);
>  #endif
> +/* io_uring.c - Linux io_uring implementation */
> +#ifdef CONFIG_LINUX_IO_URING
> +typedef struct LuringState LuringState;
> +LuringState *luring_init(Error **errp);
> +void luring_cleanup(LuringState *s);
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
> +                                uint64_t offset, QEMUIOVector *qiov, int type);
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context);
> +void luring_io_plug(BlockDriverState *bs, LuringState *s);
> +void luring_io_unplug(BlockDriverState *bs, LuringState *s);
> +#endif
>  
>  #ifdef _WIN32
>  typedef struct QEMUWin32AIOState QEMUWin32AIOState;
> -- 
> 2.17.1
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
  2019-06-11  9:47   ` Stefan Hajnoczi
@ 2019-06-11 11:34   ` Fam Zheng
  1 sibling, 0 replies; 40+ messages in thread
From: Fam Zheng @ 2019-06-11 11:34 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Stefan Hajnoczi, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

On Mon, 06/10 19:19, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  block/io_uring.c   | 14 ++++++++++++--
>  block/trace-events |  8 ++++++++
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/block/io_uring.c b/block/io_uring.c
> index f327c7ef96..47e027364a 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -17,6 +17,7 @@
>  #include "block/raw-aio.h"
>  #include "qemu/coroutine.h"
>  #include "qapi/error.h"
> +#include "trace.h"
>  
>  #define MAX_EVENTS 128
>  
> @@ -191,12 +192,15 @@ static int ioq_submit(LuringState *s)
>  
>  void luring_io_plug(BlockDriverState *bs, LuringState *s)
>  {
> +    trace_luring_io_plug();
>      s->io_q.plugged++;
>  }
>  
>  void luring_io_unplug(BlockDriverState *bs, LuringState *s)
>  {
>      assert(s->io_q.plugged);
> +    trace_luring_io_unplug(s->io_q.blocked, s->io_q.plugged,
> +                                 s->io_q.in_queue, s->io_q.in_flight);
>      if (--s->io_q.plugged == 0 &&
>          !s->io_q.blocked && s->io_q.in_queue > 0) {
>          ioq_submit(s);
> @@ -217,6 +221,7 @@ void luring_io_unplug(BlockDriverState *bs, LuringState *s)
>  static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
>                              uint64_t offset, int type)
>  {
> +    int ret;
>      struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
>      if (!sqes) {
>          sqes = &luringcb->sqeq;
> @@ -242,11 +247,14 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
>      }
>      io_uring_sqe_set_data(sqes, luringcb);
>      s->io_q.in_queue++;
> -
> +    trace_luring_do_submit(s->io_q.blocked, s->io_q.plugged,
> +                           s->io_q.in_queue, s->io_q.in_flight);
>      if (!s->io_q.blocked &&
>          (!s->io_q.plugged ||
>           s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> -        return ioq_submit(s);
> +        ret = ioq_submit(s);
> +        trace_luring_do_submit_done(ret);
> +        return ret;
>      }
>      return 0;
>  }
> @@ -294,6 +302,7 @@ LuringState *luring_init(Error **errp)
>      int rc;
>      LuringState *s;
>      s = g_malloc0(sizeof(*s));
> +    trace_luring_init_state((void *)s, sizeof(*s));

The pointer type cast is unnecessary.

>      struct io_uring *ring = &s->ring;
>      rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
>      if (rc < 0) {
> @@ -311,4 +320,5 @@ void luring_cleanup(LuringState *s)
>  {
>      io_uring_queue_exit(&s->ring);
>      g_free(s);
> +    trace_luring_cleanup_state();

Pass @s?

>  }
> diff --git a/block/trace-events b/block/trace-events
> index eab51497fc..c4564dcd96 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -60,6 +60,14 @@ qmp_block_stream(void *bs, void *job) "bs %p job %p"
>  file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
>  file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
>  
> +#io_uring.c
> +luring_init_state(void *s, size_t size) "s %p size %zu"
> +luring_cleanup_state(void) "s freed"
> +disable luring_io_plug(void) "plug"
> +disable luring_io_unplug(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit(int blocked, int plugged, int queued, int inflight) "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit_done(int ret) "submitted to kernel %d"
> +
>  # qcow2.c
>  qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
>  qcow2_writev_done_req(void *co, int ret) "co %p ret %d"
> -- 
> 2.17.1
> 

Fam



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
  2019-06-11  9:54   ` Stefan Hajnoczi
@ 2019-06-11 12:26   ` Fam Zheng
  1 sibling, 0 replies; 40+ messages in thread
From: Fam Zheng @ 2019-06-11 12:26 UTC (permalink / raw)
  To: Aarushi Mehta
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Stefan Hajnoczi, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Markus Armbruster

On Mon, 06/10 19:19, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  qemu-io.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/qemu-io.c b/qemu-io.c
> index 8d5d5911cb..54b82151c4 100644
> --- a/qemu-io.c
> +++ b/qemu-io.c
> @@ -129,6 +129,7 @@ static void open_help(void)
>  " -n, -- disable host cache, short for -t none\n"
>  " -U, -- force shared permissions\n"
>  " -k, -- use kernel AIO implementation (on Linux only)\n"
> +" -i  -- use kernel io_uring (Linux 5.1+)\n"
>  " -t, -- use the given cache mode for the image\n"
>  " -d, -- use the given discard mode for the image\n"
>  " -o, -- options to be given to the block driver"
> @@ -188,6 +189,11 @@ static int open_f(BlockBackend *blk, int argc, char **argv)
>          case 'k':
>              flags |= BDRV_O_NATIVE_AIO;
>              break;
> +#ifdef CONFIG_LINUX_IO_URING
> +        case 'i':

Maybe printing an error message saying the feature is not compiled in is
slightly better than just saying the argument is unknown?

Fam

> +            flags |= BDRV_O_IO_URING;
> +            break;
> +#endif
>          case 't':
>              if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) < 0) {
>                  error_report("Invalid cache option: %s", optarg);
> @@ -290,6 +296,7 @@ static void usage(const char *name)
>  "  -C, --copy-on-read   enable copy-on-read\n"
>  "  -m, --misalign       misalign allocations for O_DIRECT\n"
>  "  -k, --native-aio     use kernel AIO implementation (on Linux only)\n"
> +"  -i  --io_uring       use kernel io_uring (Linux 5.1+)\n"
>  "  -t, --cache=MODE     use the given cache mode for the image\n"
>  "  -d, --discard=MODE   use the given discard mode for the image\n"
>  "  -T, --trace [[enable=]<pattern>][,events=<file>][,file=<file>]\n"
> @@ -499,6 +506,7 @@ int main(int argc, char **argv)
>          { "copy-on-read", no_argument, NULL, 'C' },
>          { "misalign", no_argument, NULL, 'm' },
>          { "native-aio", no_argument, NULL, 'k' },
> +        { "io_uring", no_argument, NULL, 'i' },
>          { "discard", required_argument, NULL, 'd' },
>          { "cache", required_argument, NULL, 't' },
>          { "trace", required_argument, NULL, 'T' },
> @@ -566,6 +574,11 @@ int main(int argc, char **argv)
>          case 'k':
>              flags |= BDRV_O_NATIVE_AIO;
>              break;
> +#ifdef CONFIG_LINUX_IO_URING
> +        case 'i':
> +            flags |= BDRV_O_IO_URING;
> +            break;
> +#endif
>          case 't':
>              if (bdrv_parse_cache_mode(optarg, &flags, &writethrough) < 0) {
>                  error_report("Invalid cache option: %s", optarg);
> -- 
> 2.17.1
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-11 11:17   ` Fam Zheng
@ 2019-06-12 14:43     ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-12 14:43 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel, Max Reitz,
	saket.sinha89, Paolo Bonzini, Maxim Levitsky, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 856 bytes --]

On Tue, Jun 11, 2019 at 07:17:14PM +0800, Fam Zheng wrote:
> On Mon, 06/10 19:18, Aarushi Mehta wrote:
> > +        /* Prevent infinite loop if submission is refused */
> > +        if (ret <= 0) {
> > +            if (ret == -EAGAIN) {
> > +                continue;
> > +            }
> > +            break;
> > +        }
> > +        s->io_q.in_flight += ret;
> > +        s->io_q.in_queue  -= ret;
> > +    }
> > +    s->io_q.blocked = (s->io_q.in_queue > 0);
> 
> I'm confused about s->io_q.blocked. ioq_submit is where it gets updated, but
> if it becomes true, calling ioq_submit will be fenced. So how does it get
> cleared?

When blocked, additional I/O requests are not submitted until the next
completion.  See qemu_luring_process_completions_and_submit() for the
code path where ioq_submit() gets called again.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
  2019-06-11 11:17   ` Fam Zheng
@ 2019-06-17 12:26   ` Maxim Levitsky
  2019-06-19 10:14     ` Stefan Hajnoczi
  2019-06-22 15:10     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 2 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 12:26 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Aborts when sqe fails to be set as sqes cannot be returned to the ring.
> 
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  MAINTAINERS             |   7 +
>  block/Makefile.objs     |   3 +
>  block/io_uring.c        | 314 ++++++++++++++++++++++++++++++++++++++++
>  include/block/aio.h     |  16 +-
>  include/block/raw-aio.h |  12 ++
>  5 files changed, 351 insertions(+), 1 deletion(-)
>  create mode 100644 block/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7be1225415..49f896796e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2516,6 +2516,13 @@ F: block/file-posix.c
>  F: block/file-win32.c
>  F: block/win32-aio.c
>  
> +Linux io_uring
> +M: Aarushi Mehta <mehta.aaru20@gmail.com>
> +R: Stefan Hajnoczi <stefan@redhat.com>
> +L: qemu-block@nongnu.org
> +S: Maintained
> +F: block/io_uring.c
> +
>  qcow2
>  M: Kevin Wolf <kwolf@redhat.com>
>  M: Max Reitz <mreitz@redhat.com>
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..8fde7a23a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += file-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  block-obj-y += null.o mirror.o commit.o io.o create.o
>  block-obj-y += throttle-groups.o
>  block-obj-$(CONFIG_LINUX) += nvme.o
> @@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
>  dmg-lzfse.o-libs   := $(LZFSE_LIBS)
>  qcow.o-libs        := -lz
>  linux-aio.o-libs   := -laio
> +io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
> +io_uring.o-libs    := $(LINUX_IO_URING_LIBS)
>  parallels.o-cflags := $(LIBXML2_CFLAGS)
>  parallels.o-libs   := $(LIBXML2_LIBS)
> diff --git a/block/io_uring.c b/block/io_uring.c
> new file mode 100644
> index 0000000000..f327c7ef96
> --- /dev/null
> +++ b/block/io_uring.c
> @@ -0,0 +1,314 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2019 Aarushi Mehta
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <liburing.h>
> +#include "qemu-common.h"
> +#include "block/aio.h"
> +#include "qemu/queue.h"
> +#include "block/block.h"
> +#include "block/raw-aio.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/error.h"
> +
> +#define MAX_EVENTS 128
> +
> +typedef struct LuringAIOCB {
> +    Coroutine *co;
> +    struct io_uring_sqe sqeq;
> +    ssize_t ret;
> +    QEMUIOVector *qiov;
> +    bool is_read;
> +    QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +} LuringAIOCB;
> +
> +typedef struct LuringQueue {
> +    int plugged;
> +    unsigned int in_queue;
> +    unsigned int in_flight;
> +    bool blocked;
> +    QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
> +} LuringQueue;
> +
> +typedef struct LuringState {
> +    AioContext *aio_context;
> +
> +    struct io_uring ring;
> +
> +    /* io queue for submit at batch.  Protected by AioContext lock. */
> +    LuringQueue io_q;
> +
> +    /* I/O completion processing.  Only runs in I/O thread.  */
> +    QEMUBH *completion_bh;
> +} LuringState;
> +
> +/**
> + * ioq_submit:
> + * @s: AIO state
> + *
> + * Queues pending sqes and submits them
> + *
> + */
> +static int ioq_submit(LuringState *s);
> +
> +/**
> + * qemu_luring_process_completions:
> + * @s: AIO state
> + *
> + * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
> + *
> + */
> +static void qemu_luring_process_completions(LuringState *s)
> +{
> +    struct io_uring_cqe *cqes;
> +    int ret;
> +
> +    /*
> +     * Request completion callbacks can run the nested event loop.
> +     * Schedule ourselves so the nested event loop will "see" remaining
> +     * completed requests and process them.  Without this, completion
> +     * callbacks that wait for other requests using a nested event loop
> +     * would hang forever.
> +     */

About that qemu_bh_schedule
The code is copied from linux-aio.c where it was added with the below commit.

Author: Stefan Hajnoczi <stefanha@redhat.com>
Date:   Mon Aug 4 16:56:33 2014 +0100

    linux-aio: avoid deadlock in nested aio_poll() calls
    
    If two Linux AIO request completions are fetched in the same
    io_getevents() call, QEMU will deadlock if request A's callback waits
    for request B to complete using an aio_poll() loop.  This was reported
    to happen with the mirror blockjob.
    
    This patch moves completion processing into a BH and makes it resumable.
    Nested event loops can resume completion processing so that request B
    will complete and the deadlock will not occur.
    
    Cc: Kevin Wolf <kwolf@redhat.com>
    Cc: Paolo Bonzini <pbonzini@redhat.com>
    Cc: Ming Lei <ming.lei@canonical.com>
    Cc: Marcin Gibuła <m.gibula@beyond.pl>
    Reported-by: Marcin Gibuła <m.gibula@beyond.pl>
    Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
    Tested-by: Marcin Gibuła <m.gibula@beyond.pl>


I kind of opened a Pandora box by researching that area suspecting that the same treatment is needed in other block drivers,
but after all, this is correct behaviour, and this is why:

The reason that the bottom half workaround is needed in linux-aio is because aio uses an eventfd which just notifies it
of the completions once, thus if the co-routine which handles the response does aio_poll, the same fd won't be returned again,
at least unless more events are received which is not guaranteed.


Here in io_uring, I think the same would happen. Looking at the kernel source I see that poll implementation uses 'poll_wait' which is basically
a wait queue which is woken up when new completion events are added to the uring, thus attempting to poll again on the same uring fd will indeed block,
even if there are events not yet processed.

For all other leaf block drivers (drivers that access the data, rather that forward the requests to another block driver), they are all networking based, 
thus they poll the communication socket.
When the same situation occurs the nested aio_poll will notice that the socket still has data and thus run the corresponding co-routine, thus preventing the deadlock.

I think that the two above comments should be added to the source in some way to document this so that next guy after me won't need to spend time understanding this.


BTW, nvme userspace driver also solves this issue by not entering the co-routine directly from aio fd handler,  but doing that from a bottom half which the handler schedules. 
This works because the nested aio_poll will
run the bottom halves again, but I suspect that this adds overhead that could be avoided. I'll look at that later.


> +    qemu_bh_schedule(s->completion_bh);
> +
> +    while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {


Maybe consider using io_uring_for_each_cqe and then io_uring_cq_advance
to avoid acking one ring entry at the time ? 
However this probably will break the nesting.

> +        if (!cqes) {
> +            break;
> +        }
> +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> +        ret = cqes->res;
> +
> +        if (ret == luringcb->qiov->size) {
> +            ret = 0;
> +        } else if (ret >= 0) {


You should very carefully check the allowed return values here.

It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise
you will get the number of bytes read, which might be less that what was asked for, which implies that you
need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO 

(0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF)


Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments. 
including lack of check for -EINTR.

I assume that since aio is linux specific, and it only supports direct IO, it happens
to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this)

On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage,
retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read.

Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode), 
and in this case, it goes through the same ->read_iter which is pretty much the same path that 
regular read() takes and so it might return short reads and or -EINTR.


> +            /* Short Read/Write */
> +            if (luringcb->is_read) {
> +                /* Read, pad with zeroes */
> +                qemu_iovec_memset(luringcb->qiov, ret, 0,
> +                luringcb->qiov->size - ret);
> +            } else {
> +                ret = -ENOSPC;;
> +            }
> +        }
> +        luringcb->ret = ret;
> +
> +        io_uring_cqe_seen(&s->ring, cqes);
> +        cqes = NULL;
> +        /* Change counters one-by-one because we can be nested. */
> +        s->io_q.in_flight--;
> +
> +        /*
> +         * If the coroutine is already entered it must be in ioq_submit()
> +         * and will notice luringcb->ret has been filled in when it
> +         * eventually runs later. Coroutines cannot be entered recursively
> +         * so avoid doing that!
> +         */
> +        if (!qemu_coroutine_entered(luringcb->co)) {
> +            aio_co_wake(luringcb->co);
> +        }
> +    }
> +    qemu_bh_cancel(s->completion_bh);
> +}
> +
> +static void qemu_luring_process_completions_and_submit(LuringState *s)
> +{
> +    aio_context_acquire(s->aio_context);
> +    qemu_luring_process_completions(s);
> +
> +    if (!s->io_q.plugged && s->io_q.in_queue > 0) {
> +        ioq_submit(s);
> +    }
> +    aio_context_release(s->aio_context);
> +}
> +
> +static void qemu_luring_completion_bh(void *opaque)
> +{
> +    LuringState *s = opaque;
> +    qemu_luring_process_completions_and_submit(s);
> +}
> +
> +static void qemu_luring_completion_cb(void *opaque)
> +{
> +    LuringState *s = opaque;
> +    qemu_luring_process_completions_and_submit(s);
> +}
> +
> +static void ioq_init(LuringQueue *io_q)
> +{
> +    QSIMPLEQ_INIT(&io_q->sq_overflow);
> +    io_q->plugged = 0;
> +    io_q->in_queue = 0;
> +    io_q->in_flight = 0;
> +    io_q->blocked = false;
> +}
> +
> +static int ioq_submit(LuringState *s)
> +{
> +    int ret = 0;
> +    LuringAIOCB *luringcb, *luringcb_next;
> +
> +    while (s->io_q.in_queue > 0) {
> +        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> +                              luringcb_next) {

I am torn about the 'sq_overflow' name. it seems to me that its not immediately clear that these
are the requests that are waiting because the io uring got full, but I can't now think of a better name.

Maybe add a comment here to explain what is going on here?

Also maybe we could somehow utilize the plug/unplug facility to avoid reaching that state in first place?
Maybe the block layer has some kind of 'max outstanding requests' limit that could be used?

In my nvme-mdev I opted to not process the input queues when such a condition is detected, but here you can't as the block layer
pretty much calls you to process the requests.


> +            struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> +            if (!sqes) {
> +                break;
> +            }
> +            /* Prep sqe for submission */
> +            *sqes = luringcb->sqeq;
> +            QSIMPLEQ_REMOVE_HEAD(&s->io_q.sq_overflow, next);
> +        }
> +        ret =  io_uring_submit(&s->ring);
> +        /* Prevent infinite loop if submission is refused */
> +        if (ret <= 0) {
> +            if (ret == -EAGAIN) {
> +                continue;
> +            }
> +            break;
> +        }
> +        s->io_q.in_flight += ret;
> +        s->io_q.in_queue  -= ret;
> +    }
> +    s->io_q.blocked = (s->io_q.in_queue > 0);
> +
> +    if (s->io_q.in_flight) {
> +        /*
> +         * We can try to complete something just right away if there are
> +         * still requests in-flight.
> +         */
> +        qemu_luring_process_completions(s);
> +    }
> +    return ret;
> +}
> +
> +void luring_io_plug(BlockDriverState *bs, LuringState *s)
> +{
> +    s->io_q.plugged++;
> +}
> +
> +void luring_io_unplug(BlockDriverState *bs, LuringState *s)
> +{
> +    assert(s->io_q.plugged);
> +    if (--s->io_q.plugged == 0 &&
> +        !s->io_q.blocked && s->io_q.in_queue > 0) {
> +        ioq_submit(s);
> +    }
> +}
> +
> +/**
> + * luring_do_submit:
> + * @fd: file descriptor for I/O
> + * @luringcb: AIO control block
> + * @s: AIO state
> + * @offset: offset for request
> + * @type: type of request
> + *
> + * Fetches sqes from ring, adds to pending queue and preps them
> + *
> + */
> +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> +                            uint64_t offset, int type)
> +{
> +    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> +    if (!sqes) {
> +        sqes = &luringcb->sqeq;
> +        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> +    }
> +
> +    switch (type) {
> +    case QEMU_AIO_WRITE:
> +        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> +                             luringcb->qiov->niov, offset);
> +        break;
> +    case QEMU_AIO_READ:
> +        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> +                            luringcb->qiov->niov, offset);
> +        break;
> +    case QEMU_AIO_FLUSH:
> +        io_uring_prep_fsync(sqes, fd, 0);
> +        break;
> +    default:
> +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> +                        __func__, type);

Nitpick: Don't we use some king of error printing functions like 'error_setg' rather that fprintf?


> +        abort();
> +    }
> +    io_uring_sqe_set_data(sqes, luringcb);
> +    s->io_q.in_queue++;
> +
> +    if (!s->io_q.blocked &&
> +        (!s->io_q.plugged ||
> +         s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> +        return ioq_submit(s);
> +    }
> +    return 0;
> +}
> +
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
> +                                uint64_t offset, QEMUIOVector *qiov, int type)
> +{
> +    int ret;
> +    LuringAIOCB luringcb = {
> +        .co         = qemu_coroutine_self(),
> +        .ret        = -EINPROGRESS,
> +        .qiov       = qiov,
> +        .is_read    = (type == QEMU_AIO_READ),
> +    };
> +
> +    ret = luring_do_submit(fd, &luringcb, s, offset, type);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (luringcb.ret == -EINPROGRESS) {
> +        qemu_coroutine_yield();
> +    }
> +    return luringcb.ret;
> +}
> +
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> +{
> +    aio_set_fd_handler(old_context, s->ring.ring_fd, false, NULL, NULL, NULL,
> +                       s);
> +    qemu_bh_delete(s->completion_bh);
> +    s->aio_context = NULL;
> +}
> +
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> +{
> +    s->aio_context = new_context;
> +    s->completion_bh = aio_bh_new(new_context, qemu_luring_completion_bh, s);
> +    aio_set_fd_handler(s->aio_context, s->ring.ring_fd, false,
> +                       qemu_luring_completion_cb, NULL, NULL, s);
> +}
> +
> +LuringState *luring_init(Error **errp)
> +{
> +    int rc;
> +    LuringState *s;
> +    s = g_malloc0(sizeof(*s));
> +    struct io_uring *ring = &s->ring;
> +    rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
> +    if (rc < 0) {
> +        error_setg_errno(errp, errno, "failed to init linux io_uring ring");
> +        g_free(s);
> +        return NULL;
> +    }
> +



> +    ioq_init(&s->io_q);

Another nitpick, maybe inline that function as it is used just here?
(that will save the static declaration upfront as well)
Feel free to leave this as is, if you think this way it is clearer.

> +    return s;
> +
> +}
> +
> +void luring_cleanup(LuringState *s)
> +{
> +    io_uring_queue_exit(&s->ring);
> +    g_free(s);
> +}
> diff --git a/include/block/aio.h b/include/block/aio.h
> index 0ca25dfec6..9da3fd9793 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -50,6 +50,7 @@ typedef void IOHandler(void *opaque);
>  struct Coroutine;
>  struct ThreadPool;
>  struct LinuxAioState;
> +struct LuringState;
>  
>  struct AioContext {
>      GSource source;
> @@ -118,11 +119,19 @@ struct AioContext {
>      struct ThreadPool *thread_pool;
>  
>  #ifdef CONFIG_LINUX_AIO
> -    /* State for native Linux AIO.  Uses aio_context_acquire/release for
> +    /*
> +     * State for native Linux AIO.  Uses aio_context_acquire/release for
>       * locking.
>       */
>      struct LinuxAioState *linux_aio;
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +    /*
> +     * State for Linux io_uring.  Uses aio_context_acquire/release for
> +     * locking.
> +     */
> +    struct LuringState *linux_io_uring;
> +#endif
>  
>      /* TimerLists for calling timers - one per clock type.  Has its own
>       * locking.
> @@ -387,6 +396,11 @@ struct LinuxAioState *aio_setup_linux_aio(AioContext *ctx, Error **errp);
>  /* Return the LinuxAioState bound to this AioContext */
>  struct LinuxAioState *aio_get_linux_aio(AioContext *ctx);
>  
> +/* Setup the LuringState bound to this AioContext */
> +struct LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp);
> +
> +/* Return the LuringState bound to this AioContext */
> +struct LuringState *aio_get_linux_io_uring(AioContext *ctx);
>  /**
>   * aio_timer_new_with_attrs:
>   * @ctx: the aio context
> diff --git a/include/block/raw-aio.h b/include/block/raw-aio.h
> index 0cb7cc74a2..71d7d1395f 100644
> --- a/include/block/raw-aio.h
> +++ b/include/block/raw-aio.h
> @@ -55,6 +55,18 @@ void laio_attach_aio_context(LinuxAioState *s, AioContext *new_context);
>  void laio_io_plug(BlockDriverState *bs, LinuxAioState *s);
>  void laio_io_unplug(BlockDriverState *bs, LinuxAioState *s);
>  #endif
> +/* io_uring.c - Linux io_uring implementation */
> +#ifdef CONFIG_LINUX_IO_URING
> +typedef struct LuringState LuringState;
> +LuringState *luring_init(Error **errp);
> +void luring_cleanup(LuringState *s);
> +int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
> +                                uint64_t offset, QEMUIOVector *qiov, int type);
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context);
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context);
> +void luring_io_plug(BlockDriverState *bs, LuringState *s);
> +void luring_io_unplug(BlockDriverState *bs, LuringState *s);
> +#endif
>  
>  #ifdef _WIN32
>  typedef struct QEMUWin32AIOState QEMUWin32AIOState;


I plan on this or next week to do some benchmarks of the code and I will share the results as soon
as I do them.

Please pardon me if I made some mistakes in the review because most of the qemu is new for me,
so I don't yet know well most of the stuff here.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface Aarushi Mehta
@ 2019-06-17 12:33   ` Maxim Levitsky
  2019-06-19 10:21     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 12:33 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  MAINTAINERS         |  1 +
>  stubs/Makefile.objs |  1 +
>  stubs/io_uring.c    | 32 ++++++++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+)
>  create mode 100644 stubs/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 49f896796e..bc38175124 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2522,6 +2522,7 @@ R: Stefan Hajnoczi <stefan@redhat.com>
>  L: qemu-block@nongnu.org
>  S: Maintained
>  F: block/io_uring.c
> +F: stubs/io_uring.c
>  
>  qcow2
>  M: Kevin Wolf <kwolf@redhat.com>
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 9c7393b08c..5cf160a9c8 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -13,6 +13,7 @@ stub-obj-y += iothread.o
>  stub-obj-y += iothread-lock.o
>  stub-obj-y += is-daemonized.o
>  stub-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +stub-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  stub-obj-y += machine-init-done.o
>  stub-obj-y += migr-blocker.o
>  stub-obj-y += change-state-handler.o
> diff --git a/stubs/io_uring.c b/stubs/io_uring.c
> new file mode 100644
> index 0000000000..622d1e4648
> --- /dev/null
> +++ b/stubs/io_uring.c
> @@ -0,0 +1,32 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include "block/aio.h"
> +#include "block/raw-aio.h"
> +
> +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> +{
> +    abort();
> +}
> +
> +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> +{
> +    abort();
> +}
> +
> +LuringState *luring_init(Error **errp)
> +{
> +    abort();
> +}
> +
> +void luring_cleanup(LuringState *s)
> +{
> +    abort();
> +}

I do wonder if there is any value in these stubs (and linux aio stubs as well) since any attempt to use this
will abort the test, even basic aio initialization.

I am not yet familiar with qemu unit tests, so I won't yet put ack on this patch.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring
  2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring Aarushi Mehta
@ 2019-06-17 12:56   ` Maxim Levitsky
  0 siblings, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 12:56 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  util/async.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/util/async.c b/util/async.c
> index c10642a385..2709f0edc3 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -277,6 +277,14 @@ aio_ctx_finalize(GSource     *source)
>      }
>  #endif
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (ctx->linux_io_uring) {
> +        luring_detach_aio_context(ctx->linux_io_uring, ctx);
> +        luring_cleanup(ctx->linux_io_uring);
> +        ctx->linux_io_uring = NULL;
> +    }
> +#endif
> +
>      assert(QSLIST_EMPTY(&ctx->scheduled_coroutines));
>      qemu_bh_delete(ctx->co_schedule_bh);
>  
> @@ -341,6 +349,29 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>  }
>  #endif
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +LuringState *aio_setup_linux_io_uring(AioContext *ctx, Error **errp)
> +{
> +    if (ctx->linux_io_uring) {
> +        return ctx->linux_io_uring;
> +    }
> +
> +    ctx->linux_io_uring = luring_init(errp);
> +    if (!ctx->linux_io_uring) {
> +        return NULL;
> +    }
> +
> +    luring_attach_aio_context(ctx->linux_io_uring, ctx);
> +    return ctx->linux_io_uring;
> +}
> +
> +LuringState *aio_get_linux_io_uring(AioContext *ctx)
> +{
> +    assert(ctx->linux_io_uring);
> +    return ctx->linux_io_uring;
> +}
> +#endif
> +
>  void aio_notify(AioContext *ctx)
>  {

Minor nitpick. Maybe we can memset all the private area of the AioContext to 0, 
and then setup the stuff that is not zero? That would remove most of the code below.
This is an old habit from the kernel code.

(I assume that g_source_new doesn't do this)


>      /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> @@ -432,6 +463,11 @@ AioContext *aio_context_new(Error **errp)
>  #ifdef CONFIG_LINUX_AIO
>      ctx->linux_aio = NULL;
>  #endif
> +
> +#ifdef CONFIG_LINUX_IO_URING
> +    ctx->linux_io_uring = NULL;
> +#endif
> +
>      ctx->thread_pool = NULL;
>      qemu_rec_mutex_init(&ctx->lock);
>      timerlistgroup_init(&ctx->tlg, aio_timerlist_notify, ctx);


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option Aarushi Mehta
@ 2019-06-17 13:01   ` Maxim Levitsky
  2019-06-19 10:24     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 13:01 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  blockdev.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/blockdev.c b/blockdev.c
> index 3f44b891eb..a2a5b32604 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts *opts, int *bdrv_flags,
>          if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
>              if (!strcmp(aio, "native")) {
>                  *bdrv_flags |= BDRV_O_NATIVE_AIO;
> +            } else if (!strcmp(aio, "io_uring")) {
> +                *bdrv_flags |= BDRV_O_IO_URING;
>              } else if (!strcmp(aio, "threads")) {
>                  /* this is the default */
>              } else {
> @@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
>          },{
>              .name = "aio",
>              .type = QEMU_OPT_STRING,
> -            .help = "host AIO implementation (threads, native)",
> +            .help = "host AIO implementation (threads, native, io_uring)",
>          },{
>              .name = BDRV_OPT_CACHE_WB,
>              .type = QEMU_OPT_BOOL,

Nitpick: Maybe we should rename the native to libaio (accept both but give an deprication warning)?


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring Aarushi Mehta
@ 2019-06-17 14:12   ` Maxim Levitsky
  0 siblings, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 14:12 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  block/file-posix.c | 85 +++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 69 insertions(+), 16 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index d018429672..211dfe5337 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -154,6 +154,7 @@ typedef struct BDRVRawState {
>      bool has_write_zeroes:1;
>      bool discard_zeroes:1;
>      bool use_linux_aio:1;
> +    bool use_linux_io_uring:1;
Maybe an enum instead?


>      bool page_cache_inconsistent:1;
>      bool has_fallocate;
>      bool needs_alignment;
> @@ -423,7 +424,7 @@ static QemuOptsList raw_runtime_opts = {
>          {
>              .name = "aio",
>              .type = QEMU_OPT_STRING,
> -            .help = "host AIO implementation (threads, native)",
> +            .help = "host AIO implementation (threads, native, io_uring)",
>          },
>          {
>              .name = "locking",
> @@ -482,9 +483,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>          goto fail;
>      }
>  
> -    aio_default = (bdrv_flags & BDRV_O_NATIVE_AIO)
> -                  ? BLOCKDEV_AIO_OPTIONS_NATIVE
> -                  : BLOCKDEV_AIO_OPTIONS_THREADS;
> +    if (bdrv_flags & BDRV_O_NATIVE_AIO) {
> +        aio_default = BLOCKDEV_AIO_OPTIONS_NATIVE;
> +#ifdef CONFIG_LINUX_IO_URING
> +    } else if (bdrv_flags & BDRV_O_IO_URING) {
> +        aio_default = BLOCKDEV_AIO_OPTIONS_IO_URING;
> +#endif
> +    } else {
> +        aio_default = BLOCKDEV_AIO_OPTIONS_THREADS;
> +    }
>      aio = qapi_enum_parse(&BlockdevAioOptions_lookup,
>                            qemu_opt_get(opts, "aio"),
>                            aio_default, &local_err);
> @@ -493,7 +500,11 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>          ret = -EINVAL;
>          goto fail;
>      }
> +
>      s->use_linux_aio = (aio == BLOCKDEV_AIO_OPTIONS_NATIVE);
> +#ifdef CONFIG_LINUX_IO_URING
> +    s->use_linux_io_uring = (aio == BLOCKDEV_AIO_OPTIONS_IO_URING);
> +#endif
Same as above

>  
>      locking = qapi_enum_parse(&OnOffAuto_lookup,
>                                qemu_opt_get(opts, "locking"),
> @@ -557,7 +568,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      s->shared_perm = BLK_PERM_ALL;
>  
>  #ifdef CONFIG_LINUX_AIO
> -     /* Currently Linux does AIO only for files opened with O_DIRECT */
> +    /* Currently Linux does AIO only for files opened with O_DIRECT */
>      if (s->use_linux_aio) {
>          if (!(s->open_flags & O_DIRECT)) {
>              error_setg(errp, "aio=native was specified, but it requires "
> @@ -579,6 +590,22 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>  #endif /* !defined(CONFIG_LINUX_AIO) */
>  
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (s->use_linux_io_uring) {
> +        if (!aio_setup_linux_io_uring(bdrv_get_aio_context(bs), errp)) {
> 
> +            error_prepend(errp, "Unable to use io_uring: ");
> +            goto fail;
> +        }
> +    }
> +#else
> +    if (s->use_linux_io_uring) {
> +        error_setg(errp, "aio=io_uring was specified, but is not supported "
> +                         "in this build.");
> +        ret = -EINVAL;
> +        goto fail;
> +    }
> +#endif /* !defined(CONFIG_LINUX_IO_URING) */
> +
>      s->has_discard = true;
>      s->has_write_zeroes = true;
>      if ((bs->open_flags & BDRV_O_NOCACHE) != 0) {
> @@ -1875,16 +1902,20 @@ static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
>       * If this is the case tell the low-level driver that it needs
>       * to copy the buffer.
>       */
> -    if (s->needs_alignment) {
> -        if (!bdrv_qiov_is_aligned(bs, qiov)) {
> -            type |= QEMU_AIO_MISALIGNED;
> +    if (s->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)) {
> +                type |= QEMU_AIO_MISALIGNED;
indention is wrong.

> +#ifdef CONFIG_LINUX_IO_URING
> +    } else if (s->use_linux_io_uring) {
> +        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> +        assert(qiov->size == bytes);
> +        return luring_co_submit(bs, aio, s->fd, offset, qiov, type);
> +#endif
>  #ifdef CONFIG_LINUX_AIO
> -        } else if (s->use_linux_aio) {
> -            LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
> -            assert(qiov->size == bytes);
> -            return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
> +    } else if (s->use_linux_aio && s->needs_alignment) {
You can probably drop the s->needs_alignment here as it is set iff the file if opened with O_DIRECT which is a requirement
for the linux aio.

Maybe while at it, maybe update that comment prior to the 's->needs_alignment && !bdrv_qiov_is_aligned(bs, qiov)'
since it is somewhat outdated.

I think it should state:

"When using O_DIRECT, the request must be aligned to be able to use either libaio or io_uring interface.
If not fail back to regular thread pool read/write code which emulates this for us if we set QEMU_AIO_MISALIGNED"

Or something like that.



> +        LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
> +        assert(qiov->size == bytes);
> +        return laio_co_submit(bs, aio, s->fd, offset, qiov, type);
>  #endif
> -        }
>      }
>  
>      acb = (RawPosixAIOData) {
> @@ -1920,24 +1951,36 @@ static int coroutine_fn raw_co_pwritev(BlockDriverState *bs, uint64_t offset,
>  
>  static void raw_aio_plug(BlockDriverState *bs)
>  {
> +    BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -    BDRVRawState *s = bs->opaque;
>      if (s->use_linux_aio) {
>          LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
>          laio_io_plug(bs, aio);
>      }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (s->use_linux_io_uring) {
> +        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> +        luring_io_plug(bs, aio);
> +    }
> +#endif
>  }
>  
>  static void raw_aio_unplug(BlockDriverState *bs)
>  {
> +    BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -    BDRVRawState *s = bs->opaque;
>      if (s->use_linux_aio) {
>          LinuxAioState *aio = aio_get_linux_aio(bdrv_get_aio_context(bs));
>          laio_io_unplug(bs, aio);
>      }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (s->use_linux_io_uring) {
> +        LuringState *aio = aio_get_linux_io_uring(bdrv_get_aio_context(bs));
> +        luring_io_unplug(bs, aio);
> +    }
> +#endif
>  }
>  
>  static int raw_co_flush_to_disk(BlockDriverState *bs)
> @@ -1963,8 +2006,8 @@ static int raw_co_flush_to_disk(BlockDriverState *bs)
>  static void raw_aio_attach_aio_context(BlockDriverState *bs,
>                                         AioContext *new_context)
>  {
> +    BDRVRawState __attribute__((unused)) *s = bs->opaque;
>  #ifdef CONFIG_LINUX_AIO
> -    BDRVRawState *s = bs->opaque;
>      if (s->use_linux_aio) {
>          Error *local_err;
>          if (!aio_setup_linux_aio(new_context, &local_err)) {
> @@ -1974,6 +2017,16 @@ static void raw_aio_attach_aio_context(BlockDriverState *bs,
>          }
>      }
>  #endif
> +#ifdef CONFIG_LINUX_IO_URING
> +    if (s->use_linux_io_uring) {
> +        Error *local_err;
> +        if (!aio_setup_linux_io_uring(new_context, &local_err)) {
> +            error_reportf_err(local_err, "Unable to use linux io_uring, "
> +                                         "falling back to thread pool: ");
> +            s->use_linux_io_uring = false;
> +        }
> +    }
> +#endif
>  }
>  
>  static void raw_close(BlockDriverState *bs)


Other that minor notes,
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling
  2019-06-11  9:51   ` Stefan Hajnoczi
@ 2019-06-17 14:14     ` Maxim Levitsky
  0 siblings, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 14:14 UTC (permalink / raw)
  To: Stefan Hajnoczi, Aarushi Mehta
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Markus Armbruster

On Tue, 2019-06-11 at 10:51 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 10, 2019 at 07:19:03PM +0530, Aarushi Mehta wrote:
> > +static bool qemu_luring_poll_cb(void *opaque)
> > +{
> > +    LuringState *s = opaque;
> > +    struct io_uring_cqe *cqes;
> > +
> > +    if (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
> > +        if (!cqes) {
> > +            qemu_luring_process_completions_and_submit(s);
> > +            return true;
> > +        }
> 
> Is this logic inverted?  We have a completion when cqes != NULL.

This indeed looks inverted to me.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks for io_uring
  2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
  2019-06-11  9:58   ` Stefan Hajnoczi
@ 2019-06-17 14:21   ` Maxim Levitsky
  1 sibling, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-17 14:21 UTC (permalink / raw)
  To: Aarushi Mehta, qemu-devel
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block,
	Markus Armbruster, Max Reitz, saket.sinha89, Stefan Hajnoczi,
	Paolo Bonzini, Julia Suvorova

On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> ---
>  tests/qemu-iotests/087     | 26 ++++++++++++++++++++++++++
>  tests/qemu-iotests/087.out | 10 ++++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
> index d6c8613419..0cc7283ad8 100755
> --- a/tests/qemu-iotests/087
> +++ b/tests/qemu-iotests/087
> @@ -124,6 +124,32 @@ run_qemu_filter_aio <<EOF
>  { "execute": "quit" }
>  EOF
>  
> +echo
> +echo === aio=io_uring without O_DIRECT ===
> +echo
> +
> +# Skip this test if io_uring is not enabled in this build
> +run_qemu_filter_io_uring()
> +{
> +    run_qemu "$@"
> +}
> +
> +run_qemu_filter_io_uring <<EOF
> +{ "execute": "qmp_capabilities" }
> +{ "execute": "blockdev-add",
> +  "arguments": {
> +      "driver": "$IMGFMT",
> +      "node-name": "disk",
> +      "file": {
> +          "driver": "file",
> +          "filename": "$TEST_IMG",
> +          "aio": "io_uring"
> +      }
> +    }
> +  }
> +{ "execute": "quit" }
> +EOF
> +
>  echo
>  echo === Encrypted image QCow ===
>  echo
> diff --git a/tests/qemu-iotests/087.out b/tests/qemu-iotests/087.out
> index 2d92ea847b..f0557d425f 100644
> --- a/tests/qemu-iotests/087.out
> +++ b/tests/qemu-iotests/087.out
> @@ -32,6 +32,16 @@ QMP_VERSION
>  {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
>  
>  
> +=== aio=io_uring without O_DIRECT ===
> +
> +Testing:
> +QMP_VERSION
> +{"return": {}}
> +{"return": {}}
> +{"return": {}}
> +{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
> +
> +

This is kind of wrong copy&paste happened here: 

The "aio=native without O_DIRECT" test is a negative test trying to enable libaio support without O_DIRECT,
which qemu check and fails on.

The io_uring can be enabled without O_DIRECT, thus negative test doesn't make sense, but rather a positive test which not only enables
this but actually does some IO in this mode to see that it works is needed IMHO.



>  === Encrypted image QCow ===
>  
>  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 encryption=on encrypt.key-secret=sec0


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-17 12:26   ` Maxim Levitsky
@ 2019-06-19 10:14     ` Stefan Hajnoczi
  2019-06-19 10:47       ` Maxim Levitsky
  2019-06-22 15:10     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  1 sibling, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-19 10:14 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 5464 bytes --]

On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > +        if (!cqes) {
> > +            break;
> > +        }
> > +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > +        ret = cqes->res;
> > +
> > +        if (ret == luringcb->qiov->size) {
> > +            ret = 0;
> > +        } else if (ret >= 0) {
> 
> 
> You should very carefully check the allowed return values here.
> 
> It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise
> you will get the number of bytes read, which might be less that what was asked for, which implies that you
> need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO 
> 
> (0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF)
> 
> 
> Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments. 
> including lack of check for -EINTR.
> 
> I assume that since aio is linux specific, and it only supports direct IO, it happens
> to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this)
> 
> On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage,
> retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read.
> 
> Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode), 
> and in this case, it goes through the same ->read_iter which is pretty much the same path that 
> regular read() takes and so it might return short reads and or -EINTR.

Interesting point.  Investigating EINTR should at least be a TODO
comment and needs to be resolved before io_uring lands in a QEMU
release.

> > +static int ioq_submit(LuringState *s)
> > +{
> > +    int ret = 0;
> > +    LuringAIOCB *luringcb, *luringcb_next;
> > +
> > +    while (s->io_q.in_queue > 0) {
> > +        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> > +                              luringcb_next) {
> 
> I am torn about the 'sq_overflow' name. it seems to me that its not immediately clear that these
> are the requests that are waiting because the io uring got full, but I can't now think of a better name.
> 
> Maybe add a comment here to explain what is going on here?

Hmm...I suggested this name because I thought it was clear.  But the
fact that it puzzled you proves it wasn't clear :-).

Can anyone think of a better name?  It's the queue we keep in QEMU to
hold requests while the io_uring sq ring is full.

> Also maybe we could somehow utilize the plug/unplug facility to avoid reaching that state in first place?
> Maybe the block layer has some kind of 'max outstanding requests' limit that could be used?
> 
> In my nvme-mdev I opted to not process the input queues when such a condition is detected, but here you can't as the block layer
> pretty much calls you to process the requests.

Block layer callers are allowed to submit as many I/O requests as they
like and there is no feedback mechanism.  It's up to linux-aio.c and
io_uring.c to handle the case where host kernel I/O submission resources
are exhausted.

Plug/unplug is a batching performance optimization to reduce the number
of io_uring_enter() calls but it does not stop the callers from
submitting more I/O requests.  So plug/unplug isn't directly applicable
here.

> > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> > +                            uint64_t offset, int type)
> > +{
> > +    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> > +    if (!sqes) {
> > +        sqes = &luringcb->sqeq;
> > +        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> > +    }
> > +
> > +    switch (type) {
> > +    case QEMU_AIO_WRITE:
> > +        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > +                             luringcb->qiov->niov, offset);
> > +        break;
> > +    case QEMU_AIO_READ:
> > +        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > +                            luringcb->qiov->niov, offset);
> > +        break;
> > +    case QEMU_AIO_FLUSH:
> > +        io_uring_prep_fsync(sqes, fd, 0);
> > +        break;
> > +    default:
> > +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > +                        __func__, type);
> 
> Nitpick: Don't we use some king of error printing functions like 'error_setg' rather that fprintf?

Here we're not in a context where an Error object can be returned (e.g.
printed by the QMP monitor).  We only have an errno return value that
the emulated storage controller may squash down further to a single
EIO-type error code.

'type' is a QEMU-internal value so the default case is basically
assert(false); /* we should never get here */

For these reasons the fprintf() seems okay here.

> I plan on this or next week to do some benchmarks of the code and I will share the results as soon
> as I do them.

Excellent, Aarushi has been benchmarking too.  Perhaps you can share the
QEMU command-line and fio configuration so that the results can be
compared.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface
  2019-06-17 12:33   ` Maxim Levitsky
@ 2019-06-19 10:21     ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-19 10:21 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 2290 bytes --]

On Mon, Jun 17, 2019 at 03:33:01PM +0300, Maxim Levitsky wrote:
> On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > diff --git a/stubs/io_uring.c b/stubs/io_uring.c
> > new file mode 100644
> > index 0000000000..622d1e4648
> > --- /dev/null
> > +++ b/stubs/io_uring.c
> > @@ -0,0 +1,32 @@
> > +/*
> > + * Linux io_uring support.
> > + *
> > + * Copyright (C) 2009 IBM, Corp.
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + */
> > +#include "qemu/osdep.h"
> > +#include "block/aio.h"
> > +#include "block/raw-aio.h"
> > +
> > +void luring_detach_aio_context(LuringState *s, AioContext *old_context)
> > +{
> > +    abort();
> > +}
> > +
> > +void luring_attach_aio_context(LuringState *s, AioContext *new_context)
> > +{
> > +    abort();
> > +}
> > +
> > +LuringState *luring_init(Error **errp)
> > +{
> > +    abort();
> > +}
> > +
> > +void luring_cleanup(LuringState *s)
> > +{
> > +    abort();
> > +}
> 
> I do wonder if there is any value in these stubs (and linux aio stubs as well) since any attempt to use this
> will abort the test, even basic aio initialization.
> 
> I am not yet familiar with qemu unit tests, so I won't yet put ack on this patch.

From docs/devel/build-system.txt:

  The utility code that is used by all binaries is built into a
  static archive called libqemuutil.a, which is then linked to all the
  binaries. In order to provide hooks that are only needed by some of the
  binaries, code in libqemuutil.a may depend on other functions that are
  not fully implemented by all QEMU binaries.  Dummy stubs for all these
  functions are also provided by this library, and will only be linked
  into the binary if the real implementation is not present.  In a way,
  the stubs can be thought of as a portable implementation of the weak
  symbols concept.

Take util/async.c as an example.  It calls laio_init(), which lives in
block/.  But our binary using libqemuutil.a (which contains async.o)
never actually calls into it.  We need to a stub just to make the linker
happy but don't worry, it will never be called by the program.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option
  2019-06-17 13:01   ` Maxim Levitsky
@ 2019-06-19 10:24     ` Stefan Hajnoczi
  2019-06-19 10:48       ` Maxim Levitsky
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-19 10:24 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1789 bytes --]

On Mon, Jun 17, 2019 at 04:01:45PM +0300, Maxim Levitsky wrote:
> On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> > Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  blockdev.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/blockdev.c b/blockdev.c
> > index 3f44b891eb..a2a5b32604 100644
> > --- a/blockdev.c
> > +++ b/blockdev.c
> > @@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts *opts, int *bdrv_flags,
> >          if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
> >              if (!strcmp(aio, "native")) {
> >                  *bdrv_flags |= BDRV_O_NATIVE_AIO;
> > +            } else if (!strcmp(aio, "io_uring")) {
> > +                *bdrv_flags |= BDRV_O_IO_URING;
> >              } else if (!strcmp(aio, "threads")) {
> >                  /* this is the default */
> >              } else {
> > @@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
> >          },{
> >              .name = "aio",
> >              .type = QEMU_OPT_STRING,
> > -            .help = "host AIO implementation (threads, native)",
> > +            .help = "host AIO implementation (threads, native, io_uring)",
> >          },{
> >              .name = BDRV_OPT_CACHE_WB,
> >              .type = QEMU_OPT_BOOL,
> 
> Nitpick: Maybe we should rename the native to libaio (accept both but give an deprication warning)?

"libaio" is a clearer name but I'm afraid changing it or introducing a
new name is not worth it with so many users, command-lines, scripts, and
management tools that know about "native".  Having two names that mean
the same thing might cause confusion.

Let's leave it as is.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-19 10:14     ` Stefan Hajnoczi
@ 2019-06-19 10:47       ` Maxim Levitsky
  0 siblings, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-19 10:47 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

On Wed, 2019-06-19 at 11:14 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > > +        if (!cqes) {
> > > +            break;
> > > +        }
> > > +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > > +        ret = cqes->res;
> > > +
> > > +        if (ret == luringcb->qiov->size) {
> > > +            ret = 0;
> > > +        } else if (ret >= 0) {
> > 
> > 
> > You should very carefully check the allowed return values here.
> > 
> > It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise
> > you will get the number of bytes read, which might be less that what was asked for, which implies that you
> > need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO 
> > 
> > (0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF)
> > 
> > 
> > Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments. 
> > including lack of check for -EINTR.
> > 
> > I assume that since aio is linux specific, and it only supports direct IO, it happens
> > to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this)
> > 
> > On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage,
> > retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read.
> > 
> > Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode), 
> > and in this case, it goes through the same ->read_iter which is pretty much the same path that 
> > regular read() takes and so it might return short reads and or -EINTR.
> 
> Interesting point.  Investigating EINTR should at least be a TODO
> comment and needs to be resolved before io_uring lands in a QEMU
> release.
> 
> > > +static int ioq_submit(LuringState *s)
> > > +{
> > > +    int ret = 0;
> > > +    LuringAIOCB *luringcb, *luringcb_next;
> > > +
> > > +    while (s->io_q.in_queue > 0) {
> > > +        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> > > +                              luringcb_next) {
> > 
> > I am torn about the 'sq_overflow' name. it seems to me that its not immediately clear that these
> > are the requests that are waiting because the io uring got full, but I can't now think of a better name.
> > 
> > Maybe add a comment here to explain what is going on here?
> 
> Hmm...I suggested this name because I thought it was clear.  But the
> fact that it puzzled you proves it wasn't clear :-).
> 
> Can anyone think of a better name?  It's the queue we keep in QEMU to
> hold requests while the io_uring sq ring is full.
> 
> > Also maybe we could somehow utilize the plug/unplug facility to avoid reaching that state in first place?
> > Maybe the block layer has some kind of 'max outstanding requests' limit that could be used?
> > 
> > In my nvme-mdev I opted to not process the input queues when such a condition is detected, but here you can't as the block layer
> > pretty much calls you to process the requests.
> 
> Block layer callers are allowed to submit as many I/O requests as they
> like and there is no feedback mechanism.  It's up to linux-aio.c and
> io_uring.c to handle the case where host kernel I/O submission resources
> are exhausted.
> 
> Plug/unplug is a batching performance optimization to reduce the number
> of io_uring_enter() calls but it does not stop the callers from
> submitting more I/O requests.  So plug/unplug isn't directly applicable
> here.

Thanks for the explanation! I guess we can leave that name as is, but add some comment or so
in the place where the queue is accessed.



> 
> > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
> > > +                            uint64_t offset, int type)
> > > +{
> > > +    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> > > +    if (!sqes) {
> > > +        sqes = &luringcb->sqeq;
> > > +        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> > > +    }
> > > +
> > > +    switch (type) {
> > > +    case QEMU_AIO_WRITE:
> > > +        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > > +                             luringcb->qiov->niov, offset);
> > > +        break;
> > > +    case QEMU_AIO_READ:
> > > +        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > > +                            luringcb->qiov->niov, offset);
> > > +        break;
> > > +    case QEMU_AIO_FLUSH:
> > > +        io_uring_prep_fsync(sqes, fd, 0);
> > > +        break;
> > > +    default:
> > > +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > > +                        __func__, type);
> > 
> > Nitpick: Don't we use some king of error printing functions like 'error_setg' rather that fprintf?
> 
> Here we're not in a context where an Error object can be returned (e.g.
> printed by the QMP monitor).  We only have an errno return value that
> the emulated storage controller may squash down further to a single
> EIO-type error code.
> 
> 'type' is a QEMU-internal value so the default case is basically
> assert(false); /* we should never get here */
> 
> For these reasons the fprintf() seems okay here.
All right then.

> 
> > I plan on this or next week to do some benchmarks of the code and I will share the results as soon
> > as I do them.
> 
> Excellent, Aarushi has been benchmarking too.  Perhaps you can share the
> QEMU command-line and fio configuration so that the results can be
> compared.

I'll do that soon. I already did some initial benchmarks (didn't save the results yet), and in the current state the io_uring is a bit, tiny bit slower
that libaio, which is nothing wrong because io_uring has lot of benefits in addition to performance, plus the current code can be optimized futher.

I'll post all the benchmarks I am doing once again now, as soon as I have them.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option
  2019-06-19 10:24     ` Stefan Hajnoczi
@ 2019-06-19 10:48       ` Maxim Levitsky
  0 siblings, 0 replies; 40+ messages in thread
From: Maxim Levitsky @ 2019-06-19 10:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta, Markus Armbruster

On Wed, 2019-06-19 at 11:24 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 04:01:45PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:19 +0530, Aarushi Mehta wrote:
> > > Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
> > > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >  blockdev.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/blockdev.c b/blockdev.c
> > > index 3f44b891eb..a2a5b32604 100644
> > > --- a/blockdev.c
> > > +++ b/blockdev.c
> > > @@ -386,6 +386,8 @@ static void extract_common_blockdev_options(QemuOpts *opts, int *bdrv_flags,
> > >          if ((aio = qemu_opt_get(opts, "aio")) != NULL) {
> > >              if (!strcmp(aio, "native")) {
> > >                  *bdrv_flags |= BDRV_O_NATIVE_AIO;
> > > +            } else if (!strcmp(aio, "io_uring")) {
> > > +                *bdrv_flags |= BDRV_O_IO_URING;
> > >              } else if (!strcmp(aio, "threads")) {
> > >                  /* this is the default */
> > >              } else {
> > > @@ -4579,7 +4581,7 @@ QemuOptsList qemu_common_drive_opts = {
> > >          },{
> > >              .name = "aio",
> > >              .type = QEMU_OPT_STRING,
> > > -            .help = "host AIO implementation (threads, native)",
> > > +            .help = "host AIO implementation (threads, native, io_uring)",
> > >          },{
> > >              .name = BDRV_OPT_CACHE_WB,
> > >              .type = QEMU_OPT_BOOL,
> > 
> > Nitpick: Maybe we should rename the native to libaio (accept both but give an deprication warning)?
> 
> "libaio" is a clearer name but I'm afraid changing it or introducing a
> new name is not worth it with so many users, command-lines, scripts, and
> management tools that know about "native".  Having two names that mean
> the same thing might cause confusion.
> 
> Let's leave it as is.
> 
I won't argue about this, also this can also be done later.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring
  2019-06-17 12:26   ` Maxim Levitsky
  2019-06-19 10:14     ` Stefan Hajnoczi
@ 2019-06-22 15:10     ` Stefan Hajnoczi
  1 sibling, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-22 15:10 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Fam Zheng, Kevin Wolf, qemu block, Markus Armbruster, qemu-devel,
	saket.sinha89, Stefan Hajnoczi, Paolo Bonzini, Max Reitz,
	Julia Suvorova, Aarushi Mehta

On Mon, Jun 17, 2019 at 1:27 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > +        if (!cqes) {
> > +            break;
> > +        }
> > +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > +        ret = cqes->res;
> > +
> > +        if (ret == luringcb->qiov->size) {
> > +            ret = 0;
> > +        } else if (ret >= 0) {
>
>
> You should very carefully check the allowed return values here.
>
> It looks like you can get '-EINTR' here, which would ask you to rerun the read operation, and otherwise
> you will get the number of bytes read, which might be less that what was asked for, which implies that you
> need to retry the read operation with the remainder of the buffer rather that zero the end of the buffer IMHO
>
> (0 is returned on EOF according to 'read' semantics, which I think are used here, thus a short read might not be an EOF)
>
>
> Looking at linux-aio.c though I do see that it just passes through the returned value with no special treatments.
> including lack of check for -EINTR.
>
> I assume that since aio is linux specific, and it only supports direct IO, it happens
> to have assumption of no short reads/-EINTR (but since libaio has very sparse documentation I can't verify this)
>
> On the other hand the aio=threads implementation actually does everything as specified on the 'write' manpage,
> retrying the reads on -EINTR, and doing additional reads if less that required number of bytes were read.
>
> Looking at io_uring implementation in the kernel I see that it does support synchronous (non O_DIRECT mode),
> and in this case, it goes through the same ->read_iter which is pretty much the same path that
> regular read() takes and so it might return short reads and or -EINTR.

Thanks, Maxim.  I've confirmed that -EINTR needs to be handled based
on fs/io_uring.c.  We need to get a new sqe (since the old one is no
longer usable) and submit the request again.  There are no ordering
guarantees between pending requests so doing this is permissible.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v5 00/12] Add support for io_uring
  2019-06-11  9:56 ` [Qemu-devel] [PATCH v5 00/12] Add support " Stefan Hajnoczi
@ 2019-06-22 15:13   ` Stefan Hajnoczi
  2019-06-23 14:03     ` Stefan Hajnoczi
  0 siblings, 1 reply; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-22 15:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, qemu block, Markus Armbruster, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta

On Tue, Jun 11, 2019 at 10:57 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> On Mon, Jun 10, 2019 at 07:18:53PM +0530, Aarushi Mehta wrote:
> > This patch series adds support for the newly developed io_uring Linux AIO
> > interface. Linux io_uring is faster than Linux's AIO asynchronous I/O code,
> > offers efficient buffered asynchronous I/O support, the ability to do I/O
> > without performing a system call via polled I/O, and other efficiency enhancements.
> >
> > Testing it requires a host kernel (5.1+) and the liburing library.
> > Use the option -drive aio=io_uring to enable it.
> >
> > v5:
> > - Adds completion polling
> > - Extends qemu-io
> > - Adds qemu-iotest
>
> Flush is not hooked up.  Please use the io_uring IOURING_OP_FSYNC that
> you've already written and connect it to file-posix.c.

IOURING_OP_FSYNC is in fact synchronous.  This means io_uring_enter()
blocks until this operation completes.  This is not desirable since
the AIO engine should not block the QEMU thread it's running from for
a long time (e.g. network file system that is not responding).

I think it's best *not* to use io_uring for fsync.  Instead we can
continue to use the thread pool, just like Linux AIO.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH v5 00/12] Add support for io_uring
  2019-06-22 15:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2019-06-23 14:03     ` Stefan Hajnoczi
  0 siblings, 0 replies; 40+ messages in thread
From: Stefan Hajnoczi @ 2019-06-23 14:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, qemu block, Markus Armbruster, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Julia Suvorova,
	Aarushi Mehta

On Sat, Jun 22, 2019 at 4:13 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Jun 11, 2019 at 10:57 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Jun 10, 2019 at 07:18:53PM +0530, Aarushi Mehta wrote:
> > > This patch series adds support for the newly developed io_uring Linux AIO
> > > interface. Linux io_uring is faster than Linux's AIO asynchronous I/O code,
> > > offers efficient buffered asynchronous I/O support, the ability to do I/O
> > > without performing a system call via polled I/O, and other efficiency enhancements.
> > >
> > > Testing it requires a host kernel (5.1+) and the liburing library.
> > > Use the option -drive aio=io_uring to enable it.
> > >
> > > v5:
> > > - Adds completion polling
> > > - Extends qemu-io
> > > - Adds qemu-iotest
> >
> > Flush is not hooked up.  Please use the io_uring IOURING_OP_FSYNC that
> > you've already written and connect it to file-posix.c.
>
> IOURING_OP_FSYNC is in fact synchronous.  This means io_uring_enter()
> blocks until this operation completes.  This is not desirable since
> the AIO engine should not block the QEMU thread it's running from for
> a long time (e.g. network file system that is not responding).
>
> I think it's best *not* to use io_uring for fsync.  Instead we can
> continue to use the thread pool, just like Linux AIO.

Looking more closely, this is wrong.  Although fsync is synchronous,
io_uring takes care to bounce it to the workqueue when submitted via
io_uring_enter().  Therefore it appears asynchronous to userspace and
we can and should use io_uring for fsync.

Stefan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring
  2019-06-11  9:32     ` Stefan Hajnoczi
@ 2019-07-03  6:26       ` Markus Armbruster
  0 siblings, 0 replies; 40+ messages in thread
From: Markus Armbruster @ 2019-07-03  6:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Fam Zheng, Kevin Wolf, Sergio Lopez, qemu-block, qemu-devel,
	Max Reitz, saket.sinha89, Paolo Bonzini, Maxim Levitsky,
	Julia Suvorova, Aarushi Mehta

I apologize for replying so late.

Stefan Hajnoczi <stefanha@redhat.com> writes:

> On Tue, Jun 11, 2019 at 03:36:53PM +0800, Fam Zheng wrote:
>> On Mon, 06/10 19:18, Aarushi Mehta wrote:
>> > Option only enumerates for hosts that support it.
>> > 
>> > Signed-off-by: Aarushi Mehta <mehta.aaru20@gmail.com>
>> > Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
>> > ---
>> >  qapi/block-core.json | 4 +++-
>> >  1 file changed, 3 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/qapi/block-core.json b/qapi/block-core.json
>> > index 1defcde048..db7eedd058 100644
>> > --- a/qapi/block-core.json
>> > +++ b/qapi/block-core.json
>> > @@ -2792,11 +2792,13 @@
>> >  #
>> >  # @threads:     Use qemu's thread pool
>> >  # @native:      Use native AIO backend (only Linux and Windows)
>> > +# @io_uring:    Use linux io_uring (since 4.1)
>> >  #
>> >  # Since: 2.9
>> >  ##
>> >  { 'enum': 'BlockdevAioOptions',
>> > -  'data': [ 'threads', 'native' ] }
>> > +  'data': [ 'threads', 'native',
>> > +            { 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } ] }
>> 
>> Question: 'native' has a dependency on libaio but it doesn't have the
>> condition.  Is the inconsistency intended?
>
> 'native' could be conditional too but I guess it's a historical thing.
> Either QAPI 'if' didn't exit when BlockdevAioOptions was defined or we
> simply forgot to use it :).

The former.  QAPI supports 'if' in partially since 3.0, and fully since
4.0.

> It doesn't need to be changed in this patch series.

Making the QAPI schema reflect the underlying C code's ifdeffery can
help QMP clients.  Compare:

* query-qmp-schema reports member "native" even when attempting to use
  it will always fail.

* It reports member "io_uring" only when it can work.

This is particularly useful when a QMP client has logic like "do it this
way if X is available, else do it that way".  With accurate
introspection, you can check whether X is available safely and easily.
Without introspection, you'll likely have to try "this way", and if it
fails, figure out why, so you can decide whether to try "that way".
Worse when a multi-step "this way" fails half way through, and you get
to revert its initial steps.


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-07-03  6:27 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-10 13:48 [Qemu-devel] [PATCH v5 00/12] Add support for io_uring Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 01/12] configure: permit use of io_uring Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 02/12] qapi/block-core: add option for io_uring Aarushi Mehta
2019-06-11  7:36   ` Fam Zheng
2019-06-11  9:32     ` Stefan Hajnoczi
2019-07-03  6:26       ` Markus Armbruster
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 03/12] block/block: add BDRV flag " Aarushi Mehta
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces " Aarushi Mehta
2019-06-11 11:17   ` Fam Zheng
2019-06-12 14:43     ` Stefan Hajnoczi
2019-06-17 12:26   ` Maxim Levitsky
2019-06-19 10:14     ` Stefan Hajnoczi
2019-06-19 10:47       ` Maxim Levitsky
2019-06-22 15:10     ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 05/12] stubs: add stubs for io_uring interface Aarushi Mehta
2019-06-17 12:33   ` Maxim Levitsky
2019-06-19 10:21     ` Stefan Hajnoczi
2019-06-10 13:48 ` [Qemu-devel] [PATCH v5 06/12] util/async: add aio interfaces for io_uring Aarushi Mehta
2019-06-17 12:56   ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 07/12] blockdev: accept io_uring as option Aarushi Mehta
2019-06-17 13:01   ` Maxim Levitsky
2019-06-19 10:24     ` Stefan Hajnoczi
2019-06-19 10:48       ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 08/12] block/file-posix.c: extend to use io_uring Aarushi Mehta
2019-06-17 14:12   ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 09/12] block: add trace events for io_uring Aarushi Mehta
2019-06-11  9:47   ` Stefan Hajnoczi
2019-06-11 11:34   ` Fam Zheng
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 10/12] block/io_uring: adds userspace completion polling Aarushi Mehta
2019-06-11  9:51   ` Stefan Hajnoczi
2019-06-17 14:14     ` Maxim Levitsky
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 11/12] qemu-io: adds support for io_uring Aarushi Mehta
2019-06-11  9:54   ` Stefan Hajnoczi
2019-06-11 12:26   ` Fam Zheng
2019-06-10 13:49 ` [Qemu-devel] [PATCH v5 12/12] qemu-iotests/087: checks " Aarushi Mehta
2019-06-11  9:58   ` Stefan Hajnoczi
2019-06-17 14:21   ` Maxim Levitsky
2019-06-11  9:56 ` [Qemu-devel] [PATCH v5 00/12] Add support " Stefan Hajnoczi
2019-06-22 15:13   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2019-06-23 14:03     ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.