qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads
@ 2019-07-15 20:19 Stefan Hajnoczi
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events Stefan Hajnoczi
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2019-07-15 20:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Stefan Hajnoczi, Julia Suvorova, Aarushi Mehta

Short reads are possible with cache=writeback (see Patch 3 for details).
Handle this by resubmitting requests until the read is completed.

Patch 1 adds trace events useful for debugging io_uring.

Patch 2 fixes EINTR.  This lays the groundwork for resubmitting requests in
Patch 3.

Aarushi: Feel free to squash this into your patch series if you are happy with
the code, I don't mind if the authorship information is lost.  After applying
these patches I can successfully boot a Fedora 30 guest qcow2 file with
cache=writeback.

Based-on: <20190610134905.22294-1-mehta.aaru20@gmail.com>

Stefan Hajnoczi (3):
  block/io_uring: add submission and completion trace events
  block/io_uring: fix EINTR request resubmission
  block/io_uring: resubmit short buffered reads

 block/io_uring.c   | 136 ++++++++++++++++++++++++++++++++++-----------
 block/trace-events |   6 +-
 2 files changed, 108 insertions(+), 34 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events
  2019-07-15 20:19 [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads Stefan Hajnoczi
@ 2019-07-15 20:19 ` Stefan Hajnoczi
  2019-07-16  7:04   ` Philippe Mathieu-Daudé
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 2/3] block/io_uring: fix EINTR request resubmission Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2019-07-15 20:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Stefan Hajnoczi, Julia Suvorova, Aarushi Mehta

It is useful to follow individual requests as they are submitted.  Add
trace events that show details of each request.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io_uring.c   | 5 +++++
 block/trace-events | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/block/io_uring.c b/block/io_uring.c
index 22e8d3d9ca..19919da4c9 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -128,6 +128,8 @@ static void luring_process_completions(LuringState *s)
         LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
         ret = cqes->res;
 
+        trace_luring_process_completion(s, luringcb, ret);
+
         if (ret == luringcb->qiov->size) {
             ret = 0;
         } else if (ret >= 0) {
@@ -233,6 +235,7 @@ static int ioq_submit(LuringState *s)
             QSIMPLEQ_REMOVE_HEAD(&s->io_q.sq_overflow, next);
         }
         ret = io_uring_submit(&s->ring);
+        trace_luring_io_uring_submit(s, ret);
         /* Prevent infinite loop if submission is refused */
         if (ret <= 0) {
             if (ret == -EAGAIN) {
@@ -339,6 +342,8 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
         .is_read    = (type == QEMU_AIO_READ),
     };
 
+    trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0, type);
+
     ret = luring_do_submit(fd, &luringcb, s, offset, type);
     if (ret < 0) {
         return ret;
diff --git a/block/trace-events b/block/trace-events
index 069779773b..02952fe4cb 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -67,6 +67,9 @@ luring_io_plug(void *s) "LuringState %p plug"
 luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
 luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
 luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
+luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
+luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
+luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
 
 # qcow2.c
 qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 2/3] block/io_uring: fix EINTR request resubmission
  2019-07-15 20:19 [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads Stefan Hajnoczi
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events Stefan Hajnoczi
@ 2019-07-15 20:19 ` Stefan Hajnoczi
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads Stefan Hajnoczi
  2019-08-21 22:20 ` [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads John Snow
  3 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2019-07-15 20:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Stefan Hajnoczi, Julia Suvorova, Aarushi Mehta

Adding the request to sq_overflow isn't enough:
1. luringcb->sqeq is uninitialized if there was space in the sq ring at
   submission time.
2. Not all code paths invoke ioq_submit() after processing completions,
   so the request could hang.

Additional bugs include checking for EINTR instead of -EINTR and
forgetting to skip the completion callback when a request is
resubmitted.

Fix this by always initializing luringcb->sqeq and ensuring that all
code paths invoke ioq_submit() after appending to sq_overflow.  Ensure
that luring_process_completions() marks the cqe seen and decrements
in_flight before resubmitting the request.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io_uring.c | 64 ++++++++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 27 deletions(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index 19919da4c9..97e4f876d7 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -87,6 +87,18 @@ int luring_register_fd(LuringState *s, unsigned int fd)
                              s->fd.head, s->fd.size);
 }
 
+/**
+ * luring_resubmit:
+ *
+ * Resubmit a request by appending it to sq_overflow.  The caller must ensure
+ * that ioq_submit() is called later so that sq_overflow requests are started.
+ */
+static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
+{
+    QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
+    s->io_q.in_queue++;
+}
+
 /**
  * luring_process_completions:
  * @s: AIO state
@@ -102,7 +114,6 @@ int luring_register_fd(LuringState *s, unsigned int fd)
 static void luring_process_completions(LuringState *s)
 {
     struct io_uring_cqe *cqes;
-    int ret;
 
     /*
      * Request completion callbacks can run the nested event loop.
@@ -122,11 +133,20 @@ static void luring_process_completions(LuringState *s)
     qemu_bh_schedule(s->completion_bh);
 
     while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
+        LuringAIOCB *luringcb;
+        int ret;
+
         if (!cqes) {
             break;
         }
-        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
+
+        luringcb = io_uring_cqe_get_data(cqes);
         ret = cqes->res;
+        io_uring_cqe_seen(&s->ring, cqes);
+        cqes = NULL;
+
+        /* Change counters one-by-one because we can be nested. */
+        s->io_q.in_flight--;
 
         trace_luring_process_completion(s, luringcb, ret);
 
@@ -143,17 +163,12 @@ static void luring_process_completions(LuringState *s)
                 ret = -ENOSPC;;
             }
         /* Add to overflow queue to be resubmitted later */
-        } else if (ret == EINTR) {
-            QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
+        } else if (ret == -EINTR) {
+            luring_resubmit(s, luringcb);
+            continue;
         }
         luringcb->ret = ret;
 
-
-        io_uring_cqe_seen(&s->ring, cqes);
-        cqes = NULL;
-        /* Change counters one-by-one because we can be nested. */
-        s->io_q.in_flight--;
-
         /*
          * If the coroutine is already entered it must be in ioq_submit()
          * and will notice luringcb->ret has been filled in when it
@@ -245,16 +260,16 @@ static int ioq_submit(LuringState *s)
         }
         s->io_q.in_flight += ret;
         s->io_q.in_queue  -= ret;
+
+        if (s->io_q.in_flight) {
+            /*
+             * We can try to complete something just right away if there are
+             * still requests in-flight.
+             */
+            luring_process_completions(s);
+        }
     }
     s->io_q.blocked = (s->io_q.in_queue > 0);
-
-    if (s->io_q.in_flight) {
-        /*
-         * We can try to complete something just right away if there are
-         * still requests in-flight.
-         */
-        luring_process_completions(s);
-    }
     return ret;
 }
 
@@ -290,15 +305,7 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
                             uint64_t offset, int type)
 {
     int ret;
-    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
-    /* 
-     *If the ring is full and cannot fetch new sqes, add the request to
-     * to an overflow queue to be submitted later
-     */
-    if (!sqes) {
-        sqes = &luringcb->sqeq;
-        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
-    }
+    struct io_uring_sqe *sqes = &luringcb->sqeq;
 
     switch (type) {
     case QEMU_AIO_WRITE:
@@ -318,7 +325,10 @@ static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
         abort();
     }
     io_uring_sqe_set_data(sqes, luringcb);
+
+    QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
     s->io_q.in_queue++;
+
     trace_luring_do_submit(s, s->io_q.blocked, s->io_q.plugged,
                            s->io_q.in_queue, s->io_q.in_flight);
     if (!s->io_q.blocked &&
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads
  2019-07-15 20:19 [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads Stefan Hajnoczi
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events Stefan Hajnoczi
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 2/3] block/io_uring: fix EINTR request resubmission Stefan Hajnoczi
@ 2019-07-15 20:19 ` Stefan Hajnoczi
  2019-07-16  7:15   ` Philippe Mathieu-Daudé
  2019-08-21 22:20 ` [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads John Snow
  3 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2019-07-15 20:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Stefan Hajnoczi, Julia Suvorova, Aarushi Mehta

The io_uring API had unusual read behavior up until recently, where
short reads could occur when the start of the file range was in the page
cache and a later portion was not in the page cache.  Normally read(2)
does not expose this detail to applications and this behavior has been
fixed in Linux commit 9d93a3f5a0c ("io_uring: punt short reads to async
* context").

In the meantime Linux distros have shipped kernels where io_uring
exhibits the old behavior and there is no simple way to detect it.

Add a slow path for resubmitting short read requests.  The idea is
simple: shorten the iovecs and increment the file offset each time a
short read occurs and then resubmit the request.  The implementation
requires adding additional fields to LuringAIOCB to keep track of where
we were.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io_uring.c   | 75 +++++++++++++++++++++++++++++++++++++++-------
 block/trace-events |  3 +-
 2 files changed, 67 insertions(+), 11 deletions(-)

diff --git a/block/io_uring.c b/block/io_uring.c
index 97e4f876d7..12cef71175 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -28,6 +28,12 @@ typedef struct LuringAIOCB {
     QEMUIOVector *qiov;
     bool is_read;
     QSIMPLEQ_ENTRY(LuringAIOCB) next;
+
+    /* Buffered reads may require resubmission, see
+     * luring_resubmit_short_read().
+     */
+    int total_read;
+    QEMUIOVector resubmit_qiov;
 } LuringAIOCB;
 
 typedef struct LuringQueue {
@@ -99,6 +105,43 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
     s->io_q.in_queue++;
 }
 
+/* Before Linux commit 9d93a3f5a0c ("io_uring: punt short reads to async
+ * context") a buffered I/O request with the start of the file range in the
+ * page cache could result in a short read.  Applications need to resubmit the
+ * remaining read request.
+ *
+ * This is a slow path but recent kernels never take it.
+ */
+static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
+                                       int nread)
+{
+    QEMUIOVector *resubmit_qiov;
+    size_t remaining;
+
+    trace_luring_resubmit_short_read(s, luringcb, nread);
+
+    /* Update read position */
+    luringcb->total_read += nread;
+    remaining = luringcb->qiov->size - luringcb->total_read;
+
+    /* Shorten qiov */
+    resubmit_qiov = &luringcb->resubmit_qiov;
+    if (resubmit_qiov->iov == NULL) {
+        qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
+    } else {
+        qemu_iovec_reset(resubmit_qiov);
+    }
+    qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
+                      remaining);
+
+    /* Update sqe */
+    luringcb->sqeq.off += nread;
+    luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
+    luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
+
+    luring_resubmit(s, luringcb);
+}
+
 /**
  * luring_process_completions:
  * @s: AIO state
@@ -135,6 +178,7 @@ static void luring_process_completions(LuringState *s)
     while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
         LuringAIOCB *luringcb;
         int ret;
+        int total_bytes;
 
         if (!cqes) {
             break;
@@ -150,25 +194,36 @@ static void luring_process_completions(LuringState *s)
 
         trace_luring_process_completion(s, luringcb, ret);
 
-        if (ret == luringcb->qiov->size) {
+        /* total_read is non-zero only for resubmitted read requests */
+        total_bytes = ret + luringcb->total_read;
+
+        if (ret < 0) {
+            if (ret == -EINTR) {
+                luring_resubmit(s, luringcb);
+                continue;
+            }
+        } else if (total_bytes == luringcb->qiov->size) {
             ret = 0;
-        } else if (ret >= 0) {
+        } else {
             /* Short Read/Write */
             if (luringcb->is_read) {
-                /* Read, pad with zeroes */
-                qemu_iovec_memset(luringcb->qiov, ret, 0,
-                luringcb->qiov->size - ret);
-                ret = 0;
+                if (ret > 0) {
+                    luring_resubmit_short_read(s, luringcb, ret);
+                    continue;
+                } else {
+                    /* Pad with zeroes */
+                    qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
+                                      luringcb->qiov->size - total_bytes);
+                    ret = 0;
+                }
             } else {
                 ret = -ENOSPC;;
             }
-        /* Add to overflow queue to be resubmitted later */
-        } else if (ret == -EINTR) {
-            luring_resubmit(s, luringcb);
-            continue;
         }
         luringcb->ret = ret;
 
+        qemu_iovec_destroy(&luringcb->resubmit_qiov);
+
         /*
          * If the coroutine is already entered it must be in ioq_submit()
          * and will notice luringcb->ret has been filled in when it
diff --git a/block/trace-events b/block/trace-events
index 02952fe4cb..f434cac634 100644
--- a/block/trace-events
+++ b/block/trace-events
@@ -60,7 +60,7 @@ qmp_block_stream(void *bs) "bs %p"
 file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
 file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
 
-#io_uring.c
+# io_uring.c
 luring_init_state(void *s, size_t size) "s %p size %zu"
 luring_cleanup_state(void *s) "%p freed"
 luring_io_plug(void *s) "LuringState %p plug"
@@ -70,6 +70,7 @@ luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
 luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
 luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
 luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
+luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
 
 # qcow2.c
 qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events Stefan Hajnoczi
@ 2019-07-16  7:04   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 8+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-07-16  7:04 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Julia Suvorova, Aarushi Mehta

On 7/15/19 10:19 PM, Stefan Hajnoczi wrote:
> It is useful to follow individual requests as they are submitted.  Add
> trace events that show details of each request.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> ---
>  block/io_uring.c   | 5 +++++
>  block/trace-events | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/block/io_uring.c b/block/io_uring.c
> index 22e8d3d9ca..19919da4c9 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -128,6 +128,8 @@ static void luring_process_completions(LuringState *s)
>          LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
>          ret = cqes->res;
>  
> +        trace_luring_process_completion(s, luringcb, ret);
> +
>          if (ret == luringcb->qiov->size) {
>              ret = 0;
>          } else if (ret >= 0) {
> @@ -233,6 +235,7 @@ static int ioq_submit(LuringState *s)
>              QSIMPLEQ_REMOVE_HEAD(&s->io_q.sq_overflow, next);
>          }
>          ret = io_uring_submit(&s->ring);
> +        trace_luring_io_uring_submit(s, ret);
>          /* Prevent infinite loop if submission is refused */
>          if (ret <= 0) {
>              if (ret == -EAGAIN) {
> @@ -339,6 +342,8 @@ int coroutine_fn luring_co_submit(BlockDriverState *bs, LuringState *s, int fd,
>          .is_read    = (type == QEMU_AIO_READ),
>      };
>  
> +    trace_luring_co_submit(bs, s, &luringcb, fd, offset, qiov ? qiov->size : 0, type);
> +
>      ret = luring_do_submit(fd, &luringcb, s, offset, type);
>      if (ret < 0) {
>          return ret;
> diff --git a/block/trace-events b/block/trace-events
> index 069779773b..02952fe4cb 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -67,6 +67,9 @@ luring_io_plug(void *s) "LuringState %p plug"
>  luring_io_unplug(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
>  luring_do_submit(void *s, int blocked, int plugged, int queued, int inflight) "LuringState %p blocked %d plugged %d queued %d inflight %d"
>  luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
> +luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
> +luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
> +luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
>  
>  # qcow2.c
>  qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads Stefan Hajnoczi
@ 2019-07-16  7:15   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 8+ messages in thread
From: Philippe Mathieu-Daudé @ 2019-07-16  7:15 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Julia Suvorova, Aarushi Mehta

On 7/15/19 10:19 PM, Stefan Hajnoczi wrote:
> The io_uring API had unusual read behavior up until recently, where
> short reads could occur when the start of the file range was in the page
> cache and a later portion was not in the page cache.  Normally read(2)
> does not expose this detail to applications and this behavior has been
> fixed in Linux commit 9d93a3f5a0c ("io_uring: punt short reads to async
> * context").
> 
> In the meantime Linux distros have shipped kernels where io_uring
> exhibits the old behavior and there is no simple way to detect it.
> 
> Add a slow path for resubmitting short read requests.  The idea is
> simple: shorten the iovecs and increment the file offset each time a
> short read occurs and then resubmit the request.  The implementation
> requires adding additional fields to LuringAIOCB to keep track of where
> we were.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block/io_uring.c   | 75 +++++++++++++++++++++++++++++++++++++++-------
>  block/trace-events |  3 +-
>  2 files changed, 67 insertions(+), 11 deletions(-)
> 
> diff --git a/block/io_uring.c b/block/io_uring.c
> index 97e4f876d7..12cef71175 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -28,6 +28,12 @@ typedef struct LuringAIOCB {
>      QEMUIOVector *qiov;
>      bool is_read;
>      QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +
> +    /* Buffered reads may require resubmission, see
> +     * luring_resubmit_short_read().
> +     */
> +    int total_read;
> +    QEMUIOVector resubmit_qiov;
>  } LuringAIOCB;
>  
>  typedef struct LuringQueue {
> @@ -99,6 +105,43 @@ static void luring_resubmit(LuringState *s, LuringAIOCB *luringcb)
>      s->io_q.in_queue++;
>  }
>  
> +/* Before Linux commit 9d93a3f5a0c ("io_uring: punt short reads to async
> + * context") a buffered I/O request with the start of the file range in the
> + * page cache could result in a short read.  Applications need to resubmit the
> + * remaining read request.
> + *
> + * This is a slow path but recent kernels never take it.
> + */
> +static void luring_resubmit_short_read(LuringState *s, LuringAIOCB *luringcb,
> +                                       int nread)
> +{
> +    QEMUIOVector *resubmit_qiov;
> +    size_t remaining;
> +
> +    trace_luring_resubmit_short_read(s, luringcb, nread);
> +
> +    /* Update read position */
> +    luringcb->total_read += nread;
> +    remaining = luringcb->qiov->size - luringcb->total_read;
> +
> +    /* Shorten qiov */
> +    resubmit_qiov = &luringcb->resubmit_qiov;
> +    if (resubmit_qiov->iov == NULL) {
> +        qemu_iovec_init(resubmit_qiov, luringcb->qiov->niov);
> +    } else {
> +        qemu_iovec_reset(resubmit_qiov);
> +    }
> +    qemu_iovec_concat(resubmit_qiov, luringcb->qiov, luringcb->total_read,
> +                      remaining);
> +
> +    /* Update sqe */
> +    luringcb->sqeq.off += nread;
> +    luringcb->sqeq.addr = (__u64)(uintptr_t)luringcb->resubmit_qiov.iov;
> +    luringcb->sqeq.len = luringcb->resubmit_qiov.niov;
> +
> +    luring_resubmit(s, luringcb);
> +}
> +
>  /**
>   * luring_process_completions:
>   * @s: AIO state
> @@ -135,6 +178,7 @@ static void luring_process_completions(LuringState *s)
>      while (io_uring_peek_cqe(&s->ring, &cqes) == 0) {
>          LuringAIOCB *luringcb;
>          int ret;
> +        int total_bytes;
>  
>          if (!cqes) {
>              break;
> @@ -150,25 +194,36 @@ static void luring_process_completions(LuringState *s)
>  
>          trace_luring_process_completion(s, luringcb, ret);
>  
> -        if (ret == luringcb->qiov->size) {
> +        /* total_read is non-zero only for resubmitted read requests */
> +        total_bytes = ret + luringcb->total_read;
> +
> +        if (ret < 0) {
> +            if (ret == -EINTR) {
> +                luring_resubmit(s, luringcb);
> +                continue;
> +            }

Else fail with ret = -errno. OK.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> +        } else if (total_bytes == luringcb->qiov->size) {
>              ret = 0;
> -        } else if (ret >= 0) {
> +        } else {
>              /* Short Read/Write */
>              if (luringcb->is_read) {
> -                /* Read, pad with zeroes */
> -                qemu_iovec_memset(luringcb->qiov, ret, 0,
> -                luringcb->qiov->size - ret);
> -                ret = 0;
> +                if (ret > 0) {
> +                    luring_resubmit_short_read(s, luringcb, ret);
> +                    continue;
> +                } else {
> +                    /* Pad with zeroes */
> +                    qemu_iovec_memset(luringcb->qiov, total_bytes, 0,
> +                                      luringcb->qiov->size - total_bytes);
> +                    ret = 0;
> +                }
>              } else {
>                  ret = -ENOSPC;;
>              }
> -        /* Add to overflow queue to be resubmitted later */
> -        } else if (ret == -EINTR) {
> -            luring_resubmit(s, luringcb);
> -            continue;
>          }
>          luringcb->ret = ret;
>  
> +        qemu_iovec_destroy(&luringcb->resubmit_qiov);
> +
>          /*
>           * If the coroutine is already entered it must be in ioq_submit()
>           * and will notice luringcb->ret has been filled in when it
> diff --git a/block/trace-events b/block/trace-events
> index 02952fe4cb..f434cac634 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -60,7 +60,7 @@ qmp_block_stream(void *bs) "bs %p"
>  file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int type) "acb %p opaque %p offset %"PRId64" count %d type %d"
>  file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
>  
> -#io_uring.c
> +# io_uring.c

(left over from patch #1)

>  luring_init_state(void *s, size_t size) "s %p size %zu"
>  luring_cleanup_state(void *s) "%p freed"
>  luring_io_plug(void *s) "LuringState %p plug"
> @@ -70,6 +70,7 @@ luring_do_submit_done(void *s, int ret) "LuringState %p submitted to kernel %d"
>  luring_co_submit(void *bs, void *s, void *luringcb, int fd, uint64_t offset, size_t nbytes, int type) "bs %p s %p luringcb %p fd %d offset %" PRId64 " nbytes %zd type %d"
>  luring_process_completion(void *s, void *aiocb, int ret) "LuringState %p luringcb %p ret %d"
>  luring_io_uring_submit(void *s, int ret) "LuringState %p ret %d"
> +luring_resubmit_short_read(void *s, void *luringcb, int nread) "LuringState %p luringcb %p nread %d"
>  
>  # qcow2.c
>  qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 0x%" PRIx64 " bytes %d"
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads
  2019-07-15 20:19 [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2019-07-15 20:19 ` [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads Stefan Hajnoczi
@ 2019-08-21 22:20 ` John Snow
  2019-08-22  9:16   ` Stefan Hajnoczi
  3 siblings, 1 reply; 8+ messages in thread
From: John Snow @ 2019-08-21 22:20 UTC (permalink / raw)
  To: Stefan Hajnoczi, qemu-devel
  Cc: Kevin Wolf, qemu-block, Max Reitz, Stefan Hajnoczi,
	Julia Suvorova, Aarushi Mehta



On 7/15/19 4:19 PM, Stefan Hajnoczi wrote:
> Short reads are possible with cache=writeback (see Patch 3 for details).
> Handle this by resubmitting requests until the read is completed.
> 
> Patch 1 adds trace events useful for debugging io_uring.
> 
> Patch 2 fixes EINTR.  This lays the groundwork for resubmitting requests in
> Patch 3.
> 
> Aarushi: Feel free to squash this into your patch series if you are happy with
> the code, I don't mind if the authorship information is lost.  After applying
> these patches I can successfully boot a Fedora 30 guest qcow2 file with
> cache=writeback.
> 
> Based-on: <20190610134905.22294-1-mehta.aaru20@gmail.com>
> 
> Stefan Hajnoczi (3):
>   block/io_uring: add submission and completion trace events
>   block/io_uring: fix EINTR request resubmission
>   block/io_uring: resubmit short buffered reads
> 
>  block/io_uring.c   | 136 ++++++++++++++++++++++++++++++++++-----------
>  block/trace-events |   6 +-
>  2 files changed, 108 insertions(+), 34 deletions(-)
> 

Since this is over the 30 days mark, I'm going to assume this WAS
squashed into Aarushi's patchset, and it's safe to drop this from the
review queue for now?

--js


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads
  2019-08-21 22:20 ` [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads John Snow
@ 2019-08-22  9:16   ` Stefan Hajnoczi
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2019-08-22  9:16 UTC (permalink / raw)
  To: John Snow
  Cc: Kevin Wolf, qemu-block, qemu-devel, Max Reitz, Stefan Hajnoczi,
	Julia Suvorova, Aarushi Mehta

[-- Attachment #1: Type: text/plain, Size: 1395 bytes --]

On Wed, Aug 21, 2019 at 06:20:39PM -0400, John Snow wrote:
> 
> 
> On 7/15/19 4:19 PM, Stefan Hajnoczi wrote:
> > Short reads are possible with cache=writeback (see Patch 3 for details).
> > Handle this by resubmitting requests until the read is completed.
> > 
> > Patch 1 adds trace events useful for debugging io_uring.
> > 
> > Patch 2 fixes EINTR.  This lays the groundwork for resubmitting requests in
> > Patch 3.
> > 
> > Aarushi: Feel free to squash this into your patch series if you are happy with
> > the code, I don't mind if the authorship information is lost.  After applying
> > these patches I can successfully boot a Fedora 30 guest qcow2 file with
> > cache=writeback.
> > 
> > Based-on: <20190610134905.22294-1-mehta.aaru20@gmail.com>
> > 
> > Stefan Hajnoczi (3):
> >   block/io_uring: add submission and completion trace events
> >   block/io_uring: fix EINTR request resubmission
> >   block/io_uring: resubmit short buffered reads
> > 
> >  block/io_uring.c   | 136 ++++++++++++++++++++++++++++++++++-----------
> >  block/trace-events |   6 +-
> >  2 files changed, 108 insertions(+), 34 deletions(-)
> > 
> 
> Since this is over the 30 days mark, I'm going to assume this WAS
> squashed into Aarushi's patchset, and it's safe to drop this from the
> review queue for now?

Yes, Aarushi included in the io_uring patch series.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-08-22  9:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-15 20:19 [Qemu-devel] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads Stefan Hajnoczi
2019-07-15 20:19 ` [Qemu-devel] [PATCH 1/3] block/io_uring: add submission and completion trace events Stefan Hajnoczi
2019-07-16  7:04   ` Philippe Mathieu-Daudé
2019-07-15 20:19 ` [Qemu-devel] [PATCH 2/3] block/io_uring: fix EINTR request resubmission Stefan Hajnoczi
2019-07-15 20:19 ` [Qemu-devel] [PATCH 3/3] block/io_uring: resubmit short buffered reads Stefan Hajnoczi
2019-07-16  7:15   ` Philippe Mathieu-Daudé
2019-08-21 22:20 ` [Qemu-devel] [Qemu-block] [PATCH 0/3] block/io_uring: fix EINTR and resubmit short reads John Snow
2019-08-22  9:16   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).