All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX
@ 2023-03-15 12:13 Hanna Czenczek
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-15 12:13 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

Hi,

We accept I/O vectors with up to 1024 (IOV_MAX) elements from guests.
When a guest request does not conform to the underlying storage's
alignment requirements, we pad it with head and/or tail buffers as
necessary, which are simply appended to the I/O vector.

As of 4c002cef, we (sensibly) check that such-padded vectors do not then
exceed IOV_MAX.  However, there seems to be no sensible error handling;
instead, the guest simply receives an I/O error.

That’s a problem, because it submitted a perfectly sensible I/O request
(which adhered to the alignment the guest device has announced), but
receives an error back.  That shouldn’t happen.

I think we need to handle such excess lengths internally, but I’m not
sure how exactly.  What I can think of is either breaking up the request
into two (which seems like it might cause side effects) or to join up to
three vector elements into one, using a bounce buffer.  The latter
seemed simpler to me, so this RFC does that.  Still, it’s an RFC because
I don’t know whether that’s the ideal solution.


Hanna Czenczek (2):
  block: Split padded I/O vectors exceeding IOV_MAX
  iotests/iov-padding: New test

 block/io.c                               | 139 ++++++++++++++++++++++-
 util/iov.c                               |   4 -
 tests/qemu-iotests/tests/iov-padding     |  85 ++++++++++++++
 tests/qemu-iotests/tests/iov-padding.out |  59 ++++++++++
 4 files changed, 277 insertions(+), 10 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

-- 
2.39.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
@ 2023-03-15 12:13 ` Hanna Czenczek
  2023-03-15 18:25   ` Eric Blake
                     ` (2 more replies)
  2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
  2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
  2 siblings, 3 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-15 12:13 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
   shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 util/iov.c |   4 --
 2 files changed, 133 insertions(+), 10 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8974d46941..ee226d23d6 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,6 +1435,12 @@ out:
  * @merge_reads is true for small requests,
  * if @buf_len == @head + bytes + @tail. In this case it is possible that both
  * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then we need to
+ * merge existing vector elements into a single one.  @collapse_buf acts as the
+ * bounce buffer in such cases.
  */
 typedef struct BdrvRequestPadding {
     uint8_t *buf;
@@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
     size_t head;
     size_t tail;
     bool merge_reads;
+    bool write;
     QEMUIOVector local_qiov;
+
+    uint8_t *collapse_buf;
+    size_t collapse_len;
+    QEMUIOVector collapsed_qiov;
 } BdrvRequestPadding;
 
 static bool bdrv_init_padding(BlockDriverState *bs,
                               int64_t offset, int64_t bytes,
+                              bool write,
                               BdrvRequestPadding *pad)
 {
     int64_t align = bs->bl.request_alignment;
@@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
         pad->tail_buf = pad->buf + pad->buf_len - align;
     }
 
+    pad->write = write;
+
     return true;
 }
 
+/*
+ * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
+ * elements), collapse some elements into a single one so that it adheres to the
+ * IOV_MAX limit again.
+ *
+ * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
+ * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
+ * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
+ * buffer content back for read requests.
+ *
+ * Note that we will not touch the padding head or tail entries here.  We cannot
+ * move them to a bounce buffer, because for RMWs, both head and tail expect to
+ * be in an aligned buffer with scratch space after (head) or before (tail) to
+ * perform the read into (because the whole buffer must be aligned, but head's
+ * and tail's lengths naturally cannot be aligned, because they provide padding
+ * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
+ * cannot provide such scratch space.
+ *
+ * Therefore, this function collapses the first IOV elements after the
+ * (potential) head element.
+ */
+static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs)
+{
+    int surplus_count, collapse_count;
+    struct iovec *collapse_iovs;
+    QEMUIOVector collapse_qiov;
+    size_t move_count;
+
+    surplus_count = pad->local_qiov.niov - IOV_MAX;
+    /* Not exceeding the limit?  Nothing to collapse. */
+    if (surplus_count <= 0) {
+        return;
+    }
+
+    /*
+     * Only head and tail can have lead to the number of entries exceeding
+     * IOV_MAX, so we can exceed it by the head and tail at most
+     */
+    assert(surplus_count <= !!pad->head + !!pad->tail);
+
+    /*
+     * We merge (collapse) `surplus_count` entries into the first entry that is
+     * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
+     * there is no head, or entry 1 if there is one.
+     */
+    collapse_count = surplus_count + 1;
+    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
+
+    /* There must be no previously collapsed buffers in `pad` */
+    assert(pad->collapse_len == 0);
+    for (int i = 0; i < collapse_count; i++) {
+        pad->collapse_len += collapse_iovs[i].iov_len;
+    }
+    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
+
+    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
+    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
+    qemu_iovec_init_slice(&pad->collapsed_qiov,
+                          &collapse_qiov, 0, pad->collapse_len);
+    if (pad->write) {
+        /* For writes: Copy all to-be-collapsed data into collapse_buf */
+        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
+                          pad->collapse_buf, pad->collapse_len);
+    }
+
+    /* Create the collapsed entry in pad->local_qiov */
+    collapse_iovs[0] = (struct iovec){
+        .iov_base = pad->collapse_buf,
+        .iov_len = pad->collapse_len,
+    };
+
+    /*
+     * To finalize collapsing, we must shift the rest of pad->local_qiov left by
+     * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
+     * immediately after the collapse target.
+     *
+     * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
+     * `collapse_iovs[collapse_count]`.  The number of elements to move is the
+     * number of elements remaining in `pad->local_qiov` after and including
+     * `collapse_iovs[collapse_count]`.
+     */
+    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
+        &collapse_iovs[collapse_count];
+    memmove(&collapse_iovs[1],
+            &collapse_iovs[collapse_count],
+            move_count * sizeof(pad->local_qiov.iov[0]));
+
+    pad->local_qiov.niov -= surplus_count;
+}
+
 static int coroutine_fn GRAPH_RDLOCK
 bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
                       BdrvRequestPadding *pad, bool zero_middle)
@@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
         qemu_vfree(pad->buf);
         qemu_iovec_destroy(&pad->local_qiov);
     }
+    if (pad->collapse_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into a
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
+                                pad->collapse_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_buf);
+        qemu_iovec_destroy(&pad->collapsed_qiov);
+    }
     memset(pad, 0, sizeof(*pad));
 }
 
@@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
  * read of padding, bdrv_padding_rmw_read() should be called separately if
  * needed.
  *
+ * @write is true for write requests, false for read requests.
+ *
  * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
  *  - on function start they represent original request
  *  - on failure or when padding is not needed they are unchanged
@@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
+                            bool write,
                             BdrvRequestPadding *pad, bool *padded,
                             BdrvRequestFlags *flags)
 {
@@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
 
     bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
 
-    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
         if (padded) {
             *padded = false;
         }
@@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
         bdrv_padding_destroy(pad);
         return ret;
     }
+
+    /*
+     * If the IOV is too long after padding, merge (collapse) some entries to
+     * make it not exceed IOV_MAX
+     */
+    bdrv_padding_collapse(pad, bs);
+    assert(pad->local_qiov.niov <= IOV_MAX);
+
     *bytes += pad->head + pad->tail;
     *offset -= pad->head;
     *qiov = &pad->local_qiov;
@@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
         flags |= BDRV_REQ_COPY_ON_READ;
     }
 
-    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
     /* This flag doesn't make sense for padding or zero writes */
     flags &= ~BDRV_REQ_REGISTERED_BUF;
 
-    padding = bdrv_init_padding(bs, offset, bytes, &pad);
+    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
         bdrv_make_request_serialising(req, align);
@@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
          * alignment only if there is no ZERO flag.
          */
-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded, &flags);
+        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
+                               &pad, &padded, &flags);
         if (ret < 0) {
             return ret;
         }
diff --git a/util/iov.c b/util/iov.c
index b4be580022..4d0d381949 100644
--- a/util/iov.c
+++ b/util/iov.c
@@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
     }
 
     total_niov = !!head_len + mid_niov + !!tail_len;
-    if (total_niov > IOV_MAX) {
-        return -EINVAL;
-    }
-
     if (total_niov == 1) {
         qemu_iovec_init_buf(qiov, NULL, 0);
         p = &qiov->local_iov;
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC 2/2] iotests/iov-padding: New test
  2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
@ 2023-03-15 12:13 ` Hanna Czenczek
  2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
  2 siblings, 0 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-15 12:13 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Hanna Czenczek, Vladimir Sementsov-Ogievskiy,
	Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

Test that even vectored IO requests with 1024 vector elements that are
not aligned to the device's request alignment will succeed.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 tests/qemu-iotests/tests/iov-padding     | 85 ++++++++++++++++++++++++
 tests/qemu-iotests/tests/iov-padding.out | 59 ++++++++++++++++
 2 files changed, 144 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/iov-padding
 create mode 100644 tests/qemu-iotests/tests/iov-padding.out

diff --git a/tests/qemu-iotests/tests/iov-padding b/tests/qemu-iotests/tests/iov-padding
new file mode 100755
index 0000000000..b9604900c7
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+# group: rw quick
+#
+# Check the interaction of request padding (to fit alignment restrictions) with
+# vectored I/O from the guest
+#
+# Copyright Red Hat
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+seq=$(basename $0)
+echo "QA output created by $seq"
+
+status=1	# failure is the default!
+
+_cleanup()
+{
+    _cleanup_test_img
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+cd ..
+. ./common.rc
+. ./common.filter
+
+_supported_fmt raw
+_supported_proto file
+
+_make_test_img 1M
+
+IMGSPEC="driver=blkdebug,align=4096,image.driver=file,image.filename=$TEST_IMG"
+
+# Four combinations:
+# - Offset 4096, length 1023 * 512 + 512: Fully aligned to 4k
+# - Offset 4096, length 1023 * 512 + 4096: Head is aligned, tail is not
+# - Offset 512, length 1023 * 512 + 512: Neither head nor tail are aligned
+# - Offset 512, length 1023 * 512 + 4096: Tail is aligned, head is not
+for start_offset in 4096 512; do
+    for last_element_length in 512 4096; do
+        length=$((1023 * 512 + $last_element_length))
+
+        echo
+        echo "== performing 1024-element vectored requests to image (offset: $start_offset; length: $length) =="
+
+        # Fill with data for testing
+        $QEMU_IO -c 'write -P 1 0 1M' "$TEST_IMG" | _filter_qemu_io
+
+        # 1023 512-byte buffers, and then one with length $last_element_length
+        cmd_params="-P 2 $start_offset $(yes 512 | head -n 1023 | tr '\n' ' ') $last_element_length"
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "writev $cmd_params" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+
+        # Read all patterns -- read the part we just wrote with writev twice,
+        # once "normally", and once with a readv, so we see that that works, too
+        QEMU_IO_OPTIONS="$QEMU_IO_OPTIONS_NO_FMT" $QEMU_IO \
+            -c "read -P 1 0 $start_offset" \
+            -c "read -P 2 $start_offset $length" \
+            -c "readv $cmd_params" \
+            -c "read -P 1 $((start_offset + length)) $((1024 * 1024 - length - start_offset))" \
+            --image-opts \
+            "$IMGSPEC" \
+            | _filter_qemu_io
+    done
+done
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/tests/iov-padding.out b/tests/qemu-iotests/tests/iov-padding.out
new file mode 100644
index 0000000000..e07a91fac7
--- /dev/null
+++ b/tests/qemu-iotests/tests/iov-padding.out
@@ -0,0 +1,59 @@
+QA output created by iov-padding
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 4096
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 4096; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 4096/4096 bytes at offset 0
+4 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 4096
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 516608/516608 bytes at offset 531968
+504.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 524288) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 524288/524288 bytes at offset 512
+512 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 523776/523776 bytes at offset 524800
+511.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== performing 1024-element vectored requests to image (offset: 512; length: 527872) ==
+wrote 1048576/1048576 bytes at offset 0
+1 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 527872/527872 bytes at offset 512
+515.500 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 520192/520192 bytes at offset 528384
+508 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+*** done
-- 
2.39.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
  2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
@ 2023-03-15 15:29 ` Stefan Hajnoczi
  2023-03-15 16:05   ` Hanna Czenczek
  2 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 15:29 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Eric Blake,
	Kevin Wolf, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

On Wed, Mar 15, 2023 at 01:13:28PM +0100, Hanna Czenczek wrote:
> Hi,
> 
> We accept I/O vectors with up to 1024 (IOV_MAX) elements from guests.
> When a guest request does not conform to the underlying storage's
> alignment requirements, we pad it with head and/or tail buffers as
> necessary, which are simply appended to the I/O vector.
> 
> As of 4c002cef, we (sensibly) check that such-padded vectors do not then
> exceed IOV_MAX.  However, there seems to be no sensible error handling;
> instead, the guest simply receives an I/O error.
> 
> That???s a problem, because it submitted a perfectly sensible I/O request

Looks like there is an encoding issue. I get 3 question marks instead of
an apostrophe. lore.kernel.org also renders mojibake:
https://lore.kernel.org/qemu-devel/20230315121330.29679-1-hreitz@redhat.com/

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
@ 2023-03-15 16:05   ` Hanna Czenczek
  0 siblings, 0 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-15 16:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Eric Blake,
	Kevin Wolf, Fam Zheng

On 15.03.23 16:29, Stefan Hajnoczi wrote:
> On Wed, Mar 15, 2023 at 01:13:28PM +0100, Hanna Czenczek wrote:
>> Hi,
>>
>> We accept I/O vectors with up to 1024 (IOV_MAX) elements from guests.
>> When a guest request does not conform to the underlying storage's
>> alignment requirements, we pad it with head and/or tail buffers as
>> necessary, which are simply appended to the I/O vector.
>>
>> As of 4c002cef, we (sensibly) check that such-padded vectors do not then
>> exceed IOV_MAX.  However, there seems to be no sensible error handling;
>> instead, the guest simply receives an I/O error.
>>
>> That???s a problem, because it submitted a perfectly sensible I/O request
> Looks like there is an encoding issue. I get 3 question marks instead of
> an apostrophe. lore.kernel.org also renders mojibake:
> https://lore.kernel.org/qemu-devel/20230315121330.29679-1-hreitz@redhat.com/

Hm, yeah, no “charset=UTF-8” in that mail’s Content-type...  I think 
that’s because it just uses what’s in the patch file and sends it, and I 
needed to force it to be format.headers="Content-type: text/plain" in 
.gitconfig at some point because of some Mimecast thing.  I hope putting 
format.headers="Content-type: text/plain; charset=UTF-8" will fix that 
for the future.  Thanks for telling me!

Hanna



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
@ 2023-03-15 18:25   ` Eric Blake
  2023-03-16  9:43     ` Hanna Czenczek
  2023-03-15 18:48   ` Stefan Hajnoczi
  2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2023-03-15 18:25 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit.  As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
> 
> To the guest, this appears as a random I/O error.  We should not return
> an I/O error to the guest when it issued a perfectly valid request.
> 
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter).  However,
> that does not seem exactly great.
> 
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
>    shorter.
> 
> I am wary of (1), because it seems like it may have unintended side
> effects.
> 
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.

Agreed that approach 2 is more conservative.

> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  util/iov.c |   4 --
>  2 files changed, 133 insertions(+), 10 deletions(-)
>  
> +/*
> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
> + * elements), collapse some elements into a single one so that it adheres to the
> + * IOV_MAX limit again.
> + *
> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
> + * buffer content back for read requests.
> + *
> + * Note that we will not touch the padding head or tail entries here.  We cannot
> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
> + * be in an aligned buffer with scratch space after (head) or before (tail) to
> + * perform the read into (because the whole buffer must be aligned, but head's
> + * and tail's lengths naturally cannot be aligned, because they provide padding
> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
> + * cannot provide such scratch space.
> + *
> + * Therefore, this function collapses the first IOV elements after the
> + * (potential) head element.

It looks like you blindly pick the first one or two non-padding iovs
at the front of the array.  Would it be any wiser (in terms of less
memmove() action or even a smaller bounce buffer) to pick iovs at the
end of the array, and/or a sequential search for the smallest
neighboring iovs?  Or is that a micro-optimization that costs more
than it saves?

Would it be any easier to swap the order of padding vs. collapsing?
That is, we already know the user is giving us a long list of iovs; if
it is 1024 elements long, and we can detect that padding will be
needed, should we collapse before padding instead of padding, finding
that we now have 1026, and memmove'ing back into 1024?

But logic-wise, your patch looks correct to me.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
  2023-03-15 18:25   ` Eric Blake
@ 2023-03-15 18:48   ` Stefan Hajnoczi
  2023-03-16 10:10     ` Hanna Czenczek
  2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-03-15 18:48 UTC (permalink / raw)
  To: Hanna Czenczek
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Eric Blake,
	Kevin Wolf, Fam Zheng

[-- Attachment #1: Type: text/plain, Size: 13650 bytes --]

On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit.  As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
> 
> To the guest, this appears as a random I/O error.  We should not return
> an I/O error to the guest when it issued a perfectly valid request.
> 
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter).  However,
> that does not seem exactly great.
> 
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
>    shorter.
> 
> I am wary of (1), because it seems like it may have unintended side
> effects.
> 
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.

Looks like a reasonable solution. I think the code is correct and I
posted ideas for making it easier to understand.

> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  util/iov.c |   4 --
>  2 files changed, 133 insertions(+), 10 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8974d46941..ee226d23d6 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1435,6 +1435,12 @@ out:
>   * @merge_reads is true for small requests,
>   * if @buf_len == @head + bytes + @tail. In this case it is possible that both
>   * head and tail exist but @buf_len == align and @tail_buf == @buf.
> + *
> + * @write is true for write requests, false for read requests.
> + *
> + * If padding makes the vector too long (exceeding IOV_MAX), then we need to
> + * merge existing vector elements into a single one.  @collapse_buf acts as the
> + * bounce buffer in such cases.
>   */
>  typedef struct BdrvRequestPadding {
>      uint8_t *buf;
> @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
>      size_t head;
>      size_t tail;
>      bool merge_reads;
> +    bool write;
>      QEMUIOVector local_qiov;
> +
> +    uint8_t *collapse_buf;
> +    size_t collapse_len;
> +    QEMUIOVector collapsed_qiov;
>  } BdrvRequestPadding;
>  
>  static bool bdrv_init_padding(BlockDriverState *bs,
>                                int64_t offset, int64_t bytes,
> +                              bool write,
>                                BdrvRequestPadding *pad)
>  {
>      int64_t align = bs->bl.request_alignment;
> @@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
>          pad->tail_buf = pad->buf + pad->buf_len - align;
>      }
>  
> +    pad->write = write;
> +
>      return true;
>  }
>  
> +/*
> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
> + * elements), collapse some elements into a single one so that it adheres to the
> + * IOV_MAX limit again.
> + *
> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
> + * buffer content back for read requests.

The distinction between "collapse" and "collapsed" is subtle. I didn't
guess it right, I thought collapsed_qiov is a QEMUIOVector for
collapse_buf/collapse_len.

Please choose a name for collapsed_qiov that makes this clearer. Maybe
pre_collapse_qiov (i.e. the local_qiov iovecs that were replaced by
bdrv_padding_collapse)?

> + *
> + * Note that we will not touch the padding head or tail entries here.  We cannot
> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
> + * be in an aligned buffer with scratch space after (head) or before (tail) to
> + * perform the read into (because the whole buffer must be aligned, but head's
> + * and tail's lengths naturally cannot be aligned, because they provide padding
> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
> + * cannot provide such scratch space.

As someone who hasn't looked at this code for a while, I don't
understand this paragraph. Can you expand on why RMW is problematic
here? If not, don't worry, it's hard to explain iov juggling.

> + *
> + * Therefore, this function collapses the first IOV elements after the
> + * (potential) head element.
> + */
> +static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs)
> +{
> +    int surplus_count, collapse_count;
> +    struct iovec *collapse_iovs;
> +    QEMUIOVector collapse_qiov;
> +    size_t move_count;
> +
> +    surplus_count = pad->local_qiov.niov - IOV_MAX;
> +    /* Not exceeding the limit?  Nothing to collapse. */
> +    if (surplus_count <= 0) {
> +        return;
> +    }
> +
> +    /*
> +     * Only head and tail can have lead to the number of entries exceeding
> +     * IOV_MAX, so we can exceed it by the head and tail at most
> +     */
> +    assert(surplus_count <= !!pad->head + !!pad->tail);
> +
> +    /*
> +     * We merge (collapse) `surplus_count` entries into the first entry that is
> +     * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
> +     * there is no head, or entry 1 if there is one.
> +     */
> +    collapse_count = surplus_count + 1;
> +    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
> +
> +    /* There must be no previously collapsed buffers in `pad` */
> +    assert(pad->collapse_len == 0);
> +    for (int i = 0; i < collapse_count; i++) {
> +        pad->collapse_len += collapse_iovs[i].iov_len;
> +    }
> +    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
> +
> +    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
> +    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
> +    qemu_iovec_init_slice(&pad->collapsed_qiov,

Having collapse_qiov and collapsed_qiov in the same function is
confusing. IIUC collapse_qiov and collapsed_qiov have the same buffers
the same, except that the last iovec in collapse_qiov has its original
length from local_qiov, whereas collapsed_qiov may shrink the last
iovec.

Maybe just naming collapse_qiov "qiov" or "tmp_qiov" would be less
confusing because it avoids making collapse_buf/collapse_len vs
collapsed_qiov more confusing.

> +                          &collapse_qiov, 0, pad->collapse_len);
> +    if (pad->write) {
> +        /* For writes: Copy all to-be-collapsed data into collapse_buf */
> +        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
> +                          pad->collapse_buf, pad->collapse_len);
> +    }
> +
> +    /* Create the collapsed entry in pad->local_qiov */
> +    collapse_iovs[0] = (struct iovec){
> +        .iov_base = pad->collapse_buf,
> +        .iov_len = pad->collapse_len,
> +    };
> +
> +    /*
> +     * To finalize collapsing, we must shift the rest of pad->local_qiov left by
> +     * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
> +     * immediately after the collapse target.
> +     *
> +     * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
> +     * `collapse_iovs[collapse_count]`.  The number of elements to move is the
> +     * number of elements remaining in `pad->local_qiov` after and including
> +     * `collapse_iovs[collapse_count]`.
> +     */
> +    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
> +        &collapse_iovs[collapse_count];
> +    memmove(&collapse_iovs[1],
> +            &collapse_iovs[collapse_count],
> +            move_count * sizeof(pad->local_qiov.iov[0]));
> +
> +    pad->local_qiov.niov -= surplus_count;
> +}
> +
>  static int coroutine_fn GRAPH_RDLOCK
>  bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
>                        BdrvRequestPadding *pad, bool zero_middle)
> @@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>          qemu_vfree(pad->buf);
>          qemu_iovec_destroy(&pad->local_qiov);
>      }
> +    if (pad->collapse_buf) {
> +        if (!pad->write) {
> +            /*
> +             * If padding required elements in the vector to be collapsed into a
> +             * bounce buffer, copy the bounce buffer content back
> +             */
> +            qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
> +                                pad->collapse_buf, pad->collapse_len);
> +        }
> +        qemu_vfree(pad->collapse_buf);
> +        qemu_iovec_destroy(&pad->collapsed_qiov);
> +    }

This is safe because qemu_iovec_init_slice() took copies of local_qiov
iovecs instead of references, but the code requires less thinking if
collapsed_qiov is destroyed before local_qiov. This change would be nice
if you respin.

>      memset(pad, 0, sizeof(*pad));
>  }
>  
> @@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>   * read of padding, bdrv_padding_rmw_read() should be called separately if
>   * needed.
>   *
> + * @write is true for write requests, false for read requests.
> + *
>   * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
>   *  - on function start they represent original request
>   *  - on failure or when padding is not needed they are unchanged
> @@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>  static int bdrv_pad_request(BlockDriverState *bs,
>                              QEMUIOVector **qiov, size_t *qiov_offset,
>                              int64_t *offset, int64_t *bytes,
> +                            bool write,
>                              BdrvRequestPadding *pad, bool *padded,
>                              BdrvRequestFlags *flags)
>  {
> @@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
>  
>      bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
>  
> -    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
> +    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
>          if (padded) {
>              *padded = false;
>          }
> @@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
>          bdrv_padding_destroy(pad);
>          return ret;
>      }
> +
> +    /*
> +     * If the IOV is too long after padding, merge (collapse) some entries to
> +     * make it not exceed IOV_MAX
> +     */
> +    bdrv_padding_collapse(pad, bs);
> +    assert(pad->local_qiov.niov <= IOV_MAX);
> +
>      *bytes += pad->head + pad->tail;
>      *offset -= pad->head;
>      *qiov = &pad->local_qiov;
> @@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>          flags |= BDRV_REQ_COPY_ON_READ;
>      }
>  
> -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                           NULL, &flags);
> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
> +                           &pad, NULL, &flags);
>      if (ret < 0) {
>          goto fail;
>      }
> @@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
>      /* This flag doesn't make sense for padding or zero writes */
>      flags &= ~BDRV_REQ_REGISTERED_BUF;
>  
> -    padding = bdrv_init_padding(bs, offset, bytes, &pad);
> +    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
>      if (padding) {
>          assert(!(flags & BDRV_REQ_NO_WAIT));
>          bdrv_make_request_serialising(req, align);
> @@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
>           * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
>           * alignment only if there is no ZERO flag.
>           */
> -        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                               &padded, &flags);
> +        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
> +                               &pad, &padded, &flags);
>          if (ret < 0) {
>              return ret;
>          }
> diff --git a/util/iov.c b/util/iov.c
> index b4be580022..4d0d381949 100644
> --- a/util/iov.c
> +++ b/util/iov.c
> @@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
>      }
>  
>      total_niov = !!head_len + mid_niov + !!tail_len;
> -    if (total_niov > IOV_MAX) {
> -        return -EINVAL;
> -    }

Perhaps an assertion is good here, just to make sure it's detected if a
new caller is added some day. qemu_iovec_init_extended() is a public
API, so something unrelated to block layer padding might use it.

> -
>      if (total_niov == 1) {
>          qemu_iovec_init_buf(qiov, NULL, 0);
>          p = &qiov->local_iov;
> -- 
> 2.39.1
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 18:25   ` Eric Blake
@ 2023-03-16  9:43     ` Hanna Czenczek
  0 siblings, 0 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-16  9:43 UTC (permalink / raw)
  To: Eric Blake
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Kevin Wolf,
	Stefan Hajnoczi, Fam Zheng

On 15.03.23 19:25, Eric Blake wrote:
> On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
>> When processing vectored guest requests that are not aligned to the
>> storage request alignment, we pad them by adding head and/or tail
>> buffers for a read-modify-write cycle.
>>
>> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
>> with this padding, the vector can exceed that limit.  As of
>> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
>> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
>> limit, instead returning an error to the guest.
>>
>> To the guest, this appears as a random I/O error.  We should not return
>> an I/O error to the guest when it issued a perfectly valid request.
>>
>> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
>> longer than IOV_MAX, which generally seems to work (because the guest
>> assumes a smaller alignment than we really have, file-posix's
>> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
>> so emulate the request, so that the IOV_MAX does not matter).  However,
>> that does not seem exactly great.
>>
>> I see two ways to fix this problem:
>> 1. We split such long requests into two requests.
>> 2. We join some elements of the vector into new buffers to make it
>>     shorter.
>>
>> I am wary of (1), because it seems like it may have unintended side
>> effects.
>>
>> (2) on the other hand seems relatively simple to implement, with
>> hopefully few side effects, so this patch does that.
> Agreed that approach 2 is more conservative.
>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   util/iov.c |   4 --
>>   2 files changed, 133 insertions(+), 10 deletions(-)
>>   
>> +/*
>> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
>> + * elements), collapse some elements into a single one so that it adheres to the
>> + * IOV_MAX limit again.
>> + *
>> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
>> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
>> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
>> + * buffer content back for read requests.
>> + *
>> + * Note that we will not touch the padding head or tail entries here.  We cannot
>> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
>> + * be in an aligned buffer with scratch space after (head) or before (tail) to
>> + * perform the read into (because the whole buffer must be aligned, but head's
>> + * and tail's lengths naturally cannot be aligned, because they provide padding
>> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
>> + * cannot provide such scratch space.
>> + *
>> + * Therefore, this function collapses the first IOV elements after the
>> + * (potential) head element.
> It looks like you blindly pick the first one or two non-padding iovs
> at the front of the array.  Would it be any wiser (in terms of less
> memmove() action or even a smaller bounce buffer) to pick iovs at the
> end of the array, and/or a sequential search for the smallest
> neighboring iovs?  Or is that a micro-optimization that costs more
> than it saves?

Right, I didn’t think of picking near the end.  That makes sense 
indeed!  If not for performance, at least it allows dropping the 
non-trivial comment for the memmove().

Searching for the smallest buffers, I’m not sure.  I think it can make 
sense performance-wise – iterating over 1024 elements will probably pay 
off quickly when you indeed have differences in buffer size there.  My 
main concern is that it would be more complicated, and I just don’t 
think that’s worth it for such a rare case.

> Would it be any easier to swap the order of padding vs. collapsing?
> That is, we already know the user is giving us a long list of iovs; if
> it is 1024 elements long, and we can detect that padding will be
> needed, should we collapse before padding instead of padding, finding
> that we now have 1026, and memmove'ing back into 1024?

I’d prefer that, but it’s difficult.  We need the temporary QIOV 
(pad->local_qiov) so we can merge entries, but this is only created by 
qemu_iovec_init_extended().

We can try to move the collapsing into qemu_iovec_init_extended(), but 
it would be a bit awkward still, and probably blow up that function’s 
interface (it’s in util/iov.c, so we can’t really immediately use the 
BdrvRequestPadding object).  I think, in the end, functionally not much 
would change, so I’d rather keep the order as it is (unless someone has 
a good idea here).

> But logic-wise, your patch looks correct to me.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!

Hanna



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 18:48   ` Stefan Hajnoczi
@ 2023-03-16 10:10     ` Hanna Czenczek
  0 siblings, 0 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-16 10:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: qemu-block, qemu-devel, Vladimir Sementsov-Ogievskiy, Eric Blake,
	Kevin Wolf, Fam Zheng

On 15.03.23 19:48, Stefan Hajnoczi wrote:
> On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote:
>> When processing vectored guest requests that are not aligned to the
>> storage request alignment, we pad them by adding head and/or tail
>> buffers for a read-modify-write cycle.
>>
>> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
>> with this padding, the vector can exceed that limit.  As of
>> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
>> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
>> limit, instead returning an error to the guest.
>>
>> To the guest, this appears as a random I/O error.  We should not return
>> an I/O error to the guest when it issued a perfectly valid request.
>>
>> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
>> longer than IOV_MAX, which generally seems to work (because the guest
>> assumes a smaller alignment than we really have, file-posix's
>> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
>> so emulate the request, so that the IOV_MAX does not matter).  However,
>> that does not seem exactly great.
>>
>> I see two ways to fix this problem:
>> 1. We split such long requests into two requests.
>> 2. We join some elements of the vector into new buffers to make it
>>     shorter.
>>
>> I am wary of (1), because it seems like it may have unintended side
>> effects.
>>
>> (2) on the other hand seems relatively simple to implement, with
>> hopefully few side effects, so this patch does that.
> Looks like a reasonable solution. I think the code is correct and I
> posted ideas for making it easier to understand.
>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   util/iov.c |   4 --
>>   2 files changed, 133 insertions(+), 10 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 8974d46941..ee226d23d6 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1435,6 +1435,12 @@ out:
>>    * @merge_reads is true for small requests,
>>    * if @buf_len == @head + bytes + @tail. In this case it is possible that both
>>    * head and tail exist but @buf_len == align and @tail_buf == @buf.
>> + *
>> + * @write is true for write requests, false for read requests.
>> + *
>> + * If padding makes the vector too long (exceeding IOV_MAX), then we need to
>> + * merge existing vector elements into a single one.  @collapse_buf acts as the
>> + * bounce buffer in such cases.
>>    */
>>   typedef struct BdrvRequestPadding {
>>       uint8_t *buf;
>> @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
>>       size_t head;
>>       size_t tail;
>>       bool merge_reads;
>> +    bool write;
>>       QEMUIOVector local_qiov;
>> +
>> +    uint8_t *collapse_buf;
>> +    size_t collapse_len;
>> +    QEMUIOVector collapsed_qiov;
>>   } BdrvRequestPadding;
>>   
>>   static bool bdrv_init_padding(BlockDriverState *bs,
>>                                 int64_t offset, int64_t bytes,
>> +                              bool write,
>>                                 BdrvRequestPadding *pad)
>>   {
>>       int64_t align = bs->bl.request_alignment;
>> @@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
>>           pad->tail_buf = pad->buf + pad->buf_len - align;
>>       }
>>   
>> +    pad->write = write;
>> +
>>       return true;
>>   }
>>   
>> +/*
>> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
>> + * elements), collapse some elements into a single one so that it adheres to the
>> + * IOV_MAX limit again.
>> + *
>> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
>> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
>> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
>> + * buffer content back for read requests.
> The distinction between "collapse" and "collapsed" is subtle. I didn't
> guess it right, I thought collapsed_qiov is a QEMUIOVector for
> collapse_buf/collapse_len.
>
> Please choose a name for collapsed_qiov that makes this clearer. Maybe
> pre_collapse_qiov (i.e. the local_qiov iovecs that were replaced by
> bdrv_padding_collapse)?

Yes, sounds good!

>> + *
>> + * Note that we will not touch the padding head or tail entries here.  We cannot
>> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
>> + * be in an aligned buffer with scratch space after (head) or before (tail) to
>> + * perform the read into (because the whole buffer must be aligned, but head's
>> + * and tail's lengths naturally cannot be aligned, because they provide padding
>> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
>> + * cannot provide such scratch space.
> As someone who hasn't looked at this code for a while, I don't
> understand this paragraph. Can you expand on why RMW is problematic
> here? If not, don't worry, it's hard to explain iov juggling.

The read in the RMW cycle is done using bdrv_aligned_preadv(), so head 
and tail must be fully aligned; their buffers’ addresses and lengths 
both must be aligned.  However, head and tail themselves can’t have an 
aligned length, because they’re filling up something that’s unaligned to 
be aligned.

Therefore, there must be some scratch space in those buffers that 
overlaps with adjacent elements in the I/O vector.  The 
bdrv_aligned_preadv() calls directly read into the padding buffer, so 
they will not overwrite anything in the I/O vector.  Instead, in the I/O 
vector, the references to the padding buffer are shortened to match the 
pure lengths of head and tail, so that when we finally do the actual 
write, the scratch space is hidden.

So merging head or tail requires special consideration of this scratch 
space.  It probably isn’t impossible (I should fix the comment on that), 
but it’s just more complicated than to merge internal elements.  Also, 
if you merge head or tail, you need to adjust some existing fields in 
BdrvRequestPadding (free `buf`, adjust `buf_len`, adjust `tail_buf`).  
And qemu_iovec_{to,from}_buf() would need to skip the head/tail.

I’ve begun by trying to merge into head/tail, but faced problem after 
problem and finally just decided to give up on that, finding that just 
merging internal buffers is simpler.

>> + *
>> + * Therefore, this function collapses the first IOV elements after the
>> + * (potential) head element.
>> + */
>> +static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs)
>> +{
>> +    int surplus_count, collapse_count;
>> +    struct iovec *collapse_iovs;
>> +    QEMUIOVector collapse_qiov;
>> +    size_t move_count;
>> +
>> +    surplus_count = pad->local_qiov.niov - IOV_MAX;
>> +    /* Not exceeding the limit?  Nothing to collapse. */
>> +    if (surplus_count <= 0) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Only head and tail can have lead to the number of entries exceeding
>> +     * IOV_MAX, so we can exceed it by the head and tail at most
>> +     */
>> +    assert(surplus_count <= !!pad->head + !!pad->tail);
>> +
>> +    /*
>> +     * We merge (collapse) `surplus_count` entries into the first entry that is
>> +     * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
>> +     * there is no head, or entry 1 if there is one.
>> +     */
>> +    collapse_count = surplus_count + 1;
>> +    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
>> +
>> +    /* There must be no previously collapsed buffers in `pad` */
>> +    assert(pad->collapse_len == 0);
>> +    for (int i = 0; i < collapse_count; i++) {
>> +        pad->collapse_len += collapse_iovs[i].iov_len;
>> +    }
>> +    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
>> +
>> +    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
>> +    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
>> +    qemu_iovec_init_slice(&pad->collapsed_qiov,
> Having collapse_qiov and collapsed_qiov in the same function is
> confusing. IIUC collapse_qiov and collapsed_qiov have the same buffers
> the same, except that the last iovec in collapse_qiov has its original
> length from local_qiov, whereas collapsed_qiov may shrink the last
> iovec.
>
> Maybe just naming collapse_qiov "qiov" or "tmp_qiov" would be less
> confusing because it avoids making collapse_buf/collapse_len vs
> collapsed_qiov more confusing.

Sure!

>> +                          &collapse_qiov, 0, pad->collapse_len);
>> +    if (pad->write) {
>> +        /* For writes: Copy all to-be-collapsed data into collapse_buf */
>> +        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
>> +                          pad->collapse_buf, pad->collapse_len);
>> +    }
>> +
>> +    /* Create the collapsed entry in pad->local_qiov */
>> +    collapse_iovs[0] = (struct iovec){
>> +        .iov_base = pad->collapse_buf,
>> +        .iov_len = pad->collapse_len,
>> +    };
>> +
>> +    /*
>> +     * To finalize collapsing, we must shift the rest of pad->local_qiov left by
>> +     * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
>> +     * immediately after the collapse target.
>> +     *
>> +     * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
>> +     * `collapse_iovs[collapse_count]`.  The number of elements to move is the
>> +     * number of elements remaining in `pad->local_qiov` after and including
>> +     * `collapse_iovs[collapse_count]`.
>> +     */
>> +    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
>> +        &collapse_iovs[collapse_count];
>> +    memmove(&collapse_iovs[1],
>> +            &collapse_iovs[collapse_count],
>> +            move_count * sizeof(pad->local_qiov.iov[0]));
>> +
>> +    pad->local_qiov.niov -= surplus_count;
>> +}
>> +
>>   static int coroutine_fn GRAPH_RDLOCK
>>   bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
>>                         BdrvRequestPadding *pad, bool zero_middle)
>> @@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>           qemu_vfree(pad->buf);
>>           qemu_iovec_destroy(&pad->local_qiov);
>>       }
>> +    if (pad->collapse_buf) {
>> +        if (!pad->write) {
>> +            /*
>> +             * If padding required elements in the vector to be collapsed into a
>> +             * bounce buffer, copy the bounce buffer content back
>> +             */
>> +            qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
>> +                                pad->collapse_buf, pad->collapse_len);
>> +        }
>> +        qemu_vfree(pad->collapse_buf);
>> +        qemu_iovec_destroy(&pad->collapsed_qiov);
>> +    }
> This is safe because qemu_iovec_init_slice() took copies of local_qiov
> iovecs instead of references, but the code requires less thinking if
> collapsed_qiov is destroyed before local_qiov. This change would be nice
> if you respin.

Sure.

>>       memset(pad, 0, sizeof(*pad));
>>   }
>>   
>> @@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>    * read of padding, bdrv_padding_rmw_read() should be called separately if
>>    * needed.
>>    *
>> + * @write is true for write requests, false for read requests.
>> + *
>>    * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
>>    *  - on function start they represent original request
>>    *  - on failure or when padding is not needed they are unchanged
>> @@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>   static int bdrv_pad_request(BlockDriverState *bs,
>>                               QEMUIOVector **qiov, size_t *qiov_offset,
>>                               int64_t *offset, int64_t *bytes,
>> +                            bool write,
>>                               BdrvRequestPadding *pad, bool *padded,
>>                               BdrvRequestFlags *flags)
>>   {
>> @@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
>>   
>>       bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
>>   
>> -    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
>> +    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
>>           if (padded) {
>>               *padded = false;
>>           }
>> @@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
>>           bdrv_padding_destroy(pad);
>>           return ret;
>>       }
>> +
>> +    /*
>> +     * If the IOV is too long after padding, merge (collapse) some entries to
>> +     * make it not exceed IOV_MAX
>> +     */
>> +    bdrv_padding_collapse(pad, bs);
>> +    assert(pad->local_qiov.niov <= IOV_MAX);
>> +
>>       *bytes += pad->head + pad->tail;
>>       *offset -= pad->head;
>>       *qiov = &pad->local_qiov;
>> @@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>           flags |= BDRV_REQ_COPY_ON_READ;
>>       }
>>   
>> -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>> -                           NULL, &flags);
>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>> +                           &pad, NULL, &flags);
>>       if (ret < 0) {
>>           goto fail;
>>       }
>> @@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
>>       /* This flag doesn't make sense for padding or zero writes */
>>       flags &= ~BDRV_REQ_REGISTERED_BUF;
>>   
>> -    padding = bdrv_init_padding(bs, offset, bytes, &pad);
>> +    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
>>       if (padding) {
>>           assert(!(flags & BDRV_REQ_NO_WAIT));
>>           bdrv_make_request_serialising(req, align);
>> @@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
>>            * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
>>            * alignment only if there is no ZERO flag.
>>            */
>> -        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>> -                               &padded, &flags);
>> +        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
>> +                               &pad, &padded, &flags);
>>           if (ret < 0) {
>>               return ret;
>>           }
>> diff --git a/util/iov.c b/util/iov.c
>> index b4be580022..4d0d381949 100644
>> --- a/util/iov.c
>> +++ b/util/iov.c
>> @@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
>>       }
>>   
>>       total_niov = !!head_len + mid_niov + !!tail_len;
>> -    if (total_niov > IOV_MAX) {
>> -        return -EINVAL;
>> -    }
> Perhaps an assertion is good here, just to make sure it's detected if a
> new caller is added some day. qemu_iovec_init_extended() is a public
> API, so something unrelated to block layer padding might use it.

The problem is that I’m not removing this because it’s become 
unnecessary, but because I need this function to happily create I/O 
vectors exceeding IOV_MAX.  After this patch, it will create I/O vectors 
with up to 1026 elements, which are only shrunk afterwards.

What I can do is add a comment that this function can create I/O vectors 
exceeding 1024 elements, and callers must ensure to somehow deal with 
this, either by knowing that it can’t happen in their case (and 
asserting that), or by shrinking/splitting the resulting vector somehow.

Hanna



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
  2023-03-15 18:25   ` Eric Blake
  2023-03-15 18:48   ` Stefan Hajnoczi
@ 2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
  2023-03-17  8:05     ` Hanna Czenczek
  2 siblings, 1 reply; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-16 17:44 UTC (permalink / raw)
  To: Hanna Czenczek, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 15.03.23 15:13, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit.  As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
> 
> To the guest, this appears as a random I/O error.  We should not return
> an I/O error to the guest when it issued a perfectly valid request.
> 
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter).  However,
> that does not seem exactly great.
> 
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
>     shorter.
> 
> I am wary of (1), because it seems like it may have unintended side
> effects.
> 
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>   util/iov.c |   4 --
>   2 files changed, 133 insertions(+), 10 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8974d46941..ee226d23d6 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1435,6 +1435,12 @@ out:
>    * @merge_reads is true for small requests,
>    * if @buf_len == @head + bytes + @tail. In this case it is possible that both
>    * head and tail exist but @buf_len == align and @tail_buf == @buf.
> + *
> + * @write is true for write requests, false for read requests.
> + *
> + * If padding makes the vector too long (exceeding IOV_MAX), then we need to
> + * merge existing vector elements into a single one.  @collapse_buf acts as the
> + * bounce buffer in such cases.
>    */
>   typedef struct BdrvRequestPadding {
>       uint8_t *buf;
> @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
>       size_t head;
>       size_t tail;
>       bool merge_reads;
> +    bool write;
>       QEMUIOVector local_qiov;
> +
> +    uint8_t *collapse_buf;
> +    size_t collapse_len;
> +    QEMUIOVector collapsed_qiov;
>   } BdrvRequestPadding;
>   
>   static bool bdrv_init_padding(BlockDriverState *bs,
>                                 int64_t offset, int64_t bytes,
> +                              bool write,
>                                 BdrvRequestPadding *pad)
>   {
>       int64_t align = bs->bl.request_alignment;
> @@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs,
>           pad->tail_buf = pad->buf + pad->buf_len - align;
>       }
>   
> +    pad->write = write;
> +
>       return true;
>   }
>   
> +/*
> + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX
> + * elements), collapse some elements into a single one so that it adheres to the
> + * IOV_MAX limit again.
> + *
> + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length
> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the previous entries
> + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce
> + * buffer content back for read requests.
> + *
> + * Note that we will not touch the padding head or tail entries here.  We cannot
> + * move them to a bounce buffer, because for RMWs, both head and tail expect to
> + * be in an aligned buffer with scratch space after (head) or before (tail) to
> + * perform the read into (because the whole buffer must be aligned, but head's
> + * and tail's lengths naturally cannot be aligned, because they provide padding
> + * for unaligned requests).  A collapsed bounce buffer for multiple IOV elements
> + * cannot provide such scratch space.
> + *
> + * Therefore, this function collapses the first IOV elements after the
> + * (potential) head element.
> + */
> +static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs)
> +{
> +    int surplus_count, collapse_count;
> +    struct iovec *collapse_iovs;
> +    QEMUIOVector collapse_qiov;
> +    size_t move_count;
> +
> +    surplus_count = pad->local_qiov.niov - IOV_MAX;
> +    /* Not exceeding the limit?  Nothing to collapse. */
> +    if (surplus_count <= 0) {
> +        return;
> +    }
> +
> +    /*
> +     * Only head and tail can have lead to the number of entries exceeding
> +     * IOV_MAX, so we can exceed it by the head and tail at most
> +     */
> +    assert(surplus_count <= !!pad->head + !!pad->tail);
> +
> +    /*
> +     * We merge (collapse) `surplus_count` entries into the first entry that is
> +     * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if
> +     * there is no head, or entry 1 if there is one.
> +     */
> +    collapse_count = surplus_count + 1;
> +    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
> +
> +    /* There must be no previously collapsed buffers in `pad` */
> +    assert(pad->collapse_len == 0);
> +    for (int i = 0; i < collapse_count; i++) {
> +        pad->collapse_len += collapse_iovs[i].iov_len;
> +    }
> +    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
> +
> +    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
> +    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count);
> +    qemu_iovec_init_slice(&pad->collapsed_qiov,
> +                          &collapse_qiov, 0, pad->collapse_len);
> +    if (pad->write) {
> +        /* For writes: Copy all to-be-collapsed data into collapse_buf */
> +        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
> +                          pad->collapse_buf, pad->collapse_len);
> +    }
> +
> +    /* Create the collapsed entry in pad->local_qiov */
> +    collapse_iovs[0] = (struct iovec){
> +        .iov_base = pad->collapse_buf,
> +        .iov_len = pad->collapse_len,
> +    };
> +
> +    /*
> +     * To finalize collapsing, we must shift the rest of pad->local_qiov left by
> +     * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to
> +     * immediately after the collapse target.
> +     *
> +     * Therefore, the memmove() target is `collapse_iovs[1]` and the source is
> +     * `collapse_iovs[collapse_count]`.  The number of elements to move is the
> +     * number of elements remaining in `pad->local_qiov` after and including
> +     * `collapse_iovs[collapse_count]`.
> +     */
> +    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
> +        &collapse_iovs[collapse_count];
> +    memmove(&collapse_iovs[1],
> +            &collapse_iovs[collapse_count],
> +            move_count * sizeof(pad->local_qiov.iov[0]));
> +
> +    pad->local_qiov.niov -= surplus_count;
> +}


What I don't like is that qemu_iovec_init_extended() is really complex, and it is used only here [I mean bdrv_pad_request()] (qemu_iovec_init_slice() uses only small subset of qemu_iovec_init_extended() possibilities). And finally, we use this qemu_iovec_init_extended() only to rewrite the resulting qiov by hand using direct access to iov[] array and memmove. I think, such direct manipulations better be done in util/iov.c.. And anyway, this all show that qemu_iovec_init_extended() being complex doesn't meet our needs.

Hmm. *improving* qemu_iovec_init_external() by allowing it to allocate additional bounce-buffer, and do collapsing doesn't look good.

Maybe instead, do the logic of qemu_iovec_init_extended() together with bdrv_padding_collapse() in bdrv_pad_request() directly, using simpler qemu_iovec_* API?

Something like:

1. prepare bounce_buffer if want to collaps
2. allocate local_qiov of calculated size
3. compile the local_qiov:

   - if head: qemu_iovec_add(local_qiov, head)
   - if collpase_buf: qemu_iovec_add(local_qiov, collapse_buf)
   - qemu_iovec_concat(local_qiov, remaining part of qiov)
   - if tail: qemu_iovec_add(local_qiov, tail)

> +
>   static int coroutine_fn GRAPH_RDLOCK
>   bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req,
>                         BdrvRequestPadding *pad, bool zero_middle)
> @@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>           qemu_vfree(pad->buf);
>           qemu_iovec_destroy(&pad->local_qiov);
>       }
> +    if (pad->collapse_buf) {
> +        if (!pad->write) {
> +            /*
> +             * If padding required elements in the vector to be collapsed into a
> +             * bounce buffer, copy the bounce buffer content back
> +             */
> +            qemu_iovec_from_buf(&pad->collapsed_qiov, 0,
> +                                pad->collapse_buf, pad->collapse_len);
> +        }
> +        qemu_vfree(pad->collapse_buf);
> +        qemu_iovec_destroy(&pad->collapsed_qiov);
> +    }
>       memset(pad, 0, sizeof(*pad));
>   }
>   
> @@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>    * read of padding, bdrv_padding_rmw_read() should be called separately if
>    * needed.
>    *
> + * @write is true for write requests, false for read requests.
> + *
>    * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
>    *  - on function start they represent original request
>    *  - on failure or when padding is not needed they are unchanged
> @@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>   static int bdrv_pad_request(BlockDriverState *bs,
>                               QEMUIOVector **qiov, size_t *qiov_offset,
>                               int64_t *offset, int64_t *bytes,
> +                            bool write,
>                               BdrvRequestPadding *pad, bool *padded,
>                               BdrvRequestFlags *flags)
>   {
> @@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs,
>   
>       bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
>   
> -    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
> +    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
>           if (padded) {
>               *padded = false;
>           }
> @@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs,
>           bdrv_padding_destroy(pad);
>           return ret;
>       }
> +
> +    /*
> +     * If the IOV is too long after padding, merge (collapse) some entries to
> +     * make it not exceed IOV_MAX
> +     */
> +    bdrv_padding_collapse(pad, bs);
> +    assert(pad->local_qiov.niov <= IOV_MAX);
> +
>       *bytes += pad->head + pad->tail;
>       *offset -= pad->head;
>       *qiov = &pad->local_qiov;
> @@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>           flags |= BDRV_REQ_COPY_ON_READ;
>       }
>   
> -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                           NULL, &flags);
> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
> +                           &pad, NULL, &flags);
>       if (ret < 0) {
>           goto fail;
>       }
> @@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
>       /* This flag doesn't make sense for padding or zero writes */
>       flags &= ~BDRV_REQ_REGISTERED_BUF;
>   
> -    padding = bdrv_init_padding(bs, offset, bytes, &pad);
> +    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
>       if (padding) {
>           assert(!(flags & BDRV_REQ_NO_WAIT));
>           bdrv_make_request_serialising(req, align);
> @@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
>            * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
>            * alignment only if there is no ZERO flag.
>            */
> -        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                               &padded, &flags);
> +        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
> +                               &pad, &padded, &flags);
>           if (ret < 0) {
>               return ret;
>           }
> diff --git a/util/iov.c b/util/iov.c
> index b4be580022..4d0d381949 100644
> --- a/util/iov.c
> +++ b/util/iov.c
> @@ -444,10 +444,6 @@ int qemu_iovec_init_extended(
>       }
>   
>       total_niov = !!head_len + mid_niov + !!tail_len;
> -    if (total_niov > IOV_MAX) {
> -        return -EINVAL;
> -    }
> -
>       if (total_niov == 1) {
>           qemu_iovec_init_buf(qiov, NULL, 0);
>           p = &qiov->local_iov;

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] block: Split padded I/O vectors exceeding IOV_MAX
  2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
@ 2023-03-17  8:05     ` Hanna Czenczek
  0 siblings, 0 replies; 11+ messages in thread
From: Hanna Czenczek @ 2023-03-17  8:05 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: qemu-devel, Eric Blake, Kevin Wolf, Stefan Hajnoczi, Fam Zheng

On 16.03.23 18:44, Vladimir Sementsov-Ogievskiy wrote:
> On 15.03.23 15:13, Hanna Czenczek wrote:
>> When processing vectored guest requests that are not aligned to the
>> storage request alignment, we pad them by adding head and/or tail
>> buffers for a read-modify-write cycle.
>>
>> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
>> with this padding, the vector can exceed that limit.  As of
>> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
>> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
>> limit, instead returning an error to the guest.
>>
>> To the guest, this appears as a random I/O error.  We should not return
>> an I/O error to the guest when it issued a perfectly valid request.
>>
>> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
>> longer than IOV_MAX, which generally seems to work (because the guest
>> assumes a smaller alignment than we really have, file-posix's
>> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
>> so emulate the request, so that the IOV_MAX does not matter). However,
>> that does not seem exactly great.
>>
>> I see two ways to fix this problem:
>> 1. We split such long requests into two requests.
>> 2. We join some elements of the vector into new buffers to make it
>>     shorter.
>>
>> I am wary of (1), because it seems like it may have unintended side
>> effects.
>>
>> (2) on the other hand seems relatively simple to implement, with
>> hopefully few side effects, so this patch does that.
>>
>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   util/iov.c |   4 --
>>   2 files changed, 133 insertions(+), 10 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 8974d46941..ee226d23d6 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1435,6 +1435,12 @@ out:
>>    * @merge_reads is true for small requests,
>>    * if @buf_len == @head + bytes + @tail. In this case it is 
>> possible that both
>>    * head and tail exist but @buf_len == align and @tail_buf == @buf.
>> + *
>> + * @write is true for write requests, false for read requests.
>> + *
>> + * If padding makes the vector too long (exceeding IOV_MAX), then we 
>> need to
>> + * merge existing vector elements into a single one. @collapse_buf 
>> acts as the
>> + * bounce buffer in such cases.
>>    */
>>   typedef struct BdrvRequestPadding {
>>       uint8_t *buf;
>> @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding {
>>       size_t head;
>>       size_t tail;
>>       bool merge_reads;
>> +    bool write;
>>       QEMUIOVector local_qiov;
>> +
>> +    uint8_t *collapse_buf;
>> +    size_t collapse_len;
>> +    QEMUIOVector collapsed_qiov;
>>   } BdrvRequestPadding;
>>     static bool bdrv_init_padding(BlockDriverState *bs,
>>                                 int64_t offset, int64_t bytes,
>> +                              bool write,
>>                                 BdrvRequestPadding *pad)
>>   {
>>       int64_t align = bs->bl.request_alignment;
>> @@ -1479,9 +1491,101 @@ static bool 
>> bdrv_init_padding(BlockDriverState *bs,
>>           pad->tail_buf = pad->buf + pad->buf_len - align;
>>       }
>>   +    pad->write = write;
>> +
>>       return true;
>>   }
>>   +/*
>> + * If padding has made the IOV (`pad->local_qiov`) too long (more 
>> than IOV_MAX
>> + * elements), collapse some elements into a single one so that it 
>> adheres to the
>> + * IOV_MAX limit again.
>> + *
>> + * If collapsing, `pad->collapse_buf` will be used as a bounce 
>> buffer of length
>> + * `pad->collapse_len`.  `pad->collapsed_qiov` will contain the 
>> previous entries
>> + * (before collapsing), so that bdrv_padding_destroy() can copy the 
>> bounce
>> + * buffer content back for read requests.
>> + *
>> + * Note that we will not touch the padding head or tail entries 
>> here.  We cannot
>> + * move them to a bounce buffer, because for RMWs, both head and 
>> tail expect to
>> + * be in an aligned buffer with scratch space after (head) or before 
>> (tail) to
>> + * perform the read into (because the whole buffer must be aligned, 
>> but head's
>> + * and tail's lengths naturally cannot be aligned, because they 
>> provide padding
>> + * for unaligned requests).  A collapsed bounce buffer for multiple 
>> IOV elements
>> + * cannot provide such scratch space.
>> + *
>> + * Therefore, this function collapses the first IOV elements after the
>> + * (potential) head element.
>> + */
>> +static void bdrv_padding_collapse(BdrvRequestPadding *pad, 
>> BlockDriverState *bs)
>> +{
>> +    int surplus_count, collapse_count;
>> +    struct iovec *collapse_iovs;
>> +    QEMUIOVector collapse_qiov;
>> +    size_t move_count;
>> +
>> +    surplus_count = pad->local_qiov.niov - IOV_MAX;
>> +    /* Not exceeding the limit?  Nothing to collapse. */
>> +    if (surplus_count <= 0) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Only head and tail can have lead to the number of entries 
>> exceeding
>> +     * IOV_MAX, so we can exceed it by the head and tail at most
>> +     */
>> +    assert(surplus_count <= !!pad->head + !!pad->tail);
>> +
>> +    /*
>> +     * We merge (collapse) `surplus_count` entries into the first 
>> entry that is
>> +     * not padding, i.e. we merge `surplus_count + 1` entries into 
>> entry 0 if
>> +     * there is no head, or entry 1 if there is one.
>> +     */
>> +    collapse_count = surplus_count + 1;
>> +    collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0];
>> +
>> +    /* There must be no previously collapsed buffers in `pad` */
>> +    assert(pad->collapse_len == 0);
>> +    for (int i = 0; i < collapse_count; i++) {
>> +        pad->collapse_len += collapse_iovs[i].iov_len;
>> +    }
>> +    pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len);
>> +
>> +    /* Save the to-be-collapsed IOV elements in collapsed_qiov */
>> +    qemu_iovec_init_external(&collapse_qiov, collapse_iovs, 
>> collapse_count);
>> +    qemu_iovec_init_slice(&pad->collapsed_qiov,
>> +                          &collapse_qiov, 0, pad->collapse_len);
>> +    if (pad->write) {
>> +        /* For writes: Copy all to-be-collapsed data into 
>> collapse_buf */
>> +        qemu_iovec_to_buf(&pad->collapsed_qiov, 0,
>> +                          pad->collapse_buf, pad->collapse_len);
>> +    }
>> +
>> +    /* Create the collapsed entry in pad->local_qiov */
>> +    collapse_iovs[0] = (struct iovec){
>> +        .iov_base = pad->collapse_buf,
>> +        .iov_len = pad->collapse_len,
>> +    };
>> +
>> +    /*
>> +     * To finalize collapsing, we must shift the rest of 
>> pad->local_qiov left by
>> +     * `surplus_count`, i.e. we must move all elements after 
>> `collapse_iovs` to
>> +     * immediately after the collapse target.
>> +     *
>> +     * Therefore, the memmove() target is `collapse_iovs[1]` and the 
>> source is
>> +     * `collapse_iovs[collapse_count]`.  The number of elements to 
>> move is the
>> +     * number of elements remaining in `pad->local_qiov` after and 
>> including
>> +     * `collapse_iovs[collapse_count]`.
>> +     */
>> +    move_count = &pad->local_qiov.iov[pad->local_qiov.niov] -
>> +        &collapse_iovs[collapse_count];
>> +    memmove(&collapse_iovs[1],
>> +            &collapse_iovs[collapse_count],
>> +            move_count * sizeof(pad->local_qiov.iov[0]));
>> +
>> +    pad->local_qiov.niov -= surplus_count;
>> +}
>
>
> What I don't like is that qemu_iovec_init_extended() is really 
> complex, and it is used only here [I mean bdrv_pad_request()] 
> (qemu_iovec_init_slice() uses only small subset of 
> qemu_iovec_init_extended() possibilities). And finally, we use this 
> qemu_iovec_init_extended() only to rewrite the resulting qiov by hand 
> using direct access to iov[] array and memmove. I think, such direct 
> manipulations better be done in util/iov.c.. And anyway, this all show 
> that qemu_iovec_init_extended() being complex doesn't meet our needs.
>
> Hmm. *improving* qemu_iovec_init_external() by allowing it to allocate 
> additional bounce-buffer, and do collapsing doesn't look good.
>
> Maybe instead, do the logic of qemu_iovec_init_extended() together 
> with bdrv_padding_collapse() in bdrv_pad_request() directly, using 
> simpler qemu_iovec_* API?
>
> Something like:
>
> 1. prepare bounce_buffer if want to collaps
> 2. allocate local_qiov of calculated size
> 3. compile the local_qiov:
>
>   - if head: qemu_iovec_add(local_qiov, head)
>   - if collpase_buf: qemu_iovec_add(local_qiov, collapse_buf)
>   - qemu_iovec_concat(local_qiov, remaining part of qiov)
>   - if tail: qemu_iovec_add(local_qiov, tail)

Sure, I’ll give it a try!

Hanna



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-17  8:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-15 12:13 [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Hanna Czenczek
2023-03-15 12:13 ` [RFC 1/2] block: " Hanna Czenczek
2023-03-15 18:25   ` Eric Blake
2023-03-16  9:43     ` Hanna Czenczek
2023-03-15 18:48   ` Stefan Hajnoczi
2023-03-16 10:10     ` Hanna Czenczek
2023-03-16 17:44   ` Vladimir Sementsov-Ogievskiy
2023-03-17  8:05     ` Hanna Czenczek
2023-03-15 12:13 ` [RFC 2/2] iotests/iov-padding: New test Hanna Czenczek
2023-03-15 15:29 ` [RFC 0/2] Split padded I/O vectors exceeding IOV_MAX Stefan Hajnoczi
2023-03-15 16:05   ` Hanna Czenczek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.