On Wed, Mar 15, 2023 at 01:13:29PM +0100, Hanna Czenczek wrote: > When processing vectored guest requests that are not aligned to the > storage request alignment, we pad them by adding head and/or tail > buffers for a read-modify-write cycle. > > The guest can submit I/O vectors up to IOV_MAX (1024) in length, but > with this padding, the vector can exceed that limit. As of > 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make > qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the > limit, instead returning an error to the guest. > > To the guest, this appears as a random I/O error. We should not return > an I/O error to the guest when it issued a perfectly valid request. > > Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector > longer than IOV_MAX, which generally seems to work (because the guest > assumes a smaller alignment than we really have, file-posix's > raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and > so emulate the request, so that the IOV_MAX does not matter). However, > that does not seem exactly great. > > I see two ways to fix this problem: > 1. We split such long requests into two requests. > 2. We join some elements of the vector into new buffers to make it > shorter. > > I am wary of (1), because it seems like it may have unintended side > effects. > > (2) on the other hand seems relatively simple to implement, with > hopefully few side effects, so this patch does that. Looks like a reasonable solution. I think the code is correct and I posted ideas for making it easier to understand. > > Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964 > Signed-off-by: Hanna Czenczek > --- > block/io.c | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > util/iov.c | 4 -- > 2 files changed, 133 insertions(+), 10 deletions(-) > > diff --git a/block/io.c b/block/io.c > index 8974d46941..ee226d23d6 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -1435,6 +1435,12 @@ out: > * @merge_reads is true for small requests, > * if @buf_len == @head + bytes + @tail. In this case it is possible that both > * head and tail exist but @buf_len == align and @tail_buf == @buf. > + * > + * @write is true for write requests, false for read requests. > + * > + * If padding makes the vector too long (exceeding IOV_MAX), then we need to > + * merge existing vector elements into a single one. @collapse_buf acts as the > + * bounce buffer in such cases. > */ > typedef struct BdrvRequestPadding { > uint8_t *buf; > @@ -1443,11 +1449,17 @@ typedef struct BdrvRequestPadding { > size_t head; > size_t tail; > bool merge_reads; > + bool write; > QEMUIOVector local_qiov; > + > + uint8_t *collapse_buf; > + size_t collapse_len; > + QEMUIOVector collapsed_qiov; > } BdrvRequestPadding; > > static bool bdrv_init_padding(BlockDriverState *bs, > int64_t offset, int64_t bytes, > + bool write, > BdrvRequestPadding *pad) > { > int64_t align = bs->bl.request_alignment; > @@ -1479,9 +1491,101 @@ static bool bdrv_init_padding(BlockDriverState *bs, > pad->tail_buf = pad->buf + pad->buf_len - align; > } > > + pad->write = write; > + > return true; > } > > +/* > + * If padding has made the IOV (`pad->local_qiov`) too long (more than IOV_MAX > + * elements), collapse some elements into a single one so that it adheres to the > + * IOV_MAX limit again. > + * > + * If collapsing, `pad->collapse_buf` will be used as a bounce buffer of length > + * `pad->collapse_len`. `pad->collapsed_qiov` will contain the previous entries > + * (before collapsing), so that bdrv_padding_destroy() can copy the bounce > + * buffer content back for read requests. The distinction between "collapse" and "collapsed" is subtle. I didn't guess it right, I thought collapsed_qiov is a QEMUIOVector for collapse_buf/collapse_len. Please choose a name for collapsed_qiov that makes this clearer. Maybe pre_collapse_qiov (i.e. the local_qiov iovecs that were replaced by bdrv_padding_collapse)? > + * > + * Note that we will not touch the padding head or tail entries here. We cannot > + * move them to a bounce buffer, because for RMWs, both head and tail expect to > + * be in an aligned buffer with scratch space after (head) or before (tail) to > + * perform the read into (because the whole buffer must be aligned, but head's > + * and tail's lengths naturally cannot be aligned, because they provide padding > + * for unaligned requests). A collapsed bounce buffer for multiple IOV elements > + * cannot provide such scratch space. As someone who hasn't looked at this code for a while, I don't understand this paragraph. Can you expand on why RMW is problematic here? If not, don't worry, it's hard to explain iov juggling. > + * > + * Therefore, this function collapses the first IOV elements after the > + * (potential) head element. > + */ > +static void bdrv_padding_collapse(BdrvRequestPadding *pad, BlockDriverState *bs) > +{ > + int surplus_count, collapse_count; > + struct iovec *collapse_iovs; > + QEMUIOVector collapse_qiov; > + size_t move_count; > + > + surplus_count = pad->local_qiov.niov - IOV_MAX; > + /* Not exceeding the limit? Nothing to collapse. */ > + if (surplus_count <= 0) { > + return; > + } > + > + /* > + * Only head and tail can have lead to the number of entries exceeding > + * IOV_MAX, so we can exceed it by the head and tail at most > + */ > + assert(surplus_count <= !!pad->head + !!pad->tail); > + > + /* > + * We merge (collapse) `surplus_count` entries into the first entry that is > + * not padding, i.e. we merge `surplus_count + 1` entries into entry 0 if > + * there is no head, or entry 1 if there is one. > + */ > + collapse_count = surplus_count + 1; > + collapse_iovs = &pad->local_qiov.iov[pad->head ? 1 : 0]; > + > + /* There must be no previously collapsed buffers in `pad` */ > + assert(pad->collapse_len == 0); > + for (int i = 0; i < collapse_count; i++) { > + pad->collapse_len += collapse_iovs[i].iov_len; > + } > + pad->collapse_buf = qemu_blockalign(bs, pad->collapse_len); > + > + /* Save the to-be-collapsed IOV elements in collapsed_qiov */ > + qemu_iovec_init_external(&collapse_qiov, collapse_iovs, collapse_count); > + qemu_iovec_init_slice(&pad->collapsed_qiov, Having collapse_qiov and collapsed_qiov in the same function is confusing. IIUC collapse_qiov and collapsed_qiov have the same buffers the same, except that the last iovec in collapse_qiov has its original length from local_qiov, whereas collapsed_qiov may shrink the last iovec. Maybe just naming collapse_qiov "qiov" or "tmp_qiov" would be less confusing because it avoids making collapse_buf/collapse_len vs collapsed_qiov more confusing. > + &collapse_qiov, 0, pad->collapse_len); > + if (pad->write) { > + /* For writes: Copy all to-be-collapsed data into collapse_buf */ > + qemu_iovec_to_buf(&pad->collapsed_qiov, 0, > + pad->collapse_buf, pad->collapse_len); > + } > + > + /* Create the collapsed entry in pad->local_qiov */ > + collapse_iovs[0] = (struct iovec){ > + .iov_base = pad->collapse_buf, > + .iov_len = pad->collapse_len, > + }; > + > + /* > + * To finalize collapsing, we must shift the rest of pad->local_qiov left by > + * `surplus_count`, i.e. we must move all elements after `collapse_iovs` to > + * immediately after the collapse target. > + * > + * Therefore, the memmove() target is `collapse_iovs[1]` and the source is > + * `collapse_iovs[collapse_count]`. The number of elements to move is the > + * number of elements remaining in `pad->local_qiov` after and including > + * `collapse_iovs[collapse_count]`. > + */ > + move_count = &pad->local_qiov.iov[pad->local_qiov.niov] - > + &collapse_iovs[collapse_count]; > + memmove(&collapse_iovs[1], > + &collapse_iovs[collapse_count], > + move_count * sizeof(pad->local_qiov.iov[0])); > + > + pad->local_qiov.niov -= surplus_count; > +} > + > static int coroutine_fn GRAPH_RDLOCK > bdrv_padding_rmw_read(BdrvChild *child, BdrvTrackedRequest *req, > BdrvRequestPadding *pad, bool zero_middle) > @@ -1549,6 +1653,18 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) > qemu_vfree(pad->buf); > qemu_iovec_destroy(&pad->local_qiov); > } > + if (pad->collapse_buf) { > + if (!pad->write) { > + /* > + * If padding required elements in the vector to be collapsed into a > + * bounce buffer, copy the bounce buffer content back > + */ > + qemu_iovec_from_buf(&pad->collapsed_qiov, 0, > + pad->collapse_buf, pad->collapse_len); > + } > + qemu_vfree(pad->collapse_buf); > + qemu_iovec_destroy(&pad->collapsed_qiov); > + } This is safe because qemu_iovec_init_slice() took copies of local_qiov iovecs instead of references, but the code requires less thinking if collapsed_qiov is destroyed before local_qiov. This change would be nice if you respin. > memset(pad, 0, sizeof(*pad)); > } > > @@ -1559,6 +1675,8 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) > * read of padding, bdrv_padding_rmw_read() should be called separately if > * needed. > * > + * @write is true for write requests, false for read requests. > + * > * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out: > * - on function start they represent original request > * - on failure or when padding is not needed they are unchanged > @@ -1567,6 +1685,7 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad) > static int bdrv_pad_request(BlockDriverState *bs, > QEMUIOVector **qiov, size_t *qiov_offset, > int64_t *offset, int64_t *bytes, > + bool write, > BdrvRequestPadding *pad, bool *padded, > BdrvRequestFlags *flags) > { > @@ -1574,7 +1693,7 @@ static int bdrv_pad_request(BlockDriverState *bs, > > bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort); > > - if (!bdrv_init_padding(bs, *offset, *bytes, pad)) { > + if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) { > if (padded) { > *padded = false; > } > @@ -1589,6 +1708,14 @@ static int bdrv_pad_request(BlockDriverState *bs, > bdrv_padding_destroy(pad); > return ret; > } > + > + /* > + * If the IOV is too long after padding, merge (collapse) some entries to > + * make it not exceed IOV_MAX > + */ > + bdrv_padding_collapse(pad, bs); > + assert(pad->local_qiov.niov <= IOV_MAX); > + > *bytes += pad->head + pad->tail; > *offset -= pad->head; > *qiov = &pad->local_qiov; > @@ -1653,8 +1780,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child, > flags |= BDRV_REQ_COPY_ON_READ; > } > > - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad, > - NULL, &flags); > + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false, > + &pad, NULL, &flags); > if (ret < 0) { > goto fail; > } > @@ -1996,7 +2123,7 @@ bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes, > /* This flag doesn't make sense for padding or zero writes */ > flags &= ~BDRV_REQ_REGISTERED_BUF; > > - padding = bdrv_init_padding(bs, offset, bytes, &pad); > + padding = bdrv_init_padding(bs, offset, bytes, true, &pad); > if (padding) { > assert(!(flags & BDRV_REQ_NO_WAIT)); > bdrv_make_request_serialising(req, align); > @@ -2112,8 +2239,8 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child, > * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do > * alignment only if there is no ZERO flag. > */ > - ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad, > - &padded, &flags); > + ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true, > + &pad, &padded, &flags); > if (ret < 0) { > return ret; > } > diff --git a/util/iov.c b/util/iov.c > index b4be580022..4d0d381949 100644 > --- a/util/iov.c > +++ b/util/iov.c > @@ -444,10 +444,6 @@ int qemu_iovec_init_extended( > } > > total_niov = !!head_len + mid_niov + !!tail_len; > - if (total_niov > IOV_MAX) { > - return -EINVAL; > - } Perhaps an assertion is good here, just to make sure it's detected if a new caller is added some day. qemu_iovec_init_extended() is a public API, so something unrelated to block layer padding might use it. > - > if (total_niov == 1) { > qemu_iovec_init_buf(qiov, NULL, 0); > p = &qiov->local_iov; > -- > 2.39.1 >