io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 0/3] Add ability to save/restore iov_iter state
@ 2021-09-14 14:17 Jens Axboe
  2021-09-14 14:17 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro

Hi,

Linus didn't particularly love the iov_iter->truncated addition and how
it was used, and it was hard to disagree with that. Instead of relying
on tracking ->truncated, add a few pieces of state so we can safely
handle partial or errored read/write attempts (that we want to retry).

Then we can get rid of the iov_iter addition, and at the same time
handle cases that weren't handled correctly before.

I've run this through vectored read/write with io_uring on the commonly
problematic cases (dm and low depth SCSI device) which trigger these
conditions often, and it seems to pass muster.

For a discussion on this topic, see the thread here:

https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/

You can find these patches here:

https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter.2

Changes since v1:
- Drop 'did_bytes' from iov_iter_restore(). Only two cases in io_uring
  used it, and one of them had to be changed with v2. Better to just
  make the subsequent iov_iter_advance() explicit at that point.
- Cleanup and sanitize the io_uring side, and ensure it's sane around
  worker retries. No more digging into iov_iter_state from io_uring, we
  use it just for save/restore purposes.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-14 14:17 [PATCHSET v2 0/3] Add ability to save/restore iov_iter state Jens Axboe
@ 2021-09-14 14:17 ` Jens Axboe
  2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
  2021-09-14 14:17 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
  2 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 15 +++++++++++++++
 lib/iov_iter.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..984c4ab74859 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,12 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+struct iov_iter_state {
+	size_t iov_offset;
+	size_t count;
+	unsigned long nr_segs;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
@@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	state->iov_offset = iter->iov_offset;
+	state->count = iter->count;
+	state->nr_segs = iter->nr_segs;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -233,6 +247,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..755c10c5138c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1972,3 +1972,39 @@ int import_single_range(int rw, void __user *buf, size_t len,
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
+
+/**
+ * iov_iter_restore() - Restore a &struct iov_iter to the same state as when
+ *     iov_iter_save_state() was called.
+ *
+ * @i: &struct iov_iter to restore
+ * @state: state to restore from
+ *
+ * Used after iov_iter_save_state() to bring restore @i, if operations may
+ * have advanced it.
+ *
+ * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC
+ */
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
+{
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
+			 !iov_iter_is_kvec(i))
+		return;
+	i->iov_offset = state->iov_offset;
+	i->count = state->count;
+	/*
+	 * For the *vec iters, nr_segs + iov is constant - if we increment
+	 * the vec, then we also decrement the nr_segs count. Hence we don't
+	 * need to track both of these, just one is enough and we can deduct
+	 * the other from that. ITER_KVEC and ITER_IOVEC are the same struct
+	 * size, so we can just increment the iov pointer as they are unionzed.
+	 * ITER_BVEC _may_ be the same size on some archs, but on others it is
+	 * not. Be safe and handle it separately.
+	 */
+	BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec));
+	if (iov_iter_is_bvec(i))
+		i->bvec -= state->nr_segs - i->nr_segs;
+	else
+		i->iov -= state->nr_segs - i->nr_segs;
+	i->nr_segs = state->nr_segs;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 14:17 [PATCHSET v2 0/3] Add ability to save/restore iov_iter state Jens Axboe
  2021-09-14 14:17 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
@ 2021-09-14 14:17 ` Jens Axboe
  2021-09-14 18:45   ` Linus Torvalds
  2021-09-14 14:17 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 62 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..dbc97d440801 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@ struct io_async_rw {
 	struct iovec			fast_iov[UIO_FASTIOV];
 	const struct iovec		*free_iovec;
 	struct iov_iter			iter;
+	struct iov_iter_state		iter_state;
 	size_t				bytes_done;
 	struct wait_page_queue		wpq;
 };
@@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
+	iov_iter_restore(&rw->iter, &rw->iter_state);
 	return true;
 }
 
@@ -3310,12 +3310,16 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
 	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (!req->async_data) {
+		struct io_async_rw *iorw;
+
 		if (io_alloc_async_data(req)) {
 			kfree(iovec);
 			return -ENOMEM;
 		}
 
 		io_req_map_rw(req, iovec, fast_iov, iter);
+		iorw = req->async_data;
+		iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	}
 	return 0;
 }
@@ -3334,6 +3338,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
 	iorw->free_iovec = iov;
 	if (iov)
 		req->flags |= REQ_F_NEED_CLEANUP;
+	iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	return 0;
 }
 
@@ -3437,19 +3442,23 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3472,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,30 +3488,36 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == req->result || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	iov_iter_restore(iter, state);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
 
 	iovec = NULL;
 	rw = req->async_data;
-	/* now use our persistent iterator, if we aren't already */
-	iter = &rw->iter;
+	/* now use our persistent iterator and state, if we aren't already */
+	if (iter != &rw->iter) {
+		iter = &rw->iter;
+		state = &rw->iter_state;
+	}
 
 	do {
-		io_size -= ret;
 		rw->bytes_done += ret;
+		iov_iter_advance(iter, ret);
+		if (!iov_iter_count(iter))
+			break;
+		iov_iter_save_state(iter, state);
+
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
 			kiocb->ki_flags &= ~IOCB_WAITQ;
@@ -3520,7 +3535,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,19 +3558,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3592,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3639,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		iov_iter_restore(iter, state);
+		if (ret2 > 0)
+			iov_iter_advance(iter, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/3] Revert "iov_iter: track truncated size"
  2021-09-14 14:17 [PATCHSET v2 0/3] Add ability to save/restore iov_iter state Jens Axboe
  2021-09-14 14:17 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
  2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
@ 2021-09-14 14:17 ` Jens Axboe
  2 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

This reverts commit 2112ff5ce0c1128fe7b4d19cfe7f2b8ce5b595fa.

We no longer need to track the truncation count, the one user that did
need it has been converted to using iov_iter_restore() instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 984c4ab74859..207101a9c5c3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -53,7 +53,6 @@ struct iov_iter {
 		};
 		loff_t xarray_start;
 	};
-	size_t truncated;
 };
 
 static inline enum iter_type iov_iter_type(const struct iov_iter *i)
@@ -270,10 +269,8 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
 	 * conversion in assignement is by definition greater than all
 	 * values of size_t, including old i->count.
 	 */
-	if (i->count > count) {
-		i->truncated += i->count - count;
+	if (i->count > count)
 		i->count = count;
-	}
 }
 
 /*
@@ -282,7 +279,6 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
  */
 static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 {
-	i->truncated -= count - i->count;
 	i->count = count;
 }
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
@ 2021-09-14 18:45   ` Linus Torvalds
  2021-09-14 19:37     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2021-09-14 18:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro

On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>
>
> +       iov_iter_restore(iter, state);
> +
...
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);

Ok, so now you keep iovb_iter and the state always in sync by just
always resetting the iter back and then walking it forward explicitly
- and re-saving the state.

That seems safe, if potentially unnecessarily expensive.

I guess re-walking lots of iovec entries is actually very unlikely in
practice, so maybe this "stupid brute-force" model is the right one.

I do find the odd "use __state vs rw->state" to be very confusing,
though. Particularly in io_read(), where you do this:

+       iov_iter_restore(iter, state);
+
        ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
        if (ret2)
                return ret2;

        iovec = NULL;
        rw = req->async_data;
-       /* now use our persistent iterator, if we aren't already */
-       iter = &rw->iter;
+       /* now use our persistent iterator and state, if we aren't already */
+       if (iter != &rw->iter) {
+               iter = &rw->iter;
+               state = &rw->iter_state;
+       }

        do {
-               io_size -= ret;
                rw->bytes_done += ret;
+               iov_iter_advance(iter, ret);
+               if (!iov_iter_count(iter))
+                       break;
+               iov_iter_save_state(iter, state);


Note how it first does that iov_iter_restore() on iter/state, buit
then it *replaces&* the iter/state pointers, and then it does
iov_iter_advance() on the replacement ones.

I don't see how that could be right. You're doing iov_iter_advance()
on something else than the one you restored to the original values.

And if it is right, it's sure confusing as hell.

           Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 18:45   ` Linus Torvalds
@ 2021-09-14 19:37     ` Jens Axboe
  2021-09-14 23:02       ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 19:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro

On 9/14/21 12:45 PM, Linus Torvalds wrote:
> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>
>> +       iov_iter_restore(iter, state);
>> +
> ...
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
> 
> Ok, so now you keep iovb_iter and the state always in sync by just
> always resetting the iter back and then walking it forward explicitly
> - and re-saving the state.
> 
> That seems safe, if potentially unnecessarily expensive.

Right, it's not ideal if it's a big range of IO, then it'll definitely
be noticeable. But not too worried about it, at least not for now...

> I guess re-walking lots of iovec entries is actually very unlikely in
> practice, so maybe this "stupid brute-force" model is the right one.

Not sure what the alternative is here. We could do something similar to
__io_import_fixed() as we're only dealing with iter types where we can
do that, but probably best left as a later optimization if it's deemed
necessary.

> I do find the odd "use __state vs rw->state" to be very confusing,
> though. Particularly in io_read(), where you do this:
> 
> +       iov_iter_restore(iter, state);
> +
>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>         if (ret2)
>                 return ret2;
> 
>         iovec = NULL;
>         rw = req->async_data;
> -       /* now use our persistent iterator, if we aren't already */
> -       iter = &rw->iter;
> +       /* now use our persistent iterator and state, if we aren't already */
> +       if (iter != &rw->iter) {
> +               iter = &rw->iter;
> +               state = &rw->iter_state;
> +       }
> 
>         do {
> -               io_size -= ret;
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);
> 
> 
> Note how it first does that iov_iter_restore() on iter/state, buit
> then it *replaces&* the iter/state pointers, and then it does
> iov_iter_advance() on the replacement ones.

We restore the iter so it's the same as before we did the read_iter
call, and then setup a consistent copy of the iov/iter in case we need
to punt this request for retry. rw->iter should have the same state as
iter at this point, and since rw->iter is the copy we'll use going
forward, we're advancing that one in case ret > 0.

The other case is that no persistent state is needed, and then iter
remains the same.

I'll take a second look at this part and see if I can make it a bit more
straight forward, or at least comment it properly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 19:37     ` Jens Axboe
@ 2021-09-14 23:02       ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-09-14 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro

On 9/14/21 1:37 PM, Jens Axboe wrote:
> On 9/14/21 12:45 PM, Linus Torvalds wrote:
>> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>
>>> +       iov_iter_restore(iter, state);
>>> +
>> ...
>>>                 rw->bytes_done += ret;
>>> +               iov_iter_advance(iter, ret);
>>> +               if (!iov_iter_count(iter))
>>> +                       break;
>>> +               iov_iter_save_state(iter, state);
>>
>> Ok, so now you keep iovb_iter and the state always in sync by just
>> always resetting the iter back and then walking it forward explicitly
>> - and re-saving the state.
>>
>> That seems safe, if potentially unnecessarily expensive.
> 
> Right, it's not ideal if it's a big range of IO, then it'll definitely
> be noticeable. But not too worried about it, at least not for now...
> 
>> I guess re-walking lots of iovec entries is actually very unlikely in
>> practice, so maybe this "stupid brute-force" model is the right one.
> 
> Not sure what the alternative is here. We could do something similar to
> __io_import_fixed() as we're only dealing with iter types where we can
> do that, but probably best left as a later optimization if it's deemed
> necessary.
> 
>> I do find the odd "use __state vs rw->state" to be very confusing,
>> though. Particularly in io_read(), where you do this:
>>
>> +       iov_iter_restore(iter, state);
>> +
>>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>         if (ret2)
>>                 return ret2;
>>
>>         iovec = NULL;
>>         rw = req->async_data;
>> -       /* now use our persistent iterator, if we aren't already */
>> -       iter = &rw->iter;
>> +       /* now use our persistent iterator and state, if we aren't already */
>> +       if (iter != &rw->iter) {
>> +               iter = &rw->iter;
>> +               state = &rw->iter_state;
>> +       }
>>
>>         do {
>> -               io_size -= ret;
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
>>
>>
>> Note how it first does that iov_iter_restore() on iter/state, buit
>> then it *replaces&* the iter/state pointers, and then it does
>> iov_iter_advance() on the replacement ones.
> 
> We restore the iter so it's the same as before we did the read_iter
> call, and then setup a consistent copy of the iov/iter in case we need
> to punt this request for retry. rw->iter should have the same state as
> iter at this point, and since rw->iter is the copy we'll use going
> forward, we're advancing that one in case ret > 0.
> 
> The other case is that no persistent state is needed, and then iter
> remains the same.
> 
> I'll take a second look at this part and see if I can make it a bit more
> straight forward, or at least comment it properly.

I hacked up something that shortens the iter for the initial IO, so we
could more easily test the retry path and the state. It really is a
hack, but the idea was to issue 64K io from fio, and then the initial
attempt would be anywhere from 4K-60K truncated. That forces retry.
I ran this with both 16 segments and 8 segments, verifying that it
hits both the UIO_FASTIOV and alloc path.

I did find one issue with that, see the last hunk in the hack. We
need to increment rw->bytes_done if we don't break, or set ret to
0 if we do. Otherwise that last ret ends up being accounted twice.
But apart from that, it passes data verification runs.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc1ff47e3221..484c86252f9d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -744,6 +744,7 @@ enum {
 	REQ_F_NOWAIT_READ_BIT,
 	REQ_F_NOWAIT_WRITE_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_TRUNCATED_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -797,6 +798,7 @@ enum {
 	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
 	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_TRUNCATED		= BIT(REQ_F_TRUNCATED_BIT),
 };
 
 struct async_poll {
@@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	struct iov_iter __iter, *iter = &__iter;
+	struct iov_iter __i, __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
 	ssize_t ret, ret2;
+	bool do_restore = false;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	}
 
+	if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) {
+		int nr_vecs;
+
+		__i = *iter;
+		nr_vecs = 1 + (prandom_u32() % iter->nr_segs);
+		iter->nr_segs = nr_vecs;
+		iter->count = nr_vecs * 8192;
+		req->flags |= REQ_F_TRUNCATED;
+		do_restore = true;
+	}
+
 	ret = io_iter_do_read(req, iter);
 
+	if (ret == -EAGAIN) {
+		req->flags &= ~REQ_F_TRUNCATED;
+		*iter = __i;
+		do_restore = false;
+	}
+
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
 		/* IOPOLL retry should happen for io-wq threads */
@@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 
 	iov_iter_restore(iter, state);
 
+	if (do_restore)
+		*iter = __i;
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	do {
-		rw->bytes_done += ret;
 		iov_iter_advance(iter, ret);
 		if (!iov_iter_count(iter))
 			break;
+		rw->bytes_done += ret;
 		iov_iter_save_state(iter, state);
 
 		/* if we can retry, do so with the callbacks armed */

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe
@ 2021-09-15 16:29 ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 15 +++++++++++++++
 lib/iov_iter.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..984c4ab74859 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,12 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+struct iov_iter_state {
+	size_t iov_offset;
+	size_t count;
+	unsigned long nr_segs;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
@@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	state->iov_offset = iter->iov_offset;
+	state->count = iter->count;
+	state->nr_segs = iter->nr_segs;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -233,6 +247,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..755c10c5138c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1972,3 +1972,39 @@ int import_single_range(int rw, void __user *buf, size_t len,
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
+
+/**
+ * iov_iter_restore() - Restore a &struct iov_iter to the same state as when
+ *     iov_iter_save_state() was called.
+ *
+ * @i: &struct iov_iter to restore
+ * @state: state to restore from
+ *
+ * Used after iov_iter_save_state() to bring restore @i, if operations may
+ * have advanced it.
+ *
+ * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC
+ */
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
+{
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
+			 !iov_iter_is_kvec(i))
+		return;
+	i->iov_offset = state->iov_offset;
+	i->count = state->count;
+	/*
+	 * For the *vec iters, nr_segs + iov is constant - if we increment
+	 * the vec, then we also decrement the nr_segs count. Hence we don't
+	 * need to track both of these, just one is enough and we can deduct
+	 * the other from that. ITER_KVEC and ITER_IOVEC are the same struct
+	 * size, so we can just increment the iov pointer as they are unionzed.
+	 * ITER_BVEC _may_ be the same size on some archs, but on others it is
+	 * not. Be safe and handle it separately.
+	 */
+	BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec));
+	if (iov_iter_is_bvec(i))
+		i->bvec -= state->nr_segs - i->nr_segs;
+	else
+		i->iov -= state->nr_segs - i->nr_segs;
+	i->nr_segs = state->nr_segs;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
@ 2021-09-10 18:50   ` Al Viro
  0 siblings, 0 replies; 10+ messages in thread
From: Al Viro @ 2021-09-10 18:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-fsdevel, torvalds

On Fri, Sep 10, 2021 at 12:25:34PM -0600, Jens Axboe wrote:
> +void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
> +		      ssize_t did_bytes)
> +{
> +	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
> +			 !iov_iter_is_kvec(i))
> +		return;
> +	i->iov_offset = state->iov_offset;
> +	i->count = state->count;
> +	/*
> +	 * For the *vec iters, nr_segs + iov is constant - if we increment
> +	 * the vec, then we also decrement the nr_segs count. Hence we don't
> +	 * need to track both of these, just one is enough and we can deduct
> +	 * the other from that. ITER_{BVEC,IOVEC,KVEC} all have their pointers
> +	 * unionized, so we don't need to handle them individually.
> +	 */
> +	i->iov -= state->nr_segs - i->nr_segs;
> +	i->nr_segs = state->nr_segs;
> +}

No.  You can have iovec and kvec handled jointly (struct iovec and struct kvec
always have the same size).  You can *not* do that to bvec - check sizeof of
struct bio_vec and struct iovec on 32bit targets.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore " Jens Axboe
@ 2021-09-10 18:25 ` Jens Axboe
  2021-09-10 18:50   ` Al Viro
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 16 ++++++++++++++++
 lib/iov_iter.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..6eaedae5ea2f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,12 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+struct iov_iter_state {
+	size_t iov_offset;
+	size_t count;
+	unsigned long nr_segs;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
@@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	state->iov_offset = iter->iov_offset;
+	state->count = iter->count;
+	state->nr_segs = iter->nr_segs;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -233,6 +247,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+			ssize_t did_bytes);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..280dbcc523e5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1972,3 +1972,39 @@ int import_single_range(int rw, void __user *buf, size_t len,
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
+
+/**
+ * iov_iter_restore() - Restore a &struct iov_iter to the same state as when
+ *     iov_iter_save_state() was called.
+ *
+ * @i: &struct iov_iter to restore
+ * @state: state to restore from
+ * @did_bytes: bytes to advance @i after restoring it
+ *
+ * Used after iov_iter_save_state() to bring restore @i, if operations may
+ * have advanced it. If @did_bytes is a positive value, then after restoring
+ * @i it is advanced accordingly. This is useful for handling short reads or
+ * writes for retry, if lower down the stack @i was advanced further than the
+ * returned value. If @did_bytes is negative (eg an error), then only the
+ * state restore is done.
+ *
+ * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC
+ */
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+		      ssize_t did_bytes)
+{
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
+			 !iov_iter_is_kvec(i))
+		return;
+	i->iov_offset = state->iov_offset;
+	i->count = state->count;
+	/*
+	 * For the *vec iters, nr_segs + iov is constant - if we increment
+	 * the vec, then we also decrement the nr_segs count. Hence we don't
+	 * need to track both of these, just one is enough and we can deduct
+	 * the other from that. ITER_{BVEC,IOVEC,KVEC} all have their pointers
+	 * unionized, so we don't need to handle them individually.
+	 */
+	i->iov -= state->nr_segs - i->nr_segs;
+	i->nr_segs = state->nr_segs;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-15 16:29 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14 14:17 [PATCHSET v2 0/3] Add ability to save/restore iov_iter state Jens Axboe
2021-09-14 14:17 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
2021-09-14 18:45   ` Linus Torvalds
2021-09-14 19:37     ` Jens Axboe
2021-09-14 23:02       ` Jens Axboe
2021-09-14 14:17 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe
2021-09-15 16:29 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore " Jens Axboe
2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
2021-09-10 18:50   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).