All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/3] Add ability to save/restore iov_iter state
@ 2021-09-10 18:25 Jens Axboe
  2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro

Hi,

Linus didn't particularly love the iov_iter->truncated addition and how
it was used, and it was hard to disagree with that. Instead of relying
on tracking ->truncated, add a few pieces of state so we can safely
handle partial or errored read/write attempts (that we want to retry).

Then we can get rid of the iov_iter addition, and at the same time
handle cases that weren't handled correctly before.

I've run this through vectored read/write with io_uring on the commonly
problematic cases (dm and low depth SCSI device) which trigger these
conditions often, and it seems to pass muster.

For a discussion on this topic, see the thread here:

https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/

You can find these patches here:

https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
@ 2021-09-10 18:25 ` Jens Axboe
  2021-09-10 18:50   ` Al Viro
  2021-09-10 18:25 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 16 ++++++++++++++++
 lib/iov_iter.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..6eaedae5ea2f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,12 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+struct iov_iter_state {
+	size_t iov_offset;
+	size_t count;
+	unsigned long nr_segs;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
@@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	state->iov_offset = iter->iov_offset;
+	state->count = iter->count;
+	state->nr_segs = iter->nr_segs;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -233,6 +247,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+			ssize_t did_bytes);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..280dbcc523e5 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1972,3 +1972,39 @@ int import_single_range(int rw, void __user *buf, size_t len,
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
+
+/**
+ * iov_iter_restore() - Restore a &struct iov_iter to the same state as when
+ *     iov_iter_save_state() was called.
+ *
+ * @i: &struct iov_iter to restore
+ * @state: state to restore from
+ * @did_bytes: bytes to advance @i after restoring it
+ *
+ * Used after iov_iter_save_state() to bring restore @i, if operations may
+ * have advanced it. If @did_bytes is a positive value, then after restoring
+ * @i it is advanced accordingly. This is useful for handling short reads or
+ * writes for retry, if lower down the stack @i was advanced further than the
+ * returned value. If @did_bytes is negative (eg an error), then only the
+ * state restore is done.
+ *
+ * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC
+ */
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+		      ssize_t did_bytes)
+{
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
+			 !iov_iter_is_kvec(i))
+		return;
+	i->iov_offset = state->iov_offset;
+	i->count = state->count;
+	/*
+	 * For the *vec iters, nr_segs + iov is constant - if we increment
+	 * the vec, then we also decrement the nr_segs count. Hence we don't
+	 * need to track both of these, just one is enough and we can deduct
+	 * the other from that. ITER_{BVEC,IOVEC,KVEC} all have their pointers
+	 * unionized, so we don't need to handle them individually.
+	 */
+	i->iov -= state->nr_segs - i->nr_segs;
+	i->nr_segs = state->nr_segs;
+}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
  2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
@ 2021-09-10 18:25 ` Jens Axboe
  2021-09-10 18:25 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
  2021-09-13 22:43 ` [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
  3 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..84e33f751372 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@ struct io_async_rw {
 	struct iovec			fast_iov[UIO_FASTIOV];
 	const struct iovec		*free_iovec;
 	struct iov_iter			iter;
+	struct iov_iter_state		iter_state;
 	size_t				bytes_done;
 	struct wait_page_queue		wpq;
 };
@@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
+	iov_iter_restore(&rw->iter, &rw->iter_state, 0);
 	return true;
 }
 
@@ -3437,19 +3437,22 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	iov_iter_save_state(iter, state);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3466,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), state->count);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,18 +3482,17 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == state->count || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	iov_iter_restore(iter, state, ret);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3501,7 +3503,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	iter = &rw->iter;
 
 	do {
-		io_size -= ret;
+		state->count -= ret;
 		rw->bytes_done += ret;
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
@@ -3520,7 +3522,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0 && ret < state->count);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,19 +3545,23 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	iov_iter_save_state(iter, state);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3578,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), state->count);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3625,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		iov_iter_restore(iter, state, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] Revert "iov_iter: track truncated size"
  2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
  2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
  2021-09-10 18:25 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
@ 2021-09-10 18:25 ` Jens Axboe
  2021-09-13 22:43 ` [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
  3 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-10 18:25 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

This reverts commit 2112ff5ce0c1128fe7b4d19cfe7f2b8ce5b595fa.

We no longer need to track the truncation count, the one user that did
need it has been converted to using iov_iter_restore() instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6eaedae5ea2f..7c8aeacc6fa7 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -53,7 +53,6 @@ struct iov_iter {
 		};
 		loff_t xarray_start;
 	};
-	size_t truncated;
 };
 
 static inline enum iter_type iov_iter_type(const struct iov_iter *i)
@@ -271,10 +270,8 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
 	 * conversion in assignement is by definition greater than all
 	 * values of size_t, including old i->count.
 	 */
-	if (i->count > count) {
-		i->truncated += i->count - count;
+	if (i->count > count)
 		i->count = count;
-	}
 }
 
 /*
@@ -283,7 +280,6 @@ static inline void iov_iter_truncate(struct iov_iter *i, u64 count)
  */
 static inline void iov_iter_reexpand(struct iov_iter *i, size_t count)
 {
-	i->truncated -= count - i->count;
 	i->count = count;
 }
 
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] iov_iter: add helper to save iov_iter state
  2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
@ 2021-09-10 18:50   ` Al Viro
  2021-09-10 19:15     ` [PATCH v2 " Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Al Viro @ 2021-09-10 18:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-fsdevel, torvalds

On Fri, Sep 10, 2021 at 12:25:34PM -0600, Jens Axboe wrote:
> +void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
> +		      ssize_t did_bytes)
> +{
> +	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
> +			 !iov_iter_is_kvec(i))
> +		return;
> +	i->iov_offset = state->iov_offset;
> +	i->count = state->count;
> +	/*
> +	 * For the *vec iters, nr_segs + iov is constant - if we increment
> +	 * the vec, then we also decrement the nr_segs count. Hence we don't
> +	 * need to track both of these, just one is enough and we can deduct
> +	 * the other from that. ITER_{BVEC,IOVEC,KVEC} all have their pointers
> +	 * unionized, so we don't need to handle them individually.
> +	 */
> +	i->iov -= state->nr_segs - i->nr_segs;
> +	i->nr_segs = state->nr_segs;
> +}

No.  You can have iovec and kvec handled jointly (struct iovec and struct kvec
always have the same size).  You can *not* do that to bvec - check sizeof of
struct bio_vec and struct iovec on 32bit targets.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] iov_iter: add helper to save iov_iter state
  2021-09-10 18:50   ` Al Viro
@ 2021-09-10 19:15     ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-10 19:15 UTC (permalink / raw)
  To: Al Viro; +Cc: io-uring, linux-fsdevel, torvalds

In an ideal world, when someone is passed an iov_iter and returns X bytes,
then X bytes would have been consumed/advanced from the iov_iter. But we
have use cases that always consume the entire iterator, a few examples
of that are iomap and bdev O_DIRECT. This means we cannot rely on the
state of the iov_iter once we've called ->read_iter() or ->write_iter().

This would be easier if we didn't always have to deal with truncate of
the iov_iter, as rewinding would be trivial without that. We recently
added a commit to track the truncate state, but that grew the iov_iter
by 8 bytes and wasn't the best solution.

Implement a helper to save enough of the iov_iter state to sanely restore
it after we've called the read/write iterator helpers. This currently
only works for IOVEC/BVEC/KVEC as that's all we need, support for other
iterator types are left as an exercise for the reader.

Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

V2: Don't assume bvec is the same size as iovec/kvec. Add a special
    case for it, and a BUILD_BUG_ON() for iovec/kvec sizing as well.

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5265024e8b90..6eaedae5ea2f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,12 @@ enum iter_type {
 	ITER_DISCARD,
 };
 
+struct iov_iter_state {
+	size_t iov_offset;
+	size_t count;
+	unsigned long nr_segs;
+};
+
 struct iov_iter {
 	u8 iter_type;
 	bool data_source;
@@ -55,6 +61,14 @@ static inline enum iter_type iov_iter_type(const struct iov_iter *i)
 	return i->iter_type;
 }
 
+static inline void iov_iter_save_state(struct iov_iter *iter,
+				       struct iov_iter_state *state)
+{
+	state->iov_offset = iter->iov_offset;
+	state->count = iter->count;
+	state->nr_segs = iter->nr_segs;
+}
+
 static inline bool iter_is_iovec(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_IOVEC;
@@ -233,6 +247,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+			ssize_t did_bytes);
 
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f2d50d69a6c3..b46173eb61c7 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1972,3 +1972,45 @@ int import_single_range(int rw, void __user *buf, size_t len,
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
+
+/**
+ * iov_iter_restore() - Restore a &struct iov_iter to the same state as when
+ *     iov_iter_save_state() was called.
+ *
+ * @i: &struct iov_iter to restore
+ * @state: state to restore from
+ * @did_bytes: bytes to advance @i after restoring it
+ *
+ * Used after iov_iter_save_state() to bring restore @i, if operations may
+ * have advanced it. If @did_bytes is a positive value, then after restoring
+ * @i it is advanced accordingly. This is useful for handling short reads or
+ * writes for retry, if lower down the stack @i was advanced further than the
+ * returned value. If @did_bytes is negative (eg an error), then only the
+ * state restore is done.
+ *
+ * Note: only works on ITER_IOVEC, ITER_BVEC, and ITER_KVEC
+ */
+void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state,
+		      ssize_t did_bytes)
+{
+	if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i)) &&
+			 !iov_iter_is_kvec(i))
+		return;
+	i->iov_offset = state->iov_offset;
+	i->count = state->count;
+	/*
+	 * For the *vec iters, nr_segs + iov is constant - if we increment
+	 * the vec, then we also decrement the nr_segs count. Hence we don't
+	 * need to track both of these, just one is enough and we can deduct
+	 * the other from that. ITER_KVEC and ITER_IOVEC are the same struct
+	 * size, so we can just increment the iov pointer as they are unionzed.
+	 * ITER_BVEC _may_ be the same size on some archs, but on others it is
+	 * not. Be safe and handle it separately.
+	 */
+	BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec));
+	if (iov_iter_is_bvec(i))
+		i->bvec -= state->nr_segs - i->nr_segs;
+	else
+		i->iov -= state->nr_segs - i->nr_segs;
+	i->nr_segs = state->nr_segs;
+}


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/3] Add ability to save/restore iov_iter state
  2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
                   ` (2 preceding siblings ...)
  2021-09-10 18:25 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
@ 2021-09-13 22:43 ` Jens Axboe
  2021-09-13 23:23   ` Linus Torvalds
  3 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2021-09-13 22:43 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro

On 9/10/21 12:25 PM, Jens Axboe wrote:
> Hi,
> 
> Linus didn't particularly love the iov_iter->truncated addition and how
> it was used, and it was hard to disagree with that. Instead of relying
> on tracking ->truncated, add a few pieces of state so we can safely
> handle partial or errored read/write attempts (that we want to retry).
> 
> Then we can get rid of the iov_iter addition, and at the same time
> handle cases that weren't handled correctly before.
> 
> I've run this through vectored read/write with io_uring on the commonly
> problematic cases (dm and low depth SCSI device) which trigger these
> conditions often, and it seems to pass muster.
> 
> For a discussion on this topic, see the thread here:
> 
> https://lore.kernel.org/linux-fsdevel/CAHk-=wiacKV4Gh-MYjteU0LwNBSGpWrK-Ov25HdqB1ewinrFPg@mail.gmail.com/
> 
> You can find these patches here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=iov_iter

Al, Linus, are you OK with this? I think we should get this in for 5.15.
I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
vs iovec issue. Let me know if you want the while thing resent.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/3] Add ability to save/restore iov_iter state
  2021-09-13 22:43 ` [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
@ 2021-09-13 23:23   ` Linus Torvalds
  2021-09-14  1:54     ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2021-09-13 23:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro

On Mon, Sep 13, 2021 at 3:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Al, Linus, are you OK with this? I think we should get this in for 5.15.
> I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
> vs iovec issue. Let me know if you want the while thing resent.

So I'm ok with the iov_iter side, but the io_uring side seems still
positively buggy, and very confused.

It also messes with the state in bad ways and has internal knowledge.
And some of it looks completely bogus.

For example, I see

        state->count -= ret;
        rw->bytes_done += ret;

and I go "that's BS". There's no way it's ok to start messing with the
byte count inside the state like that. That just means that the state
is now no longer the saved state, and it's some random garbage.

I also think that the "bytes_done += ret" is a big hint there: any
time you restore the iovec state, you should then forward it by
"bytes_done". But that's not what the code does.

Instead, it will now restore the iovec styate with the *wrong* number
of bytes remaining, but will start from the beginning of the iovec.

So I think the fs/io_uring.c use of this state buffer is completely wrong.

What *may* be the right thing to do is to

 (a) not mess with state->count

 (b) when you restore the state you always use

        iov_iter_restore(iter, state, bytes_done);

to actually restore the *correct* state.

Because modifying the iovec save state like that cannot be right, and
if it's right it's still too ugly and fragile for words. That save
state should be treated as a snapshot, not as a random buffer that you
can make arbitrary changes to.

See what I'm saying?

I'd like Al to take a look at the io_uring.c usage too, since this was
just my reaction from looking at that diff a bit more.

           Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCHSET 0/3] Add ability to save/restore iov_iter state
  2021-09-13 23:23   ` Linus Torvalds
@ 2021-09-14  1:54     ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-14  1:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro

On 9/13/21 5:23 PM, Linus Torvalds wrote:
> On Mon, Sep 13, 2021 at 3:43 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Al, Linus, are you OK with this? I think we should get this in for 5.15.
>> I didn't resend the whole series, just a v2 of patch 1/3 to fix that bvec
>> vs iovec issue. Let me know if you want the while thing resent.
> 
> So I'm ok with the iov_iter side, but the io_uring side seems still
> positively buggy, and very confused.
> 
> It also messes with the state in bad ways and has internal knowledge.
> And some of it looks completely bogus.
> 
> For example, I see
> 
>         state->count -= ret;
>         rw->bytes_done += ret;
> 
> and I go "that's BS". There's no way it's ok to start messing with the
> byte count inside the state like that. That just means that the state
> is now no longer the saved state, and it's some random garbage.
>
> I also think that the "bytes_done += ret" is a big hint there: any
> time you restore the iovec state, you should then forward it by
> "bytes_done". But that's not what the code does.
> 
> Instead, it will now restore the iovec styate with the *wrong* number
> of bytes remaining, but will start from the beginning of the iovec.
> 
> So I think the fs/io_uring.c use of this state buffer is completely wrong.
> 
> What *may* be the right thing to do is to
> 
>  (a) not mess with state->count
> 
>  (b) when you restore the state you always use
> 
>         iov_iter_restore(iter, state, bytes_done);
> 
> to actually restore the *correct* state.
> 
> Because modifying the iovec save state like that cannot be right, and
> if it's right it's still too ugly and fragile for words. That save
> state should be treated as a snapshot, not as a random buffer that you
> can make arbitrary changes to.
> 
> See what I'm saying?

OK, for the do while loop itself, I do think we should be more
consistent and that would also get rid of the state->count manipulation.
I do agree that messing with that state is not something that should be
done, and we can do this more cleanly and consistently instead. Once we
hit the do {} while loop, state should be &rw->state and we can
consistently handle it that way.

Let me rework that bit and run the tests, and I'll post a v2 tomorrow.
Thanks for taking a closer look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe
@ 2021-09-15 16:29 ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-15 16:29 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 82 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..25bda8a5a4e5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@ struct io_async_rw {
 	struct iovec			fast_iov[UIO_FASTIOV];
 	const struct iovec		*free_iovec;
 	struct iov_iter			iter;
+	struct iov_iter_state		iter_state;
 	size_t				bytes_done;
 	struct wait_page_queue		wpq;
 };
@@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
+	iov_iter_restore(&rw->iter, &rw->iter_state);
 	return true;
 }
 
@@ -3310,12 +3310,17 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
 	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (!req->async_data) {
+		struct io_async_rw *iorw;
+
 		if (io_alloc_async_data(req)) {
 			kfree(iovec);
 			return -ENOMEM;
 		}
 
 		io_req_map_rw(req, iovec, fast_iov, iter);
+		iorw = req->async_data;
+		/* we've copied and mapped the iter, ensure state is saved */
+		iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	}
 	return 0;
 }
@@ -3334,6 +3339,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
 	iorw->free_iovec = iov;
 	if (iov)
 		req->flags |= REQ_F_NEED_CLEANUP;
+	iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	return 0;
 }
 
@@ -3437,19 +3443,28 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		/*
+		 * We come here from an earlier attempt, restore our state to
+		 * match in case it doesn't. It's cheap enough that we don't
+		 * need to make this conditional.
+		 */
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3478,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,30 +3494,49 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == req->result || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	/*
+	 * Don't depend on the iter state matching what was consumed, or being
+	 * untouched in case of error. Restore it and we'll advance it
+	 * manually if we need to.
+	 */
+	iov_iter_restore(iter, state);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
 
 	iovec = NULL;
 	rw = req->async_data;
-	/* now use our persistent iterator, if we aren't already */
-	iter = &rw->iter;
+	/*
+	 * Now use our persistent iterator and state, if we aren't already.
+	 * We've restored and mapped the iter to match.
+	 */
+	if (iter != &rw->iter) {
+		iter = &rw->iter;
+		state = &rw->iter_state;
+	}
 
 	do {
-		io_size -= ret;
+		/*
+		 * We end up here because of a partial read, either from
+		 * above or inside this loop. Advance the iter by the bytes
+		 * that were consumed.
+		 */
+		iov_iter_advance(iter, ret);
+		if (!iov_iter_count(iter))
+			break;
 		rw->bytes_done += ret;
+		iov_iter_save_state(iter, state);
+
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
 			kiocb->ki_flags &= ~IOCB_WAITQ;
@@ -3520,7 +3554,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+		iov_iter_restore(iter, state);
+	} while (ret > 0);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,19 +3578,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3612,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3659,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		iov_iter_restore(iter, state);
+		if (ret2 > 0)
+			iov_iter_advance(iter, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 19:37     ` Jens Axboe
@ 2021-09-14 23:02       ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2021-09-14 23:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro

On 9/14/21 1:37 PM, Jens Axboe wrote:
> On 9/14/21 12:45 PM, Linus Torvalds wrote:
>> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>>
>>> +       iov_iter_restore(iter, state);
>>> +
>> ...
>>>                 rw->bytes_done += ret;
>>> +               iov_iter_advance(iter, ret);
>>> +               if (!iov_iter_count(iter))
>>> +                       break;
>>> +               iov_iter_save_state(iter, state);
>>
>> Ok, so now you keep iovb_iter and the state always in sync by just
>> always resetting the iter back and then walking it forward explicitly
>> - and re-saving the state.
>>
>> That seems safe, if potentially unnecessarily expensive.
> 
> Right, it's not ideal if it's a big range of IO, then it'll definitely
> be noticeable. But not too worried about it, at least not for now...
> 
>> I guess re-walking lots of iovec entries is actually very unlikely in
>> practice, so maybe this "stupid brute-force" model is the right one.
> 
> Not sure what the alternative is here. We could do something similar to
> __io_import_fixed() as we're only dealing with iter types where we can
> do that, but probably best left as a later optimization if it's deemed
> necessary.
> 
>> I do find the odd "use __state vs rw->state" to be very confusing,
>> though. Particularly in io_read(), where you do this:
>>
>> +       iov_iter_restore(iter, state);
>> +
>>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>>         if (ret2)
>>                 return ret2;
>>
>>         iovec = NULL;
>>         rw = req->async_data;
>> -       /* now use our persistent iterator, if we aren't already */
>> -       iter = &rw->iter;
>> +       /* now use our persistent iterator and state, if we aren't already */
>> +       if (iter != &rw->iter) {
>> +               iter = &rw->iter;
>> +               state = &rw->iter_state;
>> +       }
>>
>>         do {
>> -               io_size -= ret;
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
>>
>>
>> Note how it first does that iov_iter_restore() on iter/state, buit
>> then it *replaces&* the iter/state pointers, and then it does
>> iov_iter_advance() on the replacement ones.
> 
> We restore the iter so it's the same as before we did the read_iter
> call, and then setup a consistent copy of the iov/iter in case we need
> to punt this request for retry. rw->iter should have the same state as
> iter at this point, and since rw->iter is the copy we'll use going
> forward, we're advancing that one in case ret > 0.
> 
> The other case is that no persistent state is needed, and then iter
> remains the same.
> 
> I'll take a second look at this part and see if I can make it a bit more
> straight forward, or at least comment it properly.

I hacked up something that shortens the iter for the initial IO, so we
could more easily test the retry path and the state. It really is a
hack, but the idea was to issue 64K io from fio, and then the initial
attempt would be anywhere from 4K-60K truncated. That forces retry.
I ran this with both 16 segments and 8 segments, verifying that it
hits both the UIO_FASTIOV and alloc path.

I did find one issue with that, see the last hunk in the hack. We
need to increment rw->bytes_done if we don't break, or set ret to
0 if we do. Otherwise that last ret ends up being accounted twice.
But apart from that, it passes data verification runs.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc1ff47e3221..484c86252f9d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -744,6 +744,7 @@ enum {
 	REQ_F_NOWAIT_READ_BIT,
 	REQ_F_NOWAIT_WRITE_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_TRUNCATED_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -797,6 +798,7 @@ enum {
 	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
 	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_TRUNCATED		= BIT(REQ_F_TRUNCATED_BIT),
 };
 
 struct async_poll {
@@ -3454,11 +3456,12 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw.kiocb;
-	struct iov_iter __iter, *iter = &__iter;
+	struct iov_iter __i, __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct iov_iter_state __state, *state;
 	ssize_t ret, ret2;
+	bool do_restore = false;
 
 	if (rw) {
 		iter = &rw->iter;
@@ -3492,8 +3495,25 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	}
 
+	if (!(req->flags & REQ_F_TRUNCATED) && !(iov_iter_count(iter) & 4095)) {
+		int nr_vecs;
+
+		__i = *iter;
+		nr_vecs = 1 + (prandom_u32() % iter->nr_segs);
+		iter->nr_segs = nr_vecs;
+		iter->count = nr_vecs * 8192;
+		req->flags |= REQ_F_TRUNCATED;
+		do_restore = true;
+	}
+
 	ret = io_iter_do_read(req, iter);
 
+	if (ret == -EAGAIN) {
+		req->flags &= ~REQ_F_TRUNCATED;
+		*iter = __i;
+		do_restore = false;
+	}
+
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
 		/* IOPOLL retry should happen for io-wq threads */
@@ -3513,6 +3533,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 
 	iov_iter_restore(iter, state);
 
+	if (do_restore)
+		*iter = __i;
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
@@ -3526,10 +3549,10 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	do {
-		rw->bytes_done += ret;
 		iov_iter_advance(iter, ret);
 		if (!iov_iter_count(iter))
 			break;
+		rw->bytes_done += ret;
 		iov_iter_save_state(iter, state);
 
 		/* if we can retry, do so with the callbacks armed */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 18:45   ` Linus Torvalds
@ 2021-09-14 19:37     ` Jens Axboe
  2021-09-14 23:02       ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2021-09-14 19:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: io-uring, linux-fsdevel, Al Viro

On 9/14/21 12:45 PM, Linus Torvalds wrote:
> On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>
>> +       iov_iter_restore(iter, state);
>> +
> ...
>>                 rw->bytes_done += ret;
>> +               iov_iter_advance(iter, ret);
>> +               if (!iov_iter_count(iter))
>> +                       break;
>> +               iov_iter_save_state(iter, state);
> 
> Ok, so now you keep iovb_iter and the state always in sync by just
> always resetting the iter back and then walking it forward explicitly
> - and re-saving the state.
> 
> That seems safe, if potentially unnecessarily expensive.

Right, it's not ideal if it's a big range of IO, then it'll definitely
be noticeable. But not too worried about it, at least not for now...

> I guess re-walking lots of iovec entries is actually very unlikely in
> practice, so maybe this "stupid brute-force" model is the right one.

Not sure what the alternative is here. We could do something similar to
__io_import_fixed() as we're only dealing with iter types where we can
do that, but probably best left as a later optimization if it's deemed
necessary.

> I do find the odd "use __state vs rw->state" to be very confusing,
> though. Particularly in io_read(), where you do this:
> 
> +       iov_iter_restore(iter, state);
> +
>         ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
>         if (ret2)
>                 return ret2;
> 
>         iovec = NULL;
>         rw = req->async_data;
> -       /* now use our persistent iterator, if we aren't already */
> -       iter = &rw->iter;
> +       /* now use our persistent iterator and state, if we aren't already */
> +       if (iter != &rw->iter) {
> +               iter = &rw->iter;
> +               state = &rw->iter_state;
> +       }
> 
>         do {
> -               io_size -= ret;
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);
> 
> 
> Note how it first does that iov_iter_restore() on iter/state, buit
> then it *replaces&* the iter/state pointers, and then it does
> iov_iter_advance() on the replacement ones.

We restore the iter so it's the same as before we did the read_iter
call, and then setup a consistent copy of the iov/iter in case we need
to punt this request for retry. rw->iter should have the same state as
iter at this point, and since rw->iter is the copy we'll use going
forward, we're advancing that one in case ret > 0.

The other case is that no persistent state is needed, and then iter
remains the same.

I'll take a second look at this part and see if I can make it a bit more
straight forward, or at least comment it properly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
@ 2021-09-14 18:45   ` Linus Torvalds
  2021-09-14 19:37     ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2021-09-14 18:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, linux-fsdevel, Al Viro

On Tue, Sep 14, 2021 at 7:18 AM Jens Axboe <axboe@kernel.dk> wrote:
>
>
> +       iov_iter_restore(iter, state);
> +
...
>                 rw->bytes_done += ret;
> +               iov_iter_advance(iter, ret);
> +               if (!iov_iter_count(iter))
> +                       break;
> +               iov_iter_save_state(iter, state);

Ok, so now you keep iovb_iter and the state always in sync by just
always resetting the iter back and then walking it forward explicitly
- and re-saving the state.

That seems safe, if potentially unnecessarily expensive.

I guess re-walking lots of iovec entries is actually very unlikely in
practice, so maybe this "stupid brute-force" model is the right one.

I do find the odd "use __state vs rw->state" to be very confusing,
though. Particularly in io_read(), where you do this:

+       iov_iter_restore(iter, state);
+
        ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
        if (ret2)
                return ret2;

        iovec = NULL;
        rw = req->async_data;
-       /* now use our persistent iterator, if we aren't already */
-       iter = &rw->iter;
+       /* now use our persistent iterator and state, if we aren't already */
+       if (iter != &rw->iter) {
+               iter = &rw->iter;
+               state = &rw->iter_state;
+       }

        do {
-               io_size -= ret;
                rw->bytes_done += ret;
+               iov_iter_advance(iter, ret);
+               if (!iov_iter_count(iter))
+                       break;
+               iov_iter_save_state(iter, state);


Note how it first does that iov_iter_restore() on iter/state, buit
then it *replaces&* the iter/state pointers, and then it does
iov_iter_advance() on the replacement ones.

I don't see how that could be right. You're doing iov_iter_advance()
on something else than the one you restored to the original values.

And if it is right, it's sure confusing as hell.

           Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] io_uring: use iov_iter state save/restore helpers
  2021-09-14 14:17 [PATCHSET v2 " Jens Axboe
@ 2021-09-14 14:17 ` Jens Axboe
  2021-09-14 18:45   ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2021-09-14 14:17 UTC (permalink / raw)
  To: io-uring, linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Get rid of the need to do re-expand and revert on an iterator when we
encounter a short IO, or failure that warrants a retry. Use the new
state save/restore helpers instead.

We keep the iov_iter_state persistent across retries, if we need to
restart the read or write operation. If there's a pending retry, the
operation will always exit with the state correctly saved.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 62 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 855ea544807f..dbc97d440801 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -712,6 +712,7 @@ struct io_async_rw {
 	struct iovec			fast_iov[UIO_FASTIOV];
 	const struct iovec		*free_iovec;
 	struct iov_iter			iter;
+	struct iov_iter_state		iter_state;
 	size_t				bytes_done;
 	struct wait_page_queue		wpq;
 };
@@ -2608,8 +2609,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 
 	if (!rw)
 		return !io_req_prep_async(req);
-	/* may have left rw->iter inconsistent on -EIOCBQUEUED */
-	iov_iter_revert(&rw->iter, req->result - iov_iter_count(&rw->iter));
+	iov_iter_restore(&rw->iter, &rw->iter_state);
 	return true;
 }
 
@@ -3310,12 +3310,16 @@ static int io_setup_async_rw(struct io_kiocb *req, const struct iovec *iovec,
 	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (!req->async_data) {
+		struct io_async_rw *iorw;
+
 		if (io_alloc_async_data(req)) {
 			kfree(iovec);
 			return -ENOMEM;
 		}
 
 		io_req_map_rw(req, iovec, fast_iov, iter);
+		iorw = req->async_data;
+		iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	}
 	return 0;
 }
@@ -3334,6 +3338,7 @@ static inline int io_rw_prep_async(struct io_kiocb *req, int rw)
 	iorw->free_iovec = iov;
 	if (iov)
 		req->flags |= REQ_F_NEED_CLEANUP;
+	iov_iter_save_state(&iorw->iter, &iorw->iter_state);
 	return 0;
 }
 
@@ -3437,19 +3442,23 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t io_size, ret, ret2;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(READ, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3463,7 +3472,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3479,30 +3488,36 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		/* no retry on NONBLOCK nor RWF_NOWAIT */
 		if (req->flags & REQ_F_NOWAIT)
 			goto done;
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
 		ret = 0;
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
-	} else if (ret <= 0 || ret == io_size || !force_nonblock ||
+	} else if (ret <= 0 || ret == req->result || !force_nonblock ||
 		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
 
+	iov_iter_restore(iter, state);
+
 	ret2 = io_setup_async_rw(req, iovec, inline_vecs, iter, true);
 	if (ret2)
 		return ret2;
 
 	iovec = NULL;
 	rw = req->async_data;
-	/* now use our persistent iterator, if we aren't already */
-	iter = &rw->iter;
+	/* now use our persistent iterator and state, if we aren't already */
+	if (iter != &rw->iter) {
+		iter = &rw->iter;
+		state = &rw->iter_state;
+	}
 
 	do {
-		io_size -= ret;
 		rw->bytes_done += ret;
+		iov_iter_advance(iter, ret);
+		if (!iov_iter_count(iter))
+			break;
+		iov_iter_save_state(iter, state);
+
 		/* if we can retry, do so with the callbacks armed */
 		if (!io_rw_should_retry(req)) {
 			kiocb->ki_flags &= ~IOCB_WAITQ;
@@ -3520,7 +3535,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return 0;
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
-	} while (ret > 0 && ret < io_size);
+	} while (ret > 0);
 done:
 	kiocb_done(kiocb, ret, issue_flags);
 out_free:
@@ -3543,19 +3558,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct iov_iter __iter, *iter = &__iter;
 	struct io_async_rw *rw = req->async_data;
-	ssize_t ret, ret2, io_size;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	struct iov_iter_state __state, *state;
+	ssize_t ret, ret2;
 
 	if (rw) {
 		iter = &rw->iter;
+		state = &rw->iter_state;
+		iov_iter_restore(iter, state);
 		iovec = NULL;
 	} else {
 		ret = io_import_iovec(WRITE, req, &iovec, iter, !force_nonblock);
 		if (ret < 0)
 			return ret;
+		state = &__state;
+		iov_iter_save_state(iter, state);
 	}
-	io_size = iov_iter_count(iter);
-	req->result = io_size;
+	req->result = iov_iter_count(iter);
+	ret2 = 0;
 
 	/* Ensure we clear previously set non-block flag */
 	if (!force_nonblock)
@@ -3572,7 +3592,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	    (req->flags & REQ_F_ISREG))
 		goto copy_iov;
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), io_size);
+	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -3619,9 +3639,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb_done(kiocb, ret2, issue_flags);
 	} else {
 copy_iov:
-		/* some cases will consume bytes even on error returns */
-		iov_iter_reexpand(iter, iter->count + iter->truncated);
-		iov_iter_revert(iter, io_size - iov_iter_count(iter));
+		iov_iter_restore(iter, state);
+		if (ret2 > 0)
+			iov_iter_advance(iter, ret2);
 		ret = io_setup_async_rw(req, iovec, inline_vecs, iter, false);
 		return ret ?: -EAGAIN;
 	}
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-15 16:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 18:25 [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
2021-09-10 18:25 ` [PATCH 1/3] iov_iter: add helper to save " Jens Axboe
2021-09-10 18:50   ` Al Viro
2021-09-10 19:15     ` [PATCH v2 " Jens Axboe
2021-09-10 18:25 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
2021-09-10 18:25 ` [PATCH 3/3] Revert "iov_iter: track truncated size" Jens Axboe
2021-09-13 22:43 ` [PATCHSET 0/3] Add ability to save/restore iov_iter state Jens Axboe
2021-09-13 23:23   ` Linus Torvalds
2021-09-14  1:54     ` Jens Axboe
2021-09-14 14:17 [PATCHSET v2 " Jens Axboe
2021-09-14 14:17 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe
2021-09-14 18:45   ` Linus Torvalds
2021-09-14 19:37     ` Jens Axboe
2021-09-14 23:02       ` Jens Axboe
2021-09-15 16:29 [PATCHSET v3 0/3] Add ability to save/restore iov_iter state Jens Axboe
2021-09-15 16:29 ` [PATCH 2/3] io_uring: use iov_iter state save/restore helpers Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.