[PATCH v2 0/4] io_uring: consistent behaviour with linked read/write

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write
@ 2022-02-21 14:16 Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 1/4] io_uring: remove duplicated calls to io_kiocb_ppos Dylan Yudaken
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 14:16 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team, Dylan Yudaken

Currently submitting multiple read/write for one file with offset = -1 will
not behave as if calling read(2)/write(2) multiple times. The offset may be
pinned to the same value for each submission (for example if they are
punted to the async worker) and so each read/write will have the same
offset.

This patch series fixes this.

Patch 1,3 cleans up the code a bit

Patch 2 grabs the file position at execution time, rather than when the job
is queued to be run which fixes inconsistincies when jobs are run asynchronously.

Patch 4 increments the file's f_pos when reading it, which fixes
inconsistincies with concurrent runs. 

A test for this will be submitted to liburing separately.

v2:
  * added patch 4 which fixes cases where IOSQE_IO_LINK is not set

Dylan Yudaken (4):
  io_uring: remove duplicated calls to io_kiocb_ppos
  io_uring: update kiocb->ki_pos at execution time
  io_uring: do not recalculate ppos unnecessarily
  io_uring: pre-increment f_pos on rw

 fs/io_uring.c | 103 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 88 insertions(+), 15 deletions(-)

base-commit: 754e0b0e35608ed5206d6a67a791563c631cec07
-- 
2.30.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] io_uring: remove duplicated calls to io_kiocb_ppos
  2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
@ 2022-02-21 14:16 ` Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time Dylan Yudaken
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 14:16 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team, Dylan Yudaken

io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.

Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter

After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 fs/io_uring.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 77b9c7e4793b..1f9b4466c269 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3400,6 +3400,7 @@ static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	struct file *file = req->file;
 	ssize_t ret = 0;
+	loff_t *ppos;
 
 	/*
 	 * Don't support polled IO through this interface, and we can't
@@ -3412,6 +3413,8 @@ static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
 	    !(kiocb->ki_filp->f_flags & O_NONBLOCK))
 		return -EAGAIN;
 
+	ppos = io_kiocb_ppos(kiocb);
+
 	while (iov_iter_count(iter)) {
 		struct iovec iovec;
 		ssize_t nr;
@@ -3425,10 +3428,10 @@ static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
 
 		if (rw == READ) {
 			nr = file->f_op->read(file, iovec.iov_base,
-					      iovec.iov_len, io_kiocb_ppos(kiocb));
+					      iovec.iov_len, ppos);
 		} else {
 			nr = file->f_op->write(file, iovec.iov_base,
-					       iovec.iov_len, io_kiocb_ppos(kiocb));
+					       iovec.iov_len, ppos);
 		}
 
 		if (nr < 0) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time
  2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 1/4] io_uring: remove duplicated calls to io_kiocb_ppos Dylan Yudaken
@ 2022-02-21 14:16 ` Dylan Yudaken
  2022-02-21 16:32   ` Jens Axboe
  2022-02-21 14:16 ` [PATCH v2 3/4] io_uring: do not recalculate ppos unnecessarily Dylan Yudaken
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 14:16 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team, Dylan Yudaken

Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.

Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 fs/io_uring.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1f9b4466c269..50b93ff2ee12 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3000,14 +3000,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		req->flags |= io_file_get_flags(file) << REQ_F_SUPPORT_NOWAIT_BIT;
 
 	kiocb->ki_pos = READ_ONCE(sqe->off);
-	if (kiocb->ki_pos == -1) {
-		if (!(file->f_mode & FMODE_STREAM)) {
-			req->flags |= REQ_F_CUR_POS;
-			kiocb->ki_pos = file->f_pos;
-		} else {
-			kiocb->ki_pos = 0;
-		}
-	}
 	kiocb->ki_flags = iocb_flags(file);
 	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
 	if (unlikely(ret))
@@ -3074,6 +3066,19 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static inline void
+io_kiocb_update_pos(struct io_kiocb *req, struct kiocb *kiocb)
+{
+	if (kiocb->ki_pos == -1) {
+		if (!(req->file->f_mode & FMODE_STREAM)) {
+			req->flags |= REQ_F_CUR_POS;
+			kiocb->ki_pos = req->file->f_pos;
+		} else {
+			kiocb->ki_pos = 0;
+		}
+	}
+}
+
 static void kiocb_done(struct io_kiocb *req, ssize_t ret,
 		       unsigned int issue_flags)
 {
@@ -3662,6 +3667,8 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
+	io_kiocb_update_pos(req, kiocb);
+
 	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
@@ -3791,6 +3798,8 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
+	io_kiocb_update_pos(req, kiocb);
+
 	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
 	if (unlikely(ret))
 		goto out_free;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] io_uring: do not recalculate ppos unnecessarily
  2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 1/4] io_uring: remove duplicated calls to io_kiocb_ppos Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time Dylan Yudaken
@ 2022-02-21 14:16 ` Dylan Yudaken
  2022-02-21 14:16 ` [PATCH v2 4/4] io_uring: pre-increment f_pos on rw Dylan Yudaken
  2022-02-21 16:33 ` [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Jens Axboe
  4 siblings, 0 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 14:16 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team, Dylan Yudaken

There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.

It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write

vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write

Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 fs/io_uring.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 50b93ff2ee12..abd8c739988e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3066,17 +3066,21 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
-static inline void
+static inline loff_t*
 io_kiocb_update_pos(struct io_kiocb *req, struct kiocb *kiocb)
 {
+	bool is_stream = req->file->f_mode & FMODE_STREAM;
 	if (kiocb->ki_pos == -1) {
-		if (!(req->file->f_mode & FMODE_STREAM)) {
+		if (!is_stream) {
 			req->flags |= REQ_F_CUR_POS;
 			kiocb->ki_pos = req->file->f_pos;
+			return &kiocb->ki_pos;
 		} else {
 			kiocb->ki_pos = 0;
+			return NULL;
 		}
 	}
+	return is_stream ? NULL : &kiocb->ki_pos;
 }
 
 static void kiocb_done(struct io_kiocb *req, ssize_t ret,
@@ -3637,6 +3641,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct io_async_rw *rw;
 	ssize_t ret, ret2;
+	loff_t *ppos;
 
 	if (!req_has_async_data(req)) {
 		ret = io_import_iovec(READ, req, &iovec, s, issue_flags);
@@ -3667,9 +3672,9 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
-	io_kiocb_update_pos(req, kiocb);
+	ppos = io_kiocb_update_pos(req, kiocb);
 
-	ret = rw_verify_area(READ, req->file, io_kiocb_ppos(kiocb), req->result);
+	ret = rw_verify_area(READ, req->file, ppos, req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
 		return ret;
@@ -3768,6 +3773,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	struct kiocb *kiocb = &req->rw.kiocb;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	ssize_t ret, ret2;
+	loff_t *ppos;
 
 	if (!req_has_async_data(req)) {
 		ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags);
@@ -3798,9 +3804,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
-	io_kiocb_update_pos(req, kiocb);
+	ppos = io_kiocb_update_pos(req, kiocb);
 
-	ret = rw_verify_area(WRITE, req->file, io_kiocb_ppos(kiocb), req->result);
+	ret = rw_verify_area(WRITE, req->file, ppos, req->result);
 	if (unlikely(ret))
 		goto out_free;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
                   ` (2 preceding siblings ...)
  2022-02-21 14:16 ` [PATCH v2 3/4] io_uring: do not recalculate ppos unnecessarily Dylan Yudaken
@ 2022-02-21 14:16 ` Dylan Yudaken
  2022-02-21 18:00   ` Pavel Begunkov
  2022-02-22  7:34   ` Hao Xu
  2022-02-21 16:33 ` [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Jens Axboe
  4 siblings, 2 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 14:16 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team, Dylan Yudaken

In read/write ops, preincrement f_pos when no offset is specified, and
then attempt fix up the position after IO completes if it completed less
than expected. This fixes the problem where multiple queued up IO will all
obtain the same f_pos, and so perform the same read/write.

This is still not as consistent as sync r/w, as it is able to advance the
file offset past the end of the file. It seems it would be quite a
performance hit to work around this limitation - such as by keeping track
of concurrent operations - and the downside does not seem to be too
problematic.

The attempt to fix up the f_pos after will at least mean that in situations
where a single operation is run, then the position will be consistent.

Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
---
 fs/io_uring.c | 81 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index abd8c739988e..a951d0754899 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3066,21 +3066,71 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
-static inline loff_t*
-io_kiocb_update_pos(struct io_kiocb *req, struct kiocb *kiocb)
+static inline bool
+io_kiocb_update_pos(struct io_kiocb *req, struct kiocb *kiocb,
+		loff_t **ppos, u64 expected, bool force_nonblock)
 {
 	bool is_stream = req->file->f_mode & FMODE_STREAM;
 	if (kiocb->ki_pos == -1) {
 		if (!is_stream) {
-			req->flags |= REQ_F_CUR_POS;
+			*ppos = &kiocb->ki_pos;
+			WARN_ON(req->flags & REQ_F_CUR_POS);
+			if (req->file->f_mode & FMODE_ATOMIC_POS) {
+				if (force_nonblock) {
+					if (!mutex_trylock(&req->file->f_pos_lock))
+						return true;
+				} else {
+					mutex_lock(&req->file->f_pos_lock);
+				}
+			}
 			kiocb->ki_pos = req->file->f_pos;
-			return &kiocb->ki_pos;
+			req->flags |= REQ_F_CUR_POS;
+			req->file->f_pos += expected;
+			if (req->file->f_mode & FMODE_ATOMIC_POS)
+				mutex_unlock(&req->file->f_pos_lock);
+			return false;
 		} else {
 			kiocb->ki_pos = 0;
-			return NULL;
+			*ppos = NULL;
+			return false;
 		}
 	}
-	return is_stream ? NULL : &kiocb->ki_pos;
+	*ppos = is_stream ? NULL : &kiocb->ki_pos;
+	return false;
+}
+
+static inline void
+io_kiocb_done_pos(struct io_kiocb *req, struct kiocb *kiocb, u64 actual)
+{
+	u64 expected;
+
+	if (likely(!(req->flags & REQ_F_CUR_POS)))
+		return;
+
+	expected = req->rw.len;
+	if (actual >= expected)
+		return;
+
+	/*
+	 * It's not definitely safe to lock here, and the assumption is,
+	 * that if we cannot lock the position that it will be changing,
+	 * and if it will be changing - then we can't update it anyway
+	 */
+	if (req->file->f_mode & FMODE_ATOMIC_POS
+		&& !mutex_trylock(&req->file->f_pos_lock))
+		return;
+
+	/*
+	 * now we want to move the pointer, but only if everything is consistent
+	 * with how we left it originally
+	 */
+	if (req->file->f_pos == kiocb->ki_pos + (expected - actual))
+		req->file->f_pos = kiocb->ki_pos;
+
+	/* else something else messed with f_pos and we can't do anything */
+
+	if (req->file->f_mode & FMODE_ATOMIC_POS)
+		mutex_unlock(&req->file->f_pos_lock);
 }
 
 static void kiocb_done(struct io_kiocb *req, ssize_t ret,
@@ -3096,8 +3146,7 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
 			ret += io->bytes_done;
 	}
 
-	if (req->flags & REQ_F_CUR_POS)
-		req->file->f_pos = req->rw.kiocb.ki_pos;
+	io_kiocb_done_pos(req, &req->rw.kiocb, ret >= 0 ? ret : 0);
 	if (ret >= 0 && (req->rw.kiocb.ki_complete == io_complete_rw))
 		__io_complete_rw(req, ret, issue_flags);
 	else
@@ -3662,21 +3711,23 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (force_nonblock) {
 		/* If the file doesn't support async, just async punt */
-		if (unlikely(!io_file_supports_nowait(req))) {
+		if (unlikely(!io_file_supports_nowait(req) ||
+				io_kiocb_update_pos(req, kiocb, &ppos,
+						req->rw.len, true))) {
 			ret = io_setup_async_rw(req, iovec, s, true);
 			return ret ?: -EAGAIN;
 		}
 		kiocb->ki_flags |= IOCB_NOWAIT;
 	} else {
+		io_kiocb_update_pos(req, kiocb, &ppos, req->rw.len, false);
 		/* Ensure we clear previously set non-block flag */
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
-	ppos = io_kiocb_update_pos(req, kiocb);
-
 	ret = rw_verify_area(READ, req->file, ppos, req->result);
 	if (unlikely(ret)) {
 		kfree(iovec);
+		io_kiocb_done_pos(req, kiocb, 0);
 		return ret;
 	}
 
@@ -3798,14 +3849,17 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		    (req->flags & REQ_F_ISREG))
 			goto copy_iov;
 
+		/* if we cannot lock the file position then punt */
+		if (unlikely(io_kiocb_update_pos(req, kiocb, &ppos, req->rw.len, true)))
+			goto copy_iov;
+
 		kiocb->ki_flags |= IOCB_NOWAIT;
 	} else {
+		io_kiocb_update_pos(req, kiocb, &ppos, req->rw.len, false);
 		/* Ensure we clear previously set non-block flag */
 		kiocb->ki_flags &= ~IOCB_NOWAIT;
 	}
 
-	ppos = io_kiocb_update_pos(req, kiocb);
-
 	ret = rw_verify_area(WRITE, req->file, ppos, req->result);
 	if (unlikely(ret))
 		goto out_free;
@@ -3858,6 +3912,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		return ret ?: -EAGAIN;
 	}
 out_free:
+	io_kiocb_done_pos(req, kiocb, 0);
 	/* it's reportedly faster than delegating the null check to kfree() */
 	if (iovec)
 		kfree(iovec);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time
  2022-02-21 14:16 ` [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time Dylan Yudaken
@ 2022-02-21 16:32   ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2022-02-21 16:32 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov, io-uring; +Cc: linux-kernel, kernel-team

On 2/21/22 7:16 AM, Dylan Yudaken wrote:
> Update kiocb->ki_pos at execution time rather than in io_prep_rw().
> io_prep_rw() happens before the job is enqueued to a worker and so the
> offset might be read multiple times before being executed once.
> 
> Ensures that the file position in a set of _linked_ SQEs will be only
> obtained after earlier SQEs have completed, and so will include their
> incremented file position.
> 
> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---
>  fs/io_uring.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 1f9b4466c269..50b93ff2ee12 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3000,14 +3000,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  		req->flags |= io_file_get_flags(file) << REQ_F_SUPPORT_NOWAIT_BIT;
>  
>  	kiocb->ki_pos = READ_ONCE(sqe->off);
> -	if (kiocb->ki_pos == -1) {
> -		if (!(file->f_mode & FMODE_STREAM)) {
> -			req->flags |= REQ_F_CUR_POS;
> -			kiocb->ki_pos = file->f_pos;
> -		} else {
> -			kiocb->ki_pos = 0;
> -		}
> -	}
>  	kiocb->ki_flags = iocb_flags(file);
>  	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
>  	if (unlikely(ret))
> @@ -3074,6 +3066,19 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
>  	}
>  }
>  
> +static inline void
> +io_kiocb_update_pos(struct io_kiocb *req, struct kiocb *kiocb)
> +{
> +	if (kiocb->ki_pos == -1) {
> +		if (!(req->file->f_mode & FMODE_STREAM)) {
> +			req->flags |= REQ_F_CUR_POS;
> +			kiocb->ki_pos = req->file->f_pos;
> +		} else {
> +			kiocb->ki_pos = 0;
> +		}
> +	}
> +}


static inline void io_kiocb_update_pos(struct io_kiocb *req,
				       struct kiocb *kiocb)
{
}

Can we just drop the kiocb argument? It'll always be req->rw.kiocb.
Should generate the same code if you do:

static inline void io_kiocb_update_pos(struct io_kiocb *req)
{
	struct kiocb *kiocb = &req->rw.kiocb;

	...
}

Apart from that minor thing, looks good to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write
  2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
                   ` (3 preceding siblings ...)
  2022-02-21 14:16 ` [PATCH v2 4/4] io_uring: pre-increment f_pos on rw Dylan Yudaken
@ 2022-02-21 16:33 ` Jens Axboe
  2022-02-21 17:48   ` Dylan Yudaken
  4 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2022-02-21 16:33 UTC (permalink / raw)
  To: Dylan Yudaken, Pavel Begunkov, io-uring; +Cc: linux-kernel, kernel-team

On 2/21/22 7:16 AM, Dylan Yudaken wrote:
> Currently submitting multiple read/write for one file with offset = -1 will
> not behave as if calling read(2)/write(2) multiple times. The offset may be
> pinned to the same value for each submission (for example if they are
> punted to the async worker) and so each read/write will have the same
> offset.
> 
> This patch series fixes this.
> 
> Patch 1,3 cleans up the code a bit
> 
> Patch 2 grabs the file position at execution time, rather than when the job
> is queued to be run which fixes inconsistincies when jobs are run asynchronously.
> 
> Patch 4 increments the file's f_pos when reading it, which fixes
> inconsistincies with concurrent runs. 
> 
> A test for this will be submitted to liburing separately.

Looks good to me, but the patch 2 change will bubble through to patch 3
and 4 as well. Care to respin a v3?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write
  2022-02-21 16:33 ` [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Jens Axboe
@ 2022-02-21 17:48   ` Dylan Yudaken
  0 siblings, 0 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-21 17:48 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Kernel Team, linux-kernel

On Mon, 2022-02-21 at 09:33 -0700, Jens Axboe wrote:
> On 2/21/22 7:16 AM, Dylan Yudaken wrote:
> > Currently submitting multiple read/write for one file with offset =
> > -1 will
> > not behave as if calling read(2)/write(2) multiple times. The
> > offset may be
> > pinned to the same value for each submission (for example if they
> > are
> > punted to the async worker) and so each read/write will have the
> > same
> > offset.
> > 
> > This patch series fixes this.
> > 
> > Patch 1,3 cleans up the code a bit
> > 
> > Patch 2 grabs the file position at execution time, rather than when
> > the job
> > is queued to be run which fixes inconsistincies when jobs are run
> > asynchronously.
> > 
> > Patch 4 increments the file's f_pos when reading it, which fixes
> > inconsistincies with concurrent runs. 
> > 
> > A test for this will be submitted to liburing separately.
> 
> Looks good to me, but the patch 2 change will bubble through to patch
> 3
> and 4 as well. Care to respin a v3?
> 

Yes sure - will do it combined with the test v3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-21 14:16 ` [PATCH v2 4/4] io_uring: pre-increment f_pos on rw Dylan Yudaken
@ 2022-02-21 18:00   ` Pavel Begunkov
  2022-02-22  7:20     ` Hao Xu
  2022-02-22  8:26     ` Dylan Yudaken
  2022-02-22  7:34   ` Hao Xu
  1 sibling, 2 replies; 13+ messages in thread
From: Pavel Begunkov @ 2022-02-21 18:00 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe, io-uring; +Cc: linux-kernel, kernel-team

On 2/21/22 14:16, Dylan Yudaken wrote:
> In read/write ops, preincrement f_pos when no offset is specified, and
> then attempt fix up the position after IO completes if it completed less
> than expected. This fixes the problem where multiple queued up IO will all
> obtain the same f_pos, and so perform the same read/write.
> 
> This is still not as consistent as sync r/w, as it is able to advance the
> file offset past the end of the file. It seems it would be quite a
> performance hit to work around this limitation - such as by keeping track
> of concurrent operations - and the downside does not seem to be too
> problematic.
> 
> The attempt to fix up the f_pos after will at least mean that in situations
> where a single operation is run, then the position will be consistent.
> 
> Co-developed-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> Signed-off-by: Dylan Yudaken <dylany@fb.com>
> ---
>   fs/io_uring.c | 81 ++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 68 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index abd8c739988e..a951d0754899 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3066,21 +3066,71 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)

[...]

> +			return false;
>   		}
>   	}
> -	return is_stream ? NULL : &kiocb->ki_pos;
> +	*ppos = is_stream ? NULL : &kiocb->ki_pos;
> +	return false;
> +}
> +
> +static inline void
> +io_kiocb_done_pos(struct io_kiocb *req, struct kiocb *kiocb, u64 actual)

That's a lot of inlining, I wouldn't be surprised if the compiler
will even refuse to do that.

io_kiocb_done_pos() {
	// rest of it
}

inline io_kiocb_done_pos() {
	if (!(flags & CUR_POS));
		return;
	__io_kiocb_done_pos();
}

io_kiocb_update_pos() is huge as well

> +{
> +	u64 expected;
> +
> +	if (likely(!(req->flags & REQ_F_CUR_POS)))
> +		return;
> +
> +	expected = req->rw.len;
> +	if (actual >= expected)
> +		return;
> +
> +	/*
> +	 * It's not definitely safe to lock here, and the assumption is,
> +	 * that if we cannot lock the position that it will be changing,
> +	 * and if it will be changing - then we can't update it anyway
> +	 */
> +	if (req->file->f_mode & FMODE_ATOMIC_POS
> +		&& !mutex_trylock(&req->file->f_pos_lock))
> +		return;
> +
> +	/*
> +	 * now we want to move the pointer, but only if everything is consistent
> +	 * with how we left it originally
> +	 */
> +	if (req->file->f_pos == kiocb->ki_pos + (expected - actual))
> +		req->file->f_pos = kiocb->ki_pos;

I wonder, is it good enough / safe to just assign it considering that
the request was executed outside of locks? vfs_seek()?

> +
> +	/* else something else messed with f_pos and we can't do anything */
> +
> +	if (req->file->f_mode & FMODE_ATOMIC_POS)
> +		mutex_unlock(&req->file->f_pos_lock);
>   }

Do we even care about races while reading it? E.g.
pos = READ_ONCE();

>   
> -	ppos = io_kiocb_update_pos(req, kiocb);
> -
>   	ret = rw_verify_area(READ, req->file, ppos, req->result);
>   	if (unlikely(ret)) {
>   		kfree(iovec);
> +		io_kiocb_done_pos(req, kiocb, 0);

Why do we update it on failure?

[...]

> -	ppos = io_kiocb_update_pos(req, kiocb);
> -
>   	ret = rw_verify_area(WRITE, req->file, ppos, req->result);
>   	if (unlikely(ret))
>   		goto out_free;
> @@ -3858,6 +3912,7 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
>   		return ret ?: -EAGAIN;
>   	}
>   out_free:
> +	io_kiocb_done_pos(req, kiocb, 0);

Looks weird. It appears we don't need it on failure and
successes are covered by kiocb_done() / ->ki_complete

>   	/* it's reportedly faster than delegating the null check to kfree() */
>   	if (iovec)
>   		kfree(iovec);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-21 18:00   ` Pavel Begunkov
@ 2022-02-22  7:20     ` Hao Xu
  2022-02-22  8:26     ` Dylan Yudaken
  1 sibling, 0 replies; 13+ messages in thread
From: Hao Xu @ 2022-02-22  7:20 UTC (permalink / raw)
  To: Pavel Begunkov, Dylan Yudaken, Jens Axboe, io-uring
  Cc: linux-kernel, kernel-team


On 2/22/22 02:00, Pavel Begunkov wrote:
> On 2/21/22 14:16, Dylan Yudaken wrote:
>> In read/write ops, preincrement f_pos when no offset is specified, and
>> then attempt fix up the position after IO completes if it completed less
>> than expected. This fixes the problem where multiple queued up IO 
>> will all
>> obtain the same f_pos, and so perform the same read/write.
>>
>> This is still not as consistent as sync r/w, as it is able to advance 
>> the
>> file offset past the end of the file. It seems it would be quite a
>> performance hit to work around this limitation - such as by keeping 
>> track
>> of concurrent operations - and the downside does not seem to be too
>> problematic.
>>
>> The attempt to fix up the f_pos after will at least mean that in 
>> situations
>> where a single operation is run, then the position will be consistent.
>>
>> Co-developed-by: Jens Axboe <axboe@kernel.dk>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> Signed-off-by: Dylan Yudaken <dylany@fb.com>
>> ---
>>   fs/io_uring.c | 81 ++++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 68 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index abd8c739988e..a951d0754899 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -3066,21 +3066,71 @@ static inline void io_rw_done(struct kiocb 
>> *kiocb, ssize_t ret)
>
> [...]
>
>> +            return false;
>>           }
>>       }
>> -    return is_stream ? NULL : &kiocb->ki_pos;
>> +    *ppos = is_stream ? NULL : &kiocb->ki_pos;
>> +    return false;
>> +}
>> +
>> +static inline void
>> +io_kiocb_done_pos(struct io_kiocb *req, struct kiocb *kiocb, u64 
>> actual)
>
> That's a lot of inlining, I wouldn't be surprised if the compiler
> will even refuse to do that.
>
> io_kiocb_done_pos() {
>     // rest of it
> }
>
> inline io_kiocb_done_pos() {
>     if (!(flags & CUR_POS));
>         return;
>     __io_kiocb_done_pos();
> }
>
> io_kiocb_update_pos() is huge as well
>
>> +{
>> +    u64 expected;
>> +
>> +    if (likely(!(req->flags & REQ_F_CUR_POS)))
>> +        return;
>> +
>> +    expected = req->rw.len;
>> +    if (actual >= expected)
>> +        return;
>> +
>> +    /*
>> +     * It's not definitely safe to lock here, and the assumption is,
>> +     * that if we cannot lock the position that it will be changing,
>> +     * and if it will be changing - then we can't update it anyway
>> +     */
>> +    if (req->file->f_mode & FMODE_ATOMIC_POS
>> +        && !mutex_trylock(&req->file->f_pos_lock))
>> +        return;
>> +
>> +    /*
>> +     * now we want to move the pointer, but only if everything is 
>> consistent
>> +     * with how we left it originally
>> +     */
>> +    if (req->file->f_pos == kiocb->ki_pos + (expected - actual))
>> +        req->file->f_pos = kiocb->ki_pos;
>
> I wonder, is it good enough / safe to just assign it considering that
> the request was executed outside of locks? vfs_seek()?
>
>> +
>> +    /* else something else messed with f_pos and we can't do 
>> anything */
>> +
>> +    if (req->file->f_mode & FMODE_ATOMIC_POS)
>> +        mutex_unlock(&req->file->f_pos_lock);
>>   }
>
> Do we even care about races while reading it? E.g.
> pos = READ_ONCE();
>
>>   -    ppos = io_kiocb_update_pos(req, kiocb);
>> -
>>       ret = rw_verify_area(READ, req->file, ppos, req->result);
>>       if (unlikely(ret)) {
>>           kfree(iovec);
>> +        io_kiocb_done_pos(req, kiocb, 0);
>
> Why do we update it on failure?
It seems like a fallback, if no pos change, fallback file->f_pos to the 
original place
>
> [...]
>
>> -    ppos = io_kiocb_update_pos(req, kiocb);
>> -
>>       ret = rw_verify_area(WRITE, req->file, ppos, req->result);
>>       if (unlikely(ret))
>>           goto out_free;
>> @@ -3858,6 +3912,7 @@ static int io_write(struct io_kiocb *req, 
>> unsigned int issue_flags)
>>           return ret ?: -EAGAIN;
>>       }
>>   out_free:
>> +    io_kiocb_done_pos(req, kiocb, 0);
>
> Looks weird. It appears we don't need it on failure and
> successes are covered by kiocb_done() / ->ki_complete
>
>>       /* it's reportedly faster than delegating the null check to 
>> kfree() */
>>       if (iovec)
>>           kfree(iovec);
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-21 14:16 ` [PATCH v2 4/4] io_uring: pre-increment f_pos on rw Dylan Yudaken
  2022-02-21 18:00   ` Pavel Begunkov
@ 2022-02-22  7:34   ` Hao Xu
  2022-02-22 10:52     ` Dylan Yudaken
  1 sibling, 1 reply; 13+ messages in thread
From: Hao Xu @ 2022-02-22  7:34 UTC (permalink / raw)
  To: Dylan Yudaken, Jens Axboe, Pavel Begunkov, io-uring
  Cc: linux-kernel, kernel-team


On 2/21/22 22:16, Dylan Yudaken wrote:
> In read/write ops, preincrement f_pos when no offset is specified, and
> then attempt fix up the position after IO completes if it completed less
> than expected. This fixes the problem where multiple queued up IO will all
> obtain the same f_pos, and so perform the same read/write.
>
> This is still not as consistent as sync r/w, as it is able to advance the
> file offset past the end of the file. It seems it would be quite a
> performance hit to work around this limitation - such as by keeping track
> of concurrent operations - and the downside does not seem to be too
> problematic.
>
> The attempt to fix up the f_pos after will at least mean that in situations
> where a single operation is run, then the position will be consistent.
>
It's a little bit weird, when a read req returns x bytes read while f_pos

moves ahead y bytes where x isn't equal to y. Don't know if this causes

problems..


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-21 18:00   ` Pavel Begunkov
  2022-02-22  7:20     ` Hao Xu
@ 2022-02-22  8:26     ` Dylan Yudaken
  1 sibling, 0 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-22  8:26 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Kernel Team, linux-kernel

On Mon, 2022-02-21 at 18:00 +0000, Pavel Begunkov wrote:
> On 2/21/22 14:16, Dylan Yudaken wrote:
> > In read/write ops, preincrement f_pos when no offset is specified,
> > and
> > then attempt fix up the position after IO completes if it completed
> > less
> > than expected. This fixes the problem where multiple queued up IO
> > will all
> > obtain the same f_pos, and so perform the same read/write.
> > 
> > This is still not as consistent as sync r/w, as it is able to
> > advance the
> > file offset past the end of the file. It seems it would be quite a
> > performance hit to work around this limitation - such as by keeping
> > track
> > of concurrent operations - and the downside does not seem to be too
> > problematic.
> > 
> > The attempt to fix up the f_pos after will at least mean that in
> > situations
> > where a single operation is run, then the position will be
> > consistent.
> > 
> > Co-developed-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > Signed-off-by: Dylan Yudaken <dylany@fb.com>
> > ---
> >   fs/io_uring.c | 81 ++++++++++++++++++++++++++++++++++++++++++----
> > -----
> >   1 file changed, 68 insertions(+), 13 deletions(-)
> > 
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index abd8c739988e..a951d0754899 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -3066,21 +3066,71 @@ static inline void io_rw_done(struct kiocb
> > *kiocb, ssize_t ret)
> 
> [...]
> 
> > +                       return false;
> >                 }
> >         }
> > -       return is_stream ? NULL : &kiocb->ki_pos;
> > +       *ppos = is_stream ? NULL : &kiocb->ki_pos;
> > +       return false;
> > +}
> > +
> > +static inline void
> > +io_kiocb_done_pos(struct io_kiocb *req, struct kiocb *kiocb, u64
> > actual)
> 
> That's a lot of inlining, I wouldn't be surprised if the compiler
> will even refuse to do that.
> 
> io_kiocb_done_pos() {
>         // rest of it
> }
> 
> inline io_kiocb_done_pos() {
>         if (!(flags & CUR_POS));
>                 return;
>         __io_kiocb_done_pos();
> }
> 
> io_kiocb_update_pos() is huge as well

Good idea, will split the slower paths out.

> 
> > +{
> > +       u64 expected;
> > +
> > +       if (likely(!(req->flags & REQ_F_CUR_POS)))
> > +               return;
> > +
> > +       expected = req->rw.len;
> > +       if (actual >= expected)
> > +               return;
> > +
> > +       /*
> > +        * It's not definitely safe to lock here, and the
> > assumption is,
> > +        * that if we cannot lock the position that it will be
> > changing,
> > +        * and if it will be changing - then we can't update it
> > anyway
> > +        */
> > +       if (req->file->f_mode & FMODE_ATOMIC_POS
> > +               && !mutex_trylock(&req->file->f_pos_lock))
> > +               return;
> > +
> > +       /*
> > +        * now we want to move the pointer, but only if everything
> > is consistent
> > +        * with how we left it originally
> > +        */
> > +       if (req->file->f_pos == kiocb->ki_pos + (expected -
> > actual))
> > +               req->file->f_pos = kiocb->ki_pos;
> 
> I wonder, is it good enough / safe to just assign it considering that
> the request was executed outside of locks? vfs_seek()?

No I do not think so - in the case of multiple r/w the same thing will
happen, even with no vfs_seek().

> 
> > +
> > +       /* else something else messed with f_pos and we can't do
> > anything */
> > +
> > +       if (req->file->f_mode & FMODE_ATOMIC_POS)
> > +               mutex_unlock(&req->file->f_pos_lock);
> >   }
> 
> Do we even care about races while reading it? E.g.
> pos = READ_ONCE();

I think so - if I remove all the locks the test cases fail.

> 
> >   
> > -       ppos = io_kiocb_update_pos(req, kiocb);
> > -
> >         ret = rw_verify_area(READ, req->file, ppos, req->result);
> >         if (unlikely(ret)) {
> >                 kfree(iovec);
> > +               io_kiocb_done_pos(req, kiocb, 0);
> 
> Why do we update it on failure?
> 
> [...]
> 
> > -       ppos = io_kiocb_update_pos(req, kiocb);
> > -
> >         ret = rw_verify_area(WRITE, req->file, ppos, req->result);
> >         if (unlikely(ret))
> >                 goto out_free;
> > @@ -3858,6 +3912,7 @@ static int io_write(struct io_kiocb *req,
> > unsigned int issue_flags)
> >                 return ret ?: -EAGAIN;
> >         }
> >   out_free:
> > +       io_kiocb_done_pos(req, kiocb, 0);
> 
> Looks weird. It appears we don't need it on failure and
> successes are covered by kiocb_done() / ->ki_complete
> 
> >         /* it's reportedly faster than delegating the null check to
> > kfree() */
> >         if (iovec)
> >                 kfree(iovec);
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] io_uring: pre-increment f_pos on rw
  2022-02-22  7:34   ` Hao Xu
@ 2022-02-22 10:52     ` Dylan Yudaken
  0 siblings, 0 replies; 13+ messages in thread
From: Dylan Yudaken @ 2022-02-22 10:52 UTC (permalink / raw)
  To: haoxu, axboe, asml.silence, io-uring; +Cc: Kernel Team, linux-kernel

On Tue, 2022-02-22 at 15:34 +0800, Hao Xu wrote:
> 
> On 2/21/22 22:16, Dylan Yudaken wrote:
> > In read/write ops, preincrement f_pos when no offset is specified,
> > and
> > then attempt fix up the position after IO completes if it completed
> > less
> > than expected. This fixes the problem where multiple queued up IO
> > will all
> > obtain the same f_pos, and so perform the same read/write.
> > 
> > This is still not as consistent as sync r/w, as it is able to
> > advance the
> > file offset past the end of the file. It seems it would be quite a
> > performance hit to work around this limitation - such as by keeping
> > track
> > of concurrent operations - and the downside does not seem to be too
> > problematic.
> > 
> > The attempt to fix up the f_pos after will at least mean that in
> > situations
> > where a single operation is run, then the position will be
> > consistent.
> > 
> It's a little bit weird, when a read req returns x bytes read while
> f_pos
> 
> moves ahead y bytes where x isn't equal to y. Don't know if this
> causes
> 
> problems..
> 

It seems to be ok - as in nothing crashes when f_pos is past the end of
the file - but I really am not an expert on these things so am happy to
receive feedback on this. 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-22 10:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-21 14:16 [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Dylan Yudaken
2022-02-21 14:16 ` [PATCH v2 1/4] io_uring: remove duplicated calls to io_kiocb_ppos Dylan Yudaken
2022-02-21 14:16 ` [PATCH v2 2/4] io_uring: update kiocb->ki_pos at execution time Dylan Yudaken
2022-02-21 16:32   ` Jens Axboe
2022-02-21 14:16 ` [PATCH v2 3/4] io_uring: do not recalculate ppos unnecessarily Dylan Yudaken
2022-02-21 14:16 ` [PATCH v2 4/4] io_uring: pre-increment f_pos on rw Dylan Yudaken
2022-02-21 18:00   ` Pavel Begunkov
2022-02-22  7:20     ` Hao Xu
2022-02-22  8:26     ` Dylan Yudaken
2022-02-22  7:34   ` Hao Xu
2022-02-22 10:52     ` Dylan Yudaken
2022-02-21 16:33 ` [PATCH v2 0/4] io_uring: consistent behaviour with linked read/write Jens Axboe
2022-02-21 17:48   ` Dylan Yudaken

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.