IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
* (no subject)
@ 2020-06-18 14:43 Jens Axboe
  2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
                   ` (15 more replies)
  0 siblings, 16 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

We technically support this already through io_uring, but it's
implemented with a thread backend to support cases where we would
block. This isn't ideal.

After a few prep patches, the core of this patchset is adding support
for async callbacks on page unlock. With this primitive, we can simply
retry the IO operation. With io_uring, this works a lot like poll based
retry for files that support it. If a page is currently locked and
needed, -EIOCBQUEUED is returned with a callback armed. The callers
callback is responsible for restarting the operation.

With this callback primitive, we can add support for
generic_file_buffered_read(), which is what most file systems end up
using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
trivial to add more.

The file flags support for this by setting FMODE_BUF_RASYNC, similar
to what we do for FMODE_NOWAIT. Open to suggestions here if this is
the preferred method or not.

In terms of results, I wrote a small test app that randomly reads 4G
of data in 4K chunks from a file hosted by ext4. The app uses a queue
depth of 32. If you want to test yourself, you can just use buffered=1
with ioengine=io_uring with fio. No application changes are needed to
use the more optimized buffered async read.

preadv for comparison:
	real    1m13.821s
	user    0m0.558s
	sys     0m11.125s
	CPU	~13%

Mainline:
	real    0m12.054s
	user    0m0.111s
	sys     0m5.659s
	CPU	~32% + ~50% == ~82%

This patchset:
	real    0m9.283s
	user    0m0.147s
	sys     0m4.619s
	CPU	~52%

The CPU numbers are just a rough estimate. For the mainline io_uring
run, this includes the app itself and all the threads doing IO on its
behalf (32% for the app, ~1.6% per worker and 32 of them). Context
switch rate is much smaller with the patchset, since we only have the
one task performing IO.

Also ran a simple fio based test case, varying the queue depth from 1
to 16, doubling every time:

[buf-test]
filename=/data/file
direct=0
ioengine=io_uring
norandommap
rw=randread
bs=4k
iodepth=${QD}
randseed=89
runtime=10s

QD/Test		Patchset IOPS		Mainline IOPS
1		9046			8294
2		19.8k			18.9k
4		39.2k			28.5k
8		64.4k			31.4k
16		65.7k			37.8k

Outside of my usual environment, so this is just running on a virtualized
NVMe device in qemu, using ext4 as the file system. NVMe isn't very
efficient virtualized, so we run out of steam at ~65K which is why we
flatline on the patched side (nvme_submit_cmd() eats ~75% of the test app
CPU). Before that happens, it's a linear increase. Not shown is context
switch rate, which is massively lower with the new code. The old thread
offload adds a blocking thread per pending IO, so context rate quickly
goes through the roof.

The goal here is efficiency. Async thread offload adds latency, and
it also adds noticable overhead on items such as adding pages to the
page cache. By allowing proper async buffered read support, we don't
have X threads hammering on the same inode page cache, we have just
the single app actually doing IO.

Been beating on this and it's solid for me, and I'm now pretty happy
with how it all turned out. Not aware of any missing bits/pieces or
code cleanups that need doing.

Series can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.7

or pull from:

git://git.kernel.dk/linux-block async-buffered.7

 block/blk-core.c        |   6 +
 fs/block_dev.c          |   2 +-
 fs/btrfs/file.c         |   2 +-
 fs/ext4/file.c          |   2 +-
 fs/io_uring.c           | 336 ++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_file.c       |   2 +-
 include/linux/blkdev.h  |   1 +
 include/linux/fs.h      |  10 +-
 include/linux/pagemap.h |  75 +++++++++
 mm/filemap.c            | 110 ++++++++-----
 10 files changed, 454 insertions(+), 92 deletions(-)

Changes since v6:
- Properly catch and resubmit requests that would have blocked in
  submission.
- Fix sqthread mm usage for async buffered retry
- Fix a retry condition iter setup
- Rebase on master + for-5.9/io_uring
Changes since v5:
- Correct commit message, iocb->private -> iocb->ki_waitq
- Get rid of io_uring goto, use an iter read helper
Changes since v3:
- io_uring: don't retry if REQ_F_NOWAIT is set
- io_uring: alloc req->io if the request type didn't already
- Add iocb->ki_waitq instead of (ab)using iocb->private
Changes since v2:
- Get rid of unnecessary wait_page_async struct, just use wait_page_async
- Add another prep handler, adding wake_page_match()
- Use wake_page_match() in both callers
Changes since v1:
- Fix an issue with inline page locking
- Fix a potential race with __wait_on_page_locked_async()
- Fix a hang related to not setting page_match, thus missing a wakeup

-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics
  2020-06-18 14:43 Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 02/15] io_uring: always plug for any number of IOs Jens Axboe
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/blk-core.c       | 6 ++++++
 include/linux/blkdev.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03252af8c82c..62a4904db921 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -958,6 +958,7 @@ generic_make_request_checks(struct bio *bio)
 	struct request_queue *q;
 	int nr_sectors = bio_sectors(bio);
 	blk_status_t status = BLK_STS_IOERR;
+	struct blk_plug *plug;
 	char b[BDEVNAME_SIZE];
 
 	might_sleep();
@@ -971,6 +972,10 @@ generic_make_request_checks(struct bio *bio)
 		goto end_io;
 	}
 
+	plug = blk_mq_plug(q, bio);
+	if (plug && plug->nowait)
+		bio->bi_opf |= REQ_NOWAIT;
+
 	/*
 	 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 	 * if queue is not a request based queue.
@@ -1800,6 +1805,7 @@ void blk_start_plug(struct blk_plug *plug)
 	INIT_LIST_HEAD(&plug->cb_list);
 	plug->rq_count = 0;
 	plug->multiple_queues = false;
+	plug->nowait = false;
 
 	/*
 	 * Store ordering should not be needed here, since a potential
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd900998b4e..6e067dca94cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1189,6 +1189,7 @@ struct blk_plug {
 	struct list_head cb_list; /* md requires an unplug callback */
 	unsigned short rq_count;
 	bool multiple_queues;
+	bool nowait;
 };
 #define BLK_MAX_REQUEST_COUNT 16
 #define BLK_PLUG_FLUSH_SIZE (128 * 1024)
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 02/15] io_uring: always plug for any number of IOs
  2020-06-18 14:43 Jens Axboe
  2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure Jens Axboe
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b14a8e6a0e15..ca78dd7c79da 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -676,7 +676,6 @@ struct io_kiocb {
 	};
 };
 
-#define IO_PLUG_THRESHOLD		2
 #define IO_IOPOLL_BATCH			8
 
 struct io_submit_state {
@@ -5914,7 +5913,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			  struct file *ring_file, int ring_fd)
 {
-	struct io_submit_state state, *statep = NULL;
+	struct io_submit_state state;
 	struct io_kiocb *link = NULL;
 	int i, submitted = 0;
 
@@ -5931,10 +5930,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	if (!percpu_ref_tryget_many(&ctx->refs, nr))
 		return -EAGAIN;
 
-	if (nr > IO_PLUG_THRESHOLD) {
-		io_submit_state_start(&state, nr);
-		statep = &state;
-	}
+	io_submit_state_start(&state, nr);
 
 	ctx->ring_fd = ring_fd;
 	ctx->ring_file = ring_file;
@@ -5949,14 +5945,14 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 			io_consume_sqe(ctx);
 			break;
 		}
-		req = io_alloc_req(ctx, statep);
+		req = io_alloc_req(ctx, &state);
 		if (unlikely(!req)) {
 			if (!submitted)
 				submitted = -EAGAIN;
 			break;
 		}
 
-		err = io_init_req(ctx, req, sqe, statep);
+		err = io_init_req(ctx, req, sqe, &state);
 		io_consume_sqe(ctx);
 		/* will complete beyond this point, count as submitted */
 		submitted++;
@@ -5982,8 +5978,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
 	}
 	if (link)
 		io_queue_link_head(link);
-	if (statep)
-		io_submit_state_end(&state);
+	io_submit_state_end(&state);
 
 	 /* Commit SQ ring head once we've consumed and submitted all SQEs */
 	io_commit_sqring(ctx);
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure
  2020-06-18 14:43 Jens Axboe
  2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
  2020-06-18 14:43 ` [PATCH 02/15] io_uring: always plug for any number of IOs Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.

Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca78dd7c79da..2e257c5a1866 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2088,6 +2088,15 @@ static struct file *__io_file_get(struct io_submit_state *state, int fd)
 	return state->file;
 }
 
+static bool io_bdev_nowait(struct block_device *bdev)
+{
+#ifdef CONFIG_BLOCK
+	return !bdev || queue_is_mq(bdev_get_queue(bdev));
+#else
+	return true;
+#endif
+}
+
 /*
  * If we tracked the file through the SCM inflight mechanism, we could support
  * any file. For now, just ensure that anything potentially problematic is done
@@ -2097,10 +2106,19 @@ static bool io_file_supports_async(struct file *file, int rw)
 {
 	umode_t mode = file_inode(file)->i_mode;
 
-	if (S_ISBLK(mode) || S_ISCHR(mode) || S_ISSOCK(mode))
-		return true;
-	if (S_ISREG(mode) && file->f_op != &io_uring_fops)
+	if (S_ISBLK(mode)) {
+		if (io_bdev_nowait(file->f_inode->i_bdev))
+			return true;
+		return false;
+	}
+	if (S_ISCHR(mode) || S_ISSOCK(mode))
 		return true;
+	if (S_ISREG(mode)) {
+		if (io_bdev_nowait(file->f_inode->i_sb->s_bdev) &&
+		    file->f_op != &io_uring_fops)
+			return true;
+		return false;
+	}
 
 	/* any ->read/write should understand O_NONBLOCK */
 	if (file->f_flags & O_NONBLOCK)
@@ -2650,7 +2668,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	iov_count = iov_iter_count(&iter);
 	ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
-		ssize_t ret2;
+		ssize_t ret2 = 0;
 
 		if (req->file->f_op->read_iter)
 			ret2 = call_read_iter(req->file, kiocb, &iter);
@@ -2658,7 +2676,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
-		if (!force_nonblock || ret2 != -EAGAIN) {
+		if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
 			kiocb_done(kiocb, ret2);
 		} else {
 copy_iov:
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
  2020-06-18 14:43 Jens Axboe
                   ` (2 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-19 14:12   ` Pavel Begunkov
  2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.

Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.

This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 124 insertions(+), 24 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e257c5a1866..40413fb9d07b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
 static void __io_queue_sqe(struct io_kiocb *req,
 			   const struct io_uring_sqe *sqe);
 
+static ssize_t io_import_iovec(int rw, struct io_kiocb *req,
+			       struct iovec **iovec, struct iov_iter *iter,
+			       bool needs_lock);
+static int io_setup_async_rw(struct io_kiocb *req, ssize_t io_size,
+			     struct iovec *iovec, struct iovec *fast_iov,
+			     struct iov_iter *iter);
+
 static struct kmem_cache *req_cachep;
 
 static const struct file_operations io_uring_fops;
@@ -1978,12 +1985,115 @@ static void io_complete_rw_common(struct kiocb *kiocb, long res)
 	__io_cqring_add_event(req, res, cflags);
 }
 
+static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+
+	if (mm) {
+		kthread_unuse_mm(mm);
+		mmput(mm);
+	}
+}
+
+static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx,
+				   struct io_kiocb *req)
+{
+	if (io_op_defs[req->opcode].needs_mm && !current->mm) {
+		if (unlikely(!mmget_not_zero(ctx->sqo_mm)))
+			return -EFAULT;
+		kthread_use_mm(ctx->sqo_mm);
+	}
+
+	return 0;
+}
+
+#ifdef CONFIG_BLOCK
+static bool io_resubmit_prep(struct io_kiocb *req, int error)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	ssize_t ret = -ECANCELED;
+	struct iov_iter iter;
+	int rw;
+
+	if (error) {
+		ret = error;
+		goto end_req;
+	}
+
+	switch (req->opcode) {
+	case IORING_OP_READV:
+	case IORING_OP_READ_FIXED:
+	case IORING_OP_READ:
+		rw = READ;
+		break;
+	case IORING_OP_WRITEV:
+	case IORING_OP_WRITE_FIXED:
+	case IORING_OP_WRITE:
+		rw = WRITE;
+		break;
+	default:
+		printk_once(KERN_WARNING "io_uring: bad opcode in resubmit %d\n",
+				req->opcode);
+		goto end_req;
+	}
+
+	ret = io_import_iovec(rw, req, &iovec, &iter, false);
+	if (ret < 0)
+		goto end_req;
+	ret = io_setup_async_rw(req, ret, iovec, inline_vecs, &iter);
+	if (!ret)
+		return true;
+	kfree(iovec);
+end_req:
+	io_cqring_add_event(req, ret);
+	req_set_fail_links(req);
+	io_put_req(req);
+	return false;
+}
+
+static void io_rw_resubmit(struct callback_head *cb)
+{
+	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
+	struct io_ring_ctx *ctx = req->ctx;
+	int err;
+
+	__set_current_state(TASK_RUNNING);
+
+	err = io_sq_thread_acquire_mm(ctx, req);
+
+	if (io_resubmit_prep(req, err)) {
+		refcount_inc(&req->refs);
+		io_queue_async_work(req);
+	}
+}
+#endif
+
+static bool io_rw_reissue(struct io_kiocb *req, long res)
+{
+#ifdef CONFIG_BLOCK
+	struct task_struct *tsk;
+	int ret;
+
+	if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
+		return false;
+
+	tsk = req->task;
+	init_task_work(&req->task_work, io_rw_resubmit);
+	ret = task_work_add(tsk, &req->task_work, true);
+	if (!ret)
+		return true;
+#endif
+	return false;
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw.kiocb);
 
-	io_complete_rw_common(kiocb, res);
-	io_put_req(req);
+	if (!io_rw_reissue(req, res)) {
+		io_complete_rw_common(kiocb, res);
+		io_put_req(req);
+	}
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -2169,6 +2279,9 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (kiocb->ki_flags & IOCB_NOWAIT)
 		req->flags |= REQ_F_NOWAIT;
 
+	if (kiocb->ki_flags & IOCB_DIRECT)
+		io_get_req_task(req);
+
 	if (force_nonblock)
 		kiocb->ki_flags |= IOCB_NOWAIT;
 
@@ -2668,6 +2781,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	iov_count = iov_iter_count(&iter);
 	ret = rw_verify_area(READ, req->file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
+		unsigned long nr_segs = iter.nr_segs;
 		ssize_t ret2 = 0;
 
 		if (req->file->f_op->read_iter)
@@ -2679,6 +2793,8 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 		if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
 			kiocb_done(kiocb, ret2);
 		} else {
+			iter.count = iov_count;
+			iter.nr_segs = nr_segs;
 copy_iov:
 			ret = io_setup_async_rw(req, io_size, iovec,
 						inline_vecs, &iter);
@@ -2765,6 +2881,7 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
 	iov_count = iov_iter_count(&iter);
 	ret = rw_verify_area(WRITE, req->file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
+		unsigned long nr_segs = iter.nr_segs;
 		ssize_t ret2;
 
 		/*
@@ -2802,6 +2919,8 @@ static int io_write(struct io_kiocb *req, bool force_nonblock)
 		if (!force_nonblock || ret2 != -EAGAIN) {
 			kiocb_done(kiocb, ret2);
 		} else {
+			iter.count = iov_count;
+			iter.nr_segs = nr_segs;
 copy_iov:
 			ret = io_setup_async_rw(req, io_size, iovec,
 						inline_vecs, &iter);
@@ -4282,28 +4401,6 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head,
 	__io_queue_proc(&pt->req->apoll->poll, pt, head);
 }
 
-static void io_sq_thread_drop_mm(struct io_ring_ctx *ctx)
-{
-	struct mm_struct *mm = current->mm;
-
-	if (mm) {
-		kthread_unuse_mm(mm);
-		mmput(mm);
-	}
-}
-
-static int io_sq_thread_acquire_mm(struct io_ring_ctx *ctx,
-				   struct io_kiocb *req)
-{
-	if (io_op_defs[req->opcode].needs_mm && !current->mm) {
-		if (unlikely(!mmget_not_zero(ctx->sqo_mm)))
-			return -EFAULT;
-		kthread_use_mm(ctx->sqo_mm);
-	}
-
-	return 0;
-}
-
 static void io_async_task_func(struct callback_head *cb)
 {
 	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
@@ -5814,6 +5911,9 @@ static void io_submit_state_start(struct io_submit_state *state,
 				  unsigned int max_ios)
 {
 	blk_start_plug(&state->plug);
+#ifdef CONFIG_BLOCK
+	state->plug.nowait = true;
+#endif
 	state->free_reqs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-18 14:43 Jens Axboe
                   ` (3 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-24  1:02   ` Dave Chinner
  2020-06-24  4:38   ` Dave Chinner
  2020-06-18 14:43 ` [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Johannes Weiner

The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f0ae9a6308cb..3378d4fca883 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2028,8 +2028,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		page = find_get_page(mapping, index);
 		if (!page) {
-			if (iocb->ki_flags & IOCB_NOWAIT)
-				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function()
  2020-06-18 14:43 Jens Axboe
                   ` (4 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Johannes Weiner

No functional changes in this patch, just in preparation for allowing
more callers.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 37 +++++++++++++++++++++++++++++++++++++
 mm/filemap.c            | 35 ++++-------------------------------
 2 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cf2468da68e9..2f18221bb5c8 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -496,6 +496,43 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 	return pgoff;
 }
 
+/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
+struct wait_page_key {
+	struct page *page;
+	int bit_nr;
+	int page_match;
+};
+
+struct wait_page_queue {
+	struct page *page;
+	int bit_nr;
+	wait_queue_entry_t wait;
+};
+
+static inline int wake_page_match(struct wait_page_queue *wait_page,
+				  struct wait_page_key *key)
+{
+	if (wait_page->page != key->page)
+	       return 0;
+	key->page_match = 1;
+
+	if (wait_page->bit_nr != key->bit_nr)
+		return 0;
+
+	/*
+	 * Stop walking if it's locked.
+	 * Is this safe if put_and_wait_on_page_locked() is in use?
+	 * Yes: the waker must hold a reference to this page, and if PG_locked
+	 * has now already been set by another task, that task must also hold
+	 * a reference to the *same usage* of this page; so there is no need
+	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
+	 */
+	if (test_bit(key->bit_nr, &key->page->flags))
+		return -1;
+
+	return 1;
+}
+
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/filemap.c b/mm/filemap.c
index 3378d4fca883..c3175dbd8fba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -987,43 +987,16 @@ void __init pagecache_init(void)
 	page_writeback_init();
 }
 
-/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
-struct wait_page_key {
-	struct page *page;
-	int bit_nr;
-	int page_match;
-};
-
-struct wait_page_queue {
-	struct page *page;
-	int bit_nr;
-	wait_queue_entry_t wait;
-};
-
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
 	struct wait_page_key *key = arg;
 	struct wait_page_queue *wait_page
 		= container_of(wait, struct wait_page_queue, wait);
+	int ret;
 
-	if (wait_page->page != key->page)
-	       return 0;
-	key->page_match = 1;
-
-	if (wait_page->bit_nr != key->bit_nr)
-		return 0;
-
-	/*
-	 * Stop walking if it's locked.
-	 * Is this safe if put_and_wait_on_page_locked() is in use?
-	 * Yes: the waker must hold a reference to this page, and if PG_locked
-	 * has now already been set by another task, that task must also hold
-	 * a reference to the *same usage* of this page; so there is no need
-	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
-	 */
-	if (test_bit(key->bit_nr, &key->page->flags))
-		return -1;
-
+	ret = wake_page_match(wait_page, key);
+	if (ret != 1)
+		return ret;
 	return autoremove_wake_function(wait, mode, sync, key);
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 07/15] mm: add support for async page locking
  2020-06-18 14:43 Jens Axboe
                   ` (5 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-07-07 11:32   ` Andreas Grünbacher
  2020-06-18 14:43 ` [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Johannes Weiner

Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  7 ++++++-
 include/linux/pagemap.h | 17 ++++++++++++++++
 mm/filemap.c            | 45 ++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6c4ab4dc1cd7..6ac919b40596 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -315,6 +315,8 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+/* iocb->ki_waitq is valid */
+#define IOCB_WAITQ		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
@@ -328,7 +330,10 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	unsigned int		ki_cookie; /* for ->iopoll */
+	union {
+		unsigned int		ki_cookie; /* for ->iopoll */
+		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	};
 
 	randomized_struct_fields_end
 };
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2f18221bb5c8..e053e1d9a4d7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -535,6 +535,7 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
 
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
+extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
@@ -571,6 +572,22 @@ static inline int lock_page_killable(struct page *page)
 	return 0;
 }
 
+/*
+ * lock_page_async - Lock the page, unless this would block. If the page
+ * is already locked, then queue a callback when the page becomes unlocked.
+ * This callback can then retry the operation.
+ *
+ * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
+ * was already locked and the callback defined in 'wait' was queued.
+ */
+static inline int lock_page_async(struct page *page,
+				  struct wait_page_queue *wait)
+{
+	if (!trylock_page(page))
+		return __lock_page_async(page, wait);
+	return 0;
+}
+
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index c3175dbd8fba..e8aaf43bee9f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1180,6 +1180,36 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 }
 EXPORT_SYMBOL(wait_on_page_bit_killable);
 
+static int __wait_on_page_locked_async(struct page *page,
+				       struct wait_page_queue *wait, bool set)
+{
+	struct wait_queue_head *q = page_waitqueue(page);
+	int ret = 0;
+
+	wait->page = page;
+	wait->bit_nr = PG_locked;
+
+	spin_lock_irq(&q->lock);
+	__add_wait_queue_entry_tail(q, &wait->wait);
+	SetPageWaiters(page);
+	if (set)
+		ret = !trylock_page(page);
+	else
+		ret = PageLocked(page);
+	/*
+	 * If we were succesful now, we know we're still on the
+	 * waitqueue as we're still under the lock. This means it's
+	 * safe to remove and return success, we know the callback
+	 * isn't going to trigger.
+	 */
+	if (!ret)
+		__remove_wait_queue(q, &wait->wait);
+	else
+		ret = -EIOCBQUEUED;
+	spin_unlock_irq(&q->lock);
+	return ret;
+}
+
 /**
  * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
  * @page: The page to wait for.
@@ -1342,6 +1372,11 @@ int __lock_page_killable(struct page *__page)
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+{
+	return __wait_on_page_locked_async(page, wait, true);
+}
+
 /*
  * Return values:
  * 1 - page is locked; mmap_lock is still held.
@@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		}
 
 readpage:
+		if (iocb->ki_flags & IOCB_NOWAIT) {
+			unlock_page(page);
+			put_page(page);
+			goto would_block;
+		}
 		/*
 		 * A previous I/O error may have been due to temporary
 		 * failures, eg. multipath errors.
@@ -2150,7 +2190,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		}
 
 		if (!PageUptodate(page)) {
-			error = lock_page_killable(page);
+			if (iocb->ki_flags & IOCB_WAITQ)
+				error = lock_page_async(page, iocb->ki_waitq);
+			else
+				error = lock_page_killable(page);
 			if (unlikely(error))
 				goto readpage_error;
 			if (!PageUptodate(page)) {
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read()
  2020-06-18 14:43 Jens Axboe
                   ` (6 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 09/15] fs: add FMODE_BUF_RASYNC Jens Axboe
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Johannes Weiner

Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e8aaf43bee9f..a5b1fa8f7ce4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1210,6 +1210,14 @@ static int __wait_on_page_locked_async(struct page *page,
 	return ret;
 }
 
+static int wait_on_page_locked_async(struct page *page,
+				     struct wait_page_queue *wait)
+{
+	if (!PageLocked(page))
+		return 0;
+	return __wait_on_page_locked_async(compound_head(page), wait, false);
+}
+
 /**
  * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
  * @page: The page to wait for.
@@ -2049,17 +2057,25 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
-			if (iocb->ki_flags & IOCB_NOWAIT) {
-				put_page(page);
-				goto would_block;
-			}
-
 			/*
 			 * See comment in do_read_cache_page on why
 			 * wait_on_page_locked is used to avoid unnecessarily
 			 * serialisations and why it's safe.
 			 */
-			error = wait_on_page_locked_killable(page);
+			if (iocb->ki_flags & IOCB_WAITQ) {
+				if (written) {
+					put_page(page);
+					goto out;
+				}
+				error = wait_on_page_locked_async(page,
+								iocb->ki_waitq);
+			} else {
+				if (iocb->ki_flags & IOCB_NOWAIT) {
+					put_page(page);
+					goto would_block;
+				}
+				error = wait_on_page_locked_killable(page);
+			}
 			if (unlikely(error))
 				goto readpage_error;
 			if (PageUptodate(page))
@@ -2147,7 +2163,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 page_not_up_to_date:
 		/* Get exclusive access to the page ... */
-		error = lock_page_killable(page);
+		if (iocb->ki_flags & IOCB_WAITQ)
+			error = lock_page_async(page, iocb->ki_waitq);
+		else
+			error = lock_page_killable(page);
 		if (unlikely(error))
 			goto readpage_error;
 
@@ -2190,10 +2209,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		}
 
 		if (!PageUptodate(page)) {
-			if (iocb->ki_flags & IOCB_WAITQ)
-				error = lock_page_async(page, iocb->ki_waitq);
-			else
-				error = lock_page_killable(page);
+			error = lock_page_killable(page);
 			if (unlikely(error))
 				goto readpage_error;
 			if (!PageUptodate(page)) {
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 09/15] fs: add FMODE_BUF_RASYNC
  2020-06-18 14:43 Jens Axboe
                   ` (7 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6ac919b40596..3f9de90e0266 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports async buffered reads */
+#define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ
  2020-06-18 14:43 Jens Axboe
                   ` (8 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 09/15] fs: add FMODE_BUF_RASYNC Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 11/15] xfs: flag files as supporting buffered async reads Jens Axboe
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 47860e589388..54720c90dad0 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1850,7 +1850,7 @@ static int blkdev_open(struct inode * inode, struct file * filp)
 	 */
 	filp->f_flags |= O_LARGEFILE;
 
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 
 	if (filp->f_flags & O_NDELAY)
 		filp->f_mode |= FMODE_NDELAY;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 11/15] xfs: flag files as supporting buffered async reads
  2020-06-18 14:43 Jens Axboe
                   ` (9 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
	Darrick J . Wong

XFS uses generic_file_read_iter(), which already supports this.

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d..fdbff4860d61 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1080,7 +1080,7 @@ xfs_file_open(
 		return -EFBIG;
 	if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
 		return -EIO;
-	file->f_mode |= FMODE_NOWAIT;
+	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return 0;
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 12/15] btrfs: flag files as supporting buffered async reads
  2020-06-18 14:43 Jens Axboe
                   ` (10 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 11/15] xfs: flag files as supporting buffered async reads Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-19 11:11   ` David Sterba
  2020-06-18 14:43 ` [PATCH 13/15] ext4: flag " Jens Axboe
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Chris Mason

btrfs uses generic_file_read_iter(), which already supports this.

Acked-by: Chris Mason <clm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 2c14312b05e8..234a418eb6da 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3472,7 +3472,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return generic_file_open(inode, filp);
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 13/15] ext4: flag as supporting buffered async reads
  2020-06-18 14:43 Jens Axboe
                   ` (11 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
	Theodore Ts'o

ext4 uses generic_file_read_iter(), which already supports this.

Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c..1e827410e9e1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -839,7 +839,7 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
 			return ret;
 	}
 
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return dquot_file_open(inode, filp);
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper
  2020-06-18 14:43 Jens Axboe
                   ` (12 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 13/15] ext4: flag " Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
  2020-06-18 14:45 ` [PATCHSET v7 0/12] Add support for async buffered reads Jens Axboe
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring
  Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe, Johannes Weiner

Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e053e1d9a4d7..7386bc67cc5a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -533,6 +533,27 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
 	return 1;
 }
 
+static inline int kiocb_wait_page_queue_init(struct kiocb *kiocb,
+					     struct wait_page_queue *wait,
+					     wait_queue_func_t func,
+					     void *data)
+{
+	/* Can't support async wakeup with polled IO */
+	if (kiocb->ki_flags & IOCB_HIPRI)
+		return -EINVAL;
+	if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
+		wait->wait.func = func;
+		wait->wait.private = data;
+		wait->wait.flags = 0;
+		INIT_LIST_HEAD(&wait->wait.entry);
+		kiocb->ki_flags |= IOCB_WAITQ;
+		kiocb->ki_waitq = wait;
+		return 0;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
  2020-06-18 14:43 Jens Axboe
                   ` (13 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
@ 2020-06-18 14:43 ` Jens Axboe
  2020-06-23 12:39   ` Pavel Begunkov
  2020-06-18 14:45 ` [PATCHSET v7 0/12] Add support for async buffered reads Jens Axboe
  15 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:43 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 137 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 40413fb9d07b..94282be1c413 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@
 #include <linux/fs_struct.h>
 #include <linux/splice.h>
 #include <linux/task_work.h>
+#include <linux/pagemap.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -503,6 +504,8 @@ struct io_async_rw {
 	struct iovec			*iov;
 	ssize_t				nr_segs;
 	ssize_t				size;
+	struct wait_page_queue		wpq;
+	struct callback_head		task_work;
 };
 
 struct io_async_ctx {
@@ -2750,6 +2753,126 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+static void __io_async_buf_error(struct io_kiocb *req, int error)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock_irq(&ctx->completion_lock);
+	io_cqring_fill_event(req, error);
+	io_commit_cqring(ctx);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
+	req_set_fail_links(req);
+	io_double_put_req(req);
+}
+
+static void io_async_buf_cancel(struct callback_head *cb)
+{
+	struct io_async_rw *rw;
+	struct io_kiocb *req;
+
+	rw = container_of(cb, struct io_async_rw, task_work);
+	req = rw->wpq.wait.private;
+	__io_async_buf_error(req, -ECANCELED);
+}
+
+static void io_async_buf_retry(struct callback_head *cb)
+{
+	struct io_async_rw *rw;
+	struct io_ring_ctx *ctx;
+	struct io_kiocb *req;
+
+	rw = container_of(cb, struct io_async_rw, task_work);
+	req = rw->wpq.wait.private;
+	ctx = req->ctx;
+
+	__set_current_state(TASK_RUNNING);
+	if (!io_sq_thread_acquire_mm(ctx, req)) {
+		mutex_lock(&ctx->uring_lock);
+		__io_queue_sqe(req, NULL);
+		mutex_unlock(&ctx->uring_lock);
+	} else {
+		__io_async_buf_error(req, -EFAULT);
+	}
+}
+
+static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode,
+			     int sync, void *arg)
+{
+	struct wait_page_queue *wpq;
+	struct io_kiocb *req = wait->private;
+	struct io_async_rw *rw = &req->io->rw;
+	struct wait_page_key *key = arg;
+	struct task_struct *tsk;
+	int ret;
+
+	wpq = container_of(wait, struct wait_page_queue, wait);
+
+	ret = wake_page_match(wpq, key);
+	if (ret != 1)
+		return ret;
+
+	list_del_init(&wait->entry);
+
+	init_task_work(&rw->task_work, io_async_buf_retry);
+	/* submit ref gets dropped, acquire a new one */
+	refcount_inc(&req->refs);
+	tsk = req->task;
+	ret = task_work_add(tsk, &rw->task_work, true);
+	if (unlikely(ret)) {
+		/* queue just for cancelation */
+		init_task_work(&rw->task_work, io_async_buf_cancel);
+		tsk = io_wq_get_task(req->ctx->io_wq);
+		task_work_add(tsk, &rw->task_work, true);
+	}
+	wake_up_process(tsk);
+	return 1;
+}
+
+static bool io_rw_should_retry(struct io_kiocb *req)
+{
+	struct kiocb *kiocb = &req->rw.kiocb;
+	int ret;
+
+	/* never retry for NOWAIT, we just complete with -EAGAIN */
+	if (req->flags & REQ_F_NOWAIT)
+		return false;
+
+	/* already tried, or we're doing O_DIRECT */
+	if (kiocb->ki_flags & (IOCB_DIRECT | IOCB_WAITQ))
+		return false;
+	/*
+	 * just use poll if we can, and don't attempt if the fs doesn't
+	 * support callback based unlocks
+	 */
+	if (file_can_poll(req->file) || !(req->file->f_mode & FMODE_BUF_RASYNC))
+		return false;
+
+	/*
+	 * If request type doesn't require req->io to defer in general,
+	 * we need to allocate it here
+	 */
+	if (!req->io && __io_alloc_async_ctx(req))
+		return false;
+
+	ret = kiocb_wait_page_queue_init(kiocb, &req->io->rw.wpq,
+						io_async_buf_func, req);
+	if (!ret) {
+		io_get_req_task(req);
+		return true;
+	}
+
+	return false;
+}
+
+static int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
+{
+	if (req->file->f_op->read_iter)
+		return call_read_iter(req->file, &req->rw.kiocb, iter);
+	return loop_rw_iter(READ, req->file, &req->rw.kiocb, iter);
+}
+
 static int io_read(struct io_kiocb *req, bool force_nonblock)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 		unsigned long nr_segs = iter.nr_segs;
 		ssize_t ret2 = 0;
 
-		if (req->file->f_op->read_iter)
-			ret2 = call_read_iter(req->file, kiocb, &iter);
-		else
-			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
+		ret2 = io_iter_do_read(req, &iter);
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
 		if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
@@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 			ret = io_setup_async_rw(req, io_size, iovec,
 						inline_vecs, &iter);
 			if (ret)
-				goto out_free;
+				goto out;
 			/* any defer here is final, must blocking retry */
 			if (!(req->flags & REQ_F_NOWAIT) &&
 			    !file_can_poll(req->file))
 				req->flags |= REQ_F_MUST_PUNT;
+			/* if we can retry, do so with the callbacks armed */
+			if (io_rw_should_retry(req)) {
+				ret2 = io_iter_do_read(req, &iter);
+				if (ret2 == -EIOCBQUEUED) {
+					goto out;
+				} else if (ret2 != -EAGAIN) {
+					kiocb_done(kiocb, ret2);
+					goto out;
+				}
+			}
+			kiocb->ki_flags &= ~IOCB_WAITQ;
 			return -EAGAIN;
 		}
 	}
-out_free:
-	kfree(iovec);
-	req->flags &= ~REQ_F_NEED_CLEANUP;
+out:
 	return ret;
 }
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCHSET v7 0/12] Add support for async buffered reads
  2020-06-18 14:43 Jens Axboe
                   ` (14 preceding siblings ...)
  2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-06-18 14:45 ` Jens Axboe
  15 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-18 14:45 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 6/18/20 8:43 AM, Jens Axboe wrote:
> We technically support this already through io_uring, but it's
> implemented with a thread backend to support cases where we would
> block. This isn't ideal.

Apparently I'm cursed when I send these out, corrected the subject
line to indicate this is v7 of the series...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 12/15] btrfs: flag files as supporting buffered async reads
  2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
@ 2020-06-19 11:11   ` David Sterba
  0 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2020-06-19 11:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
	Jens Axboe, Chris Mason

On Thu, Jun 18, 2020 at 08:43:52AM -0600, Jens Axboe wrote:
> btrfs uses generic_file_read_iter(), which already supports this.
> 
> Acked-by: Chris Mason <clm@fb.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Can you please CC the mailinglist for btrfs patches? Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
  2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
@ 2020-06-19 14:12   ` Pavel Begunkov
  2020-06-19 14:22     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2020-06-19 14:12 UTC (permalink / raw)
  To: io-uring, Jens Axboe; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 18/06/2020 17:43, Jens Axboe wrote:
> Mark the plug with nowait == true, which will cause requests to avoid
> blocking on request allocation. If they do, we catch them and reissue
> them from a task_work based handler.
> 
> Normally we can catch -EAGAIN directly, but the hard case is for split
> requests. As an example, the application issues a 512KB request. The
> block core will split this into 128KB if that's the max size for the
> device. The first request issues just fine, but we run into -EAGAIN for
> some latter splits for the same request. As the bio is split, we don't
> get to see the -EAGAIN until one of the actual reads complete, and hence
> we cannot handle it inline as part of submission.
> 
> This does potentially cause re-reads of parts of the range, as the whole
> request is reissued. There's currently no better way to handle this.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 124 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e257c5a1866..40413fb9d07b 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>  static void __io_queue_sqe(struct io_kiocb *req,
>  			   const struct io_uring_sqe *sqe);
>  
...> +
> +static void io_rw_resubmit(struct callback_head *cb)
> +{
> +	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
> +	struct io_ring_ctx *ctx = req->ctx;
> +	int err;
> +
> +	__set_current_state(TASK_RUNNING);
> +
> +	err = io_sq_thread_acquire_mm(ctx, req);
> +
> +	if (io_resubmit_prep(req, err)) {
> +		refcount_inc(&req->refs);
> +		io_queue_async_work(req);
> +	}

Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
linked reqs and some extra. I think I'll rebase on top of this.

> +}
> +#endif
> +
> +static bool io_rw_reissue(struct io_kiocb *req, long res)
> +{
> +#ifdef CONFIG_BLOCK
> +	struct task_struct *tsk;
> +	int ret;
> +
> +	if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
> +		return false;
> +
> +	tsk = req->task;
> +	init_task_work(&req->task_work, io_rw_resubmit);
> +	ret = task_work_add(tsk, &req->task_work, true);

I don't like that the request becomes un-discoverable for cancellation
awhile sitting in the task_work list. Poll stuff at least have hash_node
for that.

> +	if (!ret)
> +		return true;
> +#endif
> +	return false;
> +}
> +

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
  2020-06-19 14:12   ` Pavel Begunkov
@ 2020-06-19 14:22     ` Jens Axboe
  2020-06-19 14:30       ` Pavel Begunkov
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-19 14:22 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 6/19/20 8:12 AM, Pavel Begunkov wrote:
> On 18/06/2020 17:43, Jens Axboe wrote:
>> Mark the plug with nowait == true, which will cause requests to avoid
>> blocking on request allocation. If they do, we catch them and reissue
>> them from a task_work based handler.
>>
>> Normally we can catch -EAGAIN directly, but the hard case is for split
>> requests. As an example, the application issues a 512KB request. The
>> block core will split this into 128KB if that's the max size for the
>> device. The first request issues just fine, but we run into -EAGAIN for
>> some latter splits for the same request. As the bio is split, we don't
>> get to see the -EAGAIN until one of the actual reads complete, and hence
>> we cannot handle it inline as part of submission.
>>
>> This does potentially cause re-reads of parts of the range, as the whole
>> request is reissued. There's currently no better way to handle this.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 124 insertions(+), 24 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 2e257c5a1866..40413fb9d07b 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>>  static void __io_queue_sqe(struct io_kiocb *req,
>>  			   const struct io_uring_sqe *sqe);
>>  
> ...> +
>> +static void io_rw_resubmit(struct callback_head *cb)
>> +{
>> +	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>> +	struct io_ring_ctx *ctx = req->ctx;
>> +	int err;
>> +
>> +	__set_current_state(TASK_RUNNING);
>> +
>> +	err = io_sq_thread_acquire_mm(ctx, req);
>> +
>> +	if (io_resubmit_prep(req, err)) {
>> +		refcount_inc(&req->refs);
>> +		io_queue_async_work(req);
>> +	}
> 
> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
> linked reqs and some extra. I think I'll rebase on top of this.

Yes, there's certainly overlap there. I consider this series basically
wrapped up, so feel free to just base on top of it.

>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>> +{
>> +#ifdef CONFIG_BLOCK
>> +	struct task_struct *tsk;
>> +	int ret;
>> +
>> +	if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>> +		return false;
>> +
>> +	tsk = req->task;
>> +	init_task_work(&req->task_work, io_rw_resubmit);
>> +	ret = task_work_add(tsk, &req->task_work, true);
> 
> I don't like that the request becomes un-discoverable for cancellation
> awhile sitting in the task_work list. Poll stuff at least have hash_node
> for that.

Async buffered IO was never cancelable, so it doesn't really matter.
It's tied to the task, so we know it'll get executed - either run, or
canceled if the task is going away. This is really not that different
from having the work discoverable through io-wq queueing before, since
the latter could never be canceled anyway as it sits there
uninterruptibly waiting for IO completion.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
  2020-06-19 14:22     ` Jens Axboe
@ 2020-06-19 14:30       ` Pavel Begunkov
  2020-06-19 14:36         ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2020-06-19 14:30 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 19/06/2020 17:22, Jens Axboe wrote:
> On 6/19/20 8:12 AM, Pavel Begunkov wrote:
>> On 18/06/2020 17:43, Jens Axboe wrote:
>>> Mark the plug with nowait == true, which will cause requests to avoid
>>> blocking on request allocation. If they do, we catch them and reissue
>>> them from a task_work based handler.
>>>
>>> Normally we can catch -EAGAIN directly, but the hard case is for split
>>> requests. As an example, the application issues a 512KB request. The
>>> block core will split this into 128KB if that's the max size for the
>>> device. The first request issues just fine, but we run into -EAGAIN for
>>> some latter splits for the same request. As the bio is split, we don't
>>> get to see the -EAGAIN until one of the actual reads complete, and hence
>>> we cannot handle it inline as part of submission.
>>>
>>> This does potentially cause re-reads of parts of the range, as the whole
>>> request is reissued. There's currently no better way to handle this.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>>>  1 file changed, 124 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 2e257c5a1866..40413fb9d07b 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>>>  static void __io_queue_sqe(struct io_kiocb *req,
>>>  			   const struct io_uring_sqe *sqe);
>>>  
>> ...> +
>>> +static void io_rw_resubmit(struct callback_head *cb)
>>> +{
>>> +	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>>> +	struct io_ring_ctx *ctx = req->ctx;
>>> +	int err;
>>> +
>>> +	__set_current_state(TASK_RUNNING);
>>> +
>>> +	err = io_sq_thread_acquire_mm(ctx, req);
>>> +
>>> +	if (io_resubmit_prep(req, err)) {
>>> +		refcount_inc(&req->refs);
>>> +		io_queue_async_work(req);
>>> +	}
>>
>> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
>> linked reqs and some extra. I think I'll rebase on top of this.
> 
> Yes, there's certainly overlap there. I consider this series basically
> wrapped up, so feel free to just base on top of it.
> 
>>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>>> +{
>>> +#ifdef CONFIG_BLOCK
>>> +	struct task_struct *tsk;
>>> +	int ret;
>>> +
>>> +	if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>>> +		return false;
>>> +
>>> +	tsk = req->task;
>>> +	init_task_work(&req->task_work, io_rw_resubmit);
>>> +	ret = task_work_add(tsk, &req->task_work, true);
>>
>> I don't like that the request becomes un-discoverable for cancellation
>> awhile sitting in the task_work list. Poll stuff at least have hash_node
>> for that.
> 
> Async buffered IO was never cancelable, so it doesn't really matter.
> It's tied to the task, so we know it'll get executed - either run, or
> canceled if the task is going away. This is really not that different
> from having the work discoverable through io-wq queueing before, since
> the latter could never be canceled anyway as it sits there
> uninterruptibly waiting for IO completion.

Makes sense. I was thinking about using this task-requeue for all kinds
of requests. Though, instead of speculating it'd be better for me to embody
ideas into patches and see.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/15] io_uring: re-issue block requests that failed because of resources
  2020-06-19 14:30       ` Pavel Begunkov
@ 2020-06-19 14:36         ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-19 14:36 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 6/19/20 8:30 AM, Pavel Begunkov wrote:
> On 19/06/2020 17:22, Jens Axboe wrote:
>> On 6/19/20 8:12 AM, Pavel Begunkov wrote:
>>> On 18/06/2020 17:43, Jens Axboe wrote:
>>>> Mark the plug with nowait == true, which will cause requests to avoid
>>>> blocking on request allocation. If they do, we catch them and reissue
>>>> them from a task_work based handler.
>>>>
>>>> Normally we can catch -EAGAIN directly, but the hard case is for split
>>>> requests. As an example, the application issues a 512KB request. The
>>>> block core will split this into 128KB if that's the max size for the
>>>> device. The first request issues just fine, but we run into -EAGAIN for
>>>> some latter splits for the same request. As the bio is split, we don't
>>>> get to see the -EAGAIN until one of the actual reads complete, and hence
>>>> we cannot handle it inline as part of submission.
>>>>
>>>> This does potentially cause re-reads of parts of the range, as the whole
>>>> request is reissued. There's currently no better way to handle this.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  fs/io_uring.c | 148 ++++++++++++++++++++++++++++++++++++++++++--------
>>>>  1 file changed, 124 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 2e257c5a1866..40413fb9d07b 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -900,6 +900,13 @@ static int io_file_get(struct io_submit_state *state, struct io_kiocb *req,
>>>>  static void __io_queue_sqe(struct io_kiocb *req,
>>>>  			   const struct io_uring_sqe *sqe);
>>>>  
>>> ...> +
>>>> +static void io_rw_resubmit(struct callback_head *cb)
>>>> +{
>>>> +	struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
>>>> +	struct io_ring_ctx *ctx = req->ctx;
>>>> +	int err;
>>>> +
>>>> +	__set_current_state(TASK_RUNNING);
>>>> +
>>>> +	err = io_sq_thread_acquire_mm(ctx, req);
>>>> +
>>>> +	if (io_resubmit_prep(req, err)) {
>>>> +		refcount_inc(&req->refs);
>>>> +		io_queue_async_work(req);
>>>> +	}
>>>
>>> Hmm, I have similar stuff but for iopoll. On top removing grab_env* for
>>> linked reqs and some extra. I think I'll rebase on top of this.
>>
>> Yes, there's certainly overlap there. I consider this series basically
>> wrapped up, so feel free to just base on top of it.
>>
>>>> +static bool io_rw_reissue(struct io_kiocb *req, long res)
>>>> +{
>>>> +#ifdef CONFIG_BLOCK
>>>> +	struct task_struct *tsk;
>>>> +	int ret;
>>>> +
>>>> +	if ((res != -EAGAIN && res != -EOPNOTSUPP) || io_wq_current_is_worker())
>>>> +		return false;
>>>> +
>>>> +	tsk = req->task;
>>>> +	init_task_work(&req->task_work, io_rw_resubmit);
>>>> +	ret = task_work_add(tsk, &req->task_work, true);
>>>
>>> I don't like that the request becomes un-discoverable for cancellation
>>> awhile sitting in the task_work list. Poll stuff at least have hash_node
>>> for that.
>>
>> Async buffered IO was never cancelable, so it doesn't really matter.
>> It's tied to the task, so we know it'll get executed - either run, or
>> canceled if the task is going away. This is really not that different
>> from having the work discoverable through io-wq queueing before, since
>> the latter could never be canceled anyway as it sits there
>> uninterruptibly waiting for IO completion.
> 
> Makes sense. I was thinking about using this task-requeue for all kinds
> of requests. Though, instead of speculating it'd be better for me to embody
> ideas into patches and see.

And that's fine, for requests where it matters, on-the-side
discoverability can still be a thing. If we're in the task itself where
it is queued, that provides us safey from the work going way from under
us. Then we just have to mark it appropriately, if it needs to get
canceled instead of run to completion.

Some care needed, of course, but there's nothing that would prevent this
from working. Ideally we'd be able to peal off a task_work entry, but
that's kind of difficult with the singly linked non-locked list.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
  2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-06-23 12:39   ` Pavel Begunkov
  2020-06-23 14:38     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Pavel Begunkov @ 2020-06-23 12:39 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

On 18/06/2020 17:43, Jens Axboe wrote:
> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
> the buffered read to an io-wq worker. Instead we can rely on page
> unlocking callbacks to support retry based async IO. This is a lot more
> efficient than doing async thread offload.
> 
> The retry is done similarly to how we handle poll based retry. From
> the unlock callback, we simply queue the retry to a task_work based
> handler.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 137 insertions(+), 8 deletions(-)
> 
...
>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>  {
>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> @@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  		unsigned long nr_segs = iter.nr_segs;
>  		ssize_t ret2 = 0;
>  
> -		if (req->file->f_op->read_iter)
> -			ret2 = call_read_iter(req->file, kiocb, &iter);
> -		else
> -			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
> +		ret2 = io_iter_do_read(req, &iter);
>  
>  		/* Catch -EAGAIN return for forced non-blocking submission */
>  		if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
> @@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  			ret = io_setup_async_rw(req, io_size, iovec,
>  						inline_vecs, &iter);
>  			if (ret)
> -				goto out_free;
> +				goto out;
>  			/* any defer here is final, must blocking retry */
>  			if (!(req->flags & REQ_F_NOWAIT) &&
>  			    !file_can_poll(req->file))
>  				req->flags |= REQ_F_MUST_PUNT;
> +			/* if we can retry, do so with the callbacks armed */
> +			if (io_rw_should_retry(req)) {
> +				ret2 = io_iter_do_read(req, &iter);
> +				if (ret2 == -EIOCBQUEUED) {
> +					goto out;
> +				} else if (ret2 != -EAGAIN) {
> +					kiocb_done(kiocb, ret2);
> +					goto out;
> +				}
> +			}
> +			kiocb->ki_flags &= ~IOCB_WAITQ;
>  			return -EAGAIN;
>  		}
>  	}
> -out_free:
> -	kfree(iovec);
> -	req->flags &= ~REQ_F_NEED_CLEANUP;

This looks fishy. For instance, if it fails early on rw_verify_area(), how would
it free yet on-stack iovec? Is it handled somehow?

> +out:
>  	return ret;
>  }
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 15/15] io_uring: support true async buffered reads, if file provides it
  2020-06-23 12:39   ` Pavel Begunkov
@ 2020-06-23 14:38     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-23 14:38 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm

On 6/23/20 6:39 AM, Pavel Begunkov wrote:
> On 18/06/2020 17:43, Jens Axboe wrote:
>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>> the buffered read to an io-wq worker. Instead we can rely on page
>> unlocking callbacks to support retry based async IO. This is a lot more
>> efficient than doing async thread offload.
>>
>> The retry is done similarly to how we handle poll based retry. From
>> the unlock callback, we simply queue the retry to a task_work based
>> handler.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/io_uring.c | 145 +++++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 137 insertions(+), 8 deletions(-)
>>
> ...
>>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  {
>>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>> @@ -2784,10 +2907,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  		unsigned long nr_segs = iter.nr_segs;
>>  		ssize_t ret2 = 0;
>>  
>> -		if (req->file->f_op->read_iter)
>> -			ret2 = call_read_iter(req->file, kiocb, &iter);
>> -		else
>> -			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
>> +		ret2 = io_iter_do_read(req, &iter);
>>  
>>  		/* Catch -EAGAIN return for forced non-blocking submission */
>>  		if (!force_nonblock || (ret2 != -EAGAIN && ret2 != -EIO)) {
>> @@ -2799,17 +2919,26 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  			ret = io_setup_async_rw(req, io_size, iovec,
>>  						inline_vecs, &iter);
>>  			if (ret)
>> -				goto out_free;
>> +				goto out;
>>  			/* any defer here is final, must blocking retry */
>>  			if (!(req->flags & REQ_F_NOWAIT) &&
>>  			    !file_can_poll(req->file))
>>  				req->flags |= REQ_F_MUST_PUNT;
>> +			/* if we can retry, do so with the callbacks armed */
>> +			if (io_rw_should_retry(req)) {
>> +				ret2 = io_iter_do_read(req, &iter);
>> +				if (ret2 == -EIOCBQUEUED) {
>> +					goto out;
>> +				} else if (ret2 != -EAGAIN) {
>> +					kiocb_done(kiocb, ret2);
>> +					goto out;
>> +				}
>> +			}
>> +			kiocb->ki_flags &= ~IOCB_WAITQ;
>>  			return -EAGAIN;
>>  		}
>>  	}
>> -out_free:
>> -	kfree(iovec);
>> -	req->flags &= ~REQ_F_NEED_CLEANUP;
> 
> This looks fishy. For instance, if it fails early on rw_verify_area(), how would
> it free yet on-stack iovec? Is it handled somehow?

This was tweaked and rebased on top of the REQ_F_NEED_CLEANUP change,
it should be correct in the tree:

https://git.kernel.dk/cgit/linux-block/tree/fs/io_uring.c?h=for-5.9/io_uring#n2908

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
@ 2020-06-24  1:02   ` Dave Chinner
  2020-06-24  1:46     ` Matthew Wilcox
  2020-06-24  4:38   ` Dave Chinner
  1 sibling, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-06-24  1:02 UTC (permalink / raw)
  To: Add, support, for, async, buffered, reads
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
	Jens Axboe, Johannes Weiner

On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> The read-ahead shouldn't block, so allow it to be done even if
> IOCB_NOWAIT is set in the kiocb.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  mm/filemap.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f0ae9a6308cb..3378d4fca883 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2028,8 +2028,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>  
>  		page = find_get_page(mapping, index);
>  		if (!page) {
> -			if (iocb->ki_flags & IOCB_NOWAIT)
> -				goto would_block;
>  			page_cache_sync_readahead(mapping,
>  					ra, filp,
>  					index, last_index - index);

Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
reads? i.e. this can now block on memory allocation for the page
cache, which is something RWF_NOWAIT IO should not do....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24  1:02   ` Dave Chinner
@ 2020-06-24  1:46     ` Matthew Wilcox
  2020-06-24 15:00       ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2020-06-24  1:46 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Add, support, for, async, buffered, reads, io-uring,
	linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe,
	Johannes Weiner

On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> > The read-ahead shouldn't block, so allow it to be done even if
> > IOCB_NOWAIT is set in the kiocb.
> 
> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
> reads? i.e. this can now block on memory allocation for the page
> cache, which is something RWF_NOWAIT IO should not do....

Yes.  This eventually ends up in page_cache_readahead_unbounded()
which gets its gfp flags from readahead_gfp_mask(mapping).

I'd be quite happy to add a gfp_t to struct readahead_control.
The other thing I've been looking into for other reasons is adding
a memalloc_nowait_{save,restore}, which would avoid passing down
the gfp_t.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
  2020-06-24  1:02   ` Dave Chinner
@ 2020-06-24  4:38   ` Dave Chinner
  2020-06-24 15:01     ` Jens Axboe
  1 sibling, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2020-06-24  4:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm,
	Jens Axboe, Johannes Weiner

On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
> The read-ahead shouldn't block, so allow it to be done even if
> IOCB_NOWAIT is set in the kiocb.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

BTW, Jens, in case nobody had mentioned it, the Reply-To field for
the patches in this patchset is screwed up:

| Reply-To: Add@vger.kernel.org, support@vger.kernel.org, for@vger.kernel.org,
|         async@vger.kernel.org, buffered@vger.kernel.org,
| 	        reads@vger.kernel.org

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24  1:46     ` Matthew Wilcox
@ 2020-06-24 15:00       ` Jens Axboe
  2020-06-24 15:35         ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-24 15:00 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Johannes Weiner

On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
>> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>>> The read-ahead shouldn't block, so allow it to be done even if
>>> IOCB_NOWAIT is set in the kiocb.
>>
>> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
>> reads? i.e. this can now block on memory allocation for the page
>> cache, which is something RWF_NOWAIT IO should not do....
> 
> Yes.  This eventually ends up in page_cache_readahead_unbounded()
> which gets its gfp flags from readahead_gfp_mask(mapping).
> 
> I'd be quite happy to add a gfp_t to struct readahead_control.
> The other thing I've been looking into for other reasons is adding
> a memalloc_nowait_{save,restore}, which would avoid passing down
> the gfp_t.

That was my first thought, having the memalloc_foo_save/restore for
this. I don't think adding a gfp_t to readahead_control is going
to be super useful, seems like the kind of thing that should be
non-blocking by default.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24  4:38   ` Dave Chinner
@ 2020-06-24 15:01     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-06-24 15:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Johannes Weiner

On 6/23/20 10:38 PM, Dave Chinner wrote:
> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>> The read-ahead shouldn't block, so allow it to be done even if
>> IOCB_NOWAIT is set in the kiocb.
>>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> BTW, Jens, in case nobody had mentioned it, the Reply-To field for
> the patches in this patchset is screwed up:
> 
> | Reply-To: Add@vger.kernel.org, support@vger.kernel.org, for@vger.kernel.org,
> |         async@vger.kernel.org, buffered@vger.kernel.org,
> | 	        reads@vger.kernel.org

Yeah, I pasted the subject line into the wrong spot for git send-email,
hence the reply-to is boogered, and the subject line was empty for the
cover letter...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24 15:00       ` Jens Axboe
@ 2020-06-24 15:35         ` Jens Axboe
  2020-06-24 16:41           ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-24 15:35 UTC (permalink / raw)
  To: Matthew Wilcox, Dave Chinner
  Cc: io-uring, linux-fsdevel, linux-kernel, linux-mm, akpm, Johannes Weiner

On 6/24/20 9:00 AM, Jens Axboe wrote:
> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>> On Wed, Jun 24, 2020 at 11:02:53AM +1000, Dave Chinner wrote:
>>> On Thu, Jun 18, 2020 at 08:43:45AM -0600, Jens Axboe wrote:
>>>> The read-ahead shouldn't block, so allow it to be done even if
>>>> IOCB_NOWAIT is set in the kiocb.
>>>
>>> Doesn't think break preadv2(RWF_NOWAIT) semantics for on buffered
>>> reads? i.e. this can now block on memory allocation for the page
>>> cache, which is something RWF_NOWAIT IO should not do....
>>
>> Yes.  This eventually ends up in page_cache_readahead_unbounded()
>> which gets its gfp flags from readahead_gfp_mask(mapping).
>>
>> I'd be quite happy to add a gfp_t to struct readahead_control.
>> The other thing I've been looking into for other reasons is adding
>> a memalloc_nowait_{save,restore}, which would avoid passing down
>> the gfp_t.
> 
> That was my first thought, having the memalloc_foo_save/restore for
> this. I don't think adding a gfp_t to readahead_control is going
> to be super useful, seems like the kind of thing that should be
> non-blocking by default.

We're already doing memalloc_nofs_save/restore in
page_cache_readahead_unbounded(), so I think all we need is to just do a
noio dance in generic_file_buffered_read() and that should be enough.

diff --git a/mm/filemap.c b/mm/filemap.c
index a5b1fa8f7ce4..c29d4b310ed6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -41,6 +41,7 @@
 #include <linux/delayacct.h>
 #include <linux/psi.h>
 #include <linux/ramfs.h>
+#include <linux/sched/mm.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -2011,6 +2012,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
 	struct file_ra_state *ra = &filp->f_ra;
+	const bool nowait = (iocb->ki_flags & IOCB_NOWAIT) != 0;
 	loff_t *ppos = &iocb->ki_pos;
 	pgoff_t index;
 	pgoff_t last_index;
@@ -2044,9 +2046,15 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		page = find_get_page(mapping, index);
 		if (!page) {
+			unsigned int flags;
+
+			if (nowait)
+				flags = memalloc_noio_save();
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
+			if (nowait)
+				memalloc_noio_restore(flags);
 			page = find_get_page(mapping, index);
 			if (unlikely(page == NULL))
 				goto no_cached_page;
@@ -2070,7 +2078,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 				error = wait_on_page_locked_async(page,
 								iocb->ki_waitq);
 			} else {
-				if (iocb->ki_flags & IOCB_NOWAIT) {
+				if (nowait) {
 					put_page(page);
 					goto would_block;
 				}
@@ -2185,7 +2193,7 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
 		}
 
 readpage:
-		if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (nowait) {
 			unlock_page(page);
 			put_page(page);
 			goto would_block;

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24 15:35         ` Jens Axboe
@ 2020-06-24 16:41           ` Matthew Wilcox
  2020-06-24 16:44             ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2020-06-24 16:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, io-uring, linux-fsdevel, linux-kernel, linux-mm,
	akpm, Johannes Weiner

On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
> On 6/24/20 9:00 AM, Jens Axboe wrote:
> > On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> >> I'd be quite happy to add a gfp_t to struct readahead_control.
> >> The other thing I've been looking into for other reasons is adding
> >> a memalloc_nowait_{save,restore}, which would avoid passing down
> >> the gfp_t.
> > 
> > That was my first thought, having the memalloc_foo_save/restore for
> > this. I don't think adding a gfp_t to readahead_control is going
> > to be super useful, seems like the kind of thing that should be
> > non-blocking by default.
> 
> We're already doing memalloc_nofs_save/restore in
> page_cache_readahead_unbounded(), so I think all we need is to just do a
> noio dance in generic_file_buffered_read() and that should be enough.

I think we can still sleep though, right?  I was thinking more
like this:

http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24 16:41           ` Matthew Wilcox
@ 2020-06-24 16:44             ` Jens Axboe
  2020-07-07 11:38               ` Andreas Grünbacher
  0 siblings, 1 reply; 36+ messages in thread
From: Jens Axboe @ 2020-06-24 16:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Dave Chinner, io-uring, linux-fsdevel, linux-kernel, linux-mm,
	akpm, Johannes Weiner

On 6/24/20 10:41 AM, Matthew Wilcox wrote:
> On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
>> On 6/24/20 9:00 AM, Jens Axboe wrote:
>>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>>>> I'd be quite happy to add a gfp_t to struct readahead_control.
>>>> The other thing I've been looking into for other reasons is adding
>>>> a memalloc_nowait_{save,restore}, which would avoid passing down
>>>> the gfp_t.
>>>
>>> That was my first thought, having the memalloc_foo_save/restore for
>>> this. I don't think adding a gfp_t to readahead_control is going
>>> to be super useful, seems like the kind of thing that should be
>>> non-blocking by default.
>>
>> We're already doing memalloc_nofs_save/restore in
>> page_cache_readahead_unbounded(), so I think all we need is to just do a
>> noio dance in generic_file_buffered_read() and that should be enough.
> 
> I think we can still sleep though, right?  I was thinking more
> like this:
> 
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc

Yeah, that's probably better. How do we want to handle this? I've already
got the other bits queued up. I can either add them to the series, or
pull a branch that'll go into Linus as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/15] mm: add support for async page locking
  2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
@ 2020-07-07 11:32   ` Andreas Grünbacher
  2020-07-07 14:32     ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Grünbacher @ 2020-07-07 11:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, Linux FS-devel Mailing List, Linux Kernel Mailing List,
	Linux-MM, Andrew Morton, Johannes Weiner

Jens,

Am Do., 18. Juni 2020 um 16:47 Uhr schrieb Jens Axboe <axboe@kernel.dk>:
> Normally waiting for a page to become unlocked, or locking the page,
> requires waiting for IO to complete. Add support for lock_page_async()
> and wait_on_page_locked_async(), which are callback based instead. This
> allows a caller to get notified when a page becomes unlocked, rather
> than wait for it.
>
> We add a new iocb field, ki_waitq, to pass in the necessary data for this
> to happen. We can unionize this with ki_cookie, since that is only used
> for polled IO. Polled IO can never co-exist with async callbacks, as it is
> (by definition) polled completions. struct wait_page_key is made public,
> and we define struct wait_page_async as the interface between the caller
> and the core.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/fs.h      |  7 ++++++-
>  include/linux/pagemap.h | 17 ++++++++++++++++
>  mm/filemap.c            | 45 ++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 67 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6c4ab4dc1cd7..6ac919b40596 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -315,6 +315,8 @@ enum rw_hint {
>  #define IOCB_SYNC              (1 << 5)
>  #define IOCB_WRITE             (1 << 6)
>  #define IOCB_NOWAIT            (1 << 7)
> +/* iocb->ki_waitq is valid */
> +#define IOCB_WAITQ             (1 << 8)
>
>  struct kiocb {
>         struct file             *ki_filp;
> @@ -328,7 +330,10 @@ struct kiocb {
>         int                     ki_flags;
>         u16                     ki_hint;
>         u16                     ki_ioprio; /* See linux/ioprio.h */
> -       unsigned int            ki_cookie; /* for ->iopoll */
> +       union {
> +               unsigned int            ki_cookie; /* for ->iopoll */
> +               struct wait_page_queue  *ki_waitq; /* for async buffered IO */
> +       };
>
>         randomized_struct_fields_end
>  };
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2f18221bb5c8..e053e1d9a4d7 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -535,6 +535,7 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
>
>  extern void __lock_page(struct page *page);
>  extern int __lock_page_killable(struct page *page);
> +extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
>  extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>                                 unsigned int flags);
>  extern void unlock_page(struct page *page);
> @@ -571,6 +572,22 @@ static inline int lock_page_killable(struct page *page)
>         return 0;
>  }
>
> +/*
> + * lock_page_async - Lock the page, unless this would block. If the page
> + * is already locked, then queue a callback when the page becomes unlocked.
> + * This callback can then retry the operation.
> + *
> + * Returns 0 if the page is locked successfully, or -EIOCBQUEUED if the page
> + * was already locked and the callback defined in 'wait' was queued.
> + */
> +static inline int lock_page_async(struct page *page,
> +                                 struct wait_page_queue *wait)
> +{
> +       if (!trylock_page(page))
> +               return __lock_page_async(page, wait);
> +       return 0;
> +}
> +
>  /*
>   * lock_page_or_retry - Lock the page, unless this would block and the
>   * caller indicated that it can handle a retry.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c3175dbd8fba..e8aaf43bee9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1180,6 +1180,36 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
>  }
>  EXPORT_SYMBOL(wait_on_page_bit_killable);
>
> +static int __wait_on_page_locked_async(struct page *page,
> +                                      struct wait_page_queue *wait, bool set)
> +{
> +       struct wait_queue_head *q = page_waitqueue(page);
> +       int ret = 0;
> +
> +       wait->page = page;
> +       wait->bit_nr = PG_locked;
> +
> +       spin_lock_irq(&q->lock);
> +       __add_wait_queue_entry_tail(q, &wait->wait);
> +       SetPageWaiters(page);
> +       if (set)
> +               ret = !trylock_page(page);
> +       else
> +               ret = PageLocked(page);
> +       /*
> +        * If we were succesful now, we know we're still on the
> +        * waitqueue as we're still under the lock. This means it's
> +        * safe to remove and return success, we know the callback
> +        * isn't going to trigger.
> +        */
> +       if (!ret)
> +               __remove_wait_queue(q, &wait->wait);
> +       else
> +               ret = -EIOCBQUEUED;
> +       spin_unlock_irq(&q->lock);
> +       return ret;
> +}
> +
>  /**
>   * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
>   * @page: The page to wait for.
> @@ -1342,6 +1372,11 @@ int __lock_page_killable(struct page *__page)
>  }
>  EXPORT_SYMBOL_GPL(__lock_page_killable);
>
> +int __lock_page_async(struct page *page, struct wait_page_queue *wait)
> +{
> +       return __wait_on_page_locked_async(page, wait, true);
> +}
> +
>  /*
>   * Return values:
>   * 1 - page is locked; mmap_lock is still held.
> @@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>                 }
>
>  readpage:
> +               if (iocb->ki_flags & IOCB_NOWAIT) {
> +                       unlock_page(page);
> +                       put_page(page);
> +                       goto would_block;
> +               }

This hunk should have been part of "mm: allow read-ahead with
IOCB_NOWAIT set" ...

>                 /*
>                  * A previous I/O error may have been due to temporary
>                  * failures, eg. multipath errors.
> @@ -2150,7 +2190,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>                 }
>
>                 if (!PageUptodate(page)) {
> -                       error = lock_page_killable(page);
> +                       if (iocb->ki_flags & IOCB_WAITQ)
> +                               error = lock_page_async(page, iocb->ki_waitq);
> +                       else
> +                               error = lock_page_killable(page);
>                         if (unlikely(error))
>                                 goto readpage_error;
>                         if (!PageUptodate(page)) {
> --
> 2.27.0
>

Andreas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-06-24 16:44             ` Jens Axboe
@ 2020-07-07 11:38               ` Andreas Grünbacher
  2020-07-07 14:31                 ` Jens Axboe
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Grünbacher @ 2020-07-07 11:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Matthew Wilcox, Dave Chinner, io-uring,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux-MM,
	Andrew Morton, Johannes Weiner

Am Mi., 24. Juni 2020 um 18:48 Uhr schrieb Jens Axboe <axboe@kernel.dk>:
>
> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
> > On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
> >> On 6/24/20 9:00 AM, Jens Axboe wrote:
> >>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
> >>>> I'd be quite happy to add a gfp_t to struct readahead_control.
> >>>> The other thing I've been looking into for other reasons is adding
> >>>> a memalloc_nowait_{save,restore}, which would avoid passing down
> >>>> the gfp_t.
> >>>
> >>> That was my first thought, having the memalloc_foo_save/restore for
> >>> this. I don't think adding a gfp_t to readahead_control is going
> >>> to be super useful, seems like the kind of thing that should be
> >>> non-blocking by default.
> >>
> >> We're already doing memalloc_nofs_save/restore in
> >> page_cache_readahead_unbounded(), so I think all we need is to just do a
> >> noio dance in generic_file_buffered_read() and that should be enough.
> >
> > I think we can still sleep though, right?  I was thinking more
> > like this:
> >
> > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>
> Yeah, that's probably better. How do we want to handle this? I've already
> got the other bits queued up. I can either add them to the series, or
> pull a branch that'll go into Linus as well.

Also note my conflicting patch that introduces a IOCB_NOIO flag for
fixing a gfs2 regression:

https://lore.kernel.org/linux-fsdevel/20200703095325.1491832-2-agruenba@redhat.com/

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set
  2020-07-07 11:38               ` Andreas Grünbacher
@ 2020-07-07 14:31                 ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-07-07 14:31 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: Matthew Wilcox, Dave Chinner, io-uring,
	Linux FS-devel Mailing List, Linux Kernel Mailing List, Linux-MM,
	Andrew Morton, Johannes Weiner

On 7/7/20 5:38 AM, Andreas Grünbacher wrote:
> Am Mi., 24. Juni 2020 um 18:48 Uhr schrieb Jens Axboe <axboe@kernel.dk>:
>>
>> On 6/24/20 10:41 AM, Matthew Wilcox wrote:
>>> On Wed, Jun 24, 2020 at 09:35:19AM -0600, Jens Axboe wrote:
>>>> On 6/24/20 9:00 AM, Jens Axboe wrote:
>>>>> On 6/23/20 7:46 PM, Matthew Wilcox wrote:
>>>>>> I'd be quite happy to add a gfp_t to struct readahead_control.
>>>>>> The other thing I've been looking into for other reasons is adding
>>>>>> a memalloc_nowait_{save,restore}, which would avoid passing down
>>>>>> the gfp_t.
>>>>>
>>>>> That was my first thought, having the memalloc_foo_save/restore for
>>>>> this. I don't think adding a gfp_t to readahead_control is going
>>>>> to be super useful, seems like the kind of thing that should be
>>>>> non-blocking by default.
>>>>
>>>> We're already doing memalloc_nofs_save/restore in
>>>> page_cache_readahead_unbounded(), so I think all we need is to just do a
>>>> noio dance in generic_file_buffered_read() and that should be enough.
>>>
>>> I think we can still sleep though, right?  I was thinking more
>>> like this:
>>>
>>> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
>>
>> Yeah, that's probably better. How do we want to handle this? I've already
>> got the other bits queued up. I can either add them to the series, or
>> pull a branch that'll go into Linus as well.
> 
> Also note my conflicting patch that introduces a IOCB_NOIO flag for
> fixing a gfs2 regression:
> 
> https://lore.kernel.org/linux-fsdevel/20200703095325.1491832-2-agruenba@redhat.com/

Yeah I noticed, pretty easy to resolve though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 07/15] mm: add support for async page locking
  2020-07-07 11:32   ` Andreas Grünbacher
@ 2020-07-07 14:32     ` Jens Axboe
  0 siblings, 0 replies; 36+ messages in thread
From: Jens Axboe @ 2020-07-07 14:32 UTC (permalink / raw)
  To: Andreas Grünbacher
  Cc: io-uring, Linux FS-devel Mailing List, Linux Kernel Mailing List,
	Linux-MM, Andrew Morton, Johannes Weiner

On 7/7/20 5:32 AM, Andreas Grünbacher wrote:
>> @@ -2131,6 +2166,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
>>                 }
>>
>>  readpage:
>> +               if (iocb->ki_flags & IOCB_NOWAIT) {
>> +                       unlock_page(page);
>> +                       put_page(page);
>> +                       goto would_block;
>> +               }
> 
> This hunk should have been part of "mm: allow read-ahead with
> IOCB_NOWAIT set" ...

It probably should have... Oh well, it's been queued up multiple weeks
at this point, not worth rebasing to move this hunk.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, back to index

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-18 14:43 Jens Axboe
2020-06-18 14:43 ` [PATCH 01/15] block: provide plug based way of signaling forced no-wait semantics Jens Axboe
2020-06-18 14:43 ` [PATCH 02/15] io_uring: always plug for any number of IOs Jens Axboe
2020-06-18 14:43 ` [PATCH 03/15] io_uring: catch -EIO from buffered issue request failure Jens Axboe
2020-06-18 14:43 ` [PATCH 04/15] io_uring: re-issue block requests that failed because of resources Jens Axboe
2020-06-19 14:12   ` Pavel Begunkov
2020-06-19 14:22     ` Jens Axboe
2020-06-19 14:30       ` Pavel Begunkov
2020-06-19 14:36         ` Jens Axboe
2020-06-18 14:43 ` [PATCH 05/15] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
2020-06-24  1:02   ` Dave Chinner
2020-06-24  1:46     ` Matthew Wilcox
2020-06-24 15:00       ` Jens Axboe
2020-06-24 15:35         ` Jens Axboe
2020-06-24 16:41           ` Matthew Wilcox
2020-06-24 16:44             ` Jens Axboe
2020-07-07 11:38               ` Andreas Grünbacher
2020-07-07 14:31                 ` Jens Axboe
2020-06-24  4:38   ` Dave Chinner
2020-06-24 15:01     ` Jens Axboe
2020-06-18 14:43 ` [PATCH 06/15] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
2020-06-18 14:43 ` [PATCH 07/15] mm: add support for async page locking Jens Axboe
2020-07-07 11:32   ` Andreas Grünbacher
2020-07-07 14:32     ` Jens Axboe
2020-06-18 14:43 ` [PATCH 08/15] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
2020-06-18 14:43 ` [PATCH 09/15] fs: add FMODE_BUF_RASYNC Jens Axboe
2020-06-18 14:43 ` [PATCH 10/15] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
2020-06-18 14:43 ` [PATCH 11/15] xfs: flag files as supporting buffered async reads Jens Axboe
2020-06-18 14:43 ` [PATCH 12/15] btrfs: " Jens Axboe
2020-06-19 11:11   ` David Sterba
2020-06-18 14:43 ` [PATCH 13/15] ext4: flag " Jens Axboe
2020-06-18 14:43 ` [PATCH 14/15] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
2020-06-18 14:43 ` [PATCH 15/15] io_uring: support true async buffered reads, if file provides it Jens Axboe
2020-06-23 12:39   ` Pavel Begunkov
2020-06-23 14:38     ` Jens Axboe
2020-06-18 14:45 ` [PATCHSET v7 0/12] Add support for async buffered reads Jens Axboe

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git