IO-Uring Archive on lore.kernel.org
 help / color / Atom feed
* [PATCHSET v2 0/12] Add support for async buffered reads
@ 2020-05-23 18:57 Jens Axboe
  2020-05-23 18:57 ` [PATCH 01/12] block: read-ahead submission should imply no-wait as well Jens Axboe
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

We technically support this already through io_uring, but it's
implemented with a thread backend to support cases where we would
block. This isn't ideal.

After a few prep patches, the core of this patchset is adding support
for async callbacks on page unlock. With this primitive, we can simply
retry the IO operation. With io_uring, this works a lot like poll based
retry for files that support it. If a page is currently locked and
needed, -EIOCBQUEUED is returned with a callback armed. The callers
callback is responsible for restarting the operation.

With this callback primitive, we can add support for
generic_file_buffered_read(), which is what most file systems end up
using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
trivial to add more.

The file flags support for this by setting FMODE_BUF_RASYNC, similar
to what we do for FMODE_NOWAIT. Open to suggestions here if this is
the preferred method or not.

In terms of results, I wrote a small test app that randomly reads 4G
of data in 4K chunks from a file hosted by ext4. The app uses a queue
depth of 32. If you want to test yourself, you can just use buffered=1
with ioengine=io_uring with fio. No application changes are needed to
use the more optimized buffered async read.

preadv for comparison:
	real    1m13.821s
	user    0m0.558s
	sys     0m11.125s
	CPU	~13%

Mainline:
	real    0m12.054s
	user    0m0.111s
	sys     0m5.659s
	CPU	~32% + ~50% == ~82%

This patchset:
	real    0m9.283s
	user    0m0.147s
	sys     0m4.619s
	CPU	~52%

The CPU numbers are just a rough estimate. For the mainline io_uring
run, this includes the app itself and all the threads doing IO on its
behalf (32% for the app, ~1.6% per worker and 32 of them). Context
switch rate is much smaller with the patchset, since we only have the
one task performing IO.

The goal here is efficiency. Async thread offload adds latency, and
it also adds noticable overhead on items such as adding pages to the
page cache. By allowing proper async buffered read support, we don't
have X threads hammering on the same inode page cache, we have just
the single app actually doing IO.

Been beating on this and it's solid for me, and I'm now pretty happy
with how it all turned out. Not aware of any missing bits/pieces or
code cleanups that need doing.

Series can also be found here:

https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3

or pull from:

git://git.kernel.dk/linux-block async-buffered.3

 fs/block_dev.c            |   2 +-
 fs/btrfs/file.c           |   2 +-
 fs/ext4/file.c            |   2 +-
 fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_file.c         |   2 +-
 include/linux/blk_types.h |   3 +-
 include/linux/fs.h        |   5 ++
 include/linux/pagemap.h   |  64 ++++++++++++++++++++++
 mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
 9 files changed, 245 insertions(+), 45 deletions(-)

Changes since v2:
- Get rid of unnecessary wait_page_async struct, just use wait_page_async
- Add another prep handler, adding wake_page_match()
- Use wake_page_match() in both callers
Changes since v1:
- Fix an issue with inline page locking
- Fix a potential race with __wait_on_page_locked_async()
- Fix a hang related to not setting page_match, thus missing a wakeup

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/12] block: read-ahead submission should imply no-wait as well
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 02/12] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

As read-ahead is opportunistic, don't block for request allocation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/blk_types.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index ccb895f911b1..c296463c15eb 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -374,7 +374,8 @@ enum req_flag_bits {
 #define REQ_INTEGRITY		(1ULL << __REQ_INTEGRITY)
 #define REQ_FUA			(1ULL << __REQ_FUA)
 #define REQ_PREFLUSH		(1ULL << __REQ_PREFLUSH)
-#define REQ_RAHEAD		(1ULL << __REQ_RAHEAD)
+#define REQ_RAHEAD		\
+	((1ULL << __REQ_RAHEAD) | (1ULL << __REQ_NOWAIT))
 #define REQ_BACKGROUND		(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
 #define REQ_CGROUP_PUNT		(1ULL << __REQ_CGROUP_PUNT)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 02/12] mm: allow read-ahead with IOCB_NOWAIT set
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
  2020-05-23 18:57 ` [PATCH 01/12] block: read-ahead submission should imply no-wait as well Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 03/12] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 23a051a7ef0f..80747f1377d5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2031,8 +2031,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 		page = find_get_page(mapping, index);
 		if (!page) {
-			if (iocb->ki_flags & IOCB_NOWAIT)
-				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 03/12] mm: abstract out wake_page_match() from wake_page_function()
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
  2020-05-23 18:57 ` [PATCH 01/12] block: read-ahead submission should imply no-wait as well Jens Axboe
  2020-05-23 18:57 ` [PATCH 02/12] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 04/12] mm: add support for async page locking Jens Axboe
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

No functional changes in this patch, just in preparation for allowing
more callers.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 37 +++++++++++++++++++++++++++++++++++++
 mm/filemap.c            | 35 ++++-------------------------------
 2 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a8f7bd8ea1c6..53d980f2208d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -456,6 +456,43 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
 	return pgoff;
 }
 
+/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
+struct wait_page_key {
+	struct page *page;
+	int bit_nr;
+	int page_match;
+};
+
+struct wait_page_queue {
+	struct page *page;
+	int bit_nr;
+	wait_queue_entry_t wait;
+};
+
+static inline int wake_page_match(struct wait_page_queue *wait_page,
+				  struct wait_page_key *key)
+{
+	if (wait_page->page != key->page)
+	       return 0;
+	key->page_match = 1;
+
+	if (wait_page->bit_nr != key->bit_nr)
+		return 0;
+
+	/*
+	 * Stop walking if it's locked.
+	 * Is this safe if put_and_wait_on_page_locked() is in use?
+	 * Yes: the waker must hold a reference to this page, and if PG_locked
+	 * has now already been set by another task, that task must also hold
+	 * a reference to the *same usage* of this page; so there is no need
+	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
+	 */
+	if (test_bit(key->bit_nr, &key->page->flags))
+		return -1;
+
+	return 1;
+}
+
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/filemap.c b/mm/filemap.c
index 80747f1377d5..e891b5bee8fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -990,43 +990,16 @@ void __init pagecache_init(void)
 	page_writeback_init();
 }
 
-/* This has the same layout as wait_bit_key - see fs/cachefiles/rdwr.c */
-struct wait_page_key {
-	struct page *page;
-	int bit_nr;
-	int page_match;
-};
-
-struct wait_page_queue {
-	struct page *page;
-	int bit_nr;
-	wait_queue_entry_t wait;
-};
-
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
 	struct wait_page_key *key = arg;
 	struct wait_page_queue *wait_page
 		= container_of(wait, struct wait_page_queue, wait);
+	int ret;
 
-	if (wait_page->page != key->page)
-	       return 0;
-	key->page_match = 1;
-
-	if (wait_page->bit_nr != key->bit_nr)
-		return 0;
-
-	/*
-	 * Stop walking if it's locked.
-	 * Is this safe if put_and_wait_on_page_locked() is in use?
-	 * Yes: the waker must hold a reference to this page, and if PG_locked
-	 * has now already been set by another task, that task must also hold
-	 * a reference to the *same usage* of this page; so there is no need
-	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
-	 */
-	if (test_bit(key->bit_nr, &key->page->flags))
-		return -1;
-
+	ret = wake_page_match(wait_page, key);
+	if (ret != 1)
+		return ret;
 	return autoremove_wake_function(wait, mode, sync, key);
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 04/12] mm: add support for async page locking
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (2 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 03/12] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We use the iocb->private field to pass in this necessary data for this
to happen. struct wait_page_key is made public, and we define struct
wait_page_async as the interface between the caller and the core.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h      |  2 ++
 include/linux/pagemap.h |  9 +++++++++
 mm/filemap.c            | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7e84d823c6a8..82b989695ab9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,8 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+/* iocb->private holds wait_page_async struct */
+#define IOCB_WAITQ		(1 << 8)
 
 struct kiocb {
 	struct file		*ki_filp;
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 53d980f2208d..d3e63c9c61ae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -495,6 +495,7 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
 
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
+extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
 extern int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 				unsigned int flags);
 extern void unlock_page(struct page *page);
@@ -531,6 +532,14 @@ static inline int lock_page_killable(struct page *page)
 	return 0;
 }
 
+static inline int lock_page_async(struct page *page,
+				  struct wait_page_queue *wait)
+{
+	if (!trylock_page(page))
+		return __lock_page_async(page, wait);
+	return 0;
+}
+
 /*
  * lock_page_or_retry - Lock the page, unless this would block and the
  * caller indicated that it can handle a retry.
diff --git a/mm/filemap.c b/mm/filemap.c
index e891b5bee8fd..c746541b1d49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1183,6 +1183,42 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 }
 EXPORT_SYMBOL(wait_on_page_bit_killable);
 
+static int __wait_on_page_locked_async(struct page *page,
+				       struct wait_page_queue *wait, bool set)
+{
+	struct wait_queue_head *q = page_waitqueue(page);
+	int ret = 0;
+
+	wait->page = page;
+	wait->bit_nr = PG_locked;
+
+	spin_lock_irq(&q->lock);
+	if (set)
+		ret = !trylock_page(page);
+	else
+		ret = PageLocked(page);
+	if (ret) {
+		__add_wait_queue_entry_tail(q, &wait->wait);
+		SetPageWaiters(page);
+		if (set)
+			ret = !trylock_page(page);
+		else
+			ret = PageLocked(page);
+		/*
+		 * If we were succesful now, we know we're still on the
+		 * waitqueue as we're still under the lock. This means it's
+		 * safe to remove and return success, we know the callback
+		 * isn't going to trigger.
+		 */
+		if (!ret)
+			__remove_wait_queue(q, &wait->wait);
+		else
+			ret = -EIOCBQUEUED;
+	}
+	spin_unlock_irq(&q->lock);
+	return ret;
+}
+
 /**
  * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
  * @page: The page to wait for.
@@ -1345,6 +1381,11 @@ int __lock_page_killable(struct page *__page)
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int __lock_page_async(struct page *page, struct wait_page_queue *wait)
+{
+	return __wait_on_page_locked_async(page, wait, true);
+}
+
 /*
  * Return values:
  * 1 - page is locked; mmap_sem is still held.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (3 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 04/12] mm: add support for async page locking Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-24 14:05   ` Trond Myklebust
  2020-05-23 18:57 ` [PATCH 06/12] fs: add FMODE_BUF_RASYNC Jens Axboe
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 mm/filemap.c | 33 ++++++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c746541b1d49..a3b86c9acdc8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1219,6 +1219,14 @@ static int __wait_on_page_locked_async(struct page *page,
 	return ret;
 }
 
+static int wait_on_page_locked_async(struct page *page,
+				     struct wait_page_queue *wait)
+{
+	if (!PageLocked(page))
+		return 0;
+	return __wait_on_page_locked_async(compound_head(page), wait, false);
+}
+
 /**
  * put_and_wait_on_page_locked - Drop a reference and wait for it to be unlocked
  * @page: The page to wait for.
@@ -2058,17 +2066,25 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					index, last_index - index);
 		}
 		if (!PageUptodate(page)) {
-			if (iocb->ki_flags & IOCB_NOWAIT) {
-				put_page(page);
-				goto would_block;
-			}
-
 			/*
 			 * See comment in do_read_cache_page on why
 			 * wait_on_page_locked is used to avoid unnecessarily
 			 * serialisations and why it's safe.
 			 */
-			error = wait_on_page_locked_killable(page);
+			if (iocb->ki_flags & IOCB_WAITQ) {
+				if (written) {
+					put_page(page);
+					goto out;
+				}
+				error = wait_on_page_locked_async(page,
+								iocb->private);
+			} else {
+				if (iocb->ki_flags & IOCB_NOWAIT) {
+					put_page(page);
+					goto would_block;
+				}
+				error = wait_on_page_locked_killable(page);
+			}
 			if (unlikely(error))
 				goto readpage_error;
 			if (PageUptodate(page))
@@ -2156,7 +2172,10 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 
 page_not_up_to_date:
 		/* Get exclusive access to the page ... */
-		error = lock_page_killable(page);
+		if (iocb->ki_flags & IOCB_WAITQ)
+			error = lock_page_async(page, iocb->private);
+		else
+			error = lock_page_killable(page);
 		if (unlikely(error))
 			goto readpage_error;
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 06/12] fs: add FMODE_BUF_RASYNC
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (4 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 07/12] ext4: flag as supporting buffered async reads Jens Axboe
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 82b989695ab9..0ef5f5973b1c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports async buffered reads */
+#define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 07/12] ext4: flag as supporting buffered async reads
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (5 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 06/12] fs: add FMODE_BUF_RASYNC Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 08/12] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/ext4/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d624250a62b..9f7d9bf427b4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -826,7 +826,7 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
 			return ret;
 	}
 
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return dquot_file_open(inode, filp);
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 08/12] block: flag block devices as supporting IOCB_WAITQ
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (6 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 07/12] ext4: flag as supporting buffered async reads Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 09/12] xfs: flag files as supporting buffered async reads Jens Axboe
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 86e2a7134513..ec8dccc81b65 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1851,7 +1851,7 @@ static int blkdev_open(struct inode * inode, struct file * filp)
 	 */
 	filp->f_flags |= O_LARGEFILE;
 
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 
 	if (filp->f_flags & O_NDELAY)
 		filp->f_mode |= FMODE_NDELAY;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 09/12] xfs: flag files as supporting buffered async reads
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (7 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 08/12] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 10/12] btrfs: " Jens Axboe
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

XFS uses generic_file_read_iter(), which already supports this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/xfs/xfs_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4b8bdecc3863..97f44fbf17f2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1080,7 +1080,7 @@ xfs_file_open(
 		return -EFBIG;
 	if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
 		return -EIO;
-	file->f_mode |= FMODE_NOWAIT;
+	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return 0;
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 10/12] btrfs: flag files as supporting buffered async reads
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (8 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 09/12] xfs: flag files as supporting buffered async reads Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 11/12] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

btrfs uses generic_file_read_iter(), which already supports this.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 719e68ab552c..c933b6a1b4a8 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3480,7 +3480,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence)
 
 static int btrfs_file_open(struct inode *inode, struct file *filp)
 {
-	filp->f_mode |= FMODE_NOWAIT;
+	filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
 	return generic_file_open(inode, filp);
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 11/12] mm: add kiocb_wait_page_queue_init() helper
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (9 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 10/12] btrfs: " Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-23 18:57 ` [PATCH 12/12] io_uring: support true async buffered reads, if file provides it Jens Axboe
  2020-05-23 19:20 ` [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
  12 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/pagemap.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d3e63c9c61ae..def58de92053 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -493,6 +493,24 @@ static inline int wake_page_match(struct wait_page_queue *wait_page,
 	return 1;
 }
 
+static inline int kiocb_wait_page_queue_init(struct kiocb *kiocb,
+					     struct wait_page_queue *wait,
+					     wait_queue_func_t func,
+					     void *data)
+{
+	if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
+		wait->wait.func = func;
+		wait->wait.private = data;
+		wait->wait.flags = 0;
+		INIT_LIST_HEAD(&wait->wait.entry);
+		kiocb->ki_flags |= IOCB_WAITQ;
+		kiocb->private = wait;
+		return 0;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 extern void __lock_page(struct page *page);
 extern int __lock_page_killable(struct page *page);
 extern int __lock_page_async(struct page *page, struct wait_page_queue *wait);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (10 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 11/12] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
@ 2020-05-23 18:57 ` Jens Axboe
  2020-05-25  7:29   ` Pavel Begunkov
  2020-05-26  7:38   ` Pavel Begunkov
  2020-05-23 19:20 ` [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
  12 siblings, 2 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 18:57 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e95481c552ff..dd532d2634c2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -498,6 +498,8 @@ struct io_async_rw {
 	struct iovec			*iov;
 	ssize_t				nr_segs;
 	ssize_t				size;
+	struct wait_page_queue		wpq;
+	struct callback_head		task_work;
 };
 
 struct io_async_ctx {
@@ -2568,6 +2570,99 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+static void io_async_buf_cancel(struct callback_head *cb)
+{
+	struct io_async_rw *rw;
+	struct io_ring_ctx *ctx;
+	struct io_kiocb *req;
+
+	rw = container_of(cb, struct io_async_rw, task_work);
+	req = rw->wpq.wait.private;
+	ctx = req->ctx;
+
+	spin_lock_irq(&ctx->completion_lock);
+	io_cqring_fill_event(req, -ECANCELED);
+	io_commit_cqring(ctx);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_ev_posted(ctx);
+	req_set_fail_links(req);
+	io_double_put_req(req);
+}
+
+static void io_async_buf_retry(struct callback_head *cb)
+{
+	struct io_async_rw *rw;
+	struct io_ring_ctx *ctx;
+	struct io_kiocb *req;
+
+	rw = container_of(cb, struct io_async_rw, task_work);
+	req = rw->wpq.wait.private;
+	ctx = req->ctx;
+
+	__set_current_state(TASK_RUNNING);
+	mutex_lock(&ctx->uring_lock);
+	__io_queue_sqe(req, NULL);
+	mutex_unlock(&ctx->uring_lock);
+}
+
+static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode,
+			     int sync, void *arg)
+{
+	struct wait_page_queue *wpq;
+	struct io_kiocb *req = wait->private;
+	struct io_async_rw *rw = &req->io->rw;
+	struct wait_page_key *key = arg;
+	struct task_struct *tsk;
+	int ret;
+
+	wpq = container_of(wait, struct wait_page_queue, wait);
+
+	ret = wake_page_match(wpq, key);
+	if (ret != 1)
+		return ret;
+
+	list_del_init(&wait->entry);
+
+	init_task_work(&rw->task_work, io_async_buf_retry);
+	/* submit ref gets dropped, acquire a new one */
+	refcount_inc(&req->refs);
+	tsk = req->task;
+	ret = task_work_add(tsk, &rw->task_work, true);
+	if (unlikely(ret)) {
+		/* queue just for cancelation */
+		init_task_work(&rw->task_work, io_async_buf_cancel);
+		tsk = io_wq_get_task(req->ctx->io_wq);
+		task_work_add(tsk, &rw->task_work, true);
+	}
+	wake_up_process(tsk);
+	return 1;
+}
+
+static bool io_rw_should_retry(struct io_kiocb *req)
+{
+	struct kiocb *kiocb = &req->rw.kiocb;
+	int ret;
+
+	/* already tried, or we're doing O_DIRECT */
+	if (kiocb->ki_flags & (IOCB_DIRECT | IOCB_WAITQ))
+		return false;
+	/*
+	 * just use poll if we can, and don't attempt if the fs doesn't
+	 * support callback based unlocks
+	 */
+	if (file_can_poll(req->file) || !(req->file->f_mode & FMODE_BUF_RASYNC))
+		return false;
+
+	ret = kiocb_wait_page_queue_init(kiocb, &req->io->rw.wpq,
+						io_async_buf_func, req);
+	if (ret)
+		return false;
+	get_task_struct(current);
+	req->task = current;
+	return true;
+}
+
 static int io_read(struct io_kiocb *req, bool force_nonblock)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -2601,6 +2696,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	if (!ret) {
 		ssize_t ret2;
 
+retry:
 		if (req->file->f_op->read_iter)
 			ret2 = call_read_iter(req->file, kiocb, &iter);
 		else
@@ -2619,6 +2715,9 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 			if (!(req->flags & REQ_F_NOWAIT) &&
 			    !file_can_poll(req->file))
 				req->flags |= REQ_F_MUST_PUNT;
+			if (io_rw_should_retry(req))
+				goto retry;
+			kiocb->ki_flags &= ~IOCB_WAITQ;
 			return -EAGAIN;
 		}
 	}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHSET v2 0/12] Add support for async buffered reads
  2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
                   ` (11 preceding siblings ...)
  2020-05-23 18:57 ` [PATCH 12/12] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-05-23 19:20 ` Jens Axboe
  2020-05-24  9:46   ` Chris Panayis
  12 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2020-05-23 19:20 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

And this one is v3, obviously, not v2...


On 5/23/20 12:57 PM, Jens Axboe wrote:
> We technically support this already through io_uring, but it's
> implemented with a thread backend to support cases where we would
> block. This isn't ideal.
> 
> After a few prep patches, the core of this patchset is adding support
> for async callbacks on page unlock. With this primitive, we can simply
> retry the IO operation. With io_uring, this works a lot like poll based
> retry for files that support it. If a page is currently locked and
> needed, -EIOCBQUEUED is returned with a callback armed. The callers
> callback is responsible for restarting the operation.
> 
> With this callback primitive, we can add support for
> generic_file_buffered_read(), which is what most file systems end up
> using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
> trivial to add more.
> 
> The file flags support for this by setting FMODE_BUF_RASYNC, similar
> to what we do for FMODE_NOWAIT. Open to suggestions here if this is
> the preferred method or not.
> 
> In terms of results, I wrote a small test app that randomly reads 4G
> of data in 4K chunks from a file hosted by ext4. The app uses a queue
> depth of 32. If you want to test yourself, you can just use buffered=1
> with ioengine=io_uring with fio. No application changes are needed to
> use the more optimized buffered async read.
> 
> preadv for comparison:
> 	real    1m13.821s
> 	user    0m0.558s
> 	sys     0m11.125s
> 	CPU	~13%
> 
> Mainline:
> 	real    0m12.054s
> 	user    0m0.111s
> 	sys     0m5.659s
> 	CPU	~32% + ~50% == ~82%
> 
> This patchset:
> 	real    0m9.283s
> 	user    0m0.147s
> 	sys     0m4.619s
> 	CPU	~52%
> 
> The CPU numbers are just a rough estimate. For the mainline io_uring
> run, this includes the app itself and all the threads doing IO on its
> behalf (32% for the app, ~1.6% per worker and 32 of them). Context
> switch rate is much smaller with the patchset, since we only have the
> one task performing IO.
> 
> The goal here is efficiency. Async thread offload adds latency, and
> it also adds noticable overhead on items such as adding pages to the
> page cache. By allowing proper async buffered read support, we don't
> have X threads hammering on the same inode page cache, we have just
> the single app actually doing IO.
> 
> Been beating on this and it's solid for me, and I'm now pretty happy
> with how it all turned out. Not aware of any missing bits/pieces or
> code cleanups that need doing.
> 
> Series can also be found here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3
> 
> or pull from:
> 
> git://git.kernel.dk/linux-block async-buffered.3
> 
>  fs/block_dev.c            |   2 +-
>  fs/btrfs/file.c           |   2 +-
>  fs/ext4/file.c            |   2 +-
>  fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_file.c         |   2 +-
>  include/linux/blk_types.h |   3 +-
>  include/linux/fs.h        |   5 ++
>  include/linux/pagemap.h   |  64 ++++++++++++++++++++++
>  mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
>  9 files changed, 245 insertions(+), 45 deletions(-)
> 
> Changes since v2:
> - Get rid of unnecessary wait_page_async struct, just use wait_page_async
> - Add another prep handler, adding wake_page_match()
> - Use wake_page_match() in both callers
> Changes since v1:
> - Fix an issue with inline page locking
> - Fix a potential race with __wait_on_page_locked_async()
> - Fix a hang related to not setting page_match, thus missing a wakeup
> 


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHSET v2 0/12] Add support for async buffered reads
  2020-05-23 19:20 ` [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
@ 2020-05-24  9:46   ` Chris Panayis
  2020-05-24 19:24     ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Chris Panayis @ 2020-05-24  9:46 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

Yes! Jens & Team! Yes!

My code has never looked so beautiful, been so efficient and run so well 
since switching to io_uring/async awesome-ness.. Really, really is a 
game-changer in terms of software design, control, performance, 
expressiveness... so many levels. Really, really great work! Thank you!

Chris


On 23/05/2020 20:20, Jens Axboe wrote:
> And this one is v3, obviously, not v2...
>
>
> On 5/23/20 12:57 PM, Jens Axboe wrote:
>> We technically support this already through io_uring, but it's
>> implemented with a thread backend to support cases where we would
>> block. This isn't ideal.
>>
>> After a few prep patches, the core of this patchset is adding support
>> for async callbacks on page unlock. With this primitive, we can simply
>> retry the IO operation. With io_uring, this works a lot like poll based
>> retry for files that support it. If a page is currently locked and
>> needed, -EIOCBQUEUED is returned with a callback armed. The callers
>> callback is responsible for restarting the operation.
>>
>> With this callback primitive, we can add support for
>> generic_file_buffered_read(), which is what most file systems end up
>> using for buffered reads. XFS/ext4/btrfs/bdev is wired up, but probably
>> trivial to add more.
>>
>> The file flags support for this by setting FMODE_BUF_RASYNC, similar
>> to what we do for FMODE_NOWAIT. Open to suggestions here if this is
>> the preferred method or not.
>>
>> In terms of results, I wrote a small test app that randomly reads 4G
>> of data in 4K chunks from a file hosted by ext4. The app uses a queue
>> depth of 32. If you want to test yourself, you can just use buffered=1
>> with ioengine=io_uring with fio. No application changes are needed to
>> use the more optimized buffered async read.
>>
>> preadv for comparison:
>> 	real    1m13.821s
>> 	user    0m0.558s
>> 	sys     0m11.125s
>> 	CPU	~13%
>>
>> Mainline:
>> 	real    0m12.054s
>> 	user    0m0.111s
>> 	sys     0m5.659s
>> 	CPU	~32% + ~50% == ~82%
>>
>> This patchset:
>> 	real    0m9.283s
>> 	user    0m0.147s
>> 	sys     0m4.619s
>> 	CPU	~52%
>>
>> The CPU numbers are just a rough estimate. For the mainline io_uring
>> run, this includes the app itself and all the threads doing IO on its
>> behalf (32% for the app, ~1.6% per worker and 32 of them). Context
>> switch rate is much smaller with the patchset, since we only have the
>> one task performing IO.
>>
>> The goal here is efficiency. Async thread offload adds latency, and
>> it also adds noticable overhead on items such as adding pages to the
>> page cache. By allowing proper async buffered read support, we don't
>> have X threads hammering on the same inode page cache, we have just
>> the single app actually doing IO.
>>
>> Been beating on this and it's solid for me, and I'm now pretty happy
>> with how it all turned out. Not aware of any missing bits/pieces or
>> code cleanups that need doing.
>>
>> Series can also be found here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.3
>>
>> or pull from:
>>
>> git://git.kernel.dk/linux-block async-buffered.3
>>
>>   fs/block_dev.c            |   2 +-
>>   fs/btrfs/file.c           |   2 +-
>>   fs/ext4/file.c            |   2 +-
>>   fs/io_uring.c             |  99 ++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_file.c         |   2 +-
>>   include/linux/blk_types.h |   3 +-
>>   include/linux/fs.h        |   5 ++
>>   include/linux/pagemap.h   |  64 ++++++++++++++++++++++
>>   mm/filemap.c              | 111 ++++++++++++++++++++++++--------------
>>   9 files changed, 245 insertions(+), 45 deletions(-)
>>
>> Changes since v2:
>> - Get rid of unnecessary wait_page_async struct, just use wait_page_async
>> - Add another prep handler, adding wake_page_match()
>> - Use wake_page_match() in both callers
>> Changes since v1:
>> - Fix an issue with inline page locking
>> - Fix a potential race with __wait_on_page_locked_async()
>> - Fix a hang related to not setting page_match, thus missing a wakeup
>>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-23 18:57 ` [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
@ 2020-05-24 14:05   ` Trond Myklebust
  2020-05-24 16:30     ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2020-05-24 14:05 UTC (permalink / raw)
  To: axboe, io-uring; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
> Use the async page locking infrastructure, if IOCB_WAITQ is set in
> the
> passed in iocb. The caller must expect an -EIOCBQUEUED return value,
> which means that IO is started but not done yet. This is similar to
> how
> O_DIRECT signals the same operation. Once the callback is received by
> the caller for IO completion, the caller must retry the operation.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  mm/filemap.c | 33 ++++++++++++++++++++++++++-------
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c746541b1d49..a3b86c9acdc8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1219,6 +1219,14 @@ static int __wait_on_page_locked_async(struct
> page *page,
>  	return ret;
>  }
>  
> +static int wait_on_page_locked_async(struct page *page,
> +				     struct wait_page_queue *wait)
> +{
> +	if (!PageLocked(page))
> +		return 0;
> +	return __wait_on_page_locked_async(compound_head(page), wait,
> false);
> +}
> +
>  /**
>   * put_and_wait_on_page_locked - Drop a reference and wait for it to
> be unlocked
>   * @page: The page to wait for.
> @@ -2058,17 +2066,25 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>  					index, last_index - index);
>  		}
>  		if (!PageUptodate(page)) {
> -			if (iocb->ki_flags & IOCB_NOWAIT) {
> -				put_page(page);
> -				goto would_block;
> -			}
> -
>  			/*
>  			 * See comment in do_read_cache_page on why
>  			 * wait_on_page_locked is used to avoid
> unnecessarily
>  			 * serialisations and why it's safe.
>  			 */
> -			error = wait_on_page_locked_killable(page);
> +			if (iocb->ki_flags & IOCB_WAITQ) {
> +				if (written) {
> +					put_page(page);
> +					goto out;
> +				}
> +				error = wait_on_page_locked_async(page,
> +								iocb-
> >private);

If it is being used in 'generic_file_buffered_read()' as storage for a
wait queue, then it is hard to consider this a 'private' field.

Perhaps either rename and add type checking, or else add a separate
field altogether to struct kiocb?

> +			} else {
> +				if (iocb->ki_flags & IOCB_NOWAIT) {
> +					put_page(page);
> +					goto would_block;
> +				}
> +				error =
> wait_on_page_locked_killable(page);
> +			}
>  			if (unlikely(error))
>  				goto readpage_error;
>  			if (PageUptodate(page))
> @@ -2156,7 +2172,10 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>  
>  page_not_up_to_date:
>  		/* Get exclusive access to the page ... */
> -		error = lock_page_killable(page);
> +		if (iocb->ki_flags & IOCB_WAITQ)
> +			error = lock_page_async(page, iocb->private);
> +		else
> +			error = lock_page_killable(page);
>  		if (unlikely(error))
>  			goto readpage_error;
>  
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-24 14:05   ` Trond Myklebust
@ 2020-05-24 16:30     ` Jens Axboe
  2020-05-24 16:40       ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2020-05-24 16:30 UTC (permalink / raw)
  To: Trond Myklebust, io-uring; +Cc: linux-mm, linux-fsdevel, linux-kernel

On 5/24/20 8:05 AM, Trond Myklebust wrote:
> On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
>> Use the async page locking infrastructure, if IOCB_WAITQ is set in
>> the
>> passed in iocb. The caller must expect an -EIOCBQUEUED return value,
>> which means that IO is started but not done yet. This is similar to
>> how
>> O_DIRECT signals the same operation. Once the callback is received by
>> the caller for IO completion, the caller must retry the operation.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  mm/filemap.c | 33 ++++++++++++++++++++++++++-------
>>  1 file changed, 26 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index c746541b1d49..a3b86c9acdc8 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1219,6 +1219,14 @@ static int __wait_on_page_locked_async(struct
>> page *page,
>>  	return ret;
>>  }
>>  
>> +static int wait_on_page_locked_async(struct page *page,
>> +				     struct wait_page_queue *wait)
>> +{
>> +	if (!PageLocked(page))
>> +		return 0;
>> +	return __wait_on_page_locked_async(compound_head(page), wait,
>> false);
>> +}
>> +
>>  /**
>>   * put_and_wait_on_page_locked - Drop a reference and wait for it to
>> be unlocked
>>   * @page: The page to wait for.
>> @@ -2058,17 +2066,25 @@ static ssize_t
>> generic_file_buffered_read(struct kiocb *iocb,
>>  					index, last_index - index);
>>  		}
>>  		if (!PageUptodate(page)) {
>> -			if (iocb->ki_flags & IOCB_NOWAIT) {
>> -				put_page(page);
>> -				goto would_block;
>> -			}
>> -
>>  			/*
>>  			 * See comment in do_read_cache_page on why
>>  			 * wait_on_page_locked is used to avoid
>> unnecessarily
>>  			 * serialisations and why it's safe.
>>  			 */
>> -			error = wait_on_page_locked_killable(page);
>> +			if (iocb->ki_flags & IOCB_WAITQ) {
>> +				if (written) {
>> +					put_page(page);
>> +					goto out;
>> +				}
>> +				error = wait_on_page_locked_async(page,
>> +								iocb-
>>> private);
> 
> If it is being used in 'generic_file_buffered_read()' as storage for a
> wait queue, then it is hard to consider this a 'private' field.

private isn't the prettiest, and in fact this one in particular is a bit
of a mess. It's not clear if it's caller or callee owned. It's generally
not used, outside of the old usb gadget code, iomap O_DIRECT, and ocfs2.
With FMODE_BUF_RASYNC, the fs obviously can't set it if it uses
->private for buffered IO.

> Perhaps either rename and add type checking, or else add a separate
> field altogether to struct kiocb?

I'd hate to add a new field and increase the size of the kiocb... One
alternative is to do:

	union {
		void *private;
		struct wait_page_queue *ki_waitq;
	};

and still use IOCB_WAITQ to say that ->ki_waitq is valid.

There's also 4 bytes of padding in the kiocb struct. And some fields are
only used for O_DIRECT as well, eg ->ki_cookie which is just used for
polled O_DIRECT. So we could also do:

	union {
		unsigned int ki_cookie;
		struct wait_page_queue *ki_waitq;
	};

and still not grow the kiocb. How about we go with this approach, and
also add:

	if (kiocb->ki_flags & IOCB_HIPRI)
		return -EOPNOTSUPP;

to kiocb_wait_page_queue_init() to make sure that this combination isn't
valid?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-24 16:30     ` Jens Axboe
@ 2020-05-24 16:40       ` Jens Axboe
  2020-05-24 17:11         ` Trond Myklebust
  0 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2020-05-24 16:40 UTC (permalink / raw)
  To: Trond Myklebust, io-uring; +Cc: linux-mm, linux-fsdevel, linux-kernel

On 5/24/20 10:30 AM, Jens Axboe wrote:
> On 5/24/20 8:05 AM, Trond Myklebust wrote:
>> On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
>>> Use the async page locking infrastructure, if IOCB_WAITQ is set in
>>> the
>>> passed in iocb. The caller must expect an -EIOCBQUEUED return value,
>>> which means that IO is started but not done yet. This is similar to
>>> how
>>> O_DIRECT signals the same operation. Once the callback is received by
>>> the caller for IO completion, the caller must retry the operation.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  mm/filemap.c | 33 ++++++++++++++++++++++++++-------
>>>  1 file changed, 26 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index c746541b1d49..a3b86c9acdc8 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -1219,6 +1219,14 @@ static int __wait_on_page_locked_async(struct
>>> page *page,
>>>  	return ret;
>>>  }
>>>  
>>> +static int wait_on_page_locked_async(struct page *page,
>>> +				     struct wait_page_queue *wait)
>>> +{
>>> +	if (!PageLocked(page))
>>> +		return 0;
>>> +	return __wait_on_page_locked_async(compound_head(page), wait,
>>> false);
>>> +}
>>> +
>>>  /**
>>>   * put_and_wait_on_page_locked - Drop a reference and wait for it to
>>> be unlocked
>>>   * @page: The page to wait for.
>>> @@ -2058,17 +2066,25 @@ static ssize_t
>>> generic_file_buffered_read(struct kiocb *iocb,
>>>  					index, last_index - index);
>>>  		}
>>>  		if (!PageUptodate(page)) {
>>> -			if (iocb->ki_flags & IOCB_NOWAIT) {
>>> -				put_page(page);
>>> -				goto would_block;
>>> -			}
>>> -
>>>  			/*
>>>  			 * See comment in do_read_cache_page on why
>>>  			 * wait_on_page_locked is used to avoid
>>> unnecessarily
>>>  			 * serialisations and why it's safe.
>>>  			 */
>>> -			error = wait_on_page_locked_killable(page);
>>> +			if (iocb->ki_flags & IOCB_WAITQ) {
>>> +				if (written) {
>>> +					put_page(page);
>>> +					goto out;
>>> +				}
>>> +				error = wait_on_page_locked_async(page,
>>> +								iocb-
>>>> private);
>>
>> If it is being used in 'generic_file_buffered_read()' as storage for a
>> wait queue, then it is hard to consider this a 'private' field.
> 
> private isn't the prettiest, and in fact this one in particular is a bit
> of a mess. It's not clear if it's caller or callee owned. It's generally
> not used, outside of the old usb gadget code, iomap O_DIRECT, and ocfs2.
> With FMODE_BUF_RASYNC, the fs obviously can't set it if it uses
> ->private for buffered IO.
> 
>> Perhaps either rename and add type checking, or else add a separate
>> field altogether to struct kiocb?
> 
> I'd hate to add a new field and increase the size of the kiocb... One
> alternative is to do:
> 
> 	union {
> 		void *private;
> 		struct wait_page_queue *ki_waitq;
> 	};
> 
> and still use IOCB_WAITQ to say that ->ki_waitq is valid.
> 
> There's also 4 bytes of padding in the kiocb struct. And some fields are
> only used for O_DIRECT as well, eg ->ki_cookie which is just used for
> polled O_DIRECT. So we could also do:
> 
> 	union {
> 		unsigned int ki_cookie;
> 		struct wait_page_queue *ki_waitq;
> 	};
> 
> and still not grow the kiocb. How about we go with this approach, and
> also add:
> 
> 	if (kiocb->ki_flags & IOCB_HIPRI)
> 		return -EOPNOTSUPP;
> 
> to kiocb_wait_page_queue_init() to make sure that this combination isn't
> valid?

Here's the incremental, which is spread over 3 patches. I think this one
makes sense, as polled IO doesn't support buffered IO. And because doing
an async callback for completion is not how polled IO operates anyway,
so even if buffered IO supported it, we'd not use the callback for
polled IO anyway. kiocb_wait_page_queue_init() checks and backs this
up.


diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0ef5f5973b1c..f7b1eb765c6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -317,7 +317,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
-/* iocb->private holds wait_page_async struct */
+/* iocb->ki_waitq is valid */
 #define IOCB_WAITQ		(1 << 8)
 
 struct kiocb {
@@ -332,7 +332,10 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
-	unsigned int		ki_cookie; /* for ->iopoll */
+	union {
+		unsigned int		ki_cookie; /* for ->iopoll */
+		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
+	};
 
 	randomized_struct_fields_end
 };
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index def58de92053..8b65420410ee 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -498,13 +498,16 @@ static inline int kiocb_wait_page_queue_init(struct kiocb *kiocb,
 					     wait_queue_func_t func,
 					     void *data)
 {
+	/* Can't support async wakeup with polled IO */
+	if (kiocb->ki_flags & IOCB_HIPRI)
+		return -EINVAL;
 	if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
 		wait->wait.func = func;
 		wait->wait.private = data;
 		wait->wait.flags = 0;
 		INIT_LIST_HEAD(&wait->wait.entry);
 		kiocb->ki_flags |= IOCB_WAITQ;
-		kiocb->private = wait;
+		kiocb->ki_waitq = wait;
 		return 0;
 	}
 
diff --git a/mm/filemap.c b/mm/filemap.c
index a3b86c9acdc8..18022de7dc33 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2077,7 +2077,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 					goto out;
 				}
 				error = wait_on_page_locked_async(page,
-								iocb->private);
+								iocb->ki_waitq);
 			} else {
 				if (iocb->ki_flags & IOCB_NOWAIT) {
 					put_page(page);
@@ -2173,7 +2173,7 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb,
 page_not_up_to_date:
 		/* Get exclusive access to the page ... */
 		if (iocb->ki_flags & IOCB_WAITQ)
-			error = lock_page_async(page, iocb->private);
+			error = lock_page_async(page, iocb->ki_waitq);
 		else
 			error = lock_page_killable(page);
 		if (unlikely(error))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-24 16:40       ` Jens Axboe
@ 2020-05-24 17:11         ` Trond Myklebust
  2020-05-24 17:12           ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Trond Myklebust @ 2020-05-24 17:11 UTC (permalink / raw)
  To: axboe, io-uring; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Sun, 2020-05-24 at 10:40 -0600, Jens Axboe wrote:
> On 5/24/20 10:30 AM, Jens Axboe wrote:
> > On 5/24/20 8:05 AM, Trond Myklebust wrote:
> > > On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
> > > > Use the async page locking infrastructure, if IOCB_WAITQ is set
> > > > in
> > > > the
> > > > passed in iocb. The caller must expect an -EIOCBQUEUED return
> > > > value,
> > > > which means that IO is started but not done yet. This is
> > > > similar to
> > > > how
> > > > O_DIRECT signals the same operation. Once the callback is
> > > > received by
> > > > the caller for IO completion, the caller must retry the
> > > > operation.
> > > > 
> > > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > > ---
> > > >  mm/filemap.c | 33 ++++++++++++++++++++++++++-------
> > > >  1 file changed, 26 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index c746541b1d49..a3b86c9acdc8 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1219,6 +1219,14 @@ static int
> > > > __wait_on_page_locked_async(struct
> > > > page *page,
> > > >  	return ret;
> > > >  }
> > > >  
> > > > +static int wait_on_page_locked_async(struct page *page,
> > > > +				     struct wait_page_queue
> > > > *wait)
> > > > +{
> > > > +	if (!PageLocked(page))
> > > > +		return 0;
> > > > +	return __wait_on_page_locked_async(compound_head(page),
> > > > wait,
> > > > false);
> > > > +}
> > > > +
> > > >  /**
> > > >   * put_and_wait_on_page_locked - Drop a reference and wait for
> > > > it to
> > > > be unlocked
> > > >   * @page: The page to wait for.
> > > > @@ -2058,17 +2066,25 @@ static ssize_t
> > > > generic_file_buffered_read(struct kiocb *iocb,
> > > >  					index, last_index -
> > > > index);
> > > >  		}
> > > >  		if (!PageUptodate(page)) {
> > > > -			if (iocb->ki_flags & IOCB_NOWAIT) {
> > > > -				put_page(page);
> > > > -				goto would_block;
> > > > -			}
> > > > -
> > > >  			/*
> > > >  			 * See comment in do_read_cache_page on
> > > > why
> > > >  			 * wait_on_page_locked is used to avoid
> > > > unnecessarily
> > > >  			 * serialisations and why it's safe.
> > > >  			 */
> > > > -			error =
> > > > wait_on_page_locked_killable(page);
> > > > +			if (iocb->ki_flags & IOCB_WAITQ) {
> > > > +				if (written) {
> > > > +					put_page(page);
> > > > +					goto out;
> > > > +				}
> > > > +				error =
> > > > wait_on_page_locked_async(page,
> > > > +								
> > > > iocb-
> > > > > private);
> > > 
> > > If it is being used in 'generic_file_buffered_read()' as storage
> > > for a
> > > wait queue, then it is hard to consider this a 'private' field.
> > 
> > private isn't the prettiest, and in fact this one in particular is
> > a bit
> > of a mess. It's not clear if it's caller or callee owned. It's
> > generally
> > not used, outside of the old usb gadget code, iomap O_DIRECT, and
> > ocfs2.
> > With FMODE_BUF_RASYNC, the fs obviously can't set it if it uses
> > ->private for buffered IO.
> > 
> > > Perhaps either rename and add type checking, or else add a
> > > separate
> > > field altogether to struct kiocb?
> > 
> > I'd hate to add a new field and increase the size of the kiocb...
> > One
> > alternative is to do:
> > 
> > 	union {
> > 		void *private;
> > 		struct wait_page_queue *ki_waitq;
> > 	};
> > 
> > and still use IOCB_WAITQ to say that ->ki_waitq is valid.
> > 
> > There's also 4 bytes of padding in the kiocb struct. And some
> > fields are
> > only used for O_DIRECT as well, eg ->ki_cookie which is just used
> > for
> > polled O_DIRECT. So we could also do:
> > 
> > 	union {
> > 		unsigned int ki_cookie;
> > 		struct wait_page_queue *ki_waitq;
> > 	};
> > 
> > and still not grow the kiocb. How about we go with this approach,
> > and
> > also add:
> > 
> > 	if (kiocb->ki_flags & IOCB_HIPRI)
> > 		return -EOPNOTSUPP;
> > 
> > to kiocb_wait_page_queue_init() to make sure that this combination
> > isn't
> > valid?
> 
> Here's the incremental, which is spread over 3 patches. I think this
> one
> makes sense, as polled IO doesn't support buffered IO. And because
> doing
> an async callback for completion is not how polled IO operates
> anyway,
> so even if buffered IO supported it, we'd not use the callback for
> polled IO anyway. kiocb_wait_page_queue_init() checks and backs this
> up.
> 
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0ef5f5973b1c..f7b1eb765c6e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -317,7 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC		(1 << 5)
>  #define IOCB_WRITE		(1 << 6)
>  #define IOCB_NOWAIT		(1 << 7)
> -/* iocb->private holds wait_page_async struct */
> +/* iocb->ki_waitq is valid */
>  #define IOCB_WAITQ		(1 << 8)
>  
>  struct kiocb {
> @@ -332,7 +332,10 @@ struct kiocb {
>  	int			ki_flags;
>  	u16			ki_hint;
>  	u16			ki_ioprio; /* See linux/ioprio.h */
> -	unsigned int		ki_cookie; /* for ->iopoll */
> +	union {
> +		unsigned int		ki_cookie; /* for ->iopoll */
> +		struct wait_page_queue	*ki_waitq; /* for async
> buffered IO */
> +	};
>  
>  	randomized_struct_fields_end
>  };
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index def58de92053..8b65420410ee 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -498,13 +498,16 @@ static inline int
> kiocb_wait_page_queue_init(struct kiocb *kiocb,
>  					     wait_queue_func_t func,
>  					     void *data)
>  {
> +	/* Can't support async wakeup with polled IO */
> +	if (kiocb->ki_flags & IOCB_HIPRI)
> +		return -EINVAL;
>  	if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
>  		wait->wait.func = func;
>  		wait->wait.private = data;
>  		wait->wait.flags = 0;
>  		INIT_LIST_HEAD(&wait->wait.entry);
>  		kiocb->ki_flags |= IOCB_WAITQ;
> -		kiocb->private = wait;
> +		kiocb->ki_waitq = wait;
>  		return 0;
>  	}
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a3b86c9acdc8..18022de7dc33 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2077,7 +2077,7 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>  					goto out;
>  				}
>  				error = wait_on_page_locked_async(page,
> -								iocb-
> >private);
> +								iocb-
> >ki_waitq);
>  			} else {
>  				if (iocb->ki_flags & IOCB_NOWAIT) {
>  					put_page(page);
> @@ -2173,7 +2173,7 @@ static ssize_t
> generic_file_buffered_read(struct kiocb *iocb,
>  page_not_up_to_date:
>  		/* Get exclusive access to the page ... */
>  		if (iocb->ki_flags & IOCB_WAITQ)
> -			error = lock_page_async(page, iocb->private);
> +			error = lock_page_async(page, iocb->ki_waitq);
>  		else
>  			error = lock_page_killable(page);
>  		if (unlikely(error))

Ack. That seems cleaner to me.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read()
  2020-05-24 17:11         ` Trond Myklebust
@ 2020-05-24 17:12           ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-24 17:12 UTC (permalink / raw)
  To: Trond Myklebust, io-uring; +Cc: linux-mm, linux-fsdevel, linux-kernel

On 5/24/20 11:11 AM, Trond Myklebust wrote:
> On Sun, 2020-05-24 at 10:40 -0600, Jens Axboe wrote:
>> On 5/24/20 10:30 AM, Jens Axboe wrote:
>>> On 5/24/20 8:05 AM, Trond Myklebust wrote:
>>>> On Sat, 2020-05-23 at 12:57 -0600, Jens Axboe wrote:
>>>>> Use the async page locking infrastructure, if IOCB_WAITQ is set
>>>>> in
>>>>> the
>>>>> passed in iocb. The caller must expect an -EIOCBQUEUED return
>>>>> value,
>>>>> which means that IO is started but not done yet. This is
>>>>> similar to
>>>>> how
>>>>> O_DIRECT signals the same operation. Once the callback is
>>>>> received by
>>>>> the caller for IO completion, the caller must retry the
>>>>> operation.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  mm/filemap.c | 33 ++++++++++++++++++++++++++-------
>>>>>  1 file changed, 26 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>> index c746541b1d49..a3b86c9acdc8 100644
>>>>> --- a/mm/filemap.c
>>>>> +++ b/mm/filemap.c
>>>>> @@ -1219,6 +1219,14 @@ static int
>>>>> __wait_on_page_locked_async(struct
>>>>> page *page,
>>>>>  	return ret;
>>>>>  }
>>>>>  
>>>>> +static int wait_on_page_locked_async(struct page *page,
>>>>> +				     struct wait_page_queue
>>>>> *wait)
>>>>> +{
>>>>> +	if (!PageLocked(page))
>>>>> +		return 0;
>>>>> +	return __wait_on_page_locked_async(compound_head(page),
>>>>> wait,
>>>>> false);
>>>>> +}
>>>>> +
>>>>>  /**
>>>>>   * put_and_wait_on_page_locked - Drop a reference and wait for
>>>>> it to
>>>>> be unlocked
>>>>>   * @page: The page to wait for.
>>>>> @@ -2058,17 +2066,25 @@ static ssize_t
>>>>> generic_file_buffered_read(struct kiocb *iocb,
>>>>>  					index, last_index -
>>>>> index);
>>>>>  		}
>>>>>  		if (!PageUptodate(page)) {
>>>>> -			if (iocb->ki_flags & IOCB_NOWAIT) {
>>>>> -				put_page(page);
>>>>> -				goto would_block;
>>>>> -			}
>>>>> -
>>>>>  			/*
>>>>>  			 * See comment in do_read_cache_page on
>>>>> why
>>>>>  			 * wait_on_page_locked is used to avoid
>>>>> unnecessarily
>>>>>  			 * serialisations and why it's safe.
>>>>>  			 */
>>>>> -			error =
>>>>> wait_on_page_locked_killable(page);
>>>>> +			if (iocb->ki_flags & IOCB_WAITQ) {
>>>>> +				if (written) {
>>>>> +					put_page(page);
>>>>> +					goto out;
>>>>> +				}
>>>>> +				error =
>>>>> wait_on_page_locked_async(page,
>>>>> +								
>>>>> iocb-
>>>>>> private);
>>>>
>>>> If it is being used in 'generic_file_buffered_read()' as storage
>>>> for a
>>>> wait queue, then it is hard to consider this a 'private' field.
>>>
>>> private isn't the prettiest, and in fact this one in particular is
>>> a bit
>>> of a mess. It's not clear if it's caller or callee owned. It's
>>> generally
>>> not used, outside of the old usb gadget code, iomap O_DIRECT, and
>>> ocfs2.
>>> With FMODE_BUF_RASYNC, the fs obviously can't set it if it uses
>>> ->private for buffered IO.
>>>
>>>> Perhaps either rename and add type checking, or else add a
>>>> separate
>>>> field altogether to struct kiocb?
>>>
>>> I'd hate to add a new field and increase the size of the kiocb...
>>> One
>>> alternative is to do:
>>>
>>> 	union {
>>> 		void *private;
>>> 		struct wait_page_queue *ki_waitq;
>>> 	};
>>>
>>> and still use IOCB_WAITQ to say that ->ki_waitq is valid.
>>>
>>> There's also 4 bytes of padding in the kiocb struct. And some
>>> fields are
>>> only used for O_DIRECT as well, eg ->ki_cookie which is just used
>>> for
>>> polled O_DIRECT. So we could also do:
>>>
>>> 	union {
>>> 		unsigned int ki_cookie;
>>> 		struct wait_page_queue *ki_waitq;
>>> 	};
>>>
>>> and still not grow the kiocb. How about we go with this approach,
>>> and
>>> also add:
>>>
>>> 	if (kiocb->ki_flags & IOCB_HIPRI)
>>> 		return -EOPNOTSUPP;
>>>
>>> to kiocb_wait_page_queue_init() to make sure that this combination
>>> isn't
>>> valid?
>>
>> Here's the incremental, which is spread over 3 patches. I think this
>> one
>> makes sense, as polled IO doesn't support buffered IO. And because
>> doing
>> an async callback for completion is not how polled IO operates
>> anyway,
>> so even if buffered IO supported it, we'd not use the callback for
>> polled IO anyway. kiocb_wait_page_queue_init() checks and backs this
>> up.
>>
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 0ef5f5973b1c..f7b1eb765c6e 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -317,7 +317,7 @@ enum rw_hint {
>>  #define IOCB_SYNC		(1 << 5)
>>  #define IOCB_WRITE		(1 << 6)
>>  #define IOCB_NOWAIT		(1 << 7)
>> -/* iocb->private holds wait_page_async struct */
>> +/* iocb->ki_waitq is valid */
>>  #define IOCB_WAITQ		(1 << 8)
>>  
>>  struct kiocb {
>> @@ -332,7 +332,10 @@ struct kiocb {
>>  	int			ki_flags;
>>  	u16			ki_hint;
>>  	u16			ki_ioprio; /* See linux/ioprio.h */
>> -	unsigned int		ki_cookie; /* for ->iopoll */
>> +	union {
>> +		unsigned int		ki_cookie; /* for ->iopoll */
>> +		struct wait_page_queue	*ki_waitq; /* for async
>> buffered IO */
>> +	};
>>  
>>  	randomized_struct_fields_end
>>  };
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index def58de92053..8b65420410ee 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -498,13 +498,16 @@ static inline int
>> kiocb_wait_page_queue_init(struct kiocb *kiocb,
>>  					     wait_queue_func_t func,
>>  					     void *data)
>>  {
>> +	/* Can't support async wakeup with polled IO */
>> +	if (kiocb->ki_flags & IOCB_HIPRI)
>> +		return -EINVAL;
>>  	if (kiocb->ki_filp->f_mode & FMODE_BUF_RASYNC) {
>>  		wait->wait.func = func;
>>  		wait->wait.private = data;
>>  		wait->wait.flags = 0;
>>  		INIT_LIST_HEAD(&wait->wait.entry);
>>  		kiocb->ki_flags |= IOCB_WAITQ;
>> -		kiocb->private = wait;
>> +		kiocb->ki_waitq = wait;
>>  		return 0;
>>  	}
>>  
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a3b86c9acdc8..18022de7dc33 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2077,7 +2077,7 @@ static ssize_t
>> generic_file_buffered_read(struct kiocb *iocb,
>>  					goto out;
>>  				}
>>  				error = wait_on_page_locked_async(page,
>> -								iocb-
>>> private);
>> +								iocb-
>>> ki_waitq);
>>  			} else {
>>  				if (iocb->ki_flags & IOCB_NOWAIT) {
>>  					put_page(page);
>> @@ -2173,7 +2173,7 @@ static ssize_t
>> generic_file_buffered_read(struct kiocb *iocb,
>>  page_not_up_to_date:
>>  		/* Get exclusive access to the page ... */
>>  		if (iocb->ki_flags & IOCB_WAITQ)
>> -			error = lock_page_async(page, iocb->private);
>> +			error = lock_page_async(page, iocb->ki_waitq);
>>  		else
>>  			error = lock_page_killable(page);
>>  		if (unlikely(error))
> 
> Ack. That seems cleaner to me.

I agree, this is better. Ran it through the testing and updated the
series accordingly:

https://git.kernel.dk/cgit/linux-block/log/?h=async-buffered.4

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHSET v2 0/12] Add support for async buffered reads
  2020-05-24  9:46   ` Chris Panayis
@ 2020-05-24 19:24     ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-24 19:24 UTC (permalink / raw)
  To: Chris Panayis, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 5/24/20 3:46 AM, Chris Panayis wrote:
> Yes! Jens & Team! Yes!
> 
> My code has never looked so beautiful, been so efficient and run so well 
> since switching to io_uring/async awesome-ness.. Really, really is a 
> game-changer in terms of software design, control, performance, 
> expressiveness... so many levels. Really, really great work! Thank you!

Thanks! Glad you like it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-23 18:57 ` [PATCH 12/12] io_uring: support true async buffered reads, if file provides it Jens Axboe
@ 2020-05-25  7:29   ` Pavel Begunkov
  2020-05-25 19:59     ` Jens Axboe
  2020-05-26  7:38   ` Pavel Begunkov
  1 sibling, 1 reply; 29+ messages in thread
From: Pavel Begunkov @ 2020-05-25  7:29 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 23/05/2020 21:57, Jens Axboe wrote:
> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
> the buffered read to an io-wq worker. Instead we can rely on page
> unlocking callbacks to support retry based async IO. This is a lot more
> efficient than doing async thread offload.
> 
> The retry is done similarly to how we handle poll based retry. From
> the unlock callback, we simply queue the retry to a task_work based
> handler.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
...
> +
> +	init_task_work(&rw->task_work, io_async_buf_retry);
> +	/* submit ref gets dropped, acquire a new one */
> +	refcount_inc(&req->refs);
> +	tsk = req->task;
> +	ret = task_work_add(tsk, &rw->task_work, true);
> +	if (unlikely(ret)) {
> +		/* queue just for cancelation */
> +		init_task_work(&rw->task_work, io_async_buf_cancel);
> +		tsk = io_wq_get_task(req->ctx->io_wq);

IIRC, task will be put somewhere around io_free_req(). Then shouldn't here be
some juggling with reassigning req->task with task_{get,put}()?

> +		task_work_add(tsk, &rw->task_work, true);
> +	}
> +	wake_up_process(tsk);
> +	return 1;
> +}
...
>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>  {
>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> @@ -2601,6 +2696,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  	if (!ret) {
>  		ssize_t ret2;
>  
> +retry:
>  		if (req->file->f_op->read_iter)
>  			ret2 = call_read_iter(req->file, kiocb, &iter);
>  		else
> @@ -2619,6 +2715,9 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  			if (!(req->flags & REQ_F_NOWAIT) &&
>  			    !file_can_poll(req->file))
>  				req->flags |= REQ_F_MUST_PUNT;
> +			if (io_rw_should_retry(req))

It looks like a state machine with IOCB_WAITQ and gotos. Wouldn't it be cleaner
to call call_read_iter()/loop_rw_iter() here directly instead of "goto retry" ?

BTW, can this async stuff return -EAGAIN ?

> +				goto retry;
> +			kiocb->ki_flags &= ~IOCB_WAITQ;
>  			return -EAGAIN;
>  		}
>  	}
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-25  7:29   ` Pavel Begunkov
@ 2020-05-25 19:59     ` Jens Axboe
  2020-05-26  7:44       ` Pavel Begunkov
  0 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2020-05-25 19:59 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 5/25/20 1:29 AM, Pavel Begunkov wrote:
> On 23/05/2020 21:57, Jens Axboe wrote:
>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>> the buffered read to an io-wq worker. Instead we can rely on page
>> unlocking callbacks to support retry based async IO. This is a lot more
>> efficient than doing async thread offload.
>>
>> The retry is done similarly to how we handle poll based retry. From
>> the unlock callback, we simply queue the retry to a task_work based
>> handler.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 99 insertions(+)
>>
> ...
>> +
>> +	init_task_work(&rw->task_work, io_async_buf_retry);
>> +	/* submit ref gets dropped, acquire a new one */
>> +	refcount_inc(&req->refs);
>> +	tsk = req->task;
>> +	ret = task_work_add(tsk, &rw->task_work, true);
>> +	if (unlikely(ret)) {
>> +		/* queue just for cancelation */
>> +		init_task_work(&rw->task_work, io_async_buf_cancel);
>> +		tsk = io_wq_get_task(req->ctx->io_wq);
> 
> IIRC, task will be put somewhere around io_free_req(). Then shouldn't here be
> some juggling with reassigning req->task with task_{get,put}()?

Not sure I follow? Yes, we'll put this task again when the request
is freed, but not sure what you mean with juggling?

>> +		task_work_add(tsk, &rw->task_work, true);
>> +	}
>> +	wake_up_process(tsk);
>> +	return 1;
>> +}
> ...
>>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  {
>>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>> @@ -2601,6 +2696,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  	if (!ret) {
>>  		ssize_t ret2;
>>  
>> +retry:
>>  		if (req->file->f_op->read_iter)
>>  			ret2 = call_read_iter(req->file, kiocb, &iter);
>>  		else
>> @@ -2619,6 +2715,9 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  			if (!(req->flags & REQ_F_NOWAIT) &&
>>  			    !file_can_poll(req->file))
>>  				req->flags |= REQ_F_MUST_PUNT;
>> +			if (io_rw_should_retry(req))
> 
> It looks like a state machine with IOCB_WAITQ and gotos. Wouldn't it be cleaner
> to call call_read_iter()/loop_rw_iter() here directly instead of "goto retry" ?

We could, probably making that part a separate helper then. How about the
below incremental?

> BTW, can this async stuff return -EAGAIN ?

Probably? Prefer not to make any definitive calls on that being possible or
not, as it's sure to disappoint. If it does and IOCB_WAITQ is already set,
then we'll punt to a thread like before.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index a5a4d9602915..669dccd81207 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2677,6 +2677,13 @@ static bool io_rw_should_retry(struct io_kiocb *req)
 	return false;
 }
 
+static int __io_read(struct io_kiocb *req, struct iov_iter *iter)
+{
+	if (req->file->f_op->read_iter)
+		return call_read_iter(req->file, &req->rw.kiocb, iter);
+	return loop_rw_iter(READ, req->file, &req->rw.kiocb, iter);
+}
+
 static int io_read(struct io_kiocb *req, bool force_nonblock)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -2710,11 +2717,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 	if (!ret) {
 		ssize_t ret2;
 
-retry:
-		if (req->file->f_op->read_iter)
-			ret2 = call_read_iter(req->file, kiocb, &iter);
-		else
-			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
+		ret2 = __io_read(req, &iter);
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
 		if (!force_nonblock || ret2 != -EAGAIN) {
@@ -2729,8 +2732,11 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
 			if (!(req->flags & REQ_F_NOWAIT) &&
 			    !file_can_poll(req->file))
 				req->flags |= REQ_F_MUST_PUNT;
-			if (io_rw_should_retry(req))
-				goto retry;
+			if (io_rw_should_retry(req)) {
+				ret2 = __io_read(req, &iter);
+				if (ret2 != -EAGAIN)
+					goto out_free;
+			}
 			kiocb->ki_flags &= ~IOCB_WAITQ;
 			return -EAGAIN;
 		}

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-23 18:57 ` [PATCH 12/12] io_uring: support true async buffered reads, if file provides it Jens Axboe
  2020-05-25  7:29   ` Pavel Begunkov
@ 2020-05-26  7:38   ` Pavel Begunkov
  2020-05-26 13:47     ` Jens Axboe
  1 sibling, 1 reply; 29+ messages in thread
From: Pavel Begunkov @ 2020-05-26  7:38 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 23/05/2020 21:57, Jens Axboe wrote:
> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
> the buffered read to an io-wq worker. Instead we can rely on page
> unlocking callbacks to support retry based async IO. This is a lot more
> efficient than doing async thread offload.
> 
> The retry is done similarly to how we handle poll based retry. From
> the unlock callback, we simply queue the retry to a task_work based
> handler.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 99 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e95481c552ff..dd532d2634c2 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -498,6 +498,8 @@ struct io_async_rw {
>  	struct iovec			*iov;
>  	ssize_t				nr_segs;
>  	ssize_t				size;
> +	struct wait_page_queue		wpq;
> +	struct callback_head		task_work;
>  };
>  
>  struct io_async_ctx {
> @@ -2568,6 +2570,99 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  	return 0;
>  }
>  
> +static void io_async_buf_cancel(struct callback_head *cb)
> +{
> +	struct io_async_rw *rw;
> +	struct io_ring_ctx *ctx;
> +	struct io_kiocb *req;
> +
> +	rw = container_of(cb, struct io_async_rw, task_work);
> +	req = rw->wpq.wait.private;
> +	ctx = req->ctx;
> +
> +	spin_lock_irq(&ctx->completion_lock);
> +	io_cqring_fill_event(req, -ECANCELED);

It seems like it should go through kiocb_done()/io_complete_rw_common().
My concern is missing io_put_kbuf().

> +	io_commit_cqring(ctx);
> +	spin_unlock_irq(&ctx->completion_lock);
> +
> +	io_cqring_ev_posted(ctx);
> +	req_set_fail_links(req);
> +	io_double_put_req(req);
> +}


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-25 19:59     ` Jens Axboe
@ 2020-05-26  7:44       ` Pavel Begunkov
  2020-05-26 13:50         ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Pavel Begunkov @ 2020-05-26  7:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 25/05/2020 22:59, Jens Axboe wrote:
> On 5/25/20 1:29 AM, Pavel Begunkov wrote:
>> On 23/05/2020 21:57, Jens Axboe wrote:
>>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>>> the buffered read to an io-wq worker. Instead we can rely on page
>>> unlocking callbacks to support retry based async IO. This is a lot more
>>> efficient than doing async thread offload.
>>>
>>> The retry is done similarly to how we handle poll based retry. From
>>> the unlock callback, we simply queue the retry to a task_work based
>>> handler.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 99 insertions(+)
>>>
>> ...
>>> +
>>> +	init_task_work(&rw->task_work, io_async_buf_retry);
>>> +	/* submit ref gets dropped, acquire a new one */
>>> +	refcount_inc(&req->refs);
>>> +	tsk = req->task;
>>> +	ret = task_work_add(tsk, &rw->task_work, true);
>>> +	if (unlikely(ret)) {
>>> +		/* queue just for cancelation */
>>> +		init_task_work(&rw->task_work, io_async_buf_cancel);
>>> +		tsk = io_wq_get_task(req->ctx->io_wq);
>>
>> IIRC, task will be put somewhere around io_free_req(). Then shouldn't here be
>> some juggling with reassigning req->task with task_{get,put}()?
> 
> Not sure I follow? Yes, we'll put this task again when the request
> is freed, but not sure what you mean with juggling?

I meant something like:

...
/* queue just for cancelation */
init_task_work(&rw->task_work, io_async_buf_cancel);
+ put_task_struct(req->task);
+ req->task = get_task_struct(io_wq_task);


but, thinking twice, if I got the whole idea right, it should be ok as is --
io-wq won't go away before the request anyway, and leaving req->task pinned down
for a bit is not a problem.

>>> +		task_work_add(tsk, &rw->task_work, true);
>>> +	}
>>> +	wake_up_process(tsk);
>>> +	return 1;
>>> +}
>> ...
>>>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>  {
>>>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>>> @@ -2601,6 +2696,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>  	if (!ret) {
>>>  		ssize_t ret2;
>>>  
>>> +retry:
>>>  		if (req->file->f_op->read_iter)
>>>  			ret2 = call_read_iter(req->file, kiocb, &iter);
>>>  		else
>>> @@ -2619,6 +2715,9 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>  			if (!(req->flags & REQ_F_NOWAIT) &&
>>>  			    !file_can_poll(req->file))
>>>  				req->flags |= REQ_F_MUST_PUNT;
>>> +			if (io_rw_should_retry(req))
>>
>> It looks like a state machine with IOCB_WAITQ and gotos. Wouldn't it be cleaner
>> to call call_read_iter()/loop_rw_iter() here directly instead of "goto retry" ?
> 
> We could, probably making that part a separate helper then. How about the
> below incremental?

IMHO, it was easy to get lost with such implicit state switching.
Looks better now! See a small comment below.

> 
>> BTW, can this async stuff return -EAGAIN ?
> 
> Probably? Prefer not to make any definitive calls on that being possible or
> not, as it's sure to disappoint. If it does and IOCB_WAITQ is already set,
> then we'll punt to a thread like before.

Sounds reasonable

> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index a5a4d9602915..669dccd81207 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2677,6 +2677,13 @@ static bool io_rw_should_retry(struct io_kiocb *req)
>  	return false;
>  }
>  
> +static int __io_read(struct io_kiocb *req, struct iov_iter *iter)
> +{
> +	if (req->file->f_op->read_iter)
> +		return call_read_iter(req->file, &req->rw.kiocb, iter);
> +	return loop_rw_iter(READ, req->file, &req->rw.kiocb, iter);
> +}
> +
>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>  {
>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> @@ -2710,11 +2717,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  	if (!ret) {
>  		ssize_t ret2;
>  
> -retry:
> -		if (req->file->f_op->read_iter)
> -			ret2 = call_read_iter(req->file, kiocb, &iter);
> -		else
> -			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
> +		ret2 = __io_read(req, &iter);
>  
>  		/* Catch -EAGAIN return for forced non-blocking submission */
>  		if (!force_nonblock || ret2 != -EAGAIN) {
> @@ -2729,8 +2732,11 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>  			if (!(req->flags & REQ_F_NOWAIT) &&
>  			    !file_can_poll(req->file))
>  				req->flags |= REQ_F_MUST_PUNT;
> -			if (io_rw_should_retry(req))
> -				goto retry;
> +			if (io_rw_should_retry(req)) {
> +				ret2 = __io_read(req, &iter);
> +				if (ret2 != -EAGAIN)
> +					goto out_free;

"goto out_free" returns ret=0, so someone should add a cqe

if (ret2 != -EAGAIN) {
	kiocb_done(kiocb, ret2);
	goto free_out;
}


> +			}
>  			kiocb->ki_flags &= ~IOCB_WAITQ;
>  			return -EAGAIN;
>  		}
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-26  7:38   ` Pavel Begunkov
@ 2020-05-26 13:47     ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-26 13:47 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 5/26/20 1:38 AM, Pavel Begunkov wrote:
> On 23/05/2020 21:57, Jens Axboe wrote:
>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>> the buffered read to an io-wq worker. Instead we can rely on page
>> unlocking callbacks to support retry based async IO. This is a lot more
>> efficient than doing async thread offload.
>>
>> The retry is done similarly to how we handle poll based retry. From
>> the unlock callback, we simply queue the retry to a task_work based
>> handler.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 99 insertions(+)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index e95481c552ff..dd532d2634c2 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -498,6 +498,8 @@ struct io_async_rw {
>>  	struct iovec			*iov;
>>  	ssize_t				nr_segs;
>>  	ssize_t				size;
>> +	struct wait_page_queue		wpq;
>> +	struct callback_head		task_work;
>>  };
>>  
>>  struct io_async_ctx {
>> @@ -2568,6 +2570,99 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>  	return 0;
>>  }
>>  
>> +static void io_async_buf_cancel(struct callback_head *cb)
>> +{
>> +	struct io_async_rw *rw;
>> +	struct io_ring_ctx *ctx;
>> +	struct io_kiocb *req;
>> +
>> +	rw = container_of(cb, struct io_async_rw, task_work);
>> +	req = rw->wpq.wait.private;
>> +	ctx = req->ctx;
>> +
>> +	spin_lock_irq(&ctx->completion_lock);
>> +	io_cqring_fill_event(req, -ECANCELED);
> 
> It seems like it should go through kiocb_done()/io_complete_rw_common().
> My concern is missing io_put_kbuf().

Yeah, I noticed that too after sending it out. If you look at the
current one that I updated yesterday, it does add that (and also
renames the iter read helper):

https://git.kernel.dk/cgit/linux-block/commit/?h=async-buffered.5&id=6f4e3a4066d0db3e3478e58cc250afb16d8d4d91

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 12/12] io_uring: support true async buffered reads, if file provides it
  2020-05-26  7:44       ` Pavel Begunkov
@ 2020-05-26 13:50         ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-26 13:50 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm

On 5/26/20 1:44 AM, Pavel Begunkov wrote:
> On 25/05/2020 22:59, Jens Axboe wrote:
>> On 5/25/20 1:29 AM, Pavel Begunkov wrote:
>>> On 23/05/2020 21:57, Jens Axboe wrote:
>>>> If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
>>>> the buffered read to an io-wq worker. Instead we can rely on page
>>>> unlocking callbacks to support retry based async IO. This is a lot more
>>>> efficient than doing async thread offload.
>>>>
>>>> The retry is done similarly to how we handle poll based retry. From
>>>> the unlock callback, we simply queue the retry to a task_work based
>>>> handler.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 99 insertions(+)
>>>>
>>> ...
>>>> +
>>>> +	init_task_work(&rw->task_work, io_async_buf_retry);
>>>> +	/* submit ref gets dropped, acquire a new one */
>>>> +	refcount_inc(&req->refs);
>>>> +	tsk = req->task;
>>>> +	ret = task_work_add(tsk, &rw->task_work, true);
>>>> +	if (unlikely(ret)) {
>>>> +		/* queue just for cancelation */
>>>> +		init_task_work(&rw->task_work, io_async_buf_cancel);
>>>> +		tsk = io_wq_get_task(req->ctx->io_wq);
>>>
>>> IIRC, task will be put somewhere around io_free_req(). Then shouldn't here be
>>> some juggling with reassigning req->task with task_{get,put}()?
>>
>> Not sure I follow? Yes, we'll put this task again when the request
>> is freed, but not sure what you mean with juggling?
> 
> I meant something like:
> 
> ...
> /* queue just for cancelation */
> init_task_work(&rw->task_work, io_async_buf_cancel);
> + put_task_struct(req->task);
> + req->task = get_task_struct(io_wq_task);
> 
> 
> but, thinking twice, if I got the whole idea right, it should be ok as
> is -- io-wq won't go away before the request anyway, and leaving
> req->task pinned down for a bit is not a problem.

OK good, then I thin kwe agree it's fine.

>>>> +		task_work_add(tsk, &rw->task_work, true);
>>>> +	}
>>>> +	wake_up_process(tsk);
>>>> +	return 1;
>>>> +}
>>> ...
>>>>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>>  {
>>>>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>>>> @@ -2601,6 +2696,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>>  	if (!ret) {
>>>>  		ssize_t ret2;
>>>>  
>>>> +retry:
>>>>  		if (req->file->f_op->read_iter)
>>>>  			ret2 = call_read_iter(req->file, kiocb, &iter);
>>>>  		else
>>>> @@ -2619,6 +2715,9 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>>>  			if (!(req->flags & REQ_F_NOWAIT) &&
>>>>  			    !file_can_poll(req->file))
>>>>  				req->flags |= REQ_F_MUST_PUNT;
>>>> +			if (io_rw_should_retry(req))
>>>
>>> It looks like a state machine with IOCB_WAITQ and gotos. Wouldn't it be cleaner
>>> to call call_read_iter()/loop_rw_iter() here directly instead of "goto retry" ?
>>
>> We could, probably making that part a separate helper then. How about the
>> below incremental?
> 
> IMHO, it was easy to get lost with such implicit state switching.
> Looks better now! See a small comment below.

Agree, that is cleaner.

>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index a5a4d9602915..669dccd81207 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2677,6 +2677,13 @@ static bool io_rw_should_retry(struct io_kiocb *req)
>>  	return false;
>>  }
>>  
>> +static int __io_read(struct io_kiocb *req, struct iov_iter *iter)
>> +{
>> +	if (req->file->f_op->read_iter)
>> +		return call_read_iter(req->file, &req->rw.kiocb, iter);
>> +	return loop_rw_iter(READ, req->file, &req->rw.kiocb, iter);
>> +}
>> +
>>  static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  {
>>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>> @@ -2710,11 +2717,7 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  	if (!ret) {
>>  		ssize_t ret2;
>>  
>> -retry:
>> -		if (req->file->f_op->read_iter)
>> -			ret2 = call_read_iter(req->file, kiocb, &iter);
>> -		else
>> -			ret2 = loop_rw_iter(READ, req->file, kiocb, &iter);
>> +		ret2 = __io_read(req, &iter);
>>  
>>  		/* Catch -EAGAIN return for forced non-blocking submission */
>>  		if (!force_nonblock || ret2 != -EAGAIN) {
>> @@ -2729,8 +2732,11 @@ static int io_read(struct io_kiocb *req, bool force_nonblock)
>>  			if (!(req->flags & REQ_F_NOWAIT) &&
>>  			    !file_can_poll(req->file))
>>  				req->flags |= REQ_F_MUST_PUNT;
>> -			if (io_rw_should_retry(req))
>> -				goto retry;
>> +			if (io_rw_should_retry(req)) {
>> +				ret2 = __io_read(req, &iter);
>> +				if (ret2 != -EAGAIN)
>> +					goto out_free;
> 
> "goto out_free" returns ret=0, so someone should add a cqe
> 
> if (ret2 != -EAGAIN) {
> 	kiocb_done(kiocb, ret2);
> 	goto free_out;
> }

Fixed up in the current one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 06/12] fs: add FMODE_BUF_RASYNC
  2020-05-26 19:51 [PATCHSET v5 0/12] Add support for async buffered reads Jens Axboe
@ 2020-05-26 19:51 ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-26 19:51 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, akpm, Jens Axboe

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ba1fff0e7bca..5ffc6d236b01 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports async buffered reads */
+#define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 06/12] fs: add FMODE_BUF_RASYNC
  2020-05-24 19:21 [PATCHSET v4 " Jens Axboe
@ 2020-05-24 19:22 ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2020-05-24 19:22 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, linux-kernel, linux-mm, Jens Axboe

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/fs.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5a5434ff7543..f7b1eb765c6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File supports async buffered reads */
+#define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-23 18:57 [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
2020-05-23 18:57 ` [PATCH 01/12] block: read-ahead submission should imply no-wait as well Jens Axboe
2020-05-23 18:57 ` [PATCH 02/12] mm: allow read-ahead with IOCB_NOWAIT set Jens Axboe
2020-05-23 18:57 ` [PATCH 03/12] mm: abstract out wake_page_match() from wake_page_function() Jens Axboe
2020-05-23 18:57 ` [PATCH 04/12] mm: add support for async page locking Jens Axboe
2020-05-23 18:57 ` [PATCH 05/12] mm: support async buffered reads in generic_file_buffered_read() Jens Axboe
2020-05-24 14:05   ` Trond Myklebust
2020-05-24 16:30     ` Jens Axboe
2020-05-24 16:40       ` Jens Axboe
2020-05-24 17:11         ` Trond Myklebust
2020-05-24 17:12           ` Jens Axboe
2020-05-23 18:57 ` [PATCH 06/12] fs: add FMODE_BUF_RASYNC Jens Axboe
2020-05-23 18:57 ` [PATCH 07/12] ext4: flag as supporting buffered async reads Jens Axboe
2020-05-23 18:57 ` [PATCH 08/12] block: flag block devices as supporting IOCB_WAITQ Jens Axboe
2020-05-23 18:57 ` [PATCH 09/12] xfs: flag files as supporting buffered async reads Jens Axboe
2020-05-23 18:57 ` [PATCH 10/12] btrfs: " Jens Axboe
2020-05-23 18:57 ` [PATCH 11/12] mm: add kiocb_wait_page_queue_init() helper Jens Axboe
2020-05-23 18:57 ` [PATCH 12/12] io_uring: support true async buffered reads, if file provides it Jens Axboe
2020-05-25  7:29   ` Pavel Begunkov
2020-05-25 19:59     ` Jens Axboe
2020-05-26  7:44       ` Pavel Begunkov
2020-05-26 13:50         ` Jens Axboe
2020-05-26  7:38   ` Pavel Begunkov
2020-05-26 13:47     ` Jens Axboe
2020-05-23 19:20 ` [PATCHSET v2 0/12] Add support for async buffered reads Jens Axboe
2020-05-24  9:46   ` Chris Panayis
2020-05-24 19:24     ` Jens Axboe
2020-05-24 19:21 [PATCHSET v4 " Jens Axboe
2020-05-24 19:22 ` [PATCH 06/12] fs: add FMODE_BUF_RASYNC Jens Axboe
2020-05-26 19:51 [PATCHSET v5 0/12] Add support for async buffered reads Jens Axboe
2020-05-26 19:51 ` [PATCH 06/12] fs: add FMODE_BUF_RASYNC Jens Axboe

IO-Uring Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/io-uring/0 io-uring/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 io-uring io-uring/ https://lore.kernel.org/io-uring \
		io-uring@vger.kernel.org
	public-inbox-index io-uring

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.io-uring


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git