linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes
@ 2022-04-26 17:43 Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
                   ` (18 more replies)
  0 siblings, 19 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This patch series adds support for async buffered writes when using both
xfs and io-uring. Currently io-uring only supports buffered writes in the
slow path, by processing them in the io workers. With this patch series it is
now possible to support buffered writes in the fast path. To be able to use
the fast path the required pages must be in the page cache, the required locks
in xfs can be granted immediately and no additional blocks need to be read
form disk.

Updating the inode can take time. An optimization has been implemented for
the time update. Time updates will be processed in the slow path. While there
is already a time update in process, other write requests for the same file,
can skip the update of the modification time.
  

Performance results:
  For fio the following results have been obtained with a queue depth of
  1 and 4k block size (runtime 600 secs):

                 sequential writes:
                 without patch                 with patch
  iops:              80k                          269k


                 random writes:
                 without patch                 with patch
  iops:              76k                          249k

For an io depth of 1, the new patch improves throughput by over three times
(compared to the exiting behavior, where buffered writes are processed by an
io-worker process) and also the latency is considerably reduced. To achieve the
same or better performance with the exisiting code an io depth of 4 is required.
Increasing the iodepth further does not lead to any further improvements.

Especially for mixed workloads this is a considerable improvement.



Support for async buffered writes:

  Patch 1: block: add check for async buffered writes to generic_write_checks
    Add a new flag FMODE_BUF_WASYNC so filesystems can specify that they support
    async buffered writes and include the flag in the check of the function
    generic_write_checks().
    
  Patch 2: mm: add FGP_ATOMIC flag to __filemap_get_folio()
    This adds the FGP_ATOMIC flag. This allows to specify the gfp flags
    for memory allocations for async buffered writes.
    
  Patch 3: iomap: add iomap_page_create_gfp to allocate iomap_pages
    Add new function to allow specifying gfp flags when allocating and
    initializing the structure iomap_page.

  Patch 4: iomap: use iomap_page_create_gfp() in __iomap_write_begin
    Add a gfp flag to the iomap_page_create function.
  
  Patch 5: iomap: add async buffered write support
    Set IOMAP_NOWAIT flag if IOCB_NOWAIT is set. Also use specific gfp flags if
    the iomap_page structure is allocated for an async buffered page.

  Patch 6: xfs: add iomap async buffered write support
    Add async buffered write support to the xfs iomap layer.

Support for async buffered write support and inode time modification


  Patch 7: fs: split off need_remove_file_privs() do_remove_file_privs()
    Splits of a check and action function, so they can later be invoked
    in the nowait code path.
    
  Patch 8: fs: split off need_file_update_time and do_file_update_time
    Splits of a check and action function, so they can later be invoked
    in the nowait code path.
    
  Patch 9: fs: add pending file update time flag.
    Add new flag so consecutive write requests for the same inode can
    proceed without waiting for the inode modification time to complete.

  Patch 10: xfs: enable async write file modification handling.
    Enable async write handling in xfs for the file modification time
    update. If the file modification update requires logging or removal
    of privileges that needs to wait, -EAGAIN is returned.

  Patch 11: xfs: add async buffered write support
    Take the ilock in nowait mode if async buffered writes are enabled.

  Patch 12: io_uring: add support for async buffered writes
    This enables the async buffered writes optimization in io_uring.
    Buffered writes are enabled for blocks that are already in the page
    cache.

  Patch 13: io_uring: Add tracepoint for short writes

Support for write throttling of async buffered writes:

  Patch 14: sched: add new fields to task_struct
    Add two new fields to the task_struct. These fields store the
    deadline after which writes are no longer throttled.

  Patch 15: mm: support write throttling for async buffered writes
    This changes the balance_dirty_pages function to take an additional
    parameter. When nowait is specified the write throttling code no
    longer waits synchronously for the deadline to expire. Instead
    it sets the fields in task_struct. Once the deadline expires the
    fields are reset.
    
  Patch 16: iomap: User throttling for async buffered writes.
    Enable async buffered write throttling in iomap.

  Patch 17: io_uring: support write throttling for async buffered writes
    Adds support to io_uring for write throttling. When the writes
    are throttled, the write requests are added to the pending io list.
    Once the write throttling deadline expires, the writes are submitted.

Enable async buffered write support in xfs
  Patch 18: xfs: enable async buffered write support
    This enables the flag that enables async buffered writes for xfs.


Testing:
  This patch has been tested with xfstests and fio.


Stefan Roesch (18):
  block: add check for async buffered writes to generic_write_checks
  mm: add FGP_ATOMIC flag to __filemap_get_folio()
  iomap: add iomap_page_create_gfp to allocate iomap_pages
  iomap: use iomap_page_create_gfp() in __iomap_write_begin
  iomap: add async buffered write support
  xfs: add iomap async buffered write support
  fs: split off need_remove_file_privs() do_remove_file_privs()
  fs: split off need_file_update_time and do_file_update_time
  fs: add pending file update time flag.
  xfs: Enable async write file modification handling.
  xfs: add async buffered write support
  io_uring: add support for async buffered writes
  io_uring: add tracepoint for short writes
  sched: add new fields to task_struct
  mm: support write throttling for async buffered writes
  iomap: User throttling for async buffered writes.
  io_uring: support write throttling for async buffered writes
  xfs: enable async buffered write support

 fs/inode.c                      | 127 +++++++++++++++++++++----------
 fs/io_uring.c                   | 131 +++++++++++++++++++++++++++++---
 fs/iomap/buffered-io.c          |  63 +++++++++++++--
 fs/read_write.c                 |   3 +-
 fs/xfs/xfs_file.c               |  51 +++++++++++--
 fs/xfs/xfs_iomap.c              |  33 +++++++-
 include/linux/fs.h              |  14 ++++
 include/linux/pagemap.h         |   1 +
 include/linux/sched.h           |   3 +
 include/linux/writeback.h       |   1 +
 include/trace/events/io_uring.h |  25 ++++++
 kernel/fork.c                   |   1 +
 mm/filemap.c                    |   3 +
 mm/page-writeback.c             |  54 +++++++++----
 14 files changed, 424 insertions(+), 86 deletions(-)


base-commit: af2d861d4cd2a4da5137f795ee3509e6f944a25b
-- 
2.30.2



^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This introduces the flag FMODE_BUF_WASYNC. If devices support async
buffered writes, this flag can be set. It also modifies the check in
generic_write_checks to take async buffered writes into consideration.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/read_write.c    | 3 ++-
 include/linux/fs.h | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index e643aec2b0ef..f75d75f7bc84 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1633,7 +1633,8 @@ int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
 	if (iocb->ki_flags & IOCB_APPEND)
 		iocb->ki_pos = i_size_read(inode);
 
-	if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+	if ((iocb->ki_flags & IOCB_NOWAIT) &&
+	    !((iocb->ki_flags & IOCB_DIRECT) || (file->f_mode & FMODE_BUF_WASYNC)))
 		return -EINVAL;
 
 	return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbde95387a23..3b479d02e210 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -177,6 +177,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async buffered reads */
 #define FMODE_BUF_RASYNC	((__force fmode_t)0x40000000)
 
+/* File supports async nowait buffered writes */
+#define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 19:06   ` Matthew Wilcox
  2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

Define FGP_ATOMIC flag and add support for the new flag in the function
__filemap_get_folio().

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 include/linux/pagemap.h | 1 +
 mm/filemap.c            | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 993994cd943a..b8d839e10780 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -509,6 +509,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_HEAD		0x00000080
 #define FGP_ENTRY		0x00000100
 #define FGP_STABLE		0x00000200
+#define FGP_ATOMIC		0x00000400
 
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		int fgp_flags, gfp_t gfp);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9a1eef6c5d35..dd3c5682276e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1929,6 +1929,7 @@ static void *mapping_get_entry(struct address_space *mapping, pgoff_t index)
  * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
  * * %FGP_NOWAIT - Don't get blocked by page lock.
  * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
+ * * %FGP_ATOMIC - Use atomic allocations
  *
  * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
  * if the %GFP flags specified for %FGP_CREAT are atomic.
@@ -1988,6 +1989,8 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
 			gfp &= ~__GFP_FS;
+		if (fgp_flags & FGP_ATOMIC)
+			gfp |= GFP_ATOMIC;
 
 		folio = filemap_alloc_folio(gfp, 0);
 		if (!folio)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

Add the function iomap_page_create_gfp() to be able to specify gfp flags
and to pass in the number of blocks per folio in the function
iomap_page_create_gfp().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/iomap/buffered-io.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..85aa32f50db0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -43,17 +43,27 @@ static inline struct iomap_page *to_iomap_page(struct folio *folio)
 
 static struct bio_set iomap_ioend_bioset;
 
+/**
+ * iomap_page_create_gfp : Create and initialize iomap_page for folio.
+ * @inode     : Pointer to inode
+ * @folio     : Pointer to folio
+ * @nr_blocks : Number of blocks in the folio
+ * @gfp       : gfp allocation flags
+ *
+ * This function returns a newly allocated iomap for the folio with the settings
+ * specified in the gfp parameter.
+ *
+ **/
 static struct iomap_page *
-iomap_page_create(struct inode *inode, struct folio *folio)
+iomap_page_create_gfp(struct inode *inode, struct folio *folio,
+		unsigned int nr_blocks, gfp_t gfp)
 {
-	struct iomap_page *iop = to_iomap_page(folio);
-	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+	struct iomap_page *iop;
 
-	if (iop || nr_blocks <= 1)
+	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)), gfp);
+	if (!iop)
 		return iop;
 
-	iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
-			GFP_NOFS | __GFP_NOFAIL);
 	spin_lock_init(&iop->uptodate_lock);
 	if (folio_test_uptodate(folio))
 		bitmap_fill(iop->uptodate, nr_blocks);
@@ -61,6 +71,18 @@ iomap_page_create(struct inode *inode, struct folio *folio)
 	return iop;
 }
 
+static struct iomap_page *
+iomap_page_create(struct inode *inode, struct folio *folio)
+{
+	struct iomap_page *iop = to_iomap_page(folio);
+	unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
+
+	if (iop || nr_blocks <= 1)
+		return iop;
+
+	return iomap_page_create_gfp(inode, folio, nr_blocks, GFP_NOFS | __GFP_NOFAIL);
+}
+
 static void iomap_page_release(struct folio *folio)
 {
 	struct iomap_page *iop = folio_detach_private(folio);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (2 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This change uses the new iomap_page_create_gfp() function in the
function __iomap_write_begin().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/iomap/buffered-io.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 85aa32f50db0..1ffdc7078e7d 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -572,17 +572,22 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 		size_t len, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	struct iomap_page *iop = iomap_page_create(iter->inode, folio);
+	struct iomap_page *iop = to_iomap_page(folio);
 	loff_t block_size = i_blocksize(iter->inode);
 	loff_t block_start = round_down(pos, block_size);
 	loff_t block_end = round_up(pos + len, block_size);
+	unsigned int nr_blocks = i_blocks_per_folio(iter->inode, folio);
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
+	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
 
+	if (!iop && nr_blocks > 1)
+		iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
 				block_end - block_start, &poff, &plen);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 05/18] iomap: add async buffered write support
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (3 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This adds async buffered write support to iomap. The support is focused
on the changes necessary to support XFS with iomap.

Support for other filesystems might require additional changes.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/iomap/buffered-io.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1ffdc7078e7d..5c53a5715c85 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -580,13 +580,20 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	size_t from = offset_in_folio(folio, pos), to = from + len;
 	size_t poff, plen;
 	gfp_t  gfp = GFP_NOFS | __GFP_NOFAIL;
+	bool no_wait = (iter->flags & IOMAP_NOWAIT);
+
+	if (no_wait)
+		gfp = GFP_NOIO | GFP_ATOMIC;
 
 	if (folio_test_uptodate(folio))
 		return 0;
 	folio_clear_error(folio);
 
-	if (!iop && nr_blocks > 1)
+	if (!iop && nr_blocks > 1) {
 		iop = iomap_page_create_gfp(iter->inode, folio, nr_blocks, gfp);
+		if (no_wait && !iop)
+			return -EAGAIN;
+	}
 
 	do {
 		iomap_adjust_read_range(iter->inode, folio, &block_start,
@@ -603,6 +610,8 @@ static int __iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 			if (WARN_ON_ONCE(iter->flags & IOMAP_UNSHARE))
 				return -EIO;
 			folio_zero_segments(folio, poff, from, to, poff + plen);
+		} else if (no_wait) {
+			return -EAGAIN;
 		} else {
 			int status = iomap_read_folio_sync(block_start, folio,
 					poff, plen, srcmap);
@@ -633,6 +642,9 @@ static int iomap_write_begin(const struct iomap_iter *iter, loff_t pos,
 	unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
 	int status = 0;
 
+	if (iter->flags & IOMAP_NOWAIT)
+		fgp |= FGP_NOWAIT | FGP_ATOMIC;
+
 	BUG_ON(pos + len > iter->iomap.offset + iter->iomap.length);
 	if (srcmap != &iter->iomap)
 		BUG_ON(pos + len > srcmap->offset + srcmap->length);
@@ -790,6 +802,10 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
+		 *
+		 * For async buffered writes the assumption is that the user
+		 * page has already been faulted in. This can be optimized by
+		 * faulting the user page in the prepare phase of io-uring.
 		 */
 		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
 			status = -EFAULT;
@@ -845,6 +861,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 	};
 	int ret;
 
+	if (iocb->ki_flags & IOCB_NOWAIT)
+		iter.flags |= IOMAP_NOWAIT;
+
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_write_iter(&iter, i);
 	if (iter.pos == iocb->ki_pos)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (4 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 22:54   ` Dave Chinner
  2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This adds the async buffered write support to the iomap layer of XFS. If
a lock cannot be acquired or additional reads need to be performed, the
request will return -EAGAIN in case this is an async buffered write request.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index e552ce541ec2..80b6c48e88af 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
 	bool			eof = false, cow_eof = false, shared = false;
 	int			allocfork = XFS_DATA_FORK;
 	int			error = 0;
+	bool			no_wait = (flags & IOMAP_NOWAIT);
 
 	if (xfs_is_shutdown(mp))
 		return -EIO;
 
 	/* we can't use delayed allocations when using extent size hints */
-	if (xfs_get_extsz_hint(ip))
+	if (xfs_get_extsz_hint(ip)) {
+		if (no_wait)
+			return -EAGAIN;
+
 		return xfs_direct_write_iomap_begin(inode, offset, count,
 				flags, iomap, srcmap);
+	}
 
 	ASSERT(!XFS_IS_REALTIME_INODE(ip));
 
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	if (no_wait) {
+		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+	}
 
 	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
 	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
@@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
 
 	XFS_STATS_INC(mp, xs_blk_mapw);
 
+	if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
+		error = -EAGAIN;
+		goto out_unlock;
+	}
+
 	error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
 	if (error)
 		goto out_unlock;
@@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
 	if (xfs_is_cow_inode(ip)) {
 		if (!ip->i_cowfp) {
 			ASSERT(!xfs_is_reflink_inode(ip));
+			if (no_wait) {
+				error = -EAGAIN;
+				goto out_unlock;
+			}
 			xfs_ifork_init_cow(ip);
 		}
 		cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
@@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
 			goto found_imap;
 		}
 
+		if (no_wait) {
+			error = -EAGAIN;
+			goto out_unlock;
+		}
+
 		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
 
 		/* Trim the mapping to the nearest shared extent boundary. */
@@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
 			allocfork = XFS_COW_FORK;
 	}
 
+	if (no_wait) {
+		error = -EAGAIN;
+		goto out_unlock;
+	}
+
 	error = xfs_qm_dqattach_locked(ip, false);
 	if (error)
 		goto out_unlock;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs()
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (5 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This splits off the function need_remove_file_privs() from the function
do_remove_file_privs() from the function file_remove_privs().

No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/inode.c         | 56 +++++++++++++++++++++++++++++++---------------
 include/linux/fs.h |  3 +++
 2 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d9b422504d1..79c5702ef7c3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2010,17 +2010,8 @@ static int __remove_privs(struct user_namespace *mnt_userns,
 	return notify_change(mnt_userns, dentry, &newattrs, NULL);
 }
 
-/*
- * Remove special file priviledges (suid, capabilities) when file is written
- * to or truncated.
- */
-int file_remove_privs(struct file *file)
+int need_file_remove_privs(struct inode *inode, struct dentry *dentry)
 {
-	struct dentry *dentry = file_dentry(file);
-	struct inode *inode = file_inode(file);
-	int kill;
-	int error = 0;
-
 	/*
 	 * Fast path for nothing security related.
 	 * As well for non-regular files, e.g. blkdev inodes.
@@ -2030,16 +2021,37 @@ int file_remove_privs(struct file *file)
 	if (IS_NOSEC(inode) || !S_ISREG(inode->i_mode))
 		return 0;
 
-	kill = dentry_needs_remove_privs(dentry);
-	if (kill < 0)
-		return kill;
-	if (kill)
-		error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
+	return dentry_needs_remove_privs(dentry);
+}
+
+int do_file_remove_privs(struct file *file, struct inode *inode,
+			struct dentry *dentry, int kill)
+{
+	int error = 0;
+
+	error = __remove_privs(file_mnt_user_ns(file), dentry, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
 	return error;
 }
+
+/*
+ * Remove special file privileges (suid, capabilities) when file is written
+ * to or truncated.
+ */
+int file_remove_privs(struct file *file)
+{
+	struct dentry *dentry = file_dentry(file);
+	struct inode *inode = file_inode(file);
+	int kill;
+
+	kill = need_file_remove_privs(inode, dentry);
+	if (kill <= 0)
+		return kill;
+
+	return do_file_remove_privs(file, inode, dentry, kill);
+}
 EXPORT_SYMBOL(file_remove_privs);
 
 /**
@@ -2094,14 +2106,22 @@ EXPORT_SYMBOL(file_update_time);
 int file_modified(struct file *file)
 {
 	int err;
+	int ret;
+	struct dentry *dentry = file_dentry(file);
+	struct inode *inode = file_inode(file);
 
 	/*
 	 * Clear the security bits if the process is not being run by root.
 	 * This keeps people from modifying setuid and setgid binaries.
 	 */
-	err = file_remove_privs(file);
-	if (err)
-		return err;
+	ret = need_file_remove_privs(inode, dentry);
+	if (ret < 0) {
+		return ret;
+	} else if (ret > 0) {
+		ret = do_file_remove_privs(file, inode, dentry, ret);
+		if (ret)
+			return ret;
+	}
 
 	if (unlikely(file->f_mode & FMODE_NOCMTIME))
 		return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b479d02e210..c2992d12601f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2381,6 +2381,9 @@ static inline void file_accessed(struct file *file)
 		touch_atime(&file->f_path);
 }
 
+extern int need_file_remove_privs(struct inode *inode, struct dentry *dentry);
+extern int do_file_remove_privs(struct file *file, struct inode *inode,
+				struct dentry *dentry, int kill);
 extern int file_modified(struct file *file);
 
 int sync_inode_metadata(struct inode *inode, int wait);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (6 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This splits off the functions need_file_update_time() and
do_file_update_time() from the function file_update_time().

This is required to support async buffered writes.
No intended functional changes in this patch.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/inode.c         | 72 ++++++++++++++++++++++++++++++----------------
 include/linux/fs.h |  5 ++++
 2 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 79c5702ef7c3..64047bb0b9f8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2054,35 +2054,22 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
-/**
- *	file_update_time	-	update mtime and ctime time
- *	@file: file accessed
- *
- *	Update the mtime and ctime members of an inode and mark the inode
- *	for writeback.  Note that this function is meant exclusively for
- *	usage in the file write path of filesystems, and filesystems may
- *	choose to explicitly ignore update via this function with the
- *	S_NOCMTIME inode flag, e.g. for network filesystem where these
- *	timestamps are handled by the server.  This can return an error for
- *	file systems who need to allocate space in order to update an inode.
- */
-
-int file_update_time(struct file *file)
+int need_file_update_time(struct inode *inode, struct file *file,
+			struct timespec64 *now)
 {
-	struct inode *inode = file_inode(file);
-	struct timespec64 now;
 	int sync_it = 0;
-	int ret;
+
+	if (unlikely(file->f_mode & FMODE_NOCMTIME))
+		return 0;
 
 	/* First try to exhaust all avenues to not sync */
 	if (IS_NOCMTIME(inode))
 		return 0;
 
-	now = current_time(inode);
-	if (!timespec64_equal(&inode->i_mtime, &now))
+	if (!timespec64_equal(&inode->i_mtime, now))
 		sync_it = S_MTIME;
 
-	if (!timespec64_equal(&inode->i_ctime, &now))
+	if (!timespec64_equal(&inode->i_ctime, now))
 		sync_it |= S_CTIME;
 
 	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
@@ -2091,24 +2078,58 @@ int file_update_time(struct file *file)
 	if (!sync_it)
 		return 0;
 
+	return sync_it;
+}
+
+int do_file_update_time(struct inode *inode, struct file *file,
+			struct timespec64 *now, int sync_mode)
+{
+	int ret;
+
 	/* Finally allowed to write? Takes lock. */
 	if (__mnt_want_write_file(file))
 		return 0;
 
-	ret = inode_update_time(inode, &now, sync_it);
+	ret = inode_update_time(inode, now, sync_mode);
 	__mnt_drop_write_file(file);
 
 	return ret;
 }
+
+/**
+ *	file_update_time	-	update mtime and ctime time
+ *	@file: file accessed
+ *
+ *	Update the mtime and ctime members of an inode and mark the inode
+ *	for writeback.  Note that this function is meant exclusively for
+ *	usage in the file write path of filesystems, and filesystems may
+ *	choose to explicitly ignore update via this function with the
+ *	S_NOCMTIME inode flag, e.g. for network filesystem where these
+ *	timestamps are handled by the server.  This can return an error for
+ *	file systems who need to allocate space in order to update an inode.
+ */
+
+int file_update_time(struct file *file)
+{
+	int err;
+	struct inode *inode = file_inode(file);
+	struct timespec64 now = current_time(inode);
+
+	err = need_file_update_time(inode, file, &now);
+	if (err < 0)
+		return err;
+
+	return do_file_update_time(inode, file, &now, err);
+}
 EXPORT_SYMBOL(file_update_time);
 
 /* Caller must hold the file's inode lock */
 int file_modified(struct file *file)
 {
-	int err;
 	int ret;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = file_inode(file);
+	struct timespec64 now = current_time(inode);
 
 	/*
 	 * Clear the security bits if the process is not being run by root.
@@ -2123,10 +2144,11 @@ int file_modified(struct file *file)
 			return ret;
 	}
 
-	if (unlikely(file->f_mode & FMODE_NOCMTIME))
-		return 0;
+	ret = need_file_update_time(inode, file, &now);
+	if (ret <= 0)
+		return ret;
 
-	return file_update_time(file);
+	return do_file_update_time(inode, file, &now, ret);
 }
 EXPORT_SYMBOL(file_modified);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c2992d12601f..e268a1a50357 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2384,6 +2384,11 @@ static inline void file_accessed(struct file *file)
 extern int need_file_remove_privs(struct inode *inode, struct dentry *dentry);
 extern int do_file_remove_privs(struct file *file, struct inode *inode,
 				struct dentry *dentry, int kill);
+extern int need_file_update_time(struct inode *inode, struct file *file,
+				struct timespec64 *now);
+extern int do_file_update_time(struct inode *inode, struct file *file,
+			struct timespec64 *now, int sync_mode);
+
 extern int file_modified(struct file *file);
 
 int sync_inode_metadata(struct inode *inode, int wait);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 09/18] fs: add pending file update time flag.
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (7 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This introduces an optimization for the update time flag and async
buffered writes. While an update of the file modification time is
pending and is handled by the workers, concurrent writes do not need
to wait for this time update to complete.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/inode.c         | 1 +
 include/linux/fs.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 64047bb0b9f8..f6d9877c2bb8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2091,6 +2091,7 @@ int do_file_update_time(struct inode *inode, struct file *file,
 		return 0;
 
 	ret = inode_update_time(inode, now, sync_mode);
+	inode->i_flags &= ~S_PENDING_TIME;
 	__mnt_drop_write_file(file);
 
 	return ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e268a1a50357..dc9060c0d629 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2141,6 +2141,8 @@ struct super_operations {
 #define S_CASEFOLD	(1 << 15) /* Casefolded file */
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
 #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_PENDING_TIME (1 << 18) /* File update time is pending */
+
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2183,6 +2185,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
 #define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
 #define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
 #define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)
+#define IS_PENDING_TIME(inode) ((inode)->i_flags & S_PENDING_TIME)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (8 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 22:55   ` Dave Chinner
  2022-04-27 12:07   ` Christian Brauner
  2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This modifies xfs write checks to return -EAGAIN if the request either
requires to remove privileges or needs to update the file modification
time. This is required for async buffered writes, so the request gets
handled in the io worker.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5bddb1e9e0b3..6f9da1059e8b 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -299,6 +299,43 @@ xfs_file_read_iter(
 	return ret;
 }
 
+static int xfs_file_modified(struct file *file, int flags)
+{
+	int ret;
+	struct dentry *dentry = file_dentry(file);
+	struct inode *inode = file_inode(file);
+	struct timespec64 now = current_time(inode);
+
+	/*
+	 * Clear the security bits if the process is not being run by root.
+	 * This keeps people from modifying setuid and setgid binaries.
+	 */
+	ret = need_file_remove_privs(inode, dentry);
+	if (ret < 0) {
+		return ret;
+	} else if (ret > 0) {
+		if (flags & IOCB_NOWAIT)
+			return -EAGAIN;
+
+		ret = do_file_remove_privs(file, inode, dentry, ret);
+		if (ret)
+			return ret;
+	}
+
+	ret = need_file_update_time(inode, file, &now);
+	if (ret <= 0)
+		return ret;
+	if (flags & IOCB_NOWAIT) {
+		if (IS_PENDING_TIME(inode))
+			return 0;
+
+		inode->i_flags |= S_PENDING_TIME;
+		return -EAGAIN;
+	}
+
+	return do_file_update_time(inode, file, &now, ret);
+}
+
 /*
  * Common pre-write limit and setup checks.
  *
@@ -410,7 +447,7 @@ xfs_file_write_checks(
 		spin_unlock(&ip->i_flags_lock);
 
 out:
-	return file_modified(file);
+	return xfs_file_modified(file, iocb->ki_flags);
 }
 
 static int
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (9 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 22:56   ` Dave Chinner
  2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This adds the async buffered write support to XFS. For async buffered
write requests, the request will return -EAGAIN if the ilock cannot be
obtained immediately.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/xfs/xfs_file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 6f9da1059e8b..49d54b939502 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,12 +739,14 @@ xfs_file_buffered_write(
 	bool			cleared_space = false;
 	int			iolock;
 
-	if (iocb->ki_flags & IOCB_NOWAIT)
-		return -EOPNOTSUPP;
-
 write_retry:
 	iolock = XFS_IOLOCK_EXCL;
-	xfs_ilock(ip, iolock);
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!xfs_ilock_nowait(ip, iolock))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, iolock);
+	}
 
 	ret = xfs_file_write_checks(iocb, from, &iolock);
 	if (ret)
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 12/18] io_uring: add support for async buffered writes
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (10 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This enables the async buffered writes for the filesystems that support
async buffered writes in io-uring. Buffered writes are enabled for
blocks that are already in the page cache or can be acquired with noio.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/io_uring.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7625b29153b9..2ba615374ba4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3746,7 +3746,7 @@ static inline int io_iter_do_read(struct io_kiocb *req, struct iov_iter *iter)
 		return -EINVAL;
 }
 
-static bool need_read_all(struct io_kiocb *req)
+static bool need_complete_io(struct io_kiocb *req)
 {
 	return req->flags & REQ_F_ISREG ||
 		S_ISBLK(file_inode(req->file)->i_mode);
@@ -3874,7 +3874,7 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	} else if (ret == -EIOCBQUEUED) {
 		goto out_free;
 	} else if (ret == req->result || ret <= 0 || !force_nonblock ||
-		   (req->flags & REQ_F_NOWAIT) || !need_read_all(req)) {
+		   (req->flags & REQ_F_NOWAIT) || !need_complete_io(req)) {
 		/* read all, failed, already did sync or don't want to retry */
 		goto done;
 	}
@@ -3970,9 +3970,10 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(!io_file_supports_nowait(req)))
 			goto copy_iov;
 
-		/* file path doesn't support NOWAIT for non-direct_IO */
-		if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT) &&
-		    (req->flags & REQ_F_ISREG))
+		/* File path supports NOWAIT for non-direct_IO only for block devices. */
+		if (!(kiocb->ki_flags & IOCB_DIRECT) &&
+			!(kiocb->ki_filp->f_mode & FMODE_BUF_WASYNC) &&
+			(req->flags & REQ_F_ISREG))
 			goto copy_iov;
 
 		kiocb->ki_flags |= IOCB_NOWAIT;
@@ -4026,6 +4027,24 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		/* IOPOLL retry should happen for io-wq threads */
 		if (ret2 == -EAGAIN && (req->ctx->flags & IORING_SETUP_IOPOLL))
 			goto copy_iov;
+
+		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
+			struct io_async_rw *rw;
+
+			/* This is a partial write. The file pos has already been
+			 * updated, setup the async struct to complete the request
+			 * in the worker. Also update bytes_done to account for
+			 * the bytes already written.
+			 */
+			iov_iter_save_state(&s->iter, &s->iter_state);
+			ret = io_setup_async_rw(req, iovec, s, true);
+
+			rw = req->async_data;
+			if (rw)
+				rw->bytes_done += ret2;
+
+			return ret ? ret : -EAGAIN;
+		}
 done:
 		kiocb_done(req, ret2, issue_flags);
 	} else {
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (11 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This adds the io_uring_short_write tracepoint to io_uring. A short write
is issued if not all pages that are required for a write are in the page
cache and the async buffered writes have to return EAGAIN.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/io_uring.c                   |  3 +++
 include/trace/events/io_uring.h | 25 +++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2ba615374ba4..ace3a5cdda68 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4031,6 +4031,9 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret2 != req->result && ret2 >= 0 && need_complete_io(req)) {
 			struct io_async_rw *rw;
 
+			trace_io_uring_short_write(req->ctx, kiocb->ki_pos - ret2,
+						req->result, ret2);
+
 			/* This is a partial write. The file pos has already been
 			 * updated, setup the async struct to complete the request
 			 * in the worker. Also update bytes_done to account for
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index cddf5b6fbeb4..661834361d33 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -543,6 +543,31 @@ TRACE_EVENT(io_uring_req_failed,
 		  (unsigned long long) __entry->pad2, __entry->error)
 );
 
+TRACE_EVENT(io_uring_short_write,
+
+	TP_PROTO(void *ctx, u64 fpos, u64 wanted, u64 got),
+
+	TP_ARGS(ctx, fpos, wanted, got),
+
+	TP_STRUCT__entry(
+		__field(void *,	ctx)
+		__field(u64,	fpos)
+		__field(u64,	wanted)
+		__field(u64,	got)
+	),
+
+	TP_fast_assign(
+		__entry->ctx	= ctx;
+		__entry->fpos	= fpos;
+		__entry->wanted	= wanted;
+		__entry->got	= got;
+	),
+
+	TP_printk("ring %p, fpos %lld, wanted %lld, got %lld",
+			  __entry->ctx, __entry->fpos,
+			  __entry->wanted, __entry->got)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 14/18] sched: add new fields to task_struct
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (12 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

Add two new fields to the task_struct to support async
write throttling.

  - One field to store how long writes are throttled: bdp_pause
  - The other field to store the number of dirtied pages:
    bdp_nr_dirtied_pause

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 include/linux/sched.h | 3 +++
 kernel/fork.c         | 1 +
 2 files changed, 4 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8911b1f35aa..98d70f2945e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1327,6 +1327,9 @@ struct task_struct {
 	/* Start of a write-and-pause period: */
 	unsigned long			dirty_paused_when;
 
+	unsigned long			bdp_pause;
+	int				bdp_nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int				latency_record_count;
 	struct latency_record		latency_record[LT_SAVECOUNT];
diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..5bc6298827fc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2316,6 +2316,7 @@ static __latent_entropy struct task_struct *copy_process(
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
 	p->dirty_paused_when = 0;
+	p->bdp_nr_dirtied_pause = -1;
 
 	p->pdeath_signal = 0;
 	INIT_LIST_HEAD(&p->thread_group);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (13 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-28 17:47   ` Jan Kara
  2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This change adds support for async write throttling in the function
balance_dirty_pages(). So far if throttling was required, the code was
waiting synchronously as long as the writes were throttled. This change
introduces asynchronous throttling. Instead of waiting in the function
balance_dirty_pages(), the timeout is set in the task_struct field
bdp_pause. Once the timeout has expired, the writes are no longer
throttled.

- Add a new parameter to the balance_dirty_pages() function
  - This allows the caller to pass in the nowait flag
  - When the nowait flag is specified, the code does not wait in
    balance_dirty_pages(), but instead stores the wait expiration in the
    new task_struct field bdp_pause.

- The function balance_dirty_pages_ratelimited() resets the new values
  in the task_struct, once the timeout has expired

This change is required to support write throttling for the async
buffered writes. While the writes are throttled, io_uring still can make
progress with processing other requests.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 include/linux/writeback.h |  1 +
 mm/page-writeback.c       | 54 ++++++++++++++++++++++++++++-----------
 2 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index fec248ab1fec..48176a8047db 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -373,6 +373,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb);
 void balance_dirty_pages_ratelimited(struct address_space *mapping);
+void  balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async);
 bool wb_over_bg_thresh(struct bdi_writeback *wb);
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7e2da284e427..a62aa8a4c2f2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,7 +1546,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
  * perform some writeout.
  */
 static void balance_dirty_pages(struct bdi_writeback *wb,
-				unsigned long pages_dirtied)
+				unsigned long pages_dirtied, bool is_async)
 {
 	struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
 	struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
@@ -1780,6 +1780,14 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 					  period,
 					  pause,
 					  start_time);
+		if (is_async) {
+			if (current->bdp_nr_dirtied_pause == -1) {
+				current->bdp_pause = now + pause;
+				current->bdp_nr_dirtied_pause = nr_dirtied_pause;
+			}
+			break;
+		}
+
 		__set_current_state(TASK_KILLABLE);
 		wb->dirty_sleep = now;
 		io_schedule_timeout(pause);
@@ -1787,6 +1795,8 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
 		current->dirty_paused_when = now + pause;
 		current->nr_dirtied = 0;
 		current->nr_dirtied_pause = nr_dirtied_pause;
+		current->bdp_nr_dirtied_pause = -1;
+		current->bdp_pause = 0;
 
 		/*
 		 * This is typically equal to (dirty < thresh) and can also
@@ -1851,19 +1861,7 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
  */
 DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
 
-/**
- * balance_dirty_pages_ratelimited - balance dirty memory state
- * @mapping: address_space which was dirtied
- *
- * Processes which are dirtying memory should call in here once for each page
- * which was newly dirtied.  The function will periodically check the system's
- * dirty state and will initiate writeback if needed.
- *
- * Once we're over the dirty memory limit we decrease the ratelimiting
- * by a lot, to prevent individual processes from overshooting the limit
- * by (ratelimit_pages) each.
- */
-void balance_dirty_pages_ratelimited(struct address_space *mapping)
+void balance_dirty_pages_ratelimited_flags(struct address_space *mapping, bool is_async)
 {
 	struct inode *inode = mapping->host;
 	struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -1874,6 +1872,15 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	if (!(bdi->capabilities & BDI_CAP_WRITEBACK))
 		return;
 
+	if (current->bdp_nr_dirtied_pause != -1 && time_after(jiffies, current->bdp_pause)) {
+		current->dirty_paused_when = current->bdp_pause;
+		current->nr_dirtied = 0;
+		current->nr_dirtied_pause = current->bdp_nr_dirtied_pause;
+
+		current->bdp_nr_dirtied_pause = -1;
+		current->bdp_pause = 0;
+	}
+
 	if (inode_cgwb_enabled(inode))
 		wb = wb_get_create_current(bdi, GFP_KERNEL);
 	if (!wb)
@@ -1912,10 +1919,27 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(wb, current->nr_dirtied);
+		balance_dirty_pages(wb, current->nr_dirtied, is_async);
 
 	wb_put(wb);
 }
+
+/**
+ * balance_dirty_pages_ratelimited - balance dirty memory state
+ * @mapping: address_space which was dirtied
+ *
+ * Processes which are dirtying memory should call in here once for each page
+ * which was newly dirtied.  The function will periodically check the system's
+ * dirty state and will initiate writeback if needed.
+ *
+ * Once we're over the dirty memory limit we decrease the ratelimiting
+ * by a lot, to prevent individual processes from overshooting the limit
+ * by (ratelimit_pages) each.
+ */
+void balance_dirty_pages_ratelimited(struct address_space *mapping)
+{
+	balance_dirty_pages_ratelimited_flags(mapping, false);
+}
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
 
 /**
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 16/18] iomap: User throttling for async buffered writes.
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (14 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This enables throttling for async buffered writes.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/iomap/buffered-io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 5c53a5715c85..0f0a6fe36699 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -843,7 +843,8 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		written += status;
 		length -= status;
 
-		balance_dirty_pages_ratelimited(iter->inode->i_mapping);
+		balance_dirty_pages_ratelimited_flags(iter->inode->i_mapping,
+						(iter->flags & IOMAP_NOWAIT));
 	} while (iov_iter_count(i) && length);
 
 	return written ? written : status;
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 17/18] io_uring: support write throttling for async buffered writes
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (15 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
  2022-04-26 22:37 ` [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Dave Chinner
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This adds the process-level throttling for the block layer for async
buffered writes to io-uring. In io_write the code now checks if the write
needs to be throttled. If this is required, it adds the request to the
list of pending io requests and starts a timer. After the timer expires,
it submits the list of pending writes.

- Add new list called pending_ios for delayed writes (throttled writes)
  to struct io_uring_task. The list is protected by the task_lock spin
  lock.
- Add new timer to struct io_uring_task.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/io_uring.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ace3a5cdda68..4af654a82366 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -485,6 +485,11 @@ struct io_ring_ctx {
  */
 #define IO_RINGFD_REG_MAX 16
 
+struct pending_list {
+	struct list_head list;
+	struct io_kiocb *req;
+};
+
 struct io_uring_task {
 	/* submission side */
 	int			cached_refs;
@@ -501,6 +506,9 @@ struct io_uring_task {
 	struct callback_head	task_work;
 	struct file		**registered_rings;
 	bool			task_running;
+
+	struct pending_list	pending_ios;
+	struct timer_list	timer;
 };
 
 /*
@@ -1193,7 +1201,7 @@ static void io_rsrc_put_work(struct work_struct *work);
 
 static void io_req_task_queue(struct io_kiocb *req);
 static void __io_submit_flush_completions(struct io_ring_ctx *ctx);
-static int io_req_prep_async(struct io_kiocb *req);
+static int io_req_prep_async(struct io_kiocb *req, bool force);
 
 static int io_install_fixed_file(struct io_kiocb *req, struct file *file,
 				 unsigned int issue_flags, u32 slot_index);
@@ -1201,6 +1209,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
 
 static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
 static void io_eventfd_signal(struct io_ring_ctx *ctx);
+static void delayed_write_fn(struct timer_list *tmr);
 
 static struct kmem_cache *req_cachep;
 
@@ -2599,6 +2608,32 @@ static void io_req_task_queue_reissue(struct io_kiocb *req)
 	io_req_task_work_add(req, false);
 }
 
+static int io_req_task_queue_reissue_delayed(struct io_kiocb *req)
+{
+	struct io_uring_task *tctx = req->task->io_uring;
+	struct pending_list *pending;
+	bool empty;
+
+	pending = kmalloc(sizeof(struct pending_list), GFP_KERNEL);
+	if (!pending)
+		return -ENOMEM;
+	pending->req = req;
+
+	spin_lock_irq(&tctx->task_lock);
+	empty = list_empty(&tctx->pending_ios.list);
+	list_add_tail(&pending->list, &tctx->pending_ios.list);
+
+	if (empty) {
+		timer_setup(&tctx->timer, delayed_write_fn, 0);
+
+		tctx->timer.expires = current->bdp_pause;
+		add_timer(&tctx->timer);
+	}
+	spin_unlock_irq(&tctx->task_lock);
+
+	return 0;
+}
+
 static inline void io_queue_next(struct io_kiocb *req)
 {
 	struct io_kiocb *nxt = io_req_find_next(req);
@@ -2916,7 +2951,7 @@ static bool io_resubmit_prep(struct io_kiocb *req)
 	struct io_async_rw *rw = req->async_data;
 
 	if (!req_has_async_data(req))
-		return !io_req_prep_async(req);
+		return !io_req_prep_async(req, false);
 	iov_iter_restore(&rw->s.iter, &rw->s.iter_state);
 	return true;
 }
@@ -3938,6 +3973,38 @@ static int io_read(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static inline unsigned long write_delay(void)
+{
+	if (likely(current->bdp_nr_dirtied_pause == -1 ||
+			!time_before(jiffies, current->bdp_pause)))
+		return 0;
+
+	return current->bdp_pause;
+}
+
+static void delayed_write_fn(struct timer_list *tmr)
+{
+	struct io_uring_task *tctx = from_timer(tctx, tmr, timer);
+	struct list_head *curr;
+	struct list_head *next;
+	LIST_HEAD(pending_ios);
+
+	/* Move list to temporary list. */
+	spin_lock_irq(&tctx->task_lock);
+	list_splice_init(&tctx->pending_ios.list, &pending_ios);
+	spin_unlock_irq(&tctx->task_lock);
+
+	list_for_each_safe(curr, next, &pending_ios) {
+		struct pending_list *io;
+
+		io = list_entry(curr, struct pending_list, list);
+		io_req_task_queue_reissue(io->req);
+
+		list_del(curr);
+		kfree(io);
+	}
+}
+
 static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_rw_state __s, *s = &__s;
@@ -3947,6 +4014,18 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	ssize_t ret, ret2;
 	loff_t *ppos;
 
+	/* Write throttling active? */
+	if (!(kiocb->ki_flags & IOCB_DIRECT) && unlikely(write_delay())) {
+		int ret = io_req_prep_async(req, true);
+
+		if (unlikely(ret))
+			io_req_complete_failed(req, ret);
+		else
+			ret = io_req_task_queue_reissue_delayed(req);
+
+		return ret;
+	}
+
 	if (!req_has_async_data(req)) {
 		ret = io_import_iovec(WRITE, req, &iovec, s, issue_flags);
 		if (unlikely(ret < 0))
@@ -6962,9 +7041,9 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return -EINVAL;
 }
 
-static int io_req_prep_async(struct io_kiocb *req)
+static int io_req_prep_async(struct io_kiocb *req, bool force)
 {
-	if (!io_op_defs[req->opcode].needs_async_setup)
+	if (!force && !io_op_defs[req->opcode].needs_async_setup)
 		return 0;
 	if (WARN_ON_ONCE(req_has_async_data(req)))
 		return -EFAULT;
@@ -6974,6 +7053,10 @@ static int io_req_prep_async(struct io_kiocb *req)
 	switch (req->opcode) {
 	case IORING_OP_READV:
 		return io_rw_prep_async(req, READ);
+	case IORING_OP_WRITE:
+		if (!force)
+			break;
+		fallthrough;
 	case IORING_OP_WRITEV:
 		return io_rw_prep_async(req, WRITE);
 	case IORING_OP_SENDMSG:
@@ -6983,6 +7066,7 @@ static int io_req_prep_async(struct io_kiocb *req)
 	case IORING_OP_CONNECT:
 		return io_connect_prep_async(req);
 	}
+
 	printk_once(KERN_WARNING "io_uring: prep_async() bad opcode %d\n",
 		    req->opcode);
 	return -EFAULT;
@@ -7016,7 +7100,7 @@ static __cold void io_drain_req(struct io_kiocb *req)
 	}
 	spin_unlock(&ctx->completion_lock);
 
-	ret = io_req_prep_async(req);
+	ret = io_req_prep_async(req, false);
 	if (ret) {
 fail:
 		io_req_complete_failed(req, ret);
@@ -7550,7 +7634,7 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 	} else if (unlikely(req->ctx->drain_active)) {
 		io_drain_req(req);
 	} else {
-		int ret = io_req_prep_async(req);
+		int ret = io_req_prep_async(req, false);
 
 		if (unlikely(ret))
 			io_req_complete_failed(req, ret);
@@ -7746,7 +7830,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		struct io_kiocb *head = link->head;
 
 		if (!(req->flags & REQ_F_FAIL)) {
-			ret = io_req_prep_async(req);
+			ret = io_req_prep_async(req, false);
 			if (unlikely(ret)) {
 				req_fail_link_node(req, ret);
 				if (!(head->flags & REQ_F_FAIL))
@@ -9222,6 +9306,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	INIT_WQ_LIST(&tctx->task_list);
 	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
+	INIT_LIST_HEAD(&tctx->pending_ios.list);
 	return 0;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH v1 18/18] xfs: enable async buffered write support
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (16 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
@ 2022-04-26 17:43 ` Stefan Roesch
  2022-04-26 22:37 ` [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Dave Chinner
  18 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-26 17:43 UTC (permalink / raw)
  To: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel; +Cc: shr, david

This turns on the async buffered write support for XFS.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/xfs/xfs_file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 49d54b939502..7544dffaf032 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1210,7 +1210,7 @@ xfs_file_open(
 		return -EFBIG;
 	if (xfs_is_shutdown(XFS_M(inode->i_sb)))
 		return -EIO;
-	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC;
+	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC;
 	return 0;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
  2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
@ 2022-04-26 19:06   ` Matthew Wilcox
  2022-04-28 19:54     ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Matthew Wilcox @ 2022-04-26 19:06 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david

On Tue, Apr 26, 2022 at 10:43:19AM -0700, Stefan Roesch wrote:
> Define FGP_ATOMIC flag and add support for the new flag in the function
> __filemap_get_folio().

No.  You don't get to use the emergency memory pools for doing writes.
That completely screws up the MM.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes
  2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
                   ` (17 preceding siblings ...)
  2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
@ 2022-04-26 22:37 ` Dave Chinner
  18 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2022-04-26 22:37 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Tue, Apr 26, 2022 at 10:43:17AM -0700, Stefan Roesch wrote:
> This patch series adds support for async buffered writes when using both
> xfs and io-uring. Currently io-uring only supports buffered writes in the
> slow path, by processing them in the io workers. With this patch series it is
> now possible to support buffered writes in the fast path. To be able to use
> the fast path the required pages must be in the page cache, the required locks
> in xfs can be granted immediately and no additional blocks need to be read
> form disk.
> 
> Updating the inode can take time. An optimization has been implemented for
> the time update. Time updates will be processed in the slow path. While there
> is already a time update in process, other write requests for the same file,
> can skip the update of the modification time.
>   
> 
> Performance results:
>   For fio the following results have been obtained with a queue depth of
>   1 and 4k block size (runtime 600 secs):
> 
>                  sequential writes:
>                  without patch                 with patch
>   iops:              80k                          269k
> 
> 
>                  random writes:
>                  without patch                 with patch
>   iops:              76k                          249k

What's the results when you enable the lazytime mount option to
elide almost all the timestamp updates from buffered writes?

As it is, on my test setup:

# fio --readwrite=write --filesize=5g --filename=/mnt/scratch/foo --runtime=30s --time_based --ioengine=libaio --direct=0 --bs=4k --iodepth=1 --numjobs=1 -name=/mnt/scratch/foobar --group_reporting --fallocate=none
.....
write: IOPS=296k, BW=1157MiB/s (1213MB/s)(33.9GiB/30001msec); 0 zone resets

So buffered IO is getting ~295k IOPS on a default kernel config.
Using --ioengine=psync gets ~325k IOPS. However, using
--ioengine=io_uring I only get:

.....
write: IOPS=58.5k, BW=229MiB/s (240MB/s)(6861MiB/30001msec); 0 zone resets

~60k IOPS. IOWs, normal buffered IO is 5x faster than the io_uring
slow path.

To summarise:

		seq IOPS	rand IOPS
libaio		~295k		~220k
psync		~325k		~255k
io_uring	 ~60k		 ~45k

IOWs, there's nothing wrong with the normal blocking IO path and it
easily reaches full single CPU performance on this machine. However,
there is a massive increase in overhead in using the io_uring slow
path.

Oh, wow, an average of 5 context switches per buffered write IO with
io_uring? There's zero context switches with psync or libaio. That
blocking path overhead seems like a general candidate for io_uring
optimisation, especially as most buffered writes are going to
require blocking for one reason or another regardless of whether we
add a NOWAIT fast path or not...

> For an io depth of 1, the new patch improves throughput by over three times
> (compared to the exiting behavior, where buffered writes are processed by an
> io-worker process) and also the latency is considerably reduced. To achieve the
> same or better performance with the exisiting code an io depth of 4 is required.
> Increasing the iodepth further does not lead to any further improvements.
> 
> Especially for mixed workloads this is a considerable improvement.

Improving the slow path so it's only half as slow as the existing
psync/libaio IO paths would also be a major improvement for mixed
workloads...

> Support for async buffered writes:
> 
>   Patch 1: block: add check for async buffered writes to generic_write_checks
>     Add a new flag FMODE_BUF_WASYNC so filesystems can specify that they support
>     async buffered writes and include the flag in the check of the function
>     generic_write_checks().
>     
>   Patch 2: mm: add FGP_ATOMIC flag to __filemap_get_folio()
>     This adds the FGP_ATOMIC flag. This allows to specify the gfp flags
>     for memory allocations for async buffered writes.

I haven't even looked at the patches yet, but this gets a hard NACK.
Unbound GFP_ATOMIC allocations in a user controlled context is a
total non-starter.

>   Patch 3: iomap: add iomap_page_create_gfp to allocate iomap_pages
>     Add new function to allow specifying gfp flags when allocating and
>     initializing the structure iomap_page.
> 
>   Patch 4: iomap: use iomap_page_create_gfp() in __iomap_write_begin
>     Add a gfp flag to the iomap_page_create function.
>   
>   Patch 5: iomap: add async buffered write support
>     Set IOMAP_NOWAIT flag if IOCB_NOWAIT is set. Also use specific gfp flags if
>     the iomap_page structure is allocated for an async buffered page.
> 
>   Patch 6: xfs: add iomap async buffered write support
>     Add async buffered write support to the xfs iomap layer.
> 
> Support for async buffered write support and inode time modification
> 
> 
>   Patch 7: fs: split off need_remove_file_privs() do_remove_file_privs()
>     Splits of a check and action function, so they can later be invoked
>     in the nowait code path.

I don't like the sound of that. Separating the stripping of
SUID/SGID bits and other important things from the check opens a
potential window between the check and action where state changes
can be made that the check/action haven't taken into account...

This seems like a dangerous architectural direction to be headed as
it introduces an uncloseable race window where things can change
between checks and actions. Especially as it's likely easily
controllable from userspace because of the locking involved in
filesystem modification operations....

>   Patch 8: fs: split off need_file_update_time and do_file_update_time
>     Splits of a check and action function, so they can later be invoked
>     in the nowait code path.
>     
>   Patch 9: fs: add pending file update time flag.
>     Add new flag so consecutive write requests for the same inode can
>     proceed without waiting for the inode modification time to complete.

Isn't this exactly what the lazytime mount option already gives us?

>   Patch 10: xfs: enable async write file modification handling.
>     Enable async write handling in xfs for the file modification time
>     update. If the file modification update requires logging or removal
>     of privileges that needs to wait, -EAGAIN is returned.
> 
>   Patch 11: xfs: add async buffered write support
>     Take the ilock in nowait mode if async buffered writes are enabled.
> 
>   Patch 12: io_uring: add support for async buffered writes
>     This enables the async buffered writes optimization in io_uring.
>     Buffered writes are enabled for blocks that are already in the page
>     cache.
> 
>   Patch 13: io_uring: Add tracepoint for short writes
> 
> Support for write throttling of async buffered writes:
> 
>   Patch 14: sched: add new fields to task_struct
>     Add two new fields to the task_struct. These fields store the
>     deadline after which writes are no longer throttled.
> 
>   Patch 15: mm: support write throttling for async buffered writes
>     This changes the balance_dirty_pages function to take an additional
>     parameter. When nowait is specified the write throttling code no
>     longer waits synchronously for the deadline to expire. Instead
>     it sets the fields in task_struct. Once the deadline expires the
>     fields are reset.

How have you tested this? How does it interact with non io-uring
writers to files on the same fs/backing device? How does it interact
with other devices when they are at full writeback speed and
throttling writes on the dirty limits, too?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
  2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
@ 2022-04-26 22:54   ` Dave Chinner
  2022-04-28 20:03     ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-04-26 22:54 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Tue, Apr 26, 2022 at 10:43:23AM -0700, Stefan Roesch wrote:
> This adds the async buffered write support to the iomap layer of XFS. If
> a lock cannot be acquired or additional reads need to be performed, the
> request will return -EAGAIN in case this is an async buffered write request.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>
> ---
>  fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
>  1 file changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e552ce541ec2..80b6c48e88af 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
>  	bool			eof = false, cow_eof = false, shared = false;
>  	int			allocfork = XFS_DATA_FORK;
>  	int			error = 0;
> +	bool			no_wait = (flags & IOMAP_NOWAIT);
>  
>  	if (xfs_is_shutdown(mp))
>  		return -EIO;
>  
>  	/* we can't use delayed allocations when using extent size hints */
> -	if (xfs_get_extsz_hint(ip))
> +	if (xfs_get_extsz_hint(ip)) {
> +		if (no_wait)
> +			return -EAGAIN;
> +
>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>  				flags, iomap, srcmap);

Why can't we do IOMAP_NOWAIT allocation here?
xfs_direct_write_iomap_begin() supports IOMAP_NOWAIT semantics just
fine - that's what the direct IO path uses...

> +	}
>  
>  	ASSERT(!XFS_IS_REALTIME_INODE(ip));
>  
> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	if (no_wait) {
> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> +			return -EAGAIN;
> +	} else {
> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	}

handled by xfs_ilock_for_iomap().
>  
>  	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
>  	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
> @@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
>  
>  	XFS_STATS_INC(mp, xs_blk_mapw);
>  
> +	if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
> +		error = -EAGAIN;
> +		goto out_unlock;
> +	}
> +

Also handled by xfs_ilock_for_iomap().

>  	error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
>  	if (error)
>  		goto out_unlock;
> @@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
>  	if (xfs_is_cow_inode(ip)) {
>  		if (!ip->i_cowfp) {
>  			ASSERT(!xfs_is_reflink_inode(ip));
> +			if (no_wait) {
> +				error = -EAGAIN;
> +				goto out_unlock;
> +			}
>  			xfs_ifork_init_cow(ip);
>  		}

Also handled by xfs_ilock_for_iomap().

>  		cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
> @@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
>  			goto found_imap;
>  		}
>  
> +		if (no_wait) {
> +			error = -EAGAIN;
> +			goto out_unlock;
> +		}
> +

Won't get here because this is COW path, and xfs_ilock_for_iomap()
handles returning -EAGAIN for that case.

>  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>  
>  		/* Trim the mapping to the nearest shared extent boundary. */
> @@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
>  			allocfork = XFS_COW_FORK;
>  	}
>  
> +	if (no_wait) {
> +		error = -EAGAIN;
> +		goto out_unlock;
> +	}

Why? Delayed allocation doesn't block...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
  2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
@ 2022-04-26 22:55   ` Dave Chinner
  2022-04-27 12:07   ` Christian Brauner
  1 sibling, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2022-04-26 22:55 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Tue, Apr 26, 2022 at 10:43:27AM -0700, Stefan Roesch wrote:
> This modifies xfs write checks to return -EAGAIN if the request either
> requires to remove privileges or needs to update the file modification
> time. This is required for async buffered writes, so the request gets
> handled in the io worker.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>
> ---
>  fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5bddb1e9e0b3..6f9da1059e8b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -299,6 +299,43 @@ xfs_file_read_iter(
>  	return ret;
>  }
>  
> +static int xfs_file_modified(struct file *file, int flags)
> +{
> +	int ret;
> +	struct dentry *dentry = file_dentry(file);
> +	struct inode *inode = file_inode(file);
> +	struct timespec64 now = current_time(inode);
> +
> +	/*
> +	 * Clear the security bits if the process is not being run by root.
> +	 * This keeps people from modifying setuid and setgid binaries.
> +	 */
> +	ret = need_file_remove_privs(inode, dentry);
> +	if (ret < 0) {
> +		return ret;
> +	} else if (ret > 0) {
> +		if (flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +
> +		ret = do_file_remove_privs(file, inode, dentry, ret);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	ret = need_file_update_time(inode, file, &now);
> +	if (ret <= 0)
> +		return ret;
> +	if (flags & IOCB_NOWAIT) {
> +		if (IS_PENDING_TIME(inode))
> +			return 0;
> +
> +		inode->i_flags |= S_PENDING_TIME;
> +		return -EAGAIN;
> +	}
> +
> +	return do_file_update_time(inode, file, &now, ret);
> +}

This has nothing XFS specific in it. It belongs in the VFS code.

-Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
@ 2022-04-26 22:56   ` Dave Chinner
  2022-04-28 19:58     ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-04-26 22:56 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
> This adds the async buffered write support to XFS. For async buffered
> write requests, the request will return -EAGAIN if the ilock cannot be
> obtained immediately.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>
> ---
>  fs/xfs/xfs_file.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 6f9da1059e8b..49d54b939502 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
>  	bool			cleared_space = false;
>  	int			iolock;
>  
> -	if (iocb->ki_flags & IOCB_NOWAIT)
> -		return -EOPNOTSUPP;
> -
>  write_retry:
>  	iolock = XFS_IOLOCK_EXCL;
> -	xfs_ilock(ip, iolock);
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!xfs_ilock_nowait(ip, iolock))
> +			return -EAGAIN;
> +	} else {
> +		xfs_ilock(ip, iolock);
> +	}

xfs_ilock_iocb().

-Dave.

-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 10/18] xfs: Enable async write file modification handling.
  2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
  2022-04-26 22:55   ` Dave Chinner
@ 2022-04-27 12:07   ` Christian Brauner
  1 sibling, 0 replies; 45+ messages in thread
From: Christian Brauner @ 2022-04-27 12:07 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david

On Tue, Apr 26, 2022 at 10:43:27AM -0700, Stefan Roesch wrote:
> This modifies xfs write checks to return -EAGAIN if the request either
> requires to remove privileges or needs to update the file modification
> time. This is required for async buffered writes, so the request gets
> handled in the io worker.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>
> ---
>  fs/xfs/xfs_file.c | 39 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5bddb1e9e0b3..6f9da1059e8b 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -299,6 +299,43 @@ xfs_file_read_iter(
>  	return ret;
>  }
>  
> +static int xfs_file_modified(struct file *file, int flags)

This should probably be in fs/inode.c as:

int file_modified_async(struct file *file, int flags)

and then file_modified() can simply become:

int file_modified(struct file *file)
{
	return file_modified_async(file, 0);
}

or even:

int file_modified_async(struct file *file, bool async)

int file_modified(struct file *file)
{
	return file_modified_async(file, false);
}

to avoid piecing this together specifically in xfs (as Dave mentioned
this is all pretty generic).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
@ 2022-04-28 17:47   ` Jan Kara
  2022-04-28 20:16     ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2022-04-28 17:47 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david

On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
> This change adds support for async write throttling in the function
> balance_dirty_pages(). So far if throttling was required, the code was
> waiting synchronously as long as the writes were throttled. This change
> introduces asynchronous throttling. Instead of waiting in the function
> balance_dirty_pages(), the timeout is set in the task_struct field
> bdp_pause. Once the timeout has expired, the writes are no longer
> throttled.
> 
> - Add a new parameter to the balance_dirty_pages() function
>   - This allows the caller to pass in the nowait flag
>   - When the nowait flag is specified, the code does not wait in
>     balance_dirty_pages(), but instead stores the wait expiration in the
>     new task_struct field bdp_pause.
> 
> - The function balance_dirty_pages_ratelimited() resets the new values
>   in the task_struct, once the timeout has expired
> 
> This change is required to support write throttling for the async
> buffered writes. While the writes are throttled, io_uring still can make
> progress with processing other requests.
> 
> Signed-off-by: Stefan Roesch <shr@fb.com>

Maybe I miss something but I don't think this will throttle writers enough.
For three reasons:

1) The calculated throttling pauses should accumulate for the task so that
if we compute that say it takes 0.1s to write 100 pages and the task writes
300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
be throttled as long as we expect the writeback to take.

2) We must not allow the amount of dirty pages to exceed the dirty limit.
That can easily lead to page reclaim getting into trouble reclaiming pages
and thus machine stalls, oom kills etc. So if we are coming close to dirty
limit and we cannot sleep, we must just fail the nowait write.

3) Even with above two problems fixed I suspect results will be suboptimal
because balance_dirty_pages() heuristics assume they get called reasonably
often and throttle writes so if amount of dirty pages is coming close to
dirty limit, they think we are overestimating writeback speed and update
throttling parameters accordingly. So if io_uring code does not throttle
writers often enough, I think dirty throttling parameters will be jumping
wildly resulting in poor behavior.

So what I'd probably suggest is that if balance_dirty_pages() is called in
"async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
balance_dirty_pages() decides the task needs to wait, we store the pause
and bail all the way up into the place where we can sleep (io_uring code I
assume), sleep there, and then continue doing write.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio()
  2022-04-26 19:06   ` Matthew Wilcox
@ 2022-04-28 19:54     ` Stefan Roesch
  0 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-04-28 19:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david



On 4/26/22 12:06 PM, Matthew Wilcox wrote:
> On Tue, Apr 26, 2022 at 10:43:19AM -0700, Stefan Roesch wrote:
>> Define FGP_ATOMIC flag and add support for the new flag in the function
>> __filemap_get_folio().
> 
> No.  You don't get to use the emergency memory pools for doing writes.
> That completely screws up the MM.

The next version of the patch will remove this patch from the patch series and the
need of the atomic flag.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-04-26 22:56   ` Dave Chinner
@ 2022-04-28 19:58     ` Stefan Roesch
  2022-04-28 21:54       ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-28 19:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel



On 4/26/22 3:56 PM, Dave Chinner wrote:
> On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
>> This adds the async buffered write support to XFS. For async buffered
>> write requests, the request will return -EAGAIN if the ilock cannot be
>> obtained immediately.
>>
>> Signed-off-by: Stefan Roesch <shr@fb.com>
>> ---
>>  fs/xfs/xfs_file.c | 10 ++++++----
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 6f9da1059e8b..49d54b939502 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
>>  	bool			cleared_space = false;
>>  	int			iolock;
>>  
>> -	if (iocb->ki_flags & IOCB_NOWAIT)
>> -		return -EOPNOTSUPP;
>> -
>>  write_retry:
>>  	iolock = XFS_IOLOCK_EXCL;
>> -	xfs_ilock(ip, iolock);
>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
>> +		if (!xfs_ilock_nowait(ip, iolock))
>> +			return -EAGAIN;
>> +	} else {
>> +		xfs_ilock(ip, iolock);
>> +	}
> 
> xfs_ilock_iocb().
> 

The helper xfs_ilock_iocb cannot be used as it hardcoded to use iocb->ki_filp to
get a pointer to the xfs_inode. However here we need to use iocb->ki_filp->f_mapping->host.
I'll split off new helper for this in the next version of the patch.

> -Dave.
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
  2022-04-26 22:54   ` Dave Chinner
@ 2022-04-28 20:03     ` Stefan Roesch
  2022-04-28 21:44       ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-28 20:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel



On 4/26/22 3:54 PM, Dave Chinner wrote:
> On Tue, Apr 26, 2022 at 10:43:23AM -0700, Stefan Roesch wrote:
>> This adds the async buffered write support to the iomap layer of XFS. If
>> a lock cannot be acquired or additional reads need to be performed, the
>> request will return -EAGAIN in case this is an async buffered write request.
>>
>> Signed-off-by: Stefan Roesch <shr@fb.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
>>  1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index e552ce541ec2..80b6c48e88af 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
>>  	bool			eof = false, cow_eof = false, shared = false;
>>  	int			allocfork = XFS_DATA_FORK;
>>  	int			error = 0;
>> +	bool			no_wait = (flags & IOMAP_NOWAIT);
>>  
>>  	if (xfs_is_shutdown(mp))
>>  		return -EIO;
>>  
>>  	/* we can't use delayed allocations when using extent size hints */
>> -	if (xfs_get_extsz_hint(ip))
>> +	if (xfs_get_extsz_hint(ip)) {
>> +		if (no_wait)
>> +			return -EAGAIN;
>> +
>>  		return xfs_direct_write_iomap_begin(inode, offset, count,
>>  				flags, iomap, srcmap);
> 
> Why can't we do IOMAP_NOWAIT allocation here?
> xfs_direct_write_iomap_begin() supports IOMAP_NOWAIT semantics just
> fine - that's what the direct IO path uses...

I'll make that change in the next version.
> 
>> +	}
>>  
>>  	ASSERT(!XFS_IS_REALTIME_INODE(ip));
>>  
>> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +	if (no_wait) {
>> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>> +			return -EAGAIN;
>> +	} else {
>> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +	}
> 
> handled by xfs_ilock_for_iomap().

The helper xfs_ilock_for_iomap cannot be used for this purpose. The function
xfs_ilock_for_iomap tries to deduce the lock mode. For the function xfs_file_buffered_write()
the lock mode must be exclusive. However this is not always the case when the
helper xfs_ilock_for_iomap is used. There are cases where xfs_is_cow_inode(() is not
true.

I'll add a new helper that will be used here.

>>  
>>  	if (XFS_IS_CORRUPT(mp, !xfs_ifork_has_extents(&ip->i_df)) ||
>>  	    XFS_TEST_ERROR(false, mp, XFS_ERRTAG_BMAPIFORMAT)) {
>> @@ -902,6 +912,11 @@ xfs_buffered_write_iomap_begin(
>>  
>>  	XFS_STATS_INC(mp, xs_blk_mapw);
>>  
>> +	if (no_wait && xfs_need_iread_extents(XFS_IFORK_PTR(ip, XFS_DATA_FORK))) {
>> +		error = -EAGAIN;
>> +		goto out_unlock;
>> +	}
>> +
> 
> Also handled by xfs_ilock_for_iomap().

I'll remove this check with the next version.

> 
>>  	error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
>>  	if (error)
>>  		goto out_unlock;
>> @@ -933,6 +948,10 @@ xfs_buffered_write_iomap_begin(
>>  	if (xfs_is_cow_inode(ip)) {
>>  		if (!ip->i_cowfp) {
>>  			ASSERT(!xfs_is_reflink_inode(ip));
>> +			if (no_wait) {
>> +				error = -EAGAIN;
>> +				goto out_unlock;
>> +			}
>>  			xfs_ifork_init_cow(ip);
>>  		}
> 
> Also handled by xfs_ilock_for_iomap().
> 

I'll remove this check with the next version.


>>  		cow_eof = !xfs_iext_lookup_extent(ip, ip->i_cowfp, offset_fsb,
>> @@ -956,6 +975,11 @@ xfs_buffered_write_iomap_begin(
>>  			goto found_imap;
>>  		}
>>  
>> +		if (no_wait) {
>> +			error = -EAGAIN;
>> +			goto out_unlock;
>> +		}
>> +
> 
> Won't get here because this is COW path, and xfs_ilock_for_iomap()
> handles returning -EAGAIN for that case.

I'll remove this check with the next version.

> 
>>  		xfs_trim_extent(&imap, offset_fsb, end_fsb - offset_fsb);
>>  
>>  		/* Trim the mapping to the nearest shared extent boundary. */
>> @@ -993,6 +1017,11 @@ xfs_buffered_write_iomap_begin(
>>  			allocfork = XFS_COW_FORK;
>>  	}
>>  
>> +	if (no_wait) {
>> +		error = -EAGAIN;
>> +		goto out_unlock;
>> +	}
> 
> Why? Delayed allocation doesn't block...

I'll remove this check with the next version.

> 
> Cheers,
> 
> Dave.
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-04-28 17:47   ` Jan Kara
@ 2022-04-28 20:16     ` Stefan Roesch
  2022-05-10  9:50       ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-04-28 20:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david



On 4/28/22 10:47 AM, Jan Kara wrote:
> On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
>> This change adds support for async write throttling in the function
>> balance_dirty_pages(). So far if throttling was required, the code was
>> waiting synchronously as long as the writes were throttled. This change
>> introduces asynchronous throttling. Instead of waiting in the function
>> balance_dirty_pages(), the timeout is set in the task_struct field
>> bdp_pause. Once the timeout has expired, the writes are no longer
>> throttled.
>>
>> - Add a new parameter to the balance_dirty_pages() function
>>   - This allows the caller to pass in the nowait flag
>>   - When the nowait flag is specified, the code does not wait in
>>     balance_dirty_pages(), but instead stores the wait expiration in the
>>     new task_struct field bdp_pause.
>>
>> - The function balance_dirty_pages_ratelimited() resets the new values
>>   in the task_struct, once the timeout has expired
>>
>> This change is required to support write throttling for the async
>> buffered writes. While the writes are throttled, io_uring still can make
>> progress with processing other requests.
>>
>> Signed-off-by: Stefan Roesch <shr@fb.com>
> 
> Maybe I miss something but I don't think this will throttle writers enough.
> For three reasons:
> 
> 1) The calculated throttling pauses should accumulate for the task so that
> if we compute that say it takes 0.1s to write 100 pages and the task writes
> 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
> be throttled as long as we expect the writeback to take.
> 
> 2) We must not allow the amount of dirty pages to exceed the dirty limit.
> That can easily lead to page reclaim getting into trouble reclaiming pages
> and thus machine stalls, oom kills etc. So if we are coming close to dirty
> limit and we cannot sleep, we must just fail the nowait write.
> 
> 3) Even with above two problems fixed I suspect results will be suboptimal
> because balance_dirty_pages() heuristics assume they get called reasonably
> often and throttle writes so if amount of dirty pages is coming close to
> dirty limit, they think we are overestimating writeback speed and update
> throttling parameters accordingly. So if io_uring code does not throttle
> writers often enough, I think dirty throttling parameters will be jumping
> wildly resulting in poor behavior.
> 
> So what I'd probably suggest is that if balance_dirty_pages() is called in
> "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
> balance_dirty_pages() decides the task needs to wait, we store the pause
> and bail all the way up into the place where we can sleep (io_uring code I
> assume), sleep there, and then continue doing write.
> 

Jan, thanks for the feedback. Are you suggesting to change the following check
in the function balance_dirty_pages():

                /*
                 * Throttle it only when the background writeback cannot
                 * catch-up. This avoids (excessively) small writeouts
                 * when the wb limits are ramping up in case of !strictlimit.
                 *
                 * In strictlimit case make decision based on the wb counters
                 * and limits. Small writeouts when the wb limits are ramping
                 * up are the price we consciously pay for strictlimit-ing.
                 *
                 * If memcg domain is in effect, @dirty should be under
                 * both global and memcg freerun ceilings.
                 */
                if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
                    (!mdtc ||
                     m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
                        unsigned long intv;
                        unsigned long m_intv;

to include if we are in async mode?


There is no direct way to return that the process should sleep. Instead two new
fields are introduced in the proc structure. These two fields are then used in
io_uring to determine if the writes for a task need to be throttled.

In case the writes need to be throttled, the writes are not issued, but instead
inserted on a wait queue. We cannot sleep in the general io_uring code path as
we still want to process other requests which are affected by the throttling.


> 								Honza
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 06/18] xfs: add iomap async buffered write support
  2022-04-28 20:03     ` Stefan Roesch
@ 2022-04-28 21:44       ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2022-04-28 21:44 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Thu, Apr 28, 2022 at 01:03:49PM -0700, Stefan Roesch wrote:
> 
> 
> On 4/26/22 3:54 PM, Dave Chinner wrote:
> > On Tue, Apr 26, 2022 at 10:43:23AM -0700, Stefan Roesch wrote:
> >> This adds the async buffered write support to the iomap layer of XFS. If
> >> a lock cannot be acquired or additional reads need to be performed, the
> >> request will return -EAGAIN in case this is an async buffered write request.
> >>
> >> Signed-off-by: Stefan Roesch <shr@fb.com>
> >> ---
> >>  fs/xfs/xfs_iomap.c | 33 +++++++++++++++++++++++++++++++--
> >>  1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> index e552ce541ec2..80b6c48e88af 100644
> >> --- a/fs/xfs/xfs_iomap.c
> >> +++ b/fs/xfs/xfs_iomap.c
> >> @@ -881,18 +881,28 @@ xfs_buffered_write_iomap_begin(
> >>  	bool			eof = false, cow_eof = false, shared = false;
> >>  	int			allocfork = XFS_DATA_FORK;
> >>  	int			error = 0;
> >> +	bool			no_wait = (flags & IOMAP_NOWAIT);
> >>  
> >>  	if (xfs_is_shutdown(mp))
> >>  		return -EIO;
> >>  
> >>  	/* we can't use delayed allocations when using extent size hints */
> >> -	if (xfs_get_extsz_hint(ip))
> >> +	if (xfs_get_extsz_hint(ip)) {
> >> +		if (no_wait)
> >> +			return -EAGAIN;
> >> +
> >>  		return xfs_direct_write_iomap_begin(inode, offset, count,
> >>  				flags, iomap, srcmap);
> > 
> > Why can't we do IOMAP_NOWAIT allocation here?
> > xfs_direct_write_iomap_begin() supports IOMAP_NOWAIT semantics just
> > fine - that's what the direct IO path uses...
> 
> I'll make that change in the next version.
> > 
> >> +	}
> >>  
> >>  	ASSERT(!XFS_IS_REALTIME_INODE(ip));
> >>  
> >> -	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >> +	if (no_wait) {
> >> +		if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
> >> +			return -EAGAIN;
> >> +	} else {
> >> +		xfs_ilock(ip, XFS_ILOCK_EXCL);
> >> +	}
> > 
> > handled by xfs_ilock_for_iomap().
> 
> The helper xfs_ilock_for_iomap cannot be used for this purpose. The function
> xfs_ilock_for_iomap tries to deduce the lock mode. For the function xfs_file_buffered_write()
> the lock mode must be exclusive. However this is not always the case when the
> helper xfs_ilock_for_iomap is used. There are cases where xfs_is_cow_inode(() is not
> true.

That's trivially fixed by having the caller initialise the lock_mode
to what they want, rather than have the function set the initial
type to SHARED....

> I'll add a new helper that will be used here.

... so I don't think a new helper is needed.

--Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-04-28 19:58     ` Stefan Roesch
@ 2022-04-28 21:54       ` Dave Chinner
  2022-05-02 21:21         ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-04-28 21:54 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
> 
> 
> On 4/26/22 3:56 PM, Dave Chinner wrote:
> > On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
> >> This adds the async buffered write support to XFS. For async buffered
> >> write requests, the request will return -EAGAIN if the ilock cannot be
> >> obtained immediately.
> >>
> >> Signed-off-by: Stefan Roesch <shr@fb.com>
> >> ---
> >>  fs/xfs/xfs_file.c | 10 ++++++----
> >>  1 file changed, 6 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> >> index 6f9da1059e8b..49d54b939502 100644
> >> --- a/fs/xfs/xfs_file.c
> >> +++ b/fs/xfs/xfs_file.c
> >> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
> >>  	bool			cleared_space = false;
> >>  	int			iolock;
> >>  
> >> -	if (iocb->ki_flags & IOCB_NOWAIT)
> >> -		return -EOPNOTSUPP;
> >> -
> >>  write_retry:
> >>  	iolock = XFS_IOLOCK_EXCL;
> >> -	xfs_ilock(ip, iolock);
> >> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> >> +		if (!xfs_ilock_nowait(ip, iolock))
> >> +			return -EAGAIN;
> >> +	} else {
> >> +		xfs_ilock(ip, iolock);
> >> +	}
> > 
> > xfs_ilock_iocb().
> > 
> 
> The helper xfs_ilock_iocb cannot be used as it hardcoded to use iocb->ki_filp to
> get a pointer to the xfs_inode.

And the problem with that is?

I mean, look at what xfs_file_buffered_write() does to get the
xfs_inode 10 lines about that change:

xfs_file_buffered_write(
        struct kiocb            *iocb,
        struct iov_iter         *from)
{
        struct file             *file = iocb->ki_filp;
        struct address_space    *mapping = file->f_mapping;
        struct inode            *inode = mapping->host;
        struct xfs_inode        *ip = XFS_I(inode);

In what cases does file_inode(iocb->ki_filp) point to a different
inode than iocb->ki_filp->f_mapping->host? The dio write path assumes
that file_inode(iocb->ki_filp) is correct, as do both the buffered
and dio read paths.

What makes the buffered write path special in that
file_inode(iocb->ki_filp) is not correctly set whilst
iocb->ki_filp->f_mapping->host is?

Regardless, if this is a problem, then just pass the XFS inode to
xfs_ilock_iocb() and this is a moot point.

> However here we need to use iocb->ki_filp->f_mapping->host.
> I'll split off new helper for this in the next version of the patch.

We don't need a new helper here, either.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-04-28 21:54       ` Dave Chinner
@ 2022-05-02 21:21         ` Stefan Roesch
  2022-05-06  9:29           ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-05-02 21:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel



On 4/28/22 2:54 PM, Dave Chinner wrote:
> On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
>>
>>
>> On 4/26/22 3:56 PM, Dave Chinner wrote:
>>> On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
>>>> This adds the async buffered write support to XFS. For async buffered
>>>> write requests, the request will return -EAGAIN if the ilock cannot be
>>>> obtained immediately.
>>>>
>>>> Signed-off-by: Stefan Roesch <shr@fb.com>
>>>> ---
>>>>  fs/xfs/xfs_file.c | 10 ++++++----
>>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>>>> index 6f9da1059e8b..49d54b939502 100644
>>>> --- a/fs/xfs/xfs_file.c
>>>> +++ b/fs/xfs/xfs_file.c
>>>> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
>>>>  	bool			cleared_space = false;
>>>>  	int			iolock;
>>>>  
>>>> -	if (iocb->ki_flags & IOCB_NOWAIT)
>>>> -		return -EOPNOTSUPP;
>>>> -
>>>>  write_retry:
>>>>  	iolock = XFS_IOLOCK_EXCL;
>>>> -	xfs_ilock(ip, iolock);
>>>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
>>>> +		if (!xfs_ilock_nowait(ip, iolock))
>>>> +			return -EAGAIN;
>>>> +	} else {
>>>> +		xfs_ilock(ip, iolock);
>>>> +	}
>>>
>>> xfs_ilock_iocb().
>>>
>>
>> The helper xfs_ilock_iocb cannot be used as it hardcoded to use iocb->ki_filp to
>> get a pointer to the xfs_inode.
> 
> And the problem with that is?
> 
> I mean, look at what xfs_file_buffered_write() does to get the
> xfs_inode 10 lines about that change:
> 
> xfs_file_buffered_write(
>         struct kiocb            *iocb,
>         struct iov_iter         *from)
> {
>         struct file             *file = iocb->ki_filp;
>         struct address_space    *mapping = file->f_mapping;
>         struct inode            *inode = mapping->host;
>         struct xfs_inode        *ip = XFS_I(inode);
> 
> In what cases does file_inode(iocb->ki_filp) point to a different
> inode than iocb->ki_filp->f_mapping->host? The dio write path assumes
> that file_inode(iocb->ki_filp) is correct, as do both the buffered
> and dio read paths.
> 
> What makes the buffered write path special in that
> file_inode(iocb->ki_filp) is not correctly set whilst
> iocb->ki_filp->f_mapping->host is?
> 

In the function xfs_file_buffered_write() the code calls the function 
xfs_ilock(). The xfs_inode pointer that is passed in is iocb->ki_filp->f_mapping->host.
The one used in xfs_ilock_iocb is ki_filp->f_inode.

After getting the lock, the code in xfs_file_buffered_write calls the
function xfs_buffered_write_iomap_begin(). In this function the code
calls xfs_ilock() for ki_filp->f_inode in exclusive mode.

If I replace the first xfs_ilock() call with xfs_ilock_iocb(), then it looks
like I get a deadlock.


Am I missing something?

I can:
- replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
  and also pass in the flags value as a parameter.
or
- create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
  calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
  will use xfs_ilock_inode().


> Regardless, if this is a problem, then just pass the XFS inode to
> xfs_ilock_iocb() and this is a moot point.
> 
>> However here we need to use iocb->ki_filp->f_mapping->host.
>> I'll split off new helper for this in the next version of the patch.
> 
> We don't need a new helper here, either.
> 
> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-02 21:21         ` Stefan Roesch
@ 2022-05-06  9:29           ` Dave Chinner
  2022-05-09 19:32             ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-05-06  9:29 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Mon, May 02, 2022 at 02:21:17PM -0700, Stefan Roesch wrote:
> 
> 
> On 4/28/22 2:54 PM, Dave Chinner wrote:
> > On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
> >>
> >>
> >> On 4/26/22 3:56 PM, Dave Chinner wrote:
> >>> On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
> >>>> This adds the async buffered write support to XFS. For async buffered
> >>>> write requests, the request will return -EAGAIN if the ilock cannot be
> >>>> obtained immediately.
> >>>>
> >>>> Signed-off-by: Stefan Roesch <shr@fb.com>
> >>>> ---
> >>>>  fs/xfs/xfs_file.c | 10 ++++++----
> >>>>  1 file changed, 6 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> >>>> index 6f9da1059e8b..49d54b939502 100644
> >>>> --- a/fs/xfs/xfs_file.c
> >>>> +++ b/fs/xfs/xfs_file.c
> >>>> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
> >>>>  	bool			cleared_space = false;
> >>>>  	int			iolock;
> >>>>  
> >>>> -	if (iocb->ki_flags & IOCB_NOWAIT)
> >>>> -		return -EOPNOTSUPP;
> >>>> -
> >>>>  write_retry:
> >>>>  	iolock = XFS_IOLOCK_EXCL;
> >>>> -	xfs_ilock(ip, iolock);
> >>>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> >>>> +		if (!xfs_ilock_nowait(ip, iolock))
> >>>> +			return -EAGAIN;
> >>>> +	} else {
> >>>> +		xfs_ilock(ip, iolock);
> >>>> +	}
> >>>
> >>> xfs_ilock_iocb().
> >>>
> >>
> >> The helper xfs_ilock_iocb cannot be used as it hardcoded to use iocb->ki_filp to
> >> get a pointer to the xfs_inode.
> > 
> > And the problem with that is?
> > 
> > I mean, look at what xfs_file_buffered_write() does to get the
> > xfs_inode 10 lines about that change:
> > 
> > xfs_file_buffered_write(
> >         struct kiocb            *iocb,
> >         struct iov_iter         *from)
> > {
> >         struct file             *file = iocb->ki_filp;
> >         struct address_space    *mapping = file->f_mapping;
> >         struct inode            *inode = mapping->host;
> >         struct xfs_inode        *ip = XFS_I(inode);
> > 
> > In what cases does file_inode(iocb->ki_filp) point to a different
> > inode than iocb->ki_filp->f_mapping->host? The dio write path assumes
> > that file_inode(iocb->ki_filp) is correct, as do both the buffered
> > and dio read paths.
> > 
> > What makes the buffered write path special in that
> > file_inode(iocb->ki_filp) is not correctly set whilst
> > iocb->ki_filp->f_mapping->host is?
> > 
> 
> In the function xfs_file_buffered_write() the code calls the function 
> xfs_ilock(). The xfs_inode pointer that is passed in is iocb->ki_filp->f_mapping->host.
> The one used in xfs_ilock_iocb is ki_filp->f_inode.
> 
> After getting the lock, the code in xfs_file_buffered_write calls the
> function xfs_buffered_write_iomap_begin(). In this function the code
> calls xfs_ilock() for ki_filp->f_inode in exclusive mode.
> 
> If I replace the first xfs_ilock() call with xfs_ilock_iocb(), then it looks
> like I get a deadlock.
> 
> Am I missing something?

Yes. They take different locks. xfs_file_buffered_write() takes the
IOLOCK, xfs_buffered_write_iomap_begin() takes the ILOCK....

> I can:
> - replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
>   and also pass in the flags value as a parameter.
> or
> - create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
>   calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
>   will use xfs_ilock_inode().

You're making this way more complex than it needs to be. As I said:

> > Regardless, if this is a problem, then just pass the XFS inode to
> > xfs_ilock_iocb() and this is a moot point.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-06  9:29           ` Dave Chinner
@ 2022-05-09 19:32             ` Stefan Roesch
  2022-05-09 23:24               ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-05-09 19:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel



On 5/6/22 2:29 AM, Dave Chinner wrote:
> On Mon, May 02, 2022 at 02:21:17PM -0700, Stefan Roesch wrote:
>>
>>
>> On 4/28/22 2:54 PM, Dave Chinner wrote:
>>> On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
>>>>
>>>>
>>>> On 4/26/22 3:56 PM, Dave Chinner wrote:
>>>>> On Tue, Apr 26, 2022 at 10:43:28AM -0700, Stefan Roesch wrote:
>>>>>> This adds the async buffered write support to XFS. For async buffered
>>>>>> write requests, the request will return -EAGAIN if the ilock cannot be
>>>>>> obtained immediately.
>>>>>>
>>>>>> Signed-off-by: Stefan Roesch <shr@fb.com>
>>>>>> ---
>>>>>>  fs/xfs/xfs_file.c | 10 ++++++----
>>>>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>>>>>> index 6f9da1059e8b..49d54b939502 100644
>>>>>> --- a/fs/xfs/xfs_file.c
>>>>>> +++ b/fs/xfs/xfs_file.c
>>>>>> @@ -739,12 +739,14 @@ xfs_file_buffered_write(
>>>>>>  	bool			cleared_space = false;
>>>>>>  	int			iolock;
>>>>>>  
>>>>>> -	if (iocb->ki_flags & IOCB_NOWAIT)
>>>>>> -		return -EOPNOTSUPP;
>>>>>> -
>>>>>>  write_retry:
>>>>>>  	iolock = XFS_IOLOCK_EXCL;
>>>>>> -	xfs_ilock(ip, iolock);
>>>>>> +	if (iocb->ki_flags & IOCB_NOWAIT) {
>>>>>> +		if (!xfs_ilock_nowait(ip, iolock))
>>>>>> +			return -EAGAIN;
>>>>>> +	} else {
>>>>>> +		xfs_ilock(ip, iolock);
>>>>>> +	}
>>>>>
>>>>> xfs_ilock_iocb().
>>>>>
>>>>
>>>> The helper xfs_ilock_iocb cannot be used as it hardcoded to use iocb->ki_filp to
>>>> get a pointer to the xfs_inode.
>>>
>>> And the problem with that is?
>>>
>>> I mean, look at what xfs_file_buffered_write() does to get the
>>> xfs_inode 10 lines about that change:
>>>
>>> xfs_file_buffered_write(
>>>         struct kiocb            *iocb,
>>>         struct iov_iter         *from)
>>> {
>>>         struct file             *file = iocb->ki_filp;
>>>         struct address_space    *mapping = file->f_mapping;
>>>         struct inode            *inode = mapping->host;
>>>         struct xfs_inode        *ip = XFS_I(inode);
>>>
>>> In what cases does file_inode(iocb->ki_filp) point to a different
>>> inode than iocb->ki_filp->f_mapping->host? The dio write path assumes
>>> that file_inode(iocb->ki_filp) is correct, as do both the buffered
>>> and dio read paths.
>>>
>>> What makes the buffered write path special in that
>>> file_inode(iocb->ki_filp) is not correctly set whilst
>>> iocb->ki_filp->f_mapping->host is?
>>>
>>
>> In the function xfs_file_buffered_write() the code calls the function 
>> xfs_ilock(). The xfs_inode pointer that is passed in is iocb->ki_filp->f_mapping->host.
>> The one used in xfs_ilock_iocb is ki_filp->f_inode.
>>
>> After getting the lock, the code in xfs_file_buffered_write calls the
>> function xfs_buffered_write_iomap_begin(). In this function the code
>> calls xfs_ilock() for ki_filp->f_inode in exclusive mode.
>>
>> If I replace the first xfs_ilock() call with xfs_ilock_iocb(), then it looks
>> like I get a deadlock.
>>
>> Am I missing something?
> 
> Yes. They take different locks. xfs_file_buffered_write() takes the
> IOLOCK, xfs_buffered_write_iomap_begin() takes the ILOCK....
> 

Thanks for the clarification.

>> I can:
>> - replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
>>   and also pass in the flags value as a parameter.
>> or
>> - create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
>>   calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
>>   will use xfs_ilock_inode().
> 
> You're making this way more complex than it needs to be. As I said:
> 
>>> Regardless, if this is a problem, then just pass the XFS inode to
>>> xfs_ilock_iocb() and this is a moot point.
> 

The function xfs_ilock_iocb() is expecting a pointer to the data structure kiocb, not
a pointer to xfs_inode. I don't see how that's possible without changing the signature
of xfs_ilock_iocb().

Do you want to invoke xfs_ilock_nowait() directly()?


> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-09 19:32             ` Stefan Roesch
@ 2022-05-09 23:24               ` Dave Chinner
  2022-05-09 23:44                 ` Darrick J. Wong
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-05-09 23:24 UTC (permalink / raw)
  To: Stefan Roesch; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Mon, May 09, 2022 at 12:32:59PM -0700, Stefan Roesch wrote:
> On 5/6/22 2:29 AM, Dave Chinner wrote:
> > On Mon, May 02, 2022 at 02:21:17PM -0700, Stefan Roesch wrote:
> >> On 4/28/22 2:54 PM, Dave Chinner wrote:
> >>> On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
> >> - replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
> >>   and also pass in the flags value as a parameter.
> >> or
> >> - create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
> >>   calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
> >>   will use xfs_ilock_inode().
> > 
> > You're making this way more complex than it needs to be. As I said:
> > 
> >>> Regardless, if this is a problem, then just pass the XFS inode to
> >>> xfs_ilock_iocb() and this is a moot point.
> > 
> 
> The function xfs_ilock_iocb() is expecting a pointer to the data structure kiocb, not
> a pointer to xfs_inode. I don't see how that's possible without changing the signature
> of xfs_ilock_iocb().

For the *third time*: pass the xfs_inode to xfs_ilock_iocb() and
update all the callers to do the same thing.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-09 23:24               ` Dave Chinner
@ 2022-05-09 23:44                 ` Darrick J. Wong
  2022-05-10  1:12                   ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2022-05-09 23:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Tue, May 10, 2022 at 09:24:25AM +1000, Dave Chinner wrote:
> On Mon, May 09, 2022 at 12:32:59PM -0700, Stefan Roesch wrote:
> > On 5/6/22 2:29 AM, Dave Chinner wrote:
> > > On Mon, May 02, 2022 at 02:21:17PM -0700, Stefan Roesch wrote:
> > >> On 4/28/22 2:54 PM, Dave Chinner wrote:
> > >>> On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
> > >> - replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
> > >>   and also pass in the flags value as a parameter.
> > >> or
> > >> - create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
> > >>   calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
> > >>   will use xfs_ilock_inode().
> > > 
> > > You're making this way more complex than it needs to be. As I said:
> > > 
> > >>> Regardless, if this is a problem, then just pass the XFS inode to
> > >>> xfs_ilock_iocb() and this is a moot point.
> > > 
> > 
> > The function xfs_ilock_iocb() is expecting a pointer to the data structure kiocb, not
> > a pointer to xfs_inode. I don't see how that's possible without changing the signature
> > of xfs_ilock_iocb().
> 
> For the *third time*: pass the xfs_inode to xfs_ilock_iocb() and
> update all the callers to do the same thing.

I still don't understand why /any/ of this is necessary.  When does
iocb->ki_filp->f_inode != iocb->ki_filp->f_mapping->host? 

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-09 23:44                 ` Darrick J. Wong
@ 2022-05-10  1:12                   ` Dave Chinner
  2022-05-10  6:47                     ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-05-10  1:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Stefan Roesch, io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Mon, May 09, 2022 at 04:44:24PM -0700, Darrick J. Wong wrote:
> On Tue, May 10, 2022 at 09:24:25AM +1000, Dave Chinner wrote:
> > On Mon, May 09, 2022 at 12:32:59PM -0700, Stefan Roesch wrote:
> > > On 5/6/22 2:29 AM, Dave Chinner wrote:
> > > > On Mon, May 02, 2022 at 02:21:17PM -0700, Stefan Roesch wrote:
> > > >> On 4/28/22 2:54 PM, Dave Chinner wrote:
> > > >>> On Thu, Apr 28, 2022 at 12:58:59PM -0700, Stefan Roesch wrote:
> > > >> - replace the pointer to iocb with pointer to xfs_inode in the function xfs_ilock_iocb()
> > > >>   and also pass in the flags value as a parameter.
> > > >> or
> > > >> - create function xfs_ilock_inode(), which xfs_ilock_iocb() calls. The existing
> > > >>   calls will not need to change, only the xfs_ilock in xfs_file_buffered_write()
> > > >>   will use xfs_ilock_inode().
> > > > 
> > > > You're making this way more complex than it needs to be. As I said:
> > > > 
> > > >>> Regardless, if this is a problem, then just pass the XFS inode to
> > > >>> xfs_ilock_iocb() and this is a moot point.
> > > > 
> > > 
> > > The function xfs_ilock_iocb() is expecting a pointer to the data structure kiocb, not
> > > a pointer to xfs_inode. I don't see how that's possible without changing the signature
> > > of xfs_ilock_iocb().
> > 
> > For the *third time*: pass the xfs_inode to xfs_ilock_iocb() and
> > update all the callers to do the same thing.
> 
> I still don't understand why /any/ of this is necessary.  When does
> iocb->ki_filp->f_inode != iocb->ki_filp->f_mapping->host? 

I already asked that question because I don't know the answer,
either. I suspect the answer is "block dev inodes" but that then
just raises the question of "how do we get them here?" and I don't
know the answer to that, either. I don't have the time to dig into
this and I don't expect anyone to just pop up with an answer,
either. So in the mean time, we can just ignore it for the purpose
of this patch set...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-10  1:12                   ` Dave Chinner
@ 2022-05-10  6:47                     ` Christoph Hellwig
  2022-05-16  2:24                       ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-10  6:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Stefan Roesch, io-uring, kernel-team, linux-mm,
	linux-xfs, linux-fsdevel

On Tue, May 10, 2022 at 11:12:05AM +1000, Dave Chinner wrote:
> > I still don't understand why /any/ of this is necessary.  When does
> > iocb->ki_filp->f_inode != iocb->ki_filp->f_mapping->host? 
> 
> I already asked that question because I don't know the answer,
> either. I suspect the answer is "block dev inodes" but that then
> just raises the question of "how do we get them here?" and I don't
> know the answer to that, either. I don't have the time to dig into
> this and I don't expect anyone to just pop up with an answer,
> either. So in the mean time, we can just ignore it for the purpose
> of this patch set...

Weird device nodes (including block device) is the answer.  It never
happens for a normal file system file struct that we'd see in XFS.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-04-28 20:16     ` Stefan Roesch
@ 2022-05-10  9:50       ` Jan Kara
  2022-05-10 20:16         ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2022-05-10  9:50 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: Jan Kara, io-uring, kernel-team, linux-mm, linux-xfs,
	linux-fsdevel, david

Sorry for delayed reply. This has fallen through the cracks...

On Thu 28-04-22 13:16:19, Stefan Roesch wrote:
> On 4/28/22 10:47 AM, Jan Kara wrote:
> > On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
> >> This change adds support for async write throttling in the function
> >> balance_dirty_pages(). So far if throttling was required, the code was
> >> waiting synchronously as long as the writes were throttled. This change
> >> introduces asynchronous throttling. Instead of waiting in the function
> >> balance_dirty_pages(), the timeout is set in the task_struct field
> >> bdp_pause. Once the timeout has expired, the writes are no longer
> >> throttled.
> >>
> >> - Add a new parameter to the balance_dirty_pages() function
> >>   - This allows the caller to pass in the nowait flag
> >>   - When the nowait flag is specified, the code does not wait in
> >>     balance_dirty_pages(), but instead stores the wait expiration in the
> >>     new task_struct field bdp_pause.
> >>
> >> - The function balance_dirty_pages_ratelimited() resets the new values
> >>   in the task_struct, once the timeout has expired
> >>
> >> This change is required to support write throttling for the async
> >> buffered writes. While the writes are throttled, io_uring still can make
> >> progress with processing other requests.
> >>
> >> Signed-off-by: Stefan Roesch <shr@fb.com>
> > 
> > Maybe I miss something but I don't think this will throttle writers enough.
> > For three reasons:
> > 
> > 1) The calculated throttling pauses should accumulate for the task so that
> > if we compute that say it takes 0.1s to write 100 pages and the task writes
> > 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
> > be throttled as long as we expect the writeback to take.
> > 
> > 2) We must not allow the amount of dirty pages to exceed the dirty limit.
> > That can easily lead to page reclaim getting into trouble reclaiming pages
> > and thus machine stalls, oom kills etc. So if we are coming close to dirty
> > limit and we cannot sleep, we must just fail the nowait write.
> > 
> > 3) Even with above two problems fixed I suspect results will be suboptimal
> > because balance_dirty_pages() heuristics assume they get called reasonably
> > often and throttle writes so if amount of dirty pages is coming close to
> > dirty limit, they think we are overestimating writeback speed and update
> > throttling parameters accordingly. So if io_uring code does not throttle
> > writers often enough, I think dirty throttling parameters will be jumping
> > wildly resulting in poor behavior.
> > 
> > So what I'd probably suggest is that if balance_dirty_pages() is called in
> > "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
> > balance_dirty_pages() decides the task needs to wait, we store the pause
> > and bail all the way up into the place where we can sleep (io_uring code I
> > assume), sleep there, and then continue doing write.
> > 
> 
> Jan, thanks for the feedback. Are you suggesting to change the following
> check in the function balance_dirty_pages():
> 
>                 /*
>                  * Throttle it only when the background writeback cannot
>                  * catch-up. This avoids (excessively) small writeouts
>                  * when the wb limits are ramping up in case of !strictlimit.
>                  *
>                  * In strictlimit case make decision based on the wb counters
>                  * and limits. Small writeouts when the wb limits are ramping
>                  * up are the price we consciously pay for strictlimit-ing.
>                  *
>                  * If memcg domain is in effect, @dirty should be under
>                  * both global and memcg freerun ceilings.
>                  */
>                 if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
>                     (!mdtc ||
>                      m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
>                         unsigned long intv;
>                         unsigned long m_intv;
> 
> to include if we are in async mode?

Actually no. This condition is the one that gives any task a free pass
until dirty_freerun_ceiling(). So there's no need to do any modification
for that. Sorry, I've probably formulated my suggestion in a bit confusing
way.

> There is no direct way to return that the process should sleep. Instead
> two new fields are introduced in the proc structure. These two fields are
> then used in io_uring to determine if the writes for a task need to be
> throttled.
> 
> In case the writes need to be throttled, the writes are not issued, but
> instead inserted on a wait queue. We cannot sleep in the general io_uring
> code path as we still want to process other requests which are affected
> by the throttling.

Probably you wanted to say "are not affected by the throttling" in the
above.

I know that you're using fields in task_struct to propagate the delay info.
But IMHO that is unnecessary (although I don't care too much). Instead we
could factor out a variant of balance_dirty_pages() that returns 'pause' to
sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this
for pause calculation, places wanting async throttling would only get the
pause to sleep. So e.g. iomap_write_iter() would then check and if returned
pause is > 0, it would abort the loop similary as we'd abort it for any
other reason when NOWAIT write is aborted because we need to sleep. Iouring
code then detects short write / EAGAIN and offloads the write to the
workqueue where normal balance_dirty_pages() can sleep as needed.

This will make sure dirty limits are properly observed and we don't need
that much special handling for it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-05-10  9:50       ` Jan Kara
@ 2022-05-10 20:16         ` Stefan Roesch
  2022-05-11 10:38           ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Stefan Roesch @ 2022-05-10 20:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david



On 5/10/22 2:50 AM, Jan Kara wrote:
> Sorry for delayed reply. This has fallen through the cracks...
> 
> On Thu 28-04-22 13:16:19, Stefan Roesch wrote:
>> On 4/28/22 10:47 AM, Jan Kara wrote:
>>> On Tue 26-04-22 10:43:32, Stefan Roesch wrote:
>>>> This change adds support for async write throttling in the function
>>>> balance_dirty_pages(). So far if throttling was required, the code was
>>>> waiting synchronously as long as the writes were throttled. This change
>>>> introduces asynchronous throttling. Instead of waiting in the function
>>>> balance_dirty_pages(), the timeout is set in the task_struct field
>>>> bdp_pause. Once the timeout has expired, the writes are no longer
>>>> throttled.
>>>>
>>>> - Add a new parameter to the balance_dirty_pages() function
>>>>   - This allows the caller to pass in the nowait flag
>>>>   - When the nowait flag is specified, the code does not wait in
>>>>     balance_dirty_pages(), but instead stores the wait expiration in the
>>>>     new task_struct field bdp_pause.
>>>>
>>>> - The function balance_dirty_pages_ratelimited() resets the new values
>>>>   in the task_struct, once the timeout has expired
>>>>
>>>> This change is required to support write throttling for the async
>>>> buffered writes. While the writes are throttled, io_uring still can make
>>>> progress with processing other requests.
>>>>
>>>> Signed-off-by: Stefan Roesch <shr@fb.com>
>>>
>>> Maybe I miss something but I don't think this will throttle writers enough.
>>> For three reasons:
>>>
>>> 1) The calculated throttling pauses should accumulate for the task so that
>>> if we compute that say it takes 0.1s to write 100 pages and the task writes
>>> 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not
>>> be throttled as long as we expect the writeback to take.
>>>
>>> 2) We must not allow the amount of dirty pages to exceed the dirty limit.
>>> That can easily lead to page reclaim getting into trouble reclaiming pages
>>> and thus machine stalls, oom kills etc. So if we are coming close to dirty
>>> limit and we cannot sleep, we must just fail the nowait write.
>>>
>>> 3) Even with above two problems fixed I suspect results will be suboptimal
>>> because balance_dirty_pages() heuristics assume they get called reasonably
>>> often and throttle writes so if amount of dirty pages is coming close to
>>> dirty limit, they think we are overestimating writeback speed and update
>>> throttling parameters accordingly. So if io_uring code does not throttle
>>> writers often enough, I think dirty throttling parameters will be jumping
>>> wildly resulting in poor behavior.
>>>
>>> So what I'd probably suggest is that if balance_dirty_pages() is called in
>>> "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If
>>> balance_dirty_pages() decides the task needs to wait, we store the pause
>>> and bail all the way up into the place where we can sleep (io_uring code I
>>> assume), sleep there, and then continue doing write.
>>>
>>
>> Jan, thanks for the feedback. Are you suggesting to change the following
>> check in the function balance_dirty_pages():
>>
>>                 /*
>>                  * Throttle it only when the background writeback cannot
>>                  * catch-up. This avoids (excessively) small writeouts
>>                  * when the wb limits are ramping up in case of !strictlimit.
>>                  *
>>                  * In strictlimit case make decision based on the wb counters
>>                  * and limits. Small writeouts when the wb limits are ramping
>>                  * up are the price we consciously pay for strictlimit-ing.
>>                  *
>>                  * If memcg domain is in effect, @dirty should be under
>>                  * both global and memcg freerun ceilings.
>>                  */
>>                 if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
>>                     (!mdtc ||
>>                      m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
>>                         unsigned long intv;
>>                         unsigned long m_intv;
>>
>> to include if we are in async mode?
> 
> Actually no. This condition is the one that gives any task a free pass
> until dirty_freerun_ceiling(). So there's no need to do any modification
> for that. Sorry, I've probably formulated my suggestion in a bit confusing
> way.
> 
>> There is no direct way to return that the process should sleep. Instead
>> two new fields are introduced in the proc structure. These two fields are
>> then used in io_uring to determine if the writes for a task need to be
>> throttled.
>>
>> In case the writes need to be throttled, the writes are not issued, but
>> instead inserted on a wait queue. We cannot sleep in the general io_uring
>> code path as we still want to process other requests which are affected
>> by the throttling.
> 
> Probably you wanted to say "are not affected by the throttling" in the
> above.
> 

Yes, that's correct.

> I know that you're using fields in task_struct to propagate the delay info.
> But IMHO that is unnecessary (although I don't care too much). Instead we
> could factor out a variant of balance_dirty_pages() that returns 'pause' to
> sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this
> for pause calculation, places wanting async throttling would only get the
> pause to sleep. So e.g. iomap_write_iter() would then check and if returned
> pause is > 0, it would abort the loop similary as we'd abort it for any
> other reason when NOWAIT write is aborted because we need to sleep. Iouring
> code then detects short write / EAGAIN and offloads the write to the
> workqueue where normal balance_dirty_pages() can sleep as needed.
> 
> This will make sure dirty limits are properly observed and we don't need
> that much special handling for it.
>

I like the idea of factoring out a function out balance_dirty_pages(), however

I see two challenges:
- the write operation has already completed at this point,
- so we can't really sleep on its completion in the io-worker in io-uring
- we don't know how long to sleep in io-uring

Currently balance_dirty_pages_ratelimited() is called at the end of the function
iomap_write_iter(). If the function balance_dirty_pages_ratelimited() would instead
be called at the beginning of the function iomap_write_iter() we could return -EAGAIN
and then complete it in the io-worker.

I'm not sure what the implications are of moving the function call to the beginning of
the function iomap_write_iter().
 
> 								Honza


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-05-10 20:16         ` Stefan Roesch
@ 2022-05-11 10:38           ` Jan Kara
  2022-05-13 18:57             ` Stefan Roesch
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2022-05-11 10:38 UTC (permalink / raw)
  To: Stefan Roesch
  Cc: Jan Kara, io-uring, kernel-team, linux-mm, linux-xfs,
	linux-fsdevel, david

On Tue 10-05-22 13:16:30, Stefan Roesch wrote:
> On 5/10/22 2:50 AM, Jan Kara wrote:
> > I know that you're using fields in task_struct to propagate the delay info.
> > But IMHO that is unnecessary (although I don't care too much). Instead we
> > could factor out a variant of balance_dirty_pages() that returns 'pause' to
> > sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this
> > for pause calculation, places wanting async throttling would only get the
> > pause to sleep. So e.g. iomap_write_iter() would then check and if returned
> > pause is > 0, it would abort the loop similary as we'd abort it for any
> > other reason when NOWAIT write is aborted because we need to sleep. Iouring
> > code then detects short write / EAGAIN and offloads the write to the
> > workqueue where normal balance_dirty_pages() can sleep as needed.
> > 
> > This will make sure dirty limits are properly observed and we don't need
> > that much special handling for it.
> >
> 
> I like the idea of factoring out a function out balance_dirty_pages(), however
> 
> I see two challenges:
> - the write operation has already completed at this point,
> - so we can't really sleep on its completion in the io-worker in io-uring
> - we don't know how long to sleep in io-uring
> 
> Currently balance_dirty_pages_ratelimited() is called at the end of the
> function iomap_write_iter(). If the function
> balance_dirty_pages_ratelimited() would instead be called at the
> beginning of the function iomap_write_iter() we could return -EAGAIN and
> then complete it in the io-worker.

Well, we call balance_dirty_pages_ratelimited() after each page. So it does
not really matter much if the sleep is pushed to happen one page later.
balance_dirty_pages_ratelimited() does ratelimiting of when
balance_dirty_pages() are called so we have to make sure
current->nr_dirtied is not zeroed out before we really do wait (because
that is what determines whether we enter balance_dirty_pages() and how long
we sleep there) but looking at the code that should work out just fine.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes
  2022-05-11 10:38           ` Jan Kara
@ 2022-05-13 18:57             ` Stefan Roesch
  0 siblings, 0 replies; 45+ messages in thread
From: Stefan Roesch @ 2022-05-13 18:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: io-uring, kernel-team, linux-mm, linux-xfs, linux-fsdevel, david



On 5/11/22 3:38 AM, Jan Kara wrote:
> On Tue 10-05-22 13:16:30, Stefan Roesch wrote:
>> On 5/10/22 2:50 AM, Jan Kara wrote:
>>> I know that you're using fields in task_struct to propagate the delay info.
>>> But IMHO that is unnecessary (although I don't care too much). Instead we
>>> could factor out a variant of balance_dirty_pages() that returns 'pause' to
>>> sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this
>>> for pause calculation, places wanting async throttling would only get the
>>> pause to sleep. So e.g. iomap_write_iter() would then check and if returned
>>> pause is > 0, it would abort the loop similary as we'd abort it for any
>>> other reason when NOWAIT write is aborted because we need to sleep. Iouring
>>> code then detects short write / EAGAIN and offloads the write to the
>>> workqueue where normal balance_dirty_pages() can sleep as needed.
>>>
>>> This will make sure dirty limits are properly observed and we don't need
>>> that much special handling for it.
>>>
>>
>> I like the idea of factoring out a function out balance_dirty_pages(), however
>>
>> I see two challenges:
>> - the write operation has already completed at this point,
>> - so we can't really sleep on its completion in the io-worker in io-uring
>> - we don't know how long to sleep in io-uring
>>
>> Currently balance_dirty_pages_ratelimited() is called at the end of the
>> function iomap_write_iter(). If the function
>> balance_dirty_pages_ratelimited() would instead be called at the
>> beginning of the function iomap_write_iter() we could return -EAGAIN and
>> then complete it in the io-worker.
> 
> Well, we call balance_dirty_pages_ratelimited() after each page. So it does
> not really matter much if the sleep is pushed to happen one page later.
> balance_dirty_pages_ratelimited() does ratelimiting of when
> balance_dirty_pages() are called so we have to make sure
> current->nr_dirtied is not zeroed out before we really do wait (because
> that is what determines whether we enter balance_dirty_pages() and how long
> we sleep there) but looking at the code that should work out just fine.
> 

I'll make the changes to balance_dirty_pages() for the next version of the
patch series.

> 								Honza


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-10  6:47                     ` Christoph Hellwig
@ 2022-05-16  2:24                       ` Dave Chinner
  2022-05-16 13:39                         ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2022-05-16  2:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Stefan Roesch, io-uring, kernel-team, linux-mm,
	linux-xfs, linux-fsdevel

On Mon, May 09, 2022 at 11:47:59PM -0700, Christoph Hellwig wrote:
> On Tue, May 10, 2022 at 11:12:05AM +1000, Dave Chinner wrote:
> > > I still don't understand why /any/ of this is necessary.  When does
> > > iocb->ki_filp->f_inode != iocb->ki_filp->f_mapping->host? 
> > 
> > I already asked that question because I don't know the answer,
> > either. I suspect the answer is "block dev inodes" but that then
> > just raises the question of "how do we get them here?" and I don't
> > know the answer to that, either. I don't have the time to dig into
> > this and I don't expect anyone to just pop up with an answer,
> > either. So in the mean time, we can just ignore it for the purpose
> > of this patch set...
> 
> Weird device nodes (including block device) is the answer.  It never
> happens for a normal file system file struct that we'd see in XFS.

Ok, so we can just use XFS_I(file_inode(iocb->ki_filp)) then and we
don't need to pass the xfs_inode at all. We probably should convert
the rest of the io path to do this as well so we don't end up
forgetting this again...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH v1 11/18] xfs: add async buffered write support
  2022-05-16  2:24                       ` Dave Chinner
@ 2022-05-16 13:39                         ` Christoph Hellwig
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2022-05-16 13:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, Stefan Roesch, io-uring,
	kernel-team, linux-mm, linux-xfs, linux-fsdevel

On Mon, May 16, 2022 at 12:24:30PM +1000, Dave Chinner wrote:
> Ok, so we can just use XFS_I(file_inode(iocb->ki_filp)) then and we
> don't need to pass the xfs_inode at all. We probably should convert
> the rest of the io path to do this as well so we don't end up
> forgetting this again...

Yes.


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-05-16 13:40 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 17:43 [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 01/18] block: add check for async buffered writes to generic_write_checks Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 02/18] mm: add FGP_ATOMIC flag to __filemap_get_folio() Stefan Roesch
2022-04-26 19:06   ` Matthew Wilcox
2022-04-28 19:54     ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 03/18] iomap: add iomap_page_create_gfp to allocate iomap_pages Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 04/18] iomap: use iomap_page_create_gfp() in __iomap_write_begin Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 05/18] iomap: add async buffered write support Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 06/18] xfs: add iomap " Stefan Roesch
2022-04-26 22:54   ` Dave Chinner
2022-04-28 20:03     ` Stefan Roesch
2022-04-28 21:44       ` Dave Chinner
2022-04-26 17:43 ` [RFC PATCH v1 07/18] fs: split off need_remove_file_privs() do_remove_file_privs() Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 08/18] fs: split off need_file_update_time and do_file_update_time Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 09/18] fs: add pending file update time flag Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 10/18] xfs: Enable async write file modification handling Stefan Roesch
2022-04-26 22:55   ` Dave Chinner
2022-04-27 12:07   ` Christian Brauner
2022-04-26 17:43 ` [RFC PATCH v1 11/18] xfs: add async buffered write support Stefan Roesch
2022-04-26 22:56   ` Dave Chinner
2022-04-28 19:58     ` Stefan Roesch
2022-04-28 21:54       ` Dave Chinner
2022-05-02 21:21         ` Stefan Roesch
2022-05-06  9:29           ` Dave Chinner
2022-05-09 19:32             ` Stefan Roesch
2022-05-09 23:24               ` Dave Chinner
2022-05-09 23:44                 ` Darrick J. Wong
2022-05-10  1:12                   ` Dave Chinner
2022-05-10  6:47                     ` Christoph Hellwig
2022-05-16  2:24                       ` Dave Chinner
2022-05-16 13:39                         ` Christoph Hellwig
2022-04-26 17:43 ` [RFC PATCH v1 12/18] io_uring: add support for async buffered writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 13/18] io_uring: add tracepoint for short writes Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 14/18] sched: add new fields to task_struct Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 15/18] mm: support write throttling for async buffered writes Stefan Roesch
2022-04-28 17:47   ` Jan Kara
2022-04-28 20:16     ` Stefan Roesch
2022-05-10  9:50       ` Jan Kara
2022-05-10 20:16         ` Stefan Roesch
2022-05-11 10:38           ` Jan Kara
2022-05-13 18:57             ` Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 16/18] iomap: User " Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 17/18] io_uring: support write " Stefan Roesch
2022-04-26 17:43 ` [RFC PATCH v1 18/18] xfs: enable async buffered write support Stefan Roesch
2022-04-26 22:37 ` [RFC PATCH v1 00/18] io-uring/xfs: support async buffered writes Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).