[PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()

From: Chris Mason <clm@fb.com>
To: <djwong@kernel.org>, <hch@infradead.org>,
	<linux-xfs@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<hannes@cmpxchg.org>
Subject: [PATCH RFC] iomap: invalidate pages past eof in iomap_do_writepage()
Date: Tue, 31 May 2022 18:11:17 -0700	[thread overview]
Message-ID: <20220601011116.495988-1-clm@fb.com> (raw)

iomap_do_writepage() sends pages past i_size through
folio_redirty_for_writepage(), which normally isn't a problem because
truncate and friends clean them very quickly.

When the system a variety of cgroups, we can end up in situations where
one cgroup has almost no dirty pages at all.  This is especially common
in our XFS workloads in production because they tend to use O_DIRECT for
everything.

We've hit storms where the redirty path hits millions of times in a few
seconds, on all a single file that's only ~40 pages long.  This ends up
leading to long tail latencies for file writes because the page reclaim
workers are hogging the CPU from some kworkers bound to the same CPU.

That's the theory anyway.  We know the storms exist, but the tail
latencies are so sporadic that it's hard to have any certainty about the
cause without patching a large number of boxes.

There are a few different problems here.  First is just that I don't
understand how invalidating the page instead of redirtying might upset
the rest of the iomap/xfs world.  Btrfs invalidates in this case, which
seems like the right thing to me, but we all have slightly different
sharp corners in the truncate path so I thought I'd ask for comments.

Second is the VM should take wbc->pages_skipped into account, or use
some other way to avoid looping over and over.  I think we actually want
both but I wanted to understand the page invalidation path first.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Domas Mituzas <domas@fb.com>
---
 fs/iomap/buffered-io.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..4a687a2a9ed9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1482,10 +1482,8 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		pgoff_t end_index = isize >> PAGE_SHIFT;
 
 		/*
-		 * Skip the page if it's fully outside i_size, e.g. due to a
-		 * truncate operation that's in progress. We must redirty the
-		 * page so that reclaim stops reclaiming it. Otherwise
-		 * iomap_vm_releasepage() is called on it and gets confused.
+		 * invalidate the page if it's fully outside i_size, e.g.
+		 * due to a truncate operation that's in progress.
 		 *
 		 * Note that the end_index is unsigned long.  If the given
 		 * offset is greater than 16TB on a 32-bit system then if we
@@ -1499,8 +1497,10 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 		 * offset is just equal to the EOF.
 		 */
 		if (folio->index > end_index ||
-		    (folio->index == end_index && poff == 0))
-			goto redirty;
+		    (folio->index == end_index && poff == 0)) {
+			folio_invalidate(folio, 0, folio_size(folio));
+			goto unlock;
+		}
 
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -1518,6 +1518,7 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 
 redirty:
 	folio_redirty_for_writepage(wbc, folio);
+unlock:
 	folio_unlock(folio);
 	return 0;
 }
-- 
2.30.2