All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/3] xfs: use large folios for buffers
@ 2024-01-18 22:19 Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
aggregating multiple pages into a single contiguous memory region using
vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
and would be unnecessary if we could allocate large contiguous memory regions
for the buffers in the first place.

Enter multi-page folios.

This patchset converts the buffer cache to use the folio API, then enhances it
to optimisitically use large folios where possible. It retains the old "vmap an
array of single page folios" functionality as a fallback when large folio
allocation fails. This means that, like page cache support for large folios, we
aren't dependent on large folio allocation succeeding all the time.

This relegates the single page array allocation mechanism to the "slow path"
that we don't have to care so much about performance of this path anymore. This
might allow us to simplify it a bit in future.

One of the issues with the folio conversion is that we use a couple of APIs that
take struct page ** (i.e. pointers to page pointer arrays) and there aren't
folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
that this array will only contain single page folios and that single page folios
and struct page are the same structure and so have the same address. This is a
bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
versions of these interfaces right now. We don't need to use the bulk page
allocator so much any more, because that's now a slow path and we could probably
just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
is a little less clear....

The other issue I tripped over in doing this conversion is that the
discontiguous buffer straddling code in the buf log item dirty region tracking
is broken. We don't actually exercise that code on existing configurations, and
I tripped over it when tracking down a bug in the folio conversion. I fixed it
and short-circuted the check for contiguous buffers, but that didn't fix the
failure I was seeing (which was not handling bp->b_offset and large folios
properly when building bios).

Apart from those issues, the conversion and enhancement is relatively straight
forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
and doesn't appear to cause any problems with large directory buffers, though I
haven't done any real testing on those yet. Large folio allocations are
definitely being exercised, though, as all the inode cluster buffers are 16kB on
a 512 byte inode V5 filesystem.

Thoughts, comments, etc?

Note: this patchset is on top of the NOFS removal patchset I sent a
few days ago. That can be pulled from this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-22  6:41   ` Christoph Hellwig
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

We never log large contiguous regions of unmapped buffers, so this
bug is never triggered by the current code. However, the slowpath
for formatting buffer straddling regions is broken.

That is, the size and shape of the log vector calculated across a
straddle does not match how the formatting code formats a straddle.
This results in a log vector with an uninitialised iovec and this
causes a crash when xlog_write_full() goes to copy the iovec into
the journal.

Whilst touching this code, don't bother checking mapped or single
folio buffers for discontiguous regions because they don't have
them. This significantly reduces the overhead of this check when
logging large buffers as calling xfs_buf_offset() is not free and
it occurs a *lot* in those cases.

Fixes: 929f8b0deb83 ("xfs: optimise xfs_buf_item_size/format for contiguous regions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..83a81cb52d8e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,6 +56,10 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
+/*
+ * We only have to worry about discontiguous buffer range straddling on unmapped
+ * buffers. Everything else will have a contiguous data region we can copy from.
+ */
 static inline bool
 xfs_buf_item_straddle(
 	struct xfs_buf		*bp,
@@ -65,6 +69,9 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
+	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+		return false;
+
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
 	last = xfs_buf_offset(bp,
 			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
@@ -132,11 +139,13 @@ xfs_buf_item_size_segment(
 	return;
 
 slow_scan:
-	/* Count the first bit we jumped out of the above loop from */
-	(*nvecs)++;
-	*nbytes += XFS_BLF_CHUNK;
+	ASSERT(bp->b_addr == NULL);
 	last_bit = first_bit;
+	nbits = 1;
 	while (last_bit != -1) {
+
+		*nbytes += XFS_BLF_CHUNK;
+
 		/*
 		 * This takes the bit number to start looking from and
 		 * returns the next set bit from there.  It returns -1
@@ -151,6 +160,8 @@ xfs_buf_item_size_segment(
 		 * else keep scanning the current set of bits.
 		 */
 		if (next_bit == -1) {
+			if (first_bit != last_bit)
+				(*nvecs)++;
 			break;
 		} else if (next_bit != last_bit + 1 ||
 		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
@@ -162,7 +173,6 @@ xfs_buf_item_size_segment(
 			last_bit++;
 			nbits++;
 		}
-		*nbytes += XFS_BLF_CHUNK;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-19  1:26   ` Darrick J. Wong
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Convert the use of struct pages to struct folio everywhere. This
is just direct API conversion, no actual logic of code changes
should result.

Note: this conversion currently assumes only single page folios are
allocated, and because some of the MM interfaces we use take
pointers to arrays of struct pages, the address of single page
folios and struct pages are the same. e.g alloc_pages_bulk_array(),
vm_map_ram(), etc.


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c      | 127 +++++++++++++++++++++---------------------
 fs/xfs/xfs_buf.h      |  14 ++---
 fs/xfs/xfs_buf_item.c |   2 +-
 fs/xfs/xfs_linux.h    |   8 +++
 4 files changed, 80 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 08f2fbc04db5..15907e92d0d3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -60,25 +60,25 @@ xfs_buf_submit(
 	return __xfs_buf_submit(bp, !(bp->b_flags & XBF_ASYNC));
 }
 
+/*
+ * Return true if the buffer is vmapped.
+ *
+ * b_addr is null if the buffer is not mapped, but the code is clever enough to
+ * know it doesn't have to map a single folio, so the check has to be both for
+ * b_addr and bp->b_folio_count > 1.
+ */
 static inline int
 xfs_buf_is_vmapped(
 	struct xfs_buf	*bp)
 {
-	/*
-	 * Return true if the buffer is vmapped.
-	 *
-	 * b_addr is null if the buffer is not mapped, but the code is clever
-	 * enough to know it doesn't have to map a single page, so the check has
-	 * to be both for b_addr and bp->b_page_count > 1.
-	 */
-	return bp->b_addr && bp->b_page_count > 1;
+	return bp->b_addr && bp->b_folio_count > 1;
 }
 
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
 {
-	return (bp->b_page_count * PAGE_SIZE);
+	return (bp->b_folio_count * PAGE_SIZE);
 }
 
 /*
@@ -197,7 +197,7 @@ xfs_buf_get_maps(
 }
 
 /*
- *	Frees b_pages if it was allocated.
+ *	Frees b_maps if it was allocated.
  */
 static void
 xfs_buf_free_maps(
@@ -273,26 +273,26 @@ _xfs_buf_alloc(
 }
 
 static void
-xfs_buf_free_pages(
+xfs_buf_free_folios(
 	struct xfs_buf	*bp)
 {
 	uint		i;
 
-	ASSERT(bp->b_flags & _XBF_PAGES);
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
 
 	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
 
-	for (i = 0; i < bp->b_page_count; i++) {
-		if (bp->b_pages[i])
-			__free_page(bp->b_pages[i]);
+	for (i = 0; i < bp->b_folio_count; i++) {
+		if (bp->b_folios[i])
+			__folio_put(bp->b_folios[i]);
 	}
-	mm_account_reclaimed_pages(bp->b_page_count);
+	mm_account_reclaimed_pages(bp->b_folio_count);
 
-	if (bp->b_pages != bp->b_page_array)
-		kfree(bp->b_pages);
-	bp->b_pages = NULL;
-	bp->b_flags &= ~_XBF_PAGES;
+	if (bp->b_folios != bp->b_folio_array)
+		kfree(bp->b_folios);
+	bp->b_folios = NULL;
+	bp->b_flags &= ~_XBF_FOLIOS;
 }
 
 static void
@@ -313,8 +313,8 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (bp->b_flags & _XBF_PAGES)
-		xfs_buf_free_pages(bp);
+	if (bp->b_flags & _XBF_FOLIOS)
+		xfs_buf_free_folios(bp);
 	else if (bp->b_flags & _XBF_KMEM)
 		kfree(bp->b_addr);
 
@@ -345,15 +345,15 @@ xfs_buf_alloc_kmem(
 		return -ENOMEM;
 	}
 	bp->b_offset = offset_in_page(bp->b_addr);
-	bp->b_pages = bp->b_page_array;
-	bp->b_pages[0] = kmem_to_page(bp->b_addr);
-	bp->b_page_count = 1;
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
 static int
-xfs_buf_alloc_pages(
+xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
 	xfs_buf_flags_t	flags)
 {
@@ -364,16 +364,16 @@ xfs_buf_alloc_pages(
 		gfp_mask |= __GFP_NORETRY;
 
 	/* Make sure that we have a page list */
-	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
-	if (bp->b_page_count <= XB_PAGES) {
-		bp->b_pages = bp->b_page_array;
+	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	if (bp->b_folio_count <= XB_FOLIOS) {
+		bp->b_folios = bp->b_folio_array;
 	} else {
-		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
+		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
 					gfp_mask);
-		if (!bp->b_pages)
+		if (!bp->b_folios)
 			return -ENOMEM;
 	}
-	bp->b_flags |= _XBF_PAGES;
+	bp->b_flags |= _XBF_FOLIOS;
 
 	/* Assure zeroed buffer for non-read cases. */
 	if (!(flags & XBF_READ))
@@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
 	for (;;) {
 		long	last = filled;
 
-		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
-						bp->b_pages);
-		if (filled == bp->b_page_count) {
+		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
+						(struct page **)bp->b_folios);
+		if (filled == bp->b_folio_count) {
 			XFS_STATS_INC(bp->b_mount, xb_page_found);
 			break;
 		}
@@ -398,7 +398,7 @@ xfs_buf_alloc_pages(
 			continue;
 
 		if (flags & XBF_READ_AHEAD) {
-			xfs_buf_free_pages(bp);
+			xfs_buf_free_folios(bp);
 			return -ENOMEM;
 		}
 
@@ -412,14 +412,14 @@ xfs_buf_alloc_pages(
  *	Map buffer into kernel address-space if necessary.
  */
 STATIC int
-_xfs_buf_map_pages(
+_xfs_buf_map_folios(
 	struct xfs_buf		*bp,
 	xfs_buf_flags_t		flags)
 {
-	ASSERT(bp->b_flags & _XBF_PAGES);
-	if (bp->b_page_count == 1) {
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
+	if (bp->b_folio_count == 1) {
 		/* A single page buffer is always mappable */
-		bp->b_addr = page_address(bp->b_pages[0]);
+		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
 	} else {
@@ -443,8 +443,8 @@ _xfs_buf_map_pages(
 		 */
 		nofs_flag = memalloc_nofs_save();
 		do {
-			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-						-1);
+			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
+					bp->b_folio_count, -1);
 			if (bp->b_addr)
 				break;
 			vm_unmap_aliases();
@@ -571,7 +571,7 @@ xfs_buf_find_lock(
 			return -ENOENT;
 		}
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
+		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
 		bp->b_ops = NULL;
 	}
 	return 0;
@@ -629,14 +629,15 @@ xfs_buf_find_insert(
 		goto out_drop_pag;
 
 	/*
-	 * For buffers that fit entirely within a single page, first attempt to
-	 * allocate the memory from the heap to minimise memory usage. If we
-	 * can't get heap memory for these small buffers, we fall back to using
-	 * the page allocator.
+	 * For buffers that fit entirely within a single page folio, first
+	 * attempt to allocate the memory from the heap to minimise memory
+	 * usage. If we can't get heap memory for these small buffers, we fall
+	 * back to using the page allocator.
 	 */
+
 	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
 	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
-		error = xfs_buf_alloc_pages(new_bp, flags);
+		error = xfs_buf_alloc_folios(new_bp, flags);
 		if (error)
 			goto out_free_buf;
 	}
@@ -728,11 +729,11 @@ xfs_buf_get_map(
 
 	/* We do not hold a perag reference anymore. */
 	if (!bp->b_addr) {
-		error = _xfs_buf_map_pages(bp, flags);
+		error = _xfs_buf_map_folios(bp, flags);
 		if (unlikely(error)) {
 			xfs_warn_ratelimited(btp->bt_mount,
-				"%s: failed to map %u pages", __func__,
-				bp->b_page_count);
+				"%s: failed to map %u folios", __func__,
+				bp->b_folio_count);
 			xfs_buf_relse(bp);
 			return error;
 		}
@@ -963,14 +964,14 @@ xfs_buf_get_uncached(
 	if (error)
 		return error;
 
-	error = xfs_buf_alloc_pages(bp, flags);
+	error = xfs_buf_alloc_folios(bp, flags);
 	if (error)
 		goto fail_free_buf;
 
-	error = _xfs_buf_map_pages(bp, 0);
+	error = _xfs_buf_map_folios(bp, 0);
 	if (unlikely(error)) {
 		xfs_warn(target->bt_mount,
-			"%s: failed to map pages", __func__);
+			"%s: failed to map folios", __func__);
 		goto fail_free_buf;
 	}
 
@@ -1465,7 +1466,7 @@ xfs_buf_ioapply_map(
 	blk_opf_t	op)
 {
 	int		page_index;
-	unsigned int	total_nr_pages = bp->b_page_count;
+	unsigned int	total_nr_pages = bp->b_folio_count;
 	int		nr_pages;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
@@ -1503,7 +1504,7 @@ xfs_buf_ioapply_map(
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
@@ -1716,13 +1717,13 @@ xfs_buf_offset(
 	struct xfs_buf		*bp,
 	size_t			offset)
 {
-	struct page		*page;
+	struct folio		*folio;
 
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
-	page = bp->b_pages[offset >> PAGE_SHIFT];
-	return page_address(page) + (offset & (PAGE_SIZE-1));
+	folio = bp->b_folios[offset >> PAGE_SHIFT];
+	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
 
 void
@@ -1735,18 +1736,18 @@ xfs_buf_zero(
 
 	bend = boff + bsize;
 	while (boff < bend) {
-		struct page	*page;
+		struct folio	*folio;
 		int		page_index, page_offset, csize;
 
 		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
 		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		page = bp->b_pages[page_index];
+		folio = bp->b_folios[page_index];
 		csize = min_t(size_t, PAGE_SIZE - page_offset,
 				      BBTOB(bp->b_length) - boff);
 
 		ASSERT((csize + page_offset) <= PAGE_SIZE);
 
-		memset(page_address(page) + page_offset, 0, csize);
+		memset(folio_address(folio) + page_offset, 0, csize);
 
 		boff += csize;
 	}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b470de08a46c..1e7298ff3fa5 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -29,7 +29,7 @@ struct xfs_buf;
 #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
 #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
 #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
-#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
+#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
 #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
 
@@ -39,7 +39,7 @@ struct xfs_buf;
 #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
 
 /* flags used only internally */
-#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
+#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
 #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
 
@@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
-	{ _XBF_PAGES,		"PAGES" }, \
+	{ _XBF_FOLIOS,		"FOLIOS" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	/* The following interface flags should never be set */ \
@@ -116,7 +116,7 @@ typedef struct xfs_buftarg {
 	struct ratelimit_state	bt_ioerror_rl;
 } xfs_buftarg_t;
 
-#define XB_PAGES	2
+#define XB_FOLIOS	2
 
 struct xfs_buf_map {
 	xfs_daddr_t		bm_bn;	/* block number for I/O */
@@ -180,14 +180,14 @@ struct xfs_buf {
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
 	struct xfs_trans	*b_transp;
-	struct page		**b_pages;	/* array of page pointers */
-	struct page		*b_page_array[XB_PAGES]; /* inline pages */
+	struct folio		**b_folios;	/* array of folio pointers */
+	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
 	struct xfs_buf_map	*b_maps;	/* compound buffer map */
 	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
 	int			b_map_count;
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
-	unsigned int		b_page_count;	/* size of page array */
+	unsigned int		b_folio_count;	/* size of folio array */
 	unsigned int		b_offset;	/* page offset of b_addr,
 						   only for _XBF_KMEM buffers */
 	int			b_error;	/* error code on I/O */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 83a81cb52d8e..d1407cee48d9 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -69,7 +69,7 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
-	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
 		return false;
 
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index caccb7f76690..804389b8e802 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -279,4 +279,12 @@ kmem_to_page(void *addr)
 	return virt_to_page(addr);
 }
 
+static inline struct folio *
+kmem_to_folio(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return page_folio(vmalloc_to_page(addr));
+	return virt_to_folio(addr);
+}
+
 #endif /* __XFS_LINUX__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
@ 2024-01-18 22:19 ` Dave Chinner
  2024-01-19  1:31   ` Darrick J. Wong
  2024-01-22  6:45   ` Christoph Hellwig
  2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
  2024-01-22 10:13 ` Andi Kleen
  4 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-18 22:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: willy, linux-mm

From: Dave Chinner <dchinner@redhat.com>

Now that we have the buffer cache using the folio API, we can extend
the use of folios to allocate high order folios for multi-page
buffers rather than an array of single pages that are then vmapped
into a contiguous range.

This creates two types of buffers: single folio buffers that can
have arbitrary order, and multi-folio buffers made up of many single
page folios that get vmapped. The latter is essentially the existing
code, so there are no logic changes to handle this case.

There are a few places where we iterate the folios on a buffer.
These need to be converted to handle the high order folio case.
Luckily, this only occurs when bp->b_folio_count == 1, and the code
for handling this case is just a simple application of the folio API
to the operations that need to be performed.

The code that allocates buffers will optimistically attempt a high
order folio allocation as a fast path. If this high order allocation
fails, then we fall back to the existing multi-folio allocation
code. This now forms the slow allocation path, and hopefully will be
largely unused in normal conditions.

This should improve performance of large buffer operations (e.g.
large directory block sizes) as we should now mostly avoid the
expense of vmapping large buffers (and the vmap lock contention that
can occur) as well as avoid the runtime pressure that frequently
accessing kernel vmapped pages put on the TLBs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 119 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 15907e92d0d3..df363f17ea1a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -74,6 +74,10 @@ xfs_buf_is_vmapped(
 	return bp->b_addr && bp->b_folio_count > 1;
 }
 
+/*
+ * See comment above xfs_buf_alloc_folios() about the constraints placed on
+ * allocating vmapped buffers.
+ */
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
@@ -344,14 +348,72 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_offset = offset_in_page(bp->b_addr);
 	bp->b_folios = bp->b_folio_array;
 	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
+/*
+ * Allocating a high order folio makes the assumption that buffers are a
+ * power-of-2 size so that ilog2() returns the exact order needed to fit
+ * the contents of the buffer. Buffer lengths are mostly a power of two,
+ * so this is not an unreasonable approach to take by default.
+ *
+ * The exception here are user xattr data buffers, which can be arbitrarily
+ * sized up to 64kB plus structure metadata. In that case, round up the order.
+ */
+static bool
+xfs_buf_alloc_folio(
+	struct xfs_buf	*bp,
+	gfp_t		gfp_mask)
+{
+	int		length = BBTOB(bp->b_length);
+	int		order;
+
+	order = ilog2(length);
+	if ((1 << order) < length)
+		order = ilog2(length - 1) + 1;
+
+	if (order <= PAGE_SHIFT)
+		order = 0;
+	else
+		order -= PAGE_SHIFT;
+
+	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
+	if (!bp->b_folio_array[0])
+		return false;
+
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folio_count = 1;
+	bp->b_flags |= _XBF_FOLIOS;
+	return true;
+}
+
+/*
+ * When we allocate folios for a buffer, we end up with one of two types of
+ * buffer.
+ *
+ * The first type is a single folio buffer - this may be a high order
+ * folio or just a single page sized folio, but either way they get treated the
+ * same way by the rest of the code - the buffer memory spans a single
+ * contiguous memory region that we don't have to map and unmap to access the
+ * data directly.
+ *
+ * The second type of buffer is the multi-folio buffer. These are *always* made
+ * up of single page folios so that they can be fed to vmap_ram() to return a
+ * contiguous memory region we can access the data through, or mark it as
+ * XBF_UNMAPPED and access the data directly through individual folio_address()
+ * calls.
+ *
+ * We don't use high order folios for this second type of buffer (yet) because
+ * having variable size folios makes offset-to-folio indexing and iteration of
+ * the data range more complex than if they are fixed size. This case should now
+ * be the slow path, though, so unless we regularly fail to allocate high order
+ * folios, there should be little need to optimise this path.
+ */
 static int
 xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
@@ -363,7 +425,15 @@ xfs_buf_alloc_folios(
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 
-	/* Make sure that we have a page list */
+	/* Assure zeroed buffer for non-read cases. */
+	if (!(flags & XBF_READ))
+		gfp_mask |= __GFP_ZERO;
+
+	/* Optimistically attempt a single high order folio allocation. */
+	if (xfs_buf_alloc_folio(bp, gfp_mask))
+		return 0;
+
+	/* Fall back to allocating an array of single page folios. */
 	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
 	if (bp->b_folio_count <= XB_FOLIOS) {
 		bp->b_folios = bp->b_folio_array;
@@ -375,9 +445,6 @@ xfs_buf_alloc_folios(
 	}
 	bp->b_flags |= _XBF_FOLIOS;
 
-	/* Assure zeroed buffer for non-read cases. */
-	if (!(flags & XBF_READ))
-		gfp_mask |= __GFP_ZERO;
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
@@ -418,7 +485,7 @@ _xfs_buf_map_folios(
 {
 	ASSERT(bp->b_flags & _XBF_FOLIOS);
 	if (bp->b_folio_count == 1) {
-		/* A single page buffer is always mappable */
+		/* A single folio buffer is always mappable */
 		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
@@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map(
 	int		*count,
 	blk_opf_t	op)
 {
-	int		page_index;
-	unsigned int	total_nr_pages = bp->b_folio_count;
-	int		nr_pages;
+	int		folio_index;
+	unsigned int	total_nr_folios = bp->b_folio_count;
+	int		nr_folios;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
 	int		size;
 	int		offset;
 
-	/* skip the pages in the buffer before the start offset */
-	page_index = 0;
+	/*
+	 * If the start offset if larger than a single page, we need to be
+	 * careful. We might have a high order folio, in which case the indexing
+	 * is from the start of the buffer. However, if we have more than one
+	 * folio single page folio in the buffer, we need to skip the folios in
+	 * the buffer before the start offset.
+	 */
+	folio_index = 0;
 	offset = *buf_offset;
-	while (offset >= PAGE_SIZE) {
-		page_index++;
-		offset -= PAGE_SIZE;
+	if (bp->b_folio_count > 1) {
+		while (offset >= PAGE_SIZE) {
+			folio_index++;
+			offset -= PAGE_SIZE;
+		}
 	}
 
 	/*
@@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map(
 
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
-	nr_pages = bio_max_segs(total_nr_pages);
+	nr_folios = bio_max_segs(total_nr_folios);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
+	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 
-	for (; size && nr_pages; nr_pages--, page_index++) {
-		int	rbytes, nbytes = PAGE_SIZE - offset;
+	for (; size && nr_folios; nr_folios--, folio_index++) {
+		struct folio	*folio = bp->b_folios[folio_index];
+		int		nbytes = folio_size(folio) - offset;
 
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
-				      offset);
-		if (rbytes < nbytes)
+		if (!bio_add_folio(bio, folio, nbytes,
+				offset_in_folio(folio, offset)))
 			break;
 
 		offset = 0;
 		sector += BTOBB(nbytes);
 		size -= nbytes;
-		total_nr_pages--;
+		total_nr_folios--;
 	}
 
 	if (likely(bio->bi_iter.bi_size)) {
@@ -1722,6 +1797,13 @@ xfs_buf_offset(
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
+	/* Single folio buffers may use large folios. */
+	if (bp->b_folio_count == 1) {
+		folio = bp->b_folios[0];
+		return folio_address(folio) + offset_in_folio(folio, offset);
+	}
+
+	/* Multi-folio buffers always use PAGE_SIZE folios */
 	folio = bp->b_folios[offset >> PAGE_SHIFT];
 	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
@@ -1737,18 +1819,24 @@ xfs_buf_zero(
 	bend = boff + bsize;
 	while (boff < bend) {
 		struct folio	*folio;
-		int		page_index, page_offset, csize;
+		int		folio_index, folio_offset, csize;
 
-		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
-		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		folio = bp->b_folios[page_index];
-		csize = min_t(size_t, PAGE_SIZE - page_offset,
+		/* Single folio buffers may use large folios. */
+		if (bp->b_folio_count == 1) {
+			folio = bp->b_folios[0];
+			folio_offset = offset_in_folio(folio,
+						bp->b_offset + boff);
+		} else {
+			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
+			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
+			folio = bp->b_folios[folio_index];
+		}
+
+		csize = min_t(size_t, folio_size(folio) - folio_offset,
 				      BBTOB(bp->b_length) - boff);
+		ASSERT((csize + folio_offset) <= folio_size(folio));
 
-		ASSERT((csize + page_offset) <= PAGE_SIZE);
-
-		memset(folio_address(folio) + page_offset, 0, csize);
-
+		memset(folio_address(folio) + folio_offset, 0, csize);
 		boff += csize;
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC] [PATCH 0/3] xfs: use large folios for buffers
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
                   ` (2 preceding siblings ...)
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-01-19  1:05 ` Darrick J. Wong
  2024-01-22 10:13 ` Andi Kleen
  4 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:38AM +1100, Dave Chinner wrote:
> The XFS buffer cache supports metadata buffers up to 64kB, and it does so by
> aggregating multiple pages into a single contiguous memory region using
> vmapping. This is expensive (both the setup and the runtime TLB mapping cost),
> and would be unnecessary if we could allocate large contiguous memory regions
> for the buffers in the first place.
> 
> Enter multi-page folios.

LOL, hch and I just wrapped up making the xfbtree buffer cache work with
large folios coming from tmpfs.  Though the use case there is simpler
because we require blocksize==PAGE_SIZE, forbid the use of highmem, and
don't need discontig buffers.  Hence we sidestep vm_map_ram. :)

> This patchset converts the buffer cache to use the folio API, then enhances it
> to optimisitically use large folios where possible. It retains the old "vmap an
> array of single page folios" functionality as a fallback when large folio
> allocation fails. This means that, like page cache support for large folios, we
> aren't dependent on large folio allocation succeeding all the time.
> 
> This relegates the single page array allocation mechanism to the "slow path"
> that we don't have to care so much about performance of this path anymore. This
> might allow us to simplify it a bit in future.
> 
> One of the issues with the folio conversion is that we use a couple of APIs that
> take struct page ** (i.e. pointers to page pointer arrays) and there aren't
> folio counterparts. These are the bulk page allocator and vm_map_ram(). In the
> cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing
> that this array will only contain single page folios and that single page folios
> and struct page are the same structure and so have the same address. This is a
> bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio
> versions of these interfaces right now. We don't need to use the bulk page
> allocator so much any more, because that's now a slow path and we could probably
> just call folio_alloc() in a loop like we used to. What to do about vm_map_ram()
> is a little less clear....

Yeah, that's what I suspected.

> The other issue I tripped over in doing this conversion is that the
> discontiguous buffer straddling code in the buf log item dirty region tracking
> is broken. We don't actually exercise that code on existing configurations, and
> I tripped over it when tracking down a bug in the folio conversion. I fixed it
> and short-circuted the check for contiguous buffers, but that didn't fix the
> failure I was seeing (which was not handling bp->b_offset and large folios
> properly when building bios).

Yikes.

> Apart from those issues, the conversion and enhancement is relatively straight
> forward.  It passes fstests on both 512 and 4096 byte sector size storage (512
> byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values)
> and doesn't appear to cause any problems with large directory buffers, though I
> haven't done any real testing on those yet. Large folio allocations are
> definitely being exercised, though, as all the inode cluster buffers are 16kB on
> a 512 byte inode V5 filesystem.
> 
> Thoughts, comments, etc?

Not yet.

> Note: this patchset is on top of the NOFS removal patchset I sent a
> few days ago. That can be pulled from this git branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup

Oooh a branch link, thank you.  It's so much easier if I can pull a
branch while picking through commits over gitweb.

--D

> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
@ 2024-01-19  1:26   ` Darrick J. Wong
  2024-01-19  7:03     ` Dave Chinner
  2024-01-22  6:39     ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:40AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the use of struct pages to struct folio everywhere. This
> is just direct API conversion, no actual logic of code changes
> should result.
> 
> Note: this conversion currently assumes only single page folios are
> allocated, and because some of the MM interfaces we use take
> pointers to arrays of struct pages, the address of single page
> folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> vm_map_ram(), etc.
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c      | 127 +++++++++++++++++++++---------------------
>  fs/xfs/xfs_buf.h      |  14 ++---
>  fs/xfs/xfs_buf_item.c |   2 +-
>  fs/xfs/xfs_linux.h    |   8 +++
>  4 files changed, 80 insertions(+), 71 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 08f2fbc04db5..15907e92d0d3 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -60,25 +60,25 @@ xfs_buf_submit(
>  	return __xfs_buf_submit(bp, !(bp->b_flags & XBF_ASYNC));
>  }
>  
> +/*
> + * Return true if the buffer is vmapped.
> + *
> + * b_addr is null if the buffer is not mapped, but the code is clever enough to
> + * know it doesn't have to map a single folio, so the check has to be both for
> + * b_addr and bp->b_folio_count > 1.
> + */
>  static inline int
>  xfs_buf_is_vmapped(
>  	struct xfs_buf	*bp)
>  {
> -	/*
> -	 * Return true if the buffer is vmapped.
> -	 *
> -	 * b_addr is null if the buffer is not mapped, but the code is clever
> -	 * enough to know it doesn't have to map a single page, so the check has
> -	 * to be both for b_addr and bp->b_page_count > 1.
> -	 */
> -	return bp->b_addr && bp->b_page_count > 1;
> +	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
>  {
> -	return (bp->b_page_count * PAGE_SIZE);
> +	return (bp->b_folio_count * PAGE_SIZE);
>  }
>  
>  /*
> @@ -197,7 +197,7 @@ xfs_buf_get_maps(
>  }
>  
>  /*
> - *	Frees b_pages if it was allocated.
> + *	Frees b_maps if it was allocated.
>   */
>  static void
>  xfs_buf_free_maps(
> @@ -273,26 +273,26 @@ _xfs_buf_alloc(
>  }
>  
>  static void
> -xfs_buf_free_pages(
> +xfs_buf_free_folios(
>  	struct xfs_buf	*bp)
>  {
>  	uint		i;
>  
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  
>  	if (xfs_buf_is_vmapped(bp))
> -		vm_unmap_ram(bp->b_addr, bp->b_page_count);
> +		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
>  
> -	for (i = 0; i < bp->b_page_count; i++) {
> -		if (bp->b_pages[i])
> -			__free_page(bp->b_pages[i]);
> +	for (i = 0; i < bp->b_folio_count; i++) {
> +		if (bp->b_folios[i])
> +			__folio_put(bp->b_folios[i]);
>  	}
> -	mm_account_reclaimed_pages(bp->b_page_count);
> +	mm_account_reclaimed_pages(bp->b_folio_count);
>  
> -	if (bp->b_pages != bp->b_page_array)
> -		kfree(bp->b_pages);
> -	bp->b_pages = NULL;
> -	bp->b_flags &= ~_XBF_PAGES;
> +	if (bp->b_folios != bp->b_folio_array)
> +		kfree(bp->b_folios);
> +	bp->b_folios = NULL;
> +	bp->b_flags &= ~_XBF_FOLIOS;
>  }
>  
>  static void
> @@ -313,8 +313,8 @@ xfs_buf_free(
>  
>  	ASSERT(list_empty(&bp->b_lru));
>  
> -	if (bp->b_flags & _XBF_PAGES)
> -		xfs_buf_free_pages(bp);
> +	if (bp->b_flags & _XBF_FOLIOS)
> +		xfs_buf_free_folios(bp);
>  	else if (bp->b_flags & _XBF_KMEM)
>  		kfree(bp->b_addr);
>  
> @@ -345,15 +345,15 @@ xfs_buf_alloc_kmem(
>  		return -ENOMEM;
>  	}
>  	bp->b_offset = offset_in_page(bp->b_addr);
> -	bp->b_pages = bp->b_page_array;
> -	bp->b_pages[0] = kmem_to_page(bp->b_addr);
> -	bp->b_page_count = 1;
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
>  static int
> -xfs_buf_alloc_pages(
> +xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
>  	xfs_buf_flags_t	flags)
>  {
> @@ -364,16 +364,16 @@ xfs_buf_alloc_pages(
>  		gfp_mask |= __GFP_NORETRY;
>  
>  	/* Make sure that we have a page list */
> -	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> -	if (bp->b_page_count <= XB_PAGES) {
> -		bp->b_pages = bp->b_page_array;
> +	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> +	if (bp->b_folio_count <= XB_FOLIOS) {
> +		bp->b_folios = bp->b_folio_array;
>  	} else {
> -		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
> +		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
>  					gfp_mask);
> -		if (!bp->b_pages)
> +		if (!bp->b_folios)
>  			return -ENOMEM;
>  	}
> -	bp->b_flags |= _XBF_PAGES;
> +	bp->b_flags |= _XBF_FOLIOS;
>  
>  	/* Assure zeroed buffer for non-read cases. */
>  	if (!(flags & XBF_READ))
> @@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
>  	for (;;) {
>  		long	last = filled;
>  
> -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
> -						bp->b_pages);
> -		if (filled == bp->b_page_count) {
> +		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> +						(struct page **)bp->b_folios);

Ugh, pointer casting.  I suppose here is where we might want an
alloc_folio_bulk_array that might give us successively smaller
large-folios until b_page_count is satisfied?  (Maybe that's in the next
patch?)

I guess you'd also need a large-folio capable vm_map_ram.  Both of
these things sound reasonable, particularly if somebody wants to write
us a new buffer cache for ext2rs and support large block sizes.

Assuming that one of the goals here is (say) to be able to mount a 16k
blocksize filesystem and try to get 16k folios for the buffer cache?

--D

> +		if (filled == bp->b_folio_count) {
>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>  			break;
>  		}
> @@ -398,7 +398,7 @@ xfs_buf_alloc_pages(
>  			continue;
>  
>  		if (flags & XBF_READ_AHEAD) {
> -			xfs_buf_free_pages(bp);
> +			xfs_buf_free_folios(bp);
>  			return -ENOMEM;
>  		}
>  
> @@ -412,14 +412,14 @@ xfs_buf_alloc_pages(
>   *	Map buffer into kernel address-space if necessary.
>   */
>  STATIC int
> -_xfs_buf_map_pages(
> +_xfs_buf_map_folios(
>  	struct xfs_buf		*bp,
>  	xfs_buf_flags_t		flags)
>  {
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> -	if (bp->b_page_count == 1) {
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
> +	if (bp->b_folio_count == 1) {
>  		/* A single page buffer is always mappable */
> -		bp->b_addr = page_address(bp->b_pages[0]);
> +		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
>  	} else {
> @@ -443,8 +443,8 @@ _xfs_buf_map_pages(
>  		 */
>  		nofs_flag = memalloc_nofs_save();
>  		do {
> -			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> -						-1);
> +			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
> +					bp->b_folio_count, -1);
>  			if (bp->b_addr)
>  				break;
>  			vm_unmap_aliases();
> @@ -571,7 +571,7 @@ xfs_buf_find_lock(
>  			return -ENOENT;
>  		}
>  		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> -		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
> +		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
>  		bp->b_ops = NULL;
>  	}
>  	return 0;
> @@ -629,14 +629,15 @@ xfs_buf_find_insert(
>  		goto out_drop_pag;
>  
>  	/*
> -	 * For buffers that fit entirely within a single page, first attempt to
> -	 * allocate the memory from the heap to minimise memory usage. If we
> -	 * can't get heap memory for these small buffers, we fall back to using
> -	 * the page allocator.
> +	 * For buffers that fit entirely within a single page folio, first
> +	 * attempt to allocate the memory from the heap to minimise memory
> +	 * usage. If we can't get heap memory for these small buffers, we fall
> +	 * back to using the page allocator.
>  	 */
> +
>  	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
>  	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
> -		error = xfs_buf_alloc_pages(new_bp, flags);
> +		error = xfs_buf_alloc_folios(new_bp, flags);
>  		if (error)
>  			goto out_free_buf;
>  	}
> @@ -728,11 +729,11 @@ xfs_buf_get_map(
>  
>  	/* We do not hold a perag reference anymore. */
>  	if (!bp->b_addr) {
> -		error = _xfs_buf_map_pages(bp, flags);
> +		error = _xfs_buf_map_folios(bp, flags);
>  		if (unlikely(error)) {
>  			xfs_warn_ratelimited(btp->bt_mount,
> -				"%s: failed to map %u pages", __func__,
> -				bp->b_page_count);
> +				"%s: failed to map %u folios", __func__,
> +				bp->b_folio_count);
>  			xfs_buf_relse(bp);
>  			return error;
>  		}
> @@ -963,14 +964,14 @@ xfs_buf_get_uncached(
>  	if (error)
>  		return error;
>  
> -	error = xfs_buf_alloc_pages(bp, flags);
> +	error = xfs_buf_alloc_folios(bp, flags);
>  	if (error)
>  		goto fail_free_buf;
>  
> -	error = _xfs_buf_map_pages(bp, 0);
> +	error = _xfs_buf_map_folios(bp, 0);
>  	if (unlikely(error)) {
>  		xfs_warn(target->bt_mount,
> -			"%s: failed to map pages", __func__);
> +			"%s: failed to map folios", __func__);
>  		goto fail_free_buf;
>  	}
>  
> @@ -1465,7 +1466,7 @@ xfs_buf_ioapply_map(
>  	blk_opf_t	op)
>  {
>  	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_page_count;
> +	unsigned int	total_nr_pages = bp->b_folio_count;
>  	int		nr_pages;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
> @@ -1503,7 +1504,7 @@ xfs_buf_ioapply_map(
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> +		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
>  				      offset);
>  		if (rbytes < nbytes)
>  			break;
> @@ -1716,13 +1717,13 @@ xfs_buf_offset(
>  	struct xfs_buf		*bp,
>  	size_t			offset)
>  {
> -	struct page		*page;
> +	struct folio		*folio;
>  
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> -	page = bp->b_pages[offset >> PAGE_SHIFT];
> -	return page_address(page) + (offset & (PAGE_SIZE-1));
> +	folio = bp->b_folios[offset >> PAGE_SHIFT];
> +	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
>  
>  void
> @@ -1735,18 +1736,18 @@ xfs_buf_zero(
>  
>  	bend = boff + bsize;
>  	while (boff < bend) {
> -		struct page	*page;
> +		struct folio	*folio;
>  		int		page_index, page_offset, csize;
>  
>  		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
>  		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		page = bp->b_pages[page_index];
> +		folio = bp->b_folios[page_index];
>  		csize = min_t(size_t, PAGE_SIZE - page_offset,
>  				      BBTOB(bp->b_length) - boff);
>  
>  		ASSERT((csize + page_offset) <= PAGE_SIZE);
>  
> -		memset(page_address(page) + page_offset, 0, csize);
> +		memset(folio_address(folio) + page_offset, 0, csize);
>  
>  		boff += csize;
>  	}
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index b470de08a46c..1e7298ff3fa5 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -29,7 +29,7 @@ struct xfs_buf;
>  #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
>  #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
>  #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> +#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
>  #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
>  
> @@ -39,7 +39,7 @@ struct xfs_buf;
>  #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
> -#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
> +#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
>  #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
>  
> @@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
> -	{ _XBF_PAGES,		"PAGES" }, \
> +	{ _XBF_FOLIOS,		"FOLIOS" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
>  	/* The following interface flags should never be set */ \
> @@ -116,7 +116,7 @@ typedef struct xfs_buftarg {
>  	struct ratelimit_state	bt_ioerror_rl;
>  } xfs_buftarg_t;
>  
> -#define XB_PAGES	2
> +#define XB_FOLIOS	2
>  
>  struct xfs_buf_map {
>  	xfs_daddr_t		bm_bn;	/* block number for I/O */
> @@ -180,14 +180,14 @@ struct xfs_buf {
>  	struct xfs_buf_log_item	*b_log_item;
>  	struct list_head	b_li_list;	/* Log items list head */
>  	struct xfs_trans	*b_transp;
> -	struct page		**b_pages;	/* array of page pointers */
> -	struct page		*b_page_array[XB_PAGES]; /* inline pages */
> +	struct folio		**b_folios;	/* array of folio pointers */
> +	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
>  	struct xfs_buf_map	*b_maps;	/* compound buffer map */
>  	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
>  	int			b_map_count;
>  	atomic_t		b_pin_count;	/* pin count */
>  	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
> -	unsigned int		b_page_count;	/* size of page array */
> +	unsigned int		b_folio_count;	/* size of folio array */
>  	unsigned int		b_offset;	/* page offset of b_addr,
>  						   only for _XBF_KMEM buffers */
>  	int			b_error;	/* error code on I/O */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 83a81cb52d8e..d1407cee48d9 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -69,7 +69,7 @@ xfs_buf_item_straddle(
>  {
>  	void			*first, *last;
>  
> -	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
> +	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
>  		return false;
>  
>  	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index caccb7f76690..804389b8e802 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -279,4 +279,12 @@ kmem_to_page(void *addr)
>  	return virt_to_page(addr);
>  }
>  
> +static inline struct folio *
> +kmem_to_folio(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_folio(vmalloc_to_page(addr));
> +	return virt_to_folio(addr);
> +}
> +
>  #endif /* __XFS_LINUX__ */
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-01-19  1:31   ` Darrick J. Wong
  2024-01-19  7:12     ` Dave Chinner
  2024-01-22  6:45   ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2024-01-19  1:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:41AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the buffer cache using the folio API, we can extend
> the use of folios to allocate high order folios for multi-page
> buffers rather than an array of single pages that are then vmapped
> into a contiguous range.
> 
> This creates two types of buffers: single folio buffers that can
> have arbitrary order, and multi-folio buffers made up of many single
> page folios that get vmapped. The latter is essentially the existing
> code, so there are no logic changes to handle this case.
> 
> There are a few places where we iterate the folios on a buffer.
> These need to be converted to handle the high order folio case.
> Luckily, this only occurs when bp->b_folio_count == 1, and the code
> for handling this case is just a simple application of the folio API
> to the operations that need to be performed.
> 
> The code that allocates buffers will optimistically attempt a high
> order folio allocation as a fast path. If this high order allocation
> fails, then we fall back to the existing multi-folio allocation
> code. This now forms the slow allocation path, and hopefully will be
> largely unused in normal conditions.
> 
> This should improve performance of large buffer operations (e.g.
> large directory block sizes) as we should now mostly avoid the
> expense of vmapping large buffers (and the vmap lock contention that
> can occur) as well as avoid the runtime pressure that frequently
> accessing kernel vmapped pages put on the TLBs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 150 +++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 119 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 15907e92d0d3..df363f17ea1a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -74,6 +74,10 @@ xfs_buf_is_vmapped(
>  	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
> +/*
> + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> + * allocating vmapped buffers.
> + */
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
> @@ -344,14 +348,72 @@ xfs_buf_alloc_kmem(
>  		bp->b_addr = NULL;
>  		return -ENOMEM;
>  	}
> -	bp->b_offset = offset_in_page(bp->b_addr);
>  	bp->b_folios = bp->b_folio_array;
>  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
> +/*
> + * Allocating a high order folio makes the assumption that buffers are a
> + * power-of-2 size so that ilog2() returns the exact order needed to fit
> + * the contents of the buffer. Buffer lengths are mostly a power of two,
> + * so this is not an unreasonable approach to take by default.
> + *
> + * The exception here are user xattr data buffers, which can be arbitrarily
> + * sized up to 64kB plus structure metadata. In that case, round up the order.
> + */
> +static bool
> +xfs_buf_alloc_folio(
> +	struct xfs_buf	*bp,
> +	gfp_t		gfp_mask)
> +{
> +	int		length = BBTOB(bp->b_length);
> +	int		order;
> +
> +	order = ilog2(length);
> +	if ((1 << order) < length)
> +		order = ilog2(length - 1) + 1;
> +
> +	if (order <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order -= PAGE_SHIFT;
> +
> +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> +	if (!bp->b_folio_array[0])
> +		return false;
> +
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folio_count = 1;
> +	bp->b_flags |= _XBF_FOLIOS;
> +	return true;

Hmm.  So I guess with this patchset, either we get one multi-page large
folio for the whole buffer, or we fall back to the array of basepage
sized folios?

/me wonders if the extra flexibility from alloc_folio_bulk_array and a
folioized vm_map_ram would eliminate all this special casing?

Uhoh, lights flickering again and ice crashing off the roof.  I better
go before the lights go out again. :(

--D

> +}
> +
> +/*
> + * When we allocate folios for a buffer, we end up with one of two types of
> + * buffer.
> + *
> + * The first type is a single folio buffer - this may be a high order
> + * folio or just a single page sized folio, but either way they get treated the
> + * same way by the rest of the code - the buffer memory spans a single
> + * contiguous memory region that we don't have to map and unmap to access the
> + * data directly.
> + *
> + * The second type of buffer is the multi-folio buffer. These are *always* made
> + * up of single page folios so that they can be fed to vmap_ram() to return a
> + * contiguous memory region we can access the data through, or mark it as
> + * XBF_UNMAPPED and access the data directly through individual folio_address()
> + * calls.
> + *
> + * We don't use high order folios for this second type of buffer (yet) because
> + * having variable size folios makes offset-to-folio indexing and iteration of
> + * the data range more complex than if they are fixed size. This case should now
> + * be the slow path, though, so unless we regularly fail to allocate high order
> + * folios, there should be little need to optimise this path.
> + */
>  static int
>  xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
> @@ -363,7 +425,15 @@ xfs_buf_alloc_folios(
>  	if (flags & XBF_READ_AHEAD)
>  		gfp_mask |= __GFP_NORETRY;
>  
> -	/* Make sure that we have a page list */
> +	/* Assure zeroed buffer for non-read cases. */
> +	if (!(flags & XBF_READ))
> +		gfp_mask |= __GFP_ZERO;
> +
> +	/* Optimistically attempt a single high order folio allocation. */
> +	if (xfs_buf_alloc_folio(bp, gfp_mask))
> +		return 0;
> +
> +	/* Fall back to allocating an array of single page folios. */
>  	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
>  	if (bp->b_folio_count <= XB_FOLIOS) {
>  		bp->b_folios = bp->b_folio_array;
> @@ -375,9 +445,6 @@ xfs_buf_alloc_folios(
>  	}
>  	bp->b_flags |= _XBF_FOLIOS;
>  
> -	/* Assure zeroed buffer for non-read cases. */
> -	if (!(flags & XBF_READ))
> -		gfp_mask |= __GFP_ZERO;
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -418,7 +485,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1465,20 +1532,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;
> +		}
>  	}
>  
>  	/*
> @@ -1491,28 +1566,28 @@ xfs_buf_ioapply_map(
>  
>  next_chunk:
>  	atomic_inc(&bp->b_io_remaining);
> -	nr_pages = bio_max_segs(total_nr_pages);
> +	nr_folios = bio_max_segs(total_nr_folios);
>  
> -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
>  	bio->bi_iter.bi_sector = sector;
>  	bio->bi_end_io = xfs_buf_bio_end_io;
>  	bio->bi_private = bp;
>  
> -	for (; size && nr_pages; nr_pages--, page_index++) {
> -		int	rbytes, nbytes = PAGE_SIZE - offset;
> +	for (; size && nr_folios; nr_folios--, folio_index++) {
> +		struct folio	*folio = bp->b_folios[folio_index];
> +		int		nbytes = folio_size(folio) - offset;
>  
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> -				      offset);
> -		if (rbytes < nbytes)
> +		if (!bio_add_folio(bio, folio, nbytes,
> +				offset_in_folio(folio, offset)))
>  			break;
>  
>  		offset = 0;
>  		sector += BTOBB(nbytes);
>  		size -= nbytes;
> -		total_nr_pages--;
> +		total_nr_folios--;
>  	}
>  
>  	if (likely(bio->bi_iter.bi_size)) {
> @@ -1722,6 +1797,13 @@ xfs_buf_offset(
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> +	/* Single folio buffers may use large folios. */
> +	if (bp->b_folio_count == 1) {
> +		folio = bp->b_folios[0];
> +		return folio_address(folio) + offset_in_folio(folio, offset);
> +	}
> +
> +	/* Multi-folio buffers always use PAGE_SIZE folios */
>  	folio = bp->b_folios[offset >> PAGE_SHIFT];
>  	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
> @@ -1737,18 +1819,24 @@ xfs_buf_zero(
>  	bend = boff + bsize;
>  	while (boff < bend) {
>  		struct folio	*folio;
> -		int		page_index, page_offset, csize;
> +		int		folio_index, folio_offset, csize;
>  
> -		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> -		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		folio = bp->b_folios[page_index];
> -		csize = min_t(size_t, PAGE_SIZE - page_offset,
> +		/* Single folio buffers may use large folios. */
> +		if (bp->b_folio_count == 1) {
> +			folio = bp->b_folios[0];
> +			folio_offset = offset_in_folio(folio,
> +						bp->b_offset + boff);
> +		} else {
> +			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> +			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> +			folio = bp->b_folios[folio_index];
> +		}
> +
> +		csize = min_t(size_t, folio_size(folio) - folio_offset,
>  				      BBTOB(bp->b_length) - boff);
> +		ASSERT((csize + folio_offset) <= folio_size(folio));
>  
> -		ASSERT((csize + page_offset) <= PAGE_SIZE);
> -
> -		memset(folio_address(folio) + page_offset, 0, csize);
> -
> +		memset(folio_address(folio) + folio_offset, 0, csize);
>  		boff += csize;
>  	}
>  }
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-19  1:26   ` Darrick J. Wong
@ 2024-01-19  7:03     ` Dave Chinner
  2024-01-22  6:39     ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-19  7:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2024 at 09:19:40AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Convert the use of struct pages to struct folio everywhere. This
> > is just direct API conversion, no actual logic of code changes
> > should result.
> > 
> > Note: this conversion currently assumes only single page folios are
> > allocated, and because some of the MM interfaces we use take
> > pointers to arrays of struct pages, the address of single page
> > folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> > vm_map_ram(), etc.
> > 
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
.....
> > @@ -387,9 +387,9 @@ xfs_buf_alloc_pages(
> >  	for (;;) {
> >  		long	last = filled;
> >  
> > -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
> > -						bp->b_pages);
> > -		if (filled == bp->b_page_count) {
> > +		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> > +						(struct page **)bp->b_folios);
> 
> Ugh, pointer casting.  I suppose here is where we might want an
> alloc_folio_bulk_array that might give us successively smaller
> large-folios until b_page_count is satisfied?  (Maybe that's in the next
> patch?)

No, I explicitly chose not to do that because then converting buffer
offset to memory address becomes excitingly complex. With fixed size
folios, it's just simple math. With variable, unknown sized objects,
we either have to store the size of each object with the pointer,
or walk each object grabbing the size to determine what folio in the
buffer corresponds to a specific offset.

And it's now the slow path, so I don't really care to optimise it
that much.

> I guess you'd also need a large-folio capable vm_map_ram.  Both of
> these things sound reasonable, particularly if somebody wants to write
> us a new buffer cache for ext2rs and support large block sizes.

Maybe so, but we do not require them and I don't really have the
time or desire to try to implement something like that. And, really
what benefit do small multipage folios bring us if we still have to
vmap them?

> Assuming that one of the goals here is (say) to be able to mount a 16k
> blocksize filesystem and try to get 16k folios for the buffer cache?

The goal is that we optimistically use large folios where-ever we
have metadata buffers that are larger than a single page, regardless
of the filesystem block size.

Right now on a 4kB block size filesystem that means inode cluster
buffers (16kB for 512 byte inodes), user xattr buffers larger than a
single page, and directory blocks if the filesytsem is configure
with "-n size=X" and X is 8kB or larger.

On filesystems with block sizes larger than 4kB, it will try to use
large folios for everything but the sector sized AG headers.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-19  1:31   ` Darrick J. Wong
@ 2024-01-19  7:12     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-19  7:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:31:00PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2024 at 09:19:41AM +1100, Dave Chinner wrote:
> > +/*
> > + * Allocating a high order folio makes the assumption that buffers are a
> > + * power-of-2 size so that ilog2() returns the exact order needed to fit
> > + * the contents of the buffer. Buffer lengths are mostly a power of two,
> > + * so this is not an unreasonable approach to take by default.
> > + *
> > + * The exception here are user xattr data buffers, which can be arbitrarily
> > + * sized up to 64kB plus structure metadata. In that case, round up the order.
> > + */
> > +static bool
> > +xfs_buf_alloc_folio(
> > +	struct xfs_buf	*bp,
> > +	gfp_t		gfp_mask)
> > +{
> > +	int		length = BBTOB(bp->b_length);
> > +	int		order;
> > +
> > +	order = ilog2(length);
> > +	if ((1 << order) < length)
> > +		order = ilog2(length - 1) + 1;
> > +
> > +	if (order <= PAGE_SHIFT)
> > +		order = 0;
> > +	else
> > +		order -= PAGE_SHIFT;
> > +
> > +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> > +	if (!bp->b_folio_array[0])
> > +		return false;
> > +
> > +	bp->b_folios = bp->b_folio_array;
> > +	bp->b_folio_count = 1;
> > +	bp->b_flags |= _XBF_FOLIOS;
> > +	return true;
> 
> Hmm.  So I guess with this patchset, either we get one multi-page large
> folio for the whole buffer, or we fall back to the array of basepage
> sized folios?

Yes.

> /me wonders if the extra flexibility from alloc_folio_bulk_array and a
> folioized vm_map_ram would eliminate all this special casing?

IMO, no. The basic requirement for a buffer is to provide contiguous
memory space unless XBF_UNMAPPED is specified.

In the case of contiguous space, we either get a single contiguous
range allocated or we have to vmap multiple segments. The moment we
can't get a contiguous memory range, we've lost, we're in the slow
path and we should just do what we know will reliably work as
efficiently as possible.

In the case of XBF_UNMAPPED, if we get a single contiguous range we
can ignore XBF_UNMAPPED, otherwise we should just do what we know
will reliably work as efficiently as possible.

Hence I don't think trying to optimise the "we didn't get a
contiguous memory allocation for the buffer" case for smaller
multi-page folios because it just adds complexity and doesn't
provide any real advantage over the PAGE_SIZE allocation path we do
now.

> Uhoh, lights flickering again and ice crashing off the roof.  I better
> go before the lights go out again. :(

Fun fun!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-19  1:26   ` Darrick J. Wong
  2024-01-19  7:03     ` Dave Chinner
@ 2024-01-22  6:39     ` Christoph Hellwig
  2024-01-22 12:05       ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, willy, linux-mm

On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> Ugh, pointer casting.  I suppose here is where we might want an
> alloc_folio_bulk_array that might give us successively smaller
> large-folios until b_page_count is satisfied?  (Maybe that's in the next
> patch?)
> 
> I guess you'd also need a large-folio capable vm_map_ram. 

We need to just stop using vm_map_ram, there is no reason to do that
even right now.  It was needed when we used the page cache to back
pagebuf, but these days just sing vmalloc is the right thing for
!unmapped buffers that can't use large folios.  And I'm seriously
wondering if we should bother with unmapped buffers in the long run
if we end up normally using larger folios or just consolidate down to:

 - kmalloc for buffers < PAGE_SIZE
 - folio for buffers >= PAGE_SIZE
 - vmalloc if allocation a larger folios is not possible


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch
  2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
@ 2024-01-22  6:41   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

On Fri, Jan 19, 2024 at 09:19:39AM +1100, Dave Chinner wrote:
> Whilst touching this code, don't bother checking mapped or single
> folio buffers for discontiguous regions because they don't have
> them. This significantly reduces the overhead of this check when
> logging large buffers as calling xfs_buf_offset() is not free and
> it occurs a *lot* in those cases.

Nice.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
  2024-01-19  1:31   ` Darrick J. Wong
@ 2024-01-22  6:45   ` Christoph Hellwig
  2024-01-22 11:57     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-22  6:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, willy, linux-mm

> +	int		length = BBTOB(bp->b_length);
> +	int		order;
> +
> +	order = ilog2(length);
> +	if ((1 << order) < length)
> +		order = ilog2(length - 1) + 1;
> +
> +	if (order <= PAGE_SHIFT)
> +		order = 0;
> +	else
> +		order -= PAGE_SHIFT;

Shouldn't this simply use get_order()?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* (no subject)
  2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
                   ` (3 preceding siblings ...)
  2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
@ 2024-01-22 10:13 ` Andi Kleen
  2024-01-22 11:53   ` Dave Chinner
  4 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2024-01-22 10:13 UTC (permalink / raw)
  To: linux-xfs, david, linux-mm

Dave Chinner <david@fromorbit.com> writes:

> Thoughts, comments, etc?

The interesting part is if it will cause additional tail latencies
allocating under fragmentation with direct reclaim, compaction
etc. being triggered before it falls back to the base page path.

In fact it is highly likely it will, the question is just how bad it is.

Unfortunately benchmarking for that isn't that easy, it needs artificial
memory fragmentation and then some high stress workload, and then
instrumenting the transactions for individual latencies. 

I would in any case add a tunable for it in case people run into this.
Tail latencies are a common concern on many IO workloads.

-Andi

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re:
  2024-01-22 10:13 ` Andi Kleen
@ 2024-01-22 11:53   ` Dave Chinner
  2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2024-01-22 11:53 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Thoughts, comments, etc?
> 
> The interesting part is if it will cause additional tail latencies
> allocating under fragmentation with direct reclaim, compaction
> etc. being triggered before it falls back to the base page path.

It's not like I don't know these problems exist with memory
allocation. Go have a look at xlog_kvmalloc() which is an open coded
kvmalloc() that allows the high order kmalloc allocations to
fail-fast without triggering all the expensive and unnecessary
direct reclaim overhead (e.g. compaction!) because we can fall back
to vmalloc without huge concerns. When high order allocations start
to fail, then we fall back to vmalloc and then we hit the long
standing vmalloc scalability problems before anything else in XFS or
the IO path becomes a bottleneck.

IOWs, we already know that fail-fast high-order allocation is a more
efficient and effective fast path than using vmalloc/vmap_ram() all
the time. As this is an RFC, I haven't implemented stuff like this
yet - I haven't seen anything in the profiles indicating that high
order folio allocation is failing and causing lots of reclaim
overhead, so I simply haven't added fail-fast behaviour yet...

> In fact it is highly likely it will, the question is just how bad it is.
> 
> Unfortunately benchmarking for that isn't that easy, it needs artificial
> memory fragmentation and then some high stress workload, and then
> instrumenting the transactions for individual latencies. 

I stress test and measure XFS metadata performance under sustained
memory pressure all the time. This change has not caused any
obvious regressions in the short time I've been testing it.

I still need to do perf testing on large directory block sizes. That
is where high-order allocations will get stressed - that's where
xlog_kvmalloc() starts dominating the profiles as it trips over
vmalloc scalability issues...

> I would in any case add a tunable for it in case people run into this.

No tunables. It either works or it doesn't. If we can't make
it work reliably by default, we throw it in the dumpster, light it
on fire and walk away.

> Tail latencies are a common concern on many IO workloads.

Yes, for user data operations it's a common concern. For metadata,
not so much - there's so many far worse long tail latencies in
metadata operations (like waiting for journal space) that memory
allocation latencies in the metadata IO path are largely noise....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] xfs: convert buffer cache to use high order folios
  2024-01-22  6:45   ` Christoph Hellwig
@ 2024-01-22 11:57     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-22 11:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, willy, linux-mm

On Sun, Jan 21, 2024 at 10:45:55PM -0800, Christoph Hellwig wrote:
> > +	int		length = BBTOB(bp->b_length);
> > +	int		order;
> > +
> > +	order = ilog2(length);
> > +	if ((1 << order) < length)
> > +		order = ilog2(length - 1) + 1;
> > +
> > +	if (order <= PAGE_SHIFT)
> > +		order = 0;
> > +	else
> > +		order -= PAGE_SHIFT;
> 
> Shouldn't this simply use get_order()?

Huh. Yes, it should.

I went looking for a helper and didn't find one in the mm or folio
code. Now you point it out, I find that it is in it's own asm header
(include/asm-generic/getorder.h) so it's no wonder I didn't find
it.

Why is it in include/asm-generic? There's nothing asm related
to that function or it's implementation....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22  6:39     ` Christoph Hellwig
@ 2024-01-22 12:05       ` Dave Chinner
  2024-01-22 13:18         ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2024-01-22 12:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, willy, linux-mm

On Sun, Jan 21, 2024 at 10:39:12PM -0800, Christoph Hellwig wrote:
> On Thu, Jan 18, 2024 at 05:26:24PM -0800, Darrick J. Wong wrote:
> > Ugh, pointer casting.  I suppose here is where we might want an
> > alloc_folio_bulk_array that might give us successively smaller
> > large-folios until b_page_count is satisfied?  (Maybe that's in the next
> > patch?)
> > 
> > I guess you'd also need a large-folio capable vm_map_ram. 
> 
> We need to just stop using vm_map_ram, there is no reason to do that
> even right now.  It was needed when we used the page cache to back
> pagebuf, but these days just sing vmalloc is the right thing for
> !unmapped buffers that can't use large folios. 

I haven't looked at what using vmalloc means for packing the buffer
into a bio - we currently use bio_add_page(), so does that mean we
have to use some variant of virt_to_page() to break the vmalloc
region up into it's backing pages to feed them to the bio? Or is
there some helper that I'm unaware of that does it all for us
magically?

> And I'm seriously
> wondering if we should bother with unmapped buffers in the long run
> if we end up normally using larger folios or just consolidate down to:
> 
>  - kmalloc for buffers < PAGE_SIZE
>  - folio for buffers >= PAGE_SIZE
>  - vmalloc if allocation a larger folios is not possible

Yeah, that's kind of where I'm going with this. Large folios already
turn off unmapped buffers, and I'd really like to get rid of that
page straddling mess that unmapped buffers require in the buffer
item dirty region tracking. That means we have to get rid of
unmapped buffers....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22 12:05       ` Dave Chinner
@ 2024-01-22 13:18         ` Christoph Hellwig
  2024-01-22 21:10           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-22 13:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, willy, linux-mm

On Mon, Jan 22, 2024 at 11:05:20PM +1100, Dave Chinner wrote:
> I haven't looked at what using vmalloc means for packing the buffer
> into a bio - we currently use bio_add_page(), so does that mean we
> have to use some variant of virt_to_page() to break the vmalloc
> region up into it's backing pages to feed them to the bio? Or is
> there some helper that I'm unaware of that does it all for us
> magically?

We have a kmem_to_page helper for chuking any kind of kernel virtual
address space into pages.  xfs_rw_bdev in fs/xfs/xfs_bio_io.c uses
that for a bio, we should probably hav an async version of that
and maybe move it to the block layer instead of duplicating the
logic in various places.

> Yeah, that's kind of where I'm going with this. Large folios already
> turn off unmapped buffers, and I'd really like to get rid of that
> page straddling mess that unmapped buffers require in the buffer
> item dirty region tracking. That means we have to get rid of
> unmapped buffers....

I actually have an old series to kill unmapped buffers and use
vmalloc, but decided I'd need use folios for the fast path instead
of paying the vmalloc overhead.  I can dust it off and you can decide
if you want to pick up parts of it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using Folios for XFS metadata
  2024-01-22 11:53   ` Dave Chinner
@ 2024-01-22 13:34     ` Andi Kleen
  2024-01-23  0:19       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2024-01-22 13:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-mm

[fixed the subject, not sure what happened there]

FWIW I'm not sure fail-fail is always the right strategy here,
in many cases even with some reclaim, compaction may win. Just not if you're
on a tight budget for the latencies.

> I stress test and measure XFS metadata performance under sustained
> memory pressure all the time. This change has not caused any
> obvious regressions in the short time I've been testing it.

Did you test for tail latencies?

There are some relatively simple ways to trigger memory fragmentation,
the standard way is to allocate a very large THP backed file and then
punch a lot of holes.

> 
> I still need to do perf testing on large directory block sizes. That
> is where high-order allocations will get stressed - that's where
> xlog_kvmalloc() starts dominating the profiles as it trips over
> vmalloc scalability issues...

Yes that's true. vmalloc has many issues, although with the recent
patches to split the rbtrees with separate locks it may now look
quite different than before.

> 
> > I would in any case add a tunable for it in case people run into this.
> 
> No tunables. It either works or it doesn't. If we can't make
> it work reliably by default, we throw it in the dumpster, light it
> on fire and walk away.

I'm not sure there is a single definition of "reliably" here -- for
many workloads tail latencies don't matter, so it's always reliable,
as long as you have good aggregate throughput.

Others have very high expectations for them.

Forcing the high expectations on everyone is probably not a good
general strategy though, as there are general trade offs.

I could see that having lots of small tunables for every use might not be a
good idea. Perhaps there would be a case for a single general tunable
that controls higher order folios for everyone.

> 
> > Tail latencies are a common concern on many IO workloads.
> 
> Yes, for user data operations it's a common concern. For metadata,
> not so much - there's so many far worse long tail latencies in
> metadata operations (like waiting for journal space) that memory
> allocation latencies in the metadata IO path are largely noise....

I've seen pretty long stalls in the past.

The difference to the journal is also that it is local the file system, while
the memory is normally shared with everyone on the node or system. So the
scope of noisy neighbour impact can be quite different, especially on a large
machine. 

-Andi


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] xfs: use folios in the buffer cache
  2024-01-22 13:18         ` Christoph Hellwig
@ 2024-01-22 21:10           ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-22 21:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, willy, linux-mm

On Mon, Jan 22, 2024 at 05:18:58AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 11:05:20PM +1100, Dave Chinner wrote:
> > I haven't looked at what using vmalloc means for packing the buffer
> > into a bio - we currently use bio_add_page(), so does that mean we
> > have to use some variant of virt_to_page() to break the vmalloc
> > region up into it's backing pages to feed them to the bio? Or is
> > there some helper that I'm unaware of that does it all for us
> > magically?
> 
> We have a kmem_to_page helper for chuking any kind of kernel virtual
> address space into pages.  xfs_rw_bdev in fs/xfs/xfs_bio_io.c uses
> that for a bio, we should probably hav an async version of that
> and maybe move it to the block layer instead of duplicating the
> logic in various places.

Yeah, OK, as I expected. I'd forgotten that we already play that
game with xfs_rw_bdev(). I think we can trivially factor out an
async version and call that from the xfs_buf.c code fo vmalloc()d
ranges, so I think I'll work towards that and actually remove the
bio packing loop from xfs_buf.c altogether.

> > Yeah, that's kind of where I'm going with this. Large folios already
> > turn off unmapped buffers, and I'd really like to get rid of that
> > page straddling mess that unmapped buffers require in the buffer
> > item dirty region tracking. That means we have to get rid of
> > unmapped buffers....
> 
> I actually have an old series to kill unmapped buffers and use
> vmalloc, but decided I'd need use folios for the fast path instead
> of paying the vmalloc overhead.  I can dust it off and you can decide
> if you want to pick up parts of it.

I wouldn't worry about it too much - the rest of it is pretty
straight forward once we know inode cluster buffers are working
correctly with single large folios and we always fall back to
vmalloc().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Using Folios for XFS metadata
  2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
@ 2024-01-23  0:19       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2024-01-23  0:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-xfs, linux-mm

On Mon, Jan 22, 2024 at 05:34:12AM -0800, Andi Kleen wrote:
> [fixed the subject, not sure what happened there]
> 
> FWIW I'm not sure fail-fail is always the right strategy here,
> in many cases even with some reclaim, compaction may win. Just not if you're
> on a tight budget for the latencies.
> 
> > I stress test and measure XFS metadata performance under sustained
> > memory pressure all the time. This change has not caused any
> > obvious regressions in the short time I've been testing it.
> 
> Did you test for tail latencies?

No, it's an RFC, and I mostly don't care about memory allocation
tail latencies because it is highly unlikely to be a *new issue* we
need to care about.

The fact is taht we already do so much memory allocation and high
order memory allocation (e.g. through slub, xlog_kvmalloc(), user
data IO through the page cache, etc) that if there's a long tail
latency problem with high order memory allocation then it will
already be noticably affecting the XFS data and metadata IO
latencies.

Nobody is reporting problems about excessive long tail latencies
when using XFS, so my care factor about long tail latencies in this
specific memory allocation case is close to zero.

> There are some relatively simple ways to trigger memory fragmentation,
> the standard way is to allocate a very large THP backed file and then
> punch a lot of holes.

Or just run a filesystem with lots of variable sized high order
allocations and varying cached object life times under sustained
memory pressure for a significant period of time....

> > > I would in any case add a tunable for it in case people run into this.
> > 
> > No tunables. It either works or it doesn't. If we can't make
> > it work reliably by default, we throw it in the dumpster, light it
> > on fire and walk away.
> 
> I'm not sure there is a single definition of "reliably" here -- for
> many workloads tail latencies don't matter, so it's always reliable,
> as long as you have good aggregate throughput.
> 
> Others have very high expectations for them.
> 
> Forcing the high expectations on everyone is probably not a good
> general strategy though, as there are general trade offs.

Yup, and we have to make those tradeoffs because filesystems need to
be good at "general purpose" workloads as their primary focus.
Minimising long tail latencies is really just "best effort only"
because we intimately aware of the fact that there are global
resource limitations that cause long tail latencies in filesystem
implementations that simply cannot be worked around.

> I could see that having lots of small tunables for every use might not be a
> good idea. Perhaps there would be a case for a single general tunable
> that controls higher order folios for everyone.

You're not listening: no tunables. Code that has tunables because
you think it *may not work* is code that is not ready to be merged.

> > > Tail latencies are a common concern on many IO workloads.
> > 
> > Yes, for user data operations it's a common concern. For metadata,
> > not so much - there's so many far worse long tail latencies in
> > metadata operations (like waiting for journal space) that memory
> > allocation latencies in the metadata IO path are largely noise....
> 
> I've seen pretty long stalls in the past.
> 
> The difference to the journal is also that it is local the file system, while
> the memory is normally shared with everyone on the node or system. So the
> scope of noisy neighbour impact can be quite different, especially on a large
> machine. 

Most systems run everything on a single filesystem. That makes them
just as global as memory allocation.  If the journal bottlenecks,
everyone on the system suffers the performance degradation, not just
the user who caused it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-01-23  0:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-18 22:19 [RFC] [PATCH 0/3] xfs: use large folios for buffers Dave Chinner
2024-01-18 22:19 ` [PATCH 1/3] xfs: unmapped buffer item size straddling mismatch Dave Chinner
2024-01-22  6:41   ` Christoph Hellwig
2024-01-18 22:19 ` [PATCH 2/3] xfs: use folios in the buffer cache Dave Chinner
2024-01-19  1:26   ` Darrick J. Wong
2024-01-19  7:03     ` Dave Chinner
2024-01-22  6:39     ` Christoph Hellwig
2024-01-22 12:05       ` Dave Chinner
2024-01-22 13:18         ` Christoph Hellwig
2024-01-22 21:10           ` Dave Chinner
2024-01-18 22:19 ` [PATCH 3/3] xfs: convert buffer cache to use high order folios Dave Chinner
2024-01-19  1:31   ` Darrick J. Wong
2024-01-19  7:12     ` Dave Chinner
2024-01-22  6:45   ` Christoph Hellwig
2024-01-22 11:57     ` Dave Chinner
2024-01-19  1:05 ` [RFC] [PATCH 0/3] xfs: use large folios for buffers Darrick J. Wong
2024-01-22 10:13 ` Andi Kleen
2024-01-22 11:53   ` Dave Chinner
2024-01-22 13:34     ` Using Folios for XFS metadata Andi Kleen
2024-01-23  0:19       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.