All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] xfs: use large folios for buffers
@ 2024-03-18 22:45 Dave Chinner
  2024-03-18 22:45 ` [PATCH 1/9] xfs: unmapped buffer item size straddling mismatch Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

The XFS buffer cache supports metadata buffers up to 64kB, and it
does so by aggregating multiple pages into a single contiguous
memory region using vmapping. This is expensive (both the setup and
the runtime TLB mapping cost), and would be unnecessary if we could
allocate large contiguous memory regions for the buffers in the
first place.

Enter multi-page folios.

This patchset converts the buffer cache to use the folio API, then
enhances it to optimisitically use large folios where possible. It
retains the old "vmap an array of single page folios" functionality
as a fallback when large folio allocation fails. This means that,
like page cache support for large folios, we aren't dependent on
large folio allocation succeeding all the time.

This relegates the single page array allocation mechanism to the
"slow path" that we don't have to care so much about performance of
this path anymore. This might allow us to simplify it a bit in
future.

One of the issues with the folio conversion is that we use a couple
of APIs that take struct page ** (i.e. pointers to page pointer
arrays) and there aren't folio counterparts. These are the bulk page
allocator and vm_map_ram(). In the cases where they are used, we
cast &bp->b_folios[] to (struct page **) knowing that this array
will only contain single page folios and that single page folios and
struct page are the same structure and so have the same address.
This is a bit of a hack, so I've ported Christoph's vmalloc()-only
fallback patchset on top of these folio changes to remove both the
bulk page allocator and the calls to vm_map_ram(). This greatly
simplies the allocation and freeing fallback paths, so it's a win
all around.

The other issue I tripped over in doing this conversion is that the
discontiguous buffer straddling code in the buf log item dirty
region tracking is broken. We don't actually exercise that code on
existing configurations, and I tripped over it when tracking down a
bug in the folio conversion. I fixed it and short-circuted the check
for contiguous buffers, but that left the code still in place and
not executed. However, Christoph's unmapped buffer removal patch
gets rid of unmapped buffers, and so we never straddle pages in
buffers anymore and so that code goes away entirely by the end of
the patch set. More wins!

Apart from those small complexities that are resolved by the end of
the patchset, the conversion and enhancement is relatively straight
forward.  It passes fstests on both 512 and 4096 byte sector size
storage (512 byte sectors exercise the XBF_KMEM path which has
non-zero bp->b_offset values) and doesn't appear to cause any
problems with large 64kB directory buffers on 4kB page machines.

Version 2:
- use get_order() instead of open coding
- append in Christoph's unmapped buffer removal
- rework Christoph's vmalloc-instead-of-vm_map_ram to apply to the
  large folio based code. This greatly simplifies everything.

Version 1:
https://lore.kernel.org/linux-xfs/20240118222216.4131379-1-david@fromorbit.com/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/9] xfs: unmapped buffer item size straddling mismatch
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We never log large contiguous regions of unmapped buffers, so this
bug is never triggered by the current code. However, the slowpath
for formatting buffer straddling regions is broken.

That is, the size and shape of the log vector calculated across a
straddle does not match how the formatting code formats a straddle.
This results in a log vector with an uninitialised iovec and this
causes a crash when xlog_write_full() goes to copy the iovec into
the journal.

Whilst touching this code, don't bother checking mapped or single
folio buffers for discontiguous regions because they don't have
them. This significantly reduces the overhead of this check when
logging large buffers as calling xfs_buf_offset() is not free and
it occurs a *lot* in those cases.

Fixes: 929f8b0deb83 ("xfs: optimise xfs_buf_item_size/format for contiguous regions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf_item.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 43031842341a..83a81cb52d8e 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,6 +56,10 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
+/*
+ * We only have to worry about discontiguous buffer range straddling on unmapped
+ * buffers. Everything else will have a contiguous data region we can copy from.
+ */
 static inline bool
 xfs_buf_item_straddle(
 	struct xfs_buf		*bp,
@@ -65,6 +69,9 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
+	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+		return false;
+
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
 	last = xfs_buf_offset(bp,
 			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
@@ -132,11 +139,13 @@ xfs_buf_item_size_segment(
 	return;
 
 slow_scan:
-	/* Count the first bit we jumped out of the above loop from */
-	(*nvecs)++;
-	*nbytes += XFS_BLF_CHUNK;
+	ASSERT(bp->b_addr == NULL);
 	last_bit = first_bit;
+	nbits = 1;
 	while (last_bit != -1) {
+
+		*nbytes += XFS_BLF_CHUNK;
+
 		/*
 		 * This takes the bit number to start looking from and
 		 * returns the next set bit from there.  It returns -1
@@ -151,6 +160,8 @@ xfs_buf_item_size_segment(
 		 * else keep scanning the current set of bits.
 		 */
 		if (next_bit == -1) {
+			if (first_bit != last_bit)
+				(*nvecs)++;
 			break;
 		} else if (next_bit != last_bit + 1 ||
 		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
@@ -162,7 +173,6 @@ xfs_buf_item_size_segment(
 			last_bit++;
 			nbits++;
 		}
-		*nbytes += XFS_BLF_CHUNK;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
  2024-03-18 22:45 ` [PATCH 1/9] xfs: unmapped buffer item size straddling mismatch Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19  6:38   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the use of struct pages to struct folio everywhere. This
is just direct API conversion, no actual logic of code changes
should result.

Note: this conversion currently assumes only single page folios are
allocated, and because some of the MM interfaces we use take
pointers to arrays of struct pages, the address of single page
folios and struct pages are the same. e.g alloc_pages_bulk_array(),
vm_map_ram(), etc.


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c      | 132 +++++++++++++++++++++---------------------
 fs/xfs/xfs_buf.h      |  14 ++---
 fs/xfs/xfs_buf_item.c |   2 +-
 fs/xfs/xfs_buf_mem.c  |  30 +++++-----
 fs/xfs/xfs_buf_mem.h  |   8 +--
 fs/xfs/xfs_linux.h    |   8 +++
 6 files changed, 101 insertions(+), 93 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1a18c381127e..832226385154 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -66,25 +66,25 @@ static inline bool xfs_buf_is_uncached(struct xfs_buf *bp)
 	return bp->b_rhash_key == XFS_BUF_DADDR_NULL;
 }
 
+/*
+ * Return true if the buffer is vmapped.
+ *
+ * b_addr is null if the buffer is not mapped, but the code is clever enough to
+ * know it doesn't have to map a single folio, so the check has to be both for
+ * b_addr and bp->b_folio_count > 1.
+ */
 static inline int
 xfs_buf_is_vmapped(
 	struct xfs_buf	*bp)
 {
-	/*
-	 * Return true if the buffer is vmapped.
-	 *
-	 * b_addr is null if the buffer is not mapped, but the code is clever
-	 * enough to know it doesn't have to map a single page, so the check has
-	 * to be both for b_addr and bp->b_page_count > 1.
-	 */
-	return bp->b_addr && bp->b_page_count > 1;
+	return bp->b_addr && bp->b_folio_count > 1;
 }
 
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
 {
-	return (bp->b_page_count * PAGE_SIZE);
+	return (bp->b_folio_count * PAGE_SIZE);
 }
 
 /*
@@ -203,7 +203,7 @@ xfs_buf_get_maps(
 }
 
 /*
- *	Frees b_pages if it was allocated.
+ *	Frees b_maps if it was allocated.
  */
 static void
 xfs_buf_free_maps(
@@ -279,26 +279,26 @@ _xfs_buf_alloc(
 }
 
 static void
-xfs_buf_free_pages(
+xfs_buf_free_folios(
 	struct xfs_buf	*bp)
 {
 	uint		i;
 
-	ASSERT(bp->b_flags & _XBF_PAGES);
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
 
 	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_page_count);
+		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
 
-	for (i = 0; i < bp->b_page_count; i++) {
-		if (bp->b_pages[i])
-			__free_page(bp->b_pages[i]);
+	for (i = 0; i < bp->b_folio_count; i++) {
+		if (bp->b_folios[i])
+			__folio_put(bp->b_folios[i]);
 	}
-	mm_account_reclaimed_pages(bp->b_page_count);
+	mm_account_reclaimed_pages(bp->b_folio_count);
 
-	if (bp->b_pages != bp->b_page_array)
-		kfree(bp->b_pages);
-	bp->b_pages = NULL;
-	bp->b_flags &= ~_XBF_PAGES;
+	if (bp->b_folios != bp->b_folio_array)
+		kfree(bp->b_folios);
+	bp->b_folios = NULL;
+	bp->b_flags &= ~_XBF_FOLIOS;
 }
 
 static void
@@ -320,9 +320,9 @@ xfs_buf_free(
 	ASSERT(list_empty(&bp->b_lru));
 
 	if (xfs_buftarg_is_mem(bp->b_target))
-		xmbuf_unmap_page(bp);
-	else if (bp->b_flags & _XBF_PAGES)
-		xfs_buf_free_pages(bp);
+		xmbuf_unmap_folio(bp);
+	else if (bp->b_flags & _XBF_FOLIOS)
+		xfs_buf_free_folios(bp);
 	else if (bp->b_flags & _XBF_KMEM)
 		kfree(bp->b_addr);
 
@@ -353,15 +353,15 @@ xfs_buf_alloc_kmem(
 		return -ENOMEM;
 	}
 	bp->b_offset = offset_in_page(bp->b_addr);
-	bp->b_pages = bp->b_page_array;
-	bp->b_pages[0] = kmem_to_page(bp->b_addr);
-	bp->b_page_count = 1;
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
 static int
-xfs_buf_alloc_pages(
+xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
 	xfs_buf_flags_t	flags)
 {
@@ -372,16 +372,16 @@ xfs_buf_alloc_pages(
 		gfp_mask |= __GFP_NORETRY;
 
 	/* Make sure that we have a page list */
-	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
-	if (bp->b_page_count <= XB_PAGES) {
-		bp->b_pages = bp->b_page_array;
+	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	if (bp->b_folio_count <= XB_FOLIOS) {
+		bp->b_folios = bp->b_folio_array;
 	} else {
-		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
+		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
 					gfp_mask);
-		if (!bp->b_pages)
+		if (!bp->b_folios)
 			return -ENOMEM;
 	}
-	bp->b_flags |= _XBF_PAGES;
+	bp->b_flags |= _XBF_FOLIOS;
 
 	/* Assure zeroed buffer for non-read cases. */
 	if (!(flags & XBF_READ))
@@ -395,9 +395,9 @@ xfs_buf_alloc_pages(
 	for (;;) {
 		long	last = filled;
 
-		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
-						bp->b_pages);
-		if (filled == bp->b_page_count) {
+		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
+						(struct page **)bp->b_folios);
+		if (filled == bp->b_folio_count) {
 			XFS_STATS_INC(bp->b_mount, xb_page_found);
 			break;
 		}
@@ -406,7 +406,7 @@ xfs_buf_alloc_pages(
 			continue;
 
 		if (flags & XBF_READ_AHEAD) {
-			xfs_buf_free_pages(bp);
+			xfs_buf_free_folios(bp);
 			return -ENOMEM;
 		}
 
@@ -420,14 +420,14 @@ xfs_buf_alloc_pages(
  *	Map buffer into kernel address-space if necessary.
  */
 STATIC int
-_xfs_buf_map_pages(
+_xfs_buf_map_folios(
 	struct xfs_buf		*bp,
 	xfs_buf_flags_t		flags)
 {
-	ASSERT(bp->b_flags & _XBF_PAGES);
-	if (bp->b_page_count == 1) {
+	ASSERT(bp->b_flags & _XBF_FOLIOS);
+	if (bp->b_folio_count == 1) {
 		/* A single page buffer is always mappable */
-		bp->b_addr = page_address(bp->b_pages[0]);
+		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
 	} else {
@@ -451,8 +451,8 @@ _xfs_buf_map_pages(
 		 */
 		nofs_flag = memalloc_nofs_save();
 		do {
-			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-						-1);
+			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
+					bp->b_folio_count, -1);
 			if (bp->b_addr)
 				break;
 			vm_unmap_aliases();
@@ -579,7 +579,7 @@ xfs_buf_find_lock(
 			return -ENOENT;
 		}
 		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
+		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
 		bp->b_ops = NULL;
 	}
 	return 0;
@@ -638,16 +638,16 @@ xfs_buf_find_insert(
 		goto out_drop_pag;
 
 	if (xfs_buftarg_is_mem(new_bp->b_target)) {
-		error = xmbuf_map_page(new_bp);
+		error = xmbuf_map_folio(new_bp);
 	} else if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
 		   xfs_buf_alloc_kmem(new_bp, flags) < 0) {
 		/*
-		 * For buffers that fit entirely within a single page, first
-		 * attempt to allocate the memory from the heap to minimise
-		 * memory usage. If we can't get heap memory for these small
-		 * buffers, we fall back to using the page allocator.
+		 * For buffers that fit entirely within a single page folio,
+		 * first attempt to allocate the memory from the heap to
+		 * minimise memory usage. If we can't get heap memory for these
+		 * small buffers, we fall back to using the page allocator.
 		 */
-		error = xfs_buf_alloc_pages(new_bp, flags);
+		error = xfs_buf_alloc_folios(new_bp, flags);
 	}
 	if (error)
 		goto out_free_buf;
@@ -764,11 +764,11 @@ xfs_buf_get_map(
 
 	/* We do not hold a perag reference anymore. */
 	if (!bp->b_addr) {
-		error = _xfs_buf_map_pages(bp, flags);
+		error = _xfs_buf_map_folios(bp, flags);
 		if (unlikely(error)) {
 			xfs_warn_ratelimited(btp->bt_mount,
-				"%s: failed to map %u pages", __func__,
-				bp->b_page_count);
+				"%s: failed to map %u folios", __func__,
+				bp->b_folio_count);
 			xfs_buf_relse(bp);
 			return error;
 		}
@@ -1008,16 +1008,16 @@ xfs_buf_get_uncached(
 		return error;
 
 	if (xfs_buftarg_is_mem(bp->b_target))
-		error = xmbuf_map_page(bp);
+		error = xmbuf_map_folio(bp);
 	else
-		error = xfs_buf_alloc_pages(bp, flags);
+		error = xfs_buf_alloc_folios(bp, flags);
 	if (error)
 		goto fail_free_buf;
 
-	error = _xfs_buf_map_pages(bp, 0);
+	error = _xfs_buf_map_folios(bp, 0);
 	if (unlikely(error)) {
 		xfs_warn(target->bt_mount,
-			"%s: failed to map pages", __func__);
+			"%s: failed to map folios", __func__);
 		goto fail_free_buf;
 	}
 
@@ -1526,7 +1526,7 @@ xfs_buf_ioapply_map(
 	blk_opf_t	op)
 {
 	int		page_index;
-	unsigned int	total_nr_pages = bp->b_page_count;
+	unsigned int	total_nr_pages = bp->b_folio_count;
 	int		nr_pages;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
@@ -1564,7 +1564,7 @@ xfs_buf_ioapply_map(
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
@@ -1783,13 +1783,13 @@ xfs_buf_offset(
 	struct xfs_buf		*bp,
 	size_t			offset)
 {
-	struct page		*page;
+	struct folio		*folio;
 
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
-	page = bp->b_pages[offset >> PAGE_SHIFT];
-	return page_address(page) + (offset & (PAGE_SIZE-1));
+	folio = bp->b_folios[offset >> PAGE_SHIFT];
+	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
 
 void
@@ -1802,18 +1802,18 @@ xfs_buf_zero(
 
 	bend = boff + bsize;
 	while (boff < bend) {
-		struct page	*page;
+		struct folio	*folio;
 		int		page_index, page_offset, csize;
 
 		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
 		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		page = bp->b_pages[page_index];
+		folio = bp->b_folios[page_index];
 		csize = min_t(size_t, PAGE_SIZE - page_offset,
 				      BBTOB(bp->b_length) - boff);
 
 		ASSERT((csize + page_offset) <= PAGE_SIZE);
 
-		memset(page_address(page) + page_offset, 0, csize);
+		memset(folio_address(folio) + page_offset, 0, csize);
 
 		boff += csize;
 	}
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index b1580644501f..f059ae3d2755 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -29,7 +29,7 @@ struct xfs_buf;
 #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
 #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
 #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
-#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
+#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
 #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
 #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
 
@@ -39,7 +39,7 @@ struct xfs_buf;
 #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
 
 /* flags used only internally */
-#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
+#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
 #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
 
@@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_INODES,		"INODES" }, \
 	{ _XBF_DQUOTS,		"DQUOTS" }, \
 	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
-	{ _XBF_PAGES,		"PAGES" }, \
+	{ _XBF_FOLIOS,		"FOLIOS" }, \
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	/* The following interface flags should never be set */ \
@@ -128,7 +128,7 @@ struct xfs_buftarg {
 	struct xfs_buf_cache	bt_cache[];
 };
 
-#define XB_PAGES	2
+#define XB_FOLIOS	2
 
 struct xfs_buf_map {
 	xfs_daddr_t		bm_bn;	/* block number for I/O */
@@ -192,14 +192,14 @@ struct xfs_buf {
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
 	struct xfs_trans	*b_transp;
-	struct page		**b_pages;	/* array of page pointers */
-	struct page		*b_page_array[XB_PAGES]; /* inline pages */
+	struct folio		**b_folios;	/* array of folio pointers */
+	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
 	struct xfs_buf_map	*b_maps;	/* compound buffer map */
 	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
 	int			b_map_count;
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
-	unsigned int		b_page_count;	/* size of page array */
+	unsigned int		b_folio_count;	/* size of folio array */
 	unsigned int		b_offset;	/* page offset of b_addr,
 						   only for _XBF_KMEM buffers */
 	int			b_error;	/* error code on I/O */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 83a81cb52d8e..d1407cee48d9 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -69,7 +69,7 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
-	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
 		return false;
 
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 8ad38c64708e..26734c64c10e 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -74,7 +74,7 @@ xmbuf_alloc(
 
 	/*
 	 * We don't want to bother with kmapping data during repair, so don't
-	 * allow highmem pages to back this mapping.
+	 * allow highmem folios to back this mapping.
 	 */
 	mapping_set_gfp_mask(inode->i_mapping, GFP_KERNEL);
 
@@ -127,9 +127,9 @@ xmbuf_free(
 	kfree(btp);
 }
 
-/* Directly map a shmem page into the buffer cache. */
+/* Directly map a shmem folio into the buffer cache. */
 int
-xmbuf_map_page(
+xmbuf_map_folio(
 	struct xfs_buf		*bp)
 {
 	struct inode		*inode = file_inode(bp->b_target->bt_file);
@@ -169,27 +169,27 @@ xmbuf_map_page(
 	unlock_page(page);
 
 	bp->b_addr = page_address(page);
-	bp->b_pages = bp->b_page_array;
-	bp->b_pages[0] = page;
-	bp->b_page_count = 1;
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folios[0] = folio;
+	bp->b_folio_count = 1;
 	return 0;
 }
 
-/* Unmap a shmem page that was mapped into the buffer cache. */
+/* Unmap a shmem folio that was mapped into the buffer cache. */
 void
-xmbuf_unmap_page(
+xmbuf_unmap_folio(
 	struct xfs_buf		*bp)
 {
-	struct page		*page = bp->b_pages[0];
+	struct folio		*folio = bp->b_folios[0];
 
 	ASSERT(xfs_buftarg_is_mem(bp->b_target));
 
-	put_page(page);
+	folio_put(folio);
 
 	bp->b_addr = NULL;
-	bp->b_pages[0] = NULL;
-	bp->b_pages = NULL;
-	bp->b_page_count = 0;
+	bp->b_folios[0] = NULL;
+	bp->b_folios = NULL;
+	bp->b_folio_count = 0;
 }
 
 /* Is this a valid daddr within the buftarg? */
@@ -205,7 +205,7 @@ xmbuf_verify_daddr(
 	return daddr < (inode->i_sb->s_maxbytes >> BBSHIFT);
 }
 
-/* Discard the page backing this buffer. */
+/* Discard the folio backing this buffer. */
 static void
 xmbuf_stale(
 	struct xfs_buf		*bp)
@@ -220,7 +220,7 @@ xmbuf_stale(
 }
 
 /*
- * Finalize a buffer -- discard the backing page if it's stale, or run the
+ * Finalize a buffer -- discard the backing folio if it's stale, or run the
  * write verifier to detect problems.
  */
 int
diff --git a/fs/xfs/xfs_buf_mem.h b/fs/xfs/xfs_buf_mem.h
index eed4a7b63232..8f4c959ff829 100644
--- a/fs/xfs/xfs_buf_mem.h
+++ b/fs/xfs/xfs_buf_mem.h
@@ -19,15 +19,15 @@ int xmbuf_alloc(struct xfs_mount *mp, const char *descr,
 		struct xfs_buftarg **btpp);
 void xmbuf_free(struct xfs_buftarg *btp);
 
-int xmbuf_map_page(struct xfs_buf *bp);
-void xmbuf_unmap_page(struct xfs_buf *bp);
+int xmbuf_map_folio(struct xfs_buf *bp);
+void xmbuf_unmap_folio(struct xfs_buf *bp);
 bool xmbuf_verify_daddr(struct xfs_buftarg *btp, xfs_daddr_t daddr);
 void xmbuf_trans_bdetach(struct xfs_trans *tp, struct xfs_buf *bp);
 int xmbuf_finalize(struct xfs_buf *bp);
 #else
 # define xfs_buftarg_is_mem(...)	(false)
-# define xmbuf_map_page(...)		(-ENOMEM)
-# define xmbuf_unmap_page(...)		((void)0)
+# define xmbuf_map_folio(...)		(-ENOMEM)
+# define xmbuf_unmap_folio(...)		((void)0)
 # define xmbuf_verify_daddr(...)	(false)
 #endif /* CONFIG_XFS_MEMORY_BUFS */
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 8f07c9f6157f..9a74ba664efb 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -280,4 +280,12 @@ kmem_to_page(void *addr)
 	return virt_to_page(addr);
 }
 
+static inline struct folio *
+kmem_to_folio(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		return page_folio(vmalloc_to_page(addr));
+	return virt_to_folio(addr);
+}
+
 #endif /* __XFS_LINUX__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
  2024-03-18 22:45 ` [PATCH 1/9] xfs: unmapped buffer item size straddling mismatch Dave Chinner
  2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19  6:55   ` Christoph Hellwig
                     ` (2 more replies)
  2024-03-18 22:45 ` [PATCH 4/9] xfs: kill XBF_UNMAPPED Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have the buffer cache using the folio API, we can extend
the use of folios to allocate high order folios for multi-page
buffers rather than an array of single pages that are then vmapped
into a contiguous range.

This creates two types of buffers: single folio buffers that can
have arbitrary order, and multi-folio buffers made up of many single
page folios that get vmapped. The latter is essentially the existing
code, so there are no logic changes to handle this case.

There are a few places where we iterate the folios on a buffer.
These need to be converted to handle the high order folio case.
Luckily, this only occurs when bp->b_folio_count == 1, and the code
for handling this case is just a simple application of the folio API
to the operations that need to be performed.

The code that allocates buffers will optimistically attempt a high
order folio allocation as a fast path. If this high order allocation
fails, then we fall back to the existing multi-folio allocation
code. This now forms the slow allocation path, and hopefully will be
largely unused in normal conditions.

This should improve performance of large buffer operations (e.g.
large directory block sizes) as we should now mostly avoid the
expense of vmapping large buffers (and the vmap lock contention that
can occur) as well as avoid the runtime pressure that frequently
accessing kernel vmapped pages put on the TLBs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 110 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 832226385154..7d9303497763 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
 	return bp->b_addr && bp->b_folio_count > 1;
 }
 
+/*
+ * See comment above xfs_buf_alloc_folios() about the constraints placed on
+ * allocating vmapped buffers.
+ */
 static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
@@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_offset = offset_in_page(bp->b_addr);
 	bp->b_folios = bp->b_folio_array;
 	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
+	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
 
+/*
+ * Allocating a high order folio makes the assumption that buffers are a
+ * power-of-2 size so that ilog2() returns the exact order needed to fit
+ * the contents of the buffer. Buffer lengths are mostly a power of two,
+ * so this is not an unreasonable approach to take by default.
+ *
+ * The exception here are user xattr data buffers, which can be arbitrarily
+ * sized up to 64kB plus structure metadata. In that case, round up the order.
+ */
+static bool
+xfs_buf_alloc_folio(
+	struct xfs_buf	*bp,
+	gfp_t		gfp_mask)
+{
+	int		length = BBTOB(bp->b_length);
+	int		order = get_order(length);
+
+	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
+	if (!bp->b_folio_array[0])
+		return false;
+
+	bp->b_folios = bp->b_folio_array;
+	bp->b_folio_count = 1;
+	bp->b_flags |= _XBF_FOLIOS;
+	return true;
+}
+
+/*
+ * When we allocate folios for a buffer, we end up with one of two types of
+ * buffer.
+ *
+ * The first type is a single folio buffer - this may be a high order
+ * folio or just a single page sized folio, but either way they get treated the
+ * same way by the rest of the code - the buffer memory spans a single
+ * contiguous memory region that we don't have to map and unmap to access the
+ * data directly.
+ *
+ * The second type of buffer is the multi-folio buffer. These are *always* made
+ * up of single page folios so that they can be fed to vmap_ram() to return a
+ * contiguous memory region we can access the data through, or mark it as
+ * XBF_UNMAPPED and access the data directly through individual folio_address()
+ * calls.
+ *
+ * We don't use high order folios for this second type of buffer (yet) because
+ * having variable size folios makes offset-to-folio indexing and iteration of
+ * the data range more complex than if they are fixed size. This case should now
+ * be the slow path, though, so unless we regularly fail to allocate high order
+ * folios, there should be little need to optimise this path.
+ */
 static int
 xfs_buf_alloc_folios(
 	struct xfs_buf	*bp,
@@ -371,7 +424,15 @@ xfs_buf_alloc_folios(
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
 
-	/* Make sure that we have a page list */
+	/* Assure zeroed buffer for non-read cases. */
+	if (!(flags & XBF_READ))
+		gfp_mask |= __GFP_ZERO;
+
+	/* Optimistically attempt a single high order folio allocation. */
+	if (xfs_buf_alloc_folio(bp, gfp_mask))
+		return 0;
+
+	/* Fall back to allocating an array of single page folios. */
 	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
 	if (bp->b_folio_count <= XB_FOLIOS) {
 		bp->b_folios = bp->b_folio_array;
@@ -383,9 +444,6 @@ xfs_buf_alloc_folios(
 	}
 	bp->b_flags |= _XBF_FOLIOS;
 
-	/* Assure zeroed buffer for non-read cases. */
-	if (!(flags & XBF_READ))
-		gfp_mask |= __GFP_ZERO;
 
 	/*
 	 * Bulk filling of pages can take multiple calls. Not filling the entire
@@ -426,7 +484,7 @@ _xfs_buf_map_folios(
 {
 	ASSERT(bp->b_flags & _XBF_FOLIOS);
 	if (bp->b_folio_count == 1) {
-		/* A single page buffer is always mappable */
+		/* A single folio buffer is always mappable */
 		bp->b_addr = folio_address(bp->b_folios[0]);
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
@@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
 	int		*count,
 	blk_opf_t	op)
 {
-	int		page_index;
-	unsigned int	total_nr_pages = bp->b_folio_count;
-	int		nr_pages;
+	int		folio_index;
+	unsigned int	total_nr_folios = bp->b_folio_count;
+	int		nr_folios;
 	struct bio	*bio;
 	sector_t	sector =  bp->b_maps[map].bm_bn;
 	int		size;
 	int		offset;
 
-	/* skip the pages in the buffer before the start offset */
-	page_index = 0;
+	/*
+	 * If the start offset if larger than a single page, we need to be
+	 * careful. We might have a high order folio, in which case the indexing
+	 * is from the start of the buffer. However, if we have more than one
+	 * folio single page folio in the buffer, we need to skip the folios in
+	 * the buffer before the start offset.
+	 */
+	folio_index = 0;
 	offset = *buf_offset;
-	while (offset >= PAGE_SIZE) {
-		page_index++;
-		offset -= PAGE_SIZE;
+	if (bp->b_folio_count > 1) {
+		while (offset >= PAGE_SIZE) {
+			folio_index++;
+			offset -= PAGE_SIZE;
+		}
 	}
 
 	/*
@@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
 
 next_chunk:
 	atomic_inc(&bp->b_io_remaining);
-	nr_pages = bio_max_segs(total_nr_pages);
+	nr_folios = bio_max_segs(total_nr_folios);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
+	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
 	bio->bi_iter.bi_sector = sector;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 
-	for (; size && nr_pages; nr_pages--, page_index++) {
-		int	rbytes, nbytes = PAGE_SIZE - offset;
+	for (; size && nr_folios; nr_folios--, folio_index++) {
+		struct folio	*folio = bp->b_folios[folio_index];
+		int		nbytes = folio_size(folio) - offset;
 
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
-				      offset);
-		if (rbytes < nbytes)
+		if (!bio_add_folio(bio, folio, nbytes,
+				offset_in_folio(folio, offset)))
 			break;
 
 		offset = 0;
 		sector += BTOBB(nbytes);
 		size -= nbytes;
-		total_nr_pages--;
+		total_nr_folios--;
 	}
 
 	if (likely(bio->bi_iter.bi_size)) {
@@ -1788,6 +1854,13 @@ xfs_buf_offset(
 	if (bp->b_addr)
 		return bp->b_addr + offset;
 
+	/* Single folio buffers may use large folios. */
+	if (bp->b_folio_count == 1) {
+		folio = bp->b_folios[0];
+		return folio_address(folio) + offset_in_folio(folio, offset);
+	}
+
+	/* Multi-folio buffers always use PAGE_SIZE folios */
 	folio = bp->b_folios[offset >> PAGE_SHIFT];
 	return folio_address(folio) + (offset & (PAGE_SIZE-1));
 }
@@ -1803,18 +1876,24 @@ xfs_buf_zero(
 	bend = boff + bsize;
 	while (boff < bend) {
 		struct folio	*folio;
-		int		page_index, page_offset, csize;
+		int		folio_index, folio_offset, csize;
 
-		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
-		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-		folio = bp->b_folios[page_index];
-		csize = min_t(size_t, PAGE_SIZE - page_offset,
+		/* Single folio buffers may use large folios. */
+		if (bp->b_folio_count == 1) {
+			folio = bp->b_folios[0];
+			folio_offset = offset_in_folio(folio,
+						bp->b_offset + boff);
+		} else {
+			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
+			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
+			folio = bp->b_folios[folio_index];
+		}
+
+		csize = min_t(size_t, folio_size(folio) - folio_offset,
 				      BBTOB(bp->b_length) - boff);
+		ASSERT((csize + folio_offset) <= folio_size(folio));
 
-		ASSERT((csize + page_offset) <= PAGE_SIZE);
-
-		memset(folio_address(folio) + page_offset, 0, csize);
-
+		memset(folio_address(folio) + folio_offset, 0, csize);
 		boff += csize;
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/9] xfs: kill XBF_UNMAPPED
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (2 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19 17:30   ` Darrick J. Wong
  2024-03-18 22:45 ` [PATCH 5/9] xfs: buffer items don't straddle pages anymore Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Christoph Hellwig <hch@lst.de>

Unmapped buffer access is a pain, so kill it. The switch to large
folios means we rarely pay a vmap penalty for large buffers,
so this functionality is largely unnecessary now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c    |  2 +-
 fs/xfs/libxfs/xfs_inode_buf.c |  2 +-
 fs/xfs/scrub/inode_repair.c   |  3 +-
 fs/xfs/xfs_buf.c              | 62 ++---------------------------------
 fs/xfs/xfs_buf.h              | 16 ++++++---
 fs/xfs/xfs_buf_item.c         |  2 +-
 fs/xfs/xfs_buf_item_recover.c |  8 +----
 fs/xfs/xfs_inode.c            |  3 +-
 8 files changed, 19 insertions(+), 79 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e5ac3e5430c4..fa27a50f96ac 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -362,7 +362,7 @@ xfs_ialloc_inode_init(
 				(j * M_IGEO(mp)->blocks_per_cluster));
 		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
 				mp->m_bsize * M_IGEO(mp)->blocks_per_cluster,
-				XBF_UNMAPPED, &fbuf);
+				0, &fbuf);
 		if (error)
 			return error;
 
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index d0dcce462bf4..68989f4bf793 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -136,7 +136,7 @@ xfs_imap_to_bp(
 	int			error;
 
 	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
-			imap->im_len, XBF_UNMAPPED, bpp, &xfs_inode_buf_ops);
+			imap->im_len, 0, bpp, &xfs_inode_buf_ops);
 	if (xfs_metadata_is_sick(error))
 		xfs_agno_mark_sick(mp, xfs_daddr_to_agno(mp, imap->im_blkno),
 				XFS_SICK_AG_INODES);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index eab380e95ef4..7b31f1ad194f 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -1309,8 +1309,7 @@ xrep_dinode_core(
 
 	/* Read the inode cluster buffer. */
 	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
-			ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp,
-			NULL);
+			ri->imap.im_blkno, ri->imap.im_len, 0, &bp, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7d9303497763..2cd3671f3ce3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -239,7 +239,7 @@ _xfs_buf_alloc(
 	 * We don't want certain flags to appear in b_flags unless they are
 	 * specifically set by later operations on the buffer.
 	 */
-	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
+	flags &= ~(XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
 
 	atomic_set(&bp->b_hold, 1);
 	atomic_set(&bp->b_lru_ref, 1);
@@ -403,9 +403,7 @@ xfs_buf_alloc_folio(
  *
  * The second type of buffer is the multi-folio buffer. These are *always* made
  * up of single page folios so that they can be fed to vmap_ram() to return a
- * contiguous memory region we can access the data through, or mark it as
- * XBF_UNMAPPED and access the data directly through individual folio_address()
- * calls.
+ * contiguous memory region we can access the data through.
  *
  * We don't use high order folios for this second type of buffer (yet) because
  * having variable size folios makes offset-to-folio indexing and iteration of
@@ -486,8 +484,6 @@ _xfs_buf_map_folios(
 	if (bp->b_folio_count == 1) {
 		/* A single folio buffer is always mappable */
 		bp->b_addr = folio_address(bp->b_folios[0]);
-	} else if (flags & XBF_UNMAPPED) {
-		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
 		unsigned nofs_flag;
@@ -1844,60 +1840,6 @@ __xfs_buf_submit(
 	return error;
 }
 
-void *
-xfs_buf_offset(
-	struct xfs_buf		*bp,
-	size_t			offset)
-{
-	struct folio		*folio;
-
-	if (bp->b_addr)
-		return bp->b_addr + offset;
-
-	/* Single folio buffers may use large folios. */
-	if (bp->b_folio_count == 1) {
-		folio = bp->b_folios[0];
-		return folio_address(folio) + offset_in_folio(folio, offset);
-	}
-
-	/* Multi-folio buffers always use PAGE_SIZE folios */
-	folio = bp->b_folios[offset >> PAGE_SHIFT];
-	return folio_address(folio) + (offset & (PAGE_SIZE-1));
-}
-
-void
-xfs_buf_zero(
-	struct xfs_buf		*bp,
-	size_t			boff,
-	size_t			bsize)
-{
-	size_t			bend;
-
-	bend = boff + bsize;
-	while (boff < bend) {
-		struct folio	*folio;
-		int		folio_index, folio_offset, csize;
-
-		/* Single folio buffers may use large folios. */
-		if (bp->b_folio_count == 1) {
-			folio = bp->b_folios[0];
-			folio_offset = offset_in_folio(folio,
-						bp->b_offset + boff);
-		} else {
-			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
-			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
-			folio = bp->b_folios[folio_index];
-		}
-
-		csize = min_t(size_t, folio_size(folio) - folio_offset,
-				      BBTOB(bp->b_length) - boff);
-		ASSERT((csize + folio_offset) <= folio_size(folio));
-
-		memset(folio_address(folio) + folio_offset, 0, csize);
-		boff += csize;
-	}
-}
-
 /*
  * Log a message about and stale a buffer that a caller has decided is corrupt.
  *
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index f059ae3d2755..aef7015cf9f3 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -51,7 +51,6 @@ struct xfs_buf;
 #define XBF_LIVESCAN	 (1u << 28)
 #define XBF_INCORE	 (1u << 29)/* lookup only, return if found in cache */
 #define XBF_TRYLOCK	 (1u << 30)/* lock requested, but do not wait */
-#define XBF_UNMAPPED	 (1u << 31)/* do not map the buffer */
 
 
 typedef unsigned int xfs_buf_flags_t;
@@ -74,8 +73,7 @@ typedef unsigned int xfs_buf_flags_t;
 	/* The following interface flags should never be set */ \
 	{ XBF_LIVESCAN,		"LIVESCAN" }, \
 	{ XBF_INCORE,		"INCORE" }, \
-	{ XBF_TRYLOCK,		"TRYLOCK" }, \
-	{ XBF_UNMAPPED,		"UNMAPPED" }
+	{ XBF_TRYLOCK,		"TRYLOCK" }
 
 /*
  * Internal state flags.
@@ -320,12 +318,20 @@ extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
 #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
 extern void xfs_buf_ioerror_alert(struct xfs_buf *bp, xfs_failaddr_t fa);
 void xfs_buf_ioend_fail(struct xfs_buf *);
-void xfs_buf_zero(struct xfs_buf *bp, size_t boff, size_t bsize);
 void __xfs_buf_mark_corrupt(struct xfs_buf *bp, xfs_failaddr_t fa);
 #define xfs_buf_mark_corrupt(bp) __xfs_buf_mark_corrupt((bp), __this_address)
 
 /* Buffer Utility Routines */
-extern void *xfs_buf_offset(struct xfs_buf *, size_t);
+static inline void *xfs_buf_offset(struct xfs_buf *bp, size_t offset)
+{
+	return bp->b_addr + offset;
+}
+
+static inline void xfs_buf_zero(struct xfs_buf *bp, size_t boff, size_t bsize)
+{
+	memset(bp->b_addr + boff, 0, bsize);
+}
+
 extern void xfs_buf_stale(struct xfs_buf *bp);
 
 /* Delayed Write Buffer Routines */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d1407cee48d9..7b66d3fe4ecd 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -69,7 +69,7 @@ xfs_buf_item_straddle(
 {
 	void			*first, *last;
 
-	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
+	if (bp->b_folio_count == 1)
 		return false;
 
 	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index 09e893cf563c..d74bf7bb7794 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -891,7 +891,6 @@ xlog_recover_buf_commit_pass2(
 	struct xfs_mount		*mp = log->l_mp;
 	struct xfs_buf			*bp;
 	int				error;
-	uint				buf_flags;
 	xfs_lsn_t			lsn;
 
 	/*
@@ -910,13 +909,8 @@ xlog_recover_buf_commit_pass2(
 	}
 
 	trace_xfs_log_recover_buf_recover(log, buf_f);
-
-	buf_flags = 0;
-	if (buf_f->blf_flags & XFS_BLF_INODE_BUF)
-		buf_flags |= XBF_UNMAPPED;
-
 	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno, buf_f->blf_len,
-			  buf_flags, &bp, NULL);
+			  0, &bp, NULL);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ea48774f6b76..e7a724270423 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2405,8 +2405,7 @@ xfs_ifree_cluster(
 		 * to mark all the active inodes on the buffer stale.
 		 */
 		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
-				mp->m_bsize * igeo->blocks_per_cluster,
-				XBF_UNMAPPED, &bp);
+				mp->m_bsize * igeo->blocks_per_cluster, 0, &bp);
 		if (error)
 			return error;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/9] xfs: buffer items don't straddle pages anymore
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (3 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 4/9] xfs: kill XBF_UNMAPPED Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19  6:56   ` Christoph Hellwig
  2024-03-19 17:31   ` Darrick J. Wong
  2024-03-18 22:45 ` [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Unmapped buffers don't exist anymore, so the page straddling
detection and slow path code can go away now.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 124 ------------------------------------------
 1 file changed, 124 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 7b66d3fe4ecd..cbc06605d31c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,31 +56,6 @@ xfs_buf_log_format_size(
 			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
 }
 
-/*
- * We only have to worry about discontiguous buffer range straddling on unmapped
- * buffers. Everything else will have a contiguous data region we can copy from.
- */
-static inline bool
-xfs_buf_item_straddle(
-	struct xfs_buf		*bp,
-	uint			offset,
-	int			first_bit,
-	int			nbits)
-{
-	void			*first, *last;
-
-	if (bp->b_folio_count == 1)
-		return false;
-
-	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
-	last = xfs_buf_offset(bp,
-			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
-
-	if (last - first != nbits * XFS_BLF_CHUNK)
-		return true;
-	return false;
-}
-
 /*
  * Return the number of log iovecs and space needed to log the given buf log
  * item segment.
@@ -97,11 +72,8 @@ xfs_buf_item_size_segment(
 	int				*nvecs,
 	int				*nbytes)
 {
-	struct xfs_buf			*bp = bip->bli_buf;
 	int				first_bit;
 	int				nbits;
-	int				next_bit;
-	int				last_bit;
 
 	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
 	if (first_bit == -1)
@@ -114,15 +86,6 @@ xfs_buf_item_size_segment(
 		nbits = xfs_contig_bits(blfp->blf_data_map,
 					blfp->blf_map_size, first_bit);
 		ASSERT(nbits > 0);
-
-		/*
-		 * Straddling a page is rare because we don't log contiguous
-		 * chunks of unmapped buffers anywhere.
-		 */
-		if (nbits > 1 &&
-		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
-			goto slow_scan;
-
 		(*nvecs)++;
 		*nbytes += nbits * XFS_BLF_CHUNK;
 
@@ -137,43 +100,6 @@ xfs_buf_item_size_segment(
 	} while (first_bit != -1);
 
 	return;
-
-slow_scan:
-	ASSERT(bp->b_addr == NULL);
-	last_bit = first_bit;
-	nbits = 1;
-	while (last_bit != -1) {
-
-		*nbytes += XFS_BLF_CHUNK;
-
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					last_bit + 1);
-		/*
-		 * If we run out of bits, leave the loop,
-		 * else if we find a new set of bits bump the number of vecs,
-		 * else keep scanning the current set of bits.
-		 */
-		if (next_bit == -1) {
-			if (first_bit != last_bit)
-				(*nvecs)++;
-			break;
-		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
-			last_bit = next_bit;
-			first_bit = next_bit;
-			(*nvecs)++;
-			nbits = 1;
-		} else {
-			last_bit++;
-			nbits++;
-		}
-	}
 }
 
 /*
@@ -286,8 +212,6 @@ xfs_buf_item_format_segment(
 	struct xfs_buf		*bp = bip->bli_buf;
 	uint			base_size;
 	int			first_bit;
-	int			last_bit;
-	int			next_bit;
 	uint			nbits;
 
 	/* copy the flags across from the base format item */
@@ -332,15 +256,6 @@ xfs_buf_item_format_segment(
 		nbits = xfs_contig_bits(blfp->blf_data_map,
 					blfp->blf_map_size, first_bit);
 		ASSERT(nbits > 0);
-
-		/*
-		 * Straddling a page is rare because we don't log contiguous
-		 * chunks of unmapped buffers anywhere.
-		 */
-		if (nbits > 1 &&
-		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
-			goto slow_scan;
-
 		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
 					first_bit, nbits);
 		blfp->blf_size++;
@@ -356,45 +271,6 @@ xfs_buf_item_format_segment(
 	} while (first_bit != -1);
 
 	return;
-
-slow_scan:
-	ASSERT(bp->b_addr == NULL);
-	last_bit = first_bit;
-	nbits = 1;
-	for (;;) {
-		/*
-		 * This takes the bit number to start looking from and
-		 * returns the next set bit from there.  It returns -1
-		 * if there are no more bits set or the start bit is
-		 * beyond the end of the bitmap.
-		 */
-		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
-					(uint)last_bit + 1);
-		/*
-		 * If we run out of bits fill in the last iovec and get out of
-		 * the loop.  Else if we start a new set of bits then fill in
-		 * the iovec for the series we were looking at and start
-		 * counting the bits in the new one.  Else we're still in the
-		 * same set of bits so just keep counting and scanning.
-		 */
-		if (next_bit == -1) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			break;
-		} else if (next_bit != last_bit + 1 ||
-		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
-			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
-						first_bit, nbits);
-			blfp->blf_size++;
-			first_bit = next_bit;
-			last_bit = next_bit;
-			nbits = 1;
-		} else {
-			last_bit++;
-			nbits++;
-		}
-	}
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (4 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 5/9] xfs: buffer items don't straddle pages anymore Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19 17:34   ` Darrick J. Wong
  2024-03-18 22:45 ` [PATCH 7/9] xfs: walk b_addr for buffer I/O Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Christoph Hellwig <hch@lst.de>

With the concept of unmapped buffer gone, there is no reason to not
vmap the buffer pages directly in xfs_buf_alloc_folios.

[dchinner: port to folio-enabled buffers.]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 37 ++++++-------------------------------
 1 file changed, 6 insertions(+), 31 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2cd3671f3ce3..a77e2d8c8107 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -469,18 +469,7 @@ xfs_buf_alloc_folios(
 		XFS_STATS_INC(bp->b_mount, xb_page_retries);
 		memalloc_retry_wait(gfp_mask);
 	}
-	return 0;
-}
 
-/*
- *	Map buffer into kernel address-space if necessary.
- */
-STATIC int
-_xfs_buf_map_folios(
-	struct xfs_buf		*bp,
-	xfs_buf_flags_t		flags)
-{
-	ASSERT(bp->b_flags & _XBF_FOLIOS);
 	if (bp->b_folio_count == 1) {
 		/* A single folio buffer is always mappable */
 		bp->b_addr = folio_address(bp->b_folios[0]);
@@ -513,8 +502,13 @@ _xfs_buf_map_folios(
 		} while (retried++ <= 1);
 		memalloc_nofs_restore(nofs_flag);
 
-		if (!bp->b_addr)
+		if (!bp->b_addr) {
+			xfs_warn_ratelimited(bp->b_mount,
+				"%s: failed to map %u folios", __func__,
+				bp->b_folio_count);
+			xfs_buf_free_folios(bp);
 			return -ENOMEM;
+		}
 	}
 
 	return 0;
@@ -816,18 +810,6 @@ xfs_buf_get_map(
 			xfs_perag_put(pag);
 	}
 
-	/* We do not hold a perag reference anymore. */
-	if (!bp->b_addr) {
-		error = _xfs_buf_map_folios(bp, flags);
-		if (unlikely(error)) {
-			xfs_warn_ratelimited(btp->bt_mount,
-				"%s: failed to map %u folios", __func__,
-				bp->b_folio_count);
-			xfs_buf_relse(bp);
-			return error;
-		}
-	}
-
 	/*
 	 * Clear b_error if this is a lookup from a caller that doesn't expect
 	 * valid data to be found in the buffer.
@@ -1068,13 +1050,6 @@ xfs_buf_get_uncached(
 	if (error)
 		goto fail_free_buf;
 
-	error = _xfs_buf_map_folios(bp, 0);
-	if (unlikely(error)) {
-		xfs_warn(target->bt_mount,
-			"%s: failed to map folios", __func__);
-		goto fail_free_buf;
-	}
-
 	trace_xfs_buf_get_uncached(bp, _RET_IP_);
 	*bpp = bp;
 	return 0;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 7/9] xfs: walk b_addr for buffer I/O
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (5 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19 17:42   ` Darrick J. Wong
  2024-03-18 22:45 ` [PATCH 8/9] xfs: use vmalloc for multi-folio buffers Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Christoph Hellwig <hch@lst.de>

Instead of walking the folio array just walk the kernel virtual
address in ->b_addr.  This prepares for using vmalloc for buffers
and removing the b_folio array.

[dchinner: ported to folios-based buffers.]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 110 +++++++++++++----------------------------------
 fs/xfs/xfs_buf.h |   2 -
 2 files changed, 29 insertions(+), 83 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a77e2d8c8107..303945554415 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -358,7 +358,6 @@ xfs_buf_alloc_kmem(
 	}
 	bp->b_folios = bp->b_folio_array;
 	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
-	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
@@ -1549,87 +1548,44 @@ xfs_buf_bio_end_io(
 static void
 xfs_buf_ioapply_map(
 	struct xfs_buf	*bp,
-	int		map,
-	int		*buf_offset,
-	int		*count,
+	unsigned int	map,
+	unsigned int	*buf_offset,
 	blk_opf_t	op)
 {
-	int		folio_index;
-	unsigned int	total_nr_folios = bp->b_folio_count;
-	int		nr_folios;
 	struct bio	*bio;
-	sector_t	sector =  bp->b_maps[map].bm_bn;
 	int		size;
-	int		offset;
 
-	/*
-	 * If the start offset if larger than a single page, we need to be
-	 * careful. We might have a high order folio, in which case the indexing
-	 * is from the start of the buffer. However, if we have more than one
-	 * folio single page folio in the buffer, we need to skip the folios in
-	 * the buffer before the start offset.
-	 */
-	folio_index = 0;
-	offset = *buf_offset;
-	if (bp->b_folio_count > 1) {
-		while (offset >= PAGE_SIZE) {
-			folio_index++;
-			offset -= PAGE_SIZE;
-		}
+	/* Limit the IO size to the length of the current vector. */
+	size = min_t(unsigned int, BBTOB(bp->b_maps[map].bm_len),
+			BBTOB(bp->b_length) - *buf_offset);
+
+	if (WARN_ON_ONCE(bp->b_folio_count > BIO_MAX_VECS)) {
+		xfs_buf_ioerror(bp, -EIO);
+		return;
 	}
 
-	/*
-	 * Limit the IO size to the length of the current vector, and update the
-	 * remaining IO count for the next time around.
-	 */
-	size = min_t(int, BBTOB(bp->b_maps[map].bm_len), *count);
-	*count -= size;
-	*buf_offset += size;
-
-next_chunk:
 	atomic_inc(&bp->b_io_remaining);
-	nr_folios = bio_max_segs(total_nr_folios);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
-	bio->bi_iter.bi_sector = sector;
+	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_folio_count, op, GFP_NOIO);
+	bio->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
 
-	for (; size && nr_folios; nr_folios--, folio_index++) {
-		struct folio	*folio = bp->b_folios[folio_index];
-		int		nbytes = folio_size(folio) - offset;
-
-		if (nbytes > size)
-			nbytes = size;
-
-		if (!bio_add_folio(bio, folio, nbytes,
-				offset_in_folio(folio, offset)))
-			break;
-
-		offset = 0;
-		sector += BTOBB(nbytes);
-		size -= nbytes;
-		total_nr_folios--;
-	}
-
-	if (likely(bio->bi_iter.bi_size)) {
-		if (xfs_buf_is_vmapped(bp)) {
-			flush_kernel_vmap_range(bp->b_addr,
-						xfs_buf_vmap_len(bp));
-		}
-		submit_bio(bio);
-		if (size)
-			goto next_chunk;
-	} else {
-		/*
-		 * This is guaranteed not to be the last io reference count
-		 * because the caller (xfs_buf_submit) holds a count itself.
-		 */
-		atomic_dec(&bp->b_io_remaining);
-		xfs_buf_ioerror(bp, -EIO);
-		bio_put(bio);
-	}
-
+	do {
+		void		*data = bp->b_addr + *buf_offset;
+		struct folio	*folio = kmem_to_folio(data);
+		unsigned int	off = offset_in_folio(folio, data);
+		unsigned int	len = min_t(unsigned int, size,
+						folio_size(folio) - off);
+
+		bio_add_folio_nofail(bio, folio, len, off);
+		size -= len;
+		*buf_offset += len;
+	} while (size);
+
+	if (xfs_buf_is_vmapped(bp))
+		flush_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
+	submit_bio(bio);
 }
 
 STATIC void
@@ -1638,8 +1594,7 @@ _xfs_buf_ioapply(
 {
 	struct blk_plug	plug;
 	blk_opf_t	op;
-	int		offset;
-	int		size;
+	unsigned int	offset = 0;
 	int		i;
 
 	/*
@@ -1701,16 +1656,9 @@ _xfs_buf_ioapply(
 	 * _xfs_buf_ioapply_vec() will modify them appropriately for each
 	 * subsequent call.
 	 */
-	offset = bp->b_offset;
-	size = BBTOB(bp->b_length);
 	blk_start_plug(&plug);
-	for (i = 0; i < bp->b_map_count; i++) {
-		xfs_buf_ioapply_map(bp, i, &offset, &size, op);
-		if (bp->b_error)
-			break;
-		if (size <= 0)
-			break;	/* all done */
-	}
+	for (i = 0; i < bp->b_map_count; i++)
+		xfs_buf_ioapply_map(bp, i, &offset, op);
 	blk_finish_plug(&plug);
 }
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index aef7015cf9f3..4d515407713b 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -198,8 +198,6 @@ struct xfs_buf {
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
 	unsigned int		b_folio_count;	/* size of folio array */
-	unsigned int		b_offset;	/* page offset of b_addr,
-						   only for _XBF_KMEM buffers */
 	int			b_error;	/* error code on I/O */
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 8/9] xfs: use vmalloc for multi-folio buffers
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (6 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 7/9] xfs: walk b_addr for buffer I/O Dave Chinner
@ 2024-03-18 22:45 ` Dave Chinner
  2024-03-19 17:48   ` Darrick J. Wong
  2024-03-18 22:46 ` [PATCH 9/9] xfs: rename bp->b_folio_count Dave Chinner
  2024-03-19  0:24 ` [PATCH v2 0/9] xfs: use large folios for buffers Christoph Hellwig
  9 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:45 UTC (permalink / raw)
  To: linux-xfs

From: Christoph Hellwig <hch@lst.de>

Instead of allocating the folios manually using the bulk page
allocator and then using vm_map_page just use vmalloc to allocate
the entire buffer - vmalloc will use the bulk allocator internally
if it fits.

With this the b_folios array can go away as well as nothing uses it.

[dchinner: port to folio based buffers.]

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c     | 164 ++++++++++++-------------------------------
 fs/xfs/xfs_buf.h     |   2 -
 fs/xfs/xfs_buf_mem.c |   9 +--
 3 files changed, 45 insertions(+), 130 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 303945554415..6d6bad80722e 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -282,29 +282,6 @@ _xfs_buf_alloc(
 	return 0;
 }
 
-static void
-xfs_buf_free_folios(
-	struct xfs_buf	*bp)
-{
-	uint		i;
-
-	ASSERT(bp->b_flags & _XBF_FOLIOS);
-
-	if (xfs_buf_is_vmapped(bp))
-		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
-
-	for (i = 0; i < bp->b_folio_count; i++) {
-		if (bp->b_folios[i])
-			__folio_put(bp->b_folios[i]);
-	}
-	mm_account_reclaimed_pages(bp->b_folio_count);
-
-	if (bp->b_folios != bp->b_folio_array)
-		kfree(bp->b_folios);
-	bp->b_folios = NULL;
-	bp->b_flags &= ~_XBF_FOLIOS;
-}
-
 static void
 xfs_buf_free_callback(
 	struct callback_head	*cb)
@@ -323,13 +300,22 @@ xfs_buf_free(
 
 	ASSERT(list_empty(&bp->b_lru));
 
-	if (xfs_buftarg_is_mem(bp->b_target))
+	if (xfs_buftarg_is_mem(bp->b_target)) {
 		xmbuf_unmap_folio(bp);
-	else if (bp->b_flags & _XBF_FOLIOS)
-		xfs_buf_free_folios(bp);
-	else if (bp->b_flags & _XBF_KMEM)
-		kfree(bp->b_addr);
+		goto free;
+	}
 
+	if (!(bp->b_flags & _XBF_KMEM))
+		mm_account_reclaimed_pages(bp->b_folio_count);
+
+	if (bp->b_flags & _XBF_FOLIOS)
+		__folio_put(kmem_to_folio(bp->b_addr));
+	else
+		kvfree(bp->b_addr);
+
+	bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
+
+free:
 	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
 }
 
@@ -356,8 +342,6 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_folios = bp->b_folio_array;
-	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
@@ -377,14 +361,15 @@ xfs_buf_alloc_folio(
 	struct xfs_buf	*bp,
 	gfp_t		gfp_mask)
 {
+	struct folio	*folio;
 	int		length = BBTOB(bp->b_length);
 	int		order = get_order(length);
 
-	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
-	if (!bp->b_folio_array[0])
+	folio = folio_alloc(gfp_mask, order);
+	if (!folio)
 		return false;
 
-	bp->b_folios = bp->b_folio_array;
+	bp->b_addr = folio_address(folio);
 	bp->b_folio_count = 1;
 	bp->b_flags |= _XBF_FOLIOS;
 	return true;
@@ -400,15 +385,11 @@ xfs_buf_alloc_folio(
  * contiguous memory region that we don't have to map and unmap to access the
  * data directly.
  *
- * The second type of buffer is the multi-folio buffer. These are *always* made
- * up of single page folios so that they can be fed to vmap_ram() to return a
- * contiguous memory region we can access the data through.
- *
- * We don't use high order folios for this second type of buffer (yet) because
- * having variable size folios makes offset-to-folio indexing and iteration of
- * the data range more complex than if they are fixed size. This case should now
- * be the slow path, though, so unless we regularly fail to allocate high order
- * folios, there should be little need to optimise this path.
+ * The second type of buffer is the vmalloc()d buffer. This provides the buffer
+ * with the required contiguous memory region but backed by discontiguous
+ * physical pages. vmalloc() typically doesn't fail, but it can and so we may
+ * need to wrap the allocation in a loop to prevent low memory failures and
+ * shutdowns.
  */
 static int
 xfs_buf_alloc_folios(
@@ -416,7 +397,7 @@ xfs_buf_alloc_folios(
 	xfs_buf_flags_t	flags)
 {
 	gfp_t		gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOWARN;
-	long		filled = 0;
+	unsigned	nofs_flag;
 
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
@@ -425,89 +406,32 @@ xfs_buf_alloc_folios(
 	if (!(flags & XBF_READ))
 		gfp_mask |= __GFP_ZERO;
 
-	/* Optimistically attempt a single high order folio allocation. */
-	if (xfs_buf_alloc_folio(bp, gfp_mask))
-		return 0;
-
 	/* Fall back to allocating an array of single page folios. */
 	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
-	if (bp->b_folio_count <= XB_FOLIOS) {
-		bp->b_folios = bp->b_folio_array;
-	} else {
-		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
-					gfp_mask);
-		if (!bp->b_folios)
-			return -ENOMEM;
-	}
-	bp->b_flags |= _XBF_FOLIOS;
 
+	/* Optimistically attempt a single high order folio allocation. */
+	if (xfs_buf_alloc_folio(bp, gfp_mask))
+		return 0;
+
+	/* We are done if an order-0 allocation has already failed. */
+	if (bp->b_folio_count == 1)
+		return -ENOMEM;
 
 	/*
-	 * Bulk filling of pages can take multiple calls. Not filling the entire
-	 * array is not an allocation failure, so don't back off if we get at
-	 * least one extra page.
+	 * XXX(dgc): I think dquot reclaim is the only place we can get
+	 * to this function from memory reclaim context now. If we fix
+	 * that like we've fixed inode reclaim to avoid writeback from
+	 * reclaim, this nofs wrapping can go away.
 	 */
-	for (;;) {
-		long	last = filled;
-
-		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
-						(struct page **)bp->b_folios);
-		if (filled == bp->b_folio_count) {
-			XFS_STATS_INC(bp->b_mount, xb_page_found);
-			break;
-		}
-
-		if (filled != last)
-			continue;
-
-		if (flags & XBF_READ_AHEAD) {
-			xfs_buf_free_folios(bp);
-			return -ENOMEM;
-		}
-
-		XFS_STATS_INC(bp->b_mount, xb_page_retries);
-		memalloc_retry_wait(gfp_mask);
-	}
-
-	if (bp->b_folio_count == 1) {
-		/* A single folio buffer is always mappable */
-		bp->b_addr = folio_address(bp->b_folios[0]);
-	} else {
-		int retried = 0;
-		unsigned nofs_flag;
-
-		/*
-		 * vm_map_ram() will allocate auxiliary structures (e.g.
-		 * pagetables) with GFP_KERNEL, yet we often under a scoped nofs
-		 * context here. Mixing GFP_KERNEL with GFP_NOFS allocations
-		 * from the same call site that can be run from both above and
-		 * below memory reclaim causes lockdep false positives. Hence we
-		 * always need to force this allocation to nofs context because
-		 * we can't pass __GFP_NOLOCKDEP down to auxillary structures to
-		 * prevent false positive lockdep reports.
-		 *
-		 * XXX(dgc): I think dquot reclaim is the only place we can get
-		 * to this function from memory reclaim context now. If we fix
-		 * that like we've fixed inode reclaim to avoid writeback from
-		 * reclaim, this nofs wrapping can go away.
-		 */
-		nofs_flag = memalloc_nofs_save();
-		do {
-			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
-					bp->b_folio_count, -1);
-			if (bp->b_addr)
-				break;
-			vm_unmap_aliases();
-		} while (retried++ <= 1);
-		memalloc_nofs_restore(nofs_flag);
-
-		if (!bp->b_addr) {
-			xfs_warn_ratelimited(bp->b_mount,
-				"%s: failed to map %u folios", __func__,
-				bp->b_folio_count);
-			xfs_buf_free_folios(bp);
-			return -ENOMEM;
-		}
+	nofs_flag = memalloc_nofs_save();
+	bp->b_addr = __vmalloc(BBTOB(bp->b_length), gfp_mask);
+	memalloc_nofs_restore(nofs_flag);
+
+	if (!bp->b_addr) {
+		xfs_warn_ratelimited(bp->b_mount,
+			"%s: failed to allocate %u folios", __func__,
+			bp->b_folio_count);
+		return -ENOMEM;
 	}
 
 	return 0;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 4d515407713b..68c24947ca1a 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -190,8 +190,6 @@ struct xfs_buf {
 	struct xfs_buf_log_item	*b_log_item;
 	struct list_head	b_li_list;	/* Log items list head */
 	struct xfs_trans	*b_transp;
-	struct folio		**b_folios;	/* array of folio pointers */
-	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
 	struct xfs_buf_map	*b_maps;	/* compound buffer map */
 	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
 	int			b_map_count;
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 26734c64c10e..336e7c8effb7 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -169,8 +169,6 @@ xmbuf_map_folio(
 	unlock_page(page);
 
 	bp->b_addr = page_address(page);
-	bp->b_folios = bp->b_folio_array;
-	bp->b_folios[0] = folio;
 	bp->b_folio_count = 1;
 	return 0;
 }
@@ -180,15 +178,10 @@ void
 xmbuf_unmap_folio(
 	struct xfs_buf		*bp)
 {
-	struct folio		*folio = bp->b_folios[0];
-
 	ASSERT(xfs_buftarg_is_mem(bp->b_target));
 
-	folio_put(folio);
-
+	folio_put(kmem_to_folio(bp->b_addr));
 	bp->b_addr = NULL;
-	bp->b_folios[0] = NULL;
-	bp->b_folios = NULL;
 	bp->b_folio_count = 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 9/9] xfs: rename bp->b_folio_count
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (7 preceding siblings ...)
  2024-03-18 22:45 ` [PATCH 8/9] xfs: use vmalloc for multi-folio buffers Dave Chinner
@ 2024-03-18 22:46 ` Dave Chinner
  2024-03-19  7:37   ` Christoph Hellwig
  2024-03-19  0:24 ` [PATCH v2 0/9] xfs: use large folios for buffers Christoph Hellwig
  9 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-18 22:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The count is used purely to allocate the correct number of bvecs for
submitting IO. Rename it to b_bvec_count.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c     | 38 +++++++++++++++++++++-----------------
 fs/xfs/xfs_buf.h     |  2 +-
 fs/xfs/xfs_buf_mem.c |  4 ++--
 3 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6d6bad80722e..2a6796c48454 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -69,15 +69,14 @@ static inline bool xfs_buf_is_uncached(struct xfs_buf *bp)
 /*
  * Return true if the buffer is vmapped.
  *
- * b_addr is null if the buffer is not mapped, but the code is clever enough to
- * know it doesn't have to map a single folio, so the check has to be both for
- * b_addr and bp->b_folio_count > 1.
+ * b_addr is always set, so we have to look at bp->b_bvec_count to determine if
+ * the buffer was vmalloc()d or not.
  */
 static inline int
 xfs_buf_is_vmapped(
 	struct xfs_buf	*bp)
 {
-	return bp->b_addr && bp->b_folio_count > 1;
+	return bp->b_bvec_count > 1;
 }
 
 /*
@@ -88,7 +87,7 @@ static inline int
 xfs_buf_vmap_len(
 	struct xfs_buf	*bp)
 {
-	return (bp->b_folio_count * PAGE_SIZE);
+	return (bp->b_bvec_count * PAGE_SIZE);
 }
 
 /*
@@ -306,7 +305,7 @@ xfs_buf_free(
 	}
 
 	if (!(bp->b_flags & _XBF_KMEM))
-		mm_account_reclaimed_pages(bp->b_folio_count);
+		mm_account_reclaimed_pages(bp->b_bvec_count);
 
 	if (bp->b_flags & _XBF_FOLIOS)
 		__folio_put(kmem_to_folio(bp->b_addr));
@@ -342,7 +341,7 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_folio_count = 1;
+	bp->b_bvec_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
@@ -370,7 +369,7 @@ xfs_buf_alloc_folio(
 		return false;
 
 	bp->b_addr = folio_address(folio);
-	bp->b_folio_count = 1;
+	bp->b_bvec_count = 1;
 	bp->b_flags |= _XBF_FOLIOS;
 	return true;
 }
@@ -398,6 +397,7 @@ xfs_buf_alloc_folios(
 {
 	gfp_t		gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOWARN;
 	unsigned	nofs_flag;
+	unsigned int	count;
 
 	if (flags & XBF_READ_AHEAD)
 		gfp_mask |= __GFP_NORETRY;
@@ -407,16 +407,24 @@ xfs_buf_alloc_folios(
 		gfp_mask |= __GFP_ZERO;
 
 	/* Fall back to allocating an array of single page folios. */
-	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
+	count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
 
 	/* Optimistically attempt a single high order folio allocation. */
 	if (xfs_buf_alloc_folio(bp, gfp_mask))
 		return 0;
 
 	/* We are done if an order-0 allocation has already failed. */
-	if (bp->b_folio_count == 1)
+	if (count == 1)
 		return -ENOMEM;
 
+	/*
+	 * Largest buffer we allocate should fit entirely in a single bio,
+	 * so warn and fail if somebody asks for a buffer larger than can
+	 * be supported.
+	 */
+	if (WARN_ON_ONCE(count > BIO_MAX_VECS))
+		return -EIO;
+
 	/*
 	 * XXX(dgc): I think dquot reclaim is the only place we can get
 	 * to this function from memory reclaim context now. If we fix
@@ -430,9 +438,10 @@ xfs_buf_alloc_folios(
 	if (!bp->b_addr) {
 		xfs_warn_ratelimited(bp->b_mount,
 			"%s: failed to allocate %u folios", __func__,
-			bp->b_folio_count);
+			count);
 		return -ENOMEM;
 	}
+	bp->b_bvec_count = count;
 
 	return 0;
 }
@@ -1483,14 +1492,9 @@ xfs_buf_ioapply_map(
 	size = min_t(unsigned int, BBTOB(bp->b_maps[map].bm_len),
 			BBTOB(bp->b_length) - *buf_offset);
 
-	if (WARN_ON_ONCE(bp->b_folio_count > BIO_MAX_VECS)) {
-		xfs_buf_ioerror(bp, -EIO);
-		return;
-	}
-
 	atomic_inc(&bp->b_io_remaining);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_folio_count, op, GFP_NOIO);
+	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_bvec_count, op, GFP_NOIO);
 	bio->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 68c24947ca1a..32688525890b 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -195,7 +195,7 @@ struct xfs_buf {
 	int			b_map_count;
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
-	unsigned int		b_folio_count;	/* size of folio array */
+	unsigned int		b_bvec_count;	/* bvecs needed for IO */
 	int			b_error;	/* error code on I/O */
 
 	/*
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 336e7c8effb7..30d53ddd6e69 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -169,7 +169,7 @@ xmbuf_map_folio(
 	unlock_page(page);
 
 	bp->b_addr = page_address(page);
-	bp->b_folio_count = 1;
+	bp->b_bvec_count = 1;
 	return 0;
 }
 
@@ -182,7 +182,7 @@ xmbuf_unmap_folio(
 
 	folio_put(kmem_to_folio(bp->b_addr));
 	bp->b_addr = NULL;
-	bp->b_folio_count = 0;
+	bp->b_bvec_count = 0;
 }
 
 /* Is this a valid daddr within the buftarg? */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/9] xfs: use large folios for buffers
  2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
                   ` (8 preceding siblings ...)
  2024-03-18 22:46 ` [PATCH 9/9] xfs: rename bp->b_folio_count Dave Chinner
@ 2024-03-19  0:24 ` Christoph Hellwig
  2024-03-19  0:44   ` Dave Chinner
  9 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:51AM +1100, Dave Chinner wrote:
> Apart from those small complexities that are resolved by the end of
> the patchset, the conversion and enhancement is relatively straight
> forward.  It passes fstests on both 512 and 4096 byte sector size
> storage (512 byte sectors exercise the XBF_KMEM path which has
> non-zero bp->b_offset values) and doesn't appear to cause any
> problems with large 64kB directory buffers on 4kB page machines.

Just curious, do you have any benchmark numbers to see if this actually
improves performance?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v2 0/9] xfs: use large folios for buffers
  2024-03-19  0:24 ` [PATCH v2 0/9] xfs: use large folios for buffers Christoph Hellwig
@ 2024-03-19  0:44   ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19  0:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 18, 2024 at 05:24:10PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 09:45:51AM +1100, Dave Chinner wrote:
> > Apart from those small complexities that are resolved by the end of
> > the patchset, the conversion and enhancement is relatively straight
> > forward.  It passes fstests on both 512 and 4096 byte sector size
> > storage (512 byte sectors exercise the XBF_KMEM path which has
> > non-zero bp->b_offset values) and doesn't appear to cause any
> > problems with large 64kB directory buffers on 4kB page machines.
> 
> Just curious, do you have any benchmark numbers to see if this actually
> improves performance?

I have run some fsmark scalability tests on 64kb directory block
sizes to check that nothing fails and the numbers are in the
expected ballpark, but I haven't done any specific back to back
performance regression testing.

The reason for that is two-fold:

1. scalability on 64kb directory buffer workloads is limited by
buffer lock latency and journal size. i.e. even a 2GB journal is
too small for high concurrency and results in significant amounts of
tail pushing and the directory modifications getting stuck on
writeback of directory buffers from tail-pushing.

2. relogging 64kB directory blocks is -expensive-. Comapred to a 4kB
block size, the large directory block sizes are relogged much more
frequently and the memcpy() in each relogging costs *much* more than
relogging a 4kB directory block. It also hits xlog_kvmalloc() really
hard, and that's now where we hit vmalloc scalalbility
issues on large dir block size workloads.

The result of these things is that there hasn't been any significant
change in performance one way or the other - what we gain in buffer
access efficiency, we give back in increased lock contention and
tail pushing latency issues...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
@ 2024-03-19  6:38   ` Christoph Hellwig
  2024-03-19  6:52     ` Dave Chinner
  2024-03-19  6:53   ` Christoph Hellwig
  2024-03-19 17:15   ` Darrick J. Wong
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  6:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:53AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the use of struct pages to struct folio everywhere. This
> is just direct API conversion, no actual logic of code changes
> should result.
> 
> Note: this conversion currently assumes only single page folios are
> allocated, and because some of the MM interfaces we use take
> pointers to arrays of struct pages, the address of single page
> folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> vm_map_ram(), etc.

.. and this goes away by the end of the series.  Maybe that's worth
mentioning here?

Otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-19  6:38   ` Christoph Hellwig
@ 2024-03-19  6:52     ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19  6:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 18, 2024 at 11:38:40PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 09:45:53AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Convert the use of struct pages to struct folio everywhere. This
> > is just direct API conversion, no actual logic of code changes
> > should result.
> > 
> > Note: this conversion currently assumes only single page folios are
> > allocated, and because some of the MM interfaces we use take
> > pointers to arrays of struct pages, the address of single page
> > folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> > vm_map_ram(), etc.
> 
> .. and this goes away by the end of the series.  Maybe that's worth
> mentioning here?

Ah, yeah. I remembered to update the cover letter with this
information, so 1 out of 2 ain't bad :)

> Otherwise this looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
  2024-03-19  6:38   ` Christoph Hellwig
@ 2024-03-19  6:53   ` Christoph Hellwig
  2024-03-19 21:42     ` Dave Chinner
  2024-03-19 21:42     ` Dave Chinner
  2024-03-19 17:15   ` Darrick J. Wong
  2 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  6:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

So while this looks good to me,

> +	for (i = 0; i < bp->b_folio_count; i++) {
> +		if (bp->b_folios[i])
> +			__folio_put(bp->b_folios[i]);

The __folio_put here really needs to be folio_put or page alloc
debugging gets very unhappy.

But even with that fixed on top of this patch the first mount just hangs
without a useful kernel backtrace in /proc/*/stack, although running
with the entire ѕeries applied it does pass the basic sanity checking
so far.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
@ 2024-03-19  6:55   ` Christoph Hellwig
  2024-03-19 17:29   ` Darrick J. Wong
  2024-03-22  8:02   ` Pankaj Raghav (Samsung)
  2 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  6:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

This looks mostly good to me.  One question:
xfs_buf_free passes the folio count to mm_account_reclaimed_pages.
The name and implementation suggest it actually wants a count of
page size units, though.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/9] xfs: buffer items don't straddle pages anymore
  2024-03-18 22:45 ` [PATCH 5/9] xfs: buffer items don't straddle pages anymore Dave Chinner
@ 2024-03-19  6:56   ` Christoph Hellwig
  2024-03-19 17:31   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  6:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 9/9] xfs: rename bp->b_folio_count
  2024-03-18 22:46 ` [PATCH 9/9] xfs: rename bp->b_folio_count Dave Chinner
@ 2024-03-19  7:37   ` Christoph Hellwig
  2024-03-19 23:59     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19  7:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:46:00AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The count is used purely to allocate the correct number of bvecs for
> submitting IO. Rename it to b_bvec_count.

Well, I think we should just kill it as it simplies is the rounded
up length in PAGE_SIZE units.  The patch below passes a quick xfstests
run and is on top of this series:

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2a6796c48454f7..8ecf88b5504c18 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -67,27 +67,17 @@ static inline bool xfs_buf_is_uncached(struct xfs_buf *bp)
 }
 
 /*
- * Return true if the buffer is vmapped.
- *
- * b_addr is always set, so we have to look at bp->b_bvec_count to determine if
- * the buffer was vmalloc()d or not.
+ * See comment above xfs_buf_alloc_folios() about the constraints placed on
+ * allocating vmapped buffers.
  */
-static inline int
-xfs_buf_is_vmapped(
-	struct xfs_buf	*bp)
+static inline unsigned int xfs_buf_vmap_len(struct xfs_buf *bp)
 {
-	return bp->b_bvec_count > 1;
+	return roundup(BBTOB(bp->b_length), PAGE_SIZE);
 }
 
-/*
- * See comment above xfs_buf_alloc_folios() about the constraints placed on
- * allocating vmapped buffers.
- */
-static inline int
-xfs_buf_vmap_len(
-	struct xfs_buf	*bp)
+static inline unsigned int xfs_buf_nr_pages(struct xfs_buf *bp)
 {
-	return (bp->b_bvec_count * PAGE_SIZE);
+	return DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
 }
 
 /*
@@ -304,13 +294,15 @@ xfs_buf_free(
 		goto free;
 	}
 
-	if (!(bp->b_flags & _XBF_KMEM))
-		mm_account_reclaimed_pages(bp->b_bvec_count);
-
-	if (bp->b_flags & _XBF_FOLIOS)
-		__folio_put(kmem_to_folio(bp->b_addr));
-	else
+	if (bp->b_flags & _XBF_FOLIOS) {
+		/* XXX: should this pass xfs_buf_nr_pages()? */
+		mm_account_reclaimed_pages(1);
+		folio_put(kmem_to_folio(bp->b_addr));
+	} else {
+		if (!(bp->b_flags & _XBF_KMEM))
+			mm_account_reclaimed_pages(xfs_buf_nr_pages(bp));
 		kvfree(bp->b_addr);
+	}
 
 	bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
 
@@ -341,7 +333,6 @@ xfs_buf_alloc_kmem(
 		bp->b_addr = NULL;
 		return -ENOMEM;
 	}
-	bp->b_bvec_count = 1;
 	bp->b_flags |= _XBF_KMEM;
 	return 0;
 }
@@ -369,7 +360,6 @@ xfs_buf_alloc_folio(
 		return false;
 
 	bp->b_addr = folio_address(folio);
-	bp->b_bvec_count = 1;
 	bp->b_flags |= _XBF_FOLIOS;
 	return true;
 }
@@ -441,7 +431,6 @@ xfs_buf_alloc_folios(
 			count);
 		return -ENOMEM;
 	}
-	bp->b_bvec_count = count;
 
 	return 0;
 }
@@ -1470,7 +1459,9 @@ xfs_buf_bio_end_io(
 		cmpxchg(&bp->b_io_error, 0, error);
 	}
 
-	if (!bp->b_error && xfs_buf_is_vmapped(bp) && (bp->b_flags & XBF_READ))
+	if (!bp->b_error &&
+	    (bp->b_flags & XBF_READ) &&
+	    is_vmalloc_addr(bp->b_addr))
 		invalidate_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
 
 	if (atomic_dec_and_test(&bp->b_io_remaining) == 1)
@@ -1485,6 +1476,7 @@ xfs_buf_ioapply_map(
 	unsigned int	*buf_offset,
 	blk_opf_t	op)
 {
+	unsigned int	nr_vecs = 1;
 	struct bio	*bio;
 	int		size;
 
@@ -1494,7 +1486,9 @@ xfs_buf_ioapply_map(
 
 	atomic_inc(&bp->b_io_remaining);
 
-	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_bvec_count, op, GFP_NOIO);
+	if (is_vmalloc_addr(bp->b_addr))
+		nr_vecs = xfs_buf_nr_pages(bp);
+	bio = bio_alloc(bp->b_target->bt_bdev, nr_vecs, op, GFP_NOIO);
 	bio->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
 	bio->bi_end_io = xfs_buf_bio_end_io;
 	bio->bi_private = bp;
@@ -1511,7 +1505,7 @@ xfs_buf_ioapply_map(
 		*buf_offset += len;
 	} while (size);
 
-	if (xfs_buf_is_vmapped(bp))
+	if (is_vmalloc_addr(bp->b_addr))
 		flush_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
 	submit_bio(bio);
 }
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 32688525890bec..ad92d11f4ae173 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -195,7 +195,6 @@ struct xfs_buf {
 	int			b_map_count;
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
-	unsigned int		b_bvec_count;	/* bvecs needed for IO */
 	int			b_error;	/* error code on I/O */
 
 	/*
diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
index 30d53ddd6e6980..f082b1a64fc950 100644
--- a/fs/xfs/xfs_buf_mem.c
+++ b/fs/xfs/xfs_buf_mem.c
@@ -169,7 +169,6 @@ xmbuf_map_folio(
 	unlock_page(page);
 
 	bp->b_addr = page_address(page);
-	bp->b_bvec_count = 1;
 	return 0;
 }
 
@@ -182,7 +181,6 @@ xmbuf_unmap_folio(
 
 	folio_put(kmem_to_folio(bp->b_addr));
 	bp->b_addr = NULL;
-	bp->b_bvec_count = 0;
 }
 
 /* Is this a valid daddr within the buftarg? */

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
  2024-03-19  6:38   ` Christoph Hellwig
  2024-03-19  6:53   ` Christoph Hellwig
@ 2024-03-19 17:15   ` Darrick J. Wong
  2 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:53AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Convert the use of struct pages to struct folio everywhere. This
> is just direct API conversion, no actual logic of code changes
> should result.
> 
> Note: this conversion currently assumes only single page folios are
> allocated, and because some of the MM interfaces we use take
> pointers to arrays of struct pages, the address of single page
> folios and struct pages are the same. e.g alloc_pages_bulk_array(),
> vm_map_ram(), etc.
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks pretty straightforward,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c      | 132 +++++++++++++++++++++---------------------
>  fs/xfs/xfs_buf.h      |  14 ++---
>  fs/xfs/xfs_buf_item.c |   2 +-
>  fs/xfs/xfs_buf_mem.c  |  30 +++++-----
>  fs/xfs/xfs_buf_mem.h  |   8 +--
>  fs/xfs/xfs_linux.h    |   8 +++
>  6 files changed, 101 insertions(+), 93 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1a18c381127e..832226385154 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -66,25 +66,25 @@ static inline bool xfs_buf_is_uncached(struct xfs_buf *bp)
>  	return bp->b_rhash_key == XFS_BUF_DADDR_NULL;
>  }
>  
> +/*
> + * Return true if the buffer is vmapped.
> + *
> + * b_addr is null if the buffer is not mapped, but the code is clever enough to
> + * know it doesn't have to map a single folio, so the check has to be both for
> + * b_addr and bp->b_folio_count > 1.
> + */
>  static inline int
>  xfs_buf_is_vmapped(
>  	struct xfs_buf	*bp)
>  {
> -	/*
> -	 * Return true if the buffer is vmapped.
> -	 *
> -	 * b_addr is null if the buffer is not mapped, but the code is clever
> -	 * enough to know it doesn't have to map a single page, so the check has
> -	 * to be both for b_addr and bp->b_page_count > 1.
> -	 */
> -	return bp->b_addr && bp->b_page_count > 1;
> +	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
>  {
> -	return (bp->b_page_count * PAGE_SIZE);
> +	return (bp->b_folio_count * PAGE_SIZE);
>  }
>  
>  /*
> @@ -203,7 +203,7 @@ xfs_buf_get_maps(
>  }
>  
>  /*
> - *	Frees b_pages if it was allocated.
> + *	Frees b_maps if it was allocated.
>   */
>  static void
>  xfs_buf_free_maps(
> @@ -279,26 +279,26 @@ _xfs_buf_alloc(
>  }
>  
>  static void
> -xfs_buf_free_pages(
> +xfs_buf_free_folios(
>  	struct xfs_buf	*bp)
>  {
>  	uint		i;
>  
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  
>  	if (xfs_buf_is_vmapped(bp))
> -		vm_unmap_ram(bp->b_addr, bp->b_page_count);
> +		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
>  
> -	for (i = 0; i < bp->b_page_count; i++) {
> -		if (bp->b_pages[i])
> -			__free_page(bp->b_pages[i]);
> +	for (i = 0; i < bp->b_folio_count; i++) {
> +		if (bp->b_folios[i])
> +			__folio_put(bp->b_folios[i]);
>  	}
> -	mm_account_reclaimed_pages(bp->b_page_count);
> +	mm_account_reclaimed_pages(bp->b_folio_count);
>  
> -	if (bp->b_pages != bp->b_page_array)
> -		kfree(bp->b_pages);
> -	bp->b_pages = NULL;
> -	bp->b_flags &= ~_XBF_PAGES;
> +	if (bp->b_folios != bp->b_folio_array)
> +		kfree(bp->b_folios);
> +	bp->b_folios = NULL;
> +	bp->b_flags &= ~_XBF_FOLIOS;
>  }
>  
>  static void
> @@ -320,9 +320,9 @@ xfs_buf_free(
>  	ASSERT(list_empty(&bp->b_lru));
>  
>  	if (xfs_buftarg_is_mem(bp->b_target))
> -		xmbuf_unmap_page(bp);
> -	else if (bp->b_flags & _XBF_PAGES)
> -		xfs_buf_free_pages(bp);
> +		xmbuf_unmap_folio(bp);
> +	else if (bp->b_flags & _XBF_FOLIOS)
> +		xfs_buf_free_folios(bp);
>  	else if (bp->b_flags & _XBF_KMEM)
>  		kfree(bp->b_addr);
>  
> @@ -353,15 +353,15 @@ xfs_buf_alloc_kmem(
>  		return -ENOMEM;
>  	}
>  	bp->b_offset = offset_in_page(bp->b_addr);
> -	bp->b_pages = bp->b_page_array;
> -	bp->b_pages[0] = kmem_to_page(bp->b_addr);
> -	bp->b_page_count = 1;
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
>  static int
> -xfs_buf_alloc_pages(
> +xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
>  	xfs_buf_flags_t	flags)
>  {
> @@ -372,16 +372,16 @@ xfs_buf_alloc_pages(
>  		gfp_mask |= __GFP_NORETRY;
>  
>  	/* Make sure that we have a page list */
> -	bp->b_page_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> -	if (bp->b_page_count <= XB_PAGES) {
> -		bp->b_pages = bp->b_page_array;
> +	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> +	if (bp->b_folio_count <= XB_FOLIOS) {
> +		bp->b_folios = bp->b_folio_array;
>  	} else {
> -		bp->b_pages = kzalloc(sizeof(struct page *) * bp->b_page_count,
> +		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
>  					gfp_mask);
> -		if (!bp->b_pages)
> +		if (!bp->b_folios)
>  			return -ENOMEM;
>  	}
> -	bp->b_flags |= _XBF_PAGES;
> +	bp->b_flags |= _XBF_FOLIOS;
>  
>  	/* Assure zeroed buffer for non-read cases. */
>  	if (!(flags & XBF_READ))
> @@ -395,9 +395,9 @@ xfs_buf_alloc_pages(
>  	for (;;) {
>  		long	last = filled;
>  
> -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_page_count,
> -						bp->b_pages);
> -		if (filled == bp->b_page_count) {
> +		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> +						(struct page **)bp->b_folios);
> +		if (filled == bp->b_folio_count) {
>  			XFS_STATS_INC(bp->b_mount, xb_page_found);
>  			break;
>  		}
> @@ -406,7 +406,7 @@ xfs_buf_alloc_pages(
>  			continue;
>  
>  		if (flags & XBF_READ_AHEAD) {
> -			xfs_buf_free_pages(bp);
> +			xfs_buf_free_folios(bp);
>  			return -ENOMEM;
>  		}
>  
> @@ -420,14 +420,14 @@ xfs_buf_alloc_pages(
>   *	Map buffer into kernel address-space if necessary.
>   */
>  STATIC int
> -_xfs_buf_map_pages(
> +_xfs_buf_map_folios(
>  	struct xfs_buf		*bp,
>  	xfs_buf_flags_t		flags)
>  {
> -	ASSERT(bp->b_flags & _XBF_PAGES);
> -	if (bp->b_page_count == 1) {
> +	ASSERT(bp->b_flags & _XBF_FOLIOS);
> +	if (bp->b_folio_count == 1) {
>  		/* A single page buffer is always mappable */
> -		bp->b_addr = page_address(bp->b_pages[0]);
> +		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
>  	} else {
> @@ -451,8 +451,8 @@ _xfs_buf_map_pages(
>  		 */
>  		nofs_flag = memalloc_nofs_save();
>  		do {
> -			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> -						-1);
> +			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
> +					bp->b_folio_count, -1);
>  			if (bp->b_addr)
>  				break;
>  			vm_unmap_aliases();
> @@ -579,7 +579,7 @@ xfs_buf_find_lock(
>  			return -ENOENT;
>  		}
>  		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> -		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
> +		bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
>  		bp->b_ops = NULL;
>  	}
>  	return 0;
> @@ -638,16 +638,16 @@ xfs_buf_find_insert(
>  		goto out_drop_pag;
>  
>  	if (xfs_buftarg_is_mem(new_bp->b_target)) {
> -		error = xmbuf_map_page(new_bp);
> +		error = xmbuf_map_folio(new_bp);
>  	} else if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
>  		   xfs_buf_alloc_kmem(new_bp, flags) < 0) {
>  		/*
> -		 * For buffers that fit entirely within a single page, first
> -		 * attempt to allocate the memory from the heap to minimise
> -		 * memory usage. If we can't get heap memory for these small
> -		 * buffers, we fall back to using the page allocator.
> +		 * For buffers that fit entirely within a single page folio,
> +		 * first attempt to allocate the memory from the heap to
> +		 * minimise memory usage. If we can't get heap memory for these
> +		 * small buffers, we fall back to using the page allocator.
>  		 */
> -		error = xfs_buf_alloc_pages(new_bp, flags);
> +		error = xfs_buf_alloc_folios(new_bp, flags);
>  	}
>  	if (error)
>  		goto out_free_buf;
> @@ -764,11 +764,11 @@ xfs_buf_get_map(
>  
>  	/* We do not hold a perag reference anymore. */
>  	if (!bp->b_addr) {
> -		error = _xfs_buf_map_pages(bp, flags);
> +		error = _xfs_buf_map_folios(bp, flags);
>  		if (unlikely(error)) {
>  			xfs_warn_ratelimited(btp->bt_mount,
> -				"%s: failed to map %u pages", __func__,
> -				bp->b_page_count);
> +				"%s: failed to map %u folios", __func__,
> +				bp->b_folio_count);
>  			xfs_buf_relse(bp);
>  			return error;
>  		}
> @@ -1008,16 +1008,16 @@ xfs_buf_get_uncached(
>  		return error;
>  
>  	if (xfs_buftarg_is_mem(bp->b_target))
> -		error = xmbuf_map_page(bp);
> +		error = xmbuf_map_folio(bp);
>  	else
> -		error = xfs_buf_alloc_pages(bp, flags);
> +		error = xfs_buf_alloc_folios(bp, flags);
>  	if (error)
>  		goto fail_free_buf;
>  
> -	error = _xfs_buf_map_pages(bp, 0);
> +	error = _xfs_buf_map_folios(bp, 0);
>  	if (unlikely(error)) {
>  		xfs_warn(target->bt_mount,
> -			"%s: failed to map pages", __func__);
> +			"%s: failed to map folios", __func__);
>  		goto fail_free_buf;
>  	}
>  
> @@ -1526,7 +1526,7 @@ xfs_buf_ioapply_map(
>  	blk_opf_t	op)
>  {
>  	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_page_count;
> +	unsigned int	total_nr_pages = bp->b_folio_count;
>  	int		nr_pages;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
> @@ -1564,7 +1564,7 @@ xfs_buf_ioapply_map(
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
> +		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
>  				      offset);
>  		if (rbytes < nbytes)
>  			break;
> @@ -1783,13 +1783,13 @@ xfs_buf_offset(
>  	struct xfs_buf		*bp,
>  	size_t			offset)
>  {
> -	struct page		*page;
> +	struct folio		*folio;
>  
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> -	page = bp->b_pages[offset >> PAGE_SHIFT];
> -	return page_address(page) + (offset & (PAGE_SIZE-1));
> +	folio = bp->b_folios[offset >> PAGE_SHIFT];
> +	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
>  
>  void
> @@ -1802,18 +1802,18 @@ xfs_buf_zero(
>  
>  	bend = boff + bsize;
>  	while (boff < bend) {
> -		struct page	*page;
> +		struct folio	*folio;
>  		int		page_index, page_offset, csize;
>  
>  		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
>  		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		page = bp->b_pages[page_index];
> +		folio = bp->b_folios[page_index];
>  		csize = min_t(size_t, PAGE_SIZE - page_offset,
>  				      BBTOB(bp->b_length) - boff);
>  
>  		ASSERT((csize + page_offset) <= PAGE_SIZE);
>  
> -		memset(page_address(page) + page_offset, 0, csize);
> +		memset(folio_address(folio) + page_offset, 0, csize);
>  
>  		boff += csize;
>  	}
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index b1580644501f..f059ae3d2755 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -29,7 +29,7 @@ struct xfs_buf;
>  #define XBF_READ_AHEAD	 (1u << 2) /* asynchronous read-ahead */
>  #define XBF_NO_IOACCT	 (1u << 3) /* bypass I/O accounting (non-LRU bufs) */
>  #define XBF_ASYNC	 (1u << 4) /* initiator will not wait for completion */
> -#define XBF_DONE	 (1u << 5) /* all pages in the buffer uptodate */
> +#define XBF_DONE	 (1u << 5) /* all folios in the buffer uptodate */
>  #define XBF_STALE	 (1u << 6) /* buffer has been staled, do not find it */
>  #define XBF_WRITE_FAIL	 (1u << 7) /* async writes have failed on this buffer */
>  
> @@ -39,7 +39,7 @@ struct xfs_buf;
>  #define _XBF_LOGRECOVERY (1u << 18)/* log recovery buffer */
>  
>  /* flags used only internally */
> -#define _XBF_PAGES	 (1u << 20)/* backed by refcounted pages */
> +#define _XBF_FOLIOS	 (1u << 20)/* backed by refcounted folios */
>  #define _XBF_KMEM	 (1u << 21)/* backed by heap memory */
>  #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
>  
> @@ -68,7 +68,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	{ _XBF_INODES,		"INODES" }, \
>  	{ _XBF_DQUOTS,		"DQUOTS" }, \
>  	{ _XBF_LOGRECOVERY,	"LOG_RECOVERY" }, \
> -	{ _XBF_PAGES,		"PAGES" }, \
> +	{ _XBF_FOLIOS,		"FOLIOS" }, \
>  	{ _XBF_KMEM,		"KMEM" }, \
>  	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
>  	/* The following interface flags should never be set */ \
> @@ -128,7 +128,7 @@ struct xfs_buftarg {
>  	struct xfs_buf_cache	bt_cache[];
>  };
>  
> -#define XB_PAGES	2
> +#define XB_FOLIOS	2
>  
>  struct xfs_buf_map {
>  	xfs_daddr_t		bm_bn;	/* block number for I/O */
> @@ -192,14 +192,14 @@ struct xfs_buf {
>  	struct xfs_buf_log_item	*b_log_item;
>  	struct list_head	b_li_list;	/* Log items list head */
>  	struct xfs_trans	*b_transp;
> -	struct page		**b_pages;	/* array of page pointers */
> -	struct page		*b_page_array[XB_PAGES]; /* inline pages */
> +	struct folio		**b_folios;	/* array of folio pointers */
> +	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
>  	struct xfs_buf_map	*b_maps;	/* compound buffer map */
>  	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
>  	int			b_map_count;
>  	atomic_t		b_pin_count;	/* pin count */
>  	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
> -	unsigned int		b_page_count;	/* size of page array */
> +	unsigned int		b_folio_count;	/* size of folio array */
>  	unsigned int		b_offset;	/* page offset of b_addr,
>  						   only for _XBF_KMEM buffers */
>  	int			b_error;	/* error code on I/O */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 83a81cb52d8e..d1407cee48d9 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -69,7 +69,7 @@ xfs_buf_item_straddle(
>  {
>  	void			*first, *last;
>  
> -	if (bp->b_page_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
> +	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
>  		return false;
>  
>  	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
> index 8ad38c64708e..26734c64c10e 100644
> --- a/fs/xfs/xfs_buf_mem.c
> +++ b/fs/xfs/xfs_buf_mem.c
> @@ -74,7 +74,7 @@ xmbuf_alloc(
>  
>  	/*
>  	 * We don't want to bother with kmapping data during repair, so don't
> -	 * allow highmem pages to back this mapping.
> +	 * allow highmem folios to back this mapping.
>  	 */
>  	mapping_set_gfp_mask(inode->i_mapping, GFP_KERNEL);
>  
> @@ -127,9 +127,9 @@ xmbuf_free(
>  	kfree(btp);
>  }
>  
> -/* Directly map a shmem page into the buffer cache. */
> +/* Directly map a shmem folio into the buffer cache. */
>  int
> -xmbuf_map_page(
> +xmbuf_map_folio(
>  	struct xfs_buf		*bp)
>  {
>  	struct inode		*inode = file_inode(bp->b_target->bt_file);
> @@ -169,27 +169,27 @@ xmbuf_map_page(
>  	unlock_page(page);
>  
>  	bp->b_addr = page_address(page);
> -	bp->b_pages = bp->b_page_array;
> -	bp->b_pages[0] = page;
> -	bp->b_page_count = 1;
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folios[0] = folio;
> +	bp->b_folio_count = 1;
>  	return 0;
>  }
>  
> -/* Unmap a shmem page that was mapped into the buffer cache. */
> +/* Unmap a shmem folio that was mapped into the buffer cache. */
>  void
> -xmbuf_unmap_page(
> +xmbuf_unmap_folio(
>  	struct xfs_buf		*bp)
>  {
> -	struct page		*page = bp->b_pages[0];
> +	struct folio		*folio = bp->b_folios[0];
>  
>  	ASSERT(xfs_buftarg_is_mem(bp->b_target));
>  
> -	put_page(page);
> +	folio_put(folio);
>  
>  	bp->b_addr = NULL;
> -	bp->b_pages[0] = NULL;
> -	bp->b_pages = NULL;
> -	bp->b_page_count = 0;
> +	bp->b_folios[0] = NULL;
> +	bp->b_folios = NULL;
> +	bp->b_folio_count = 0;
>  }
>  
>  /* Is this a valid daddr within the buftarg? */
> @@ -205,7 +205,7 @@ xmbuf_verify_daddr(
>  	return daddr < (inode->i_sb->s_maxbytes >> BBSHIFT);
>  }
>  
> -/* Discard the page backing this buffer. */
> +/* Discard the folio backing this buffer. */
>  static void
>  xmbuf_stale(
>  	struct xfs_buf		*bp)
> @@ -220,7 +220,7 @@ xmbuf_stale(
>  }
>  
>  /*
> - * Finalize a buffer -- discard the backing page if it's stale, or run the
> + * Finalize a buffer -- discard the backing folio if it's stale, or run the
>   * write verifier to detect problems.
>   */
>  int
> diff --git a/fs/xfs/xfs_buf_mem.h b/fs/xfs/xfs_buf_mem.h
> index eed4a7b63232..8f4c959ff829 100644
> --- a/fs/xfs/xfs_buf_mem.h
> +++ b/fs/xfs/xfs_buf_mem.h
> @@ -19,15 +19,15 @@ int xmbuf_alloc(struct xfs_mount *mp, const char *descr,
>  		struct xfs_buftarg **btpp);
>  void xmbuf_free(struct xfs_buftarg *btp);
>  
> -int xmbuf_map_page(struct xfs_buf *bp);
> -void xmbuf_unmap_page(struct xfs_buf *bp);
> +int xmbuf_map_folio(struct xfs_buf *bp);
> +void xmbuf_unmap_folio(struct xfs_buf *bp);
>  bool xmbuf_verify_daddr(struct xfs_buftarg *btp, xfs_daddr_t daddr);
>  void xmbuf_trans_bdetach(struct xfs_trans *tp, struct xfs_buf *bp);
>  int xmbuf_finalize(struct xfs_buf *bp);
>  #else
>  # define xfs_buftarg_is_mem(...)	(false)
> -# define xmbuf_map_page(...)		(-ENOMEM)
> -# define xmbuf_unmap_page(...)		((void)0)
> +# define xmbuf_map_folio(...)		(-ENOMEM)
> +# define xmbuf_unmap_folio(...)		((void)0)
>  # define xmbuf_verify_daddr(...)	(false)
>  #endif /* CONFIG_XFS_MEMORY_BUFS */
>  
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 8f07c9f6157f..9a74ba664efb 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -280,4 +280,12 @@ kmem_to_page(void *addr)
>  	return virt_to_page(addr);
>  }
>  
> +static inline struct folio *
> +kmem_to_folio(void *addr)
> +{
> +	if (is_vmalloc_addr(addr))
> +		return page_folio(vmalloc_to_page(addr));
> +	return virt_to_folio(addr);
> +}
> +
>  #endif /* __XFS_LINUX__ */
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
  2024-03-19  6:55   ` Christoph Hellwig
@ 2024-03-19 17:29   ` Darrick J. Wong
  2024-03-19 21:32     ` Christoph Hellwig
  2024-03-19 21:55     ` Dave Chinner
  2024-03-22  8:02   ` Pankaj Raghav (Samsung)
  2 siblings, 2 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the buffer cache using the folio API, we can extend
> the use of folios to allocate high order folios for multi-page
> buffers rather than an array of single pages that are then vmapped
> into a contiguous range.
> 
> This creates two types of buffers: single folio buffers that can
> have arbitrary order, and multi-folio buffers made up of many single
> page folios that get vmapped. The latter is essentially the existing
> code, so there are no logic changes to handle this case.
> 
> There are a few places where we iterate the folios on a buffer.
> These need to be converted to handle the high order folio case.
> Luckily, this only occurs when bp->b_folio_count == 1, and the code
> for handling this case is just a simple application of the folio API
> to the operations that need to be performed.
> 
> The code that allocates buffers will optimistically attempt a high
> order folio allocation as a fast path. If this high order allocation
> fails, then we fall back to the existing multi-folio allocation
> code. This now forms the slow allocation path, and hopefully will be
> largely unused in normal conditions.
> 
> This should improve performance of large buffer operations (e.g.
> large directory block sizes) as we should now mostly avoid the
> expense of vmapping large buffers (and the vmap lock contention that
> can occur) as well as avoid the runtime pressure that frequently
> accessing kernel vmapped pages put on the TLBs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 110 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 832226385154..7d9303497763 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
>  	return bp->b_addr && bp->b_folio_count > 1;
>  }
>  
> +/*
> + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> + * allocating vmapped buffers.
> + */
>  static inline int
>  xfs_buf_vmap_len(
>  	struct xfs_buf	*bp)
> @@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
>  		bp->b_addr = NULL;
>  		return -ENOMEM;
>  	}
> -	bp->b_offset = offset_in_page(bp->b_addr);
>  	bp->b_folios = bp->b_folio_array;
>  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
>  }
>  
> +/*
> + * Allocating a high order folio makes the assumption that buffers are a
> + * power-of-2 size so that ilog2() returns the exact order needed to fit
> + * the contents of the buffer. Buffer lengths are mostly a power of two,
> + * so this is not an unreasonable approach to take by default.
> + *
> + * The exception here are user xattr data buffers, which can be arbitrarily
> + * sized up to 64kB plus structure metadata. In that case, round up the order.

So.... does that mean a 128K folio for a 68k xattr remote value buffer?

I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
tree, which isn't awesome.  I haven't figured out a good way to deal
with that, since the verity code uses a lot of blkno/byte shifting and
2k merkle tree blocks suck.

(Just rambling, don't mind me)

> + */
> +static bool
> +xfs_buf_alloc_folio(
> +	struct xfs_buf	*bp,
> +	gfp_t		gfp_mask)
> +{
> +	int		length = BBTOB(bp->b_length);
> +	int		order = get_order(length);
> +
> +	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> +	if (!bp->b_folio_array[0])
> +		return false;
> +
> +	bp->b_folios = bp->b_folio_array;
> +	bp->b_folio_count = 1;
> +	bp->b_flags |= _XBF_FOLIOS;
> +	return true;
> +}
> +
> +/*
> + * When we allocate folios for a buffer, we end up with one of two types of
> + * buffer.
> + *
> + * The first type is a single folio buffer - this may be a high order
> + * folio or just a single page sized folio, but either way they get treated the
> + * same way by the rest of the code - the buffer memory spans a single
> + * contiguous memory region that we don't have to map and unmap to access the
> + * data directly.
> + *
> + * The second type of buffer is the multi-folio buffer. These are *always* made
> + * up of single page folios so that they can be fed to vmap_ram() to return a
> + * contiguous memory region we can access the data through, or mark it as
> + * XBF_UNMAPPED and access the data directly through individual folio_address()
> + * calls.
> + *
> + * We don't use high order folios for this second type of buffer (yet) because
> + * having variable size folios makes offset-to-folio indexing and iteration of
> + * the data range more complex than if they are fixed size. This case should now
> + * be the slow path, though, so unless we regularly fail to allocate high order
> + * folios, there should be little need to optimise this path.
> + */
>  static int
>  xfs_buf_alloc_folios(
>  	struct xfs_buf	*bp,
> @@ -371,7 +424,15 @@ xfs_buf_alloc_folios(
>  	if (flags & XBF_READ_AHEAD)
>  		gfp_mask |= __GFP_NORETRY;
>  
> -	/* Make sure that we have a page list */
> +	/* Assure zeroed buffer for non-read cases. */
> +	if (!(flags & XBF_READ))
> +		gfp_mask |= __GFP_ZERO;
> +
> +	/* Optimistically attempt a single high order folio allocation. */
> +	if (xfs_buf_alloc_folio(bp, gfp_mask))
> +		return 0;
> +
> +	/* Fall back to allocating an array of single page folios. */
>  	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
>  	if (bp->b_folio_count <= XB_FOLIOS) {
>  		bp->b_folios = bp->b_folio_array;
> @@ -383,9 +444,6 @@ xfs_buf_alloc_folios(
>  	}
>  	bp->b_flags |= _XBF_FOLIOS;
>  
> -	/* Assure zeroed buffer for non-read cases. */
> -	if (!(flags & XBF_READ))
> -		gfp_mask |= __GFP_ZERO;
>  
>  	/*
>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;
> +		}
>  	}
>  
>  	/*
> @@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
>  
>  next_chunk:
>  	atomic_inc(&bp->b_io_remaining);
> -	nr_pages = bio_max_segs(total_nr_pages);
> +	nr_folios = bio_max_segs(total_nr_folios);
>  
> -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
>  	bio->bi_iter.bi_sector = sector;
>  	bio->bi_end_io = xfs_buf_bio_end_io;
>  	bio->bi_private = bp;
>  
> -	for (; size && nr_pages; nr_pages--, page_index++) {
> -		int	rbytes, nbytes = PAGE_SIZE - offset;
> +	for (; size && nr_folios; nr_folios--, folio_index++) {
> +		struct folio	*folio = bp->b_folios[folio_index];
> +		int		nbytes = folio_size(folio) - offset;
>  
>  		if (nbytes > size)
>  			nbytes = size;
>  
> -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> -				      offset);

Um, bio_add_folio returns a bool, maybe that's why hch was complaining
about hangs with only the first few patches applied?

Annoying if the kbuild robots don't catch that.  Either way, with hch's
replies to this and patch 2 dealt with,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> -		if (rbytes < nbytes)
> +		if (!bio_add_folio(bio, folio, nbytes,
> +				offset_in_folio(folio, offset)))
>  			break;
>  
>  		offset = 0;
>  		sector += BTOBB(nbytes);
>  		size -= nbytes;
> -		total_nr_pages--;
> +		total_nr_folios--;
>  	}
>  
>  	if (likely(bio->bi_iter.bi_size)) {
> @@ -1788,6 +1854,13 @@ xfs_buf_offset(
>  	if (bp->b_addr)
>  		return bp->b_addr + offset;
>  
> +	/* Single folio buffers may use large folios. */
> +	if (bp->b_folio_count == 1) {
> +		folio = bp->b_folios[0];
> +		return folio_address(folio) + offset_in_folio(folio, offset);
> +	}
> +
> +	/* Multi-folio buffers always use PAGE_SIZE folios */
>  	folio = bp->b_folios[offset >> PAGE_SHIFT];
>  	return folio_address(folio) + (offset & (PAGE_SIZE-1));
>  }
> @@ -1803,18 +1876,24 @@ xfs_buf_zero(
>  	bend = boff + bsize;
>  	while (boff < bend) {
>  		struct folio	*folio;
> -		int		page_index, page_offset, csize;
> +		int		folio_index, folio_offset, csize;
>  
> -		page_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> -		page_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -		folio = bp->b_folios[page_index];
> -		csize = min_t(size_t, PAGE_SIZE - page_offset,
> +		/* Single folio buffers may use large folios. */
> +		if (bp->b_folio_count == 1) {
> +			folio = bp->b_folios[0];
> +			folio_offset = offset_in_folio(folio,
> +						bp->b_offset + boff);
> +		} else {
> +			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> +			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> +			folio = bp->b_folios[folio_index];
> +		}
> +
> +		csize = min_t(size_t, folio_size(folio) - folio_offset,
>  				      BBTOB(bp->b_length) - boff);
> +		ASSERT((csize + folio_offset) <= folio_size(folio));
>  
> -		ASSERT((csize + page_offset) <= PAGE_SIZE);
> -
> -		memset(folio_address(folio) + page_offset, 0, csize);
> -
> +		memset(folio_address(folio) + folio_offset, 0, csize);
>  		boff += csize;
>  	}
>  }
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/9] xfs: kill XBF_UNMAPPED
  2024-03-18 22:45 ` [PATCH 4/9] xfs: kill XBF_UNMAPPED Dave Chinner
@ 2024-03-19 17:30   ` Darrick J. Wong
  2024-03-19 23:36     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:55AM +1100, Dave Chinner wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Unmapped buffer access is a pain, so kill it. The switch to large
> folios means we rarely pay a vmap penalty for large buffers,
> so this functionality is largely unnecessary now.

What was the original point of unmapped buffers?  Was it optimizing for
not using vmalloc space for inode buffers on 32-bit machines?

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_ialloc.c    |  2 +-
>  fs/xfs/libxfs/xfs_inode_buf.c |  2 +-
>  fs/xfs/scrub/inode_repair.c   |  3 +-
>  fs/xfs/xfs_buf.c              | 62 ++---------------------------------
>  fs/xfs/xfs_buf.h              | 16 ++++++---
>  fs/xfs/xfs_buf_item.c         |  2 +-
>  fs/xfs/xfs_buf_item_recover.c |  8 +----
>  fs/xfs/xfs_inode.c            |  3 +-
>  8 files changed, 19 insertions(+), 79 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index e5ac3e5430c4..fa27a50f96ac 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -362,7 +362,7 @@ xfs_ialloc_inode_init(
>  				(j * M_IGEO(mp)->blocks_per_cluster));
>  		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
>  				mp->m_bsize * M_IGEO(mp)->blocks_per_cluster,
> -				XBF_UNMAPPED, &fbuf);
> +				0, &fbuf);
>  		if (error)
>  			return error;
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index d0dcce462bf4..68989f4bf793 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -136,7 +136,7 @@ xfs_imap_to_bp(
>  	int			error;
>  
>  	error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, imap->im_blkno,
> -			imap->im_len, XBF_UNMAPPED, bpp, &xfs_inode_buf_ops);
> +			imap->im_len, 0, bpp, &xfs_inode_buf_ops);
>  	if (xfs_metadata_is_sick(error))
>  		xfs_agno_mark_sick(mp, xfs_daddr_to_agno(mp, imap->im_blkno),
>  				XFS_SICK_AG_INODES);
> diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
> index eab380e95ef4..7b31f1ad194f 100644
> --- a/fs/xfs/scrub/inode_repair.c
> +++ b/fs/xfs/scrub/inode_repair.c
> @@ -1309,8 +1309,7 @@ xrep_dinode_core(
>  
>  	/* Read the inode cluster buffer. */
>  	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
> -			ri->imap.im_blkno, ri->imap.im_len, XBF_UNMAPPED, &bp,
> -			NULL);
> +			ri->imap.im_blkno, ri->imap.im_len, 0, &bp, NULL);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7d9303497763..2cd3671f3ce3 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -239,7 +239,7 @@ _xfs_buf_alloc(
>  	 * We don't want certain flags to appear in b_flags unless they are
>  	 * specifically set by later operations on the buffer.
>  	 */
> -	flags &= ~(XBF_UNMAPPED | XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
> +	flags &= ~(XBF_TRYLOCK | XBF_ASYNC | XBF_READ_AHEAD);
>  
>  	atomic_set(&bp->b_hold, 1);
>  	atomic_set(&bp->b_lru_ref, 1);
> @@ -403,9 +403,7 @@ xfs_buf_alloc_folio(
>   *
>   * The second type of buffer is the multi-folio buffer. These are *always* made
>   * up of single page folios so that they can be fed to vmap_ram() to return a
> - * contiguous memory region we can access the data through, or mark it as
> - * XBF_UNMAPPED and access the data directly through individual folio_address()
> - * calls.
> + * contiguous memory region we can access the data through.
>   *
>   * We don't use high order folios for this second type of buffer (yet) because
>   * having variable size folios makes offset-to-folio indexing and iteration of
> @@ -486,8 +484,6 @@ _xfs_buf_map_folios(
>  	if (bp->b_folio_count == 1) {
>  		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
> -	} else if (flags & XBF_UNMAPPED) {
> -		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
>  		unsigned nofs_flag;
> @@ -1844,60 +1840,6 @@ __xfs_buf_submit(
>  	return error;
>  }
>  
> -void *
> -xfs_buf_offset(
> -	struct xfs_buf		*bp,
> -	size_t			offset)
> -{
> -	struct folio		*folio;
> -
> -	if (bp->b_addr)
> -		return bp->b_addr + offset;
> -
> -	/* Single folio buffers may use large folios. */
> -	if (bp->b_folio_count == 1) {
> -		folio = bp->b_folios[0];
> -		return folio_address(folio) + offset_in_folio(folio, offset);
> -	}
> -
> -	/* Multi-folio buffers always use PAGE_SIZE folios */
> -	folio = bp->b_folios[offset >> PAGE_SHIFT];
> -	return folio_address(folio) + (offset & (PAGE_SIZE-1));
> -}
> -
> -void
> -xfs_buf_zero(
> -	struct xfs_buf		*bp,
> -	size_t			boff,
> -	size_t			bsize)
> -{
> -	size_t			bend;
> -
> -	bend = boff + bsize;
> -	while (boff < bend) {
> -		struct folio	*folio;
> -		int		folio_index, folio_offset, csize;
> -
> -		/* Single folio buffers may use large folios. */
> -		if (bp->b_folio_count == 1) {
> -			folio = bp->b_folios[0];
> -			folio_offset = offset_in_folio(folio,
> -						bp->b_offset + boff);
> -		} else {
> -			folio_index = (boff + bp->b_offset) >> PAGE_SHIFT;
> -			folio_offset = (boff + bp->b_offset) & ~PAGE_MASK;
> -			folio = bp->b_folios[folio_index];
> -		}
> -
> -		csize = min_t(size_t, folio_size(folio) - folio_offset,
> -				      BBTOB(bp->b_length) - boff);
> -		ASSERT((csize + folio_offset) <= folio_size(folio));
> -
> -		memset(folio_address(folio) + folio_offset, 0, csize);
> -		boff += csize;
> -	}
> -}
> -
>  /*
>   * Log a message about and stale a buffer that a caller has decided is corrupt.
>   *
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index f059ae3d2755..aef7015cf9f3 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -51,7 +51,6 @@ struct xfs_buf;
>  #define XBF_LIVESCAN	 (1u << 28)
>  #define XBF_INCORE	 (1u << 29)/* lookup only, return if found in cache */
>  #define XBF_TRYLOCK	 (1u << 30)/* lock requested, but do not wait */
> -#define XBF_UNMAPPED	 (1u << 31)/* do not map the buffer */
>  
>  
>  typedef unsigned int xfs_buf_flags_t;
> @@ -74,8 +73,7 @@ typedef unsigned int xfs_buf_flags_t;
>  	/* The following interface flags should never be set */ \
>  	{ XBF_LIVESCAN,		"LIVESCAN" }, \
>  	{ XBF_INCORE,		"INCORE" }, \
> -	{ XBF_TRYLOCK,		"TRYLOCK" }, \
> -	{ XBF_UNMAPPED,		"UNMAPPED" }
> +	{ XBF_TRYLOCK,		"TRYLOCK" }
>  
>  /*
>   * Internal state flags.
> @@ -320,12 +318,20 @@ extern void __xfs_buf_ioerror(struct xfs_buf *bp, int error,
>  #define xfs_buf_ioerror(bp, err) __xfs_buf_ioerror((bp), (err), __this_address)
>  extern void xfs_buf_ioerror_alert(struct xfs_buf *bp, xfs_failaddr_t fa);
>  void xfs_buf_ioend_fail(struct xfs_buf *);
> -void xfs_buf_zero(struct xfs_buf *bp, size_t boff, size_t bsize);
>  void __xfs_buf_mark_corrupt(struct xfs_buf *bp, xfs_failaddr_t fa);
>  #define xfs_buf_mark_corrupt(bp) __xfs_buf_mark_corrupt((bp), __this_address)
>  
>  /* Buffer Utility Routines */
> -extern void *xfs_buf_offset(struct xfs_buf *, size_t);
> +static inline void *xfs_buf_offset(struct xfs_buf *bp, size_t offset)
> +{
> +	return bp->b_addr + offset;
> +}
> +
> +static inline void xfs_buf_zero(struct xfs_buf *bp, size_t boff, size_t bsize)
> +{
> +	memset(bp->b_addr + boff, 0, bsize);
> +}
> +
>  extern void xfs_buf_stale(struct xfs_buf *bp);
>  
>  /* Delayed Write Buffer Routines */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index d1407cee48d9..7b66d3fe4ecd 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -69,7 +69,7 @@ xfs_buf_item_straddle(
>  {
>  	void			*first, *last;
>  
> -	if (bp->b_folio_count == 1 || !(bp->b_flags & XBF_UNMAPPED))
> +	if (bp->b_folio_count == 1)
>  		return false;
>  
>  	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> index 09e893cf563c..d74bf7bb7794 100644
> --- a/fs/xfs/xfs_buf_item_recover.c
> +++ b/fs/xfs/xfs_buf_item_recover.c
> @@ -891,7 +891,6 @@ xlog_recover_buf_commit_pass2(
>  	struct xfs_mount		*mp = log->l_mp;
>  	struct xfs_buf			*bp;
>  	int				error;
> -	uint				buf_flags;
>  	xfs_lsn_t			lsn;
>  
>  	/*
> @@ -910,13 +909,8 @@ xlog_recover_buf_commit_pass2(
>  	}
>  
>  	trace_xfs_log_recover_buf_recover(log, buf_f);
> -
> -	buf_flags = 0;
> -	if (buf_f->blf_flags & XFS_BLF_INODE_BUF)
> -		buf_flags |= XBF_UNMAPPED;
> -
>  	error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno, buf_f->blf_len,
> -			  buf_flags, &bp, NULL);
> +			  0, &bp, NULL);
>  	if (error)
>  		return error;
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ea48774f6b76..e7a724270423 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2405,8 +2405,7 @@ xfs_ifree_cluster(
>  		 * to mark all the active inodes on the buffer stale.
>  		 */
>  		error = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
> -				mp->m_bsize * igeo->blocks_per_cluster,
> -				XBF_UNMAPPED, &bp);
> +				mp->m_bsize * igeo->blocks_per_cluster, 0, &bp);
>  		if (error)
>  			return error;
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/9] xfs: buffer items don't straddle pages anymore
  2024-03-18 22:45 ` [PATCH 5/9] xfs: buffer items don't straddle pages anymore Dave Chinner
  2024-03-19  6:56   ` Christoph Hellwig
@ 2024-03-19 17:31   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:56AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unmapped buffers don't exist anymore, so the page straddling
> detection and slow path code can go away now.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Oh good, I was reading throguh this code just yesterday for other
reasons and now I'll just stop. :)

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 124 ------------------------------------------
>  1 file changed, 124 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 7b66d3fe4ecd..cbc06605d31c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -56,31 +56,6 @@ xfs_buf_log_format_size(
>  			(blfp->blf_map_size * sizeof(blfp->blf_data_map[0]));
>  }
>  
> -/*
> - * We only have to worry about discontiguous buffer range straddling on unmapped
> - * buffers. Everything else will have a contiguous data region we can copy from.
> - */
> -static inline bool
> -xfs_buf_item_straddle(
> -	struct xfs_buf		*bp,
> -	uint			offset,
> -	int			first_bit,
> -	int			nbits)
> -{
> -	void			*first, *last;
> -
> -	if (bp->b_folio_count == 1)
> -		return false;
> -
> -	first = xfs_buf_offset(bp, offset + (first_bit << XFS_BLF_SHIFT));
> -	last = xfs_buf_offset(bp,
> -			offset + ((first_bit + nbits) << XFS_BLF_SHIFT));
> -
> -	if (last - first != nbits * XFS_BLF_CHUNK)
> -		return true;
> -	return false;
> -}
> -
>  /*
>   * Return the number of log iovecs and space needed to log the given buf log
>   * item segment.
> @@ -97,11 +72,8 @@ xfs_buf_item_size_segment(
>  	int				*nvecs,
>  	int				*nbytes)
>  {
> -	struct xfs_buf			*bp = bip->bli_buf;
>  	int				first_bit;
>  	int				nbits;
> -	int				next_bit;
> -	int				last_bit;
>  
>  	first_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size, 0);
>  	if (first_bit == -1)
> @@ -114,15 +86,6 @@ xfs_buf_item_size_segment(
>  		nbits = xfs_contig_bits(blfp->blf_data_map,
>  					blfp->blf_map_size, first_bit);
>  		ASSERT(nbits > 0);
> -
> -		/*
> -		 * Straddling a page is rare because we don't log contiguous
> -		 * chunks of unmapped buffers anywhere.
> -		 */
> -		if (nbits > 1 &&
> -		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> -			goto slow_scan;
> -
>  		(*nvecs)++;
>  		*nbytes += nbits * XFS_BLF_CHUNK;
>  
> @@ -137,43 +100,6 @@ xfs_buf_item_size_segment(
>  	} while (first_bit != -1);
>  
>  	return;
> -
> -slow_scan:
> -	ASSERT(bp->b_addr == NULL);
> -	last_bit = first_bit;
> -	nbits = 1;
> -	while (last_bit != -1) {
> -
> -		*nbytes += XFS_BLF_CHUNK;
> -
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					last_bit + 1);
> -		/*
> -		 * If we run out of bits, leave the loop,
> -		 * else if we find a new set of bits bump the number of vecs,
> -		 * else keep scanning the current set of bits.
> -		 */
> -		if (next_bit == -1) {
> -			if (first_bit != last_bit)
> -				(*nvecs)++;
> -			break;
> -		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
> -			last_bit = next_bit;
> -			first_bit = next_bit;
> -			(*nvecs)++;
> -			nbits = 1;
> -		} else {
> -			last_bit++;
> -			nbits++;
> -		}
> -	}
>  }
>  
>  /*
> @@ -286,8 +212,6 @@ xfs_buf_item_format_segment(
>  	struct xfs_buf		*bp = bip->bli_buf;
>  	uint			base_size;
>  	int			first_bit;
> -	int			last_bit;
> -	int			next_bit;
>  	uint			nbits;
>  
>  	/* copy the flags across from the base format item */
> @@ -332,15 +256,6 @@ xfs_buf_item_format_segment(
>  		nbits = xfs_contig_bits(blfp->blf_data_map,
>  					blfp->blf_map_size, first_bit);
>  		ASSERT(nbits > 0);
> -
> -		/*
> -		 * Straddling a page is rare because we don't log contiguous
> -		 * chunks of unmapped buffers anywhere.
> -		 */
> -		if (nbits > 1 &&
> -		    xfs_buf_item_straddle(bp, offset, first_bit, nbits))
> -			goto slow_scan;
> -
>  		xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
>  					first_bit, nbits);
>  		blfp->blf_size++;
> @@ -356,45 +271,6 @@ xfs_buf_item_format_segment(
>  	} while (first_bit != -1);
>  
>  	return;
> -
> -slow_scan:
> -	ASSERT(bp->b_addr == NULL);
> -	last_bit = first_bit;
> -	nbits = 1;
> -	for (;;) {
> -		/*
> -		 * This takes the bit number to start looking from and
> -		 * returns the next set bit from there.  It returns -1
> -		 * if there are no more bits set or the start bit is
> -		 * beyond the end of the bitmap.
> -		 */
> -		next_bit = xfs_next_bit(blfp->blf_data_map, blfp->blf_map_size,
> -					(uint)last_bit + 1);
> -		/*
> -		 * If we run out of bits fill in the last iovec and get out of
> -		 * the loop.  Else if we start a new set of bits then fill in
> -		 * the iovec for the series we were looking at and start
> -		 * counting the bits in the new one.  Else we're still in the
> -		 * same set of bits so just keep counting and scanning.
> -		 */
> -		if (next_bit == -1) {
> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			break;
> -		} else if (next_bit != last_bit + 1 ||
> -		           xfs_buf_item_straddle(bp, offset, first_bit, nbits)) {
> -			xfs_buf_item_copy_iovec(lv, vecp, bp, offset,
> -						first_bit, nbits);
> -			blfp->blf_size++;
> -			first_bit = next_bit;
> -			last_bit = next_bit;
> -			nbits = 1;
> -		} else {
> -			last_bit++;
> -			nbits++;
> -		}
> -	}
>  }
>  
>  /*
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios
  2024-03-18 22:45 ` [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios Dave Chinner
@ 2024-03-19 17:34   ` Darrick J. Wong
  2024-03-19 21:32     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:57AM +1100, Dave Chinner wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> With the concept of unmapped buffer gone, there is no reason to not
> vmap the buffer pages directly in xfs_buf_alloc_folios.

"..no reason to not map the buffer pages..."?

I say that mostly because the first dumb thought I had was "wait, we're
going to vm_map_ram everything now??" which of course is not true.

> [dchinner: port to folio-enabled buffers.]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

With that changed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c | 37 ++++++-------------------------------
>  1 file changed, 6 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 2cd3671f3ce3..a77e2d8c8107 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -469,18 +469,7 @@ xfs_buf_alloc_folios(
>  		XFS_STATS_INC(bp->b_mount, xb_page_retries);
>  		memalloc_retry_wait(gfp_mask);
>  	}
> -	return 0;
> -}
>  
> -/*
> - *	Map buffer into kernel address-space if necessary.
> - */
> -STATIC int
> -_xfs_buf_map_folios(
> -	struct xfs_buf		*bp,
> -	xfs_buf_flags_t		flags)
> -{
> -	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
>  		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
> @@ -513,8 +502,13 @@ _xfs_buf_map_folios(
>  		} while (retried++ <= 1);
>  		memalloc_nofs_restore(nofs_flag);
>  
> -		if (!bp->b_addr)
> +		if (!bp->b_addr) {
> +			xfs_warn_ratelimited(bp->b_mount,
> +				"%s: failed to map %u folios", __func__,
> +				bp->b_folio_count);
> +			xfs_buf_free_folios(bp);
>  			return -ENOMEM;
> +		}
>  	}
>  
>  	return 0;
> @@ -816,18 +810,6 @@ xfs_buf_get_map(
>  			xfs_perag_put(pag);
>  	}
>  
> -	/* We do not hold a perag reference anymore. */
> -	if (!bp->b_addr) {
> -		error = _xfs_buf_map_folios(bp, flags);
> -		if (unlikely(error)) {
> -			xfs_warn_ratelimited(btp->bt_mount,
> -				"%s: failed to map %u folios", __func__,
> -				bp->b_folio_count);
> -			xfs_buf_relse(bp);
> -			return error;
> -		}
> -	}
> -
>  	/*
>  	 * Clear b_error if this is a lookup from a caller that doesn't expect
>  	 * valid data to be found in the buffer.
> @@ -1068,13 +1050,6 @@ xfs_buf_get_uncached(
>  	if (error)
>  		goto fail_free_buf;
>  
> -	error = _xfs_buf_map_folios(bp, 0);
> -	if (unlikely(error)) {
> -		xfs_warn(target->bt_mount,
> -			"%s: failed to map folios", __func__);
> -		goto fail_free_buf;
> -	}
> -
>  	trace_xfs_buf_get_uncached(bp, _RET_IP_);
>  	*bpp = bp;
>  	return 0;
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/9] xfs: walk b_addr for buffer I/O
  2024-03-18 22:45 ` [PATCH 7/9] xfs: walk b_addr for buffer I/O Dave Chinner
@ 2024-03-19 17:42   ` Darrick J. Wong
  2024-03-19 21:33     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:58AM +1100, Dave Chinner wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Instead of walking the folio array just walk the kernel virtual
> address in ->b_addr.  This prepares for using vmalloc for buffers
> and removing the b_folio array.
> 
> [dchinner: ported to folios-based buffers.]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c | 110 +++++++++++++----------------------------------
>  fs/xfs/xfs_buf.h |   2 -
>  2 files changed, 29 insertions(+), 83 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index a77e2d8c8107..303945554415 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -358,7 +358,6 @@ xfs_buf_alloc_kmem(
>  	}
>  	bp->b_folios = bp->b_folio_array;
>  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> -	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
> @@ -1549,87 +1548,44 @@ xfs_buf_bio_end_io(
>  static void
>  xfs_buf_ioapply_map(
>  	struct xfs_buf	*bp,
> -	int		map,
> -	int		*buf_offset,
> -	int		*count,
> +	unsigned int	map,

I like making these never-negative quantities unsigned.

> +	unsigned int	*buf_offset,
>  	blk_opf_t	op)
>  {
> -	int		folio_index;
> -	unsigned int	total_nr_folios = bp->b_folio_count;
> -	int		nr_folios;
>  	struct bio	*bio;
> -	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
> -	int		offset;
>  
> -	/*
> -	 * If the start offset if larger than a single page, we need to be
> -	 * careful. We might have a high order folio, in which case the indexing
> -	 * is from the start of the buffer. However, if we have more than one
> -	 * folio single page folio in the buffer, we need to skip the folios in
> -	 * the buffer before the start offset.
> -	 */
> -	folio_index = 0;
> -	offset = *buf_offset;
> -	if (bp->b_folio_count > 1) {
> -		while (offset >= PAGE_SIZE) {
> -			folio_index++;
> -			offset -= PAGE_SIZE;
> -		}
> +	/* Limit the IO size to the length of the current vector. */
> +	size = min_t(unsigned int, BBTOB(bp->b_maps[map].bm_len),
> +			BBTOB(bp->b_length) - *buf_offset);
> +
> +	if (WARN_ON_ONCE(bp->b_folio_count > BIO_MAX_VECS)) {
> +		xfs_buf_ioerror(bp, -EIO);
> +		return;
>  	}
>  
> -	/*
> -	 * Limit the IO size to the length of the current vector, and update the
> -	 * remaining IO count for the next time around.
> -	 */
> -	size = min_t(int, BBTOB(bp->b_maps[map].bm_len), *count);
> -	*count -= size;
> -	*buf_offset += size;
> -
> -next_chunk:
>  	atomic_inc(&bp->b_io_remaining);
> -	nr_folios = bio_max_segs(total_nr_folios);
>  
> -	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
> -	bio->bi_iter.bi_sector = sector;
> +	bio = bio_alloc(bp->b_target->bt_bdev, bp->b_folio_count, op, GFP_NOIO);
> +	bio->bi_iter.bi_sector = bp->b_maps[map].bm_bn;
>  	bio->bi_end_io = xfs_buf_bio_end_io;
>  	bio->bi_private = bp;
>  
> -	for (; size && nr_folios; nr_folios--, folio_index++) {
> -		struct folio	*folio = bp->b_folios[folio_index];
> -		int		nbytes = folio_size(folio) - offset;
> -
> -		if (nbytes > size)
> -			nbytes = size;
> -
> -		if (!bio_add_folio(bio, folio, nbytes,
> -				offset_in_folio(folio, offset)))
> -			break;
> -
> -		offset = 0;
> -		sector += BTOBB(nbytes);
> -		size -= nbytes;
> -		total_nr_folios--;
> -	}
> -
> -	if (likely(bio->bi_iter.bi_size)) {
> -		if (xfs_buf_is_vmapped(bp)) {
> -			flush_kernel_vmap_range(bp->b_addr,
> -						xfs_buf_vmap_len(bp));
> -		}
> -		submit_bio(bio);
> -		if (size)
> -			goto next_chunk;
> -	} else {
> -		/*
> -		 * This is guaranteed not to be the last io reference count
> -		 * because the caller (xfs_buf_submit) holds a count itself.
> -		 */
> -		atomic_dec(&bp->b_io_remaining);
> -		xfs_buf_ioerror(bp, -EIO);
> -		bio_put(bio);
> -	}
> -
> +	do {
> +		void		*data = bp->b_addr + *buf_offset;
> +		struct folio	*folio = kmem_to_folio(data);
> +		unsigned int	off = offset_in_folio(folio, data);
> +		unsigned int	len = min_t(unsigned int, size,
> +						folio_size(folio) - off);
> +
> +		bio_add_folio_nofail(bio, folio, len, off);
> +		size -= len;
> +		*buf_offset += len;
> +	} while (size);
> +
> +	if (xfs_buf_is_vmapped(bp))
> +		flush_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp));
> +	submit_bio(bio);
>  }
>  
>  STATIC void
> @@ -1638,8 +1594,7 @@ _xfs_buf_ioapply(
>  {
>  	struct blk_plug	plug;
>  	blk_opf_t	op;
> -	int		offset;
> -	int		size;
> +	unsigned int	offset = 0;
>  	int		i;
>  
>  	/*
> @@ -1701,16 +1656,9 @@ _xfs_buf_ioapply(
>  	 * _xfs_buf_ioapply_vec() will modify them appropriately for each
>  	 * subsequent call.
>  	 */
> -	offset = bp->b_offset;

Huh.  So ... where does b_offset come into play here?

OH.  Since we're starting with b_addr and working our way /back/ to
folios, we don't need b_offset anymore since we can compute that from
(b_addr - folio_address()).  So then the @offset variable in
_xfs_buf_ioapply is really a cursor into how far into the xfs_buf we've
ioapply'd.

Would you mind adding a sentence to the commit message?

"Instead of walking the folio array just walk the kernel virtual address
in ->b_addr.  This prepares for using vmalloc for buffers and removing
the b_folio array.  Furthermore, b_offset goes away since we can compute
that from b_addr and the folio."

With that changed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> -	size = BBTOB(bp->b_length);
>  	blk_start_plug(&plug);
> -	for (i = 0; i < bp->b_map_count; i++) {
> -		xfs_buf_ioapply_map(bp, i, &offset, &size, op);
> -		if (bp->b_error)
> -			break;
> -		if (size <= 0)
> -			break;	/* all done */
> -	}
> +	for (i = 0; i < bp->b_map_count; i++)
> +		xfs_buf_ioapply_map(bp, i, &offset, op);
>  	blk_finish_plug(&plug);
>  }
>  
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index aef7015cf9f3..4d515407713b 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -198,8 +198,6 @@ struct xfs_buf {
>  	atomic_t		b_pin_count;	/* pin count */
>  	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
>  	unsigned int		b_folio_count;	/* size of folio array */
> -	unsigned int		b_offset;	/* page offset of b_addr,
> -						   only for _XBF_KMEM buffers */
>  	int			b_error;	/* error code on I/O */
>  
>  	/*
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/9] xfs: use vmalloc for multi-folio buffers
  2024-03-18 22:45 ` [PATCH 8/9] xfs: use vmalloc for multi-folio buffers Dave Chinner
@ 2024-03-19 17:48   ` Darrick J. Wong
  2024-03-20  0:20     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 09:45:59AM +1100, Dave Chinner wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Instead of allocating the folios manually using the bulk page
> allocator and then using vm_map_page just use vmalloc to allocate
> the entire buffer - vmalloc will use the bulk allocator internally
> if it fits.
> 
> With this the b_folios array can go away as well as nothing uses it.
> 
> [dchinner: port to folio based buffers.]
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c     | 164 ++++++++++++-------------------------------
>  fs/xfs/xfs_buf.h     |   2 -
>  fs/xfs/xfs_buf_mem.c |   9 +--
>  3 files changed, 45 insertions(+), 130 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 303945554415..6d6bad80722e 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -282,29 +282,6 @@ _xfs_buf_alloc(
>  	return 0;
>  }
>  
> -static void
> -xfs_buf_free_folios(
> -	struct xfs_buf	*bp)
> -{
> -	uint		i;
> -
> -	ASSERT(bp->b_flags & _XBF_FOLIOS);
> -
> -	if (xfs_buf_is_vmapped(bp))
> -		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
> -
> -	for (i = 0; i < bp->b_folio_count; i++) {
> -		if (bp->b_folios[i])
> -			__folio_put(bp->b_folios[i]);
> -	}
> -	mm_account_reclaimed_pages(bp->b_folio_count);
> -
> -	if (bp->b_folios != bp->b_folio_array)
> -		kfree(bp->b_folios);
> -	bp->b_folios = NULL;
> -	bp->b_flags &= ~_XBF_FOLIOS;
> -}
> -
>  static void
>  xfs_buf_free_callback(
>  	struct callback_head	*cb)
> @@ -323,13 +300,22 @@ xfs_buf_free(
>  
>  	ASSERT(list_empty(&bp->b_lru));
>  
> -	if (xfs_buftarg_is_mem(bp->b_target))
> +	if (xfs_buftarg_is_mem(bp->b_target)) {
>  		xmbuf_unmap_folio(bp);
> -	else if (bp->b_flags & _XBF_FOLIOS)
> -		xfs_buf_free_folios(bp);
> -	else if (bp->b_flags & _XBF_KMEM)
> -		kfree(bp->b_addr);
> +		goto free;
> +	}
>  
> +	if (!(bp->b_flags & _XBF_KMEM))
> +		mm_account_reclaimed_pages(bp->b_folio_count);

Echoing hch's statement about the argument being passed to
mm_account_reclaimed_pages needing to be fed units of base pages, not
folios.

> +
> +	if (bp->b_flags & _XBF_FOLIOS)
> +		__folio_put(kmem_to_folio(bp->b_addr));

Is it necessary to use folio_put instead of the __ version like hch said
earlier?

> +	else
> +		kvfree(bp->b_addr);
> +
> +	bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;

Shouldn't this be:

	bp->b_flags &= ~(_XBF_KMEM | _XBF_FOLIOS); ?

> +
> +free:
>  	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
>  }
>  
> @@ -356,8 +342,6 @@ xfs_buf_alloc_kmem(
>  		bp->b_addr = NULL;
>  		return -ENOMEM;
>  	}
> -	bp->b_folios = bp->b_folio_array;
> -	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_KMEM;
>  	return 0;
> @@ -377,14 +361,15 @@ xfs_buf_alloc_folio(
>  	struct xfs_buf	*bp,
>  	gfp_t		gfp_mask)
>  {
> +	struct folio	*folio;
>  	int		length = BBTOB(bp->b_length);
>  	int		order = get_order(length);
>  
> -	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> -	if (!bp->b_folio_array[0])
> +	folio = folio_alloc(gfp_mask, order);
> +	if (!folio)
>  		return false;
>  
> -	bp->b_folios = bp->b_folio_array;
> +	bp->b_addr = folio_address(folio);
>  	bp->b_folio_count = 1;
>  	bp->b_flags |= _XBF_FOLIOS;
>  	return true;
> @@ -400,15 +385,11 @@ xfs_buf_alloc_folio(
>   * contiguous memory region that we don't have to map and unmap to access the
>   * data directly.
>   *
> - * The second type of buffer is the multi-folio buffer. These are *always* made
> - * up of single page folios so that they can be fed to vmap_ram() to return a
> - * contiguous memory region we can access the data through.
> - *
> - * We don't use high order folios for this second type of buffer (yet) because
> - * having variable size folios makes offset-to-folio indexing and iteration of
> - * the data range more complex than if they are fixed size. This case should now
> - * be the slow path, though, so unless we regularly fail to allocate high order
> - * folios, there should be little need to optimise this path.
> + * The second type of buffer is the vmalloc()d buffer. This provides the buffer
> + * with the required contiguous memory region but backed by discontiguous
> + * physical pages. vmalloc() typically doesn't fail, but it can and so we may
> + * need to wrap the allocation in a loop to prevent low memory failures and
> + * shutdowns.

Where's the loop now?  Is that buried under __vmalloc somewhere?

--D

>   */
>  static int
>  xfs_buf_alloc_folios(
> @@ -416,7 +397,7 @@ xfs_buf_alloc_folios(
>  	xfs_buf_flags_t	flags)
>  {
>  	gfp_t		gfp_mask = GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOWARN;
> -	long		filled = 0;
> +	unsigned	nofs_flag;
>  
>  	if (flags & XBF_READ_AHEAD)
>  		gfp_mask |= __GFP_NORETRY;
> @@ -425,89 +406,32 @@ xfs_buf_alloc_folios(
>  	if (!(flags & XBF_READ))
>  		gfp_mask |= __GFP_ZERO;
>  
> -	/* Optimistically attempt a single high order folio allocation. */
> -	if (xfs_buf_alloc_folio(bp, gfp_mask))
> -		return 0;
> -
>  	/* Fall back to allocating an array of single page folios. */
>  	bp->b_folio_count = DIV_ROUND_UP(BBTOB(bp->b_length), PAGE_SIZE);
> -	if (bp->b_folio_count <= XB_FOLIOS) {
> -		bp->b_folios = bp->b_folio_array;
> -	} else {
> -		bp->b_folios = kzalloc(sizeof(struct folio *) * bp->b_folio_count,
> -					gfp_mask);
> -		if (!bp->b_folios)
> -			return -ENOMEM;
> -	}
> -	bp->b_flags |= _XBF_FOLIOS;
>  
> +	/* Optimistically attempt a single high order folio allocation. */
> +	if (xfs_buf_alloc_folio(bp, gfp_mask))
> +		return 0;
> +
> +	/* We are done if an order-0 allocation has already failed. */
> +	if (bp->b_folio_count == 1)
> +		return -ENOMEM;
>  
>  	/*
> -	 * Bulk filling of pages can take multiple calls. Not filling the entire
> -	 * array is not an allocation failure, so don't back off if we get at
> -	 * least one extra page.
> +	 * XXX(dgc): I think dquot reclaim is the only place we can get
> +	 * to this function from memory reclaim context now. If we fix
> +	 * that like we've fixed inode reclaim to avoid writeback from
> +	 * reclaim, this nofs wrapping can go away.
>  	 */
> -	for (;;) {
> -		long	last = filled;
> -
> -		filled = alloc_pages_bulk_array(gfp_mask, bp->b_folio_count,
> -						(struct page **)bp->b_folios);
> -		if (filled == bp->b_folio_count) {
> -			XFS_STATS_INC(bp->b_mount, xb_page_found);
> -			break;
> -		}
> -
> -		if (filled != last)
> -			continue;
> -
> -		if (flags & XBF_READ_AHEAD) {
> -			xfs_buf_free_folios(bp);
> -			return -ENOMEM;
> -		}
> -
> -		XFS_STATS_INC(bp->b_mount, xb_page_retries);
> -		memalloc_retry_wait(gfp_mask);
> -	}
> -
> -	if (bp->b_folio_count == 1) {
> -		/* A single folio buffer is always mappable */
> -		bp->b_addr = folio_address(bp->b_folios[0]);
> -	} else {
> -		int retried = 0;
> -		unsigned nofs_flag;
> -
> -		/*
> -		 * vm_map_ram() will allocate auxiliary structures (e.g.
> -		 * pagetables) with GFP_KERNEL, yet we often under a scoped nofs
> -		 * context here. Mixing GFP_KERNEL with GFP_NOFS allocations
> -		 * from the same call site that can be run from both above and
> -		 * below memory reclaim causes lockdep false positives. Hence we
> -		 * always need to force this allocation to nofs context because
> -		 * we can't pass __GFP_NOLOCKDEP down to auxillary structures to
> -		 * prevent false positive lockdep reports.
> -		 *
> -		 * XXX(dgc): I think dquot reclaim is the only place we can get
> -		 * to this function from memory reclaim context now. If we fix
> -		 * that like we've fixed inode reclaim to avoid writeback from
> -		 * reclaim, this nofs wrapping can go away.
> -		 */
> -		nofs_flag = memalloc_nofs_save();
> -		do {
> -			bp->b_addr = vm_map_ram((struct page **)bp->b_folios,
> -					bp->b_folio_count, -1);
> -			if (bp->b_addr)
> -				break;
> -			vm_unmap_aliases();
> -		} while (retried++ <= 1);
> -		memalloc_nofs_restore(nofs_flag);
> -
> -		if (!bp->b_addr) {
> -			xfs_warn_ratelimited(bp->b_mount,
> -				"%s: failed to map %u folios", __func__,
> -				bp->b_folio_count);
> -			xfs_buf_free_folios(bp);
> -			return -ENOMEM;
> -		}
> +	nofs_flag = memalloc_nofs_save();
> +	bp->b_addr = __vmalloc(BBTOB(bp->b_length), gfp_mask);
> +	memalloc_nofs_restore(nofs_flag);
> +
> +	if (!bp->b_addr) {
> +		xfs_warn_ratelimited(bp->b_mount,
> +			"%s: failed to allocate %u folios", __func__,
> +			bp->b_folio_count);
> +		return -ENOMEM;
>  	}
>  
>  	return 0;
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 4d515407713b..68c24947ca1a 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -190,8 +190,6 @@ struct xfs_buf {
>  	struct xfs_buf_log_item	*b_log_item;
>  	struct list_head	b_li_list;	/* Log items list head */
>  	struct xfs_trans	*b_transp;
> -	struct folio		**b_folios;	/* array of folio pointers */
> -	struct folio		*b_folio_array[XB_FOLIOS]; /* inline folios */
>  	struct xfs_buf_map	*b_maps;	/* compound buffer map */
>  	struct xfs_buf_map	__b_map;	/* inline compound buffer map */
>  	int			b_map_count;
> diff --git a/fs/xfs/xfs_buf_mem.c b/fs/xfs/xfs_buf_mem.c
> index 26734c64c10e..336e7c8effb7 100644
> --- a/fs/xfs/xfs_buf_mem.c
> +++ b/fs/xfs/xfs_buf_mem.c
> @@ -169,8 +169,6 @@ xmbuf_map_folio(
>  	unlock_page(page);
>  
>  	bp->b_addr = page_address(page);
> -	bp->b_folios = bp->b_folio_array;
> -	bp->b_folios[0] = folio;
>  	bp->b_folio_count = 1;
>  	return 0;
>  }
> @@ -180,15 +178,10 @@ void
>  xmbuf_unmap_folio(
>  	struct xfs_buf		*bp)
>  {
> -	struct folio		*folio = bp->b_folios[0];
> -
>  	ASSERT(xfs_buftarg_is_mem(bp->b_target));
>  
> -	folio_put(folio);
> -
> +	folio_put(kmem_to_folio(bp->b_addr));
>  	bp->b_addr = NULL;
> -	bp->b_folios[0] = NULL;
> -	bp->b_folios = NULL;
>  	bp->b_folio_count = 0;
>  }
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 17:29   ` Darrick J. Wong
@ 2024-03-19 21:32     ` Christoph Hellwig
  2024-03-19 21:38       ` Darrick J. Wong
  2024-03-19 21:55     ` Dave Chinner
  1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19 21:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> So.... does that mean a 128K folio for a 68k xattr remote value buffer?

I though 64k was the maximum xattr size?

> I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> tree, which isn't awesome.

Maybe that's a question for the fsverity thread, but how do we end
up with these weird sizes?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios
  2024-03-19 17:34   ` Darrick J. Wong
@ 2024-03-19 21:32     ` Christoph Hellwig
  2024-03-19 21:39       ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19 21:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 10:34:13AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:57AM +1100, Dave Chinner wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > With the concept of unmapped buffer gone, there is no reason to not
> > vmap the buffer pages directly in xfs_buf_alloc_folios.
> 
> "..no reason to not map the buffer pages..."?

or maybe 

"..no good reason to not map the buffer pages..."


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/9] xfs: walk b_addr for buffer I/O
  2024-03-19 17:42   ` Darrick J. Wong
@ 2024-03-19 21:33     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19 21:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

> Huh.  So ... where does b_offset come into play here?
> 
> OH.  Since we're starting with b_addr and working our way /back/ to
> folios, we don't need b_offset anymore since we can compute that from
> (b_addr - folio_address()).  So then the @offset variable in
> _xfs_buf_ioapply is really a cursor into how far into the xfs_buf we've
> ioapply'd.

Yes.

> Would you mind adding a sentence to the commit message?
> 
> "Instead of walking the folio array just walk the kernel virtual address
> in ->b_addr.  This prepares for using vmalloc for buffers and removing
> the b_folio array.  Furthermore, b_offset goes away since we can compute
> that from b_addr and the folio."

Fine with me.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 21:32     ` Christoph Hellwig
@ 2024-03-19 21:38       ` Darrick J. Wong
  2024-03-19 21:41         ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 21:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 02:32:04PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> > So.... does that mean a 128K folio for a 68k xattr remote value buffer?
> 
> I though 64k was the maximum xattr size?

64k is the maximum xattr value size, yes.  But remote xattr value blocks
now have block headers complete with owner/uuid/magic/etc.  Each block
can only store $blksz-56 bytes now.  Hence that 64k value needs
ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.

(On a 64k fsb filesystem, that bloats up to 128k.)

Luckily we bwrite remote blocks to disk directly, which is why we've
never blown out the buffer log item bitmap.

> > I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> > tree, which isn't awesome.
> 
> Maybe that's a question for the fsverity thread, but how do we end
> up with these weird sizes?

Same reason.  Merkle tree blocks are $blksz by default, but storing a
4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
that second block is wasted.

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios
  2024-03-19 21:32     ` Christoph Hellwig
@ 2024-03-19 21:39       ` Darrick J. Wong
  2024-03-19 21:41         ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-19 21:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 02:32:48PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 10:34:13AM -0700, Darrick J. Wong wrote:
> > On Tue, Mar 19, 2024 at 09:45:57AM +1100, Dave Chinner wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > With the concept of unmapped buffer gone, there is no reason to not
> > > vmap the buffer pages directly in xfs_buf_alloc_folios.
> > 
> > "..no reason to not map the buffer pages..."?
> 
> or maybe 
> 
> "..no good reason to not map the buffer pages..."

We might as well fix the split infinitive as well:

"...no good reason not to map the buffer pages..."

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 21:38       ` Darrick J. Wong
@ 2024-03-19 21:41         ` Christoph Hellwig
  2024-03-19 22:23           ` Dave Chinner
  2024-03-21  2:12           ` Darrick J. Wong
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19 21:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> now have block headers complete with owner/uuid/magic/etc.  Each block
> can only store $blksz-56 bytes now.  Hence that 64k value needs
> ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.

Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
add headers.  That almost asks for using vmalloc if the size is just
above a a power of two.

> Same reason.  Merkle tree blocks are $blksz by default, but storing a
> 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> that second block is wasted.

*sad panda face*.  We should be able to come up with something
better than that..


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios
  2024-03-19 21:39       ` Darrick J. Wong
@ 2024-03-19 21:41         ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-19 21:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 02:39:19PM -0700, Darrick J. Wong wrote:
> We might as well fix the split infinitive as well:
> 
> "...no good reason not to map the buffer pages..."

fine.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-19  6:53   ` Christoph Hellwig
  2024-03-19 21:42     ` Dave Chinner
@ 2024-03-19 21:42     ` Dave Chinner
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 21:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 18, 2024 at 11:53:58PM -0700, Christoph Hellwig wrote:
> So while this looks good to me,
> 
> > +	for (i = 0; i < bp->b_folio_count; i++) {
> > +		if (bp->b_folios[i])
> > +			__folio_put(bp->b_folios[i]);
> 
> The __folio_put here really needs to be folio_put or page alloc
> debugging gets very unhappy.

*nod*

I can't remember why I used that in the first place...

> But even with that fixed on top of this patch the first mount just hangs
> without a useful kernel backtrace in /proc/*/stack, although running
> with the entire ѕeries applied it does pass the basic sanity checking
> so far.

Huh. It worked before I folded in your patches to clean everything
up; I don't tend to test individual patches if the whole series
works. I guess I screwed something up somewhere and then fixed it
later - I'll sort that out.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/9] xfs: use folios in the buffer cache
  2024-03-19  6:53   ` Christoph Hellwig
@ 2024-03-19 21:42     ` Dave Chinner
  2024-03-19 21:42     ` Dave Chinner
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 21:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Mon, Mar 18, 2024 at 11:53:58PM -0700, Christoph Hellwig wrote:
> So while this looks good to me,
> 
> > +	for (i = 0; i < bp->b_folio_count; i++) {
> > +		if (bp->b_folios[i])
> > +			__folio_put(bp->b_folios[i]);
> 
> The __folio_put here really needs to be folio_put or page alloc
> debugging gets very unhappy.

*nod*

I can't remember why I used that in the first place...

> But even with that fixed on top of this patch the first mount just hangs
> without a useful kernel backtrace in /proc/*/stack, although running
> with the entire ѕeries applied it does pass the basic sanity checking
> so far.

Huh. It worked before I folded in your patches to clean everything
up; I don't tend to test individual patches if the whole series
works. I guess I screwed something up somewhere and then fixed it
later - I'll sort that out.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 17:29   ` Darrick J. Wong
  2024-03-19 21:32     ` Christoph Hellwig
@ 2024-03-19 21:55     ` Dave Chinner
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 21:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 10:29:09AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:54AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have the buffer cache using the folio API, we can extend
> > the use of folios to allocate high order folios for multi-page
> > buffers rather than an array of single pages that are then vmapped
> > into a contiguous range.
> > 
> > This creates two types of buffers: single folio buffers that can
> > have arbitrary order, and multi-folio buffers made up of many single
> > page folios that get vmapped. The latter is essentially the existing
> > code, so there are no logic changes to handle this case.
> > 
> > There are a few places where we iterate the folios on a buffer.
> > These need to be converted to handle the high order folio case.
> > Luckily, this only occurs when bp->b_folio_count == 1, and the code
> > for handling this case is just a simple application of the folio API
> > to the operations that need to be performed.
> > 
> > The code that allocates buffers will optimistically attempt a high
> > order folio allocation as a fast path. If this high order allocation
> > fails, then we fall back to the existing multi-folio allocation
> > code. This now forms the slow allocation path, and hopefully will be
> > largely unused in normal conditions.
> > 
> > This should improve performance of large buffer operations (e.g.
> > large directory block sizes) as we should now mostly avoid the
> > expense of vmapping large buffers (and the vmap lock contention that
> > can occur) as well as avoid the runtime pressure that frequently
> > accessing kernel vmapped pages put on the TLBs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c | 141 ++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 110 insertions(+), 31 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 832226385154..7d9303497763 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -80,6 +80,10 @@ xfs_buf_is_vmapped(
> >  	return bp->b_addr && bp->b_folio_count > 1;
> >  }
> >  
> > +/*
> > + * See comment above xfs_buf_alloc_folios() about the constraints placed on
> > + * allocating vmapped buffers.
> > + */
> >  static inline int
> >  xfs_buf_vmap_len(
> >  	struct xfs_buf	*bp)
> > @@ -352,14 +356,63 @@ xfs_buf_alloc_kmem(
> >  		bp->b_addr = NULL;
> >  		return -ENOMEM;
> >  	}
> > -	bp->b_offset = offset_in_page(bp->b_addr);
> >  	bp->b_folios = bp->b_folio_array;
> >  	bp->b_folios[0] = kmem_to_folio(bp->b_addr);
> > +	bp->b_offset = offset_in_folio(bp->b_folios[0], bp->b_addr);
> >  	bp->b_folio_count = 1;
> >  	bp->b_flags |= _XBF_KMEM;
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Allocating a high order folio makes the assumption that buffers are a
> > + * power-of-2 size so that ilog2() returns the exact order needed to fit
> > + * the contents of the buffer. Buffer lengths are mostly a power of two,
> > + * so this is not an unreasonable approach to take by default.
> > + *
> > + * The exception here are user xattr data buffers, which can be arbitrarily
> > + * sized up to 64kB plus structure metadata. In that case, round up the order.
> 
> So.... does that mean a 128K folio for a 68k xattr remote value buffer?

Yes. I figured that >64kB metadata is a rare corner case, and so
the extra memory usage isn't that big a deal. If it does ever become
a problem, we can skip straight to vmalloc for such buffers...

> I've been noticing the 4k merkle tree blobs consume 2 fsb in the xattr
> tree, which isn't awesome.

Yup, that's because we have headers in remote xattr blocks for
holding identifying information and CRCs.

> I haven't figured out a good way to deal
> with that, since the verity code uses a lot of blkno/byte shifting and
> 2k merkle tree blocks suck.

I really wish that the verity code just asked us to store individual
{key,value} tuples rather than large opaque blocks of fixed size
data. Then we could store everything efficiently as individual
btree records in the attr fork dabtree, and verity could still
implement it's page sized opaque block storage for all the other
filesystems behind that....

> > @@ -1551,28 +1617,28 @@ xfs_buf_ioapply_map(
> >  
> >  next_chunk:
> >  	atomic_inc(&bp->b_io_remaining);
> > -	nr_pages = bio_max_segs(total_nr_pages);
> > +	nr_folios = bio_max_segs(total_nr_folios);
> >  
> > -	bio = bio_alloc(bp->b_target->bt_bdev, nr_pages, op, GFP_NOIO);
> > +	bio = bio_alloc(bp->b_target->bt_bdev, nr_folios, op, GFP_NOIO);
> >  	bio->bi_iter.bi_sector = sector;
> >  	bio->bi_end_io = xfs_buf_bio_end_io;
> >  	bio->bi_private = bp;
> >  
> > -	for (; size && nr_pages; nr_pages--, page_index++) {
> > -		int	rbytes, nbytes = PAGE_SIZE - offset;
> > +	for (; size && nr_folios; nr_folios--, folio_index++) {
> > +		struct folio	*folio = bp->b_folios[folio_index];
> > +		int		nbytes = folio_size(folio) - offset;
> >  
> >  		if (nbytes > size)
> >  			nbytes = size;
> >  
> > -		rbytes = bio_add_folio(bio, bp->b_folios[page_index], nbytes,
> > -				      offset);
> 
> Um, bio_add_folio returns a bool, maybe that's why hch was complaining
> about hangs with only the first few patches applied?

Good spot - entirely possible this is the problem...

> Annoying if the kbuild robots don't catch that.  Either way, with hch's
> replies to this and patch 2 dealt with,
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 21:41         ` Christoph Hellwig
@ 2024-03-19 22:23           ` Dave Chinner
  2024-03-21  2:12           ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 22:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > now have block headers complete with owner/uuid/magic/etc.  Each block
> > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> 
> Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> add headers.

We needed CRCs for them, and they can be discontiguous so we also
needed self identifying information so xattrs could be reconstructed
from the header information.

If I was doing the v5 stuff again, I would have put a the
information in the remote xattr name structure held in the dabtree
record, not the actual xattr data extent. But hindsight is 20:20....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/9] xfs: kill XBF_UNMAPPED
  2024-03-19 17:30   ` Darrick J. Wong
@ 2024-03-19 23:36     ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 23:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 10:30:46AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:55AM +1100, Dave Chinner wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Unmapped buffer access is a pain, so kill it. The switch to large
> > folios means we rarely pay a vmap penalty for large buffers,
> > so this functionality is largely unnecessary now.
> 
> What was the original point of unmapped buffers?  Was it optimizing for
> not using vmalloc space for inode buffers on 32-bit machines?

Largely, yes. This is original XFS-on-Irix design details from the
early 1990s.

The Irix buffer cache could vmap multi-page buffers on demand, but
it could also ask the page allocator/page cache to use use 16kB or
64kB large pages for the buffers to avoid needing vmap. The Irix mm
subsystem was multi-page size aware even on 32 bit machines
(heterogenous, simultaneous use of 4kB, 16kB, 64kB, 256kB, 2MB, and
16MB page sizes within the page cache and the buffer cache was
supported). However, the kernel had a limited vmap pool size even
on 64 bit machines. Hence buffers tended to be mapped and unmapped
at access time similar to how we use kmap_*() on Linux.

As a result, base page size allocation was still always preferred,
so in the cases where vmap or larger pages were not actually
needed they were avoided via unmapped buffers.

It's definitely a positive for us that the linux mm is nearing
functional parity with Irix from the mid 1990s. It means we can
slowly remove the compromises the Linux XFS port had to make to
work on Linux.  If we can now just get GFP_NOFAIL as the the
official memory allocator policy....

> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 9/9] xfs: rename bp->b_folio_count
  2024-03-19  7:37   ` Christoph Hellwig
@ 2024-03-19 23:59     ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-19 23:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 12:37:09AM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 09:46:00AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The count is used purely to allocate the correct number of bvecs for
> > submitting IO. Rename it to b_bvec_count.
> 
> Well, I think we should just kill it as it simplies is the rounded
> up length in PAGE_SIZE units.  The patch below passes a quick xfstests
> run and is on top of this series:

This seems like a reasonable approach - fixing the
mm_account_reclaimed_pages() issues earlier in the patch set meant
I'd already done the xfs_buf_free() changes you made here. :)

That just leaves the vmap wrappers to be fixed up - I think I'll
just replace the b_folio_count rename patch with that cleanup...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 8/9] xfs: use vmalloc for multi-folio buffers
  2024-03-19 17:48   ` Darrick J. Wong
@ 2024-03-20  0:20     ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-03-20  0:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Mar 19, 2024 at 10:48:19AM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 09:45:59AM +1100, Dave Chinner wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Instead of allocating the folios manually using the bulk page
> > allocator and then using vm_map_page just use vmalloc to allocate
> > the entire buffer - vmalloc will use the bulk allocator internally
> > if it fits.
> > 
> > With this the b_folios array can go away as well as nothing uses it.
> > 
> > [dchinner: port to folio based buffers.]
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c     | 164 ++++++++++++-------------------------------
> >  fs/xfs/xfs_buf.h     |   2 -
> >  fs/xfs/xfs_buf_mem.c |   9 +--
> >  3 files changed, 45 insertions(+), 130 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 303945554415..6d6bad80722e 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -282,29 +282,6 @@ _xfs_buf_alloc(
> >  	return 0;
> >  }
> >  
> > -static void
> > -xfs_buf_free_folios(
> > -	struct xfs_buf	*bp)
> > -{
> > -	uint		i;
> > -
> > -	ASSERT(bp->b_flags & _XBF_FOLIOS);
> > -
> > -	if (xfs_buf_is_vmapped(bp))
> > -		vm_unmap_ram(bp->b_addr, bp->b_folio_count);
> > -
> > -	for (i = 0; i < bp->b_folio_count; i++) {
> > -		if (bp->b_folios[i])
> > -			__folio_put(bp->b_folios[i]);
> > -	}
> > -	mm_account_reclaimed_pages(bp->b_folio_count);
> > -
> > -	if (bp->b_folios != bp->b_folio_array)
> > -		kfree(bp->b_folios);
> > -	bp->b_folios = NULL;
> > -	bp->b_flags &= ~_XBF_FOLIOS;
> > -}
> > -
> >  static void
> >  xfs_buf_free_callback(
> >  	struct callback_head	*cb)
> > @@ -323,13 +300,22 @@ xfs_buf_free(
> >  
> >  	ASSERT(list_empty(&bp->b_lru));
> >  
> > -	if (xfs_buftarg_is_mem(bp->b_target))
> > +	if (xfs_buftarg_is_mem(bp->b_target)) {
> >  		xmbuf_unmap_folio(bp);
> > -	else if (bp->b_flags & _XBF_FOLIOS)
> > -		xfs_buf_free_folios(bp);
> > -	else if (bp->b_flags & _XBF_KMEM)
> > -		kfree(bp->b_addr);
> > +		goto free;
> > +	}
> >  
> > +	if (!(bp->b_flags & _XBF_KMEM))
> > +		mm_account_reclaimed_pages(bp->b_folio_count);
> 
> Echoing hch's statement about the argument being passed to
> mm_account_reclaimed_pages needing to be fed units of base pages, not
> folios.
> 
> > +
> > +	if (bp->b_flags & _XBF_FOLIOS)
> > +		__folio_put(kmem_to_folio(bp->b_addr));
> 
> Is it necessary to use folio_put instead of the __ version like hch said
> earlier?

Both fixed.

> 
> > +	else
> > +		kvfree(bp->b_addr);
> > +
> > +	bp->b_flags &= _XBF_KMEM | _XBF_FOLIOS;
> 
> Shouldn't this be:
> 
> 	bp->b_flags &= ~(_XBF_KMEM | _XBF_FOLIOS); ?

Yes. Good catch.

> > @@ -377,14 +361,15 @@ xfs_buf_alloc_folio(
> >  	struct xfs_buf	*bp,
> >  	gfp_t		gfp_mask)
> >  {
> > +	struct folio	*folio;
> >  	int		length = BBTOB(bp->b_length);
> >  	int		order = get_order(length);
> >  
> > -	bp->b_folio_array[0] = folio_alloc(gfp_mask, order);
> > -	if (!bp->b_folio_array[0])
> > +	folio = folio_alloc(gfp_mask, order);
> > +	if (!folio)
> >  		return false;
> >  
> > -	bp->b_folios = bp->b_folio_array;
> > +	bp->b_addr = folio_address(folio);
> >  	bp->b_folio_count = 1;
> >  	bp->b_flags |= _XBF_FOLIOS;
> >  	return true;
> > @@ -400,15 +385,11 @@ xfs_buf_alloc_folio(
> >   * contiguous memory region that we don't have to map and unmap to access the
> >   * data directly.
> >   *
> > - * The second type of buffer is the multi-folio buffer. These are *always* made
> > - * up of single page folios so that they can be fed to vmap_ram() to return a
> > - * contiguous memory region we can access the data through.
> > - *
> > - * We don't use high order folios for this second type of buffer (yet) because
> > - * having variable size folios makes offset-to-folio indexing and iteration of
> > - * the data range more complex than if they are fixed size. This case should now
> > - * be the slow path, though, so unless we regularly fail to allocate high order
> > - * folios, there should be little need to optimise this path.
> > + * The second type of buffer is the vmalloc()d buffer. This provides the buffer
> > + * with the required contiguous memory region but backed by discontiguous
> > + * physical pages. vmalloc() typically doesn't fail, but it can and so we may
> > + * need to wrap the allocation in a loop to prevent low memory failures and
> > + * shutdowns.
> 
> Where's the loop now?  Is that buried under __vmalloc somewhere?

I thought I'd added __GFP_NOFAIL to the __vmalloc() gfp mask to make
it loop. I suspect I lost it at some point when rebasing either this
or the (now merged) kmem.[ch] removal patchset.

Well spotted, I'll fix that up.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-19 21:41         ` Christoph Hellwig
  2024-03-19 22:23           ` Dave Chinner
@ 2024-03-21  2:12           ` Darrick J. Wong
  2024-03-21  2:40             ` Darrick J. Wong
  1 sibling, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-21  2:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > now have block headers complete with owner/uuid/magic/etc.  Each block
> > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> 
> Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> add headers.  That almost asks for using vmalloc if the size is just
> above a a power of two.
> 
> > Same reason.  Merkle tree blocks are $blksz by default, but storing a
> > 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> > that second block is wasted.
> 
> *sad panda face*.  We should be able to come up with something
> better than that..

Well xfs_verity.c could just zlib/zstd/whatever compress the contents
and see if that helps.  We only need a ~2% compression to shrink to a
single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
1k blocks.

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-21  2:12           ` Darrick J. Wong
@ 2024-03-21  2:40             ` Darrick J. Wong
  2024-03-21 21:28               ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-21  2:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Wed, Mar 20, 2024 at 07:12:36PM -0700, Darrick J. Wong wrote:
> On Tue, Mar 19, 2024 at 02:41:27PM -0700, Christoph Hellwig wrote:
> > On Tue, Mar 19, 2024 at 02:38:27PM -0700, Darrick J. Wong wrote:
> > > 64k is the maximum xattr value size, yes.  But remote xattr value blocks
> > > now have block headers complete with owner/uuid/magic/etc.  Each block
> > > can only store $blksz-56 bytes now.  Hence that 64k value needs
> > > ceil(65536 / 4040) == 17 blocks on a 4k fsb filesystem.
> > 
> > Uggg, ok.  I thought we'd just treat remote xattrs as data and don't
> > add headers.  That almost asks for using vmalloc if the size is just
> > above a a power of two.
> > 
> > > Same reason.  Merkle tree blocks are $blksz by default, but storing a
> > > 4096 blob in the xattrs requires ceil(4096 / 4040) == 2 blocks.  Most of
> > > that second block is wasted.
> > 
> > *sad panda face*.  We should be able to come up with something
> > better than that..
> 
> Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> and see if that helps.  We only need a ~2% compression to shrink to a
> single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> 1k blocks.

...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
first 32 bytes.

--D

> --D
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-21  2:40             ` Darrick J. Wong
@ 2024-03-21 21:28               ` Christoph Hellwig
  2024-03-21 21:39                 ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-21 21:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Wed, Mar 20, 2024 at 07:40:05PM -0700, Darrick J. Wong wrote:
> > Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> > and see if that helps.  We only need a ~2% compression to shrink to a
> > single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> > 1k blocks.
> 
> ...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
> tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
> first 32 bytes.

That does sound much better than wasting the space or crazy compression
schemes.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-21 21:28               ` Christoph Hellwig
@ 2024-03-21 21:39                 ` Darrick J. Wong
  2024-03-21 22:02                   ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-03-21 21:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Thu, Mar 21, 2024 at 02:28:23PM -0700, Christoph Hellwig wrote:
> On Wed, Mar 20, 2024 at 07:40:05PM -0700, Darrick J. Wong wrote:
> > > Well xfs_verity.c could just zlib/zstd/whatever compress the contents
> > > and see if that helps.  We only need a ~2% compression to shrink to a
> > > single 4k block, a ~1% reduction for 64k blocks, or a 6% reduction for
> > > 1k blocks.
> > 
> > ...or we just turn off the attr remote header for XFS_ATTR_VERITY merkle
> > tree block values and XOR i_ino, merkle_pos, and the fs uuid into the
> > first 32 bytes.
> 
> That does sound much better than wasting the space or crazy compression
> schemes.

It turns out that merkle tree blocks are nearly "random" data, which
means that my experiments with creating fsverity files and trying to
compress merkle tree blocks actually made it *bigger*.  No zlib for us!

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-21 21:39                 ` Darrick J. Wong
@ 2024-03-21 22:02                   ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-03-21 22:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Thu, Mar 21, 2024 at 02:39:03PM -0700, Darrick J. Wong wrote:
> 
> It turns out that merkle tree blocks are nearly "random" data, which
> means that my experiments with creating fsverity files and trying to
> compress merkle tree blocks actually made it *bigger*.  No zlib for us!

For a cryptographically secure hash, not having high entropy would be a
bad thing..


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
  2024-03-19  6:55   ` Christoph Hellwig
  2024-03-19 17:29   ` Darrick J. Wong
@ 2024-03-22  8:02   ` Pankaj Raghav (Samsung)
  2024-03-22 22:04     ` Dave Chinner
  2 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-22  8:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

>  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
>  {
>  	ASSERT(bp->b_flags & _XBF_FOLIOS);
>  	if (bp->b_folio_count == 1) {
> -		/* A single page buffer is always mappable */
> +		/* A single folio buffer is always mappable */
>  		bp->b_addr = folio_address(bp->b_folios[0]);
>  	} else if (flags & XBF_UNMAPPED) {
>  		bp->b_addr = NULL;
> @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
>  	int		*count,
>  	blk_opf_t	op)
>  {
> -	int		page_index;
> -	unsigned int	total_nr_pages = bp->b_folio_count;
> -	int		nr_pages;
> +	int		folio_index;
> +	unsigned int	total_nr_folios = bp->b_folio_count;
> +	int		nr_folios;
>  	struct bio	*bio;
>  	sector_t	sector =  bp->b_maps[map].bm_bn;
>  	int		size;
>  	int		offset;
>  
> -	/* skip the pages in the buffer before the start offset */
> -	page_index = 0;
> +	/*
> +	 * If the start offset if larger than a single page, we need to be
> +	 * careful. We might have a high order folio, in which case the indexing
> +	 * is from the start of the buffer. However, if we have more than one
> +	 * folio single page folio in the buffer, we need to skip the folios in
s/folio single page folio/single page folio/

> +	 * the buffer before the start offset.
> +	 */
> +	folio_index = 0;
>  	offset = *buf_offset;
> -	while (offset >= PAGE_SIZE) {
> -		page_index++;
> -		offset -= PAGE_SIZE;
> +	if (bp->b_folio_count > 1) {
> +		while (offset >= PAGE_SIZE) {
> +			folio_index++;
> +			offset -= PAGE_SIZE;

Can this be:
folio_index = offset >> PAGE_SHIFT;
offset = offset_in_page(offset);

instead of a loop?

> +		}
>  	}
>  
>  	/*

--
Pankaj

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-22  8:02   ` Pankaj Raghav (Samsung)
@ 2024-03-22 22:04     ` Dave Chinner
  2024-03-25 11:17       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-03-22 22:04 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung); +Cc: linux-xfs

On Fri, Mar 22, 2024 at 09:02:31AM +0100, Pankaj Raghav (Samsung) wrote:
> >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
> >  {
> >  	ASSERT(bp->b_flags & _XBF_FOLIOS);
> >  	if (bp->b_folio_count == 1) {
> > -		/* A single page buffer is always mappable */
> > +		/* A single folio buffer is always mappable */
> >  		bp->b_addr = folio_address(bp->b_folios[0]);
> >  	} else if (flags & XBF_UNMAPPED) {
> >  		bp->b_addr = NULL;
> > @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
> >  	int		*count,
> >  	blk_opf_t	op)
> >  {
> > -	int		page_index;
> > -	unsigned int	total_nr_pages = bp->b_folio_count;
> > -	int		nr_pages;
> > +	int		folio_index;
> > +	unsigned int	total_nr_folios = bp->b_folio_count;
> > +	int		nr_folios;
> >  	struct bio	*bio;
> >  	sector_t	sector =  bp->b_maps[map].bm_bn;
> >  	int		size;
> >  	int		offset;
> >  
> > -	/* skip the pages in the buffer before the start offset */
> > -	page_index = 0;
> > +	/*
> > +	 * If the start offset if larger than a single page, we need to be
> > +	 * careful. We might have a high order folio, in which case the indexing
> > +	 * is from the start of the buffer. However, if we have more than one
> > +	 * folio single page folio in the buffer, we need to skip the folios in
> s/folio single page folio/single page folio/
> 
> > +	 * the buffer before the start offset.
> > +	 */
> > +	folio_index = 0;
> >  	offset = *buf_offset;
> > -	while (offset >= PAGE_SIZE) {
> > -		page_index++;
> > -		offset -= PAGE_SIZE;
> > +	if (bp->b_folio_count > 1) {
> > +		while (offset >= PAGE_SIZE) {
> > +			folio_index++;
> > +			offset -= PAGE_SIZE;
> 
> Can this be:
> folio_index = offset >> PAGE_SHIFT;
> offset = offset_in_page(offset);
> 
> instead of a loop?

It could, but I'm not going to change it or even bother to fix the
comment as this code goes away later in the patch set.

See "[PATCH 7/9] xfs: walk b_addr for buffer I/O" later in the
series - once we can rely on bp->b_addr always being set for buffers
we convert this code to a simpler and more efficient folio based
iteration that is not reliant on pages or PAGE_SIZE at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/9] xfs: convert buffer cache to use high order folios
  2024-03-22 22:04     ` Dave Chinner
@ 2024-03-25 11:17       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-25 11:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Mar 23, 2024 at 09:04:51AM +1100, Dave Chinner wrote:
> On Fri, Mar 22, 2024 at 09:02:31AM +0100, Pankaj Raghav (Samsung) wrote:
> > >  	 * Bulk filling of pages can take multiple calls. Not filling the entire
> > > @@ -426,7 +484,7 @@ _xfs_buf_map_folios(
> > >  {
> > >  	ASSERT(bp->b_flags & _XBF_FOLIOS);
> > >  	if (bp->b_folio_count == 1) {
> > > -		/* A single page buffer is always mappable */
> > > +		/* A single folio buffer is always mappable */
> > >  		bp->b_addr = folio_address(bp->b_folios[0]);
> > >  	} else if (flags & XBF_UNMAPPED) {
> > >  		bp->b_addr = NULL;
> > > @@ -1525,20 +1583,28 @@ xfs_buf_ioapply_map(
> > >  	int		*count,
> > >  	blk_opf_t	op)
> > >  {
> > > -	int		page_index;
> > > -	unsigned int	total_nr_pages = bp->b_folio_count;
> > > -	int		nr_pages;
> > > +	int		folio_index;
> > > +	unsigned int	total_nr_folios = bp->b_folio_count;
> > > +	int		nr_folios;
> > >  	struct bio	*bio;
> > >  	sector_t	sector =  bp->b_maps[map].bm_bn;
> > >  	int		size;
> > >  	int		offset;
> > >  
> > > -	/* skip the pages in the buffer before the start offset */
> > > -	page_index = 0;
> > > +	/*
> > > +	 * If the start offset if larger than a single page, we need to be
> > > +	 * careful. We might have a high order folio, in which case the indexing
> > > +	 * is from the start of the buffer. However, if we have more than one
> > > +	 * folio single page folio in the buffer, we need to skip the folios in
> > s/folio single page folio/single page folio/
> > 
> > > +	 * the buffer before the start offset.
> > > +	 */
> > > +	folio_index = 0;
> > >  	offset = *buf_offset;
> > > -	while (offset >= PAGE_SIZE) {
> > > -		page_index++;
> > > -		offset -= PAGE_SIZE;
> > > +	if (bp->b_folio_count > 1) {
> > > +		while (offset >= PAGE_SIZE) {
> > > +			folio_index++;
> > > +			offset -= PAGE_SIZE;
> > 
> > Can this be:
> > folio_index = offset >> PAGE_SHIFT;
> > offset = offset_in_page(offset);
> > 
> > instead of a loop?
> 
> It could, but I'm not going to change it or even bother to fix the
> comment as this code goes away later in the patch set.
> 
> See "[PATCH 7/9] xfs: walk b_addr for buffer I/O" later in the
> series - once we can rely on bp->b_addr always being set for buffers
> we convert this code to a simpler and more efficient folio based
> iteration that is not reliant on pages or PAGE_SIZE at all.
> 

Ah, I missed seeing the whole series before replying to individual patches.
Nvm. 

Thanks for the pointer on patch 7, and I agree that it does not make 
sense to change this patch as it goes away later.

--
Pankaj

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-03-25 11:23 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-18 22:45 [PATCH v2 0/9] xfs: use large folios for buffers Dave Chinner
2024-03-18 22:45 ` [PATCH 1/9] xfs: unmapped buffer item size straddling mismatch Dave Chinner
2024-03-18 22:45 ` [PATCH 2/9] xfs: use folios in the buffer cache Dave Chinner
2024-03-19  6:38   ` Christoph Hellwig
2024-03-19  6:52     ` Dave Chinner
2024-03-19  6:53   ` Christoph Hellwig
2024-03-19 21:42     ` Dave Chinner
2024-03-19 21:42     ` Dave Chinner
2024-03-19 17:15   ` Darrick J. Wong
2024-03-18 22:45 ` [PATCH 3/9] xfs: convert buffer cache to use high order folios Dave Chinner
2024-03-19  6:55   ` Christoph Hellwig
2024-03-19 17:29   ` Darrick J. Wong
2024-03-19 21:32     ` Christoph Hellwig
2024-03-19 21:38       ` Darrick J. Wong
2024-03-19 21:41         ` Christoph Hellwig
2024-03-19 22:23           ` Dave Chinner
2024-03-21  2:12           ` Darrick J. Wong
2024-03-21  2:40             ` Darrick J. Wong
2024-03-21 21:28               ` Christoph Hellwig
2024-03-21 21:39                 ` Darrick J. Wong
2024-03-21 22:02                   ` Christoph Hellwig
2024-03-19 21:55     ` Dave Chinner
2024-03-22  8:02   ` Pankaj Raghav (Samsung)
2024-03-22 22:04     ` Dave Chinner
2024-03-25 11:17       ` Pankaj Raghav (Samsung)
2024-03-18 22:45 ` [PATCH 4/9] xfs: kill XBF_UNMAPPED Dave Chinner
2024-03-19 17:30   ` Darrick J. Wong
2024-03-19 23:36     ` Dave Chinner
2024-03-18 22:45 ` [PATCH 5/9] xfs: buffer items don't straddle pages anymore Dave Chinner
2024-03-19  6:56   ` Christoph Hellwig
2024-03-19 17:31   ` Darrick J. Wong
2024-03-18 22:45 ` [PATCH 6/9] xfs: map buffers in xfs_buf_alloc_folios Dave Chinner
2024-03-19 17:34   ` Darrick J. Wong
2024-03-19 21:32     ` Christoph Hellwig
2024-03-19 21:39       ` Darrick J. Wong
2024-03-19 21:41         ` Christoph Hellwig
2024-03-18 22:45 ` [PATCH 7/9] xfs: walk b_addr for buffer I/O Dave Chinner
2024-03-19 17:42   ` Darrick J. Wong
2024-03-19 21:33     ` Christoph Hellwig
2024-03-18 22:45 ` [PATCH 8/9] xfs: use vmalloc for multi-folio buffers Dave Chinner
2024-03-19 17:48   ` Darrick J. Wong
2024-03-20  0:20     ` Dave Chinner
2024-03-18 22:46 ` [PATCH 9/9] xfs: rename bp->b_folio_count Dave Chinner
2024-03-19  7:37   ` Christoph Hellwig
2024-03-19 23:59     ` Dave Chinner
2024-03-19  0:24 ` [PATCH v2 0/9] xfs: use large folios for buffers Christoph Hellwig
2024-03-19  0:44   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.