All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v3] xfs: lockless buffer lookups
@ 2022-07-07 23:52 Dave Chinner
  2022-07-07 23:52 ` [PATCH 1/6] xfs: rework xfs_buf_incore() API Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

Current work to merge the XFS inode life cycle with the VFS indoe
life cycle is finding some interesting issues. If we have a path
that hits buffer trylocks fairly hard (e.g. a non-blocking
background inode freeing function), we end up hitting massive
contention on the buffer cache hash locks:

-   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
   - 92.67% xfs_inodegc_worker
      - 92.13% xfs_inode_unlink
         - 91.52% xfs_inactive_ifree
            - 85.63% xfs_read_agi
               - 85.61% xfs_trans_read_buf_map
                  - 85.59% xfs_buf_read_map
                     - xfs_buf_get_map
                        - 85.55% xfs_buf_find
                           - 72.87% _raw_spin_lock
                              - do_raw_spin_lock
                                   71.86% __pv_queued_spin_lock_slowpath
                           - 8.74% xfs_buf_rele
                              - 7.88% _raw_spin_lock
                                 - 7.88% do_raw_spin_lock
                                      7.63% __pv_queued_spin_lock_slowpath
                           - 1.70% xfs_buf_trylock
                              - 1.68% down_trylock
                                 - 1.41% _raw_spin_lock_irqsave
                                    - 1.39% do_raw_spin_lock
                                         __pv_queued_spin_lock_slowpath
                           - 0.76% _raw_spin_unlock
                                0.75% do_raw_spin_unlock

This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.

We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by creating a lookup fast
path which leverages the rhashtable's ability to do rcu protected
lookups.

This is a rework of the initial lockless buffer lookup patch I sent
here:

https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/

And the alternative cleanup sent by Christoph here:

https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/

This version isn't quite a short as Christophs, but it does roughly
the same thing in killing the two-phase _xfs_buf_find() call
mechanism. It separates the fast and slow paths a little more
cleanly and doesn't have context dependent buffer return state from
the slow path that the caller needs to handle. It also picks up the
rhashtable insert optimisation that Christoph added.

This series passes fstests under several different configs and does
not cause any obvious regressions in scalability testing that has
been performed. Hence I'm proposing this as potential 5.20 cycle
material.

Thoughts, comments?

Version 3:
- rebased onto linux-xfs/for-next
- rearranged some of the changes to avoid repeated shuffling of code
  to different locations
- fixed typos in commits
- s/xfs_buf_find_verify/xfs_buf_map_verify/
- s/xfs_buf_find_fast/xfs_buf_lookup/

Version 2:
- https://lore.kernel.org/linux-xfs/20220627060841.244226-1-david@fromorbit.com/
- based on 5.19-rc2
- high speed collision of original proposals.

Initial versions:
- https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/
- https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6] xfs: rework xfs_buf_incore() API
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-07 23:52 ` [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Make it consistent with the other buffer APIs to return a error and
the buffer is placed in a parameter.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_attr_remote.c | 15 ++++++++++-----
 fs/xfs/scrub/repair.c           | 15 +++++++++------
 fs/xfs/xfs_buf.c                | 19 ++-----------------
 fs/xfs/xfs_buf.h                | 20 ++++++++++++++++----
 fs/xfs/xfs_qm.c                 |  9 ++++-----
 5 files changed, 41 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index 7298c148f848..d440393b40eb 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -543,6 +543,7 @@ xfs_attr_rmtval_stale(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_buf		*bp;
+	int			error;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
@@ -550,14 +551,18 @@ xfs_attr_rmtval_stale(
 	    XFS_IS_CORRUPT(mp, map->br_startblock == HOLESTARTBLOCK))
 		return -EFSCORRUPTED;
 
-	bp = xfs_buf_incore(mp->m_ddev_targp,
+	error = xfs_buf_incore(mp->m_ddev_targp,
 			XFS_FSB_TO_DADDR(mp, map->br_startblock),
-			XFS_FSB_TO_BB(mp, map->br_blockcount), incore_flags);
-	if (bp) {
-		xfs_buf_stale(bp);
-		xfs_buf_relse(bp);
+			XFS_FSB_TO_BB(mp, map->br_blockcount),
+			incore_flags, &bp);
+	if (error) {
+		if (error == -ENOENT)
+			return 0;
+		return error;
 	}
 
+	xfs_buf_stale(bp);
+	xfs_buf_relse(bp);
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 1e7b6b209ee8..5e7428782571 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -457,16 +457,19 @@ xrep_invalidate_blocks(
 	 * assume it's owned by someone else.
 	 */
 	for_each_xbitmap_block(fsbno, bmr, n, bitmap) {
+		int		error;
+
 		/* Skip AG headers and post-EOFS blocks */
 		if (!xfs_verify_fsbno(sc->mp, fsbno))
 			continue;
-		bp = xfs_buf_incore(sc->mp->m_ddev_targp,
+		error = xfs_buf_incore(sc->mp->m_ddev_targp,
 				XFS_FSB_TO_DADDR(sc->mp, fsbno),
-				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK);
-		if (bp) {
-			xfs_trans_bjoin(sc->tp, bp);
-			xfs_trans_binval(sc->tp, bp);
-		}
+				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK, &bp);
+		if (error)
+			continue;
+
+		xfs_trans_bjoin(sc->tp, bp);
+		xfs_trans_binval(sc->tp, bp);
 	}
 
 	return 0;
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf4e60871068..143e1c70df5d 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -616,23 +616,6 @@ xfs_buf_find(
 	return 0;
 }
 
-struct xfs_buf *
-xfs_buf_incore(
-	struct xfs_buftarg	*target,
-	xfs_daddr_t		blkno,
-	size_t			numblks,
-	xfs_buf_flags_t		flags)
-{
-	struct xfs_buf		*bp;
-	int			error;
-	DEFINE_SINGLE_BUF_MAP(map, blkno, numblks);
-
-	error = xfs_buf_find(target, &map, 1, flags, NULL, &bp);
-	if (error)
-		return NULL;
-	return bp;
-}
-
 /*
  * Assembles a buffer covering the specified range. The code is optimised for
  * cache hits, as metadata intensive workloads will see 3 orders of magnitude
@@ -656,6 +639,8 @@ xfs_buf_get_map(
 		goto found;
 	if (error != -ENOENT)
 		return error;
+	if (flags & XBF_INCORE)
+		return -ENOENT;
 
 	error = _xfs_buf_alloc(target, map, nmaps, flags, &new_bp);
 	if (error)
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 1ee3056ff9cf..58e9034d51bd 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -42,9 +42,11 @@ struct xfs_buf;
 #define _XBF_DELWRI_Q	 (1u << 22)/* buffer on a delwri queue */
 
 /* flags used only as arguments to access routines */
+#define XBF_INCORE	 (1u << 29)/* lookup only, return if found in cache */
 #define XBF_TRYLOCK	 (1u << 30)/* lock requested, but do not wait */
 #define XBF_UNMAPPED	 (1u << 31)/* do not map the buffer */
 
+
 typedef unsigned int xfs_buf_flags_t;
 
 #define XFS_BUF_FLAGS \
@@ -63,6 +65,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	/* The following interface flags should never be set */ \
+	{ XBF_INCORE,		"INCORE" }, \
 	{ XBF_TRYLOCK,		"TRYLOCK" }, \
 	{ XBF_UNMAPPED,		"UNMAPPED" }
 
@@ -196,10 +199,6 @@ struct xfs_buf {
 };
 
 /* Finding and Reading Buffers */
-struct xfs_buf *xfs_buf_incore(struct xfs_buftarg *target,
-			   xfs_daddr_t blkno, size_t numblks,
-			   xfs_buf_flags_t flags);
-
 int xfs_buf_get_map(struct xfs_buftarg *target, struct xfs_buf_map *map,
 		int nmaps, xfs_buf_flags_t flags, struct xfs_buf **bpp);
 int xfs_buf_read_map(struct xfs_buftarg *target, struct xfs_buf_map *map,
@@ -209,6 +208,19 @@ void xfs_buf_readahead_map(struct xfs_buftarg *target,
 			       struct xfs_buf_map *map, int nmaps,
 			       const struct xfs_buf_ops *ops);
 
+static inline int
+xfs_buf_incore(
+	struct xfs_buftarg	*target,
+	xfs_daddr_t		blkno,
+	size_t			numblks,
+	xfs_buf_flags_t		flags,
+	struct xfs_buf		**bpp)
+{
+	DEFINE_SINGLE_BUF_MAP(map, blkno, numblks);
+
+	return xfs_buf_get_map(target, &map, 1, XBF_INCORE | flags, bpp);
+}
+
 static inline int
 xfs_buf_get(
 	struct xfs_buftarg	*target,
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index abf08bbf34a9..3517a6be8dad 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1229,12 +1229,11 @@ xfs_qm_flush_one(
 	 */
 	if (!xfs_dqflock_nowait(dqp)) {
 		/* buf is pinned in-core by delwri list */
-		bp = xfs_buf_incore(mp->m_ddev_targp, dqp->q_blkno,
-				mp->m_quotainfo->qi_dqchunklen, 0);
-		if (!bp) {
-			error = -EINVAL;
+		error = xfs_buf_incore(mp->m_ddev_targp, dqp->q_blkno,
+				mp->m_quotainfo->qi_dqchunklen, 0, &bp);
+		if (error)
 			goto out_unlock;
-		}
+
 		xfs_buf_unlock(bp);
 
 		xfs_buf_delwri_pushbuf(bp, buffer_list);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
  2022-07-07 23:52 ` [PATCH 1/6] xfs: rework xfs_buf_incore() API Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-09 22:58   ` Darrick J. Wong
  2022-07-07 23:52 ` [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map() Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_buf_find() is made up of three main parts: lookup, insert and
locking. The interactions with xfs_buf_get_map() require it to be
called twice - once for a pure lookup, and again on lookup failure
so the insert path can be run. We want to simplify this down a lot,
so split it into a fast path lookup, a slow path insert and a "lock
the found buffer" helper. This will then let us integrate these
operations more effectively into xfs_buf_get_map() in future
patches.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 159 +++++++++++++++++++++++++++++++----------------
 1 file changed, 105 insertions(+), 54 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 143e1c70df5d..91dc691f40a8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -503,6 +503,99 @@ xfs_buf_hash_destroy(
 	rhashtable_destroy(&pag->pag_buf_hash);
 }
 
+static int
+xfs_buf_map_verify(
+	struct xfs_buftarg	*btp,
+	struct xfs_buf_map	*map)
+{
+	xfs_daddr_t		eofs;
+
+	/* Check for IOs smaller than the sector size / not sector aligned */
+	ASSERT(!(BBTOB(map->bm_len) < btp->bt_meta_sectorsize));
+	ASSERT(!(BBTOB(map->bm_bn) & (xfs_off_t)btp->bt_meta_sectormask));
+
+	/*
+	 * Corrupted block numbers can get through to here, unfortunately, so we
+	 * have to check that the buffer falls within the filesystem bounds.
+	 */
+	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
+	if (map->bm_bn < 0 || map->bm_bn >= eofs) {
+		xfs_alert(btp->bt_mount,
+			  "%s: daddr 0x%llx out of range, EOFS 0x%llx",
+			  __func__, map->bm_bn, eofs);
+		WARN_ON(1);
+		return -EFSCORRUPTED;
+	}
+	return 0;
+}
+
+static int
+xfs_buf_find_lock(
+	struct xfs_buftarg	*btp,
+	struct xfs_buf          *bp,
+	xfs_buf_flags_t		flags)
+{
+	if (!xfs_buf_trylock(bp)) {
+		if (flags & XBF_TRYLOCK) {
+			xfs_buf_rele(bp);
+			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
+			return -EAGAIN;
+		}
+		xfs_buf_lock(bp);
+		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
+	}
+
+	/*
+	 * if the buffer is stale, clear all the external state associated with
+	 * it. We need to keep flags such as how we allocated the buffer memory
+	 * intact here.
+	 */
+	if (bp->b_flags & XBF_STALE) {
+		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
+		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
+		bp->b_ops = NULL;
+	}
+	return 0;
+}
+
+static inline struct xfs_buf *
+xfs_buf_lookup(
+	struct xfs_perag	*pag,
+	struct xfs_buf_map	*map)
+{
+	struct xfs_buf          *bp;
+
+	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
+	if (!bp)
+		return NULL;
+	atomic_inc(&bp->b_hold);
+	return bp;
+}
+
+/*
+ * Insert the new_bp into the hash table. This consumes the perag reference
+ * taken for the lookup.
+ */
+static int
+xfs_buf_find_insert(
+	struct xfs_buftarg	*btp,
+	struct xfs_perag	*pag,
+	struct xfs_buf		*new_bp)
+{
+	/* No match found */
+	if (!new_bp) {
+		xfs_perag_put(pag);
+		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
+		return -ENOENT;
+	}
+
+	/* the buffer keeps the perag reference until it is freed */
+	new_bp->b_pag = pag;
+	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
+			       xfs_buf_hash_params);
+	return 0;
+}
+
 /*
  * Look up a buffer in the buffer cache and return it referenced and locked
  * in @found_bp.
@@ -533,7 +626,7 @@ xfs_buf_find(
 	struct xfs_perag	*pag;
 	struct xfs_buf		*bp;
 	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
-	xfs_daddr_t		eofs;
+	int			error;
 	int			i;
 
 	*found_bp = NULL;
@@ -541,47 +634,22 @@ xfs_buf_find(
 	for (i = 0; i < nmaps; i++)
 		cmap.bm_len += map[i].bm_len;
 
-	/* Check for IOs smaller than the sector size / not sector aligned */
-	ASSERT(!(BBTOB(cmap.bm_len) < btp->bt_meta_sectorsize));
-	ASSERT(!(BBTOB(cmap.bm_bn) & (xfs_off_t)btp->bt_meta_sectormask));
-
-	/*
-	 * Corrupted block numbers can get through to here, unfortunately, so we
-	 * have to check that the buffer falls within the filesystem bounds.
-	 */
-	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
-	if (cmap.bm_bn < 0 || cmap.bm_bn >= eofs) {
-		xfs_alert(btp->bt_mount,
-			  "%s: daddr 0x%llx out of range, EOFS 0x%llx",
-			  __func__, cmap.bm_bn, eofs);
-		WARN_ON(1);
-		return -EFSCORRUPTED;
-	}
+	error = xfs_buf_map_verify(btp, &cmap);
+	if (error)
+		return error;
 
 	pag = xfs_perag_get(btp->bt_mount,
 			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
 
 	spin_lock(&pag->pag_buf_lock);
-	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmap,
-				    xfs_buf_hash_params);
-	if (bp) {
-		atomic_inc(&bp->b_hold);
+	bp = xfs_buf_lookup(pag, &cmap);
+	if (bp)
 		goto found;
-	}
-
-	/* No match found */
-	if (!new_bp) {
-		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
-		spin_unlock(&pag->pag_buf_lock);
-		xfs_perag_put(pag);
-		return -ENOENT;
-	}
 
-	/* the buffer keeps the perag reference until it is freed */
-	new_bp->b_pag = pag;
-	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
-			       xfs_buf_hash_params);
+	error = xfs_buf_find_insert(btp, pag, new_bp);
 	spin_unlock(&pag->pag_buf_lock);
+	if (error)
+		return error;
 	*found_bp = new_bp;
 	return 0;
 
@@ -589,26 +657,9 @@ xfs_buf_find(
 	spin_unlock(&pag->pag_buf_lock);
 	xfs_perag_put(pag);
 
-	if (!xfs_buf_trylock(bp)) {
-		if (flags & XBF_TRYLOCK) {
-			xfs_buf_rele(bp);
-			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
-			return -EAGAIN;
-		}
-		xfs_buf_lock(bp);
-		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
-	}
-
-	/*
-	 * if the buffer is stale, clear all the external state associated with
-	 * it. We need to keep flags such as how we allocated the buffer memory
-	 * intact here.
-	 */
-	if (bp->b_flags & XBF_STALE) {
-		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
-		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
-		bp->b_ops = NULL;
-	}
+	error = xfs_buf_find_lock(btp, bp, flags);
+	if (error)
+		return error;
 
 	trace_xfs_buf_find(bp, flags, _RET_IP_);
 	XFS_STATS_INC(btp->bt_mount, xb_get_locked);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map()
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
  2022-07-07 23:52 ` [PATCH 1/6] xfs: rework xfs_buf_incore() API Dave Chinner
  2022-07-07 23:52 ` [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-10  0:15   ` Darrick J. Wong
  2022-07-11  5:14   ` Christoph Hellwig
  2022-07-07 23:52 ` [PATCH 4/6] xfs: reduce the number of atomic when locking a buffer after lookup Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we factored xfs_buf_find(), we can start separating into
distinct fast and slow paths from xfs_buf_get_map(). We start by
moving the lookup map and perag setup to _get_map(), and then move
all the specifics of the fast path lookup into xfs_buf_lookup()
and call it directly from _get_map(). We the move all the slow path
code to xfs_buf_find_insert(), which is now also called directly
from _get_map(). As such, xfs_buf_find() now goes away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c | 207 ++++++++++++++++++++++-------------------------
 1 file changed, 95 insertions(+), 112 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 91dc691f40a8..81ca951b451a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -531,18 +531,16 @@ xfs_buf_map_verify(
 
 static int
 xfs_buf_find_lock(
-	struct xfs_buftarg	*btp,
 	struct xfs_buf          *bp,
 	xfs_buf_flags_t		flags)
 {
 	if (!xfs_buf_trylock(bp)) {
 		if (flags & XBF_TRYLOCK) {
-			xfs_buf_rele(bp);
-			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
+			XFS_STATS_INC(bp->b_mount, xb_busy_locked);
 			return -EAGAIN;
 		}
 		xfs_buf_lock(bp);
-		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
+		XFS_STATS_INC(bp->b_mount, xb_get_locked_waited);
 	}
 
 	/*
@@ -558,113 +556,97 @@ xfs_buf_find_lock(
 	return 0;
 }
 
-static inline struct xfs_buf *
+static inline int
 xfs_buf_lookup(
 	struct xfs_perag	*pag,
-	struct xfs_buf_map	*map)
+	struct xfs_buf_map	*map,
+	xfs_buf_flags_t		flags,
+	struct xfs_buf		**bpp)
 {
 	struct xfs_buf          *bp;
+	int			error;
 
+	spin_lock(&pag->pag_buf_lock);
 	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
-	if (!bp)
-		return NULL;
+	if (!bp) {
+		spin_unlock(&pag->pag_buf_lock);
+		return -ENOENT;
+	}
 	atomic_inc(&bp->b_hold);
-	return bp;
-}
+	spin_unlock(&pag->pag_buf_lock);
 
-/*
- * Insert the new_bp into the hash table. This consumes the perag reference
- * taken for the lookup.
- */
-static int
-xfs_buf_find_insert(
-	struct xfs_buftarg	*btp,
-	struct xfs_perag	*pag,
-	struct xfs_buf		*new_bp)
-{
-	/* No match found */
-	if (!new_bp) {
-		xfs_perag_put(pag);
-		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
-		return -ENOENT;
+	error = xfs_buf_find_lock(bp, flags);
+	if (error) {
+		xfs_buf_rele(bp);
+		return error;
 	}
 
-	/* the buffer keeps the perag reference until it is freed */
-	new_bp->b_pag = pag;
-	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
-			       xfs_buf_hash_params);
+	trace_xfs_buf_find(bp, flags, _RET_IP_);
+	*bpp = bp;
 	return 0;
 }
 
 /*
- * Look up a buffer in the buffer cache and return it referenced and locked
- * in @found_bp.
- *
- * If @new_bp is supplied and we have a lookup miss, insert @new_bp into the
- * cache.
- *
- * If XBF_TRYLOCK is set in @flags, only try to lock the buffer and return
- * -EAGAIN if we fail to lock it.
- *
- * Return values are:
- *	-EFSCORRUPTED if have been supplied with an invalid address
- *	-EAGAIN on trylock failure
- *	-ENOENT if we fail to find a match and @new_bp was NULL
- *	0, with @found_bp:
- *		- @new_bp if we inserted it into the cache
- *		- the buffer we found and locked.
+ * Insert the new_bp into the hash table. This consumes the perag reference
+ * taken for the lookup regardless of the result of the insert.
  */
 static int
-xfs_buf_find(
+xfs_buf_find_insert(
 	struct xfs_buftarg	*btp,
+	struct xfs_perag	*pag,
+	struct xfs_buf_map	*cmap,
 	struct xfs_buf_map	*map,
 	int			nmaps,
 	xfs_buf_flags_t		flags,
-	struct xfs_buf		*new_bp,
-	struct xfs_buf		**found_bp)
+	struct xfs_buf		**bpp)
 {
-	struct xfs_perag	*pag;
+	struct xfs_buf		*new_bp;
 	struct xfs_buf		*bp;
-	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
 	int			error;
-	int			i;
-
-	*found_bp = NULL;
-
-	for (i = 0; i < nmaps; i++)
-		cmap.bm_len += map[i].bm_len;
 
-	error = xfs_buf_map_verify(btp, &cmap);
+	error = _xfs_buf_alloc(btp, map, nmaps, flags, &new_bp);
 	if (error)
-		return error;
+		goto out_drop_pag;
 
-	pag = xfs_perag_get(btp->bt_mount,
-			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
+	/*
+	 * For buffers that fit entirely within a single page, first attempt to
+	 * allocate the memory from the heap to minimise memory usage. If we
+	 * can't get heap memory for these small buffers, we fall back to using
+	 * the page allocator.
+	 */
+	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
+	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
+		error = xfs_buf_alloc_pages(new_bp, flags);
+		if (error)
+			goto out_free_buf;
+	}
 
 	spin_lock(&pag->pag_buf_lock);
-	bp = xfs_buf_lookup(pag, &cmap);
-	if (bp)
-		goto found;
+	bp = rhashtable_lookup(&pag->pag_buf_hash, cmap, xfs_buf_hash_params);
+	if (bp) {
+		atomic_inc(&bp->b_hold);
+		spin_unlock(&pag->pag_buf_lock);
+		error = xfs_buf_find_lock(bp, flags);
+		if (error)
+			xfs_buf_rele(bp);
+		else
+			*bpp = bp;
+		goto out_free_buf;
+	}
 
-	error = xfs_buf_find_insert(btp, pag, new_bp);
+	/* The buffer keeps the perag reference until it is freed. */
+	new_bp->b_pag = pag;
+	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
+			       xfs_buf_hash_params);
 	spin_unlock(&pag->pag_buf_lock);
-	if (error)
-		return error;
-	*found_bp = new_bp;
+	*bpp = new_bp;
 	return 0;
 
-found:
-	spin_unlock(&pag->pag_buf_lock);
+out_free_buf:
+	xfs_buf_free(new_bp);
+out_drop_pag:
 	xfs_perag_put(pag);
-
-	error = xfs_buf_find_lock(btp, bp, flags);
-	if (error)
-		return error;
-
-	trace_xfs_buf_find(bp, flags, _RET_IP_);
-	XFS_STATS_INC(btp->bt_mount, xb_get_locked);
-	*found_bp = bp;
-	return 0;
+	return error;
 }
 
 /*
@@ -674,54 +656,54 @@ xfs_buf_find(
  */
 int
 xfs_buf_get_map(
-	struct xfs_buftarg	*target,
+	struct xfs_buftarg	*btp,
 	struct xfs_buf_map	*map,
 	int			nmaps,
 	xfs_buf_flags_t		flags,
 	struct xfs_buf		**bpp)
 {
-	struct xfs_buf		*bp;
-	struct xfs_buf		*new_bp;
+	struct xfs_perag	*pag;
+	struct xfs_buf		*bp = NULL;
+	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
 	int			error;
+	int			i;
 
-	*bpp = NULL;
-	error = xfs_buf_find(target, map, nmaps, flags, NULL, &bp);
-	if (!error)
-		goto found;
-	if (error != -ENOENT)
-		return error;
-	if (flags & XBF_INCORE)
-		return -ENOENT;
+	for (i = 0; i < nmaps; i++)
+		cmap.bm_len += map[i].bm_len;
 
-	error = _xfs_buf_alloc(target, map, nmaps, flags, &new_bp);
+	error = xfs_buf_map_verify(btp, &cmap);
 	if (error)
 		return error;
 
-	/*
-	 * For buffers that fit entirely within a single page, first attempt to
-	 * allocate the memory from the heap to minimise memory usage. If we
-	 * can't get heap memory for these small buffers, we fall back to using
-	 * the page allocator.
-	 */
-	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
-	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
-		error = xfs_buf_alloc_pages(new_bp, flags);
-		if (error)
-			goto out_free_buf;
-	}
+	pag = xfs_perag_get(btp->bt_mount,
+			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
 
-	error = xfs_buf_find(target, map, nmaps, flags, new_bp, &bp);
-	if (error)
-		goto out_free_buf;
+	error = xfs_buf_lookup(pag, &cmap, flags, &bp);
+	if (error && error != -ENOENT)
+		goto out_put_perag;
+
+	/* cache hits always outnumber misses by at least 10:1 */
+	if (unlikely(!bp)) {
+		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
 
-	if (bp != new_bp)
-		xfs_buf_free(new_bp);
+		if (flags & XBF_INCORE)
+			goto out_put_perag;
 
-found:
+		/* xfs_buf_find_insert() consumes the perag reference. */
+		error = xfs_buf_find_insert(btp, pag, &cmap, map, nmaps,
+				flags, &bp);
+		if (error)
+			return error;
+	} else {
+		XFS_STATS_INC(btp->bt_mount, xb_get_locked);
+		xfs_perag_put(pag);
+	}
+
+	/* We do not hold a perag reference anymore. */
 	if (!bp->b_addr) {
 		error = _xfs_buf_map_pages(bp, flags);
 		if (unlikely(error)) {
-			xfs_warn_ratelimited(target->bt_mount,
+			xfs_warn_ratelimited(btp->bt_mount,
 				"%s: failed to map %u pages", __func__,
 				bp->b_page_count);
 			xfs_buf_relse(bp);
@@ -736,12 +718,13 @@ xfs_buf_get_map(
 	if (!(flags & XBF_READ))
 		xfs_buf_ioerror(bp, 0);
 
-	XFS_STATS_INC(target->bt_mount, xb_get);
+	XFS_STATS_INC(btp->bt_mount, xb_get);
 	trace_xfs_buf_get(bp, flags, _RET_IP_);
 	*bpp = bp;
 	return 0;
-out_free_buf:
-	xfs_buf_free(new_bp);
+
+out_put_perag:
+	xfs_perag_put(pag);
 	return error;
 }
 
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6] xfs: reduce the number of atomic when locking a buffer after lookup
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
                   ` (2 preceding siblings ...)
  2022-07-07 23:52 ` [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map() Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-07 23:52 ` [PATCH 5/6] xfs: remove a superflous hash lookup when inserting new buffers Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Avoid an extra atomic operation in the non-trylock case by only
doing a trylock if the XBF_TRYLOCK flag is set. This follows the
pattern in the IO path with NOWAIT semantics where the
"trylock-fail-lock" path showed 5-10% reduced throughput compared to
just using single lock call when not under NOWAIT conditions. So
make that same change here, too.

See commit 942491c9e6d6 ("xfs: fix AIM7 regression") for details.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 81ca951b451a..374c4e508b12 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -534,11 +534,12 @@ xfs_buf_find_lock(
 	struct xfs_buf          *bp,
 	xfs_buf_flags_t		flags)
 {
-	if (!xfs_buf_trylock(bp)) {
-		if (flags & XBF_TRYLOCK) {
+	if (flags & XBF_TRYLOCK) {
+		if (!xfs_buf_trylock(bp)) {
 			XFS_STATS_INC(bp->b_mount, xb_busy_locked);
 			return -EAGAIN;
 		}
+	} else {
 		xfs_buf_lock(bp);
 		XFS_STATS_INC(bp->b_mount, xb_get_locked_waited);
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6] xfs: remove a superflous hash lookup when inserting new buffers
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
                   ` (3 preceding siblings ...)
  2022-07-07 23:52 ` [PATCH 4/6] xfs: reduce the number of atomic when locking a buffer after lookup Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-07 23:52 ` [PATCH 6/6] xfs: lockless buffer lookup Dave Chinner
  2022-07-13 17:01 ` [PATCH 0/6 v3] xfs: lockless buffer lookups Darrick J. Wong
  6 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently on the slow path insert we repeat the initial hash table
lookup before we attempt the insert, resulting in a two traversals
of the hash table to ensure the insert is valid. The rhashtable API
provides a method for an atomic lookup and insert operation, so we
can avoid one of the hash table traversals by using this method.

Adapted from a large patch containing this optimisation by Christoph
Hellwig.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 374c4e508b12..1a6542e01bec 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -623,8 +623,15 @@ xfs_buf_find_insert(
 	}
 
 	spin_lock(&pag->pag_buf_lock);
-	bp = rhashtable_lookup(&pag->pag_buf_hash, cmap, xfs_buf_hash_params);
+	bp = rhashtable_lookup_get_insert_fast(&pag->pag_buf_hash,
+			&new_bp->b_rhash_head, xfs_buf_hash_params);
+	if (IS_ERR(bp)) {
+		error = PTR_ERR(bp);
+		spin_unlock(&pag->pag_buf_lock);
+		goto out_free_buf;
+	}
 	if (bp) {
+		/* found an existing buffer */
 		atomic_inc(&bp->b_hold);
 		spin_unlock(&pag->pag_buf_lock);
 		error = xfs_buf_find_lock(bp, flags);
@@ -635,10 +642,8 @@ xfs_buf_find_insert(
 		goto out_free_buf;
 	}
 
-	/* The buffer keeps the perag reference until it is freed. */
+	/* The new buffer keeps the perag reference until it is freed. */
 	new_bp->b_pag = pag;
-	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
-			       xfs_buf_hash_params);
 	spin_unlock(&pag->pag_buf_lock);
 	*bpp = new_bp;
 	return 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6] xfs: lockless buffer lookup
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
                   ` (4 preceding siblings ...)
  2022-07-07 23:52 ` [PATCH 5/6] xfs: remove a superflous hash lookup when inserting new buffers Dave Chinner
@ 2022-07-07 23:52 ` Dave Chinner
  2022-07-10  0:15   ` Darrick J. Wong
  2022-07-13 17:01 ` [PATCH 0/6 v3] xfs: lockless buffer lookups Darrick J. Wong
  6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-07-07 23:52 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have a standalone fast path for buffer lookup, we can
easily convert it to use rcu lookups. When we continually hammer the
buffer cache with trylock lookups, we end up with a huge amount of
lock contention on the per-ag buffer hash locks:

-   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
   - 92.67% xfs_inodegc_worker
      - 92.13% xfs_inode_unlink
         - 91.52% xfs_inactive_ifree
            - 85.63% xfs_read_agi
               - 85.61% xfs_trans_read_buf_map
                  - 85.59% xfs_buf_read_map
                     - xfs_buf_get_map
                        - 85.55% xfs_buf_find
                           - 72.87% _raw_spin_lock
                              - do_raw_spin_lock
                                   71.86% __pv_queued_spin_lock_slowpath
                           - 8.74% xfs_buf_rele
                              - 7.88% _raw_spin_lock
                                 - 7.88% do_raw_spin_lock
                                      7.63% __pv_queued_spin_lock_slowpath
                           - 1.70% xfs_buf_trylock
                              - 1.68% down_trylock
                                 - 1.41% _raw_spin_lock_irqsave
                                    - 1.39% do_raw_spin_lock
                                         __pv_queued_spin_lock_slowpath
                           - 0.76% _raw_spin_unlock
                                0.75% do_raw_spin_unlock

This is basically hammering the pag->pag_buf_lock from lots of CPUs
doing trylocks at the same time. Most of the buffer trylock
operations ultimately fail after we've done the lookup, so we're
really hammering the buf hash lock whilst making no progress.

We can also see significant spinlock traffic on the same lock just
under normal operation when lots of tasks are accessing metadata
from the same AG, so let's avoid all this by converting the lookup
fast path to leverages the rhashtable's ability to do rcu protected
lookups.

We avoid races with the buffer release path by using
atomic_inc_not_zero() on the buffer hold count. Any buffer that is
in the LRU will have a non-zero count, thereby allowing the lockless
fast path to be taken in most cache hit situations. If the buffer
hold count is zero, then it is likely going through the release path
so in that case we fall back to the existing lookup miss slow path.

The slow path will then do an atomic lookup and insert under the
buffer hash lock and hence serialise correctly against buffer
release freeing the buffer.

The use of rcu protected lookups means that buffer handles now need
to be freed by RCU callbacks (same as inodes). We still free the
buffer pages before the RCU callback - we won't be trying to access
them at all on a buffer that has zero references - but we need the
buffer handle itself to be present for the entire rcu protected read
side to detect a zero hold count correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_buf.c | 22 +++++++++++++++-------
 fs/xfs/xfs_buf.h |  1 +
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1a6542e01bec..6dac5583977f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -294,6 +294,16 @@ xfs_buf_free_pages(
 	bp->b_flags &= ~_XBF_PAGES;
 }
 
+static void
+xfs_buf_free_callback(
+	struct callback_head	*cb)
+{
+	struct xfs_buf		*bp = container_of(cb, struct xfs_buf, b_rcu);
+
+	xfs_buf_free_maps(bp);
+	kmem_cache_free(xfs_buf_cache, bp);
+}
+
 static void
 xfs_buf_free(
 	struct xfs_buf		*bp)
@@ -307,8 +317,7 @@ xfs_buf_free(
 	else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 
-	xfs_buf_free_maps(bp);
-	kmem_cache_free(xfs_buf_cache, bp);
+	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
 }
 
 static int
@@ -567,14 +576,13 @@ xfs_buf_lookup(
 	struct xfs_buf          *bp;
 	int			error;
 
-	spin_lock(&pag->pag_buf_lock);
+	rcu_read_lock();
 	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
-	if (!bp) {
-		spin_unlock(&pag->pag_buf_lock);
+	if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
+		rcu_read_unlock();
 		return -ENOENT;
 	}
-	atomic_inc(&bp->b_hold);
-	spin_unlock(&pag->pag_buf_lock);
+	rcu_read_unlock();
 
 	error = xfs_buf_find_lock(bp, flags);
 	if (error) {
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 58e9034d51bd..02b3c1635ec3 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -196,6 +196,7 @@ struct xfs_buf {
 	int			b_last_error;
 
 	const struct xfs_buf_ops	*b_ops;
+	struct rcu_head		b_rcu;
 };
 
 /* Finding and Reading Buffers */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces
  2022-07-07 23:52 ` [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces Dave Chinner
@ 2022-07-09 22:58   ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-09 22:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 08, 2022 at 09:52:55AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_buf_find() is made up of three main parts: lookup, insert and
> locking. The interactions with xfs_buf_get_map() require it to be
> called twice - once for a pure lookup, and again on lookup failure
> so the insert path can be run. We want to simplify this down a lot,
> so split it into a fast path lookup, a slow path insert and a "lock
> the found buffer" helper. This will then let us integrate these
> operations more effectively into xfs_buf_get_map() in future
> patches.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Much improved, thank you.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c | 159 +++++++++++++++++++++++++++++++----------------
>  1 file changed, 105 insertions(+), 54 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 143e1c70df5d..91dc691f40a8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -503,6 +503,99 @@ xfs_buf_hash_destroy(
>  	rhashtable_destroy(&pag->pag_buf_hash);
>  }
>  
> +static int
> +xfs_buf_map_verify(
> +	struct xfs_buftarg	*btp,
> +	struct xfs_buf_map	*map)
> +{
> +	xfs_daddr_t		eofs;
> +
> +	/* Check for IOs smaller than the sector size / not sector aligned */
> +	ASSERT(!(BBTOB(map->bm_len) < btp->bt_meta_sectorsize));
> +	ASSERT(!(BBTOB(map->bm_bn) & (xfs_off_t)btp->bt_meta_sectormask));
> +
> +	/*
> +	 * Corrupted block numbers can get through to here, unfortunately, so we
> +	 * have to check that the buffer falls within the filesystem bounds.
> +	 */
> +	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
> +	if (map->bm_bn < 0 || map->bm_bn >= eofs) {
> +		xfs_alert(btp->bt_mount,
> +			  "%s: daddr 0x%llx out of range, EOFS 0x%llx",
> +			  __func__, map->bm_bn, eofs);
> +		WARN_ON(1);
> +		return -EFSCORRUPTED;
> +	}
> +	return 0;
> +}
> +
> +static int
> +xfs_buf_find_lock(
> +	struct xfs_buftarg	*btp,
> +	struct xfs_buf          *bp,
> +	xfs_buf_flags_t		flags)
> +{
> +	if (!xfs_buf_trylock(bp)) {
> +		if (flags & XBF_TRYLOCK) {
> +			xfs_buf_rele(bp);
> +			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
> +			return -EAGAIN;
> +		}
> +		xfs_buf_lock(bp);
> +		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
> +	}
> +
> +	/*
> +	 * if the buffer is stale, clear all the external state associated with
> +	 * it. We need to keep flags such as how we allocated the buffer memory
> +	 * intact here.
> +	 */
> +	if (bp->b_flags & XBF_STALE) {
> +		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> +		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
> +		bp->b_ops = NULL;
> +	}
> +	return 0;
> +}
> +
> +static inline struct xfs_buf *
> +xfs_buf_lookup(
> +	struct xfs_perag	*pag,
> +	struct xfs_buf_map	*map)
> +{
> +	struct xfs_buf          *bp;
> +
> +	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
> +	if (!bp)
> +		return NULL;
> +	atomic_inc(&bp->b_hold);
> +	return bp;
> +}
> +
> +/*
> + * Insert the new_bp into the hash table. This consumes the perag reference
> + * taken for the lookup.
> + */
> +static int
> +xfs_buf_find_insert(
> +	struct xfs_buftarg	*btp,
> +	struct xfs_perag	*pag,
> +	struct xfs_buf		*new_bp)
> +{
> +	/* No match found */
> +	if (!new_bp) {
> +		xfs_perag_put(pag);
> +		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
> +		return -ENOENT;
> +	}
> +
> +	/* the buffer keeps the perag reference until it is freed */
> +	new_bp->b_pag = pag;
> +	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
> +			       xfs_buf_hash_params);
> +	return 0;
> +}
> +
>  /*
>   * Look up a buffer in the buffer cache and return it referenced and locked
>   * in @found_bp.
> @@ -533,7 +626,7 @@ xfs_buf_find(
>  	struct xfs_perag	*pag;
>  	struct xfs_buf		*bp;
>  	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
> -	xfs_daddr_t		eofs;
> +	int			error;
>  	int			i;
>  
>  	*found_bp = NULL;
> @@ -541,47 +634,22 @@ xfs_buf_find(
>  	for (i = 0; i < nmaps; i++)
>  		cmap.bm_len += map[i].bm_len;
>  
> -	/* Check for IOs smaller than the sector size / not sector aligned */
> -	ASSERT(!(BBTOB(cmap.bm_len) < btp->bt_meta_sectorsize));
> -	ASSERT(!(BBTOB(cmap.bm_bn) & (xfs_off_t)btp->bt_meta_sectormask));
> -
> -	/*
> -	 * Corrupted block numbers can get through to here, unfortunately, so we
> -	 * have to check that the buffer falls within the filesystem bounds.
> -	 */
> -	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
> -	if (cmap.bm_bn < 0 || cmap.bm_bn >= eofs) {
> -		xfs_alert(btp->bt_mount,
> -			  "%s: daddr 0x%llx out of range, EOFS 0x%llx",
> -			  __func__, cmap.bm_bn, eofs);
> -		WARN_ON(1);
> -		return -EFSCORRUPTED;
> -	}
> +	error = xfs_buf_map_verify(btp, &cmap);
> +	if (error)
> +		return error;
>  
>  	pag = xfs_perag_get(btp->bt_mount,
>  			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
>  
>  	spin_lock(&pag->pag_buf_lock);
> -	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmap,
> -				    xfs_buf_hash_params);
> -	if (bp) {
> -		atomic_inc(&bp->b_hold);
> +	bp = xfs_buf_lookup(pag, &cmap);
> +	if (bp)
>  		goto found;
> -	}
> -
> -	/* No match found */
> -	if (!new_bp) {
> -		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
> -		spin_unlock(&pag->pag_buf_lock);
> -		xfs_perag_put(pag);
> -		return -ENOENT;
> -	}
>  
> -	/* the buffer keeps the perag reference until it is freed */
> -	new_bp->b_pag = pag;
> -	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
> -			       xfs_buf_hash_params);
> +	error = xfs_buf_find_insert(btp, pag, new_bp);
>  	spin_unlock(&pag->pag_buf_lock);
> +	if (error)
> +		return error;
>  	*found_bp = new_bp;
>  	return 0;
>  
> @@ -589,26 +657,9 @@ xfs_buf_find(
>  	spin_unlock(&pag->pag_buf_lock);
>  	xfs_perag_put(pag);
>  
> -	if (!xfs_buf_trylock(bp)) {
> -		if (flags & XBF_TRYLOCK) {
> -			xfs_buf_rele(bp);
> -			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
> -			return -EAGAIN;
> -		}
> -		xfs_buf_lock(bp);
> -		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
> -	}
> -
> -	/*
> -	 * if the buffer is stale, clear all the external state associated with
> -	 * it. We need to keep flags such as how we allocated the buffer memory
> -	 * intact here.
> -	 */
> -	if (bp->b_flags & XBF_STALE) {
> -		ASSERT((bp->b_flags & _XBF_DELWRI_Q) == 0);
> -		bp->b_flags &= _XBF_KMEM | _XBF_PAGES;
> -		bp->b_ops = NULL;
> -	}
> +	error = xfs_buf_find_lock(btp, bp, flags);
> +	if (error)
> +		return error;
>  
>  	trace_xfs_buf_find(bp, flags, _RET_IP_);
>  	XFS_STATS_INC(btp->bt_mount, xb_get_locked);
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map()
  2022-07-07 23:52 ` [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map() Dave Chinner
@ 2022-07-10  0:15   ` Darrick J. Wong
  2022-07-11  5:14   ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-10  0:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 08, 2022 at 09:52:56AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we factored xfs_buf_find(), we can start separating into
> distinct fast and slow paths from xfs_buf_get_map(). We start by
> moving the lookup map and perag setup to _get_map(), and then move
> all the specifics of the fast path lookup into xfs_buf_lookup()
> and call it directly from _get_map(). We the move all the slow path
> code to xfs_buf_find_insert(), which is now also called directly
> from _get_map(). As such, xfs_buf_find() now goes away.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Makes sense to /me...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c | 207 ++++++++++++++++++++++-------------------------
>  1 file changed, 95 insertions(+), 112 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 91dc691f40a8..81ca951b451a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -531,18 +531,16 @@ xfs_buf_map_verify(
>  
>  static int
>  xfs_buf_find_lock(
> -	struct xfs_buftarg	*btp,
>  	struct xfs_buf          *bp,
>  	xfs_buf_flags_t		flags)
>  {
>  	if (!xfs_buf_trylock(bp)) {
>  		if (flags & XBF_TRYLOCK) {
> -			xfs_buf_rele(bp);
> -			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
> +			XFS_STATS_INC(bp->b_mount, xb_busy_locked);
>  			return -EAGAIN;
>  		}
>  		xfs_buf_lock(bp);
> -		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
> +		XFS_STATS_INC(bp->b_mount, xb_get_locked_waited);
>  	}
>  
>  	/*
> @@ -558,113 +556,97 @@ xfs_buf_find_lock(
>  	return 0;
>  }
>  
> -static inline struct xfs_buf *
> +static inline int
>  xfs_buf_lookup(
>  	struct xfs_perag	*pag,
> -	struct xfs_buf_map	*map)
> +	struct xfs_buf_map	*map,
> +	xfs_buf_flags_t		flags,
> +	struct xfs_buf		**bpp)
>  {
>  	struct xfs_buf          *bp;
> +	int			error;
>  
> +	spin_lock(&pag->pag_buf_lock);
>  	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
> -	if (!bp)
> -		return NULL;
> +	if (!bp) {
> +		spin_unlock(&pag->pag_buf_lock);
> +		return -ENOENT;
> +	}
>  	atomic_inc(&bp->b_hold);
> -	return bp;
> -}
> +	spin_unlock(&pag->pag_buf_lock);
>  
> -/*
> - * Insert the new_bp into the hash table. This consumes the perag reference
> - * taken for the lookup.
> - */
> -static int
> -xfs_buf_find_insert(
> -	struct xfs_buftarg	*btp,
> -	struct xfs_perag	*pag,
> -	struct xfs_buf		*new_bp)
> -{
> -	/* No match found */
> -	if (!new_bp) {
> -		xfs_perag_put(pag);
> -		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
> -		return -ENOENT;
> +	error = xfs_buf_find_lock(bp, flags);
> +	if (error) {
> +		xfs_buf_rele(bp);
> +		return error;
>  	}
>  
> -	/* the buffer keeps the perag reference until it is freed */
> -	new_bp->b_pag = pag;
> -	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
> -			       xfs_buf_hash_params);
> +	trace_xfs_buf_find(bp, flags, _RET_IP_);
> +	*bpp = bp;
>  	return 0;
>  }
>  
>  /*
> - * Look up a buffer in the buffer cache and return it referenced and locked
> - * in @found_bp.
> - *
> - * If @new_bp is supplied and we have a lookup miss, insert @new_bp into the
> - * cache.
> - *
> - * If XBF_TRYLOCK is set in @flags, only try to lock the buffer and return
> - * -EAGAIN if we fail to lock it.
> - *
> - * Return values are:
> - *	-EFSCORRUPTED if have been supplied with an invalid address
> - *	-EAGAIN on trylock failure
> - *	-ENOENT if we fail to find a match and @new_bp was NULL
> - *	0, with @found_bp:
> - *		- @new_bp if we inserted it into the cache
> - *		- the buffer we found and locked.
> + * Insert the new_bp into the hash table. This consumes the perag reference
> + * taken for the lookup regardless of the result of the insert.
>   */
>  static int
> -xfs_buf_find(
> +xfs_buf_find_insert(
>  	struct xfs_buftarg	*btp,
> +	struct xfs_perag	*pag,
> +	struct xfs_buf_map	*cmap,
>  	struct xfs_buf_map	*map,
>  	int			nmaps,
>  	xfs_buf_flags_t		flags,
> -	struct xfs_buf		*new_bp,
> -	struct xfs_buf		**found_bp)
> +	struct xfs_buf		**bpp)
>  {
> -	struct xfs_perag	*pag;
> +	struct xfs_buf		*new_bp;
>  	struct xfs_buf		*bp;
> -	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
>  	int			error;
> -	int			i;
> -
> -	*found_bp = NULL;
> -
> -	for (i = 0; i < nmaps; i++)
> -		cmap.bm_len += map[i].bm_len;
>  
> -	error = xfs_buf_map_verify(btp, &cmap);
> +	error = _xfs_buf_alloc(btp, map, nmaps, flags, &new_bp);
>  	if (error)
> -		return error;
> +		goto out_drop_pag;
>  
> -	pag = xfs_perag_get(btp->bt_mount,
> -			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
> +	/*
> +	 * For buffers that fit entirely within a single page, first attempt to
> +	 * allocate the memory from the heap to minimise memory usage. If we
> +	 * can't get heap memory for these small buffers, we fall back to using
> +	 * the page allocator.
> +	 */
> +	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
> +	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
> +		error = xfs_buf_alloc_pages(new_bp, flags);
> +		if (error)
> +			goto out_free_buf;
> +	}
>  
>  	spin_lock(&pag->pag_buf_lock);
> -	bp = xfs_buf_lookup(pag, &cmap);
> -	if (bp)
> -		goto found;
> +	bp = rhashtable_lookup(&pag->pag_buf_hash, cmap, xfs_buf_hash_params);
> +	if (bp) {
> +		atomic_inc(&bp->b_hold);
> +		spin_unlock(&pag->pag_buf_lock);
> +		error = xfs_buf_find_lock(bp, flags);
> +		if (error)
> +			xfs_buf_rele(bp);
> +		else
> +			*bpp = bp;
> +		goto out_free_buf;
> +	}
>  
> -	error = xfs_buf_find_insert(btp, pag, new_bp);
> +	/* The buffer keeps the perag reference until it is freed. */
> +	new_bp->b_pag = pag;
> +	rhashtable_insert_fast(&pag->pag_buf_hash, &new_bp->b_rhash_head,
> +			       xfs_buf_hash_params);
>  	spin_unlock(&pag->pag_buf_lock);
> -	if (error)
> -		return error;
> -	*found_bp = new_bp;
> +	*bpp = new_bp;
>  	return 0;
>  
> -found:
> -	spin_unlock(&pag->pag_buf_lock);
> +out_free_buf:
> +	xfs_buf_free(new_bp);
> +out_drop_pag:
>  	xfs_perag_put(pag);
> -
> -	error = xfs_buf_find_lock(btp, bp, flags);
> -	if (error)
> -		return error;
> -
> -	trace_xfs_buf_find(bp, flags, _RET_IP_);
> -	XFS_STATS_INC(btp->bt_mount, xb_get_locked);
> -	*found_bp = bp;
> -	return 0;
> +	return error;
>  }
>  
>  /*
> @@ -674,54 +656,54 @@ xfs_buf_find(
>   */
>  int
>  xfs_buf_get_map(
> -	struct xfs_buftarg	*target,
> +	struct xfs_buftarg	*btp,
>  	struct xfs_buf_map	*map,
>  	int			nmaps,
>  	xfs_buf_flags_t		flags,
>  	struct xfs_buf		**bpp)
>  {
> -	struct xfs_buf		*bp;
> -	struct xfs_buf		*new_bp;
> +	struct xfs_perag	*pag;
> +	struct xfs_buf		*bp = NULL;
> +	struct xfs_buf_map	cmap = { .bm_bn = map[0].bm_bn };
>  	int			error;
> +	int			i;
>  
> -	*bpp = NULL;
> -	error = xfs_buf_find(target, map, nmaps, flags, NULL, &bp);
> -	if (!error)
> -		goto found;
> -	if (error != -ENOENT)
> -		return error;
> -	if (flags & XBF_INCORE)
> -		return -ENOENT;
> +	for (i = 0; i < nmaps; i++)
> +		cmap.bm_len += map[i].bm_len;
>  
> -	error = _xfs_buf_alloc(target, map, nmaps, flags, &new_bp);
> +	error = xfs_buf_map_verify(btp, &cmap);
>  	if (error)
>  		return error;
>  
> -	/*
> -	 * For buffers that fit entirely within a single page, first attempt to
> -	 * allocate the memory from the heap to minimise memory usage. If we
> -	 * can't get heap memory for these small buffers, we fall back to using
> -	 * the page allocator.
> -	 */
> -	if (BBTOB(new_bp->b_length) >= PAGE_SIZE ||
> -	    xfs_buf_alloc_kmem(new_bp, flags) < 0) {
> -		error = xfs_buf_alloc_pages(new_bp, flags);
> -		if (error)
> -			goto out_free_buf;
> -	}
> +	pag = xfs_perag_get(btp->bt_mount,
> +			    xfs_daddr_to_agno(btp->bt_mount, cmap.bm_bn));
>  
> -	error = xfs_buf_find(target, map, nmaps, flags, new_bp, &bp);
> -	if (error)
> -		goto out_free_buf;
> +	error = xfs_buf_lookup(pag, &cmap, flags, &bp);
> +	if (error && error != -ENOENT)
> +		goto out_put_perag;
> +
> +	/* cache hits always outnumber misses by at least 10:1 */
> +	if (unlikely(!bp)) {
> +		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
>  
> -	if (bp != new_bp)
> -		xfs_buf_free(new_bp);
> +		if (flags & XBF_INCORE)
> +			goto out_put_perag;
>  
> -found:
> +		/* xfs_buf_find_insert() consumes the perag reference. */
> +		error = xfs_buf_find_insert(btp, pag, &cmap, map, nmaps,
> +				flags, &bp);
> +		if (error)
> +			return error;
> +	} else {
> +		XFS_STATS_INC(btp->bt_mount, xb_get_locked);
> +		xfs_perag_put(pag);
> +	}
> +
> +	/* We do not hold a perag reference anymore. */
>  	if (!bp->b_addr) {
>  		error = _xfs_buf_map_pages(bp, flags);
>  		if (unlikely(error)) {
> -			xfs_warn_ratelimited(target->bt_mount,
> +			xfs_warn_ratelimited(btp->bt_mount,
>  				"%s: failed to map %u pages", __func__,
>  				bp->b_page_count);
>  			xfs_buf_relse(bp);
> @@ -736,12 +718,13 @@ xfs_buf_get_map(
>  	if (!(flags & XBF_READ))
>  		xfs_buf_ioerror(bp, 0);
>  
> -	XFS_STATS_INC(target->bt_mount, xb_get);
> +	XFS_STATS_INC(btp->bt_mount, xb_get);
>  	trace_xfs_buf_get(bp, flags, _RET_IP_);
>  	*bpp = bp;
>  	return 0;
> -out_free_buf:
> -	xfs_buf_free(new_bp);
> +
> +out_put_perag:
> +	xfs_perag_put(pag);
>  	return error;
>  }
>  
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6] xfs: lockless buffer lookup
  2022-07-07 23:52 ` [PATCH 6/6] xfs: lockless buffer lookup Dave Chinner
@ 2022-07-10  0:15   ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-10  0:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 08, 2022 at 09:52:59AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have a standalone fast path for buffer lookup, we can
> easily convert it to use rcu lookups. When we continually hammer the
> buffer cache with trylock lookups, we end up with a huge amount of
> lock contention on the per-ag buffer hash locks:
> 
> -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
>    - 92.67% xfs_inodegc_worker
>       - 92.13% xfs_inode_unlink
>          - 91.52% xfs_inactive_ifree
>             - 85.63% xfs_read_agi
>                - 85.61% xfs_trans_read_buf_map
>                   - 85.59% xfs_buf_read_map
>                      - xfs_buf_get_map
>                         - 85.55% xfs_buf_find
>                            - 72.87% _raw_spin_lock
>                               - do_raw_spin_lock
>                                    71.86% __pv_queued_spin_lock_slowpath
>                            - 8.74% xfs_buf_rele
>                               - 7.88% _raw_spin_lock
>                                  - 7.88% do_raw_spin_lock
>                                       7.63% __pv_queued_spin_lock_slowpath
>                            - 1.70% xfs_buf_trylock
>                               - 1.68% down_trylock
>                                  - 1.41% _raw_spin_lock_irqsave
>                                     - 1.39% do_raw_spin_lock
>                                          __pv_queued_spin_lock_slowpath
>                            - 0.76% _raw_spin_unlock
>                                 0.75% do_raw_spin_unlock
> 
> This is basically hammering the pag->pag_buf_lock from lots of CPUs
> doing trylocks at the same time. Most of the buffer trylock
> operations ultimately fail after we've done the lookup, so we're
> really hammering the buf hash lock whilst making no progress.
> 
> We can also see significant spinlock traffic on the same lock just
> under normal operation when lots of tasks are accessing metadata
> from the same AG, so let's avoid all this by converting the lookup
> fast path to leverages the rhashtable's ability to do rcu protected
> lookups.
> 
> We avoid races with the buffer release path by using
> atomic_inc_not_zero() on the buffer hold count. Any buffer that is
> in the LRU will have a non-zero count, thereby allowing the lockless
> fast path to be taken in most cache hit situations. If the buffer
> hold count is zero, then it is likely going through the release path
> so in that case we fall back to the existing lookup miss slow path.
> 
> The slow path will then do an atomic lookup and insert under the
> buffer hash lock and hence serialise correctly against buffer
> release freeing the buffer.
> 
> The use of rcu protected lookups means that buffer handles now need
> to be freed by RCU callbacks (same as inodes). We still free the
> buffer pages before the RCU callback - we won't be trying to access
> them at all on a buffer that has zero references - but we need the
> buffer handle itself to be present for the entire rcu protected read
> side to detect a zero hold count correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Looks good, will test....
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf.c | 22 +++++++++++++++-------
>  fs/xfs/xfs_buf.h |  1 +
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1a6542e01bec..6dac5583977f 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -294,6 +294,16 @@ xfs_buf_free_pages(
>  	bp->b_flags &= ~_XBF_PAGES;
>  }
>  
> +static void
> +xfs_buf_free_callback(
> +	struct callback_head	*cb)
> +{
> +	struct xfs_buf		*bp = container_of(cb, struct xfs_buf, b_rcu);
> +
> +	xfs_buf_free_maps(bp);
> +	kmem_cache_free(xfs_buf_cache, bp);
> +}
> +
>  static void
>  xfs_buf_free(
>  	struct xfs_buf		*bp)
> @@ -307,8 +317,7 @@ xfs_buf_free(
>  	else if (bp->b_flags & _XBF_KMEM)
>  		kmem_free(bp->b_addr);
>  
> -	xfs_buf_free_maps(bp);
> -	kmem_cache_free(xfs_buf_cache, bp);
> +	call_rcu(&bp->b_rcu, xfs_buf_free_callback);
>  }
>  
>  static int
> @@ -567,14 +576,13 @@ xfs_buf_lookup(
>  	struct xfs_buf          *bp;
>  	int			error;
>  
> -	spin_lock(&pag->pag_buf_lock);
> +	rcu_read_lock();
>  	bp = rhashtable_lookup(&pag->pag_buf_hash, map, xfs_buf_hash_params);
> -	if (!bp) {
> -		spin_unlock(&pag->pag_buf_lock);
> +	if (!bp || !atomic_inc_not_zero(&bp->b_hold)) {
> +		rcu_read_unlock();
>  		return -ENOENT;
>  	}
> -	atomic_inc(&bp->b_hold);
> -	spin_unlock(&pag->pag_buf_lock);
> +	rcu_read_unlock();
>  
>  	error = xfs_buf_find_lock(bp, flags);
>  	if (error) {
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 58e9034d51bd..02b3c1635ec3 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -196,6 +196,7 @@ struct xfs_buf {
>  	int			b_last_error;
>  
>  	const struct xfs_buf_ops	*b_ops;
> +	struct rcu_head		b_rcu;
>  };
>  
>  /* Finding and Reading Buffers */
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map()
  2022-07-07 23:52 ` [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map() Dave Chinner
  2022-07-10  0:15   ` Darrick J. Wong
@ 2022-07-11  5:14   ` Christoph Hellwig
  2022-07-12  0:01     ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-07-11  5:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 08, 2022 at 09:52:56AM +1000, Dave Chinner wrote:
> index 91dc691f40a8..81ca951b451a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -531,18 +531,16 @@ xfs_buf_map_verify(
>  
>  static int
>  xfs_buf_find_lock(
> -	struct xfs_buftarg	*btp,
>  	struct xfs_buf          *bp,
>  	xfs_buf_flags_t		flags)
>  {
>  	if (!xfs_buf_trylock(bp)) {
>  		if (flags & XBF_TRYLOCK) {
> -			xfs_buf_rele(bp);
> -			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
> +			XFS_STATS_INC(bp->b_mount, xb_busy_locked);
>  			return -EAGAIN;
>  		}
>  		xfs_buf_lock(bp);
> -		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
> +		XFS_STATS_INC(bp->b_mount, xb_get_locked_waited);
>  	}
>  
>  	/*

Not doing this to start with in the previous patch still feels
rather odd.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map()
  2022-07-11  5:14   ` Christoph Hellwig
@ 2022-07-12  0:01     ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2022-07-12  0:01 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Sun, Jul 10, 2022 at 10:14:16PM -0700, Christoph Hellwig wrote:
> On Fri, Jul 08, 2022 at 09:52:56AM +1000, Dave Chinner wrote:
> > index 91dc691f40a8..81ca951b451a 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -531,18 +531,16 @@ xfs_buf_map_verify(
> >  
> >  static int
> >  xfs_buf_find_lock(
> > -	struct xfs_buftarg	*btp,
> >  	struct xfs_buf          *bp,
> >  	xfs_buf_flags_t		flags)
> >  {
> >  	if (!xfs_buf_trylock(bp)) {
> >  		if (flags & XBF_TRYLOCK) {
> > -			xfs_buf_rele(bp);
> > -			XFS_STATS_INC(btp->bt_mount, xb_busy_locked);
> > +			XFS_STATS_INC(bp->b_mount, xb_busy_locked);
> >  			return -EAGAIN;
> >  		}
> >  		xfs_buf_lock(bp);
> > -		XFS_STATS_INC(btp->bt_mount, xb_get_locked_waited);
> > +		XFS_STATS_INC(bp->b_mount, xb_get_locked_waited);
> >  	}
> >  
> >  	/*
> 
> Not doing this to start with in the previous patch still feels
> rather odd.

Oops, missed that. Will fix.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6 v3] xfs: lockless buffer lookups
  2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
                   ` (5 preceding siblings ...)
  2022-07-07 23:52 ` [PATCH 6/6] xfs: lockless buffer lookup Dave Chinner
@ 2022-07-13 17:01 ` Darrick J. Wong
  2022-07-13 17:03   ` Darrick J. Wong
  2022-07-14  1:32   ` Dave Chinner
  6 siblings, 2 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-13 17:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 08, 2022 at 09:52:53AM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Current work to merge the XFS inode life cycle with the VFS indoe
> life cycle is finding some interesting issues. If we have a path
> that hits buffer trylocks fairly hard (e.g. a non-blocking
> background inode freeing function), we end up hitting massive
> contention on the buffer cache hash locks:

Hmm.  I applied this to a test branch and this fell out of xfs/436 when
it runs rmmod xfs.  I'll see if I can reproduce it more regularly, but
thought I'd put this out there early...

XFS (sda3): Unmounting Filesystem
=============================================================================
BUG xfs_buf (Not tainted): Objects remaining in xfs_buf on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

Slab 0xffffea000443b780 objects=18 used=4 fp=0xffff888110edf340 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff)
CPU: 3 PID: 30378 Comm: modprobe Not tainted 5.19.0-rc5-djwx #rc5 bebda13a030d0898279476b6652ddea67c2060cc
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 slab_err+0x95/0xc9
 __kmem_cache_shutdown.cold+0x39/0x1e9
 kmem_cache_destroy+0x49/0x130
 exit_xfs_fs+0x50/0xc57 [xfs 370e1c994a59de083c05cd4df389f629878b8122]
 __do_sys_delete_module.constprop.0+0x145/0x220
 ? exit_to_user_mode_prepare+0x6c/0x100
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fe7d7877c9b
Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007fffb911cab8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000555a217adcc0 RCX: 00007fe7d7877c9b
RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000555a217add28
RBP: 0000555a217adcc0 R08: 0000000000000000 R09: 0000000000000000
R10: 00007fe7d790fac0 R11: 0000000000000206 R12: 0000555a217add28
R13: 0000000000000000 R14: 0000555a217add28 R15: 00007fffb911ede8
 </TASK>
Disabling lock debugging due to kernel taint
Object 0xffff888110ede000 @offset=0
Object 0xffff888110ede1c0 @offset=448
Object 0xffff888110edefc0 @offset=4032
Object 0xffff888110edf6c0 @offset=5824

--D

> -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
>    - 92.67% xfs_inodegc_worker
>       - 92.13% xfs_inode_unlink
>          - 91.52% xfs_inactive_ifree
>             - 85.63% xfs_read_agi
>                - 85.61% xfs_trans_read_buf_map
>                   - 85.59% xfs_buf_read_map
>                      - xfs_buf_get_map
>                         - 85.55% xfs_buf_find
>                            - 72.87% _raw_spin_lock
>                               - do_raw_spin_lock
>                                    71.86% __pv_queued_spin_lock_slowpath
>                            - 8.74% xfs_buf_rele
>                               - 7.88% _raw_spin_lock
>                                  - 7.88% do_raw_spin_lock
>                                       7.63% __pv_queued_spin_lock_slowpath
>                            - 1.70% xfs_buf_trylock
>                               - 1.68% down_trylock
>                                  - 1.41% _raw_spin_lock_irqsave
>                                     - 1.39% do_raw_spin_lock
>                                          __pv_queued_spin_lock_slowpath
>                            - 0.76% _raw_spin_unlock
>                                 0.75% do_raw_spin_unlock
> 
> This is basically hammering the pag->pag_buf_lock from lots of CPUs
> doing trylocks at the same time. Most of the buffer trylock
> operations ultimately fail after we've done the lookup, so we're
> really hammering the buf hash lock whilst making no progress.
> 
> We can also see significant spinlock traffic on the same lock just
> under normal operation when lots of tasks are accessing metadata
> from the same AG, so let's avoid all this by creating a lookup fast
> path which leverages the rhashtable's ability to do rcu protected
> lookups.
> 
> This is a rework of the initial lockless buffer lookup patch I sent
> here:
> 
> https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/
> 
> And the alternative cleanup sent by Christoph here:
> 
> https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/
> 
> This version isn't quite a short as Christophs, but it does roughly
> the same thing in killing the two-phase _xfs_buf_find() call
> mechanism. It separates the fast and slow paths a little more
> cleanly and doesn't have context dependent buffer return state from
> the slow path that the caller needs to handle. It also picks up the
> rhashtable insert optimisation that Christoph added.
> 
> This series passes fstests under several different configs and does
> not cause any obvious regressions in scalability testing that has
> been performed. Hence I'm proposing this as potential 5.20 cycle
> material.
> 
> Thoughts, comments?
> 
> Version 3:
> - rebased onto linux-xfs/for-next
> - rearranged some of the changes to avoid repeated shuffling of code
>   to different locations
> - fixed typos in commits
> - s/xfs_buf_find_verify/xfs_buf_map_verify/
> - s/xfs_buf_find_fast/xfs_buf_lookup/
> 
> Version 2:
> - https://lore.kernel.org/linux-xfs/20220627060841.244226-1-david@fromorbit.com/
> - based on 5.19-rc2
> - high speed collision of original proposals.
> 
> Initial versions:
> - https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/
> - https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6 v3] xfs: lockless buffer lookups
  2022-07-13 17:01 ` [PATCH 0/6 v3] xfs: lockless buffer lookups Darrick J. Wong
@ 2022-07-13 17:03   ` Darrick J. Wong
  2022-07-14  1:32   ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-13 17:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 13, 2022 at 10:01:15AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 08, 2022 at 09:52:53AM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > Current work to merge the XFS inode life cycle with the VFS indoe
> > life cycle is finding some interesting issues. If we have a path
> > that hits buffer trylocks fairly hard (e.g. a non-blocking
> > background inode freeing function), we end up hitting massive
> > contention on the buffer cache hash locks:
> 
> Hmm.  I applied this to a test branch and this fell out of xfs/436 when
> it runs rmmod xfs.  I'll see if I can reproduce it more regularly, but
> thought I'd put this out there early...

...and I should have mentioned that this VM was running with
MKFS_OPTIONS='-i nrext64=1 -d rmapbt=1' and always_cow turned on.

--D

> XFS (sda3): Unmounting Filesystem
> =============================================================================
> BUG xfs_buf (Not tainted): Objects remaining in xfs_buf on __kmem_cache_shutdown()
> -----------------------------------------------------------------------------
> 
> Slab 0xffffea000443b780 objects=18 used=4 fp=0xffff888110edf340 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff)
> CPU: 3 PID: 30378 Comm: modprobe Not tainted 5.19.0-rc5-djwx #rc5 bebda13a030d0898279476b6652ddea67c2060cc
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x34/0x44
>  slab_err+0x95/0xc9
>  __kmem_cache_shutdown.cold+0x39/0x1e9
>  kmem_cache_destroy+0x49/0x130
>  exit_xfs_fs+0x50/0xc57 [xfs 370e1c994a59de083c05cd4df389f629878b8122]
>  __do_sys_delete_module.constprop.0+0x145/0x220
>  ? exit_to_user_mode_prepare+0x6c/0x100
>  do_syscall_64+0x35/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fe7d7877c9b
> Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
> RSP: 002b:00007fffb911cab8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> RAX: ffffffffffffffda RBX: 0000555a217adcc0 RCX: 00007fe7d7877c9b
> RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000555a217add28
> RBP: 0000555a217adcc0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00007fe7d790fac0 R11: 0000000000000206 R12: 0000555a217add28
> R13: 0000000000000000 R14: 0000555a217add28 R15: 00007fffb911ede8
>  </TASK>
> Disabling lock debugging due to kernel taint
> Object 0xffff888110ede000 @offset=0
> Object 0xffff888110ede1c0 @offset=448
> Object 0xffff888110edefc0 @offset=4032
> Object 0xffff888110edf6c0 @offset=5824
> 
> --D
> 
> > -   92.71%     0.05%  [kernel]                  [k] xfs_inodegc_worker
> >    - 92.67% xfs_inodegc_worker
> >       - 92.13% xfs_inode_unlink
> >          - 91.52% xfs_inactive_ifree
> >             - 85.63% xfs_read_agi
> >                - 85.61% xfs_trans_read_buf_map
> >                   - 85.59% xfs_buf_read_map
> >                      - xfs_buf_get_map
> >                         - 85.55% xfs_buf_find
> >                            - 72.87% _raw_spin_lock
> >                               - do_raw_spin_lock
> >                                    71.86% __pv_queued_spin_lock_slowpath
> >                            - 8.74% xfs_buf_rele
> >                               - 7.88% _raw_spin_lock
> >                                  - 7.88% do_raw_spin_lock
> >                                       7.63% __pv_queued_spin_lock_slowpath
> >                            - 1.70% xfs_buf_trylock
> >                               - 1.68% down_trylock
> >                                  - 1.41% _raw_spin_lock_irqsave
> >                                     - 1.39% do_raw_spin_lock
> >                                          __pv_queued_spin_lock_slowpath
> >                            - 0.76% _raw_spin_unlock
> >                                 0.75% do_raw_spin_unlock
> > 
> > This is basically hammering the pag->pag_buf_lock from lots of CPUs
> > doing trylocks at the same time. Most of the buffer trylock
> > operations ultimately fail after we've done the lookup, so we're
> > really hammering the buf hash lock whilst making no progress.
> > 
> > We can also see significant spinlock traffic on the same lock just
> > under normal operation when lots of tasks are accessing metadata
> > from the same AG, so let's avoid all this by creating a lookup fast
> > path which leverages the rhashtable's ability to do rcu protected
> > lookups.
> > 
> > This is a rework of the initial lockless buffer lookup patch I sent
> > here:
> > 
> > https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/
> > 
> > And the alternative cleanup sent by Christoph here:
> > 
> > https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/
> > 
> > This version isn't quite a short as Christophs, but it does roughly
> > the same thing in killing the two-phase _xfs_buf_find() call
> > mechanism. It separates the fast and slow paths a little more
> > cleanly and doesn't have context dependent buffer return state from
> > the slow path that the caller needs to handle. It also picks up the
> > rhashtable insert optimisation that Christoph added.
> > 
> > This series passes fstests under several different configs and does
> > not cause any obvious regressions in scalability testing that has
> > been performed. Hence I'm proposing this as potential 5.20 cycle
> > material.
> > 
> > Thoughts, comments?
> > 
> > Version 3:
> > - rebased onto linux-xfs/for-next
> > - rearranged some of the changes to avoid repeated shuffling of code
> >   to different locations
> > - fixed typos in commits
> > - s/xfs_buf_find_verify/xfs_buf_map_verify/
> > - s/xfs_buf_find_fast/xfs_buf_lookup/
> > 
> > Version 2:
> > - https://lore.kernel.org/linux-xfs/20220627060841.244226-1-david@fromorbit.com/
> > - based on 5.19-rc2
> > - high speed collision of original proposals.
> > 
> > Initial versions:
> > - https://lore.kernel.org/linux-xfs/20220403120119.235457-1-hch@lst.de/
> > - https://lore.kernel.org/linux-xfs/20220328213810.1174688-1-david@fromorbit.com/
> > 
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6 v3] xfs: lockless buffer lookups
  2022-07-13 17:01 ` [PATCH 0/6 v3] xfs: lockless buffer lookups Darrick J. Wong
  2022-07-13 17:03   ` Darrick J. Wong
@ 2022-07-14  1:32   ` Dave Chinner
  2022-07-14  2:11     ` Darrick J. Wong
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2022-07-14  1:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jul 13, 2022 at 10:01:15AM -0700, Darrick J. Wong wrote:
> On Fri, Jul 08, 2022 at 09:52:53AM +1000, Dave Chinner wrote:
> > Hi folks,
> > 
> > Current work to merge the XFS inode life cycle with the VFS indoe
> > life cycle is finding some interesting issues. If we have a path
> > that hits buffer trylocks fairly hard (e.g. a non-blocking
> > background inode freeing function), we end up hitting massive
> > contention on the buffer cache hash locks:
> 
> Hmm.  I applied this to a test branch and this fell out of xfs/436 when
> it runs rmmod xfs.  I'll see if I can reproduce it more regularly, but
> thought I'd put this out there early...
> 
> XFS (sda3): Unmounting Filesystem
> =============================================================================
> BUG xfs_buf (Not tainted): Objects remaining in xfs_buf on __kmem_cache_shutdown()
> -----------------------------------------------------------------------------
> 
> Slab 0xffffea000443b780 objects=18 used=4 fp=0xffff888110edf340 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff)
> CPU: 3 PID: 30378 Comm: modprobe Not tainted 5.19.0-rc5-djwx #rc5 bebda13a030d0898279476b6652ddea67c2060cc
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0x34/0x44
>  slab_err+0x95/0xc9
>  __kmem_cache_shutdown.cold+0x39/0x1e9
>  kmem_cache_destroy+0x49/0x130
>  exit_xfs_fs+0x50/0xc57 [xfs 370e1c994a59de083c05cd4df389f629878b8122]
>  __do_sys_delete_module.constprop.0+0x145/0x220
>  ? exit_to_user_mode_prepare+0x6c/0x100
>  do_syscall_64+0x35/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fe7d7877c9b
> Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
> RSP: 002b:00007fffb911cab8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> RAX: ffffffffffffffda RBX: 0000555a217adcc0 RCX: 00007fe7d7877c9b
> RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000555a217add28
> RBP: 0000555a217adcc0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00007fe7d790fac0 R11: 0000000000000206 R12: 0000555a217add28
> R13: 0000000000000000 R14: 0000555a217add28 R15: 00007fffb911ede8
>  </TASK>
> Disabling lock debugging due to kernel taint
> Object 0xffff888110ede000 @offset=0
> Object 0xffff888110ede1c0 @offset=448
> Object 0xffff888110edefc0 @offset=4032
> Object 0xffff888110edf6c0 @offset=5824

Curious, I haven't seen anything from KASAN that would indicate a
leak is occurring, and unmount can't occur while there are still
referenced buffers in the system. So what might be leaking is a bit
of a mystery to me right now...

Is this a result of xfs/436 running by itself, or left over from
some other prior test? i.e. if you add a '_reload_fs_module "xfs"'
call before the test does anything, does it complain?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/6 v3] xfs: lockless buffer lookups
  2022-07-14  1:32   ` Dave Chinner
@ 2022-07-14  2:11     ` Darrick J. Wong
  0 siblings, 0 replies; 16+ messages in thread
From: Darrick J. Wong @ 2022-07-14  2:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 14, 2022 at 11:32:01AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2022 at 10:01:15AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 08, 2022 at 09:52:53AM +1000, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Current work to merge the XFS inode life cycle with the VFS indoe
> > > life cycle is finding some interesting issues. If we have a path
> > > that hits buffer trylocks fairly hard (e.g. a non-blocking
> > > background inode freeing function), we end up hitting massive
> > > contention on the buffer cache hash locks:
> > 
> > Hmm.  I applied this to a test branch and this fell out of xfs/436 when
> > it runs rmmod xfs.  I'll see if I can reproduce it more regularly, but
> > thought I'd put this out there early...
> > 
> > XFS (sda3): Unmounting Filesystem
> > =============================================================================
> > BUG xfs_buf (Not tainted): Objects remaining in xfs_buf on __kmem_cache_shutdown()
> > -----------------------------------------------------------------------------
> > 
> > Slab 0xffffea000443b780 objects=18 used=4 fp=0xffff888110edf340 flags=0x17ff80000010200(slab|head|node=0|zone=2|lastcpupid=0xfff)
> > CPU: 3 PID: 30378 Comm: modprobe Not tainted 5.19.0-rc5-djwx #rc5 bebda13a030d0898279476b6652ddea67c2060cc
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20171121_152543-x86-ol7-builder-01.us.oracle.com-4.el7.1 04/01/2014
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0x34/0x44
> >  slab_err+0x95/0xc9
> >  __kmem_cache_shutdown.cold+0x39/0x1e9
> >  kmem_cache_destroy+0x49/0x130
> >  exit_xfs_fs+0x50/0xc57 [xfs 370e1c994a59de083c05cd4df389f629878b8122]
> >  __do_sys_delete_module.constprop.0+0x145/0x220
> >  ? exit_to_user_mode_prepare+0x6c/0x100
> >  do_syscall_64+0x35/0x80
> >  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > RIP: 0033:0x7fe7d7877c9b
> > Code: 73 01 c3 48 8b 0d 95 21 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 65 21 0f 00 f7 d8 64 89 01 48
> > RSP: 002b:00007fffb911cab8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
> > RAX: ffffffffffffffda RBX: 0000555a217adcc0 RCX: 00007fe7d7877c9b
> > RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000555a217add28
> > RBP: 0000555a217adcc0 R08: 0000000000000000 R09: 0000000000000000
> > R10: 00007fe7d790fac0 R11: 0000000000000206 R12: 0000555a217add28
> > R13: 0000000000000000 R14: 0000555a217add28 R15: 00007fffb911ede8
> >  </TASK>
> > Disabling lock debugging due to kernel taint
> > Object 0xffff888110ede000 @offset=0
> > Object 0xffff888110ede1c0 @offset=448
> > Object 0xffff888110edefc0 @offset=4032
> > Object 0xffff888110edf6c0 @offset=5824
> 
> Curious, I haven't seen anything from KASAN that would indicate a
> leak is occurring, and unmount can't occur while there are still
> referenced buffers in the system. So what might be leaking is a bit
> of a mystery to me right now...
> 
> Is this a result of xfs/436 running by itself, or left over from
> some other prior test? i.e. if you add a '_reload_fs_module "xfs"'
> call before the test does anything, does it complain?

Still digging into that.  I ran ./check -I 100 xfs/434 xfs/436 and
couldn't reproduce it, so I'll have to dig further.  You might as well
push the patchset along since that's the only time this has happened
despite several days and dozens of VMs testing this.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-07-14  2:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07 23:52 [PATCH 0/6 v3] xfs: lockless buffer lookups Dave Chinner
2022-07-07 23:52 ` [PATCH 1/6] xfs: rework xfs_buf_incore() API Dave Chinner
2022-07-07 23:52 ` [PATCH 2/6] xfs: break up xfs_buf_find() into individual pieces Dave Chinner
2022-07-09 22:58   ` Darrick J. Wong
2022-07-07 23:52 ` [PATCH 3/6] xfs: merge xfs_buf_find() and xfs_buf_get_map() Dave Chinner
2022-07-10  0:15   ` Darrick J. Wong
2022-07-11  5:14   ` Christoph Hellwig
2022-07-12  0:01     ` Dave Chinner
2022-07-07 23:52 ` [PATCH 4/6] xfs: reduce the number of atomic when locking a buffer after lookup Dave Chinner
2022-07-07 23:52 ` [PATCH 5/6] xfs: remove a superflous hash lookup when inserting new buffers Dave Chinner
2022-07-07 23:52 ` [PATCH 6/6] xfs: lockless buffer lookup Dave Chinner
2022-07-10  0:15   ` Darrick J. Wong
2022-07-13 17:01 ` [PATCH 0/6 v3] xfs: lockless buffer lookups Darrick J. Wong
2022-07-13 17:03   ` Darrick J. Wong
2022-07-14  1:32   ` Dave Chinner
2022-07-14  2:11     ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.