All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] XFS buffer cache scalability improvements
@ 2016-10-18 20:14 Lucas Stach
  2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Lucas Stach @ 2016-10-18 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi all,

this series scratches my own small itch with XFS, namely scalability of the buffer
cache in metadata intensive workloads. With a large number of cached buffers those
workloads are CPU bound with a significant amount of time spent searching the cache.

The first commit replaces the rbtree used to index the cache with an rhashtable. The
rbtree is a bottleneck in scalability, as the data structure itself is pretty CPU
cache unfriendly. For larger numbers of cached buffers over 80% of the CPU time
is spent waiting on cache misses resulting from the inherent pointer chasing.

rhashtables provide a fast lookup with the ability to have lookups proceed while the
hashtable is being resized. This seems to match the read dominated workload of the
buffer cache index structure pretty well.

The second patch is logical follow up. The rhashtable cache index is protected by
RCU and does not need any additional locking. By switching the buffer cache entries
over to RCU freeing the buffer cache can be operated in a completely lock-free
manner. This should help scalability in the long run.

This series survives at least a xfstests auto group run (though with the scratch
device being a ramdisk) with no regressions and didn't show any problems in my
real world testing (using the patched FS with multiple large git trees) so far.

Regards,
Lucas

Lucas Stach (2):
  xfs: use rhashtable to track buffer cache
  xfs: switch buffer cache entries to RCU freeing

 fs/xfs/xfs_buf.c   | 155 ++++++++++++++++++++++++++++++++++-------------------
 fs/xfs/xfs_buf.h   |   4 +-
 fs/xfs/xfs_linux.h |   1 +
 fs/xfs/xfs_mount.c |   7 ++-
 fs/xfs/xfs_mount.h |   6 ++-
 5 files changed, 114 insertions(+), 59 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-18 20:14 [PATCH 0/2] XFS buffer cache scalability improvements Lucas Stach
@ 2016-10-18 20:14 ` Lucas Stach
  2016-10-18 22:18   ` Dave Chinner
  2016-10-19  1:15   ` Dave Chinner
  2016-10-18 20:14 ` [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing Lucas Stach
  2016-10-18 21:21 ` [PATCH 0/2] XFS buffer cache scalability improvements Dave Chinner
  2 siblings, 2 replies; 16+ messages in thread
From: Lucas Stach @ 2016-10-18 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On filesystems with a lot of metadata and in metadata intensive workloads
xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
the CPU time is spent on CPU cache misses while traversing the rbtree.

As the buffer cache does not need any kind of ordering, but fast lookups
a hashtable is the natural data structure to use. The rhashtable
infrastructure provides a self-scaling hashtable implementation and
allows lookups to proceed while the table is going through a resize
operation.

This reduces the CPU-time spent for the lookups to 1/3 even for small
filesystems with a relatively small number of cached buffers, with
possibly much larger gains on higher loaded filesystems.

The minimum size of 4096 buckets was chosen as it was the size of the
xfs buffer cache hash before it was converted to an rbtree.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
---
 fs/xfs/xfs_buf.c   | 118 ++++++++++++++++++++++++++++++++++-------------------
 fs/xfs/xfs_buf.h   |   2 +-
 fs/xfs/xfs_linux.h |   1 +
 fs/xfs/xfs_mount.c |   7 +++-
 fs/xfs/xfs_mount.h |   7 +++-
 5 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b5b9bff..50c5b01 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -219,7 +219,6 @@ _xfs_buf_alloc(
 	init_completion(&bp->b_iowait);
 	INIT_LIST_HEAD(&bp->b_lru);
 	INIT_LIST_HEAD(&bp->b_list);
-	RB_CLEAR_NODE(&bp->b_rbnode);
 	sema_init(&bp->b_sema, 0); /* held, no waiters */
 	spin_lock_init(&bp->b_lock);
 	XB_SET_OWNER(bp);
@@ -473,7 +472,66 @@ _xfs_buf_map_pages(
 /*
  *	Finding and Reading Buffers
  */
+struct xfs_buf_cmp_arg {
+	xfs_daddr_t	blkno;
+	int		numblks;
+};
+int
+_xfs_buf_cmp(
+	struct rhashtable_compare_arg *arg,
+	const void *obj)
+{
+	const struct xfs_buf_cmp_arg *cmp_arg = arg->key;
+	const struct xfs_buf *bp = obj;
+
+	/*
+	 * The key hashing in the lookup path depends on the key being the
+	 * first element of the compare_arg, make sure to assert this.
+	 */
+	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) != 0);
+
+	if (bp->b_bn == cmp_arg->blkno) {
+		if (unlikely(bp->b_length != cmp_arg->numblks)) {
+			/*
+			 * found a block number match. If the range doesn't
+			 * match, the only way this is allowed is if the buffer
+			 * in the cache is stale and the transaction that made
+			 * it stale has not yet committed. i.e. we are
+			 * reallocating a busy extent. Skip this buffer and
+			 * continue searching for an exact match.
+			 */
+			ASSERT(bp->b_flags & XBF_STALE);
+			return 1;
+		}
+
+		return 0;
+	}
+
+	return 1;
+}
+static const struct rhashtable_params xfs_buf_hash_params = {
+	.min_size = 4096,
+	.nelem_hint = 3072,
+	.key_len = sizeof(xfs_daddr_t),
+	.key_offset = offsetof(struct xfs_buf, b_bn),
+	.head_offset = offsetof(struct xfs_buf, b_rhash_head),
+	.automatic_shrinking = true,
+	.obj_cmpfn = _xfs_buf_cmp,
+};
+
+int xfs_buf_hash_init(
+	struct xfs_perag *pag)
+{
+	spin_lock_init(&pag->pag_buf_lock);
+	return rhashtable_init(&pag->pag_buf_hash, &xfs_buf_hash_params);
+}
 
+void
+xfs_buf_hash_destroy(
+	struct xfs_perag *pag)
+{
+	rhashtable_destroy(&pag->pag_buf_hash);
+}
 /*
  *	Look up, and creates if absent, a lockable buffer for
  *	a given range of an inode.  The buffer is returned
@@ -488,16 +546,13 @@ _xfs_buf_find(
 	xfs_buf_t		*new_bp)
 {
 	struct xfs_perag	*pag;
-	struct rb_node		**rbp;
-	struct rb_node		*parent;
 	xfs_buf_t		*bp;
-	xfs_daddr_t		blkno = map[0].bm_bn;
+	struct xfs_buf_cmp_arg	cmp_arg = { .blkno = map[0].bm_bn };
 	xfs_daddr_t		eofs;
-	int			numblks = 0;
 	int			i;
 
 	for (i = 0; i < nmaps; i++)
-		numblks += map[i].bm_len;
+		cmp_arg.numblks += map[i].bm_len;
 
 	/* Check for IOs smaller than the sector size / not sector aligned */
 	ASSERT(!(BBTOB(numblks) < btp->bt_meta_sectorsize));
@@ -508,7 +563,7 @@ _xfs_buf_find(
 	 * have to check that the buffer falls within the filesystem bounds.
 	 */
 	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
-	if (blkno < 0 || blkno >= eofs) {
+	if (cmp_arg.blkno < 0 || cmp_arg.blkno >= eofs) {
 		/*
 		 * XXX (dgc): we should really be returning -EFSCORRUPTED here,
 		 * but none of the higher level infrastructure supports
@@ -516,53 +571,31 @@ _xfs_buf_find(
 		 */
 		xfs_alert(btp->bt_mount,
 			  "%s: Block out of range: block 0x%llx, EOFS 0x%llx ",
-			  __func__, blkno, eofs);
+			  __func__, cmp_arg.blkno, eofs);
 		WARN_ON(1);
 		return NULL;
 	}
 
-	/* get tree root */
+	/* get pag */
 	pag = xfs_perag_get(btp->bt_mount,
-				xfs_daddr_to_agno(btp->bt_mount, blkno));
+			    xfs_daddr_to_agno(btp->bt_mount, cmp_arg.blkno));
 
-	/* walk tree */
+	/* lookup buf in pag hash */
 	spin_lock(&pag->pag_buf_lock);
-	rbp = &pag->pag_buf_tree.rb_node;
-	parent = NULL;
-	bp = NULL;
-	while (*rbp) {
-		parent = *rbp;
-		bp = rb_entry(parent, struct xfs_buf, b_rbnode);
-
-		if (blkno < bp->b_bn)
-			rbp = &(*rbp)->rb_left;
-		else if (blkno > bp->b_bn)
-			rbp = &(*rbp)->rb_right;
-		else {
-			/*
-			 * found a block number match. If the range doesn't
-			 * match, the only way this is allowed is if the buffer
-			 * in the cache is stale and the transaction that made
-			 * it stale has not yet committed. i.e. we are
-			 * reallocating a busy extent. Skip this buffer and
-			 * continue searching to the right for an exact match.
-			 */
-			if (bp->b_length != numblks) {
-				ASSERT(bp->b_flags & XBF_STALE);
-				rbp = &(*rbp)->rb_right;
-				continue;
-			}
-			atomic_inc(&bp->b_hold);
-			goto found;
-		}
+	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
+				    xfs_buf_hash_params);
+	if (bp) {
+		atomic_inc(&bp->b_hold);
+		goto found;
 	}
 
 	/* No match found */
 	if (new_bp) {
-		rb_link_node(&new_bp->b_rbnode, parent, rbp);
-		rb_insert_color(&new_bp->b_rbnode, &pag->pag_buf_tree);
 		/* the buffer keeps the perag reference until it is freed */
 		new_bp->b_pag = pag;
+		rhashtable_insert_fast(&pag->pag_buf_hash,
+				       &new_bp->b_rhash_head,
+				       xfs_buf_hash_params);
 		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
@@ -983,7 +1016,8 @@ xfs_buf_rele(
 		}
 
 		ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
-		rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
+		rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head,
+				       xfs_buf_hash_params);
 		spin_unlock(&pag->pag_buf_lock);
 		xfs_perag_put(pag);
 		freebuf = true;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 1c2e52b..37943ac 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -150,7 +150,7 @@ typedef struct xfs_buf {
 	 * which is the only bit that is touched if we hit the semaphore
 	 * fast-path on locking.
 	 */
-	struct rb_node		b_rbnode;	/* rbtree node */
+	struct rhash_head	b_rhash_head;	/* pag buffer hash node */
 	xfs_daddr_t		b_bn;		/* block number of buffer */
 	int			b_length;	/* size of buffer in BBs */
 	atomic_t		b_hold;		/* reference count */
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 68640fb..a415f82 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -78,6 +78,7 @@ typedef __u32			xfs_nlink_t;
 #include <linux/freezer.h>
 #include <linux/list_sort.h>
 #include <linux/ratelimit.h>
+#include <linux/rhashtable.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fc78739..b9a9a58 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -157,6 +157,7 @@ xfs_free_perag(
 		spin_unlock(&mp->m_perag_lock);
 		ASSERT(pag);
 		ASSERT(atomic_read(&pag->pag_ref) == 0);
+		xfs_buf_hash_destroy(pag);
 		call_rcu(&pag->rcu_head, __xfs_free_perag);
 	}
 }
@@ -212,8 +213,8 @@ xfs_initialize_perag(
 		spin_lock_init(&pag->pag_ici_lock);
 		mutex_init(&pag->pag_ici_reclaim_lock);
 		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
-		spin_lock_init(&pag->pag_buf_lock);
-		pag->pag_buf_tree = RB_ROOT;
+		if (xfs_buf_hash_init(pag))
+			goto out_unwind;
 
 		if (radix_tree_preload(GFP_NOFS))
 			goto out_unwind;
@@ -239,9 +240,11 @@ xfs_initialize_perag(
 	return 0;
 
 out_unwind:
+	xfs_buf_hash_destroy(pag);
 	kmem_free(pag);
 	for (; index > first_initialised; index--) {
 		pag = radix_tree_delete(&mp->m_perag_tree, index);
+		xfs_buf_hash_destroy(pag);
 		kmem_free(pag);
 	}
 	return error;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 819b80b..84f7852 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -393,8 +393,8 @@ typedef struct xfs_perag {
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-	spinlock_t	pag_buf_lock;	/* lock for pag_buf_tree */
-	struct rb_root	pag_buf_tree;	/* ordered tree of active buffers */
+	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
+	struct rhashtable pag_buf_hash;
 
 	/* for rcu-safe freeing */
 	struct rcu_head	rcu_head;
@@ -424,6 +424,9 @@ xfs_perag_resv(
 	}
 }
 
+int xfs_buf_hash_init(xfs_perag_t *pag);
+void xfs_buf_hash_destroy(xfs_perag_t *pag);
+
 extern void	xfs_uuid_table_free(void);
 extern int	xfs_log_sbcount(xfs_mount_t *);
 extern __uint64_t xfs_default_resblks(xfs_mount_t *mp);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing
  2016-10-18 20:14 [PATCH 0/2] XFS buffer cache scalability improvements Lucas Stach
  2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
@ 2016-10-18 20:14 ` Lucas Stach
  2016-10-18 22:43   ` Dave Chinner
  2016-10-18 21:21 ` [PATCH 0/2] XFS buffer cache scalability improvements Dave Chinner
  2 siblings, 1 reply; 16+ messages in thread
From: Lucas Stach @ 2016-10-18 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

The buffer cache hash as the indexing data structure into the buffer
cache is already protected by RCU. By freeing the entries itself by
RCU we can get rid of the buffer cache lock altogether.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
---
 fs/xfs/xfs_buf.c   | 41 +++++++++++++++++++++++++++--------------
 fs/xfs/xfs_buf.h   |  2 ++
 fs/xfs/xfs_mount.h |  1 -
 3 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 50c5b01..3ee0a3d1 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -326,6 +326,16 @@ xfs_buf_free(
 	kmem_zone_free(xfs_buf_zone, bp);
 }
 
+STATIC void
+__xfs_buf_rcu_free(
+	struct rcu_head	*head)
+{
+	xfs_buf_t *bp = container_of(head, struct xfs_buf, rcu_head);
+
+	ASSERT(atomic_read(&bp->b_hold) == 0);
+	xfs_buf_free(bp);
+}
+
 /*
  * Allocates all the pages for buffer in question and builds it's page list.
  */
@@ -491,6 +501,14 @@ _xfs_buf_cmp(
 	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) != 0);
 
 	if (bp->b_bn == cmp_arg->blkno) {
+		/*
+		 * Skip matches with a hold count of zero, as they are about to
+		 * be freed by RCU. Continue searching as another valid entry
+		 * might have already been inserted into the hash.
+		 */
+		if (unlikely(atomic_read(&bp->b_hold) == 0))
+			return 1;
+
 		if (unlikely(bp->b_length != cmp_arg->numblks)) {
 			/*
 			 * found a block number match. If the range doesn't
@@ -522,7 +540,6 @@ static const struct rhashtable_params xfs_buf_hash_params = {
 int xfs_buf_hash_init(
 	struct xfs_perag *pag)
 {
-	spin_lock_init(&pag->pag_buf_lock);
 	return rhashtable_init(&pag->pag_buf_hash, &xfs_buf_hash_params);
 }
 
@@ -580,14 +597,16 @@ _xfs_buf_find(
 	pag = xfs_perag_get(btp->bt_mount,
 			    xfs_daddr_to_agno(btp->bt_mount, cmp_arg.blkno));
 
+	rcu_read_lock();
 	/* lookup buf in pag hash */
-	spin_lock(&pag->pag_buf_lock);
 	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
 				    xfs_buf_hash_params);
-	if (bp) {
-		atomic_inc(&bp->b_hold);
+
+	/* if the hold count is zero the buffer is about to be freed by RCU */
+	if (bp && atomic_inc_not_zero(&bp->b_hold))
 		goto found;
-	}
+
+	rcu_read_unlock();
 
 	/* No match found */
 	if (new_bp) {
@@ -596,16 +615,14 @@ _xfs_buf_find(
 		rhashtable_insert_fast(&pag->pag_buf_hash,
 				       &new_bp->b_rhash_head,
 				       xfs_buf_hash_params);
-		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		XFS_STATS_INC(btp->bt_mount, xb_miss_locked);
-		spin_unlock(&pag->pag_buf_lock);
 		xfs_perag_put(pag);
 	}
 	return new_bp;
 
 found:
-	spin_unlock(&pag->pag_buf_lock);
+	rcu_read_unlock();
 	xfs_perag_put(pag);
 
 	if (!xfs_buf_trylock(bp)) {
@@ -956,7 +973,6 @@ xfs_buf_rele(
 	xfs_buf_t		*bp)
 {
 	struct xfs_perag	*pag = bp->b_pag;
-	bool			release;
 	bool			freebuf = false;
 
 	trace_xfs_buf_rele(bp, _RET_IP_);
@@ -975,9 +991,8 @@ xfs_buf_rele(
 
 	ASSERT(atomic_read(&bp->b_hold) > 0);
 
-	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
 	spin_lock(&bp->b_lock);
-	if (!release) {
+	if (!atomic_dec_and_test(&bp->b_hold)) {
 		/*
 		 * Drop the in-flight state if the buffer is already on the LRU
 		 * and it holds the only reference. This is racy because we
@@ -1001,7 +1016,6 @@ xfs_buf_rele(
 			bp->b_state &= ~XFS_BSTATE_DISPOSE;
 			atomic_inc(&bp->b_hold);
 		}
-		spin_unlock(&pag->pag_buf_lock);
 	} else {
 		/*
 		 * most of the time buffers will already be removed from the
@@ -1018,7 +1032,6 @@ xfs_buf_rele(
 		ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 		rhashtable_remove_fast(&pag->pag_buf_hash, &bp->b_rhash_head,
 				       xfs_buf_hash_params);
-		spin_unlock(&pag->pag_buf_lock);
 		xfs_perag_put(pag);
 		freebuf = true;
 	}
@@ -1027,7 +1040,7 @@ xfs_buf_rele(
 	spin_unlock(&bp->b_lock);
 
 	if (freebuf)
-		xfs_buf_free(bp);
+		call_rcu(&bp->rcu_head, __xfs_buf_rcu_free);
 }
 
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 37943ac..3f99ea7 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -210,6 +210,8 @@ typedef struct xfs_buf {
 
 	const struct xfs_buf_ops	*b_ops;
 
+	struct rcu_head		rcu_head;
+
 #ifdef XFS_BUF_LOCK_TRACKING
 	int			b_last_holder;
 #endif
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 84f7852..1116909 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -393,7 +393,6 @@ typedef struct xfs_perag {
 	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
 
 	/* buffer cache index */
-	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
 	struct rhashtable pag_buf_hash;
 
 	/* for rcu-safe freeing */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] XFS buffer cache scalability improvements
  2016-10-18 20:14 [PATCH 0/2] XFS buffer cache scalability improvements Lucas Stach
  2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
  2016-10-18 20:14 ` [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing Lucas Stach
@ 2016-10-18 21:21 ` Dave Chinner
  2016-10-22 17:51   ` Lucas Stach
  2016-11-10 23:02   ` Dave Chinner
  2 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2016-10-18 21:21 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Tue, Oct 18, 2016 at 10:14:11PM +0200, Lucas Stach wrote:
> Hi all,
> 
> this series scratches my own small itch with XFS, namely scalability of the buffer
> cache in metadata intensive workloads. With a large number of cached buffers those
> workloads are CPU bound with a significant amount of time spent searching the cache.
> 
> The first commit replaces the rbtree used to index the cache with an rhashtable. The
> rbtree is a bottleneck in scalability, as the data structure itself is pretty CPU
> cache unfriendly. For larger numbers of cached buffers over 80% of the CPU time
> is spent waiting on cache misses resulting from the inherent pointer chasing.
> 
> rhashtables provide a fast lookup with the ability to have lookups proceed while the
> hashtable is being resized. This seems to match the read dominated workload of the
> buffer cache index structure pretty well.

Yup, it's a good idea - I have considered doing this change for
these reasons, but have never found the time.

> The second patch is logical follow up. The rhashtable cache index is protected by
> RCU and does not need any additional locking. By switching the buffer cache entries
> over to RCU freeing the buffer cache can be operated in a completely lock-free
> manner. This should help scalability in the long run.

Yup, that's another reason I'd considered rhashtables :P

However, this is where it gets hairy. The buffer lifecycle is
intricate, subtle, and has a history of nasty bugs that just never
seem to go away. This change will require a lot of verification
work to ensure things like the LRU manipulations haven't been
compromised by the removal of this lock...

> This series survives at least a xfstests auto group run (though with the scratch
> device being a ramdisk) with no regressions and didn't show any problems in my
> real world testing (using the patched FS with multiple large git trees) so far.

It's a performance modification - any performance/profile numbers
that show the improvement?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
@ 2016-10-18 22:18   ` Dave Chinner
  2016-10-22 18:01     ` Lucas Stach
  2016-10-19  1:15   ` Dave Chinner
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-10-18 22:18 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Tue, Oct 18, 2016 at 10:14:12PM +0200, Lucas Stach wrote:
> On filesystems with a lot of metadata and in metadata intensive workloads
> xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
> the CPU time is spent on CPU cache misses while traversing the rbtree.
> 
> As the buffer cache does not need any kind of ordering, but fast lookups
> a hashtable is the natural data structure to use. The rhashtable
> infrastructure provides a self-scaling hashtable implementation and
> allows lookups to proceed while the table is going through a resize
> operation.
> 
> This reduces the CPU-time spent for the lookups to 1/3 even for small
> filesystems with a relatively small number of cached buffers, with
> possibly much larger gains on higher loaded filesystems.
> 
> The minimum size of 4096 buckets was chosen as it was the size of the
> xfs buffer cache hash before it was converted to an rbtree.

That hashs table size was for the /global/ hash table. We now have a
cache per allocation group, so we most definitely don't want 4k
entries per AG as the default. THink of filesystems with hundreds of
even thousands of AGs....

I'd suggest that we want to make the default something much smaller;
maybe even as low as 32 buckets. It will grow quickly as the load
comes onto the filesystem....

> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> ---
>  fs/xfs/xfs_buf.c   | 118 ++++++++++++++++++++++++++++++++++-------------------
>  fs/xfs/xfs_buf.h   |   2 +-
>  fs/xfs/xfs_linux.h |   1 +
>  fs/xfs/xfs_mount.c |   7 +++-
>  fs/xfs/xfs_mount.h |   7 +++-
>  5 files changed, 88 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index b5b9bff..50c5b01 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -219,7 +219,6 @@ _xfs_buf_alloc(
>  	init_completion(&bp->b_iowait);
>  	INIT_LIST_HEAD(&bp->b_lru);
>  	INIT_LIST_HEAD(&bp->b_list);
> -	RB_CLEAR_NODE(&bp->b_rbnode);
>  	sema_init(&bp->b_sema, 0); /* held, no waiters */
>  	spin_lock_init(&bp->b_lock);
>  	XB_SET_OWNER(bp);
> @@ -473,7 +472,66 @@ _xfs_buf_map_pages(
>  /*
>   *	Finding and Reading Buffers
>   */
> +struct xfs_buf_cmp_arg {
> +	xfs_daddr_t	blkno;
> +	int		numblks;
> +};

That's a struct xfs_buf_map.

> +int
> +_xfs_buf_cmp(

static? Also, no leading underscore, and perhaps it should be called
xfs_buf_rhash_compare()....

> +	struct rhashtable_compare_arg *arg,
> +	const void *obj)
> +{
> +	const struct xfs_buf_cmp_arg *cmp_arg = arg->key;
> +	const struct xfs_buf *bp = obj;

Tab spacing for variables.

	struct rhashtable_compare_arg	*arg,
	const void			*obj)
{
	const struct xfs_buf_map	*map = arg->key;
	const struct xfs_buf		*bp = obj;


> +
> +	/*
> +	 * The key hashing in the lookup path depends on the key being the
> +	 * first element of the compare_arg, make sure to assert this.
> +	 */
> +	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) != 0);
> +
> +	if (bp->b_bn == cmp_arg->blkno) {
> +		if (unlikely(bp->b_length != cmp_arg->numblks)) {
> +			/*
> +			 * found a block number match. If the range doesn't
> +			 * match, the only way this is allowed is if the buffer
> +			 * in the cache is stale and the transaction that made
> +			 * it stale has not yet committed. i.e. we are
> +			 * reallocating a busy extent. Skip this buffer and
> +			 * continue searching for an exact match.
> +			 */
> +			ASSERT(bp->b_flags & XBF_STALE);
> +			return 1;
> +		}
> +
> +		return 0;
> +	}
> +
> +	return 1;

Change the logic to reduce indentation, and don't use unlikely().
gcc already hints branches that return as unlikely for code layout
purposes. However, it's been shown repeatedly that static hints like
this are wrong in the vast majority of the places they are used and
the hardware branch predictors do a far better job than humans.
So something like:

	if (bp->b_bn != map->bm_bn)
		return 1;

	/*
	 * Found a block number match. If the range doesn't match,
	 * the only way this is allowed is if the buffer in the
	 * cache is stale and the transaction that made it stale has
	 * not yet committed. i.e. we are reallocating a busy
	 * extent. Skip this buffer and continue searching for an
	 * exact match.
	 */
	if (unlikely(bp->b_length != cmp_arg->numblks)) {
		ASSERT(bp->b_flags & XBF_STALE);
		return 1;
	}

	return 0;

> +}
> +static const struct rhashtable_params xfs_buf_hash_params = {
> +	.min_size = 4096,
> +	.nelem_hint = 3072,

What does this hint do?

> +	.key_len = sizeof(xfs_daddr_t),
> +	.key_offset = offsetof(struct xfs_buf, b_bn),
> +	.head_offset = offsetof(struct xfs_buf, b_rhash_head),
> +	.automatic_shrinking = true,

Hmmm - so memory pressure is going to cause this hash to be resized
as the shrinker frees buffers. That, in turn, will cause the
rhashtable code to run GFP_KERNEL allocations, which could result in
it re-entering the shrinker and trying to free buffers which will
modify the hash table.

That doesn't seem like a smart thing to do to me - it seems to me
like it introduces a whole new avenue for memory reclaim deadlocks
(or, at minimum, lockdep false positives) to occur....

> +	.obj_cmpfn = _xfs_buf_cmp,
> +};
> +
> +int xfs_buf_hash_init(
> +	struct xfs_perag *pag)
> +{
> +	spin_lock_init(&pag->pag_buf_lock);
> +	return rhashtable_init(&pag->pag_buf_hash, &xfs_buf_hash_params);
> +}
>  
> +void
> +xfs_buf_hash_destroy(
> +	struct xfs_perag *pag)
> +{
> +	rhashtable_destroy(&pag->pag_buf_hash);
> +}
>  /*
>   *	Look up, and creates if absent, a lockable buffer for
>   *	a given range of an inode.  The buffer is returned
> @@ -488,16 +546,13 @@ _xfs_buf_find(
>  	xfs_buf_t		*new_bp)
>  {
>  	struct xfs_perag	*pag;
> -	struct rb_node		**rbp;
> -	struct rb_node		*parent;
>  	xfs_buf_t		*bp;
> -	xfs_daddr_t		blkno = map[0].bm_bn;
> +	struct xfs_buf_cmp_arg	cmp_arg = { .blkno = map[0].bm_bn };

it's a compare map, so maybe call it cmap? And I'd move the
initialisation down to where the block count is initialised,
too.

> -	/* get tree root */
> +	/* get pag */

Comment is now redundant.

>  	pag = xfs_perag_get(btp->bt_mount,
> -				xfs_daddr_to_agno(btp->bt_mount, blkno));
> +			    xfs_daddr_to_agno(btp->bt_mount, cmp_arg.blkno));
>  
> -	/* walk tree */
> +	/* lookup buf in pag hash */

Comment is also now redundant.

[...]
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index fc78739..b9a9a58 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -157,6 +157,7 @@ xfs_free_perag(
>  		spin_unlock(&mp->m_perag_lock);
>  		ASSERT(pag);
>  		ASSERT(atomic_read(&pag->pag_ref) == 0);
> +		xfs_buf_hash_destroy(pag);
>  		call_rcu(&pag->rcu_head, __xfs_free_perag);
>  	}
>  }
> @@ -212,8 +213,8 @@ xfs_initialize_perag(
>  		spin_lock_init(&pag->pag_ici_lock);
>  		mutex_init(&pag->pag_ici_reclaim_lock);
>  		INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
> -		spin_lock_init(&pag->pag_buf_lock);
> -		pag->pag_buf_tree = RB_ROOT;
> +		if (xfs_buf_hash_init(pag))
> +			goto out_unwind;
>  
>  		if (radix_tree_preload(GFP_NOFS))
>  			goto out_unwind;
> @@ -239,9 +240,11 @@ xfs_initialize_perag(
>  	return 0;
>  
>  out_unwind:
> +	xfs_buf_hash_destroy(pag);

I don't think this is correct for the case that xfs_buf_hash_init()
fails as the rhashtable_destroy() function assumes the init
completed successfully. i.e. this will oops with a null pointer.
So I think an error stack like this is needed:

 		if (radix_tree_preload(GFP_NOFS))
-			goto out_unwind;
+			goto out_destroy_hash;
....
+out_destroy_hash:
+	xfs_buf_hash_destroy(pag);
+out_unwind:
>  	kmem_free(pag);
>  	for (; index > first_initialised; index--) {
>  		pag = radix_tree_delete(&mp->m_perag_tree, index);
> +		xfs_buf_hash_destroy(pag);
>  		kmem_free(pag);
>  	}
>  	return error;

> index 819b80b..84f7852 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -393,8 +393,8 @@ typedef struct xfs_perag {
>  	unsigned long	pag_ici_reclaim_cursor;	/* reclaim restart point */
>  
>  	/* buffer cache index */
> -	spinlock_t	pag_buf_lock;	/* lock for pag_buf_tree */
> -	struct rb_root	pag_buf_tree;	/* ordered tree of active buffers */
> +	spinlock_t	pag_buf_lock;	/* lock for pag_buf_hash */
> +	struct rhashtable pag_buf_hash;
>  
>  	/* for rcu-safe freeing */
>  	struct rcu_head	rcu_head;
> @@ -424,6 +424,9 @@ xfs_perag_resv(
>  	}
>  }
>  
> +int xfs_buf_hash_init(xfs_perag_t *pag);
> +void xfs_buf_hash_destroy(xfs_perag_t *pag);

No typedefs, please. Also, shouldn't these be defined in
fs/xfs/xfs_buf.h?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing
  2016-10-18 20:14 ` [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing Lucas Stach
@ 2016-10-18 22:43   ` Dave Chinner
  2016-10-22 18:52     ` Lucas Stach
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-10-18 22:43 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Tue, Oct 18, 2016 at 10:14:13PM +0200, Lucas Stach wrote:
> The buffer cache hash as the indexing data structure into the buffer
> cache is already protected by RCU. By freeing the entries itself by
> RCU we can get rid of the buffer cache lock altogether.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> ---
>  fs/xfs/xfs_buf.c   | 41 +++++++++++++++++++++++++++--------------
>  fs/xfs/xfs_buf.h   |  2 ++
>  fs/xfs/xfs_mount.h |  1 -
>  3 files changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 50c5b01..3ee0a3d1 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -326,6 +326,16 @@ xfs_buf_free(
>  	kmem_zone_free(xfs_buf_zone, bp);
>  }
>  
> +STATIC void
> +__xfs_buf_rcu_free(
> +	struct rcu_head	*head)
> +{
> +	xfs_buf_t *bp = container_of(head, struct xfs_buf, rcu_head);
> +
> +	ASSERT(atomic_read(&bp->b_hold) == 0);
> +	xfs_buf_free(bp);
> +}
> +
>  /*
>   * Allocates all the pages for buffer in question and builds it's page list.
>   */
> @@ -491,6 +501,14 @@ _xfs_buf_cmp(
>  	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) != 0);
>  
>  	if (bp->b_bn == cmp_arg->blkno) {
> +		/*
> +		 * Skip matches with a hold count of zero, as they are about to
> +		 * be freed by RCU. Continue searching as another valid entry
> +		 * might have already been inserted into the hash.
> +		 */
> +		if (unlikely(atomic_read(&bp->b_hold) == 0))
> +			return 1;

This is an invalid assumption. Buffers transition through a hold
count of zero when they get moved to the LRU list, which then adds
a reference for the LRU list so the hold count goes back above zero.
The pag_buf_lock serialised this 1->0->1 hold count change so it
wasn't ever seen by lookups or concurrent xfs_buf_rele() calls.

Removing the pag_buf_lock around freeing of buffers and lookup
exposes this transition to lookup and so b_hold cannot be used like
this to determine if a buffer is being freed or not.

Like I said, there is some really subtle stuff in the buffer
lifecycle management....

Hmmmm - because this is now all RCU-ised, we are going to need some
form of memory barrier to ensure that whatever we use to detect
buffers in the RCU grace period is race free. I suspect the easiest
thing to do is to use the bp->b_lock spinlock to set a noticable
"buffer being freed" state in xfs_buf_rele() that we can check in
lookup under the spinlock. That way the lock provides the memory
barrier we need and it also provides serialisation against
transient hold count values in xfs_buf_rele() similar to what the
pag_buf_lock used to provide....

(FYI: the internal object spinlock pattern is what we use for the
RCU lookup vs freeing serialisation in the XFS inode cache....)

> @@ -580,14 +597,16 @@ _xfs_buf_find(
>  	pag = xfs_perag_get(btp->bt_mount,
>  			    xfs_daddr_to_agno(btp->bt_mount, cmp_arg.blkno));
>  
> +	rcu_read_lock();
>  	/* lookup buf in pag hash */
> -	spin_lock(&pag->pag_buf_lock);
>  	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
>  				    xfs_buf_hash_params);
> -	if (bp) {
> -		atomic_inc(&bp->b_hold);
> +
> +	/* if the hold count is zero the buffer is about to be freed by RCU */
> +	if (bp && atomic_inc_not_zero(&bp->b_hold))
>  		goto found;
> -	}
> +
> +	rcu_read_unlock();

This has the same problem with transient 0 hold counts in
xfs_buf_rele(). I suspect that we need to take the hold reference in
the xfs_buf_rhash_compare() function where we need to take the
bp->b_lock() to check for the buffer being in the rcu grace
period. If it's not being freed, we should take a reference right
then so that it doesn't get freed between the lookup check and the
buffer being returned...

> @@ -975,9 +991,8 @@ xfs_buf_rele(
>  
>  	ASSERT(atomic_read(&bp->b_hold) > 0);
>  
> -	release = atomic_dec_and_lock(&bp->b_hold, &pag->pag_buf_lock);
>  	spin_lock(&bp->b_lock);
> -	if (!release) {
> +	if (!atomic_dec_and_test(&bp->b_hold)) {
>  		/*
>  		 * Drop the in-flight state if the buffer is already on the LRU
>  		 * and it holds the only reference. This is racy because we
> @@ -1001,7 +1016,6 @@ xfs_buf_rele(
>  			bp->b_state &= ~XFS_BSTATE_DISPOSE;
>  			atomic_inc(&bp->b_hold);
>  		}
> -		spin_unlock(&pag->pag_buf_lock);

This is the atomic_inc that makes the hold count go from 0->1 again
after adding the buffer to the LRU instead of freeing it. It is
inside the bp->b_lock, so we should be able to serialise lookups
using that lock.

Perhaps a new state flag that is only set when we are about to free
the buffer (XFS_BSTATE_FREE) is what we should add here, rather than
trying to play games checking and serialising bp->b_hold updates....

> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -210,6 +210,8 @@ typedef struct xfs_buf {
>  
>  	const struct xfs_buf_ops	*b_ops;
>  
> +	struct rcu_head		rcu_head;

Make it a union with the b_lru field so the structure doesn't grow
just for RCU freeing support.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
  2016-10-18 22:18   ` Dave Chinner
@ 2016-10-19  1:15   ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2016-10-19  1:15 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Tue, Oct 18, 2016 at 10:14:12PM +0200, Lucas Stach wrote:
> On filesystems with a lot of metadata and in metadata intensive workloads
> xfs_buf_find() is showing up at the top of the CPU cycles trace. Most of
> the CPU time is spent on CPU cache misses while traversing the rbtree.
> 
> As the buffer cache does not need any kind of ordering, but fast lookups
> a hashtable is the natural data structure to use. The rhashtable
> infrastructure provides a self-scaling hashtable implementation and
> allows lookups to proceed while the table is going through a resize
> operation.
> 
> This reduces the CPU-time spent for the lookups to 1/3 even for small
> filesystems with a relatively small number of cached buffers, with
> possibly much larger gains on higher loaded filesystems.
> 
> The minimum size of 4096 buckets was chosen as it was the size of the
> xfs buffer cache hash before it was converted to an rbtree.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>

This fails to compile on CONFIG_XFS_DEBUG=y kernels due to various
ASSERT statements not being updated appropriately. When making XFS
changes to core infrastructure like this, it is highly recommended
that you use CONFIG_XFS_DEBUG=y so that it catches mistakes and
assumptions that your changes violate at runtime....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] XFS buffer cache scalability improvements
  2016-10-18 21:21 ` [PATCH 0/2] XFS buffer cache scalability improvements Dave Chinner
@ 2016-10-22 17:51   ` Lucas Stach
  2016-11-10 23:02   ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Lucas Stach @ 2016-10-22 17:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Am Mittwoch, den 19.10.2016, 08:21 +1100 schrieb Dave Chinner:
> On Tue, Oct 18, 2016 at 10:14:11PM +0200, Lucas Stach wrote:
> > 
> > Hi all,
> > 
> > this series scratches my own small itch with XFS, namely
> > scalability of the buffer
> > cache in metadata intensive workloads. With a large number of
> > cached buffers those
> > workloads are CPU bound with a significant amount of time spent
> > searching the cache.
> > 
> > The first commit replaces the rbtree used to index the cache with
> > an rhashtable. The
> > rbtree is a bottleneck in scalability, as the data structure itself
> > is pretty CPU
> > cache unfriendly. For larger numbers of cached buffers over 80% of
> > the CPU time
> > is spent waiting on cache misses resulting from the inherent
> > pointer chasing.
> > 
> > rhashtables provide a fast lookup with the ability to have lookups
> > proceed while the
> > hashtable is being resized. This seems to match the read dominated
> > workload of the
> > buffer cache index structure pretty well.
> 
> Yup, it's a good idea - I have considered doing this change for
> these reasons, but have never found the time.
> 
> > 
> > The second patch is logical follow up. The rhashtable cache index
> > is protected by
> > RCU and does not need any additional locking. By switching the
> > buffer cache entries
> > over to RCU freeing the buffer cache can be operated in a
> > completely lock-free
> > manner. This should help scalability in the long run.
> 
> Yup, that's another reason I'd considered rhashtables :P
> 
> However, this is where it gets hairy. The buffer lifecycle is
> intricate, subtle, and has a history of nasty bugs that just never
> seem to go away. This change will require a lot of verification
> work to ensure things like the LRU manipulations haven't been
> compromised by the removal of this lock...
> 
> > 
> > This series survives at least a xfstests auto group run (though
> > with the scratch
> > device being a ramdisk) with no regressions and didn't show any
> > problems in my
> > real world testing (using the patched FS with multiple large git
> > trees) so far.
> 
> It's a performance modification - any performance/profile numbers
> that show the improvement?
> 
In my testing with a small scale FS (some linux kernel git trees on a
FS with 4 AGs) the CPU time spent in xfs_buf_find while doing a git
checkout (going from one specific revision to another) with warm caches
goes down from 2% to 0.6% with this change.

I have a profile from a machine where xfs_buf_find is taking up the top
spot of the profile at 6% CPU time. Unfortunately I haven't been able
to re-run the test on a FS at that scale yet.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-18 22:18   ` Dave Chinner
@ 2016-10-22 18:01     ` Lucas Stach
  2016-10-24  2:15       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lucas Stach @ 2016-10-22 18:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Am Mittwoch, den 19.10.2016, 09:18 +1100 schrieb Dave Chinner:
> On Tue, Oct 18, 2016 at 10:14:12PM +0200, Lucas Stach wrote:
> > 

[ Snipped lots of comments that are all valid and will be taken care of
in V2]

> > +static const struct rhashtable_params xfs_buf_hash_params = {
> > +	.min_size = 4096,
> > +	.nelem_hint = 3072,
> 
> What does this hint do?
> 
It's the initial sizing of the hash. The shrinker will not resize the
hash below min_size, but if the nelem_hint isn't given that hash will
start out really small. Maybe that's just the behavior we want, will
have to rethink this.

> > 
> > +	.key_len = sizeof(xfs_daddr_t),
> > +	.key_offset = offsetof(struct xfs_buf, b_bn),
> > +	.head_offset = offsetof(struct xfs_buf, b_rhash_head),
> > +	.automatic_shrinking = true,
> 
> Hmmm - so memory pressure is going to cause this hash to be resized
> as the shrinker frees buffers. That, in turn, will cause the
> rhashtable code to run GFP_KERNEL allocations, which could result in
> it re-entering the shrinker and trying to free buffers which will
> modify the hash table.
> 
> That doesn't seem like a smart thing to do to me - it seems to me
> like it introduces a whole new avenue for memory reclaim deadlocks
> (or, at minimum, lockdep false positives) to occur....
> 
Shrinking of the hash table is done in a worker, so I don't see the
direct chain you are describing above.

[more valid remarks snipped]

Regards,
Lucas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing
  2016-10-18 22:43   ` Dave Chinner
@ 2016-10-22 18:52     ` Lucas Stach
  2016-10-24  2:37       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lucas Stach @ 2016-10-22 18:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Am Mittwoch, den 19.10.2016, 09:43 +1100 schrieb Dave Chinner:
> On Tue, Oct 18, 2016 at 10:14:13PM +0200, Lucas Stach wrote:
> > 
> > The buffer cache hash as the indexing data structure into the
> > buffer
> > cache is already protected by RCU. By freeing the entries itself by
> > RCU we can get rid of the buffer cache lock altogether.
> > 
> > Signed-off-by: Lucas Stach <dev@lynxeye.de>
> > ---
> >  fs/xfs/xfs_buf.c   | 41 +++++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_buf.h   |  2 ++
> >  fs/xfs/xfs_mount.h |  1 -
> >  3 files changed, 29 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 50c5b01..3ee0a3d1 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -326,6 +326,16 @@ xfs_buf_free(
> >  	kmem_zone_free(xfs_buf_zone, bp);
> >  }
> >  
> > +STATIC void
> > +__xfs_buf_rcu_free(
> > +	struct rcu_head	*head)
> > +{
> > +	xfs_buf_t *bp = container_of(head, struct xfs_buf,
> > rcu_head);
> > +
> > +	ASSERT(atomic_read(&bp->b_hold) == 0);
> > +	xfs_buf_free(bp);
> > +}
> > +
> >  /*
> >   * Allocates all the pages for buffer in question and builds it's
> > page list.
> >   */
> > @@ -491,6 +501,14 @@ _xfs_buf_cmp(
> >  	BUILD_BUG_ON(offsetof(struct xfs_buf_cmp_arg, blkno) !=
> > 0);
> >  
> >  	if (bp->b_bn == cmp_arg->blkno) {
> > +		/*
> > +		 * Skip matches with a hold count of zero, as they
> > are about to
> > +		 * be freed by RCU. Continue searching as another
> > valid entry
> > +		 * might have already been inserted into the hash.
> > +		 */
> > +		if (unlikely(atomic_read(&bp->b_hold) == 0))
> > +			return 1;
> 
> This is an invalid assumption. Buffers transition through a hold
> count of zero when they get moved to the LRU list, which then adds
> a reference for the LRU list so the hold count goes back above zero.
> The pag_buf_lock serialised this 1->0->1 hold count change so it
> wasn't ever seen by lookups or concurrent xfs_buf_rele() calls.
> 
> Removing the pag_buf_lock around freeing of buffers and lookup
> exposes this transition to lookup and so b_hold cannot be used like
> this to determine if a buffer is being freed or not.
> 
> Like I said, there is some really subtle stuff in the buffer
> lifecycle management....
> 
> Hmmmm - because this is now all RCU-ised, we are going to need some
> form of memory barrier to ensure that whatever we use to detect
> buffers in the RCU grace period is race free. I suspect the easiest
> thing to do is to use the bp->b_lock spinlock to set a noticable
> "buffer being freed" state in xfs_buf_rele() that we can check in
> lookup under the spinlock. That way the lock provides the memory
> barrier we need and it also provides serialisation against
> transient hold count values in xfs_buf_rele() similar to what the
> pag_buf_lock used to provide....
> 
> (FYI: the internal object spinlock pattern is what we use for the
> RCU lookup vs freeing serialisation in the XFS inode cache....)
> 

Thanks. Using the internal spinlock seems like a good idea.

> > 
> > @@ -580,14 +597,16 @@ _xfs_buf_find(
> >  	pag = xfs_perag_get(btp->bt_mount,
> >  			    xfs_daddr_to_agno(btp->bt_mount,
> > cmp_arg.blkno));
> >  
> > +	rcu_read_lock();
> >  	/* lookup buf in pag hash */
> > -	spin_lock(&pag->pag_buf_lock);
> >  	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
> >  				    xfs_buf_hash_params);
> > -	if (bp) {
> > -		atomic_inc(&bp->b_hold);
> > +
> > +	/* if the hold count is zero the buffer is about to be
> > freed by RCU */
> > +	if (bp && atomic_inc_not_zero(&bp->b_hold))
> >  		goto found;
> > -	}
> > +
> > +	rcu_read_unlock();
> 
> This has the same problem with transient 0 hold counts in
> xfs_buf_rele(). I suspect that we need to take the hold reference in
> the xfs_buf_rhash_compare() function where we need to take the
> bp->b_lock() to check for the buffer being in the rcu grace
> period. If it's not being freed, we should take a reference right
> then so that it doesn't get freed between the lookup check and the
> buffer being returned...
> 
I don't think this will work with rhashtables. The internal workings of
those depend on being able to call the compare function multiple times
on a single lookup, so you can't do anything in the lookup that changes
the state of the object.

I think the best we can do is to skip buffers that are already in
"going to be freed" state in the lookup (without taking the bp-
>b_lock).
Then take the lock to increase the hold count and recheck if the object
transitioned into the "about to be freed" state between our lookup and
the time we got the lock. If the object is marked from removal at this
point we need to retry the lookup, I think.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-22 18:01     ` Lucas Stach
@ 2016-10-24  2:15       ` Dave Chinner
  2016-10-24 11:47         ` Lucas Stach
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-10-24  2:15 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Sat, Oct 22, 2016 at 08:01:49PM +0200, Lucas Stach wrote:
> Am Mittwoch, den 19.10.2016, 09:18 +1100 schrieb Dave Chinner:
> > On Tue, Oct 18, 2016 at 10:14:12PM +0200, Lucas Stach wrote:
> > > +	.key_len = sizeof(xfs_daddr_t),
> > > +	.key_offset = offsetof(struct xfs_buf, b_bn),
> > > +	.head_offset = offsetof(struct xfs_buf, b_rhash_head),
> > > +	.automatic_shrinking = true,
> > 
> > Hmmm - so memory pressure is going to cause this hash to be resized
> > as the shrinker frees buffers. That, in turn, will cause the
> > rhashtable code to run GFP_KERNEL allocations, which could result in
> > it re-entering the shrinker and trying to free buffers which will
> > modify the hash table.
> > 
> > That doesn't seem like a smart thing to do to me - it seems to me
> > like it introduces a whole new avenue for memory reclaim deadlocks
> > (or, at minimum, lockdep false positives) to occur....
> > 
> Shrinking of the hash table is done in a worker, so I don't see the
> direct chain you are describing above.

We've had deadlocks where workqueue work has been stalled on memory
allocation trying to allocation a new worker thread to run the work.
The rhashtable code appears to use unbound system work queues which
means there are no rescuer threads, and they are being called to do
work in memory reclaim context. Rescuer threads come along with the
WQ_MEM_RECLAIM initialisation flag for workqueues, but the
rhashtable code is most definitely not doing that...

i.e. if memory reclaim requires that workqueue to make progress to
continue freeing memory or resolve a blocking situation in the
shrinker (e.g. waiting for IO completion) then a) it needs to have a
rescuer thread, and b) it must avoid re-entering the shrinker that
is already blocking waiting for the work to be run. i.e. it can't
do GFP_KERNEL allocations because that will result in re-entering
the blocked shrinker...

Now, this /might/ be ok for the rhashtable code as it may not block
future operations if a grow/shrink gets held up on memory
allocation, but I'm not intimately familiar with that code. It is,
however, a red flag that needs to be checked out and verified.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing
  2016-10-22 18:52     ` Lucas Stach
@ 2016-10-24  2:37       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2016-10-24  2:37 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Sat, Oct 22, 2016 at 08:52:44PM +0200, Lucas Stach wrote:
> Am Mittwoch, den 19.10.2016, 09:43 +1100 schrieb Dave Chinner:
> > On Tue, Oct 18, 2016 at 10:14:13PM +0200, Lucas Stach wrote:
> > >  			    xfs_daddr_to_agno(btp->bt_mount,
> > > cmp_arg.blkno));
> > >  
> > > +	rcu_read_lock();
> > >  	/* lookup buf in pag hash */
> > > -	spin_lock(&pag->pag_buf_lock);
> > >  	bp = rhashtable_lookup_fast(&pag->pag_buf_hash, &cmp_arg,
> > >  				    xfs_buf_hash_params);
> > > -	if (bp) {
> > > -		atomic_inc(&bp->b_hold);
> > > +
> > > +	/* if the hold count is zero the buffer is about to be
> > > freed by RCU */
> > > +	if (bp && atomic_inc_not_zero(&bp->b_hold))
> > >  		goto found;
> > > -	}
> > > +
> > > +	rcu_read_unlock();
> > 
> > This has the same problem with transient 0 hold counts in
> > xfs_buf_rele(). I suspect that we need to take the hold reference in
> > the xfs_buf_rhash_compare() function where we need to take the
> > bp->b_lock() to check for the buffer being in the rcu grace
> > period. If it's not being freed, we should take a reference right
> > then so that it doesn't get freed between the lookup check and the
> > buffer being returned...
> > 
> I don't think this will work with rhashtables. The internal workings of
> those depend on being able to call the compare function multiple times
> on a single lookup, so you can't do anything in the lookup that changes
> the state of the object.

That's not documented anywhere iin the code that I can find. :(
It would be nice to have important API details like that documented
somewhere obvious....

> I think the best we can do is to skip buffers that are already in
> "going to be freed" state in the lookup (without taking the bp-
> >b_lock).

Let's get our terminology straight here to avoid confusion. You're
talking about checking in the compare function here, right? As
opposed to checking the object returned by the lookup function?

> Then take the lock to increase the hold count and recheck if the object
> transitioned into the "about to be freed" state between our lookup and
> the time we got the lock.

.... between the check in the compare function and the time we got
the lock in the object returned?

If so, There is no need to increase the hold count before we check
the freeing state once we have the lock - freeing will serialise
against the lock we take on the buffer object the lookup returns
before it can drop the last reference and free it. Hence as long as
we increment the hold count before we drop the lock we are good...

> If the object is marked from removal at this
> point we need to retry the lookup, I think.

Yup, seems reasonable. Probably best to hook up the xb_miss_locked
stat in this case, too...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] xfs: use rhashtable to track buffer cache
  2016-10-24  2:15       ` Dave Chinner
@ 2016-10-24 11:47         ` Lucas Stach
  0 siblings, 0 replies; 16+ messages in thread
From: Lucas Stach @ 2016-10-24 11:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Am Montag, den 24.10.2016, 13:15 +1100 schrieb Dave Chinner:
> On Sat, Oct 22, 2016 at 08:01:49PM +0200, Lucas Stach wrote:
> > 
> > Am Mittwoch, den 19.10.2016, 09:18 +1100 schrieb Dave Chinner:
> > > 
> > > On Tue, Oct 18, 2016 at 10:14:12PM +0200, Lucas Stach wrote:
> > > > 
> > > > +	.key_len = sizeof(xfs_daddr_t),
> > > > +	.key_offset = offsetof(struct xfs_buf, b_bn),
> > > > +	.head_offset = offsetof(struct xfs_buf, b_rhash_head),
> > > > +	.automatic_shrinking = true,
> > > 
> > > Hmmm - so memory pressure is going to cause this hash to be
> > > resized
> > > as the shrinker frees buffers. That, in turn, will cause the
> > > rhashtable code to run GFP_KERNEL allocations, which could result
> > > in
> > > it re-entering the shrinker and trying to free buffers which will
> > > modify the hash table.
> > > 
> > > That doesn't seem like a smart thing to do to me - it seems to me
> > > like it introduces a whole new avenue for memory reclaim
> > > deadlocks
> > > (or, at minimum, lockdep false positives) to occur....
> > > 
> > Shrinking of the hash table is done in a worker, so I don't see the
> > direct chain you are describing above.
> 
> We've had deadlocks where workqueue work has been stalled on memory
> allocation trying to allocation a new worker thread to run the work.
> The rhashtable code appears to use unbound system work queues which
> means there are no rescuer threads, and they are being called to do
> work in memory reclaim context. Rescuer threads come along with the
> WQ_MEM_RECLAIM initialisation flag for workqueues, but the
> rhashtable code is most definitely not doing that...
> 
> i.e. if memory reclaim requires that workqueue to make progress to
> continue freeing memory or resolve a blocking situation in the
> shrinker (e.g. waiting for IO completion) then a) it needs to have a
> rescuer thread, and b) it must avoid re-entering the shrinker that
> is already blocking waiting for the work to be run. i.e. it can't
> do GFP_KERNEL allocations because that will result in re-entering
> the blocked shrinker...
> 
> Now, this /might/ be ok for the rhashtable code as it may not block
> future operations if a grow/shrink gets held up on memory
> allocation, but I'm not intimately familiar with that code. It is,
> however, a red flag that needs to be checked out and verified.

Right, this is exactly what I meant. Insertion/deletion of entries in
rhashtables will not be held up by the grow/shrink worker. The bucket
locks will only be taken once the worker succeeded in allocating the
required memory.

If the shrink worker isn't able to make progress due to memory pressure
we only end up with a sparsely populated hash table for a longer period
of time. It does not affect insertion/deletion of the objects into the
hash.

Same goes for growing the table. If the expand worker is blocked on
anything the only adverse effect is that we end up with longer hash
chains than we would like until the expand worker has a chance to do
its work.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] XFS buffer cache scalability improvements
  2016-10-18 21:21 ` [PATCH 0/2] XFS buffer cache scalability improvements Dave Chinner
  2016-10-22 17:51   ` Lucas Stach
@ 2016-11-10 23:02   ` Dave Chinner
  2016-12-02 21:54     ` Lucas Stach
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2016-11-10 23:02 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

Hi Lucas,

On Wed, Oct 19, 2016 at 08:21:16AM +1100, Dave Chinner wrote:
> On Tue, Oct 18, 2016 at 10:14:11PM +0200, Lucas Stach wrote:
> > The second patch is logical follow up. The rhashtable cache index is protected by
> > RCU and does not need any additional locking. By switching the buffer cache entries
> > over to RCU freeing the buffer cache can be operated in a completely lock-free
> > manner. This should help scalability in the long run.
> 
> Yup, that's another reason I'd considered rhashtables :P
> 
> However, this is where it gets hairy. The buffer lifecycle is
> intricate, subtle, and has a history of nasty bugs that just never
> seem to go away. This change will require a lot of verification
> work to ensure things like the LRU manipulations haven't been
> compromised by the removal of this lock...
.....
> It's a performance modification - any performance/profile numbers
> that show the improvement?

Here's why detailed performance measurement for changes like this
are important: RCU freeing and lockless lookups are not a win for
the XFS buffer cache.

I fixed the code to be RCU safe (use bp->b_lock, XFS_BSTATE_FREE and
memory barriers) and verified that it worked OK (no regressions over
a nightly xfstests cycle across all my test machines) so this
morning I've run performance tests against it. I've also modified
the rhashtable patch with all my review comments:

[dchinner: reduce minimum hash size to an acceptable size for large
	   filesystems with many AGs with no active use.]
[dchinner: remove stale rbtree asserts.]
[dchinner: use xfs_buf_map for compare function argument.]
[dchinner: make functions static.]
[dchinner: remove redundant comments.]

The results show that RCU freeing significantly slows down my fsmark
file creation benchmark (https://lkml.org/lkml/2016/3/31/1161) that
hammers the buffer cache - it is not uncommon for profiles to show
10-11% CPU usage in _xfs_buf_find() with the rbtree implementation.
These tests were run on 4.9-rc4+for-next:

			files/s		wall time	sys CPU
rbtree:		     220910+/-1.9e+04	4m21.639s	48m17.206s
rhashtable:	     227518+/-1.9e+04	4m22.179s	45m41.639s
With RCU freeing:    198181+/-3e+04	4m45.286s	51m36.842s

So we can see that rbtree->rhashtable reduces system time and
increases create a little but not significantly, and the overall
runtime is pretty much unchanged. However, adding RCU lookup/freeing
to the rhashtable shows a siginficant degradation - 10% decrease in
create rate, 50% increase in create rate stddev, and a 10% increase
in system time. Not good.

The reasons for this change is quite obvious from my monitoring:
there is significant change in memory usage footprint and memory
reclaim overhead. RCU freeing delays the freeing of the buffers,
which means the buffer cache shrinker is not actually freeing memory
when demand occurs.  Instead, freeing is delayed to the end of the
rcu grace period and hence does not releive pressure quickly. Hence
memory reclaim transfers that pressure to other caches, increasing
reclaim scanning work and allocation latency. The result is higher
reclaim CPU usage, a very different memory usage profile over the
course of the test and, ultimately, lower performance.

Further, because we normally cycle through buffers so fast, RCU
freeing means that we are no longer finding hot buffers in the slab
cache during allocation. The idea of the slab cache is that on
heavily cycled slabs the objects being allocated are the ones that
were just freed and so are still hot in the CPU caches. When we
switch to RCU freeing, this no longer happens because freeing only
ever happens in sparse, periodic batches. Hence we end up allocating
cache cold buffers and so end up with more cache misses when first
allocating new buffers.

IOWs, the loss of performance due to RCU freeing is not made up by
the anticipated reduction the overhead of uncontended locks on
lookup. This is mainly because there is no measurable lookup locking
overhead now that rhashtables are used. rbtree CPU profile:

   - 9.39%  _xfs_buf_find
        0.92% xfs_perag_get                                                                                                                                            ¿
      - 0.90% xfs_buf_trylock                                                                                                                                          ¿
           0.71% down_trylock

rhashtable:

   - 2.62% _xfs_buf_find
      - 1.12% xfs_perag_get
         + 0.58% radix_tree_lookup
      - 0.91% xfs_buf_trylock
           0.82% down_trylock

rhashtable+RCU:

   - 2.31% _xfs_buf_find
        0.91% xfs_perag_get
      - 0.83% xfs_buf_trylock
           0.75% down_trylock

So with the rhashtable change in place, we've already removed the
cause of the pag_buf_lock contention (the rbtree pointer chasing) so
there just isn't any overhead that using RCU can optimise away.
Hence there's no gains to amortise the efficiency losses using RCU
freeing introduces, and as a result using RCU is slower than
traditional locking techniques.

I'll keep testing the rhashtbale code - it look solid enough at this
point to consider it for the 4.10 cycle.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] XFS buffer cache scalability improvements
  2016-11-10 23:02   ` Dave Chinner
@ 2016-12-02 21:54     ` Lucas Stach
  2016-12-04 21:36       ` Dave Chinner
  0 siblings, 1 reply; 16+ messages in thread
From: Lucas Stach @ 2016-12-02 21:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

Am Freitag, den 11.11.2016, 10:02 +1100 schrieb Dave Chinner:
> Hi Lucas,
> 
> On Wed, Oct 19, 2016 at 08:21:16AM +1100, Dave Chinner wrote:
> > On Tue, Oct 18, 2016 at 10:14:11PM +0200, Lucas Stach wrote:
> > > The second patch is logical follow up. The rhashtable cache index
> > > is protected by
> > > RCU and does not need any additional locking. By switching the
> > > buffer cache entries
> > > over to RCU freeing the buffer cache can be operated in a
> > > completely lock-free
> > > manner. This should help scalability in the long run.
> > 
> > Yup, that's another reason I'd considered rhashtables :P
> > 
> > However, this is where it gets hairy. The buffer lifecycle is
> > intricate, subtle, and has a history of nasty bugs that just never
> > seem to go away. This change will require a lot of verification
> > work to ensure things like the LRU manipulations haven't been
> > compromised by the removal of this lock...
> 
> .....
> > It's a performance modification - any performance/profile numbers
> > that show the improvement?
> 
> Here's why detailed performance measurement for changes like this
> are important: RCU freeing and lockless lookups are not a win for
> the XFS buffer cache.
> 
> I fixed the code to be RCU safe (use bp->b_lock, XFS_BSTATE_FREE and
> memory barriers) and verified that it worked OK (no regressions over
> a nightly xfstests cycle across all my test machines) so this
> morning I've run performance tests against it. I've also modified
> the rhashtable patch with all my review comments:
> 
> [dchinner: reduce minimum hash size to an acceptable size for large
> 	   filesystems with many AGs with no active use.]
> [dchinner: remove stale rbtree asserts.]
> [dchinner: use xfs_buf_map for compare function argument.]
> [dchinner: make functions static.]
> [dchinner: remove redundant comments.]
> 
> The results show that RCU freeing significantly slows down my fsmark
> file creation benchmark (https://lkml.org/lkml/2016/3/31/1161) that
> hammers the buffer cache - it is not uncommon for profiles to show
> 10-11% CPU usage in _xfs_buf_find() with the rbtree implementation.
> These tests were run on 4.9-rc4+for-next:
> 
> 			files/s		wall time	sys CPU
> rbtree:		     220910+/-1.9e+04	4m21.639s	
> 48m17.206s
> rhashtable:	     227518+/-1.9e+04	4m22.179s	45m4
> 1.639s
> With RCU freeing:    198181+/-3e+04	4m45.286s	51m36.842
> s
> 
> So we can see that rbtree->rhashtable reduces system time and
> increases create a little but not significantly, and the overall
> runtime is pretty much unchanged. However, adding RCU lookup/freeing
> to the rhashtable shows a siginficant degradation - 10% decrease in
> create rate, 50% increase in create rate stddev, and a 10% increase
> in system time. Not good.
> 
> The reasons for this change is quite obvious from my monitoring:
> there is significant change in memory usage footprint and memory
> reclaim overhead. RCU freeing delays the freeing of the buffers,
> which means the buffer cache shrinker is not actually freeing memory
> when demand occurs.  Instead, freeing is delayed to the end of the
> rcu grace period and hence does not releive pressure quickly. Hence
> memory reclaim transfers that pressure to other caches, increasing
> reclaim scanning work and allocation latency. The result is higher
> reclaim CPU usage, a very different memory usage profile over the
> course of the test and, ultimately, lower performance.
> 
> Further, because we normally cycle through buffers so fast, RCU
> freeing means that we are no longer finding hot buffers in the slab
> cache during allocation. The idea of the slab cache is that on
> heavily cycled slabs the objects being allocated are the ones that
> were just freed and so are still hot in the CPU caches. When we
> switch to RCU freeing, this no longer happens because freeing only
> ever happens in sparse, periodic batches. Hence we end up allocating
> cache cold buffers and so end up with more cache misses when first
> allocating new buffers.
> 
> IOWs, the loss of performance due to RCU freeing is not made up by
> the anticipated reduction the overhead of uncontended locks on
> lookup. This is mainly because there is no measurable lookup locking
> overhead now that rhashtables are used. rbtree CPU profile:
> 
>    - 9.39%  _xfs_buf_find
>         0.92%
> xfs_perag_get                                                        
>                                                                      
>                ¿
>       - 0.90%
> xfs_buf_trylock                                                      
>                                                                      
>                ¿
>            0.71% down_trylock
> 
> rhashtable:
> 
>    - 2.62% _xfs_buf_find
>       - 1.12% xfs_perag_get
>          + 0.58% radix_tree_lookup
>       - 0.91% xfs_buf_trylock
>            0.82% down_trylock
> 
> rhashtable+RCU:
> 
>    - 2.31% _xfs_buf_find
>         0.91% xfs_perag_get
>       - 0.83% xfs_buf_trylock
>            0.75% down_trylock
> 
> So with the rhashtable change in place, we've already removed the
> cause of the pag_buf_lock contention (the rbtree pointer chasing) so
> there just isn't any overhead that using RCU can optimise away.
> Hence there's no gains to amortise the efficiency losses using RCU
> freeing introduces, and as a result using RCU is slower than
> traditional locking techniques.
> 
> I'll keep testing the rhashtbale code - it look solid enough at this
> point to consider it for the 4.10 cycle.
> 
Thanks for running those numbers. I had started to modify the patches
according to the review, but didn't get around to fix the RCU path.

Do you still consider this change to go in with 4.10?

Regards,
Lucas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] XFS buffer cache scalability improvements
  2016-12-02 21:54     ` Lucas Stach
@ 2016-12-04 21:36       ` Dave Chinner
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Chinner @ 2016-12-04 21:36 UTC (permalink / raw)
  To: Lucas Stach; +Cc: linux-xfs

On Fri, Dec 02, 2016 at 10:54:52PM +0100, Lucas Stach wrote:
.....
> > So with the rhashtable change in place, we've already removed the
> > cause of the pag_buf_lock contention (the rbtree pointer chasing) so
> > there just isn't any overhead that using RCU can optimise away.
> > Hence there's no gains to amortise the efficiency losses using RCU
> > freeing introduces, and as a result using RCU is slower than
> > traditional locking techniques.
> > 
> > I'll keep testing the rhashtbale code - it look solid enough at this
> > point to consider it for the 4.10 cycle.
> > 
> Thanks for running those numbers. I had started to modify the patches
> according to the review, but didn't get around to fix the RCU path.
> 
> Do you still consider this change to go in with 4.10?

Yes, I posted an up-to-date version of it in this series I posted
last friday:

https://www.spinics.net/lists/linux-xfs/msg02532.html
https://www.spinics.net/lists/linux-xfs/msg02535.html

Needs review before I'll merge it for 4.10, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-12-04 21:37 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-18 20:14 [PATCH 0/2] XFS buffer cache scalability improvements Lucas Stach
2016-10-18 20:14 ` [PATCH 1/2] xfs: use rhashtable to track buffer cache Lucas Stach
2016-10-18 22:18   ` Dave Chinner
2016-10-22 18:01     ` Lucas Stach
2016-10-24  2:15       ` Dave Chinner
2016-10-24 11:47         ` Lucas Stach
2016-10-19  1:15   ` Dave Chinner
2016-10-18 20:14 ` [PATCH 2/2] xfs: switch buffer cache entries to RCU freeing Lucas Stach
2016-10-18 22:43   ` Dave Chinner
2016-10-22 18:52     ` Lucas Stach
2016-10-24  2:37       ` Dave Chinner
2016-10-18 21:21 ` [PATCH 0/2] XFS buffer cache scalability improvements Dave Chinner
2016-10-22 17:51   ` Lucas Stach
2016-11-10 23:02   ` Dave Chinner
2016-12-02 21:54     ` Lucas Stach
2016-12-04 21:36       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.