All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 v2] xfs: fix a couple of performance issues
@ 2020-05-12  9:28 Dave Chinner
  2020-05-12  9:28 ` [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

To follow up on the interesting performance gain I found, there's
three RFC patches that follow up the two I posted earlier. These get
rid of the CIL xc_cil_lock entirely by moving the entire CIL list
and accounting to percpu structures.

The result is that I'm topping out at about 1.12M transactions/s
and bottlenecking on VFS spinlocks in the dentry cache path walk
code and the superblock inode list lock. The XFS CIL commit path
mostly disappears from the profiles when creating about 600,000
inodes/s:


-   73.42%     0.12%  [kernel]               [k] path_openat
   - 11.29% path_openat
      - 7.12% xfs_vn_create
         - 7.18% xfs_vn_mknod
            - 7.30% xfs_generic_create
               - 6.73% xfs_create
                  - 2.69% xfs_dir_ialloc
                     - 2.98% xfs_ialloc
                        - 1.26% xfs_dialloc
                           - 1.04% xfs_dialloc_ag
                        - 1.02% xfs_setup_inode
                           - 0.90% inode_sb_list_add
>>>>>                         - 1.09% _raw_spin_lock
                                 - 4.47% do_raw_spin_lock
                                      4.05% __pv_queued_spin_lock_slowpath
                        - 0.75% xfs_iget
                  - 2.43% xfs_trans_commit
                     - 3.47% __xfs_trans_commit
                        - 7.47% xfs_log_commit_cil
                             1.60% memcpy_erms
                           - 1.35% xfs_buf_item_size
                                0.99% xfs_buf_item_size_segment.isra.0
                             1.30% xfs_buf_item_format
                  - 1.44% xfs_dir_createname
                     - 1.60% xfs_dir2_node_addname
                        - 1.08% xfs_dir2_leafn_add
                             0.79% xfs_dir3_leaf_check_int
      - 1.09% terminate_walk
         - 1.09% dput
>>>>>>      - 1.42% _raw_spin_lock
               - 7.75% do_raw_spin_lock
                    7.19% __pv_queued_spin_lock_slowpath
      - 0.99% xfs_vn_lookup
         - 0.96% xfs_lookup
            - 1.01% xfs_dir_lookup
               - 1.24% xfs_dir2_node_lookup
                  - 1.09% xfs_da3_node_lookup_int
      - 0.90% unlazy_walk
         - 0.87% legitimize_root
            - 0.94% __legitimize_path.isra.0
               - 0.91% lockref_get_not_dead
>>>>>>>           - 1.28% _raw_spin_lock
                     - 6.85% do_raw_spin_lock
                          6.29% __pv_queued_spin_lock_slowpath
      - 0.82% d_lookup
           __d_lookup
.....
+   39.21%     6.76%  [kernel]               [k] do_raw_spin_lock
+   35.07%     0.16%  [kernel]               [k] _raw_spin_lock
+   32.35%    32.13%  [kernel]               [k] __pv_queued_spin_lock_slowpath

So we're going 3-4x faster on this machine than without these
patches, yet we're still burning about 40% of the CPU consumed by
the workload on spinlocks.  IOWs, the XFS code is running 3-4x
faster consuming half the CPU, and we're bashing on other locks
now...

There's still more work to do to make these patches production
ready, but I figured people might want to comment on how much it
hurts their brain and whether there might be better ways to
aggregrate all this percpu functionality into a neater package...

Cheers,

Dave.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
@ 2020-05-12  9:28 ` Dave Chinner
  2020-05-12 12:30   ` Brian Foster
  2020-05-12  9:28 ` [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machien is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

  25.83%  [k] xfs_agino_range
  12.68%  [k] __xfs_dir3_data_check
   6.95%  [k] xfs_verify_ino
   6.78%  [k] xfs_dir2_data_entry_tag_p
   3.56%  [k] xfs_buf_find
   2.31%  [k] xfs_verify_dir_ino
   2.02%  [k] xfs_dabuf_map.constprop.0
   1.65%  [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

  13.90%  [k] __pv_queued_spin_lock_slowpath
   3.76%  [k] do_raw_spin_lock
   2.83%  [k] xfs_dir3_leaf_check_int
   2.75%  [k] xfs_agino_range
   2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
   2.18%  [k] __xfs_dir3_data_check
   2.02%  [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occuring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

   9.77%  [kernel]  [k] xfs_agino_range
   6.27%  [kernel]  [k] __xfs_dir3_data_check
   5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   4.54%  [kernel]  [k] xfs_buf_find
   3.79%  [kernel]  [k] do_raw_spin_lock
   3.39%  [kernel]  [k] xfs_verify_ino
   2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

		vanilla		patched
machine A	13m20s		8m30s
machine B	5m30s		5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index aba5a15792792..712b3e2583316 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -88,21 +88,12 @@ typedef struct xfs_mount {
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
-	int			m_bsize;	/* fs logical block size */
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
 	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
 	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
-	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
-	uint			m_allocsize_log;/* min write size log bytes */
-	uint			m_allocsize_blocks; /* min write size blocks */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
-	int			m_logbufs;	/* number of log buffers */
-	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -117,9 +108,15 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+
+	/*
+	 * Read-only variables that are pre-calculated at mount time.
+	 */
+	int ____cacheline_aligned m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
+	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -138,20 +135,35 @@ typedef struct xfs_mount {
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
-	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
-	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
-	struct mutex		m_growlock;	/* growfs mutex */
+	int			m_dalign;	/* stripe unit */
+	int			m_swidth;	/* stripe width */
+	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
+	uint			m_allocsize_log;/* min write size log bytes */
+	uint			m_allocsize_blocks; /* min write size blocks */
+	int			m_logbufs;	/* number of log buffers */
+	int			m_logbsize;	/* size of each log buffer */
+	uint			m_rsumlevels;	/* rt summary levels */
+	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint64_t		m_flags;	/* global mount flags */
-	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	uint			m_qflags;	/* quota status flags */
+	uint64_t		m_flags;	/* global mount flags */
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
+						/* low free space thresholds */
+	bool			m_always_cow;
+	bool			m_fail_unmount;
+	bool			m_finobt_nores; /* no per-AG finobt resv. */
+	/*
+	 * End of pre-calculated read-only variables
+	 */
+
+	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
+	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
+	struct mutex		m_growlock;	/* growfs mutex */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	int			m_dalign;	/* stripe unit */
-	int			m_swidth;	/* stripe width */
-	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
@@ -160,8 +172,6 @@ typedef struct xfs_mount {
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
-						/* low free space thresholds */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -191,8 +201,6 @@ typedef struct xfs_mount {
 	 */
 	uint32_t		m_generation;
 
-	bool			m_always_cow;
-	bool			m_fail_unmount;
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
  2020-05-12  9:28 ` [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-12  9:28 ` Dave Chinner
  2020-05-12 12:31   ` Brian Foster
  2020-05-12  9:28 ` [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. Convert it to a
per-cpu counter.

And .... oh wow, that was unexpected!

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

		unpatched	patched
machine A:	3m45s		2m27s
machine B:	4m13s		4m14s

Create rates:
		unpatched	patched
machine A:	246k+/-15k	384k+/-10k
machine B:	225k+/-13k	223k+/-11k

Concurrent rm of same 50 million inodes:

		unpatched	patched
machine A:	8m30s		3m09s
machine B:	5m02s		4m51s

The transaction rate on the fast machine went from about 250k/sec to
over 600k/sec, which indicates just how much of a bottleneck this
atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h |  2 +-
 fs/xfs/xfs_super.c | 12 +++++++++---
 fs/xfs/xfs_trans.c |  6 +++---
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 712b3e2583316..af3d8b71e9591 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -84,6 +84,7 @@ typedef struct xfs_mount {
 	 * extents or anything related to the rt device.
 	 */
 	struct percpu_counter	m_delalloc_blks;
+	struct percpu_counter	m_active_trans;	/* in progress xact counter */
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
@@ -164,7 +165,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e80bd2c4c279e..bc4853525ce18 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -883,7 +883,7 @@ xfs_quiesce_attr(
 	int	error = 0;
 
 	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
+	while (percpu_counter_sum(&mp->m_active_trans) > 0)
 		delay(100);
 
 	/* force the log to unpin objects from the now complete transactions */
@@ -902,7 +902,7 @@ xfs_quiesce_attr(
 	 * Just warn here till VFS can correctly support
 	 * read-only remount without racing.
 	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
+	WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0);
 
 	xfs_log_quiesce(mp);
 }
@@ -1027,8 +1027,14 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_fdblocks;
 
+	error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL);
+	if (error)
+		goto free_delalloc_blocks;
+
 	return 0;
 
+free_delalloc_blocks:
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 free_fdblocks:
 	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
@@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters(
 	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
 	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
 	percpu_counter_destroy(&mp->m_delalloc_blks);
+	percpu_counter_destroy(&mp->m_active_trans);
 }
 
 static void
@@ -1792,7 +1799,6 @@ static int xfs_init_fs_context(
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 28b983ff8b113..636df5017782e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -68,7 +68,7 @@ xfs_trans_free(
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	trace_xfs_trans_free(tp, _RET_IP_);
-	atomic_dec(&tp->t_mountp->m_active_trans);
+	percpu_counter_dec(&tp->t_mountp->m_active_trans);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
@@ -126,7 +126,7 @@ xfs_trans_dup(
 
 	xfs_trans_dup_dqinfo(tp, ntp);
 
-	atomic_inc(&tp->t_mountp->m_active_trans);
+	percpu_counter_inc(&tp->t_mountp->m_active_trans);
 	return ntp;
 }
 
@@ -275,7 +275,7 @@ xfs_trans_alloc(
 	 */
 	WARN_ON(resp->tr_logres > 0 &&
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
-	atomic_inc(&mp->m_active_trans);
+	percpu_counter_inc(&mp->m_active_trans);
 
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
  2020-05-12  9:28 ` [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Dave Chinner
  2020-05-12  9:28 ` [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu Dave Chinner
@ 2020-05-12  9:28 ` Dave Chinner
  2020-05-12 14:05   ` Brian Foster
  2020-05-12  9:28 ` [PATCH 4/5] [RFC] xfs: per-cpu CIL lists Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

With the m_active_trans atomic bottleneck out of the way, the CIL
xc_cil_lock is the next bottleneck that causes cacheline contention.
This protects several things, the first of which is the CIL context
reservation ticket and space usage counters.

We can lift them out of the xc_cil_lock by converting them to
percpu counters. THis involves two things, the first of which is
lifting calculations and samples that don't actually need protecting
from races outside the xc_cil lock.

The second is converting the counters to percpu counters and lifting
them outside the lock. This requires a couple of tricky things to
minimise initial state races and to ensure we take into account
split reservations. We do this by erring on the "take the
reservation just in case" side, which largely lost in the noise of
many frequent large transactions.

We use a trick with percpu_counter_add_batch() to ensure the global
sum is updated immediately on first reservation, hence allowing us
to use fast counter reads everywhere to determine if the CIL is
empty or not, rather than using the list itself. This is important
for later patches where the CIL is moved to percpu lists
and hence cannot use list_empty() to detect an empty CIL. Hence we
provide a low overhead, lockless mechanism for determining if the
CIL is empty or not via this mechanisms. All other percpu counter
updates use a large batch count so they aggregate on the local CPU
and minimise global sum updates.

The xc_ctx_lock rwsem protects draining the percpu counters to the
context's ticket, similar to the way it allows access to the CIL
without using the xc_cil_lock. i.e. the CIL push has exclusive
access to the CIL, the context and the percpu counters while holding
the xc_ctx_lock. This ensures that we can sum and zero the counters
atomically from the perspective of the transaction commit side of
the push. i.e. they reset to zero atomically with the CIL context
swap and hence we don't need to have the percpu counters attached to
the CIL context.

Performance wise, this increases the transaction rate from
~620,000/s to around 750,000/second. Using a 32-way concurrent
create instead of 16-way on a 32p/16GB virtual machine:

		create time	rate		unlink time
unpatched	  2m03s      472k/s+/-9k/s	 3m6s
patched		  1m56s	     533k/s+/-28k/s	 2m34

Notably, the system time for the create went from 44m20s down to
38m37s, whilst going faster. There is more variance, but I think
that is from the cacheline contention having inconsistent overhead.

XXX: probably should split into two patches

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
 fs/xfs/xfs_log_priv.h |  2 +
 2 files changed, 72 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b43f0e8f43f2e..746c841757ed1 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -393,7 +393,7 @@ xlog_cil_insert_items(
 	struct xfs_log_item	*lip;
 	int			len = 0;
 	int			diff_iovecs = 0;
-	int			iclog_space;
+	int			iclog_space, space_used;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
 	ASSERT(tp);
@@ -403,17 +403,16 @@ xlog_cil_insert_items(
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
 	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
-
-	spin_lock(&cil->xc_cil_lock);
-
 	/* account for space used by new iovec headers  */
+
 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
 	len += iovhdr_res;
 	ctx->nvecs += diff_iovecs;
 
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+	/*
+	 * The ticket can't go away from us here, so we can do racy sampling
+	 * and precalculate everything.
+	 */
 
 	/*
 	 * Now transfer enough transaction reservation to the context ticket
@@ -421,27 +420,28 @@ xlog_cil_insert_items(
 	 * reservation has to grow as well as the current reservation as we
 	 * steal from tickets so we can correctly determine the space used
 	 * during the transaction commit.
+	 *
+	 * We use percpu_counter_add_batch() here to force the addition into the
+	 * global sum immediately. This will result in percpu_counter_read() now
+	 * always returning a non-zero value, and hence we'll only ever have a
+	 * very short race window on new contexts.
 	 */
-	if (ctx->ticket->t_curr_res == 0) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
 		ctx_res = ctx->ticket->t_unit_res;
-		ctx->ticket->t_curr_res = ctx_res;
 		tp->t_ticket->t_curr_res -= ctx_res;
+		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
 	}
 
 	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
+	if (len > 0 && !ctx_res) {
+		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
 		split_res = (len + iclog_space - 1) / iclog_space;
 		/* need to take into account split region headers, too */
 		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += split_res;
-		ctx->ticket->t_curr_res += split_res;
 		tp->t_ticket->t_curr_res -= split_res;
 		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
 	tp->t_ticket->t_curr_res -= len;
-	ctx->space_used += len;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
@@ -458,6 +458,15 @@ xlog_cil_insert_items(
 		xlog_print_trans(tp);
 	}
 
+	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
+	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
+
+	spin_lock(&cil->xc_cil_lock);
+
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
+
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
 	 * We do this here so we only need to take the CIL lock once during
@@ -741,6 +750,18 @@ xlog_cil_push_work(
 		num_iovecs += lv->lv_niovecs;
 	}
 
+	/*
+	 * Drain per cpu counters back to context so they can be re-initialised
+	 * to zero before we allow commits to the new context we are about to
+	 * switch to.
+	 */
+	ctx->space_used = percpu_counter_sum(&cil->xc_space_used);
+	ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res);
+	ctx->ticket->t_unit_res = ctx->ticket->t_curr_res;
+	percpu_counter_set(&cil->xc_space_used, 0);
+	percpu_counter_set(&cil->xc_curr_res, 0);
+
+
 	/*
 	 * initialise the new context and attach it to the CIL. Then attach
 	 * the current context to the CIL committing lsit so it can be found
@@ -900,6 +921,7 @@ xlog_cil_push_background(
 	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
+	s64		space_used = percpu_counter_read(&cil->xc_space_used);
 
 	/*
 	 * The cil won't be empty because we are called while holding the
@@ -911,7 +933,7 @@ xlog_cil_push_background(
 	 * don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
@@ -934,9 +956,9 @@ xlog_cil_push_background(
 	 * If we are well over the space limit, throttle the work that is being
 	 * done until the push work on this context has begun.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
-		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		ASSERT(space_used < log->l_logsize);
 		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
 		return;
 	}
@@ -1200,16 +1222,23 @@ xlog_cil_init(
 {
 	struct xfs_cil	*cil;
 	struct xfs_cil_ctx *ctx;
+	int		error = -ENOMEM;
 
 	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
 	if (!cil)
-		return -ENOMEM;
+		return error;
 
 	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
+	if (!ctx)
+		goto out_free_cil;
+
+	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_ctx;
+
+	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
+	if (error)
+		goto out_free_space;
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
@@ -1230,19 +1259,31 @@ xlog_cil_init(
 	cil->xc_log = log;
 	log->l_cilp = cil;
 	return 0;
+
+out_free_space:
+	percpu_counter_destroy(&cil->xc_space_used);
+out_free_ctx:
+	kmem_free(ctx);
+out_free_cil:
+	kmem_free(cil);
+	return error;
 }
 
 void
 xlog_cil_destroy(
 	struct xlog	*log)
 {
-	if (log->l_cilp->xc_ctx) {
-		if (log->l_cilp->xc_ctx->ticket)
-			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
-		kmem_free(log->l_cilp->xc_ctx);
+	struct xfs_cil  *cil = log->l_cilp;
+
+	if (cil->xc_ctx) {
+		if (cil->xc_ctx->ticket)
+			xfs_log_ticket_put(cil->xc_ctx->ticket);
+		kmem_free(cil->xc_ctx);
 	}
+	percpu_counter_destroy(&cil->xc_space_used);
+	percpu_counter_destroy(&cil->xc_curr_res);
 
-	ASSERT(list_empty(&log->l_cilp->xc_cil));
-	kmem_free(log->l_cilp);
+	ASSERT(list_empty(&cil->xc_cil));
+	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ec22c7a3867f1..f5e79a7d44c8e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -262,6 +262,8 @@ struct xfs_cil_ctx {
  */
 struct xfs_cil {
 	struct xlog		*xc_log;
+	struct percpu_counter	xc_space_used;
+	struct percpu_counter	xc_curr_res;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
                   ` (2 preceding siblings ...)
  2020-05-12  9:28 ` [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters Dave Chinner
@ 2020-05-12  9:28 ` Dave Chinner
  2020-05-13 17:02   ` Brian Foster
  2020-05-12  9:28 ` [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu Dave Chinner
  2020-05-12 10:25 ` [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
  5 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Next on the list to getting rid of the xc_cil_lock is making the CIL
itself per-cpu.

This requires a trade-off: we no longer move items forward in the
CIL; once they are on the CIL they remain there as we treat the
percpu lists as lockless.

XXX: preempt_disable() around the list operations to ensure they
stay local to the CPU.

XXX: this needs CPU hotplug notifiers to clean up when cpus go
offline.

Performance now increases substantially - the transaction rate goes
from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
for the first time.

Using a 32-way concurrent create/unlink on a 32p/16GB virtual
machine:

	    create time     rate            unlink time
unpatched	1m56s      533k/s+/-28k/s      2m34s
patched		1m49s	   523k/s+/-14k/s      2m00s

Notably, the system time for the create went up, while variance went
down. This indicates we're starting to hit some other contention
limit as we reduce the amount of time we spend contending on the
xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 746c841757ed1..af444bc69a7cd 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -467,28 +467,28 @@ xlog_cil_insert_items(
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 
+	spin_unlock(&cil->xc_cil_lock);
+
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
+	 * Move everything to the tail of the local per-cpu CIL list.
 	 */
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
-
 		/* Skip items which aren't dirty in this transaction. */
 		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
 			continue;
 
 		/*
-		 * Only move the item if it isn't already at the tail. This is
-		 * to prevent a transient list_empty() state when reinserting
-		 * an item that is already the only item in the CIL.
+		 * If the item is already in the CIL, don't try to reposition it
+		 * because we don't know what per-cpu list it is on.
 		 */
-		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
-			list_move_tail(&lip->li_cil, &cil->xc_cil);
+		if (!list_empty(&lip->li_cil))
+			continue;
+		list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil));
 	}
 
-	spin_unlock(&cil->xc_cil_lock);
-
 	if (tp->t_ticket->t_curr_res < 0)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 }
@@ -666,6 +666,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	LIST_HEAD		(cil_items);
+	int			cpu;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -687,7 +689,7 @@ xlog_cil_push_work(
 	 * move on to a new sequence number and so we have to be able to push
 	 * this sequence again later.
 	 */
-	if (list_empty(&cil->xc_cil)) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
 		cil->xc_push_seq = 0;
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
@@ -728,17 +730,21 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * Remove the items from the per-cpu CIL lists and then pull all the
+	 * log vectors off the items. We hold the xc_ctx_lock exclusively here,
+	 * so nothing can be adding or removing from the per-cpu lists here.
 	 */
+	/* XXX: hotplug! */
+	for_each_online_cpu(cpu) {
+		list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items);
+	}
+
 	lv = NULL;
 	num_iovecs = 0;
-	while (!list_empty(&cil->xc_cil)) {
+	while (!list_empty(&cil_items)) {
 		struct xfs_log_item	*item;
 
-		item = list_first_entry(&cil->xc_cil,
+		item = list_first_entry(&cil_items,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
 		if (!ctx->lv_chain)
@@ -927,7 +933,7 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there
 	 */
-	ASSERT(!list_empty(&cil->xc_cil));
+	ASSERT(space_used != 0);
 
 	/*
 	 * don't do a background push if we haven't used up all the
@@ -993,7 +999,8 @@ xlog_cil_push_now(
 	 * there's no work we need to do.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
+	if (percpu_counter_read(&cil->xc_curr_res) == 0 ||
+	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
@@ -1011,7 +1018,7 @@ xlog_cil_empty(
 	bool		empty = false;
 
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil))
+	if (percpu_counter_read(&cil->xc_curr_res) == 0)
 		empty = true;
 	spin_unlock(&cil->xc_push_lock);
 	return empty;
@@ -1163,7 +1170,7 @@ xlog_cil_force_lsn(
 	 * we would have found the context on the committing list.
 	 */
 	if (sequence == cil->xc_current_sequence &&
-	    !list_empty(&cil->xc_cil)) {
+	    percpu_counter_read(&cil->xc_curr_res) != 0) {
 		spin_unlock(&cil->xc_push_lock);
 		goto restart;
 	}
@@ -1223,6 +1230,7 @@ xlog_cil_init(
 	struct xfs_cil	*cil;
 	struct xfs_cil_ctx *ctx;
 	int		error = -ENOMEM;
+	int		cpu;
 
 	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
 	if (!cil)
@@ -1232,16 +1240,24 @@ xlog_cil_init(
 	if (!ctx)
 		goto out_free_cil;
 
+	/* XXX: CPU hotplug!!! */
+	cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL);
+	if (!cil->xc_cil)
+		goto out_free_ctx;
+
+	for_each_possible_cpu(cpu) {
+		INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu));
+	}
+
 	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
 	if (error)
-		goto out_free_ctx;
+		goto out_free_pcp_cil;
 
 	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
 	if (error)
 		goto out_free_space;
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
-	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
@@ -1262,6 +1278,8 @@ xlog_cil_init(
 
 out_free_space:
 	percpu_counter_destroy(&cil->xc_space_used);
+out_free_pcp_cil:
+	free_percpu(cil->xc_cil);
 out_free_ctx:
 	kmem_free(ctx);
 out_free_cil:
@@ -1274,6 +1292,7 @@ xlog_cil_destroy(
 	struct xlog	*log)
 {
 	struct xfs_cil  *cil = log->l_cilp;
+	int		cpu;
 
 	if (cil->xc_ctx) {
 		if (cil->xc_ctx->ticket)
@@ -1283,7 +1302,10 @@ xlog_cil_destroy(
 	percpu_counter_destroy(&cil->xc_space_used);
 	percpu_counter_destroy(&cil->xc_curr_res);
 
-	ASSERT(list_empty(&cil->xc_cil));
+	for_each_possible_cpu(cpu) {
+		ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu)));
+	}
+	free_percpu(cil->xc_cil);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index f5e79a7d44c8e..0bb982920d070 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -264,7 +264,7 @@ struct xfs_cil {
 	struct xlog		*xc_log;
 	struct percpu_counter	xc_space_used;
 	struct percpu_counter	xc_curr_res;
-	struct list_head	xc_cil;
+	struct list_head __percpu *xc_cil;
 	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
                   ` (3 preceding siblings ...)
  2020-05-12  9:28 ` [PATCH 4/5] [RFC] xfs: per-cpu CIL lists Dave Chinner
@ 2020-05-12  9:28 ` Dave Chinner
  2020-05-12 10:25 ` [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-12  9:28 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We use the same percpu list trick with the busy extents as we do
with the CIL lists, and this gets rid of the last use of the
xc_cil_lock in the commit fast path.

As noted in the previous patch, it looked like we were approaching
another bottleneck, and that can be seen by the fact that
performance didn't substantially increase even though there is now
no lock contention in the commit path. The transaction rate
only increases slightly to 1.12M/sec.

Using a 32-way concurrent create/unlink on a 32p/16GB virtual
machine:

	     create time     rate            unlink time
unpatched       1m49s      523k/s+/-14k/s      2m00s
patched         1m48s      535k/s+/-24k/s      1m51s

So variance went back up, and performance improved slightly.
Profiling at this point indicates spinlock contention at the VFS
level (inode_sb_list_add() and dentry cache pathwalking) so
significant further gains will require VFS surgery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 35 ++++++++++++++++++-----------------
 fs/xfs/xfs_log_priv.h | 12 +++++++++++-
 2 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index af444bc69a7cd..d3a5f8478d64a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -461,13 +461,9 @@ xlog_cil_insert_items(
 	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
 	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
 
-	spin_lock(&cil->xc_cil_lock);
-
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
-
-	spin_unlock(&cil->xc_cil_lock);
+		list_splice_tail_init(&tp->t_busy, pcp_busy(cil));
 
 	/*
 	 * Now (re-)position everything modified at the tail of the CIL.
@@ -486,7 +482,7 @@ xlog_cil_insert_items(
 		 */
 		if (!list_empty(&lip->li_cil))
 			continue;
-		list_add_tail(&lip->li_cil, this_cpu_ptr(cil->xc_cil));
+		list_add_tail(&lip->li_cil, pcp_cil(cil));
 	}
 
 	if (tp->t_ticket->t_curr_res < 0)
@@ -733,10 +729,14 @@ xlog_cil_push_work(
 	 * Remove the items from the per-cpu CIL lists and then pull all the
 	 * log vectors off the items. We hold the xc_ctx_lock exclusively here,
 	 * so nothing can be adding or removing from the per-cpu lists here.
+	 *
+	 * Also splice the busy extents onto the context while we are walking
+	 * the percpu structure.
 	 */
 	/* XXX: hotplug! */
 	for_each_online_cpu(cpu) {
-		list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items);
+		list_splice_tail_init(pcp_cil_cpu(cil, cpu), &cil_items);
+		list_splice_tail_init(pcp_busy_cpu(cil, cpu), &ctx->busy_extents);
 	}
 
 	lv = NULL;
@@ -933,7 +933,7 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there
 	 */
-	ASSERT(space_used != 0);
+	ASSERT(percpu_counter_read(&cil->xc_curr_res) != 0);
 
 	/*
 	 * don't do a background push if we haven't used up all the
@@ -1241,17 +1241,18 @@ xlog_cil_init(
 		goto out_free_cil;
 
 	/* XXX: CPU hotplug!!! */
-	cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL);
-	if (!cil->xc_cil)
+	cil->xc_pcp = alloc_percpu_gfp(struct xfs_cil_pcpu, GFP_KERNEL);
+	if (!cil->xc_pcp)
 		goto out_free_ctx;
 
 	for_each_possible_cpu(cpu) {
-		INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu));
+		INIT_LIST_HEAD(pcp_cil_cpu(cil, cpu));
+		INIT_LIST_HEAD(pcp_busy_cpu(cil, cpu));
 	}
 
 	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
 	if (error)
-		goto out_free_pcp_cil;
+		goto out_free_pcp;
 
 	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
 	if (error)
@@ -1259,7 +1260,6 @@ xlog_cil_init(
 
 	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_committing);
-	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
@@ -1278,8 +1278,8 @@ xlog_cil_init(
 
 out_free_space:
 	percpu_counter_destroy(&cil->xc_space_used);
-out_free_pcp_cil:
-	free_percpu(cil->xc_cil);
+out_free_pcp:
+	free_percpu(cil->xc_pcp);
 out_free_ctx:
 	kmem_free(ctx);
 out_free_cil:
@@ -1303,9 +1303,10 @@ xlog_cil_destroy(
 	percpu_counter_destroy(&cil->xc_curr_res);
 
 	for_each_possible_cpu(cpu) {
-		ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu)));
+		ASSERT(list_empty(pcp_cil_cpu(cil, cpu)));
+		ASSERT(list_empty(pcp_busy_cpu(cil, cpu)));
 	}
-	free_percpu(cil->xc_cil);
+	free_percpu(cil->xc_pcp);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 0bb982920d070..cfc22c9482ea4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -260,11 +260,16 @@ struct xfs_cil_ctx {
  * the commit LSN to be determined as well. This should make synchronous
  * operations almost as efficient as the old logging methods.
  */
+struct xfs_cil_pcpu {
+	struct list_head	p_cil;
+	struct list_head	p_busy_extents;
+};
+
 struct xfs_cil {
 	struct xlog		*xc_log;
 	struct percpu_counter	xc_space_used;
 	struct percpu_counter	xc_curr_res;
-	struct list_head __percpu *xc_cil;
+	struct xfs_cil_pcpu __percpu *xc_pcp;
 	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
@@ -278,6 +283,11 @@ struct xfs_cil {
 	struct work_struct	xc_push_work;
 } ____cacheline_aligned_in_smp;
 
+#define pcp_cil(cil)		&(this_cpu_ptr(cil->xc_pcp)->p_cil)
+#define pcp_cil_cpu(cil, cpu)	&(per_cpu_ptr(cil->xc_pcp, cpu)->p_cil)
+#define pcp_busy(cil)		&(this_cpu_ptr(cil->xc_pcp)->p_busy_extents)
+#define pcp_busy_cpu(cil, cpu)	&(per_cpu_ptr(cil->xc_pcp, cpu)->p_busy_extents)
+
 /*
  * The amount of log space we allow the CIL to aggregate is difficult to size.
  * Whatever we choose, we have to make sure we can get a reservation for the
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 0/5 v2] xfs: fix a couple of performance issues
  2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
                   ` (4 preceding siblings ...)
  2020-05-12  9:28 ` [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu Dave Chinner
@ 2020-05-12 10:25 ` Dave Chinner
  5 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-12 10:25 UTC (permalink / raw)
  To: linux-xfs

On Tue, May 12, 2020 at 07:28:06PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> To follow up on the interesting performance gain I found, there's
> three RFC patches that follow up the two I posted earlier. These get
> rid of the CIL xc_cil_lock entirely by moving the entire CIL list
> and accounting to percpu structures.
> 
> The result is that I'm topping out at about 1.12M transactions/s
> and bottlenecking on VFS spinlocks in the dentry cache path walk
> code and the superblock inode list lock. The XFS CIL commit path
> mostly disappears from the profiles when creating about 600,000
> inodes/s:
> 
> 
> -   73.42%     0.12%  [kernel]               [k] path_openat
>    - 11.29% path_openat
>       - 7.12% xfs_vn_create
>          - 7.18% xfs_vn_mknod
>             - 7.30% xfs_generic_create
>                - 6.73% xfs_create
>                   - 2.69% xfs_dir_ialloc
>                      - 2.98% xfs_ialloc
>                         - 1.26% xfs_dialloc
>                            - 1.04% xfs_dialloc_ag
>                         - 1.02% xfs_setup_inode
>                            - 0.90% inode_sb_list_add
> >>>>>                         - 1.09% _raw_spin_lock
>                                  - 4.47% do_raw_spin_lock
>                                       4.05% __pv_queued_spin_lock_slowpath
>                         - 0.75% xfs_iget
>                   - 2.43% xfs_trans_commit
>                      - 3.47% __xfs_trans_commit
>                         - 7.47% xfs_log_commit_cil
>                              1.60% memcpy_erms
>                            - 1.35% xfs_buf_item_size
>                                 0.99% xfs_buf_item_size_segment.isra.0
>                              1.30% xfs_buf_item_format
>                   - 1.44% xfs_dir_createname
>                      - 1.60% xfs_dir2_node_addname
>                         - 1.08% xfs_dir2_leafn_add
>                              0.79% xfs_dir3_leaf_check_int
>       - 1.09% terminate_walk
>          - 1.09% dput
> >>>>>>      - 1.42% _raw_spin_lock
>                - 7.75% do_raw_spin_lock
>                     7.19% __pv_queued_spin_lock_slowpath
>       - 0.99% xfs_vn_lookup
>          - 0.96% xfs_lookup
>             - 1.01% xfs_dir_lookup
>                - 1.24% xfs_dir2_node_lookup
>                   - 1.09% xfs_da3_node_lookup_int
>       - 0.90% unlazy_walk
>          - 0.87% legitimize_root
>             - 0.94% __legitimize_path.isra.0
>                - 0.91% lockref_get_not_dead
> >>>>>>>           - 1.28% _raw_spin_lock
>                      - 6.85% do_raw_spin_lock
>                           6.29% __pv_queued_spin_lock_slowpath
>       - 0.82% d_lookup
>            __d_lookup
> .....
> +   39.21%     6.76%  [kernel]               [k] do_raw_spin_lock
> +   35.07%     0.16%  [kernel]               [k] _raw_spin_lock
> +   32.35%    32.13%  [kernel]               [k] __pv_queued_spin_lock_slowpath
> 
> So we're going 3-4x faster on this machine than without these
> patches, yet we're still burning about 40% of the CPU consumed by
> the workload on spinlocks.  IOWs, the XFS code is running 3-4x
> faster consuming half the CPU, and we're bashing on other locks
> now...

Just as a small followup, I started this with my usual 16-way
create/unlink workload which ran at about 245k creates/s and unlinks
at about 150k/s.

With this patch set, I just ran 492k creates/s (1m54s) and 420k
unlinks/s from just 16 threads (2m18s). IOWs, I didn't need to go to
32 threads to see the perf improvement - as the above profiles
indicate, those extra 16 threads are effectively just creating heat
spinning on VFS locks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  9:28 ` [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-12 12:30   ` Brian Foster
  2020-05-12 16:09     ` Darrick J. Wong
  2020-05-12 21:53     ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Brian Foster @ 2020-05-12 12:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 07:28:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Seeing massive cpu usage from xfs_agino_range() on one machine;
> instruction level profiles look similar to another machine running
> the same workload, only one machien is consuming 10x as much CPU as
> the other and going much slower. The only real difference between
> the two machines is core count per socket. Both are running
> identical 16p/16GB virtual machine configurations
> 
...
> 
> It's an improvement, hence indicating that separation and further
> optimisation of read-only global filesystem data is worthwhile, but
> it clearly isn't the underlying issue causing this specific
> performance degradation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Pretty neat improvement. Could you share your test script that generated
the above? I have a 80 CPU box I'd be interested to give this a whirl
on...

>  fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
>  1 file changed, 29 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index aba5a15792792..712b3e2583316 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -88,21 +88,12 @@ typedef struct xfs_mount {
>  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
>  	char			*m_rtname;	/* realtime device name */
>  	char			*m_logname;	/* external log device name */
> -	int			m_bsize;	/* fs logical block size */
>  	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
>  	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
>  	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
> -	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> -	uint			m_allocsize_log;/* min write size log bytes */
> -	uint			m_allocsize_blocks; /* min write size blocks */
>  	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
>  	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
>  	struct xlog		*m_log;		/* log specific stuff */
> -	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> -	int			m_logbufs;	/* number of log buffers */
> -	int			m_logbsize;	/* size of each log buffer */
> -	uint			m_rsumlevels;	/* rt summary levels */
> -	uint			m_rsumsize;	/* size of rt summary, bytes */
>  	/*
>  	 * Optional cache of rt summary level per bitmap block with the
>  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> @@ -117,9 +108,15 @@ typedef struct xfs_mount {
>  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
>  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
>  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> +
> +	/*
> +	 * Read-only variables that are pre-calculated at mount time.
> +	 */

The intent here is to align the entire section below, right? If so, the
connection with the cache line alignment is a bit tenuous. Could we
tweak and/or add a sentence to the comment to be more explicit? I.e.:

	/*
	 * Align the following pre-calculated fields to a cache line to
	 * prevent cache line bouncing between frequently read and
	 * frequently written fields.
	 */

> +	int ____cacheline_aligned m_bsize;	/* fs logical block size */
>  	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
>  	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
>  	uint8_t			m_agno_log;	/* log #ag's */
> +	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
>  	uint			m_blockmask;	/* sb_blocksize-1 */
>  	uint			m_blockwsize;	/* sb_blocksize in words */
>  	uint			m_blockwmask;	/* blockwsize-1 */
> @@ -138,20 +135,35 @@ typedef struct xfs_mount {
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
>  	uint			m_alloc_set_aside; /* space we can't use */
>  	uint			m_ag_max_usable; /* max space per AG */
> -	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> -	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> -	struct mutex		m_growlock;	/* growfs mutex */
> +	int			m_dalign;	/* stripe unit */
> +	int			m_swidth;	/* stripe width */
> +	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> +	uint			m_allocsize_log;/* min write size log bytes */
> +	uint			m_allocsize_blocks; /* min write size blocks */
> +	int			m_logbufs;	/* number of log buffers */
> +	int			m_logbsize;	/* size of each log buffer */
> +	uint			m_rsumlevels;	/* rt summary levels */
> +	uint			m_rsumsize;	/* size of rt summary, bytes */
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
> -	uint64_t		m_flags;	/* global mount flags */
> -	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	uint			m_qflags;	/* quota status flags */
> +	uint64_t		m_flags;	/* global mount flags */
> +	int64_t			m_low_space[XFS_LOWSP_MAX];
> +	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
>  	struct xfs_trans_resv	m_resv;		/* precomputed res values */
> +						/* low free space thresholds */
> +	bool			m_always_cow;
> +	bool			m_fail_unmount;
> +	bool			m_finobt_nores; /* no per-AG finobt resv. */
> +	/*
> +	 * End of pre-calculated read-only variables
> +	 */

m_always_cow and m_fail_unmount are mutable via sysfs knobs so
technically not read-only. I'm assuming that has no practical impact on
the performance optimization, but it might be worth leaving them where
they are to avoid confusion.

With those nits fixed:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +
> +	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> +	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> +	struct mutex		m_growlock;	/* growfs mutex */
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	int			m_dalign;	/* stripe unit */
> -	int			m_swidth;	/* stripe width */
> -	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
>  	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> @@ -160,8 +172,6 @@ typedef struct xfs_mount {
>  	struct delayed_work	m_cowblocks_work; /* background cow blocks
>  						     trimming */
>  	bool			m_update_sb;	/* sb needs update in mount */
> -	int64_t			m_low_space[XFS_LOWSP_MAX];
> -						/* low free space thresholds */
>  	struct xfs_kobj		m_kobj;
>  	struct xfs_kobj		m_error_kobj;
>  	struct xfs_kobj		m_error_meta_kobj;
> @@ -191,8 +201,6 @@ typedef struct xfs_mount {
>  	 */
>  	uint32_t		m_generation;
>  
> -	bool			m_always_cow;
> -	bool			m_fail_unmount;
>  #ifdef DEBUG
>  	/*
>  	 * Frequency with which errors are injected.  Replaces xfs_etest; the
> -- 
> 2.26.1.301.g55bc3eb7cb9
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  9:28 ` [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu Dave Chinner
@ 2020-05-12 12:31   ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-05-12 12:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 07:28:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a global atomic counter, and we are hitting it at a rate of
> half a million transactions a second, so it's bouncing the counter
> cacheline all over the place on large machines. Convert it to a
> per-cpu counter.
> 
> And .... oh wow, that was unexpected!
> 
> Concurrent create, 50 million inodes, identical 16p/16GB virtual
> machines on different physical hosts. Machine A has twice the CPU
> cores per socket of machine B:
> 
> 		unpatched	patched
> machine A:	3m45s		2m27s
> machine B:	4m13s		4m14s
> 
> Create rates:
> 		unpatched	patched
> machine A:	246k+/-15k	384k+/-10k
> machine B:	225k+/-13k	223k+/-11k
> 
> Concurrent rm of same 50 million inodes:
> 
> 		unpatched	patched
> machine A:	8m30s		3m09s
> machine B:	5m02s		4m51s
> 
> The transaction rate on the fast machine went from about 250k/sec to
> over 600k/sec, which indicates just how much of a bottleneck this
> atomic counter was.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks fairly straightforward. We're increasing the size of xfs_mount,
but it's already over a 4k page and there's only one per-mount:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_mount.h |  2 +-
>  fs/xfs/xfs_super.c | 12 +++++++++---
>  fs/xfs/xfs_trans.c |  6 +++---
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 712b3e2583316..af3d8b71e9591 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -84,6 +84,7 @@ typedef struct xfs_mount {
>  	 * extents or anything related to the rt device.
>  	 */
>  	struct percpu_counter	m_delalloc_blks;
> +	struct percpu_counter	m_active_trans;	/* in progress xact counter */
>  
>  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
>  	char			*m_rtname;	/* realtime device name */
> @@ -164,7 +165,6 @@ typedef struct xfs_mount {
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e80bd2c4c279e..bc4853525ce18 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -883,7 +883,7 @@ xfs_quiesce_attr(
>  	int	error = 0;
>  
>  	/* wait for all modifications to complete */
> -	while (atomic_read(&mp->m_active_trans) > 0)
> +	while (percpu_counter_sum(&mp->m_active_trans) > 0)
>  		delay(100);
>  
>  	/* force the log to unpin objects from the now complete transactions */
> @@ -902,7 +902,7 @@ xfs_quiesce_attr(
>  	 * Just warn here till VFS can correctly support
>  	 * read-only remount without racing.
>  	 */
> -	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
> +	WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0);
>  
>  	xfs_log_quiesce(mp);
>  }
> @@ -1027,8 +1027,14 @@ xfs_init_percpu_counters(
>  	if (error)
>  		goto free_fdblocks;
>  
> +	error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL);
> +	if (error)
> +		goto free_delalloc_blocks;
> +
>  	return 0;
>  
> +free_delalloc_blocks:
> +	percpu_counter_destroy(&mp->m_delalloc_blks);
>  free_fdblocks:
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  free_ifree:
> @@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters(
>  	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
>  	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
> +	percpu_counter_destroy(&mp->m_active_trans);
>  }
>  
>  static void
> @@ -1792,7 +1799,6 @@ static int xfs_init_fs_context(
>  	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
>  	spin_lock_init(&mp->m_perag_lock);
>  	mutex_init(&mp->m_growlock);
> -	atomic_set(&mp->m_active_trans, 0);
>  	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
>  	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
>  	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 28b983ff8b113..636df5017782e 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -68,7 +68,7 @@ xfs_trans_free(
>  	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
>  
>  	trace_xfs_trans_free(tp, _RET_IP_);
> -	atomic_dec(&tp->t_mountp->m_active_trans);
> +	percpu_counter_dec(&tp->t_mountp->m_active_trans);
>  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_end_intwrite(tp->t_mountp->m_super);
>  	xfs_trans_free_dqinfo(tp);
> @@ -126,7 +126,7 @@ xfs_trans_dup(
>  
>  	xfs_trans_dup_dqinfo(tp, ntp);
>  
> -	atomic_inc(&tp->t_mountp->m_active_trans);
> +	percpu_counter_inc(&tp->t_mountp->m_active_trans);
>  	return ntp;
>  }
>  
> @@ -275,7 +275,7 @@ xfs_trans_alloc(
>  	 */
>  	WARN_ON(resp->tr_logres > 0 &&
>  		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
> -	atomic_inc(&mp->m_active_trans);
> +	percpu_counter_inc(&mp->m_active_trans);
>  
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
>  	tp->t_flags = flags;
> -- 
> 2.26.1.301.g55bc3eb7cb9
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-12  9:28 ` [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters Dave Chinner
@ 2020-05-12 14:05   ` Brian Foster
  2020-05-12 23:36     ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-05-12 14:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> With the m_active_trans atomic bottleneck out of the way, the CIL
> xc_cil_lock is the next bottleneck that causes cacheline contention.
> This protects several things, the first of which is the CIL context
> reservation ticket and space usage counters.
> 
> We can lift them out of the xc_cil_lock by converting them to
> percpu counters. THis involves two things, the first of which is
> lifting calculations and samples that don't actually need protecting
> from races outside the xc_cil lock.
> 
> The second is converting the counters to percpu counters and lifting
> them outside the lock. This requires a couple of tricky things to
> minimise initial state races and to ensure we take into account
> split reservations. We do this by erring on the "take the
> reservation just in case" side, which largely lost in the noise of
> many frequent large transactions.
> 
> We use a trick with percpu_counter_add_batch() to ensure the global
> sum is updated immediately on first reservation, hence allowing us
> to use fast counter reads everywhere to determine if the CIL is
> empty or not, rather than using the list itself. This is important
> for later patches where the CIL is moved to percpu lists
> and hence cannot use list_empty() to detect an empty CIL. Hence we
> provide a low overhead, lockless mechanism for determining if the
> CIL is empty or not via this mechanisms. All other percpu counter
> updates use a large batch count so they aggregate on the local CPU
> and minimise global sum updates.
> 
> The xc_ctx_lock rwsem protects draining the percpu counters to the
> context's ticket, similar to the way it allows access to the CIL
> without using the xc_cil_lock. i.e. the CIL push has exclusive
> access to the CIL, the context and the percpu counters while holding
> the xc_ctx_lock. This ensures that we can sum and zero the counters
> atomically from the perspective of the transaction commit side of
> the push. i.e. they reset to zero atomically with the CIL context
> swap and hence we don't need to have the percpu counters attached to
> the CIL context.
> 
> Performance wise, this increases the transaction rate from
> ~620,000/s to around 750,000/second. Using a 32-way concurrent
> create instead of 16-way on a 32p/16GB virtual machine:
> 
> 		create time	rate		unlink time
> unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> patched		  1m56s	     533k/s+/-28k/s	 2m34
> 
> Notably, the system time for the create went from 44m20s down to
> 38m37s, whilst going faster. There is more variance, but I think
> that is from the cacheline contention having inconsistent overhead.
> 
> XXX: probably should split into two patches
> 

Yes please. :)

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
>  fs/xfs/xfs_log_priv.h |  2 +
>  2 files changed, 72 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b43f0e8f43f2e..746c841757ed1 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -393,7 +393,7 @@ xlog_cil_insert_items(
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
>  	int			diff_iovecs = 0;
> -	int			iclog_space;
> +	int			iclog_space, space_used;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  

fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]

>  	ASSERT(tp);
> @@ -403,17 +403,16 @@ xlog_cil_insert_items(
>  	 * are done so it doesn't matter exactly how we update the CIL.
>  	 */
>  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> -
> -	spin_lock(&cil->xc_cil_lock);
> -
>  	/* account for space used by new iovec headers  */
> +
>  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
>  	len += iovhdr_res;
>  	ctx->nvecs += diff_iovecs;
>  
> -	/* attach the transaction to the CIL if it has any busy extents */
> -	if (!list_empty(&tp->t_busy))
> -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> +	/*
> +	 * The ticket can't go away from us here, so we can do racy sampling
> +	 * and precalculate everything.
> +	 */
>  
>  	/*
>  	 * Now transfer enough transaction reservation to the context ticket
> @@ -421,27 +420,28 @@ xlog_cil_insert_items(
>  	 * reservation has to grow as well as the current reservation as we
>  	 * steal from tickets so we can correctly determine the space used
>  	 * during the transaction commit.
> +	 *
> +	 * We use percpu_counter_add_batch() here to force the addition into the
> +	 * global sum immediately. This will result in percpu_counter_read() now
> +	 * always returning a non-zero value, and hence we'll only ever have a
> +	 * very short race window on new contexts.
>  	 */
> -	if (ctx->ticket->t_curr_res == 0) {
> +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
>  		ctx_res = ctx->ticket->t_unit_res;
> -		ctx->ticket->t_curr_res = ctx_res;
>  		tp->t_ticket->t_curr_res -= ctx_res;
> +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
>  	}

Ok, so we open a race here at the cost of stealing more reservation than
necessary from the transaction. Seems harmless, but I would like to see
some quantification/analysis on what a 'very short race window' is in
this context. Particularly as it relates to percpu functionality. Does
the window scale with cpu count, for example? It might not matter either
way because we expect any given transaction to accommodate the ctx res,
but it would be good to understand the behavior here so we can think
about potential side effects, if any.

>  
>  	/* do we need space for more log record headers? */
> -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> -	if (len > 0 && (ctx->space_used / iclog_space !=
> -				(ctx->space_used + len) / iclog_space)) {
> +	if (len > 0 && !ctx_res) {
> +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
>  		split_res = (len + iclog_space - 1) / iclog_space;
>  		/* need to take into account split region headers, too */
>  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> -		ctx->ticket->t_unit_res += split_res;
> -		ctx->ticket->t_curr_res += split_res;
>  		tp->t_ticket->t_curr_res -= split_res;
>  		ASSERT(tp->t_ticket->t_curr_res >= len);
>  	}

Similarly here, assume additional split reservation for every context
rather than checking each commit. Seems reasonable in principle, but
just from a cursory glance this doesn't cover the case of the context
expanding beyond more than two iclogs. IOW, the current logic adds
split_res if the size increase from the current transaction expands the
ctx into another iclog than before the transaction. The new logic only
seems to add split_res for the first transaction into the ctx. Also note
that len seems to be a factor in the calculation of split_res, but it's
not immediately clear to me what impact filtering the split_res
calculation as such has in that regard.

(BTW the comment above this hunk needs an update if we do end up with
some special logic here.)

Other than those bits this seems fairly sane to me.

Brian

>  	tp->t_ticket->t_curr_res -= len;
> -	ctx->space_used += len;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> @@ -458,6 +458,15 @@ xlog_cil_insert_items(
>  		xlog_print_trans(tp);
>  	}
>  
> +	percpu_counter_add_batch(&cil->xc_curr_res, split_res, 1000 * 1000);
> +	percpu_counter_add_batch(&cil->xc_space_used, len, 1000 * 1000);
> +
> +	spin_lock(&cil->xc_cil_lock);
> +
> +	/* attach the transaction to the CIL if it has any busy extents */
> +	if (!list_empty(&tp->t_busy))
> +		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> +
>  	/*
>  	 * Now (re-)position everything modified at the tail of the CIL.
>  	 * We do this here so we only need to take the CIL lock once during
> @@ -741,6 +750,18 @@ xlog_cil_push_work(
>  		num_iovecs += lv->lv_niovecs;
>  	}
>  
> +	/*
> +	 * Drain per cpu counters back to context so they can be re-initialised
> +	 * to zero before we allow commits to the new context we are about to
> +	 * switch to.
> +	 */
> +	ctx->space_used = percpu_counter_sum(&cil->xc_space_used);
> +	ctx->ticket->t_curr_res = percpu_counter_sum(&cil->xc_curr_res);
> +	ctx->ticket->t_unit_res = ctx->ticket->t_curr_res;
> +	percpu_counter_set(&cil->xc_space_used, 0);
> +	percpu_counter_set(&cil->xc_curr_res, 0);
> +
> +
>  	/*
>  	 * initialise the new context and attach it to the CIL. Then attach
>  	 * the current context to the CIL committing lsit so it can be found
> @@ -900,6 +921,7 @@ xlog_cil_push_background(
>  	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
> +	s64		space_used = percpu_counter_read(&cil->xc_space_used);
>  
>  	/*
>  	 * The cil won't be empty because we are called while holding the
> @@ -911,7 +933,7 @@ xlog_cil_push_background(
>  	 * don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> @@ -934,9 +956,9 @@ xlog_cil_push_background(
>  	 * If we are well over the space limit, throttle the work that is being
>  	 * done until the push work on this context has begun.
>  	 */
> -	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> -		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		ASSERT(space_used < log->l_logsize);
>  		xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock);
>  		return;
>  	}
> @@ -1200,16 +1222,23 @@ xlog_cil_init(
>  {
>  	struct xfs_cil	*cil;
>  	struct xfs_cil_ctx *ctx;
> +	int		error = -ENOMEM;
>  
>  	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
>  	if (!cil)
> -		return -ENOMEM;
> +		return error;
>  
>  	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
> -	if (!ctx) {
> -		kmem_free(cil);
> -		return -ENOMEM;
> -	}
> +	if (!ctx)
> +		goto out_free_cil;
> +
> +	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
> +	if (error)
> +		goto out_free_ctx;
> +
> +	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
> +	if (error)
> +		goto out_free_space;
>  
>  	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
>  	INIT_LIST_HEAD(&cil->xc_cil);
> @@ -1230,19 +1259,31 @@ xlog_cil_init(
>  	cil->xc_log = log;
>  	log->l_cilp = cil;
>  	return 0;
> +
> +out_free_space:
> +	percpu_counter_destroy(&cil->xc_space_used);
> +out_free_ctx:
> +	kmem_free(ctx);
> +out_free_cil:
> +	kmem_free(cil);
> +	return error;
>  }
>  
>  void
>  xlog_cil_destroy(
>  	struct xlog	*log)
>  {
> -	if (log->l_cilp->xc_ctx) {
> -		if (log->l_cilp->xc_ctx->ticket)
> -			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
> -		kmem_free(log->l_cilp->xc_ctx);
> +	struct xfs_cil  *cil = log->l_cilp;
> +
> +	if (cil->xc_ctx) {
> +		if (cil->xc_ctx->ticket)
> +			xfs_log_ticket_put(cil->xc_ctx->ticket);
> +		kmem_free(cil->xc_ctx);
>  	}
> +	percpu_counter_destroy(&cil->xc_space_used);
> +	percpu_counter_destroy(&cil->xc_curr_res);
>  
> -	ASSERT(list_empty(&log->l_cilp->xc_cil));
> -	kmem_free(log->l_cilp);
> +	ASSERT(list_empty(&cil->xc_cil));
> +	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index ec22c7a3867f1..f5e79a7d44c8e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -262,6 +262,8 @@ struct xfs_cil_ctx {
>   */
>  struct xfs_cil {
>  	struct xlog		*xc_log;
> +	struct percpu_counter	xc_space_used;
> +	struct percpu_counter	xc_curr_res;
>  	struct list_head	xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
> -- 
> 2.26.1.301.g55bc3eb7cb9
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
  2020-05-12 12:30   ` Brian Foster
@ 2020-05-12 16:09     ` Darrick J. Wong
  2020-05-12 21:43       ` Dave Chinner
  2020-05-12 21:53     ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2020-05-12 16:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, May 12, 2020 at 08:30:27AM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 07:28:07PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Seeing massive cpu usage from xfs_agino_range() on one machine;
> > instruction level profiles look similar to another machine running
> > the same workload, only one machien is consuming 10x as much CPU as
> > the other and going much slower. The only real difference between
> > the two machines is core count per socket. Both are running
> > identical 16p/16GB virtual machine configurations
> > 
> ...
> > 
> > It's an improvement, hence indicating that separation and further
> > optimisation of read-only global filesystem data is worthwhile, but
> > it clearly isn't the underlying issue causing this specific
> > performance degradation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Pretty neat improvement. Could you share your test script that generated
> the above? I have a 80 CPU box I'd be interested to give this a whirl
> on...
> 
> >  fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
> >  1 file changed, 29 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index aba5a15792792..712b3e2583316 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -88,21 +88,12 @@ typedef struct xfs_mount {
> >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> >  	char			*m_rtname;	/* realtime device name */
> >  	char			*m_logname;	/* external log device name */
> > -	int			m_bsize;	/* fs logical block size */
> >  	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
> >  	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
> >  	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
> > -	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> > -	uint			m_allocsize_log;/* min write size log bytes */
> > -	uint			m_allocsize_blocks; /* min write size blocks */
> >  	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
> >  	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
> >  	struct xlog		*m_log;		/* log specific stuff */
> > -	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> > -	int			m_logbufs;	/* number of log buffers */
> > -	int			m_logbsize;	/* size of each log buffer */
> > -	uint			m_rsumlevels;	/* rt summary levels */
> > -	uint			m_rsumsize;	/* size of rt summary, bytes */
> >  	/*
> >  	 * Optional cache of rt summary level per bitmap block with the
> >  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> > @@ -117,9 +108,15 @@ typedef struct xfs_mount {
> >  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
> >  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
> >  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> > +
> > +	/*
> > +	 * Read-only variables that are pre-calculated at mount time.
> > +	 */
> 
> The intent here is to align the entire section below, right? If so, the
> connection with the cache line alignment is a bit tenuous. Could we
> tweak and/or add a sentence to the comment to be more explicit? I.e.:
> 
> 	/*
> 	 * Align the following pre-calculated fields to a cache line to
> 	 * prevent cache line bouncing between frequently read and
> 	 * frequently written fields.
> 	 */

I kinda wish we laid out via comments each place we cross a 64b boundary
on a 64-bit CPU, but I guess seeing as some of these structures can
change size depending on config option and kernel version that's
probably just asking for confusion and madness.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D


> > +	int ____cacheline_aligned m_bsize;	/* fs logical block size */
> >  	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
> >  	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
> >  	uint8_t			m_agno_log;	/* log #ag's */
> > +	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
> >  	uint			m_blockmask;	/* sb_blocksize-1 */
> >  	uint			m_blockwsize;	/* sb_blocksize in words */
> >  	uint			m_blockwmask;	/* blockwsize-1 */
> > @@ -138,20 +135,35 @@ typedef struct xfs_mount {
> >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> >  	uint			m_alloc_set_aside; /* space we can't use */
> >  	uint			m_ag_max_usable; /* max space per AG */
> > -	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> > -	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> > -	struct mutex		m_growlock;	/* growfs mutex */
> > +	int			m_dalign;	/* stripe unit */
> > +	int			m_swidth;	/* stripe width */
> > +	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> > +	uint			m_allocsize_log;/* min write size log bytes */
> > +	uint			m_allocsize_blocks; /* min write size blocks */
> > +	int			m_logbufs;	/* number of log buffers */
> > +	int			m_logbsize;	/* size of each log buffer */
> > +	uint			m_rsumlevels;	/* rt summary levels */
> > +	uint			m_rsumsize;	/* size of rt summary, bytes */
> >  	int			m_fixedfsid[2];	/* unchanged for life of FS */
> > -	uint64_t		m_flags;	/* global mount flags */
> > -	bool			m_finobt_nores; /* no per-AG finobt resv. */
> >  	uint			m_qflags;	/* quota status flags */
> > +	uint64_t		m_flags;	/* global mount flags */
> > +	int64_t			m_low_space[XFS_LOWSP_MAX];
> > +	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> >  	struct xfs_trans_resv	m_resv;		/* precomputed res values */
> > +						/* low free space thresholds */
> > +	bool			m_always_cow;
> > +	bool			m_fail_unmount;
> > +	bool			m_finobt_nores; /* no per-AG finobt resv. */
> > +	/*
> > +	 * End of pre-calculated read-only variables
> > +	 */
> 
> m_always_cow and m_fail_unmount are mutable via sysfs knobs so
> technically not read-only. I'm assuming that has no practical impact on
> the performance optimization, but it might be worth leaving them where
> they are to avoid confusion.
> 
> With those nits fixed:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +
> > +	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> > +	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> > +	struct mutex		m_growlock;	/* growfs mutex */
> >  	uint64_t		m_resblks;	/* total reserved blocks */
> >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> > -	int			m_dalign;	/* stripe unit */
> > -	int			m_swidth;	/* stripe width */
> > -	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
> >  	atomic_t		m_active_trans;	/* number trans frozen */
> >  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> > @@ -160,8 +172,6 @@ typedef struct xfs_mount {
> >  	struct delayed_work	m_cowblocks_work; /* background cow blocks
> >  						     trimming */
> >  	bool			m_update_sb;	/* sb needs update in mount */
> > -	int64_t			m_low_space[XFS_LOWSP_MAX];
> > -						/* low free space thresholds */
> >  	struct xfs_kobj		m_kobj;
> >  	struct xfs_kobj		m_error_kobj;
> >  	struct xfs_kobj		m_error_meta_kobj;
> > @@ -191,8 +201,6 @@ typedef struct xfs_mount {
> >  	 */
> >  	uint32_t		m_generation;
> >  
> > -	bool			m_always_cow;
> > -	bool			m_fail_unmount;
> >  #ifdef DEBUG
> >  	/*
> >  	 * Frequency with which errors are injected.  Replaces xfs_etest; the
> > -- 
> > 2.26.1.301.g55bc3eb7cb9
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
  2020-05-12 16:09     ` Darrick J. Wong
@ 2020-05-12 21:43       ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-12 21:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Tue, May 12, 2020 at 09:09:58AM -0700, Darrick J. Wong wrote:
> On Tue, May 12, 2020 at 08:30:27AM -0400, Brian Foster wrote:
> > On Tue, May 12, 2020 at 07:28:07PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Seeing massive cpu usage from xfs_agino_range() on one machine;
> > > instruction level profiles look similar to another machine running
> > > the same workload, only one machien is consuming 10x as much CPU as
> > > the other and going much slower. The only real difference between
> > > the two machines is core count per socket. Both are running
> > > identical 16p/16GB virtual machine configurations
> > > 
> > ...
> > > 
> > > It's an improvement, hence indicating that separation and further
> > > optimisation of read-only global filesystem data is worthwhile, but
> > > it clearly isn't the underlying issue causing this specific
> > > performance degradation.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Pretty neat improvement. Could you share your test script that generated
> > the above? I have a 80 CPU box I'd be interested to give this a whirl
> > on...
> > 
> > >  fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
> > >  1 file changed, 29 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index aba5a15792792..712b3e2583316 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -88,21 +88,12 @@ typedef struct xfs_mount {
> > >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> > >  	char			*m_rtname;	/* realtime device name */
> > >  	char			*m_logname;	/* external log device name */
> > > -	int			m_bsize;	/* fs logical block size */
> > >  	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
> > >  	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
> > >  	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
> > > -	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> > > -	uint			m_allocsize_log;/* min write size log bytes */
> > > -	uint			m_allocsize_blocks; /* min write size blocks */
> > >  	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
> > >  	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
> > >  	struct xlog		*m_log;		/* log specific stuff */
> > > -	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> > > -	int			m_logbufs;	/* number of log buffers */
> > > -	int			m_logbsize;	/* size of each log buffer */
> > > -	uint			m_rsumlevels;	/* rt summary levels */
> > > -	uint			m_rsumsize;	/* size of rt summary, bytes */
> > >  	/*
> > >  	 * Optional cache of rt summary level per bitmap block with the
> > >  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> > > @@ -117,9 +108,15 @@ typedef struct xfs_mount {
> > >  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
> > >  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
> > >  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> > > +
> > > +	/*
> > > +	 * Read-only variables that are pre-calculated at mount time.
> > > +	 */
> > 
> > The intent here is to align the entire section below, right? If so, the
> > connection with the cache line alignment is a bit tenuous. Could we
> > tweak and/or add a sentence to the comment to be more explicit? I.e.:
> > 
> > 	/*
> > 	 * Align the following pre-calculated fields to a cache line to
> > 	 * prevent cache line bouncing between frequently read and
> > 	 * frequently written fields.
> > 	 */
> 
> I kinda wish we laid out via comments each place we cross a 64b boundary
> on a 64-bit CPU, but I guess seeing as some of these structures can
> change size depending on config option and kernel version that's
> probably just asking for confusion and madness.

Yup, that's why we have tools like pahole. Stuff like lock debugging
or even different locking options change the size and layout of
structures like this, so you really have to look at pahole output to
determine what sits in the same cacheline for any given kernel
build.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount
  2020-05-12 12:30   ` Brian Foster
  2020-05-12 16:09     ` Darrick J. Wong
@ 2020-05-12 21:53     ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-12 21:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, May 12, 2020 at 08:30:27AM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 07:28:07PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Seeing massive cpu usage from xfs_agino_range() on one machine;
> > instruction level profiles look similar to another machine running
> > the same workload, only one machien is consuming 10x as much CPU as
> > the other and going much slower. The only real difference between
> > the two machines is core count per socket. Both are running
> > identical 16p/16GB virtual machine configurations
> > 
> ...
> > 
> > It's an improvement, hence indicating that separation and further
> > optimisation of read-only global filesystem data is worthwhile, but
> > it clearly isn't the underlying issue causing this specific
> > performance degradation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Pretty neat improvement. Could you share your test script that generated
> the above? I have a 80 CPU box I'd be interested to give this a whirl
> on...

I need to punch it out to a git repo somewhere...

> >  fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
> >  1 file changed, 29 insertions(+), 21 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index aba5a15792792..712b3e2583316 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -88,21 +88,12 @@ typedef struct xfs_mount {
> >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> >  	char			*m_rtname;	/* realtime device name */
> >  	char			*m_logname;	/* external log device name */
> > -	int			m_bsize;	/* fs logical block size */
> >  	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
> >  	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
> >  	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
> > -	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> > -	uint			m_allocsize_log;/* min write size log bytes */
> > -	uint			m_allocsize_blocks; /* min write size blocks */
> >  	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
> >  	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
> >  	struct xlog		*m_log;		/* log specific stuff */
> > -	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> > -	int			m_logbufs;	/* number of log buffers */
> > -	int			m_logbsize;	/* size of each log buffer */
> > -	uint			m_rsumlevels;	/* rt summary levels */
> > -	uint			m_rsumsize;	/* size of rt summary, bytes */
> >  	/*
> >  	 * Optional cache of rt summary level per bitmap block with the
> >  	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> > @@ -117,9 +108,15 @@ typedef struct xfs_mount {
> >  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
> >  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
> >  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> > +
> > +	/*
> > +	 * Read-only variables that are pre-calculated at mount time.
> > +	 */
> 
> The intent here is to align the entire section below, right? If so, the
> connection with the cache line alignment is a bit tenuous. Could we
> tweak and/or add a sentence to the comment to be more explicit? I.e.:
> 
> 	/*
> 	 * Align the following pre-calculated fields to a cache line to
> 	 * prevent cache line bouncing between frequently read and
> 	 * frequently written fields.
> 	 */

I can make it more verbose, but we don't tend to verbosely comment
variables tagged with __read_mostly as tagging them implies that
they need separation from frequently written variables. I can't use
__read_mostly here (it's for global variables and affects the data
segment they are placed in) so I just put a simple comment in. I
should have used them "Read-mostly variables" in the comment
because...

> > +	bool			m_always_cow;
> > +	bool			m_fail_unmount;
> > +	bool			m_finobt_nores; /* no per-AG finobt resv. */
> > +	/*
> > +	 * End of pre-calculated read-only variables
> > +	 */
> 
> m_always_cow and m_fail_unmount are mutable via sysfs knobs so
> technically not read-only.

... yes, these are read-mostly variables, not "read-only". The
optimisation still stands for them, as they may never be changed
after the initial user setup post mount....

I'll do another round on this patch to take all the comments into
account...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-12 14:05   ` Brian Foster
@ 2020-05-12 23:36     ` Dave Chinner
  2020-05-13 12:09       ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-12 23:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With the m_active_trans atomic bottleneck out of the way, the CIL
> > xc_cil_lock is the next bottleneck that causes cacheline contention.
> > This protects several things, the first of which is the CIL context
> > reservation ticket and space usage counters.
> > 
> > We can lift them out of the xc_cil_lock by converting them to
> > percpu counters. THis involves two things, the first of which is
> > lifting calculations and samples that don't actually need protecting
> > from races outside the xc_cil lock.
> > 
> > The second is converting the counters to percpu counters and lifting
> > them outside the lock. This requires a couple of tricky things to
> > minimise initial state races and to ensure we take into account
> > split reservations. We do this by erring on the "take the
> > reservation just in case" side, which largely lost in the noise of
> > many frequent large transactions.
> > 
> > We use a trick with percpu_counter_add_batch() to ensure the global
> > sum is updated immediately on first reservation, hence allowing us
> > to use fast counter reads everywhere to determine if the CIL is
> > empty or not, rather than using the list itself. This is important
> > for later patches where the CIL is moved to percpu lists
> > and hence cannot use list_empty() to detect an empty CIL. Hence we
> > provide a low overhead, lockless mechanism for determining if the
> > CIL is empty or not via this mechanisms. All other percpu counter
> > updates use a large batch count so they aggregate on the local CPU
> > and minimise global sum updates.
> > 
> > The xc_ctx_lock rwsem protects draining the percpu counters to the
> > context's ticket, similar to the way it allows access to the CIL
> > without using the xc_cil_lock. i.e. the CIL push has exclusive
> > access to the CIL, the context and the percpu counters while holding
> > the xc_ctx_lock. This ensures that we can sum and zero the counters
> > atomically from the perspective of the transaction commit side of
> > the push. i.e. they reset to zero atomically with the CIL context
> > swap and hence we don't need to have the percpu counters attached to
> > the CIL context.
> > 
> > Performance wise, this increases the transaction rate from
> > ~620,000/s to around 750,000/second. Using a 32-way concurrent
> > create instead of 16-way on a 32p/16GB virtual machine:
> > 
> > 		create time	rate		unlink time
> > unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> > patched		  1m56s	     533k/s+/-28k/s	 2m34
> > 
> > Notably, the system time for the create went from 44m20s down to
> > 38m37s, whilst going faster. There is more variance, but I think
> > that is from the cacheline contention having inconsistent overhead.
> > 
> > XXX: probably should split into two patches
> > 
> 
> Yes please. :)
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
> >  fs/xfs/xfs_log_priv.h |  2 +
> >  2 files changed, 72 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b43f0e8f43f2e..746c841757ed1 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -393,7 +393,7 @@ xlog_cil_insert_items(
> >  	struct xfs_log_item	*lip;
> >  	int			len = 0;
> >  	int			diff_iovecs = 0;
> > -	int			iclog_space;
> > +	int			iclog_space, space_used;
> >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> >  
> 
> fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
> fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]
> 
> >  	ASSERT(tp);
> > @@ -403,17 +403,16 @@ xlog_cil_insert_items(
> >  	 * are done so it doesn't matter exactly how we update the CIL.
> >  	 */
> >  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> > -
> > -	spin_lock(&cil->xc_cil_lock);
> > -
> >  	/* account for space used by new iovec headers  */
> > +
> >  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> >  	len += iovhdr_res;
> >  	ctx->nvecs += diff_iovecs;
> >  
> > -	/* attach the transaction to the CIL if it has any busy extents */
> > -	if (!list_empty(&tp->t_busy))
> > -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> > +	/*
> > +	 * The ticket can't go away from us here, so we can do racy sampling
> > +	 * and precalculate everything.
> > +	 */
> >  
> >  	/*
> >  	 * Now transfer enough transaction reservation to the context ticket
> > @@ -421,27 +420,28 @@ xlog_cil_insert_items(
> >  	 * reservation has to grow as well as the current reservation as we
> >  	 * steal from tickets so we can correctly determine the space used
> >  	 * during the transaction commit.
> > +	 *
> > +	 * We use percpu_counter_add_batch() here to force the addition into the
> > +	 * global sum immediately. This will result in percpu_counter_read() now
> > +	 * always returning a non-zero value, and hence we'll only ever have a
> > +	 * very short race window on new contexts.
> >  	 */
> > -	if (ctx->ticket->t_curr_res == 0) {
> > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> >  		ctx_res = ctx->ticket->t_unit_res;
> > -		ctx->ticket->t_curr_res = ctx_res;
> >  		tp->t_ticket->t_curr_res -= ctx_res;
> > +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
> >  	}
> 
> Ok, so we open a race here at the cost of stealing more reservation than
> necessary from the transaction. Seems harmless, but I would like to see
> some quantification/analysis on what a 'very short race window' is in
> this context.

About 20 instructions when the value is zero. The number of racers
will be dependent on how many threads are blocked in commit on the
xc_ctx_lock while a CIL push is in progress. The unit reservation
stolen here for the CIL is:

	xfs_log_calc_unit_res(mp, 0)
		~= 4 * sizeof(xlog_op_header) +
		   sizeof(xlog_trans_header) +
		   sizeof(sector) +
		   log stripe unit roundoff

So for a typical 4k log sector, we are talking about ~6kB of unit
reservation per thread that races here. For a 256k log stripe unit,
then it's going to be about 520kB per racing thread.

That said, every thread that races has this reservation available,
and the amount reserved adds to the space used in the CIL. Hence the
only downside of racing here is that the CIL background pushes
earlier because it hits the threshold sooner. That isn't a huge
issue - if we can't push immediately then the CIL will
run to the hard limit and block commits there; that overrun space is
larger than amount of space "wasted" by racing commits here.

> Particularly as it relates to percpu functionality. Does
> the window scale with cpu count, for example? It might not matter either

Not really. We need a thundering herd to cause issues, and this
occurs after formatting an item so we won't get a huge thundering
herd even when lots of threads block on the xc_ctx_lock waiting for
a push to complete.

> way because we expect any given transaction to accommodate the ctx res,
> but it would be good to understand the behavior here so we can think
> about potential side effects, if any.

I haven't been able to come up with any adverse side effects except
for "performance might drop a bit if we reserve too much and push
early", but that is tempered by the fact that performance goes up
much more than we might lose by getting rid of the xc_cil_lock
bottleneck.

> >  	/* do we need space for more log record headers? */
> > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > -				(ctx->space_used + len) / iclog_space)) {
> > +	if (len > 0 && !ctx_res) {
> > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> >  		split_res = (len + iclog_space - 1) / iclog_space;
> >  		/* need to take into account split region headers, too */
> >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > -		ctx->ticket->t_unit_res += split_res;
> > -		ctx->ticket->t_curr_res += split_res;
> >  		tp->t_ticket->t_curr_res -= split_res;
> >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> >  	}
> 
> Similarly here, assume additional split reservation for every
> context rather than checking each commit. Seems reasonable in
> principle, but just from a cursory glance this doesn't cover the
> case of the context expanding beyond more than two iclogs.  IOW,
> the current logic adds split_res if the size increase from the
> current transaction expands the ctx into another iclog than before
> the transaction. The new logic only seems to add split_res for the
> first transaction into the ctx. Also note

No, I changed it to apply to any vector length longer than a single
iclog except for transactions that have taken the unit reservation
above.

Sampling the space used isn't accurate here, and we do not want to
be doing an accurate sum across all CPUs, hence trying to detect a
reservation crossing an iclog boundary is difficult. Hence I just
took the reservation for anything that is guaranteed to cross an
iclog boundary. 

> that len seems to be a factor in the calculation of split_res, but it's
> not immediately clear to me what impact filtering the split_res
> calculation as such has in that regard.

The len used in the split_res calc guarantees that we always have at
least 1 log and op header accounted for by a split, and if the
vector length is greater than a single iclog it will include a
header for every iclog that the vector may span. i.e. if len >
iclog_space, it will reserve 2 extra iclog headers and op headers as
it may split across 3 iclogs....

> (BTW the comment above this hunk needs an update if we do end up with
> some special logic here.)

Definitely. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-12 23:36     ` Dave Chinner
@ 2020-05-13 12:09       ` Brian Foster
  2020-05-13 21:52         ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-05-13 12:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > On Tue, May 12, 2020 at 07:28:09PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > With the m_active_trans atomic bottleneck out of the way, the CIL
> > > xc_cil_lock is the next bottleneck that causes cacheline contention.
> > > This protects several things, the first of which is the CIL context
> > > reservation ticket and space usage counters.
> > > 
> > > We can lift them out of the xc_cil_lock by converting them to
> > > percpu counters. THis involves two things, the first of which is
> > > lifting calculations and samples that don't actually need protecting
> > > from races outside the xc_cil lock.
> > > 
> > > The second is converting the counters to percpu counters and lifting
> > > them outside the lock. This requires a couple of tricky things to
> > > minimise initial state races and to ensure we take into account
> > > split reservations. We do this by erring on the "take the
> > > reservation just in case" side, which largely lost in the noise of
> > > many frequent large transactions.
> > > 
> > > We use a trick with percpu_counter_add_batch() to ensure the global
> > > sum is updated immediately on first reservation, hence allowing us
> > > to use fast counter reads everywhere to determine if the CIL is
> > > empty or not, rather than using the list itself. This is important
> > > for later patches where the CIL is moved to percpu lists
> > > and hence cannot use list_empty() to detect an empty CIL. Hence we
> > > provide a low overhead, lockless mechanism for determining if the
> > > CIL is empty or not via this mechanisms. All other percpu counter
> > > updates use a large batch count so they aggregate on the local CPU
> > > and minimise global sum updates.
> > > 
> > > The xc_ctx_lock rwsem protects draining the percpu counters to the
> > > context's ticket, similar to the way it allows access to the CIL
> > > without using the xc_cil_lock. i.e. the CIL push has exclusive
> > > access to the CIL, the context and the percpu counters while holding
> > > the xc_ctx_lock. This ensures that we can sum and zero the counters
> > > atomically from the perspective of the transaction commit side of
> > > the push. i.e. they reset to zero atomically with the CIL context
> > > swap and hence we don't need to have the percpu counters attached to
> > > the CIL context.
> > > 
> > > Performance wise, this increases the transaction rate from
> > > ~620,000/s to around 750,000/second. Using a 32-way concurrent
> > > create instead of 16-way on a 32p/16GB virtual machine:
> > > 
> > > 		create time	rate		unlink time
> > > unpatched	  2m03s      472k/s+/-9k/s	 3m6s
> > > patched		  1m56s	     533k/s+/-28k/s	 2m34
> > > 
> > > Notably, the system time for the create went from 44m20s down to
> > > 38m37s, whilst going faster. There is more variance, but I think
> > > that is from the cacheline contention having inconsistent overhead.
> > > 
> > > XXX: probably should split into two patches
> > > 
> > 
> > Yes please. :)
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c  | 99 ++++++++++++++++++++++++++++++-------------
> > >  fs/xfs/xfs_log_priv.h |  2 +
> > >  2 files changed, 72 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b43f0e8f43f2e..746c841757ed1 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -393,7 +393,7 @@ xlog_cil_insert_items(
> > >  	struct xfs_log_item	*lip;
> > >  	int			len = 0;
> > >  	int			diff_iovecs = 0;
> > > -	int			iclog_space;
> > > +	int			iclog_space, space_used;
> > >  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> > >  
> > 
> > fs/xfs//xfs_log_cil.c: In function ‘xlog_cil_insert_items’:
> > fs/xfs//xfs_log_cil.c:396:21: warning: unused variable ‘space_used’ [-Wunused-variable]
> > 
> > >  	ASSERT(tp);
> > > @@ -403,17 +403,16 @@ xlog_cil_insert_items(
> > >  	 * are done so it doesn't matter exactly how we update the CIL.
> > >  	 */
> > >  	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> > > -
> > > -	spin_lock(&cil->xc_cil_lock);
> > > -
> > >  	/* account for space used by new iovec headers  */
> > > +
> > >  	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> > >  	len += iovhdr_res;
> > >  	ctx->nvecs += diff_iovecs;
> > >  
> > > -	/* attach the transaction to the CIL if it has any busy extents */
> > > -	if (!list_empty(&tp->t_busy))
> > > -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
> > > +	/*
> > > +	 * The ticket can't go away from us here, so we can do racy sampling
> > > +	 * and precalculate everything.
> > > +	 */
> > >  
> > >  	/*
> > >  	 * Now transfer enough transaction reservation to the context ticket
> > > @@ -421,27 +420,28 @@ xlog_cil_insert_items(
> > >  	 * reservation has to grow as well as the current reservation as we
> > >  	 * steal from tickets so we can correctly determine the space used
> > >  	 * during the transaction commit.
> > > +	 *
> > > +	 * We use percpu_counter_add_batch() here to force the addition into the
> > > +	 * global sum immediately. This will result in percpu_counter_read() now
> > > +	 * always returning a non-zero value, and hence we'll only ever have a
> > > +	 * very short race window on new contexts.
> > >  	 */
> > > -	if (ctx->ticket->t_curr_res == 0) {
> > > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> > >  		ctx_res = ctx->ticket->t_unit_res;
> > > -		ctx->ticket->t_curr_res = ctx_res;
> > >  		tp->t_ticket->t_curr_res -= ctx_res;
> > > +		percpu_counter_add_batch(&cil->xc_curr_res, ctx_res, ctx_res - 1);
> > >  	}
> > 
> > Ok, so we open a race here at the cost of stealing more reservation than
> > necessary from the transaction. Seems harmless, but I would like to see
> > some quantification/analysis on what a 'very short race window' is in
> > this context.
> 
> About 20 instructions when the value is zero. The number of racers
> will be dependent on how many threads are blocked in commit on the
> xc_ctx_lock while a CIL push is in progress. The unit reservation
> stolen here for the CIL is:
> 

Well we're also in the transaction commit path, which these changes are
intended to improve. The post push situation makes sense for a worst
case scenario, though, as that can probably line up a bunch of CPUs
behind the ctx lock.

> 	xfs_log_calc_unit_res(mp, 0)
> 		~= 4 * sizeof(xlog_op_header) +
> 		   sizeof(xlog_trans_header) +
> 		   sizeof(sector) +
> 		   log stripe unit roundoff
> 
> So for a typical 4k log sector, we are talking about ~6kB of unit
> reservation per thread that races here. For a 256k log stripe unit,
> then it's going to be about 520kB per racing thread.
> 
> That said, every thread that races has this reservation available,
> and the amount reserved adds to the space used in the CIL. Hence the
> only downside of racing here is that the CIL background pushes
> earlier because it hits the threshold sooner. That isn't a huge
> issue - if we can't push immediately then the CIL will
> run to the hard limit and block commits there; that overrun space is
> larger than amount of space "wasted" by racing commits here.
> 

Right, it's more of a tradeoff of holding on to unused reservation a bit
longer for performance. Note that the background CIL push is based on
context size, not reservation size, so I don't see how that would be
affected by this particular race.

> > Particularly as it relates to percpu functionality. Does
> > the window scale with cpu count, for example? It might not matter either
> 
> Not really. We need a thundering herd to cause issues, and this
> occurs after formatting an item so we won't get a huge thundering
> herd even when lots of threads block on the xc_ctx_lock waiting for
> a push to complete.
> 

It would be nice to have some debug code somewhere that somehow or
another asserts or warns if the CIL reservation exceeds some
insane/unexpected heuristic based on the current size of the context. I
don't know what that code or heuristic looks like (i.e. multiple factors
of the ctx size?) so I'm obviously handwaving. Just something to think
about if we can come up with a way to accomplish that opportunistically.

> > way because we expect any given transaction to accommodate the ctx res,
> > but it would be good to understand the behavior here so we can think
> > about potential side effects, if any.
> 
> I haven't been able to come up with any adverse side effects except
> for "performance might drop a bit if we reserve too much and push
> early", but that is tempered by the fact that performance goes up
> much more than we might lose by getting rid of the xc_cil_lock
> bottleneck.
> 

FWIW, a more extreme test vector could be to steal the remainder of the
transaction reservation for the CIL ticket and see how that affects
things. That's probably more suited for a local test than something to
live in the upstream code, though.

> > >  	/* do we need space for more log record headers? */
> > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > -				(ctx->space_used + len) / iclog_space)) {
> > > +	if (len > 0 && !ctx_res) {
> > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > >  		/* need to take into account split region headers, too */
> > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > -		ctx->ticket->t_unit_res += split_res;
> > > -		ctx->ticket->t_curr_res += split_res;
> > >  		tp->t_ticket->t_curr_res -= split_res;
> > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > >  	}
> > 
> > Similarly here, assume additional split reservation for every
> > context rather than checking each commit. Seems reasonable in
> > principle, but just from a cursory glance this doesn't cover the
> > case of the context expanding beyond more than two iclogs.  IOW,
> > the current logic adds split_res if the size increase from the
> > current transaction expands the ctx into another iclog than before
> > the transaction. The new logic only seems to add split_res for the
> > first transaction into the ctx. Also note
> 
> No, I changed it to apply to any vector length longer than a single
> iclog except for transactions that have taken the unit reservation
> above.
> 

Ok, I had the ctx_res logic inverted in my head. So it's not that
split_res is only added for the first transaction, but rather we treat
every transaction that didn't contribute unit res as if it crosses an
iclog boundary. That seems much more reasonable, though it does add to
the "overreservation" of the ticket so I'll reemphasize the request for
some form of debug/trace check that helps analyze runtime CIL ticket
reservation accounting. ;)

OTOH, this skips the split_res in the case where a single large
multi-iclog transaction happens to be the first in the ctx, right? That
doesn't seem that unlikely a scenario considering minimum iclog and
worst case transaction unit res sizes. It actually makes me wonder what
happens if the CIL ticket underruns.. :P

> Sampling the space used isn't accurate here, and we do not want to
> be doing an accurate sum across all CPUs, hence trying to detect a
> reservation crossing an iclog boundary is difficult. Hence I just
> took the reservation for anything that is guaranteed to cross an
> iclog boundary. 
> 

Right.. though it looks more like we take the reservation for anything
that might possibly cross an iclog boundary vs. anything that is
guaranteed to do so, because we can't really establish the latter
without batching up the current size.

Brian

> > that len seems to be a factor in the calculation of split_res, but it's
> > not immediately clear to me what impact filtering the split_res
> > calculation as such has in that regard.
> 
> The len used in the split_res calc guarantees that we always have at
> least 1 log and op header accounted for by a split, and if the
> vector length is greater than a single iclog it will include a
> header for every iclog that the vector may span. i.e. if len >
> iclog_space, it will reserve 2 extra iclog headers and op headers as
> it may split across 3 iclogs....
> 
> > (BTW the comment above this hunk needs an update if we do end up with
> > some special logic here.)
> 
> Definitely. :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-12  9:28 ` [PATCH 4/5] [RFC] xfs: per-cpu CIL lists Dave Chinner
@ 2020-05-13 17:02   ` Brian Foster
  2020-05-13 23:33     ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-05-13 17:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 07:28:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Next on the list to getting rid of the xc_cil_lock is making the CIL
> itself per-cpu.
> 
> This requires a trade-off: we no longer move items forward in the
> CIL; once they are on the CIL they remain there as we treat the
> percpu lists as lockless.
> 
> XXX: preempt_disable() around the list operations to ensure they
> stay local to the CPU.
> 
> XXX: this needs CPU hotplug notifiers to clean up when cpus go
> offline.
> 
> Performance now increases substantially - the transaction rate goes
> from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
> for the first time.
> 
> Using a 32-way concurrent create/unlink on a 32p/16GB virtual
> machine:
> 
> 	    create time     rate            unlink time
> unpatched	1m56s      533k/s+/-28k/s      2m34s
> patched		1m49s	   523k/s+/-14k/s      2m00s
> 
> Notably, the system time for the create went up, while variance went
> down. This indicates we're starting to hit some other contention
> limit as we reduce the amount of time we spend contending on the
> xc_cil_lock.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
>  fs/xfs/xfs_log_priv.h |  2 +-
>  2 files changed, 45 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 746c841757ed1..af444bc69a7cd 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
...
> @@ -687,7 +689,7 @@ xlog_cil_push_work(
>  	 * move on to a new sequence number and so we have to be able to push
>  	 * this sequence again later.
>  	 */
> -	if (list_empty(&cil->xc_cil)) {
> +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {

It seems reasonable, but I need to think a bit more about the whole
percpu list thing. In the meantime, one thing that comes to mind is the
more of these list_empty() -> percpu_counter_read() translations I see
the less I like it because we're leaking this inherent raciness to
different contexts. Whether it's ultimately safe or not, it's subject to
change and far too subtle and indirect for my taste. 

Could we replace all of the direct ->xc_cil list checks with an atomic
bitop (i.e. XLOG_CIL_EMPTY) or something similar in the xfs_cil? AFAICT,
that could be done in a separate patch and we could ultimately reuse it
to close the race with the initial ctx reservation (via
test_and_set_bit()) because it's otherwise set in the same function. Hm?

Brian

>  		cil->xc_push_seq = 0;
>  		spin_unlock(&cil->xc_push_lock);
>  		goto out_skip;
> @@ -728,17 +730,21 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * Remove the items from the per-cpu CIL lists and then pull all the
> +	 * log vectors off the items. We hold the xc_ctx_lock exclusively here,
> +	 * so nothing can be adding or removing from the per-cpu lists here.
>  	 */
> +	/* XXX: hotplug! */
> +	for_each_online_cpu(cpu) {
> +		list_splice_tail_init(per_cpu_ptr(cil->xc_cil, cpu), &cil_items);
> +	}
> +
>  	lv = NULL;
>  	num_iovecs = 0;
> -	while (!list_empty(&cil->xc_cil)) {
> +	while (!list_empty(&cil_items)) {
>  		struct xfs_log_item	*item;
>  
> -		item = list_first_entry(&cil->xc_cil,
> +		item = list_first_entry(&cil_items,
>  					struct xfs_log_item, li_cil);
>  		list_del_init(&item->li_cil);
>  		if (!ctx->lv_chain)
> @@ -927,7 +933,7 @@ xlog_cil_push_background(
>  	 * The cil won't be empty because we are called while holding the
>  	 * context lock so whatever we added to the CIL will still be there
>  	 */
> -	ASSERT(!list_empty(&cil->xc_cil));
> +	ASSERT(space_used != 0);
>  
>  	/*
>  	 * don't do a background push if we haven't used up all the
> @@ -993,7 +999,8 @@ xlog_cil_push_now(
>  	 * there's no work we need to do.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> -	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
> +	if (percpu_counter_read(&cil->xc_curr_res) == 0 ||
> +	    push_seq <= cil->xc_push_seq) {
>  		spin_unlock(&cil->xc_push_lock);
>  		return;
>  	}
> @@ -1011,7 +1018,7 @@ xlog_cil_empty(
>  	bool		empty = false;
>  
>  	spin_lock(&cil->xc_push_lock);
> -	if (list_empty(&cil->xc_cil))
> +	if (percpu_counter_read(&cil->xc_curr_res) == 0)
>  		empty = true;
>  	spin_unlock(&cil->xc_push_lock);
>  	return empty;
> @@ -1163,7 +1170,7 @@ xlog_cil_force_lsn(
>  	 * we would have found the context on the committing list.
>  	 */
>  	if (sequence == cil->xc_current_sequence &&
> -	    !list_empty(&cil->xc_cil)) {
> +	    percpu_counter_read(&cil->xc_curr_res) != 0) {
>  		spin_unlock(&cil->xc_push_lock);
>  		goto restart;
>  	}
> @@ -1223,6 +1230,7 @@ xlog_cil_init(
>  	struct xfs_cil	*cil;
>  	struct xfs_cil_ctx *ctx;
>  	int		error = -ENOMEM;
> +	int		cpu;
>  
>  	cil = kmem_zalloc(sizeof(*cil), KM_MAYFAIL);
>  	if (!cil)
> @@ -1232,16 +1240,24 @@ xlog_cil_init(
>  	if (!ctx)
>  		goto out_free_cil;
>  
> +	/* XXX: CPU hotplug!!! */
> +	cil->xc_cil = alloc_percpu_gfp(struct list_head, GFP_KERNEL);
> +	if (!cil->xc_cil)
> +		goto out_free_ctx;
> +
> +	for_each_possible_cpu(cpu) {
> +		INIT_LIST_HEAD(per_cpu_ptr(cil->xc_cil, cpu));
> +	}
> +
>  	error = percpu_counter_init(&cil->xc_space_used, 0, GFP_KERNEL);
>  	if (error)
> -		goto out_free_ctx;
> +		goto out_free_pcp_cil;
>  
>  	error = percpu_counter_init(&cil->xc_curr_res, 0, GFP_KERNEL);
>  	if (error)
>  		goto out_free_space;
>  
>  	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
> -	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
>  	spin_lock_init(&cil->xc_push_lock);
> @@ -1262,6 +1278,8 @@ xlog_cil_init(
>  
>  out_free_space:
>  	percpu_counter_destroy(&cil->xc_space_used);
> +out_free_pcp_cil:
> +	free_percpu(cil->xc_cil);
>  out_free_ctx:
>  	kmem_free(ctx);
>  out_free_cil:
> @@ -1274,6 +1292,7 @@ xlog_cil_destroy(
>  	struct xlog	*log)
>  {
>  	struct xfs_cil  *cil = log->l_cilp;
> +	int		cpu;
>  
>  	if (cil->xc_ctx) {
>  		if (cil->xc_ctx->ticket)
> @@ -1283,7 +1302,10 @@ xlog_cil_destroy(
>  	percpu_counter_destroy(&cil->xc_space_used);
>  	percpu_counter_destroy(&cil->xc_curr_res);
>  
> -	ASSERT(list_empty(&cil->xc_cil));
> +	for_each_possible_cpu(cpu) {
> +		ASSERT(list_empty(per_cpu_ptr(cil->xc_cil, cpu)));
> +	}
> +	free_percpu(cil->xc_cil);
>  	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index f5e79a7d44c8e..0bb982920d070 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -264,7 +264,7 @@ struct xfs_cil {
>  	struct xlog		*xc_log;
>  	struct percpu_counter	xc_space_used;
>  	struct percpu_counter	xc_curr_res;
> -	struct list_head	xc_cil;
> +	struct list_head __percpu *xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
>  	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
> -- 
> 2.26.1.301.g55bc3eb7cb9
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-13 12:09       ` Brian Foster
@ 2020-05-13 21:52         ` Dave Chinner
  2020-05-14  1:50           ` Dave Chinner
  2020-05-14 13:43           ` Brian Foster
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-13 21:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > Particularly as it relates to percpu functionality. Does
> > > the window scale with cpu count, for example? It might not matter either
> > 
> > Not really. We need a thundering herd to cause issues, and this
> > occurs after formatting an item so we won't get a huge thundering
> > herd even when lots of threads block on the xc_ctx_lock waiting for
> > a push to complete.
> > 
> 
> It would be nice to have some debug code somewhere that somehow or
> another asserts or warns if the CIL reservation exceeds some
> insane/unexpected heuristic based on the current size of the context. I
> don't know what that code or heuristic looks like (i.e. multiple factors
> of the ctx size?) so I'm obviously handwaving. Just something to think
> about if we can come up with a way to accomplish that opportunistically.

I don't think there is a reliable mechanism that can be used here.
At one end of the scale we have the valid case of a synchronous
inode modification on a log with a 256k stripe unit. So it's valid
to have a CIL reservation of ~550kB for a single item that consumes
~700 bytes of log space.

OTOH, we might be freeing extents on a massively fragmented file and
filesystem, so we're pushing 200kB+ transactions into the CIL for
every rolling transaction. On a filesystem with a 512 byte log
sector size and no LSU, the CIL reservations are dwarfed by the
actual metadata being logged...

I'd suggest that looking at the ungrant trace for the CIL ticket
once it has committed will tell us exactly how much the reservation
was over-estimated, as the unused portion of the reservation will be
returned to the reserve grant head at this point in time.

> > > way because we expect any given transaction to accommodate the ctx res,
> > > but it would be good to understand the behavior here so we can think
> > > about potential side effects, if any.
> > 
> > I haven't been able to come up with any adverse side effects except
> > for "performance might drop a bit if we reserve too much and push
> > early", but that is tempered by the fact that performance goes up
> > much more than we might lose by getting rid of the xc_cil_lock
> > bottleneck.
> > 
> 
> FWIW, a more extreme test vector could be to steal the remainder of the
> transaction reservation for the CIL ticket and see how that affects
> things. That's probably more suited for a local test than something to
> live in the upstream code, though.

It will only affect performance. Worst case is that it degrades the
CIL to behave like the original logging code that wrote direct to
iclogs. i.e. it defeats the in memory aggregation of delayed logging
and effectively writes directly to the iclogs.

Which, really, is what using the -o wsync or -o dirsync largely do
by making a large number of transactions synchrnonous.

In short, performance goes down if we reserve too much, but nothing
incorrect should occur.

> 
> > > >  	/* do we need space for more log record headers? */
> > > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > > -				(ctx->space_used + len) / iclog_space)) {
> > > > +	if (len > 0 && !ctx_res) {
> > > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > > >  		/* need to take into account split region headers, too */
> > > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > > -		ctx->ticket->t_unit_res += split_res;
> > > > -		ctx->ticket->t_curr_res += split_res;
> > > >  		tp->t_ticket->t_curr_res -= split_res;
> > > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > > >  	}
> > > 
> > > Similarly here, assume additional split reservation for every
> > > context rather than checking each commit. Seems reasonable in
> > > principle, but just from a cursory glance this doesn't cover the
> > > case of the context expanding beyond more than two iclogs.  IOW,
> > > the current logic adds split_res if the size increase from the
> > > current transaction expands the ctx into another iclog than before
> > > the transaction. The new logic only seems to add split_res for the
> > > first transaction into the ctx. Also note
> > 
> > No, I changed it to apply to any vector length longer than a single
> > iclog except for transactions that have taken the unit reservation
> > above.
> > 
> 
> Ok, I had the ctx_res logic inverted in my head. So it's not that
> split_res is only added for the first transaction, but rather we treat
> every transaction that didn't contribute unit res as if it crosses an
> iclog boundary. That seems much more reasonable, though it does add to
> the "overreservation" of the ticket so I'll reemphasize the request for
> some form of debug/trace check that helps analyze runtime CIL ticket
> reservation accounting. ;)

I can add some trace points, but I think we already have the
tracepoints we need to understand how much overestimation occurs.
i.e. trace_xfs_log_ticket_ungrant() from the CIL push worker
context.

> OTOH, this skips the split_res in the case where a single large
> multi-iclog transaction happens to be the first in the ctx, right?

True, that's easy enough to fix.

> That
> doesn't seem that unlikely a scenario considering minimum iclog and
> worst case transaction unit res sizes. It actually makes me wonder what
> happens if the CIL ticket underruns.. :P

xlog_write() will catch the underrun when we go to write the commit
record via xlog_commit_record(), dump the ticket and shutdown the
filesystem.

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-13 17:02   ` Brian Foster
@ 2020-05-13 23:33     ` Dave Chinner
  2020-05-14 13:44       ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-13 23:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, May 13, 2020 at 01:02:37PM -0400, Brian Foster wrote:
> On Tue, May 12, 2020 at 07:28:10PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Next on the list to getting rid of the xc_cil_lock is making the CIL
> > itself per-cpu.
> > 
> > This requires a trade-off: we no longer move items forward in the
> > CIL; once they are on the CIL they remain there as we treat the
> > percpu lists as lockless.
> > 
> > XXX: preempt_disable() around the list operations to ensure they
> > stay local to the CPU.
> > 
> > XXX: this needs CPU hotplug notifiers to clean up when cpus go
> > offline.
> > 
> > Performance now increases substantially - the transaction rate goes
> > from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
> > for the first time.
> > 
> > Using a 32-way concurrent create/unlink on a 32p/16GB virtual
> > machine:
> > 
> > 	    create time     rate            unlink time
> > unpatched	1m56s      533k/s+/-28k/s      2m34s
> > patched		1m49s	   523k/s+/-14k/s      2m00s
> > 
> > Notably, the system time for the create went up, while variance went
> > down. This indicates we're starting to hit some other contention
> > limit as we reduce the amount of time we spend contending on the
> > xc_cil_lock.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
> >  fs/xfs/xfs_log_priv.h |  2 +-
> >  2 files changed, 45 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 746c841757ed1..af444bc69a7cd 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> ...
> > @@ -687,7 +689,7 @@ xlog_cil_push_work(
> >  	 * move on to a new sequence number and so we have to be able to push
> >  	 * this sequence again later.
> >  	 */
> > -	if (list_empty(&cil->xc_cil)) {
> > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> 
> It seems reasonable, but I need to think a bit more about the whole
> percpu list thing. In the meantime, one thing that comes to mind is the
> more of these list_empty() -> percpu_counter_read() translations I see
> the less I like it because we're leaking this inherent raciness to
> different contexts. Whether it's ultimately safe or not, it's subject to
> change and far too subtle and indirect for my taste. 

Well, all the critical list_empty(&cil->xc_cil) checks are done
under the xc_push_lock, so I'd suggest that if we zero the counters
under the push lock when switching contexts, and put the initial
zero->non-zero counter transition to under the same lock we'll get
exact checks without requiring a spinlock/atomic in the fast
path and have all the right memory barriers in place such that races
can't happen...

> Could we replace all of the direct ->xc_cil list checks with an atomic
> bitop (i.e. XLOG_CIL_EMPTY) or something similar in the xfs_cil? AFAICT,
> that could be done in a separate patch and we could ultimately reuse it
> to close the race with the initial ctx reservation (via
> test_and_set_bit()) because it's otherwise set in the same function. Hm?

test_and_set_bit() still locks the memory bus and so requires
exclusive access to the cacheline. Avoiding locked bus ops
(atomics, spinlocks, etc) in the fast path is the problem
I'm trying to solve with this patchset. IOWs, this isn't a viable
solution to a scalability problem caused by many CPUs all trying to
access the same cacheline exclusively.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-13 21:52         ` Dave Chinner
@ 2020-05-14  1:50           ` Dave Chinner
  2020-05-14  2:49             ` Dave Chinner
  2020-05-14 13:43           ` Brian Foster
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-14  1:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > Particularly as it relates to percpu functionality. Does
> > > > the window scale with cpu count, for example? It might not matter either
> > > 
> > > Not really. We need a thundering herd to cause issues, and this
> > > occurs after formatting an item so we won't get a huge thundering
> > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > a push to complete.
> > > 
> > 
> > It would be nice to have some debug code somewhere that somehow or
> > another asserts or warns if the CIL reservation exceeds some
> > insane/unexpected heuristic based on the current size of the context. I
> > don't know what that code or heuristic looks like (i.e. multiple factors
> > of the ctx size?) so I'm obviously handwaving. Just something to think
> > about if we can come up with a way to accomplish that opportunistically.
> 
> I don't think there is a reliable mechanism that can be used here.
> At one end of the scale we have the valid case of a synchronous
> inode modification on a log with a 256k stripe unit. So it's valid
> to have a CIL reservation of ~550kB for a single item that consumes
> ~700 bytes of log space.
> 
> OTOH, we might be freeing extents on a massively fragmented file and
> filesystem, so we're pushing 200kB+ transactions into the CIL for
> every rolling transaction. On a filesystem with a 512 byte log
> sector size and no LSU, the CIL reservations are dwarfed by the
> actual metadata being logged...
> 
> I'd suggest that looking at the ungrant trace for the CIL ticket
> once it has committed will tell us exactly how much the reservation
> was over-estimated, as the unused portion of the reservation will be
> returned to the reserve grant head at this point in time.

Typical for this workload is a CIl ticket that looks like this at
ungrant time:

t_curr_res 13408 t_unit_res 231100
t_curr_res 9240 t_unit_res 140724
t_curr_res 46284 t_unit_res 263964
t_curr_res 29780 t_unit_res 190020
t_curr_res 38044 t_unit_res 342016
t_curr_res 21636 t_unit_res 321476
t_curr_res 21576 t_unit_res 263964
t_curr_res 42200 t_unit_res 411852
t_curr_res 21636 t_unit_res 292720
t_curr_res 62740 t_unit_res 514552
t_curr_res 17456 t_unit_res 284504
t_curr_res 29852 t_unit_res 411852
t_curr_res 13384 t_unit_res 206452
t_curr_res 70956 t_unit_res 518660
t_curr_res 70908 t_unit_res 333800
t_curr_res 50404 t_unit_res 518660
t_curr_res 17480 t_unit_res 321476
t_curr_res 33948 t_unit_res 436500
t_curr_res 17492 t_unit_res 317368
t_curr_res 50392 t_unit_res 489904
t_curr_res 13360 t_unit_res 325584
t_curr_res 66812 t_unit_res 506336
t_curr_res 33924 t_unit_res 366664
t_curr_res 70932 t_unit_res 551524
t_curr_res 29852 t_unit_res 374880
t_curr_res 25720 t_unit_res 494012
t_curr_res 42152 t_unit_res 506336
t_curr_res 21684 t_unit_res 543308
t_curr_res 29840 t_unit_res 440608
t_curr_res 46320 t_unit_res 551524
t_curr_res 21624 t_unit_res 387204
t_curr_res 29840 t_unit_res 522768

So we are looking at a reservation of up to 500KB, and typically
using all but a few 10s of KB of it.

I'll use this as the ballpark for the lockless code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-14  1:50           ` Dave Chinner
@ 2020-05-14  2:49             ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-14  2:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, May 14, 2020 at 11:50:55AM +1000, Dave Chinner wrote:
> On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> > On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > > Particularly as it relates to percpu functionality. Does
> > > > > the window scale with cpu count, for example? It might not matter either
> > > > 
> > > > Not really. We need a thundering herd to cause issues, and this
> > > > occurs after formatting an item so we won't get a huge thundering
> > > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > > a push to complete.
> > > > 
> > > 
> > > It would be nice to have some debug code somewhere that somehow or
> > > another asserts or warns if the CIL reservation exceeds some
> > > insane/unexpected heuristic based on the current size of the context. I
> > > don't know what that code or heuristic looks like (i.e. multiple factors
> > > of the ctx size?) so I'm obviously handwaving. Just something to think
> > > about if we can come up with a way to accomplish that opportunistically.
> > 
> > I don't think there is a reliable mechanism that can be used here.
> > At one end of the scale we have the valid case of a synchronous
> > inode modification on a log with a 256k stripe unit. So it's valid
> > to have a CIL reservation of ~550kB for a single item that consumes
> > ~700 bytes of log space.
> > 
> > OTOH, we might be freeing extents on a massively fragmented file and
> > filesystem, so we're pushing 200kB+ transactions into the CIL for
> > every rolling transaction. On a filesystem with a 512 byte log
> > sector size and no LSU, the CIL reservations are dwarfed by the
> > actual metadata being logged...
> > 
> > I'd suggest that looking at the ungrant trace for the CIL ticket
> > once it has committed will tell us exactly how much the reservation
> > was over-estimated, as the unused portion of the reservation will be
> > returned to the reserve grant head at this point in time.
> 
> Typical for this workload is a CIl ticket that looks like this at
> ungrant time:
> 
> t_curr_res 13408 t_unit_res 231100
> t_curr_res 9240 t_unit_res 140724
> t_curr_res 46284 t_unit_res 263964
> t_curr_res 29780 t_unit_res 190020
> t_curr_res 38044 t_unit_res 342016
> t_curr_res 21636 t_unit_res 321476
> t_curr_res 21576 t_unit_res 263964
> t_curr_res 42200 t_unit_res 411852
> t_curr_res 21636 t_unit_res 292720
> t_curr_res 62740 t_unit_res 514552
> t_curr_res 17456 t_unit_res 284504
> t_curr_res 29852 t_unit_res 411852
> t_curr_res 13384 t_unit_res 206452
> t_curr_res 70956 t_unit_res 518660
> t_curr_res 70908 t_unit_res 333800
> t_curr_res 50404 t_unit_res 518660
> t_curr_res 17480 t_unit_res 321476
> t_curr_res 33948 t_unit_res 436500
> t_curr_res 17492 t_unit_res 317368
> t_curr_res 50392 t_unit_res 489904
> t_curr_res 13360 t_unit_res 325584
> t_curr_res 66812 t_unit_res 506336
> t_curr_res 33924 t_unit_res 366664
> t_curr_res 70932 t_unit_res 551524
> t_curr_res 29852 t_unit_res 374880
> t_curr_res 25720 t_unit_res 494012
> t_curr_res 42152 t_unit_res 506336
> t_curr_res 21684 t_unit_res 543308
> t_curr_res 29840 t_unit_res 440608
> t_curr_res 46320 t_unit_res 551524
> t_curr_res 21624 t_unit_res 387204
> t_curr_res 29840 t_unit_res 522768
> 
> So we are looking at a reservation of up to 500KB, and typically
> using all but a few 10s of KB of it.
> 
> I'll use this as the ballpark for the lockless code.

And, yeah, just reserving a split res for everything blows it out by
a factor of 100.

I've worked out via experimentation and measurement that taking a
split reservation when the current transaction is near the iclog
threshold results in roughly doubling the above reservation, but I
haven't had it go anywhere near an underrun at this point....

I'll work with this heuristic for the moment, and spend some more
time thinking about stuff I've just learnt getting to this point.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters
  2020-05-13 21:52         ` Dave Chinner
  2020-05-14  1:50           ` Dave Chinner
@ 2020-05-14 13:43           ` Brian Foster
  1 sibling, 0 replies; 25+ messages in thread
From: Brian Foster @ 2020-05-14 13:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 14, 2020 at 07:52:41AM +1000, Dave Chinner wrote:
> On Wed, May 13, 2020 at 08:09:59AM -0400, Brian Foster wrote:
> > On Wed, May 13, 2020 at 09:36:27AM +1000, Dave Chinner wrote:
> > > On Tue, May 12, 2020 at 10:05:44AM -0400, Brian Foster wrote:
> > > > Particularly as it relates to percpu functionality. Does
> > > > the window scale with cpu count, for example? It might not matter either
> > > 
> > > Not really. We need a thundering herd to cause issues, and this
> > > occurs after formatting an item so we won't get a huge thundering
> > > herd even when lots of threads block on the xc_ctx_lock waiting for
> > > a push to complete.
> > > 
> > 
> > It would be nice to have some debug code somewhere that somehow or
> > another asserts or warns if the CIL reservation exceeds some
> > insane/unexpected heuristic based on the current size of the context. I
> > don't know what that code or heuristic looks like (i.e. multiple factors
> > of the ctx size?) so I'm obviously handwaving. Just something to think
> > about if we can come up with a way to accomplish that opportunistically.
> 
> I don't think there is a reliable mechanism that can be used here.
> At one end of the scale we have the valid case of a synchronous
> inode modification on a log with a 256k stripe unit. So it's valid
> to have a CIL reservation of ~550kB for a single item that consumes
> ~700 bytes of log space.
> 

Perhaps ctx size isn't a sufficient baseline by itself. That said, log
stripe unit is still fixed and afaict centralized to the base unit res
calculation of the CIL ctx and regular transaction tickets. So I'm not
convinced we couldn't come up with something useful on the push side
that factors lsunit into the metric. It doesn't have to be perfect, just
something conservative enough to catch consuming reservation beyond
expected worst case conditions without causing false positives.

Re: your followups, I'll put more thought into it when the percpu
algorithm is more settled..

> OTOH, we might be freeing extents on a massively fragmented file and
> filesystem, so we're pushing 200kB+ transactions into the CIL for
> every rolling transaction. On a filesystem with a 512 byte log
> sector size and no LSU, the CIL reservations are dwarfed by the
> actual metadata being logged...
> 
> I'd suggest that looking at the ungrant trace for the CIL ticket
> once it has committed will tell us exactly how much the reservation
> was over-estimated, as the unused portion of the reservation will be
> returned to the reserve grant head at this point in time.
> 

Yeah, that works for me.

> > > > way because we expect any given transaction to accommodate the ctx res,
> > > > but it would be good to understand the behavior here so we can think
> > > > about potential side effects, if any.
> > > 
> > > I haven't been able to come up with any adverse side effects except
> > > for "performance might drop a bit if we reserve too much and push
> > > early", but that is tempered by the fact that performance goes up
> > > much more than we might lose by getting rid of the xc_cil_lock
> > > bottleneck.
> > > 
> > 
> > FWIW, a more extreme test vector could be to steal the remainder of the
> > transaction reservation for the CIL ticket and see how that affects
> > things. That's probably more suited for a local test than something to
> > live in the upstream code, though.
> 
> It will only affect performance. Worst case is that it degrades the
> CIL to behave like the original logging code that wrote direct to
> iclogs. i.e. it defeats the in memory aggregation of delayed logging
> and effectively writes directly to the iclogs.
> 
> Which, really, is what using the -o wsync or -o dirsync largely do
> by making a large number of transactions synchrnonous.
> 
> In short, performance goes down if we reserve too much, but nothing
> incorrect should occur.
> 

Ok.

> > 
> > > > >  	/* do we need space for more log record headers? */
> > > > > -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > > -	if (len > 0 && (ctx->space_used / iclog_space !=
> > > > > -				(ctx->space_used + len) / iclog_space)) {
> > > > > +	if (len > 0 && !ctx_res) {
> > > > > +		iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> > > > >  		split_res = (len + iclog_space - 1) / iclog_space;
> > > > >  		/* need to take into account split region headers, too */
> > > > >  		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> > > > > -		ctx->ticket->t_unit_res += split_res;
> > > > > -		ctx->ticket->t_curr_res += split_res;
> > > > >  		tp->t_ticket->t_curr_res -= split_res;
> > > > >  		ASSERT(tp->t_ticket->t_curr_res >= len);
> > > > >  	}
> > > > 
> > > > Similarly here, assume additional split reservation for every
> > > > context rather than checking each commit. Seems reasonable in
> > > > principle, but just from a cursory glance this doesn't cover the
> > > > case of the context expanding beyond more than two iclogs.  IOW,
> > > > the current logic adds split_res if the size increase from the
> > > > current transaction expands the ctx into another iclog than before
> > > > the transaction. The new logic only seems to add split_res for the
> > > > first transaction into the ctx. Also note
> > > 
> > > No, I changed it to apply to any vector length longer than a single
> > > iclog except for transactions that have taken the unit reservation
> > > above.
> > > 
> > 
> > Ok, I had the ctx_res logic inverted in my head. So it's not that
> > split_res is only added for the first transaction, but rather we treat
> > every transaction that didn't contribute unit res as if it crosses an
> > iclog boundary. That seems much more reasonable, though it does add to
> > the "overreservation" of the ticket so I'll reemphasize the request for
> > some form of debug/trace check that helps analyze runtime CIL ticket
> > reservation accounting. ;)
> 
> I can add some trace points, but I think we already have the
> tracepoints we need to understand how much overestimation occurs.
> i.e. trace_xfs_log_ticket_ungrant() from the CIL push worker
> context.
> 

Sure..

> > OTOH, this skips the split_res in the case where a single large
> > multi-iclog transaction happens to be the first in the ctx, right?
> 
> True, that's easy enough to fix.
> 
> > That
> > doesn't seem that unlikely a scenario considering minimum iclog and
> > worst case transaction unit res sizes. It actually makes me wonder what
> > happens if the CIL ticket underruns.. :P
> 
> xlog_write() will catch the underrun when we go to write the commit
> record via xlog_commit_record(), dump the ticket and shutdown the
> filesystem.
> 

Ah, right. Note that's not the last place we take from ->t_curr_res in
the xlog_write() path. xlog_state_get_iclog_space() steals a bit as well
if we're at the top of an iclog so it might be worth bolstering that
check.

Brian

> CHeers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-13 23:33     ` Dave Chinner
@ 2020-05-14 13:44       ` Brian Foster
  2020-05-14 22:46         ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-05-14 13:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 14, 2020 at 09:33:58AM +1000, Dave Chinner wrote:
> On Wed, May 13, 2020 at 01:02:37PM -0400, Brian Foster wrote:
> > On Tue, May 12, 2020 at 07:28:10PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Next on the list to getting rid of the xc_cil_lock is making the CIL
> > > itself per-cpu.
> > > 
> > > This requires a trade-off: we no longer move items forward in the
> > > CIL; once they are on the CIL they remain there as we treat the
> > > percpu lists as lockless.
> > > 
> > > XXX: preempt_disable() around the list operations to ensure they
> > > stay local to the CPU.
> > > 
> > > XXX: this needs CPU hotplug notifiers to clean up when cpus go
> > > offline.
> > > 
> > > Performance now increases substantially - the transaction rate goes
> > > from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
> > > for the first time.
> > > 
> > > Using a 32-way concurrent create/unlink on a 32p/16GB virtual
> > > machine:
> > > 
> > > 	    create time     rate            unlink time
> > > unpatched	1m56s      533k/s+/-28k/s      2m34s
> > > patched		1m49s	   523k/s+/-14k/s      2m00s
> > > 
> > > Notably, the system time for the create went up, while variance went
> > > down. This indicates we're starting to hit some other contention
> > > limit as we reduce the amount of time we spend contending on the
> > > xc_cil_lock.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
> > >  fs/xfs/xfs_log_priv.h |  2 +-
> > >  2 files changed, 45 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index 746c841757ed1..af444bc69a7cd 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > ...
> > > @@ -687,7 +689,7 @@ xlog_cil_push_work(
> > >  	 * move on to a new sequence number and so we have to be able to push
> > >  	 * this sequence again later.
> > >  	 */
> > > -	if (list_empty(&cil->xc_cil)) {
> > > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> > 
> > It seems reasonable, but I need to think a bit more about the whole
> > percpu list thing. In the meantime, one thing that comes to mind is the
> > more of these list_empty() -> percpu_counter_read() translations I see
> > the less I like it because we're leaking this inherent raciness to
> > different contexts. Whether it's ultimately safe or not, it's subject to
> > change and far too subtle and indirect for my taste. 
> 
> Well, all the critical list_empty(&cil->xc_cil) checks are done
> under the xc_push_lock, so I'd suggest that if we zero the counters
> under the push lock when switching contexts, and put the initial
> zero->non-zero counter transition to under the same lock we'll get
> exact checks without requiring a spinlock/atomic in the fast
> path and have all the right memory barriers in place such that races
> can't happen...
> 

That might work. We'd just have to audit the external checks and provide
clear comments on the purpose of the lock in those cases.

> > Could we replace all of the direct ->xc_cil list checks with an atomic
> > bitop (i.e. XLOG_CIL_EMPTY) or something similar in the xfs_cil? AFAICT,
> > that could be done in a separate patch and we could ultimately reuse it
> > to close the race with the initial ctx reservation (via
> > test_and_set_bit()) because it's otherwise set in the same function. Hm?
> 
> test_and_set_bit() still locks the memory bus and so requires
> exclusive access to the cacheline. Avoiding locked bus ops
> (atomics, spinlocks, etc) in the fast path is the problem
> I'm trying to solve with this patchset. IOWs, this isn't a viable
> solution to a scalability problem caused by many CPUs all trying to
> access the same cacheline exclusively.
> 

Of course I'd expect some hit from the added serialization, but it's not
clear to me it would be as noticeable as an explicit lock in practice.
For example, if we had an XLOG_CIL_EMPTY bit that was set at push time
and had a test_and_clear_bit() in the commit/insert path, would we take
that hit every time through the commit path or only until the cpu clears
it or sees that it's been cleared?

I'm not familiar enough with the bitops implementation to have
expectations one way or the other, but I'd be happy to test it out if
you can share the tests used to produce the documented results. :)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-14 13:44       ` Brian Foster
@ 2020-05-14 22:46         ` Dave Chinner
  2020-05-15 17:26           ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2020-05-14 22:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, May 14, 2020 at 09:44:46AM -0400, Brian Foster wrote:
> On Thu, May 14, 2020 at 09:33:58AM +1000, Dave Chinner wrote:
> > On Wed, May 13, 2020 at 01:02:37PM -0400, Brian Foster wrote:
> > > On Tue, May 12, 2020 at 07:28:10PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Next on the list to getting rid of the xc_cil_lock is making the CIL
> > > > itself per-cpu.
> > > > 
> > > > This requires a trade-off: we no longer move items forward in the
> > > > CIL; once they are on the CIL they remain there as we treat the
> > > > percpu lists as lockless.
> > > > 
> > > > XXX: preempt_disable() around the list operations to ensure they
> > > > stay local to the CPU.
> > > > 
> > > > XXX: this needs CPU hotplug notifiers to clean up when cpus go
> > > > offline.
> > > > 
> > > > Performance now increases substantially - the transaction rate goes
> > > > from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
> > > > for the first time.
> > > > 
> > > > Using a 32-way concurrent create/unlink on a 32p/16GB virtual
> > > > machine:
> > > > 
> > > > 	    create time     rate            unlink time
> > > > unpatched	1m56s      533k/s+/-28k/s      2m34s
> > > > patched		1m49s	   523k/s+/-14k/s      2m00s
> > > > 
> > > > Notably, the system time for the create went up, while variance went
> > > > down. This indicates we're starting to hit some other contention
> > > > limit as we reduce the amount of time we spend contending on the
> > > > xc_cil_lock.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
> > > >  fs/xfs/xfs_log_priv.h |  2 +-
> > > >  2 files changed, 45 insertions(+), 23 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index 746c841757ed1..af444bc69a7cd 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > ...
> > > > @@ -687,7 +689,7 @@ xlog_cil_push_work(
> > > >  	 * move on to a new sequence number and so we have to be able to push
> > > >  	 * this sequence again later.
> > > >  	 */
> > > > -	if (list_empty(&cil->xc_cil)) {
> > > > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> > > 
> > > It seems reasonable, but I need to think a bit more about the whole
> > > percpu list thing. In the meantime, one thing that comes to mind is the
> > > more of these list_empty() -> percpu_counter_read() translations I see
> > > the less I like it because we're leaking this inherent raciness to
> > > different contexts. Whether it's ultimately safe or not, it's subject to
> > > change and far too subtle and indirect for my taste. 
> > 
> > Well, all the critical list_empty(&cil->xc_cil) checks are done
> > under the xc_push_lock, so I'd suggest that if we zero the counters
> > under the push lock when switching contexts, and put the initial
> > zero->non-zero counter transition to under the same lock we'll get
> > exact checks without requiring a spinlock/atomic in the fast
> > path and have all the right memory barriers in place such that races
> > can't happen...
> > 
> 
> That might work. We'd just have to audit the external checks and provide
> clear comments on the purpose of the lock in those cases.

It does work. And there's only one external check and that's an
ASSERT that falls into the case of "we just added to the CIL, so the
counter must be non-zero because we haven't dropped the xc_ctx_lock
yet" so we don't need to hold the push lock here.

> > > Could we replace all of the direct ->xc_cil list checks with an atomic
> > > bitop (i.e. XLOG_CIL_EMPTY) or something similar in the xfs_cil? AFAICT,
> > > that could be done in a separate patch and we could ultimately reuse it
> > > to close the race with the initial ctx reservation (via
> > > test_and_set_bit()) because it's otherwise set in the same function. Hm?
> > 
> > test_and_set_bit() still locks the memory bus and so requires
> > exclusive access to the cacheline. Avoiding locked bus ops
> > (atomics, spinlocks, etc) in the fast path is the problem
> > I'm trying to solve with this patchset. IOWs, this isn't a viable
> > solution to a scalability problem caused by many CPUs all trying to
> > access the same cacheline exclusively.
> > 
> 
> Of course I'd expect some hit from the added serialization, but it's not
> clear to me it would be as noticeable as an explicit lock in practice.

It's an atomic. They have less overhead than a spin lock, but that
means it just requires a few more CPUs banging on it before it goes
into exponential breakdown like a spinlock does. And contention on
atomics is a lot harder to see in profiles.

> For example, if we had an XLOG_CIL_EMPTY bit that was set at push time
> and had a test_and_clear_bit() in the commit/insert path, would we take
> that hit every time through the commit path or only until the cpu clears
> it or sees that it's been cleared?

Every time it goes through the commit path. The x86 implementation:

static __always_inline bool
arch_test_and_set_bit(long nr, volatile unsigned long *addr)
{
        return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
}

It is a single instruction that always locks the bus to perform the
operation atomically.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-14 22:46         ` Dave Chinner
@ 2020-05-15 17:26           ` Brian Foster
  2020-05-18  0:30             ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2020-05-15 17:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, May 15, 2020 at 08:46:38AM +1000, Dave Chinner wrote:
> On Thu, May 14, 2020 at 09:44:46AM -0400, Brian Foster wrote:
> > On Thu, May 14, 2020 at 09:33:58AM +1000, Dave Chinner wrote:
> > > On Wed, May 13, 2020 at 01:02:37PM -0400, Brian Foster wrote:
> > > > On Tue, May 12, 2020 at 07:28:10PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Next on the list to getting rid of the xc_cil_lock is making the CIL
> > > > > itself per-cpu.
> > > > > 
> > > > > This requires a trade-off: we no longer move items forward in the
> > > > > CIL; once they are on the CIL they remain there as we treat the
> > > > > percpu lists as lockless.
> > > > > 
> > > > > XXX: preempt_disable() around the list operations to ensure they
> > > > > stay local to the CPU.
> > > > > 
> > > > > XXX: this needs CPU hotplug notifiers to clean up when cpus go
> > > > > offline.
> > > > > 
> > > > > Performance now increases substantially - the transaction rate goes
> > > > > from 750,000/s to 1.05M/sec, and the unlink rate is over 500,000/s
> > > > > for the first time.
> > > > > 
> > > > > Using a 32-way concurrent create/unlink on a 32p/16GB virtual
> > > > > machine:
> > > > > 
> > > > > 	    create time     rate            unlink time
> > > > > unpatched	1m56s      533k/s+/-28k/s      2m34s
> > > > > patched		1m49s	   523k/s+/-14k/s      2m00s
> > > > > 
> > > > > Notably, the system time for the create went up, while variance went
> > > > > down. This indicates we're starting to hit some other contention
> > > > > limit as we reduce the amount of time we spend contending on the
> > > > > xc_cil_lock.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c  | 66 ++++++++++++++++++++++++++++---------------
> > > > >  fs/xfs/xfs_log_priv.h |  2 +-
> > > > >  2 files changed, 45 insertions(+), 23 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index 746c841757ed1..af444bc69a7cd 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > ...
> > > > > @@ -687,7 +689,7 @@ xlog_cil_push_work(
> > > > >  	 * move on to a new sequence number and so we have to be able to push
> > > > >  	 * this sequence again later.
> > > > >  	 */
> > > > > -	if (list_empty(&cil->xc_cil)) {
> > > > > +	if (percpu_counter_read(&cil->xc_curr_res) == 0) {
> > > > 
> > > > It seems reasonable, but I need to think a bit more about the whole
> > > > percpu list thing. In the meantime, one thing that comes to mind is the
> > > > more of these list_empty() -> percpu_counter_read() translations I see
> > > > the less I like it because we're leaking this inherent raciness to
> > > > different contexts. Whether it's ultimately safe or not, it's subject to
> > > > change and far too subtle and indirect for my taste. 
> > > 
> > > Well, all the critical list_empty(&cil->xc_cil) checks are done
> > > under the xc_push_lock, so I'd suggest that if we zero the counters
> > > under the push lock when switching contexts, and put the initial
> > > zero->non-zero counter transition to under the same lock we'll get
> > > exact checks without requiring a spinlock/atomic in the fast
> > > path and have all the right memory barriers in place such that races
> > > can't happen...
> > > 
> > 
> > That might work. We'd just have to audit the external checks and provide
> > clear comments on the purpose of the lock in those cases.
> 
> It does work. And there's only one external check and that's an
> ASSERT that falls into the case of "we just added to the CIL, so the
> counter must be non-zero because we haven't dropped the xc_ctx_lock
> yet" so we don't need to hold the push lock here.
> 

Thinking about it again, that still seems way overengineered. All the
percpu batching stuff does is perform a global counter update behind a
spinlock and invoke a lockless read of that counter. AFAICT we can do
the same kind of lockless check on a flag or boolean in the CIL and use
an existing spinlock (as opposed to burying the percpu locks under our
own). Then we don't have to abuse special percpu interfaces at all to
obscurely provide serialization and indirect state.

> > > > Could we replace all of the direct ->xc_cil list checks with an atomic
> > > > bitop (i.e. XLOG_CIL_EMPTY) or something similar in the xfs_cil? AFAICT,
> > > > that could be done in a separate patch and we could ultimately reuse it
> > > > to close the race with the initial ctx reservation (via
> > > > test_and_set_bit()) because it's otherwise set in the same function. Hm?
> > > 
> > > test_and_set_bit() still locks the memory bus and so requires
> > > exclusive access to the cacheline. Avoiding locked bus ops
> > > (atomics, spinlocks, etc) in the fast path is the problem
> > > I'm trying to solve with this patchset. IOWs, this isn't a viable
> > > solution to a scalability problem caused by many CPUs all trying to
> > > access the same cacheline exclusively.
> > > 
> > 
> > Of course I'd expect some hit from the added serialization, but it's not
> > clear to me it would be as noticeable as an explicit lock in practice.
> 
> It's an atomic. They have less overhead than a spin lock, but that
> means it just requires a few more CPUs banging on it before it goes
> into exponential breakdown like a spinlock does. And contention on
> atomics is a lot harder to see in profiles.
> 
> > For example, if we had an XLOG_CIL_EMPTY bit that was set at push time
> > and had a test_and_clear_bit() in the commit/insert path, would we take
> > that hit every time through the commit path or only until the cpu clears
> > it or sees that it's been cleared?
> 
> Every time it goes through the commit path. The x86 implementation:
> 
> static __always_inline bool
> arch_test_and_set_bit(long nr, volatile unsigned long *addr)
> {
>         return GEN_BINARY_RMWcc(LOCK_PREFIX __ASM_SIZE(bts), *addr, c, "Ir", nr);
> }
> 
> It is a single instruction that always locks the bus to perform the
> operation atomically.
> 

Ok, that might prohibit using a bitop in the commit path. I'd still like
to see actual numbers on that, though, just to see where on the spectrum
it lands. I'm also wondering if the fast path logic mentioned above
could be implemented like the following (using bitops instead of the
spinlock):

	if (test_bit(XLOG_CIL_EMPTY, ...) &&
	    test_and_clear_bit(XLOG_CIL_EMPTY, ...)) {
		<steal CIL res>
	}

That type of pattern seems to be used in at least a few other places in
the kernel (e.g. filemap_check_errors(), wb_start_writeback(),
__blk_mq_tag_busy()), presumably for similar reasons.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/5] [RFC] xfs: per-cpu CIL lists
  2020-05-15 17:26           ` Brian Foster
@ 2020-05-18  0:30             ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2020-05-18  0:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, May 15, 2020 at 01:26:49PM -0400, Brian Foster wrote:
> Ok, that might prohibit using a bitop in the commit path. I'd still like
> to see actual numbers on that, though, just to see where on the spectrum
> it lands. I'm also wondering if the fast path logic mentioned above
> could be implemented like the following (using bitops instead of the
> spinlock):
> 
> 	if (test_bit(XLOG_CIL_EMPTY, ...) &&
> 	    test_and_clear_bit(XLOG_CIL_EMPTY, ...)) {
> 		<steal CIL res>
> 	}
> 
> That type of pattern seems to be used in at least a few other places in
> the kernel (e.g. filemap_check_errors(), wb_start_writeback(),
> __blk_mq_tag_busy()), presumably for similar reasons.

Ok, that seems reasonable given that there is other code using the
same pattern to avoid atomic ops. Overhead will be no different to
the test/lock/retest pattern I've been using...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-05-18  0:30 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-12  9:28 [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner
2020-05-12  9:28 ` [PATCH 1/5] xfs: separate read-only variables in struct xfs_mount Dave Chinner
2020-05-12 12:30   ` Brian Foster
2020-05-12 16:09     ` Darrick J. Wong
2020-05-12 21:43       ` Dave Chinner
2020-05-12 21:53     ` Dave Chinner
2020-05-12  9:28 ` [PATCH 2/5] xfs: convert m_active_trans counter to per-cpu Dave Chinner
2020-05-12 12:31   ` Brian Foster
2020-05-12  9:28 ` [PATCH 3/5] [RFC] xfs: use percpu counters for CIL context counters Dave Chinner
2020-05-12 14:05   ` Brian Foster
2020-05-12 23:36     ` Dave Chinner
2020-05-13 12:09       ` Brian Foster
2020-05-13 21:52         ` Dave Chinner
2020-05-14  1:50           ` Dave Chinner
2020-05-14  2:49             ` Dave Chinner
2020-05-14 13:43           ` Brian Foster
2020-05-12  9:28 ` [PATCH 4/5] [RFC] xfs: per-cpu CIL lists Dave Chinner
2020-05-13 17:02   ` Brian Foster
2020-05-13 23:33     ` Dave Chinner
2020-05-14 13:44       ` Brian Foster
2020-05-14 22:46         ` Dave Chinner
2020-05-15 17:26           ` Brian Foster
2020-05-18  0:30             ` Dave Chinner
2020-05-12  9:28 ` [PATCH 5/5] [RFC] xfs: make CIl busy extent lists per-cpu Dave Chinner
2020-05-12 10:25 ` [PATCH 0/5 v2] xfs: fix a couple of performance issues Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.