All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: fix a couple of performance issues
@ 2020-05-12  2:59 Dave Chinner
  2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
  2020-05-12  2:59 ` [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu Dave Chinner
  0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-05-12  2:59 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

I was comparing profiles between two machines and realised there was
a big discrepancy between them on an unlink workload that was kinda
weird. I pulled the string, and realised the problem was cacheline
bouncing interfering with cache residency of read-only variables.
Hence the first patch.

The second patch came about from working out what variable was
causing the cacheline bouncing that wasn't showing up in the CPU
usage profiles as overhead in the code paths that were contending on
it. And for larger machines, converting the atomic variable to a
per-cpu counter provides a major performance win.

Thoughts, comments, etc all welcome.

-Dave.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  2:59 [PATCH 0/2] xfs: fix a couple of performance issues Dave Chinner
@ 2020-05-12  2:59 ` Dave Chinner
  2020-05-12  8:14   ` Christoph Hellwig
  2020-05-12  2:59 ` [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-05-12  2:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machien is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

  25.83%  [k] xfs_agino_range
  12.68%  [k] __xfs_dir3_data_check
   6.95%  [k] xfs_verify_ino
   6.78%  [k] xfs_dir2_data_entry_tag_p
   3.56%  [k] xfs_buf_find
   2.31%  [k] xfs_verify_dir_ino
   2.02%  [k] xfs_dabuf_map.constprop.0
   1.65%  [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

  13.90%  [k] __pv_queued_spin_lock_slowpath
   3.76%  [k] do_raw_spin_lock
   2.83%  [k] xfs_dir3_leaf_check_int
   2.75%  [k] xfs_agino_range
   2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
   2.18%  [k] __xfs_dir3_data_check
   2.02%  [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occuring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

   9.77%  [kernel]  [k] xfs_agino_range
   6.27%  [kernel]  [k] __xfs_dir3_data_check
   5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   4.54%  [kernel]  [k] xfs_buf_find
   3.79%  [kernel]  [k] do_raw_spin_lock
   3.39%  [kernel]  [k] xfs_verify_ino
   2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

		vanilla		patched
machine A	13m20s		8m30s
machine B	5m30s		5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index aba5a15792792..712b3e2583316 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -88,21 +88,12 @@ typedef struct xfs_mount {
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
-	int			m_bsize;	/* fs logical block size */
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
 	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
 	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
-	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
-	uint			m_allocsize_log;/* min write size log bytes */
-	uint			m_allocsize_blocks; /* min write size blocks */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
-	int			m_logbufs;	/* number of log buffers */
-	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -117,9 +108,15 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+
+	/*
+	 * Read-only variables that are pre-calculated at mount time.
+	 */
+	int ____cacheline_aligned m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
+	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -138,20 +135,35 @@ typedef struct xfs_mount {
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
-	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
-	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
-	struct mutex		m_growlock;	/* growfs mutex */
+	int			m_dalign;	/* stripe unit */
+	int			m_swidth;	/* stripe width */
+	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
+	uint			m_allocsize_log;/* min write size log bytes */
+	uint			m_allocsize_blocks; /* min write size blocks */
+	int			m_logbufs;	/* number of log buffers */
+	int			m_logbsize;	/* size of each log buffer */
+	uint			m_rsumlevels;	/* rt summary levels */
+	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint64_t		m_flags;	/* global mount flags */
-	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	uint			m_qflags;	/* quota status flags */
+	uint64_t		m_flags;	/* global mount flags */
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
+						/* low free space thresholds */
+	bool			m_always_cow;
+	bool			m_fail_unmount;
+	bool			m_finobt_nores; /* no per-AG finobt resv. */
+	/*
+	 * End of pre-calculated read-only variables
+	 */
+
+	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
+	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
+	struct mutex		m_growlock;	/* growfs mutex */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	int			m_dalign;	/* stripe unit */
-	int			m_swidth;	/* stripe width */
-	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
@@ -160,8 +172,6 @@ typedef struct xfs_mount {
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
-						/* low free space thresholds */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -191,8 +201,6 @@ typedef struct xfs_mount {
 	 */
 	uint32_t		m_generation;
 
-	bool			m_always_cow;
-	bool			m_fail_unmount;
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  2:59 [PATCH 0/2] xfs: fix a couple of performance issues Dave Chinner
  2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-12  2:59 ` Dave Chinner
  2020-05-12  8:16   ` Christoph Hellwig
  2020-05-12 16:03   ` Darrick J. Wong
  1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2020-05-12  2:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. Convert it to a
per-cpu counter.

And .... oh wow, that was unexpected!

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

		unpatched	patched
machine A:	3m45s		2m27s
machine B:	4m13s		4m14s

Create rates:
		unpatched	patched
machine A:	246k+/-15k	384k+/-10k
machine B:	225k+/-13k	223k+/-11k

Concurrent rm of same 50 million inodes:

		unpatched	patched
machine A:	8m30s		3m09s
machine B:	4m02s		4m51s

The transaction rate on the fast machine went from about 250k/sec to
over 600k/sec, which indicates just how much of a bottleneck this
atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h |  2 +-
 fs/xfs/xfs_super.c | 12 +++++++++---
 fs/xfs/xfs_trans.c |  6 +++---
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 712b3e2583316..af3d8b71e9591 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -84,6 +84,7 @@ typedef struct xfs_mount {
 	 * extents or anything related to the rt device.
 	 */
 	struct percpu_counter	m_delalloc_blks;
+	struct percpu_counter	m_active_trans;	/* in progress xact counter */
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
@@ -164,7 +165,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e80bd2c4c279e..bc4853525ce18 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -883,7 +883,7 @@ xfs_quiesce_attr(
 	int	error = 0;
 
 	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
+	while (percpu_counter_sum(&mp->m_active_trans) > 0)
 		delay(100);
 
 	/* force the log to unpin objects from the now complete transactions */
@@ -902,7 +902,7 @@ xfs_quiesce_attr(
 	 * Just warn here till VFS can correctly support
 	 * read-only remount without racing.
 	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
+	WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0);
 
 	xfs_log_quiesce(mp);
 }
@@ -1027,8 +1027,14 @@ xfs_init_percpu_counters(
 	if (error)
 		goto free_fdblocks;
 
+	error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL);
+	if (error)
+		goto free_delalloc_blocks;
+
 	return 0;
 
+free_delalloc_blocks:
+	percpu_counter_destroy(&mp->m_delalloc_blks);
 free_fdblocks:
 	percpu_counter_destroy(&mp->m_fdblocks);
 free_ifree:
@@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters(
 	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
 	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
 	percpu_counter_destroy(&mp->m_delalloc_blks);
+	percpu_counter_destroy(&mp->m_active_trans);
 }
 
 static void
@@ -1792,7 +1799,6 @@ static int xfs_init_fs_context(
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 28b983ff8b113..636df5017782e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -68,7 +68,7 @@ xfs_trans_free(
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	trace_xfs_trans_free(tp, _RET_IP_);
-	atomic_dec(&tp->t_mountp->m_active_trans);
+	percpu_counter_dec(&tp->t_mountp->m_active_trans);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
@@ -126,7 +126,7 @@ xfs_trans_dup(
 
 	xfs_trans_dup_dqinfo(tp, ntp);
 
-	atomic_inc(&tp->t_mountp->m_active_trans);
+	percpu_counter_inc(&tp->t_mountp->m_active_trans);
 	return ntp;
 }
 
@@ -275,7 +275,7 @@ xfs_trans_alloc(
 	 */
 	WARN_ON(resp->tr_logres > 0 &&
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
-	atomic_inc(&mp->m_active_trans);
+	percpu_counter_inc(&mp->m_active_trans);
 
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-12  8:14   ` Christoph Hellwig
  2020-05-12  9:11     ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2020-05-12  8:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

I'm surprised the difference is that big.  The moves look obviously
fine, but why do you put the precalculated fields in the middle
of struct xfs_mount instead of at the end?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  2:59 ` [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu Dave Chinner
@ 2020-05-12  8:16   ` Christoph Hellwig
  2020-05-12  9:06     ` Dave Chinner
  2020-05-12 16:03   ` Darrick J. Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2020-05-12  8:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> Concurrent rm of same 50 million inodes:
> 
> 		unpatched	patched
> machine A:	8m30s		3m09s
> machine B:	4m02s		4m51s

This actually is a significant slow down on machine B, which
might be the more common one.  Can you find out how that happens
and if we can mitigate it?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  8:16   ` Christoph Hellwig
@ 2020-05-12  9:06     ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2020-05-12  9:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, May 12, 2020 at 01:16:08AM -0700, Christoph Hellwig wrote:
> On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> > Concurrent rm of same 50 million inodes:
> > 
> > 		unpatched	patched
> > machine A:	8m30s		3m09s
> > machine B:	4m02s		4m51s
> 
> This actually is a significant slow down on machine B, which

Oops, that's supposed to read "5m02s", not "4m02s". It was
marginally faster on machine B, as the commentary indicated. See the
log below for raw numbers. First set is unpatched, second set is the
patched kernel.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


$ ~/tests/fsmark-50-test-xfs.sh
QUOTA=
MKFSOPTS=
DEV=/dev/vdc
THREADS=16
meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
4722

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  32  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15  -d  /mnt/scratch/16 
#       Version 3.3, 16 thread(s) starting at Tue May 12 11:48:45 2020
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 1800 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     262602.4         11654020
     0      3200000            0     259664.3         11996380
     0      4800000            0     257304.3         11735862
     0      6400000            0     184439.9         12639349
     0      8000000            0     248908.7         12534005
     0      9600000            0     242772.1         12594078
     0     11200000            0     219974.6         13209264
     0     12800000            0     226903.2         13578529
     0     14400000            0     221077.0         13711077
     0     16000000            0     228973.0         13422679
     0     17600000            0     228344.3         13520123
     0     19200000            0     223846.4         13032929
     0     20800000            0     222562.0         13473815
     0     22400000            0     222068.3         13147580
     0     24000000            0     227009.1         13993071
     0     25600000            0     222685.8         13279342
     0     27200000            0     222493.4         13427861
     0     28800000            0     225331.6         13368843
     0     30400000            0     223663.7         13485135
     0     32000000            0     227392.7         13616403
     0     33600000            0     223416.0         14259976
     0     35200000            0     223949.8         13770566
     0     36800000            0     223848.5         14109223
     0     38400000            0     226992.1         13699116
     0     40000000            0     224701.9         13912164
     0     41600000            0     226491.7         14211451
     0     43200000            0     226421.6         13734525
     0     44800000            0     192666.9         14355559
     0     46400000            0     215824.1         16048153
     0     48000000            0     219833.9         13915186
     0     49600000            0     200815.3         13911419
     0     51200000            0     208422.5         13289185
     0 1600000-51200000(2.64e+07+/-1.4e+07)            0 184439.900000-262602.400000(225479+/-1.3e+04) 11654020-16048153(1.34312e+07+/-6.2e+05)

real    4m13.728s
user    4m54.879s
sys     45m35.908s
6287952 6287828  99%    0.20K 161508       39   1292064K xfs_ili                
6287545 6287545 100%    1.19K 273980       26   8767360K xfs_inode              
385038 369135  95%    0.44K  10697       36    171152K xfs_buf                
 35464  15741  44%    0.26K   1144       31      9152K xfs_buf_item           
  1840   1104  60%    0.17K     40       46       320K xfs_icr                
   704    704 100%    0.18K     16       44       128K xfs_log_ticket         
   576    576 100%    0.22K     16       36       128K xfs_btree_cur          
   544    544 100%    0.47K     16       34       256K xfs_da_state           
     0      0   0%    0.06K      0       64         0K xfs_bmap_free_item     
     0      0   0%    0.04K      0      102         0K xfs_ifork              
     0      0   0%    0.42K      0       37         0K xfs_efd_item           
     0      0   0%    0.42K      0       37         0K xfs_efi_item           
     0      0   0%    0.16K      0       24         0K xfs_rud_item           
     0      0   0%    0.67K      0       47         0K xfs_rui_item           
     0      0   0%    0.16K      0       24         0K xfs_cud_item           
     0      0   0%    0.42K      0       37         0K xfs_cui_item           
     0      0   0%    0.16K      0       24         0K xfs_bud_item           
     0      0   0%    0.20K      0       39         0K xfs_bui_item           
     0      0   0%    0.54K      0       29         0K xfs_dquot              
     0      0   0%    0.52K      0       31         0K xfs_dqtrx              
Repair

real    0m0.337s
user    0m0.004s
sys     0m0.059s
removing files

real    4m30.062s
user    0m4.154s
sys     3m14.984s

real    4m30.818s
user    0m3.912s
sys     3m16.197s

real    4m31.320s
user    0m4.047s
sys     3m15.194s

real    4m32.028s
user    0m4.028s
sys     3m16.690s

real    4m32.973s
user    0m3.974s
sys     3m16.360s

real    4m33.592s
user    0m3.943s
sys     3m13.878s

real    4m34.329s
user    0m4.017s
sys     3m16.072s

real    4m34.703s
user    0m4.000s
sys     3m15.959s

real    4m35.050s
user    0m3.977s
sys     3m16.347s

real    4m35.608s
user    0m3.938s
sys     3m16.133s

real    4m38.287s
user    0m4.049s
sys     3m16.415s

real    4m52.474s
user    0m4.036s
sys     3m16.174s

real    4m57.587s
user    0m4.131s
sys     3m17.122s

real    5m1.172s
user    0m4.074s
sys     3m17.258s

real    5m1.418s
user    0m3.930s
sys     3m17.229s

real    5m2.636s
user    0m4.153s
sys     3m18.217s

-----

$ ~/tests/fsmark-50-test-xfs.sh
QUOTA=
MKFSOPTS=
DEV=/dev/vdc
THREADS=16
meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
4735

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  32  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  -d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15  -d  /mnt/scratch/16 
#       Version 3.3, 16 thread(s) starting at Tue May 12 12:14:32 2020
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 1800 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 0 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0      1600000            0     261219.3         12276821
     0      3200000            0     258794.5         12649641
     0      4800000            0     239854.1         13457093
     0      6400000            0     224239.7         15311092
     0      8000000            0     247496.4         13289793
     0      9600000            0     237615.3         13263139
     0     11200000            0     223894.5         14654331
     0     12800000            0     224140.0         14361572
     0     14400000            0     215027.4         14415680
     0     16000000            0     223168.6         14288061
     0     17600000            0     224865.4         13978708
     0     19200000            0     220644.3         14040732
     0     20800000            0     219239.1         14174390
     0     22400000            0     228602.0         13519203
     0     24000000            0     224225.2         14069045
     0     25600000            0     221120.1         14516810
     0     27200000            0     221523.0         14665428
     0     28800000            0     217718.8         14485142
     0     30400000            0     220233.9         14628782
     0     32000000            0     220244.3         13882637
     0     33600000            0     223942.6         14548384
     0     35200000            0     222644.3         14862085
     0     36800000            0     219387.9         14354504
     0     38400000            0     222048.7         14658001
     0     40000000            0     221213.9         14664261
     0     41600000            0     220770.8         14583222
     0     43200000            0     220374.6         15453719
     0     44800000            0     205423.2         14096683
     0     46400000            0     208618.2         16111760
     0     48000000            0     200860.0         16714313
     0     49600000            0     207497.0         15617340
     0     51200000            0     206502.8         15035764

real    4m14.483s
user    5m8.450s
sys     46m5.712s
     0 1600000-51200000(2.64e+07+/-1.4e+07)            0 200860.000000-261219.300000(223036+/-1.1e+04) 12276821-16714313(1.43879e+07+/-7.2e+05)
6009008 5941393  98%    0.20K 154292       39   1234336K xfs_ili                
5962213 5941503  99%    1.19K 299994       26   9599808K xfs_inode              
382752 337541  88%    0.44K  10632       36    170112K xfs_buf                
 32085  14207  44%    0.26K   1035       31      8280K xfs_buf_item           
  2990   2089  69%    0.17K     65       46       520K xfs_icr                
   704    704 100%    0.18K     16       44       128K xfs_log_ticket         
   576    576 100%    0.22K     16       36       128K xfs_btree_cur          
   544    544 100%    0.47K     16       34       256K xfs_da_state           
     0      0   0%    0.06K      0       64         0K xfs_bmap_free_item     
     0      0   0%    0.04K      0      102         0K xfs_ifork              
     0      0   0%    0.42K      0       37         0K xfs_efd_item           
     0      0   0%    0.42K      0       37         0K xfs_efi_item           
     0      0   0%    0.16K      0       24         0K xfs_rud_item           
     0      0   0%    0.67K      0       47         0K xfs_rui_item           
     0      0   0%    0.16K      0       24         0K xfs_cud_item           
     0      0   0%    0.42K      0       37         0K xfs_cui_item           
     0      0   0%    0.16K      0       24         0K xfs_bud_item           
     0      0   0%    0.20K      0       39         0K xfs_bui_item           
     0      0   0%    0.54K      0       29         0K xfs_dquot              
     0      0   0%    0.52K      0       31         0K xfs_dqtrx              
Repair

real    0m0.579s
user    0m0.007s
sys     0m0.076s
removing files

real    4m26.929s
user    0m4.010s
sys     3m9.884s

real    4m27.298s
user    0m4.113s
sys     3m9.052s

real    4m29.477s
user    0m4.007s
sys     3m10.205s

real    4m29.562s
user    0m4.004s
sys     3m9.534s

real    4m29.582s
user    0m4.001s
sys     3m8.189s

real    4m31.160s
user    0m4.038s
sys     3m10.027s

real    4m31.646s
user    0m4.232s
sys     3m8.585s

real    4m31.671s
user    0m4.246s
sys     3m9.954s

real    4m31.824s
user    0m3.966s
sys     3m9.712s

real    4m33.730s
user    0m4.189s
sys     3m8.743s

real    4m34.045s
user    0m4.007s
sys     3m10.145s

real    4m36.011s
user    0m4.006s
sys     3m10.385s

real    4m36.110s
user    0m3.929s
sys     3m11.254s

real    4m43.693s
user    0m4.149s
sys     3m9.350s

real    4m50.344s
user    0m4.159s
sys     3m9.867s

real    4m51.145s
user    0m4.165s
sys     3m10.713s


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  8:14   ` Christoph Hellwig
@ 2020-05-12  9:11     ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2020-05-12  9:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, May 12, 2020 at 01:14:23AM -0700, Christoph Hellwig wrote:
> I'm surprised the difference is that big.  The moves look obviously
> fine, but why do you put the precalculated fields in the middle
> of struct xfs_mount instead of at the end?

Just because it was the minimum to move about and it clearly
demonstrated the damage the contended cacheline was doing to
performance of code accessing read-only variables.

Really, there's a lot in the xfs_mount that might well be read only
that I didn't consider. I'm thinking that most of the pointers to
structures are read only (e.g. the log, ailp, buftargs, etc) as they
do not change for the life of the structure. I don't really have the
time to dig into this real deep (this is a quick, interesting
diversion!) so I did the 99-percenter and moved on...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12  2:59 ` [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu Dave Chinner
  2020-05-12  8:16   ` Christoph Hellwig
@ 2020-05-12 16:03   ` Darrick J. Wong
  2020-05-12 21:39     ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2020-05-12 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a global atomic counter, and we are hitting it at a rate of
> half a million transactions a second, so it's bouncing the counter
> cacheline all over the place on large machines. Convert it to a
> per-cpu counter.
> 
> And .... oh wow, that was unexpected!
> 
> Concurrent create, 50 million inodes, identical 16p/16GB virtual
> machines on different physical hosts. Machine A has twice the CPU
> cores per socket of machine B:
> 
> 		unpatched	patched
> machine A:	3m45s		2m27s
> machine B:	4m13s		4m14s
> 
> Create rates:
> 		unpatched	patched
> machine A:	246k+/-15k	384k+/-10k
> machine B:	225k+/-13k	223k+/-11k
> 
> Concurrent rm of same 50 million inodes:
> 
> 		unpatched	patched
> machine A:	8m30s		3m09s
> machine B:	4m02s		4m51s
> 
> The transaction rate on the fast machine went from about 250k/sec to
> over 600k/sec, which indicates just how much of a bottleneck this
> atomic counter was.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.h |  2 +-
>  fs/xfs/xfs_super.c | 12 +++++++++---
>  fs/xfs/xfs_trans.c |  6 +++---
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 712b3e2583316..af3d8b71e9591 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -84,6 +84,7 @@ typedef struct xfs_mount {
>  	 * extents or anything related to the rt device.
>  	 */
>  	struct percpu_counter	m_delalloc_blks;
> +	struct percpu_counter	m_active_trans;	/* in progress xact counter */
>  
>  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
>  	char			*m_rtname;	/* realtime device name */
> @@ -164,7 +165,6 @@ typedef struct xfs_mount {
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e80bd2c4c279e..bc4853525ce18 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -883,7 +883,7 @@ xfs_quiesce_attr(
>  	int	error = 0;
>  
>  	/* wait for all modifications to complete */
> -	while (atomic_read(&mp->m_active_trans) > 0)
> +	while (percpu_counter_sum(&mp->m_active_trans) > 0)
>  		delay(100);

Hmm.  AFAICT, this counter stops us from quiescing the log while
transactions are still running.  We only quiesce the log for unmount,
remount-ro, and fs freeze.  Given that we now start_sb_write for
xfs_getfsmap and the background freeing threads, I wonder, do we still
need this at all?

--D

>  
>  	/* force the log to unpin objects from the now complete transactions */
> @@ -902,7 +902,7 @@ xfs_quiesce_attr(
>  	 * Just warn here till VFS can correctly support
>  	 * read-only remount without racing.
>  	 */
> -	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
> +	WARN_ON(percpu_counter_sum(&mp->m_active_trans) != 0);
>  
>  	xfs_log_quiesce(mp);
>  }
> @@ -1027,8 +1027,14 @@ xfs_init_percpu_counters(
>  	if (error)
>  		goto free_fdblocks;
>  
> +	error = percpu_counter_init(&mp->m_active_trans, 0, GFP_KERNEL);
> +	if (error)
> +		goto free_delalloc_blocks;
> +
>  	return 0;
>  
> +free_delalloc_blocks:
> +	percpu_counter_destroy(&mp->m_delalloc_blks);
>  free_fdblocks:
>  	percpu_counter_destroy(&mp->m_fdblocks);
>  free_ifree:
> @@ -1057,6 +1063,7 @@ xfs_destroy_percpu_counters(
>  	ASSERT(XFS_FORCED_SHUTDOWN(mp) ||
>  	       percpu_counter_sum(&mp->m_delalloc_blks) == 0);
>  	percpu_counter_destroy(&mp->m_delalloc_blks);
> +	percpu_counter_destroy(&mp->m_active_trans);
>  }
>  
>  static void
> @@ -1792,7 +1799,6 @@ static int xfs_init_fs_context(
>  	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
>  	spin_lock_init(&mp->m_perag_lock);
>  	mutex_init(&mp->m_growlock);
> -	atomic_set(&mp->m_active_trans, 0);
>  	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
>  	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
>  	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 28b983ff8b113..636df5017782e 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -68,7 +68,7 @@ xfs_trans_free(
>  	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
>  
>  	trace_xfs_trans_free(tp, _RET_IP_);
> -	atomic_dec(&tp->t_mountp->m_active_trans);
> +	percpu_counter_dec(&tp->t_mountp->m_active_trans);
>  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_end_intwrite(tp->t_mountp->m_super);
>  	xfs_trans_free_dqinfo(tp);
> @@ -126,7 +126,7 @@ xfs_trans_dup(
>  
>  	xfs_trans_dup_dqinfo(tp, ntp);
>  
> -	atomic_inc(&tp->t_mountp->m_active_trans);
> +	percpu_counter_inc(&tp->t_mountp->m_active_trans);
>  	return ntp;
>  }
>  
> @@ -275,7 +275,7 @@ xfs_trans_alloc(
>  	 */
>  	WARN_ON(resp->tr_logres > 0 &&
>  		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
> -	atomic_inc(&mp->m_active_trans);
> +	percpu_counter_inc(&mp->m_active_trans);
>  
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
>  	tp->t_flags = flags;
> -- 
> 2.26.1.301.g55bc3eb7cb9
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12 16:03   ` Darrick J. Wong
@ 2020-05-12 21:39     ` Dave Chinner
  2020-05-12 22:59       ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2020-05-12 21:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, May 12, 2020 at 09:03:52AM -0700, Darrick J. Wong wrote:
> On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It's a global atomic counter, and we are hitting it at a rate of
> > half a million transactions a second, so it's bouncing the counter
> > cacheline all over the place on large machines. Convert it to a
> > per-cpu counter.
> > 
> > And .... oh wow, that was unexpected!
> > 
> > Concurrent create, 50 million inodes, identical 16p/16GB virtual
> > machines on different physical hosts. Machine A has twice the CPU
> > cores per socket of machine B:
> > 
> > 		unpatched	patched
> > machine A:	3m45s		2m27s
> > machine B:	4m13s		4m14s
> > 
> > Create rates:
> > 		unpatched	patched
> > machine A:	246k+/-15k	384k+/-10k
> > machine B:	225k+/-13k	223k+/-11k
> > 
> > Concurrent rm of same 50 million inodes:
> > 
> > 		unpatched	patched
> > machine A:	8m30s		3m09s
> > machine B:	4m02s		4m51s
> > 
> > The transaction rate on the fast machine went from about 250k/sec to
> > over 600k/sec, which indicates just how much of a bottleneck this
> > atomic counter was.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_mount.h |  2 +-
> >  fs/xfs/xfs_super.c | 12 +++++++++---
> >  fs/xfs/xfs_trans.c |  6 +++---
> >  3 files changed, 13 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 712b3e2583316..af3d8b71e9591 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -84,6 +84,7 @@ typedef struct xfs_mount {
> >  	 * extents or anything related to the rt device.
> >  	 */
> >  	struct percpu_counter	m_delalloc_blks;
> > +	struct percpu_counter	m_active_trans;	/* in progress xact counter */
> >  
> >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> >  	char			*m_rtname;	/* realtime device name */
> > @@ -164,7 +165,6 @@ typedef struct xfs_mount {
> >  	uint64_t		m_resblks;	/* total reserved blocks */
> >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> > -	atomic_t		m_active_trans;	/* number trans frozen */
> >  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> >  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index e80bd2c4c279e..bc4853525ce18 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -883,7 +883,7 @@ xfs_quiesce_attr(
> >  	int	error = 0;
> >  
> >  	/* wait for all modifications to complete */
> > -	while (atomic_read(&mp->m_active_trans) > 0)
> > +	while (percpu_counter_sum(&mp->m_active_trans) > 0)
> >  		delay(100);
> 
> Hmm.  AFAICT, this counter stops us from quiescing the log while
> transactions are still running.  We only quiesce the log for unmount,
> remount-ro, and fs freeze.  Given that we now start_sb_write for
> xfs_getfsmap and the background freeing threads, I wonder, do we still
> need this at all?

Perhaps not - I didn't look that far. It's basically only needed to
protect against XFS_TRANS_NO_WRITECOUNT transactions, which is
really just xfs_sync_sb() these days. This can come from several
places, but the only one outside of mount/freeze/unmount is the log
worker.  Perhaps the log worker can be cancelled before calling
xfs_quiesce_attr() rather than after?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12 21:39     ` Dave Chinner
@ 2020-05-12 22:59       ` Darrick J. Wong
  2020-05-13 17:17         ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2020-05-12 22:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 13, 2020 at 07:39:19AM +1000, Dave Chinner wrote:
> On Tue, May 12, 2020 at 09:03:52AM -0700, Darrick J. Wong wrote:
> > On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > It's a global atomic counter, and we are hitting it at a rate of
> > > half a million transactions a second, so it's bouncing the counter
> > > cacheline all over the place on large machines. Convert it to a
> > > per-cpu counter.
> > > 
> > > And .... oh wow, that was unexpected!
> > > 
> > > Concurrent create, 50 million inodes, identical 16p/16GB virtual
> > > machines on different physical hosts. Machine A has twice the CPU
> > > cores per socket of machine B:
> > > 
> > > 		unpatched	patched
> > > machine A:	3m45s		2m27s
> > > machine B:	4m13s		4m14s
> > > 
> > > Create rates:
> > > 		unpatched	patched
> > > machine A:	246k+/-15k	384k+/-10k
> > > machine B:	225k+/-13k	223k+/-11k
> > > 
> > > Concurrent rm of same 50 million inodes:
> > > 
> > > 		unpatched	patched
> > > machine A:	8m30s		3m09s
> > > machine B:	4m02s		4m51s
> > > 
> > > The transaction rate on the fast machine went from about 250k/sec to
> > > over 600k/sec, which indicates just how much of a bottleneck this
> > > atomic counter was.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_mount.h |  2 +-
> > >  fs/xfs/xfs_super.c | 12 +++++++++---
> > >  fs/xfs/xfs_trans.c |  6 +++---
> > >  3 files changed, 13 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index 712b3e2583316..af3d8b71e9591 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -84,6 +84,7 @@ typedef struct xfs_mount {
> > >  	 * extents or anything related to the rt device.
> > >  	 */
> > >  	struct percpu_counter	m_delalloc_blks;
> > > +	struct percpu_counter	m_active_trans;	/* in progress xact counter */
> > >  
> > >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> > >  	char			*m_rtname;	/* realtime device name */
> > > @@ -164,7 +165,6 @@ typedef struct xfs_mount {
> > >  	uint64_t		m_resblks;	/* total reserved blocks */
> > >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> > >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> > > -	atomic_t		m_active_trans;	/* number trans frozen */
> > >  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> > >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> > >  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index e80bd2c4c279e..bc4853525ce18 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -883,7 +883,7 @@ xfs_quiesce_attr(
> > >  	int	error = 0;
> > >  
> > >  	/* wait for all modifications to complete */
> > > -	while (atomic_read(&mp->m_active_trans) > 0)
> > > +	while (percpu_counter_sum(&mp->m_active_trans) > 0)
> > >  		delay(100);
> > 
> > Hmm.  AFAICT, this counter stops us from quiescing the log while
> > transactions are still running.  We only quiesce the log for unmount,
> > remount-ro, and fs freeze.  Given that we now start_sb_write for
> > xfs_getfsmap and the background freeing threads, I wonder, do we still
> > need this at all?
> 
> Perhaps not - I didn't look that far. It's basically only needed to
> protect against XFS_TRANS_NO_WRITECOUNT transactions, which is
> really just xfs_sync_sb() these days. This can come from several
> places, but the only one outside of mount/freeze/unmount is the log
> worker.  Perhaps the log worker can be cancelled before calling
> xfs_quiesce_attr() rather than after?

What if we skip bumping m_active_trans for NO_WRITECOUNT transactions?
There aren't that many of them, and it'd probably better for memory
consumption on 1000-core systems. ;)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu
  2020-05-12 22:59       ` Darrick J. Wong
@ 2020-05-13 17:17         ` Darrick J. Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2020-05-13 17:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, May 12, 2020 at 03:59:18PM -0700, Darrick J. Wong wrote:
> On Wed, May 13, 2020 at 07:39:19AM +1000, Dave Chinner wrote:
> > On Tue, May 12, 2020 at 09:03:52AM -0700, Darrick J. Wong wrote:
> > > On Tue, May 12, 2020 at 12:59:49PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > It's a global atomic counter, and we are hitting it at a rate of
> > > > half a million transactions a second, so it's bouncing the counter
> > > > cacheline all over the place on large machines. Convert it to a
> > > > per-cpu counter.
> > > > 
> > > > And .... oh wow, that was unexpected!
> > > > 
> > > > Concurrent create, 50 million inodes, identical 16p/16GB virtual
> > > > machines on different physical hosts. Machine A has twice the CPU
> > > > cores per socket of machine B:
> > > > 
> > > > 		unpatched	patched
> > > > machine A:	3m45s		2m27s
> > > > machine B:	4m13s		4m14s
> > > > 
> > > > Create rates:
> > > > 		unpatched	patched
> > > > machine A:	246k+/-15k	384k+/-10k
> > > > machine B:	225k+/-13k	223k+/-11k
> > > > 
> > > > Concurrent rm of same 50 million inodes:
> > > > 
> > > > 		unpatched	patched
> > > > machine A:	8m30s		3m09s
> > > > machine B:	4m02s		4m51s
> > > > 
> > > > The transaction rate on the fast machine went from about 250k/sec to
> > > > over 600k/sec, which indicates just how much of a bottleneck this
> > > > atomic counter was.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_mount.h |  2 +-
> > > >  fs/xfs/xfs_super.c | 12 +++++++++---
> > > >  fs/xfs/xfs_trans.c |  6 +++---
> > > >  3 files changed, 13 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > > index 712b3e2583316..af3d8b71e9591 100644
> > > > --- a/fs/xfs/xfs_mount.h
> > > > +++ b/fs/xfs/xfs_mount.h
> > > > @@ -84,6 +84,7 @@ typedef struct xfs_mount {
> > > >  	 * extents or anything related to the rt device.
> > > >  	 */
> > > >  	struct percpu_counter	m_delalloc_blks;
> > > > +	struct percpu_counter	m_active_trans;	/* in progress xact counter */
> > > >  
> > > >  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
> > > >  	char			*m_rtname;	/* realtime device name */
> > > > @@ -164,7 +165,6 @@ typedef struct xfs_mount {
> > > >  	uint64_t		m_resblks;	/* total reserved blocks */
> > > >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> > > >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> > > > -	atomic_t		m_active_trans;	/* number trans frozen */
> > > >  	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> > > >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> > > >  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index e80bd2c4c279e..bc4853525ce18 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -883,7 +883,7 @@ xfs_quiesce_attr(
> > > >  	int	error = 0;
> > > >  
> > > >  	/* wait for all modifications to complete */
> > > > -	while (atomic_read(&mp->m_active_trans) > 0)
> > > > +	while (percpu_counter_sum(&mp->m_active_trans) > 0)
> > > >  		delay(100);
> > > 
> > > Hmm.  AFAICT, this counter stops us from quiescing the log while
> > > transactions are still running.  We only quiesce the log for unmount,
> > > remount-ro, and fs freeze.  Given that we now start_sb_write for
> > > xfs_getfsmap and the background freeing threads, I wonder, do we still
> > > need this at all?
> > 
> > Perhaps not - I didn't look that far. It's basically only needed to
> > protect against XFS_TRANS_NO_WRITECOUNT transactions, which is
> > really just xfs_sync_sb() these days. This can come from several
> > places, but the only one outside of mount/freeze/unmount is the log
> > worker.  Perhaps the log worker can be cancelled before calling
> > xfs_quiesce_attr() rather than after?
> 
> What if we skip bumping m_active_trans for NO_WRITECOUNT transactions?
> There aren't that many of them, and it'd probably better for memory
> consumption on 1000-core systems. ;)

I think the log worker is the only piece remaining that uses
NO_WRITECOUNT transactions without fsfreeze protection, and AFAICT we
should be able to cancel it at the beginning of xfs_quiesce_attr since
the function forces the log out and pushes the AIL manually.  That seems
like it should enable us to remove m_active_trans...

--D

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-05-13 17:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-12  2:59 [PATCH 0/2] xfs: fix a couple of performance issues Dave Chinner
2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
2020-05-12  8:14   ` Christoph Hellwig
2020-05-12  9:11     ` Dave Chinner
2020-05-12  2:59 ` [PATCH 2/2] xfs: convert m_active_trans counter to per-cpu Dave Chinner
2020-05-12  8:16   ` Christoph Hellwig
2020-05-12  9:06     ` Dave Chinner
2020-05-12 16:03   ` Darrick J. Wong
2020-05-12 21:39     ` Dave Chinner
2020-05-12 22:59       ` Darrick J. Wong
2020-05-13 17:17         ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.