linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2 v3] xfs: improve transaction rate scalability
@ 2020-05-19 22:23 Dave Chinner
  2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
  2020-05-19 22:23 ` [PATCH 2/2] xfs: remove the m_active_trans counter Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-19 22:23 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is the third version of the patchset to improve the transaction
rate on higher CPU count machines. The previous versions can be found
here:

https://lore.kernel.org/linux-xfs/20200512025949.1807131-1-david@fromorbit.com/
https://lore.kernel.org/linux-xfs/20200512092811.1846252-1-david@fromorbit.com/

Changes for v3 are:
- completely reorganise the struct xfs_mount, not just a subset of
  the variables. This seems to improve performance even more than
  the original version. (Christoph requested this.)
- remove the m_active_trans counter rather than convert it to percpu
  as Darrick suggested.

Results are just as good or better than previous versions, passes
fstests without regressions on both low PCU count and high CPU count
test VMs.

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-19 22:23 [PATCH 0/2 v3] xfs: improve transaction rate scalability Dave Chinner
@ 2020-05-19 22:23 ` Dave Chinner
  2020-05-20  6:57   ` Christoph Hellwig
                     ` (2 more replies)
  2020-05-19 22:23 ` [PATCH 2/2] xfs: remove the m_active_trans counter Dave Chinner
  1 sibling, 3 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-19 22:23 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machien is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

  25.83%  [k] xfs_agino_range
  12.68%  [k] __xfs_dir3_data_check
   6.95%  [k] xfs_verify_ino
   6.78%  [k] xfs_dir2_data_entry_tag_p
   3.56%  [k] xfs_buf_find
   2.31%  [k] xfs_verify_dir_ino
   2.02%  [k] xfs_dabuf_map.constprop.0
   1.65%  [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

  13.90%  [k] __pv_queued_spin_lock_slowpath
   3.76%  [k] do_raw_spin_lock
   2.83%  [k] xfs_dir3_leaf_check_int
   2.75%  [k] xfs_agino_range
   2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
   2.18%  [k] __xfs_dir3_data_check
   2.02%  [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occuring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

   9.77%  [kernel]  [k] xfs_agino_range
   6.27%  [kernel]  [k] __xfs_dir3_data_check
   5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   4.54%  [kernel]  [k] xfs_buf_find
   3.79%  [kernel]  [k] do_raw_spin_lock
   3.39%  [kernel]  [k] xfs_verify_ino
   2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

		vanilla		patched
machine A	13m20s		6m42s
machine B	5m30s		5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h | 148 +++++++++++++++++++++++++--------------------
 1 file changed, 82 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 4835581f3eb00..c1f92c1847bb2 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -55,61 +55,25 @@ struct xfs_error_cfg {
 	long		retry_timeout;	/* in jiffies, -1 = infinite */
 };
 
+/*
+ * The struct xfsmount layout is optimised to separate read-mostly variables
+ * from variables that are frequently modified. We put the read-mostly variables
+ * first, then place all the other variables at the end.
+ *
+ * Typically, read-mostly variables are those that are set at mount time and
+ * never changed again, or only change rarely as a result of things like sysfs
+ * knobs being tweaked.
+ */
 typedef struct xfs_mount {
+	struct xfs_sb		m_sb;		/* copy of fs superblock */
 	struct super_block	*m_super;
-
-	/*
-	 * Bitsets of per-fs metadata that have been checked and/or are sick.
-	 * Callers must hold m_sb_lock to access these two fields.
-	 */
-	uint8_t			m_fs_checked;
-	uint8_t			m_fs_sick;
-	/*
-	 * Bitsets of rt metadata that have been checked and/or are sick.
-	 * Callers must hold m_sb_lock to access this field.
-	 */
-	uint8_t			m_rt_checked;
-	uint8_t			m_rt_sick;
-
 	struct xfs_ail		*m_ail;		/* fs active log item list */
-
-	struct xfs_sb		m_sb;		/* copy of fs superblock */
-	spinlock_t		m_sb_lock;	/* sb counter lock */
-	struct percpu_counter	m_icount;	/* allocated inodes counter */
-	struct percpu_counter	m_ifree;	/* free inodes counter */
-	struct percpu_counter	m_fdblocks;	/* free block counter */
-	/*
-	 * Count of data device blocks reserved for delayed allocations,
-	 * including indlen blocks.  Does not include allocated CoW staging
-	 * extents or anything related to the rt device.
-	 */
-	struct percpu_counter	m_delalloc_blks;
-
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
-	int			m_bsize;	/* fs logical block size */
-	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
-	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
-	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
-	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
-	uint			m_allocsize_log;/* min write size log bytes */
-	uint			m_allocsize_blocks; /* min write size blocks */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
-	int			m_logbufs;	/* number of log buffers */
-	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
-	/*
-	 * Optional cache of rt summary level per bitmap block with the
-	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
-	 * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip
-	 * inode lock.
-	 */
-	uint8_t			*m_rsum_cache;
 	struct xfs_inode	*m_rbmip;	/* pointer to bitmap inode */
 	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
 	struct xfs_inode	*m_rootip;	/* pointer to root directory */
@@ -117,9 +81,26 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+	/*
+	 * Optional cache of rt summary level per bitmap block with the
+	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
+	 * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip
+	 * inode lock.
+	 */
+	uint8_t			*m_rsum_cache;
+	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
+	struct workqueue_struct *m_buf_workqueue;
+	struct workqueue_struct	*m_unwritten_workqueue;
+	struct workqueue_struct	*m_cil_workqueue;
+	struct workqueue_struct	*m_reclaim_workqueue;
+	struct workqueue_struct *m_eofblocks_workqueue;
+	struct workqueue_struct	*m_sync_workqueue;
+
+	int			m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
+	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -138,47 +119,83 @@ typedef struct xfs_mount {
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
-	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
-	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
-	struct mutex		m_growlock;	/* growfs mutex */
+	int			m_dalign;	/* stripe unit */
+	int			m_swidth;	/* stripe width */
+	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
+	uint			m_allocsize_log;/* min write size log bytes */
+	uint			m_allocsize_blocks; /* min write size blocks */
+	int			m_logbufs;	/* number of log buffers */
+	int			m_logbsize;	/* size of each log buffer */
+	uint			m_rsumlevels;	/* rt summary levels */
+	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint64_t		m_flags;	/* global mount flags */
-	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	uint			m_qflags;	/* quota status flags */
+	uint64_t		m_flags;	/* global mount flags */
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
+						/* low free space thresholds */
+	bool			m_always_cow;
+	bool			m_fail_unmount;
+	bool			m_finobt_nores; /* no per-AG finobt resv. */
+	bool			m_update_sb;	/* sb needs update in mount */
+
+	/*
+	 * Bitsets of per-fs metadata that have been checked and/or are sick.
+	 * Callers must hold m_sb_lock to access these two fields.
+	 */
+	uint8_t			m_fs_checked;
+	uint8_t			m_fs_sick;
+	/*
+	 * Bitsets of rt metadata that have been checked and/or are sick.
+	 * Callers must hold m_sb_lock to access this field.
+	 */
+	uint8_t			m_rt_checked;
+	uint8_t			m_rt_sick;
+
+	/*
+	 * End of read-mostly variables. Frequently written variables and locks
+	 * should be placed below this comment from now on. The first variable
+	 * here is marked as cacheline aligned so they it is separated from
+	 * the read-mostly variables.
+	 */
+
+	spinlock_t ____cacheline_aligned m_sb_lock; /* sb counter lock */
+	struct percpu_counter	m_icount;	/* allocated inodes counter */
+	struct percpu_counter	m_ifree;	/* free inodes counter */
+	struct percpu_counter	m_fdblocks;	/* free block counter */
+	/*
+	 * Count of data device blocks reserved for delayed allocations,
+	 * including indlen blocks.  Does not include allocated CoW staging
+	 * extents or anything related to the rt device.
+	 */
+	struct percpu_counter	m_delalloc_blks;
+
+	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
+	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	int			m_dalign;	/* stripe unit */
-	int			m_swidth;	/* stripe width */
-	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	atomic_t		m_active_trans;	/* number trans frozen */
-	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
-	bool			m_update_sb;	/* sb needs update in mount */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
-						/* low free space thresholds */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
 	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
 	struct xstats		m_stats;	/* per-fs stats */
+	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
+	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
+	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
 
 	/*
 	 * Workqueue item so that we can coalesce multiple inode flush attempts
 	 * into a single flush.
 	 */
 	struct work_struct	m_flush_inodes_work;
-	struct workqueue_struct *m_buf_workqueue;
-	struct workqueue_struct	*m_unwritten_workqueue;
-	struct workqueue_struct	*m_cil_workqueue;
-	struct workqueue_struct	*m_reclaim_workqueue;
-	struct workqueue_struct *m_eofblocks_workqueue;
-	struct workqueue_struct	*m_sync_workqueue;
 
 	/*
 	 * Generation of the filesysyem layout.  This is incremented by each
@@ -190,9 +207,8 @@ typedef struct xfs_mount {
 	 * to various other kinds of pain inflicted on the pNFS server.
 	 */
 	uint32_t		m_generation;
+	struct mutex		m_growlock;	/* growfs mutex */
 
-	bool			m_always_cow;
-	bool			m_fail_unmount;
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-19 22:23 [PATCH 0/2 v3] xfs: improve transaction rate scalability Dave Chinner
  2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-19 22:23 ` Dave Chinner
  2020-05-20  7:01   ` Christoph Hellwig
  2020-05-20 20:47   ` Darrick J. Wong
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-19 22:23 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's a global atomic counter, and we are hitting it at a rate of
half a million transactions a second, so it's bouncing the counter
cacheline all over the place on large machines. We don't actually
need it anymore - it used to be required because the VFS freeze code
could not track/prevent filesystem transactions that were running,
but that problem no longer exists.

Hence to remove the counter, we simply have to ensure that nothing
calls xfs_sync_sb() while we are trying to quiesce the filesytem.
That only happens if the log worker is still running when we call
xfs_quiesce_attr(). The log worker is cancelled at the end of
xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
early here and then we can remove the counter altogether.

Concurrent create, 50 million inodes, identical 16p/16GB virtual
machines on different physical hosts. Machine A has twice the CPU
cores per socket of machine B:

		unpatched	patched
machine A:	3m16s		2m00s
machine B:	4m04s		4m05s

Create rates:
		unpatched	patched
machine A:	282k+/-31k	468k+/-21k
machine B:	231k+/-8k	233k+/-11k

Concurrent rm of same 50 million inodes:

		unpatched	patched
machine A:	6m42s		2m33s
machine B:	4m47s		4m47s

The transaction rate on the fast machine went from just under
300k/sec to 700k/sec, which indicates just how much of a bottleneck
this atomic counter was.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h |  1 -
 fs/xfs/xfs_super.c | 17 +++++------------
 fs/xfs/xfs_trans.c | 27 +++++++++++----------------
 3 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c1f92c1847bb2..3725d25ad97e8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -176,7 +176,6 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	atomic_t		m_active_trans;	/* number trans frozen */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct delayed_work	m_eofblocks_work; /* background eof blocks
 						     trimming */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index aae469f73efeb..fa58cb07c8fdf 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp)
  * there is no log replay required to write the inodes to disk - this is the
  * primary difference between a sync and a quiesce.
  *
- * Note: xfs_log_quiesce() stops background log work - the callers must ensure
- * it is started again when appropriate.
+ * We cancel log work early here to ensure all transactions the log worker may
+ * run have finished before we clean up and log the superblock and write an
+ * unmount record. The unfreeze process is responsible for restarting the log
+ * worker correctly.
  */
 void
 xfs_quiesce_attr(
@@ -883,9 +885,7 @@ xfs_quiesce_attr(
 {
 	int	error = 0;
 
-	/* wait for all modifications to complete */
-	while (atomic_read(&mp->m_active_trans) > 0)
-		delay(100);
+	cancel_delayed_work_sync(&mp->m_log->l_work);
 
 	/* force the log to unpin objects from the now complete transactions */
 	xfs_log_force(mp, XFS_LOG_SYNC);
@@ -899,12 +899,6 @@ xfs_quiesce_attr(
 	if (error)
 		xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. "
 				"Frozen image may not be consistent.");
-	/*
-	 * Just warn here till VFS can correctly support
-	 * read-only remount without racing.
-	 */
-	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
-
 	xfs_log_quiesce(mp);
 }
 
@@ -1793,7 +1787,6 @@ static int xfs_init_fs_context(
 	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
 	spin_lock_init(&mp->m_perag_lock);
 	mutex_init(&mp->m_growlock);
-	atomic_set(&mp->m_active_trans, 0);
 	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
 	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
 	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index b055a5ab53465..217937d743dbb 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -68,7 +68,6 @@ xfs_trans_free(
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	trace_xfs_trans_free(tp, _RET_IP_);
-	atomic_dec(&tp->t_mountp->m_active_trans);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
@@ -125,8 +124,6 @@ xfs_trans_dup(
 	xfs_defer_move(ntp, tp);
 
 	xfs_trans_dup_dqinfo(tp, ntp);
-
-	atomic_inc(&tp->t_mountp->m_active_trans);
 	return ntp;
 }
 
@@ -275,7 +272,6 @@ xfs_trans_alloc(
 	 */
 	WARN_ON(resp->tr_logres > 0 &&
 		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
-	atomic_inc(&mp->m_active_trans);
 
 	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
 	tp->t_flags = flags;
@@ -299,20 +295,19 @@ xfs_trans_alloc(
 
 /*
  * Create an empty transaction with no reservation.  This is a defensive
- * mechanism for routines that query metadata without actually modifying
- * them -- if the metadata being queried is somehow cross-linked (think a
- * btree block pointer that points higher in the tree), we risk deadlock.
- * However, blocks grabbed as part of a transaction can be re-grabbed.
- * The verifiers will notice the corrupt block and the operation will fail
- * back to userspace without deadlocking.
+ * mechanism for routines that query metadata without actually modifying them --
+ * if the metadata being queried is somehow cross-linked (think a btree block
+ * pointer that points higher in the tree), we risk deadlock.  However, blocks
+ * grabbed as part of a transaction can be re-grabbed.  The verifiers will
+ * notice the corrupt block and the operation will fail back to userspace
+ * without deadlocking.
  *
- * Note the zero-length reservation; this transaction MUST be cancelled
- * without any dirty data.
+ * Note the zero-length reservation; this transaction MUST be cancelled without
+ * any dirty data.
  *
- * Callers should obtain freeze protection to avoid two conflicts with fs
- * freezing: (1) having active transactions trip the m_active_trans ASSERTs;
- * and (2) grabbing buffers at the same time that freeze is trying to drain
- * the buffer LRU list.
+ * Callers should obtain freeze protection to avoid a conflict with fs freezing
+ * where we can be grabbing buffers at the same time that freeze is trying to
+ * drain the buffer LRU list.
  */
 int
 xfs_trans_alloc_empty(
-- 
2.26.2.761.g0e0b3e54be


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-20  6:57   ` Christoph Hellwig
  2020-05-20  7:12     ` Dave Chinner
  2020-05-20  9:43   ` Chaitanya Kulkarni
  2020-05-20 20:44   ` Darrick J. Wong
  2 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-20  6:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Shouldn't m_errortag and m_errortag_kobj also go into the read-mostly
section?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-19 22:23 ` [PATCH 2/2] xfs: remove the m_active_trans counter Dave Chinner
@ 2020-05-20  7:01   ` Christoph Hellwig
  2020-05-20  7:13     ` Dave Chinner
  2020-05-20 20:47   ` Darrick J. Wong
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-20  7:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a global atomic counter, and we are hitting it at a rate of
> half a million transactions a second, so it's bouncing the counter
> cacheline all over the place on large machines. We don't actually
> need it anymore - it used to be required because the VFS freeze code
> could not track/prevent filesystem transactions that were running,
> but that problem no longer exists.
> 
> Hence to remove the counter, we simply have to ensure that nothing
> calls xfs_sync_sb() while we are trying to quiesce the filesytem.
> That only happens if the log worker is still running when we call
> xfs_quiesce_attr(). The log worker is cancelled at the end of
> xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
> early here and then we can remove the counter altogether.
> 
> Concurrent create, 50 million inodes, identical 16p/16GB virtual
> machines on different physical hosts. Machine A has twice the CPU
> cores per socket of machine B:
> 
> 		unpatched	patched
> machine A:	3m16s		2m00s
> machine B:	4m04s		4m05s
> 
> Create rates:
> 		unpatched	patched
> machine A:	282k+/-31k	468k+/-21k
> machine B:	231k+/-8k	233k+/-11k
> 
> Concurrent rm of same 50 million inodes:
> 
> 		unpatched	patched
> machine A:	6m42s		2m33s
> machine B:	4m47s		4m47s
> 
> The transaction rate on the fast machine went from just under
> 300k/sec to 700k/sec, which indicates just how much of a bottleneck
> this atomic counter was.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_mount.h |  1 -
>  fs/xfs/xfs_super.c | 17 +++++------------
>  fs/xfs/xfs_trans.c | 27 +++++++++++----------------
>  3 files changed, 16 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c1f92c1847bb2..3725d25ad97e8 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -176,7 +176,6 @@ typedef struct xfs_mount {
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
>  						     trimming */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index aae469f73efeb..fa58cb07c8fdf 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp)
>   * there is no log replay required to write the inodes to disk - this is the
>   * primary difference between a sync and a quiesce.
>   *
> + * We cancel log work early here to ensure all transactions the log worker may
> + * run have finished before we clean up and log the superblock and write an
> + * unmount record. The unfreeze process is responsible for restarting the log
> + * worker correctly.
>   */
>  void
>  xfs_quiesce_attr(
> @@ -883,9 +885,7 @@ xfs_quiesce_attr(
>  {
>  	int	error = 0;
>  
> -	/* wait for all modifications to complete */
> -	while (atomic_read(&mp->m_active_trans) > 0)
> -		delay(100);
> +	cancel_delayed_work_sync(&mp->m_log->l_work);

Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce
be removed now given that we've already cancelled it here?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-20  6:57   ` Christoph Hellwig
@ 2020-05-20  7:12     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-20  7:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, May 19, 2020 at 11:57:43PM -0700, Christoph Hellwig wrote:
> Shouldn't m_errortag and m_errortag_kobj also go into the read-mostly
> section?
> 
> Otherwise looks good:

kobjs are reference counted and full of random stuff like lists,
sysfs references, etc, so they aren't obviously read-mostly
variables. I left all the kobjs in the write section for
this reason. The errortag stuff is also debug only code so I didn't
so I didn't bother touching it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-20  7:01   ` Christoph Hellwig
@ 2020-05-20  7:13     ` Dave Chinner
  2020-05-20  7:17       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-05-20  7:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Wed, May 20, 2020 at 12:01:52AM -0700, Christoph Hellwig wrote:
> On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It's a global atomic counter, and we are hitting it at a rate of
> > half a million transactions a second, so it's bouncing the counter
> > cacheline all over the place on large machines. We don't actually
> > need it anymore - it used to be required because the VFS freeze code
> > could not track/prevent filesystem transactions that were running,
> > but that problem no longer exists.
> > 
> > Hence to remove the counter, we simply have to ensure that nothing
> > calls xfs_sync_sb() while we are trying to quiesce the filesytem.
> > That only happens if the log worker is still running when we call
> > xfs_quiesce_attr(). The log worker is cancelled at the end of
> > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
> > early here and then we can remove the counter altogether.
> > 
> > Concurrent create, 50 million inodes, identical 16p/16GB virtual
> > machines on different physical hosts. Machine A has twice the CPU
> > cores per socket of machine B:
> > 
> > 		unpatched	patched
> > machine A:	3m16s		2m00s
> > machine B:	4m04s		4m05s
> > 
> > Create rates:
> > 		unpatched	patched
> > machine A:	282k+/-31k	468k+/-21k
> > machine B:	231k+/-8k	233k+/-11k
> > 
> > Concurrent rm of same 50 million inodes:
> > 
> > 		unpatched	patched
> > machine A:	6m42s		2m33s
> > machine B:	4m47s		4m47s
> > 
> > The transaction rate on the fast machine went from just under
> > 300k/sec to 700k/sec, which indicates just how much of a bottleneck
> > this atomic counter was.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_mount.h |  1 -
> >  fs/xfs/xfs_super.c | 17 +++++------------
> >  fs/xfs/xfs_trans.c | 27 +++++++++++----------------
> >  3 files changed, 16 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index c1f92c1847bb2..3725d25ad97e8 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -176,7 +176,6 @@ typedef struct xfs_mount {
> >  	uint64_t		m_resblks;	/* total reserved blocks */
> >  	uint64_t		m_resblks_avail;/* available reserved blocks */
> >  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> > -	atomic_t		m_active_trans;	/* number trans frozen */
> >  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
> >  	struct delayed_work	m_eofblocks_work; /* background eof blocks
> >  						     trimming */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index aae469f73efeb..fa58cb07c8fdf 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp)
> >   * there is no log replay required to write the inodes to disk - this is the
> >   * primary difference between a sync and a quiesce.
> >   *
> > + * We cancel log work early here to ensure all transactions the log worker may
> > + * run have finished before we clean up and log the superblock and write an
> > + * unmount record. The unfreeze process is responsible for restarting the log
> > + * worker correctly.
> >   */
> >  void
> >  xfs_quiesce_attr(
> > @@ -883,9 +885,7 @@ xfs_quiesce_attr(
> >  {
> >  	int	error = 0;
> >  
> > -	/* wait for all modifications to complete */
> > -	while (atomic_read(&mp->m_active_trans) > 0)
> > -		delay(100);
> > +	cancel_delayed_work_sync(&mp->m_log->l_work);
> 
> Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce
> be removed now given that we've already cancelled it here?

No, because every other caller of xfs_log_quiesce() requires the
work to be cancelled before the log is quiesced.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-20  7:13     ` Dave Chinner
@ 2020-05-20  7:17       ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-20  7:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Wed, May 20, 2020 at 05:13:39PM +1000, Dave Chinner wrote:
> > > @@ -883,9 +885,7 @@ xfs_quiesce_attr(
> > >  {
> > >  	int	error = 0;
> > >  
> > > -	/* wait for all modifications to complete */
> > > -	while (atomic_read(&mp->m_active_trans) > 0)
> > > -		delay(100);
> > > +	cancel_delayed_work_sync(&mp->m_log->l_work);
> > 
> > Shouldn't the cancel_delayed_work_sync for l_work in xfs_log_quiesce
> > be removed now given that we've already cancelled it here?
> 
> No, because every other caller of xfs_log_quiesce() requires the
> work to be cancelled before the log is quiesced.

The only other caller is xfs_log_unmount, which could grow it.  But
I guess an extra cancel isn't harmful in the end.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
  2020-05-20  6:57   ` Christoph Hellwig
@ 2020-05-20  9:43   ` Chaitanya Kulkarni
  2020-05-20 20:44   ` Darrick J. Wong
  2 siblings, 0 replies; 15+ messages in thread
From: Chaitanya Kulkarni @ 2020-05-20  9:43 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 5/19/20 3:23 PM, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
> 
> Seeing massive cpu usage from xfs_agino_range() on one machine;
> instruction level profiles look similar to another machine running
> the same workload, only one machien is consuming 10x as much CPU as
's/machien/machine/', can be done at the time of applying patch.
> the other and going much slower. The only real difference between
> the two machines is core count per socket. Both are running
> identical 16p/16GB virtual machine configurations
> 
> Machine A:
> 
>    25.83%  [k] xfs_agino_range
>    12.68%  [k] __xfs_dir3_data_check
>     6.95%  [k] xfs_verify_ino
>     6.78%  [k] xfs_dir2_data_entry_tag_p
>     3.56%  [k] xfs_buf_find
>     2.31%  [k] xfs_verify_dir_ino
>     2.02%  [k] xfs_dabuf_map.constprop.0
>     1.65%  [k] xfs_ag_block_count
> 
> And takes around 13 minutes to remove 50 million inodes.
> 
> Machine B:
> 
>    13.90%  [k] __pv_queued_spin_lock_slowpath
>     3.76%  [k] do_raw_spin_lock
>     2.83%  [k] xfs_dir3_leaf_check_int
>     2.75%  [k] xfs_agino_range
>     2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
>     2.18%  [k] __xfs_dir3_data_check
>     2.02%  [k] xfs_log_commit_cil
> 
> And takes around 5m30s to remove 50 million inodes.
> 
> Suspect is cacheline contention on m_sectbb_log which is used in one
> of the macros in xfs_agino_range. This is a read-only variable but
> shares a cacheline with m_active_trans which is a global atomic that
> gets bounced all around the machine.
> 
> The workload is trying to run hundreds of thousands of transactions
> per second and hence cacheline contention will be occuring on this
's/occuring/occurring/', can be done at the time of applying patch.
> atomic counter. Hence xfs_agino_range() is likely just be an
> innocent bystander as the cache coherency protocol fights over the
> cacheline between CPU cores and sockets.
> 
> On machine A, this rearrangement of the struct xfs_mount
> results in the profile changing to:
> 
>     9.77%  [kernel]  [k] xfs_agino_range
>     6.27%  [kernel]  [k] __xfs_dir3_data_check
>     5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>     4.54%  [kernel]  [k] xfs_buf_find
>     3.79%  [kernel]  [k] do_raw_spin_lock
>     3.39%  [kernel]  [k] xfs_verify_ino
>     2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> 
> Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
> of machine B and still runs substantially slower than it should.
> 
> Current rm -rf of 50 million files:
> 
> 		vanilla		patched
> machine A	13m20s		6m42s
> machine B	5m30s		5m02s
> 
> It's an improvement, hence indicating that separation and further
> optimisation of read-only global filesystem data is worthwhile, but
> it clearly isn't the underlying issue causing this specific
> performance degradation.
> 
> Signed-off-by: Dave Chinner<dchinner@redhat.com>
> ---


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
  2020-05-20  6:57   ` Christoph Hellwig
  2020-05-20  9:43   ` Chaitanya Kulkarni
@ 2020-05-20 20:44   ` Darrick J. Wong
  2 siblings, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-20 20:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 20, 2020 at 08:23:09AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Seeing massive cpu usage from xfs_agino_range() on one machine;
> instruction level profiles look similar to another machine running
> the same workload, only one machien is consuming 10x as much CPU as
> the other and going much slower. The only real difference between
> the two machines is core count per socket. Both are running
> identical 16p/16GB virtual machine configurations
> 
> Machine A:
> 
>   25.83%  [k] xfs_agino_range
>   12.68%  [k] __xfs_dir3_data_check
>    6.95%  [k] xfs_verify_ino
>    6.78%  [k] xfs_dir2_data_entry_tag_p
>    3.56%  [k] xfs_buf_find
>    2.31%  [k] xfs_verify_dir_ino
>    2.02%  [k] xfs_dabuf_map.constprop.0
>    1.65%  [k] xfs_ag_block_count
> 
> And takes around 13 minutes to remove 50 million inodes.
> 
> Machine B:
> 
>   13.90%  [k] __pv_queued_spin_lock_slowpath
>    3.76%  [k] do_raw_spin_lock
>    2.83%  [k] xfs_dir3_leaf_check_int
>    2.75%  [k] xfs_agino_range
>    2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
>    2.18%  [k] __xfs_dir3_data_check
>    2.02%  [k] xfs_log_commit_cil
> 
> And takes around 5m30s to remove 50 million inodes.
> 
> Suspect is cacheline contention on m_sectbb_log which is used in one
> of the macros in xfs_agino_range. This is a read-only variable but
> shares a cacheline with m_active_trans which is a global atomic that
> gets bounced all around the machine.
> 
> The workload is trying to run hundreds of thousands of transactions
> per second and hence cacheline contention will be occuring on this
> atomic counter. Hence xfs_agino_range() is likely just be an
> innocent bystander as the cache coherency protocol fights over the
> cacheline between CPU cores and sockets.
> 
> On machine A, this rearrangement of the struct xfs_mount
> results in the profile changing to:
> 
>    9.77%  [kernel]  [k] xfs_agino_range
>    6.27%  [kernel]  [k] __xfs_dir3_data_check
>    5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    4.54%  [kernel]  [k] xfs_buf_find
>    3.79%  [kernel]  [k] do_raw_spin_lock
>    3.39%  [kernel]  [k] xfs_verify_ino
>    2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> 
> Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
> of machine B and still runs substantially slower than it should.
> 
> Current rm -rf of 50 million files:
> 
> 		vanilla		patched
> machine A	13m20s		6m42s
> machine B	5m30s		5m02s
> 
> It's an improvement, hence indicating that separation and further
> optimisation of read-only global filesystem data is worthwhile, but
> it clearly isn't the underlying issue causing this specific
> performance degradation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_mount.h | 148 +++++++++++++++++++++++++--------------------
>  1 file changed, 82 insertions(+), 66 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 4835581f3eb00..c1f92c1847bb2 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -55,61 +55,25 @@ struct xfs_error_cfg {
>  	long		retry_timeout;	/* in jiffies, -1 = infinite */
>  };
>  
> +/*
> + * The struct xfsmount layout is optimised to separate read-mostly variables
> + * from variables that are frequently modified. We put the read-mostly variables
> + * first, then place all the other variables at the end.
> + *
> + * Typically, read-mostly variables are those that are set at mount time and
> + * never changed again, or only change rarely as a result of things like sysfs
> + * knobs being tweaked.
> + */
>  typedef struct xfs_mount {
> +	struct xfs_sb		m_sb;		/* copy of fs superblock */
>  	struct super_block	*m_super;
> -
> -	/*
> -	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> -	 * Callers must hold m_sb_lock to access these two fields.
> -	 */
> -	uint8_t			m_fs_checked;
> -	uint8_t			m_fs_sick;
> -	/*
> -	 * Bitsets of rt metadata that have been checked and/or are sick.
> -	 * Callers must hold m_sb_lock to access this field.
> -	 */
> -	uint8_t			m_rt_checked;
> -	uint8_t			m_rt_sick;
> -
>  	struct xfs_ail		*m_ail;		/* fs active log item list */
> -
> -	struct xfs_sb		m_sb;		/* copy of fs superblock */
> -	spinlock_t		m_sb_lock;	/* sb counter lock */
> -	struct percpu_counter	m_icount;	/* allocated inodes counter */
> -	struct percpu_counter	m_ifree;	/* free inodes counter */
> -	struct percpu_counter	m_fdblocks;	/* free block counter */
> -	/*
> -	 * Count of data device blocks reserved for delayed allocations,
> -	 * including indlen blocks.  Does not include allocated CoW staging
> -	 * extents or anything related to the rt device.
> -	 */
> -	struct percpu_counter	m_delalloc_blks;
> -
>  	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
>  	char			*m_rtname;	/* realtime device name */
>  	char			*m_logname;	/* external log device name */
> -	int			m_bsize;	/* fs logical block size */
> -	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
> -	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
> -	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
> -	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> -	uint			m_allocsize_log;/* min write size log bytes */
> -	uint			m_allocsize_blocks; /* min write size blocks */
>  	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
>  	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
>  	struct xlog		*m_log;		/* log specific stuff */
> -	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
> -	int			m_logbufs;	/* number of log buffers */
> -	int			m_logbsize;	/* size of each log buffer */
> -	uint			m_rsumlevels;	/* rt summary levels */
> -	uint			m_rsumsize;	/* size of rt summary, bytes */
> -	/*
> -	 * Optional cache of rt summary level per bitmap block with the
> -	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> -	 * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip
> -	 * inode lock.
> -	 */
> -	uint8_t			*m_rsum_cache;
>  	struct xfs_inode	*m_rbmip;	/* pointer to bitmap inode */
>  	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
>  	struct xfs_inode	*m_rootip;	/* pointer to root directory */
> @@ -117,9 +81,26 @@ typedef struct xfs_mount {
>  	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
>  	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
>  	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
> +	/*
> +	 * Optional cache of rt summary level per bitmap block with the
> +	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
> +	 * rsum[i][bbno] != 0. Reads and writes are serialized by the rsumip
> +	 * inode lock.
> +	 */
> +	uint8_t			*m_rsum_cache;
> +	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
> +	struct workqueue_struct *m_buf_workqueue;
> +	struct workqueue_struct	*m_unwritten_workqueue;
> +	struct workqueue_struct	*m_cil_workqueue;
> +	struct workqueue_struct	*m_reclaim_workqueue;
> +	struct workqueue_struct *m_eofblocks_workqueue;
> +	struct workqueue_struct	*m_sync_workqueue;
> +
> +	int			m_bsize;	/* fs logical block size */
>  	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
>  	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
>  	uint8_t			m_agno_log;	/* log #ag's */
> +	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
>  	uint			m_blockmask;	/* sb_blocksize-1 */
>  	uint			m_blockwsize;	/* sb_blocksize in words */
>  	uint			m_blockwmask;	/* blockwsize-1 */
> @@ -138,47 +119,83 @@ typedef struct xfs_mount {
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
>  	uint			m_alloc_set_aside; /* space we can't use */
>  	uint			m_ag_max_usable; /* max space per AG */
> -	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> -	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
> -	struct mutex		m_growlock;	/* growfs mutex */
> +	int			m_dalign;	/* stripe unit */
> +	int			m_swidth;	/* stripe width */
> +	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
> +	uint			m_allocsize_log;/* min write size log bytes */
> +	uint			m_allocsize_blocks; /* min write size blocks */
> +	int			m_logbufs;	/* number of log buffers */
> +	int			m_logbsize;	/* size of each log buffer */
> +	uint			m_rsumlevels;	/* rt summary levels */
> +	uint			m_rsumsize;	/* size of rt summary, bytes */
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
> -	uint64_t		m_flags;	/* global mount flags */
> -	bool			m_finobt_nores; /* no per-AG finobt resv. */
>  	uint			m_qflags;	/* quota status flags */
> +	uint64_t		m_flags;	/* global mount flags */
> +	int64_t			m_low_space[XFS_LOWSP_MAX];
> +	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
>  	struct xfs_trans_resv	m_resv;		/* precomputed res values */
> +						/* low free space thresholds */
> +	bool			m_always_cow;
> +	bool			m_fail_unmount;
> +	bool			m_finobt_nores; /* no per-AG finobt resv. */
> +	bool			m_update_sb;	/* sb needs update in mount */
> +
> +	/*
> +	 * Bitsets of per-fs metadata that have been checked and/or are sick.
> +	 * Callers must hold m_sb_lock to access these two fields.
> +	 */
> +	uint8_t			m_fs_checked;
> +	uint8_t			m_fs_sick;
> +	/*
> +	 * Bitsets of rt metadata that have been checked and/or are sick.
> +	 * Callers must hold m_sb_lock to access this field.
> +	 */
> +	uint8_t			m_rt_checked;
> +	uint8_t			m_rt_sick;
> +
> +	/*
> +	 * End of read-mostly variables. Frequently written variables and locks
> +	 * should be placed below this comment from now on. The first variable
> +	 * here is marked as cacheline aligned so they it is separated from
> +	 * the read-mostly variables.
> +	 */
> +
> +	spinlock_t ____cacheline_aligned m_sb_lock; /* sb counter lock */
> +	struct percpu_counter	m_icount;	/* allocated inodes counter */
> +	struct percpu_counter	m_ifree;	/* free inodes counter */
> +	struct percpu_counter	m_fdblocks;	/* free block counter */
> +	/*
> +	 * Count of data device blocks reserved for delayed allocations,
> +	 * including indlen blocks.  Does not include allocated CoW staging
> +	 * extents or anything related to the rt device.
> +	 */
> +	struct percpu_counter	m_delalloc_blks;
> +
> +	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
> +	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	int			m_dalign;	/* stripe unit */
> -	int			m_swidth;	/* stripe width */
> -	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
>  	atomic_t		m_active_trans;	/* number trans frozen */
> -	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
>  						     trimming */
>  	struct delayed_work	m_cowblocks_work; /* background cow blocks
>  						     trimming */
> -	bool			m_update_sb;	/* sb needs update in mount */
> -	int64_t			m_low_space[XFS_LOWSP_MAX];
> -						/* low free space thresholds */
>  	struct xfs_kobj		m_kobj;
>  	struct xfs_kobj		m_error_kobj;
>  	struct xfs_kobj		m_error_meta_kobj;
>  	struct xfs_error_cfg	m_error_cfg[XFS_ERR_CLASS_MAX][XFS_ERR_ERRNO_MAX];
>  	struct xstats		m_stats;	/* per-fs stats */
> +	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
> +	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
> +	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
>  
>  	/*
>  	 * Workqueue item so that we can coalesce multiple inode flush attempts
>  	 * into a single flush.
>  	 */
>  	struct work_struct	m_flush_inodes_work;
> -	struct workqueue_struct *m_buf_workqueue;
> -	struct workqueue_struct	*m_unwritten_workqueue;
> -	struct workqueue_struct	*m_cil_workqueue;
> -	struct workqueue_struct	*m_reclaim_workqueue;
> -	struct workqueue_struct *m_eofblocks_workqueue;
> -	struct workqueue_struct	*m_sync_workqueue;
>  
>  	/*
>  	 * Generation of the filesysyem layout.  This is incremented by each
> @@ -190,9 +207,8 @@ typedef struct xfs_mount {
>  	 * to various other kinds of pain inflicted on the pNFS server.
>  	 */
>  	uint32_t		m_generation;
> +	struct mutex		m_growlock;	/* growfs mutex */
>  
> -	bool			m_always_cow;
> -	bool			m_fail_unmount;
>  #ifdef DEBUG
>  	/*
>  	 * Frequency with which errors are injected.  Replaces xfs_etest; the
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-19 22:23 ` [PATCH 2/2] xfs: remove the m_active_trans counter Dave Chinner
  2020-05-20  7:01   ` Christoph Hellwig
@ 2020-05-20 20:47   ` Darrick J. Wong
  2020-05-20 21:51     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2020-05-20 20:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a global atomic counter, and we are hitting it at a rate of
> half a million transactions a second, so it's bouncing the counter
> cacheline all over the place on large machines. We don't actually
> need it anymore - it used to be required because the VFS freeze code
> could not track/prevent filesystem transactions that were running,
> but that problem no longer exists.
> 
> Hence to remove the counter, we simply have to ensure that nothing
> calls xfs_sync_sb() while we are trying to quiesce the filesytem.
> That only happens if the log worker is still running when we call
> xfs_quiesce_attr(). The log worker is cancelled at the end of
> xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
> early here and then we can remove the counter altogether.
> 
> Concurrent create, 50 million inodes, identical 16p/16GB virtual
> machines on different physical hosts. Machine A has twice the CPU
> cores per socket of machine B:
> 
> 		unpatched	patched
> machine A:	3m16s		2m00s
> machine B:	4m04s		4m05s
> 
> Create rates:
> 		unpatched	patched
> machine A:	282k+/-31k	468k+/-21k
> machine B:	231k+/-8k	233k+/-11k
> 
> Concurrent rm of same 50 million inodes:
> 
> 		unpatched	patched
> machine A:	6m42s		2m33s
> machine B:	4m47s		4m47s
> 
> The transaction rate on the fast machine went from just under
> 300k/sec to 700k/sec, which indicates just how much of a bottleneck
> this atomic counter was.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

/me kinda wonders why removing the counter entirely has so little effect
on machine B, but seeing as I've been pondering killing this counter
myself for years,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_mount.h |  1 -
>  fs/xfs/xfs_super.c | 17 +++++------------
>  fs/xfs/xfs_trans.c | 27 +++++++++++----------------
>  3 files changed, 16 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index c1f92c1847bb2..3725d25ad97e8 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -176,7 +176,6 @@ typedef struct xfs_mount {
>  	uint64_t		m_resblks;	/* total reserved blocks */
>  	uint64_t		m_resblks_avail;/* available reserved blocks */
>  	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
> -	atomic_t		m_active_trans;	/* number trans frozen */
>  	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
>  	struct delayed_work	m_eofblocks_work; /* background eof blocks
>  						     trimming */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index aae469f73efeb..fa58cb07c8fdf 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -874,8 +874,10 @@ xfs_restore_resvblks(struct xfs_mount *mp)
>   * there is no log replay required to write the inodes to disk - this is the
>   * primary difference between a sync and a quiesce.
>   *
> - * Note: xfs_log_quiesce() stops background log work - the callers must ensure
> - * it is started again when appropriate.
> + * We cancel log work early here to ensure all transactions the log worker may
> + * run have finished before we clean up and log the superblock and write an
> + * unmount record. The unfreeze process is responsible for restarting the log
> + * worker correctly.
>   */
>  void
>  xfs_quiesce_attr(
> @@ -883,9 +885,7 @@ xfs_quiesce_attr(
>  {
>  	int	error = 0;
>  
> -	/* wait for all modifications to complete */
> -	while (atomic_read(&mp->m_active_trans) > 0)
> -		delay(100);
> +	cancel_delayed_work_sync(&mp->m_log->l_work);
>  
>  	/* force the log to unpin objects from the now complete transactions */
>  	xfs_log_force(mp, XFS_LOG_SYNC);
> @@ -899,12 +899,6 @@ xfs_quiesce_attr(
>  	if (error)
>  		xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. "
>  				"Frozen image may not be consistent.");
> -	/*
> -	 * Just warn here till VFS can correctly support
> -	 * read-only remount without racing.
> -	 */
> -	WARN_ON(atomic_read(&mp->m_active_trans) != 0);
> -
>  	xfs_log_quiesce(mp);
>  }
>  
> @@ -1793,7 +1787,6 @@ static int xfs_init_fs_context(
>  	INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
>  	spin_lock_init(&mp->m_perag_lock);
>  	mutex_init(&mp->m_growlock);
> -	atomic_set(&mp->m_active_trans, 0);
>  	INIT_WORK(&mp->m_flush_inodes_work, xfs_flush_inodes_worker);
>  	INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
>  	INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index b055a5ab53465..217937d743dbb 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -68,7 +68,6 @@ xfs_trans_free(
>  	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
>  
>  	trace_xfs_trans_free(tp, _RET_IP_);
> -	atomic_dec(&tp->t_mountp->m_active_trans);
>  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_end_intwrite(tp->t_mountp->m_super);
>  	xfs_trans_free_dqinfo(tp);
> @@ -125,8 +124,6 @@ xfs_trans_dup(
>  	xfs_defer_move(ntp, tp);
>  
>  	xfs_trans_dup_dqinfo(tp, ntp);
> -
> -	atomic_inc(&tp->t_mountp->m_active_trans);
>  	return ntp;
>  }
>  
> @@ -275,7 +272,6 @@ xfs_trans_alloc(
>  	 */
>  	WARN_ON(resp->tr_logres > 0 &&
>  		mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
> -	atomic_inc(&mp->m_active_trans);
>  
>  	tp->t_magic = XFS_TRANS_HEADER_MAGIC;
>  	tp->t_flags = flags;
> @@ -299,20 +295,19 @@ xfs_trans_alloc(
>  
>  /*
>   * Create an empty transaction with no reservation.  This is a defensive
> - * mechanism for routines that query metadata without actually modifying
> - * them -- if the metadata being queried is somehow cross-linked (think a
> - * btree block pointer that points higher in the tree), we risk deadlock.
> - * However, blocks grabbed as part of a transaction can be re-grabbed.
> - * The verifiers will notice the corrupt block and the operation will fail
> - * back to userspace without deadlocking.
> + * mechanism for routines that query metadata without actually modifying them --
> + * if the metadata being queried is somehow cross-linked (think a btree block
> + * pointer that points higher in the tree), we risk deadlock.  However, blocks
> + * grabbed as part of a transaction can be re-grabbed.  The verifiers will
> + * notice the corrupt block and the operation will fail back to userspace
> + * without deadlocking.
>   *
> - * Note the zero-length reservation; this transaction MUST be cancelled
> - * without any dirty data.
> + * Note the zero-length reservation; this transaction MUST be cancelled without
> + * any dirty data.
>   *
> - * Callers should obtain freeze protection to avoid two conflicts with fs
> - * freezing: (1) having active transactions trip the m_active_trans ASSERTs;
> - * and (2) grabbing buffers at the same time that freeze is trying to drain
> - * the buffer LRU list.
> + * Callers should obtain freeze protection to avoid a conflict with fs freezing
> + * where we can be grabbing buffers at the same time that freeze is trying to
> + * drain the buffer LRU list.
>   */
>  int
>  xfs_trans_alloc_empty(
> -- 
> 2.26.2.761.g0e0b3e54be
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] xfs: remove the m_active_trans counter
  2020-05-20 20:47   ` Darrick J. Wong
@ 2020-05-20 21:51     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-20 21:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 20, 2020 at 01:47:54PM -0700, Darrick J. Wong wrote:
> On Wed, May 20, 2020 at 08:23:10AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > It's a global atomic counter, and we are hitting it at a rate of
> > half a million transactions a second, so it's bouncing the counter
> > cacheline all over the place on large machines. We don't actually
> > need it anymore - it used to be required because the VFS freeze code
> > could not track/prevent filesystem transactions that were running,
> > but that problem no longer exists.
> > 
> > Hence to remove the counter, we simply have to ensure that nothing
> > calls xfs_sync_sb() while we are trying to quiesce the filesytem.
> > That only happens if the log worker is still running when we call
> > xfs_quiesce_attr(). The log worker is cancelled at the end of
> > xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it
> > early here and then we can remove the counter altogether.
> > 
> > Concurrent create, 50 million inodes, identical 16p/16GB virtual
> > machines on different physical hosts. Machine A has twice the CPU
> > cores per socket of machine B:
> > 
> > 		unpatched	patched
> > machine A:	3m16s		2m00s
> > machine B:	4m04s		4m05s
> > 
> > Create rates:
> > 		unpatched	patched
> > machine A:	282k+/-31k	468k+/-21k
> > machine B:	231k+/-8k	233k+/-11k
> > 
> > Concurrent rm of same 50 million inodes:
> > 
> > 		unpatched	patched
> > machine A:	6m42s		2m33s
> > machine B:	4m47s		4m47s
> > 
> > The transaction rate on the fast machine went from just under
> > 300k/sec to 700k/sec, which indicates just how much of a bottleneck
> > this atomic counter was.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> /me kinda wonders why removing the counter entirely has so little effect
> on machine B, but seeing as I've been pondering killing this counter
> myself for years,

Because the transaction rate on machine B isn't high enough to that
the cacheline bouncing becomes a limiting factor.

Don't forget that the impact of cacheline bouncing is exponential -
there is a very small window where it goes from "none at all" to
"all the machine", and these two machines sit on either side of that
threshold. i.e. the older machine is too slow to hit that threshold,
the newer machine hits it easily.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  8:14   ` Christoph Hellwig
@ 2020-05-12  9:11     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2020-05-12  9:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, May 12, 2020 at 01:14:23AM -0700, Christoph Hellwig wrote:
> I'm surprised the difference is that big.  The moves look obviously
> fine, but why do you put the precalculated fields in the middle
> of struct xfs_mount instead of at the end?

Just because it was the minimum to move about and it clearly
demonstrated the damage the contended cacheline was doing to
performance of code accessing read-only variables.

Really, there's a lot in the xfs_mount that might well be read only
that I didn't consider. I'm thinking that most of the pointers to
structures are read only (e.g. the log, ailp, buftargs, etc) as they
do not change for the life of the structure. I don't really have the
time to dig into this real deep (this is a quick, interesting
diversion!) so I did the 99-percenter and moved on...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
@ 2020-05-12  8:14   ` Christoph Hellwig
  2020-05-12  9:11     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2020-05-12  8:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

I'm surprised the difference is that big.  The moves look obviously
fine, but why do you put the precalculated fields in the middle
of struct xfs_mount instead of at the end?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount
  2020-05-12  2:59 [PATCH 0/2] xfs: fix a couple of performance issues Dave Chinner
@ 2020-05-12  2:59 ` Dave Chinner
  2020-05-12  8:14   ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2020-05-12  2:59 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Seeing massive cpu usage from xfs_agino_range() on one machine;
instruction level profiles look similar to another machine running
the same workload, only one machien is consuming 10x as much CPU as
the other and going much slower. The only real difference between
the two machines is core count per socket. Both are running
identical 16p/16GB virtual machine configurations

Machine A:

  25.83%  [k] xfs_agino_range
  12.68%  [k] __xfs_dir3_data_check
   6.95%  [k] xfs_verify_ino
   6.78%  [k] xfs_dir2_data_entry_tag_p
   3.56%  [k] xfs_buf_find
   2.31%  [k] xfs_verify_dir_ino
   2.02%  [k] xfs_dabuf_map.constprop.0
   1.65%  [k] xfs_ag_block_count

And takes around 13 minutes to remove 50 million inodes.

Machine B:

  13.90%  [k] __pv_queued_spin_lock_slowpath
   3.76%  [k] do_raw_spin_lock
   2.83%  [k] xfs_dir3_leaf_check_int
   2.75%  [k] xfs_agino_range
   2.51%  [k] __raw_callee_save___pv_queued_spin_unlock
   2.18%  [k] __xfs_dir3_data_check
   2.02%  [k] xfs_log_commit_cil

And takes around 5m30s to remove 50 million inodes.

Suspect is cacheline contention on m_sectbb_log which is used in one
of the macros in xfs_agino_range. This is a read-only variable but
shares a cacheline with m_active_trans which is a global atomic that
gets bounced all around the machine.

The workload is trying to run hundreds of thousands of transactions
per second and hence cacheline contention will be occuring on this
atomic counter. Hence xfs_agino_range() is likely just be an
innocent bystander as the cache coherency protocol fights over the
cacheline between CPU cores and sockets.

On machine A, this rearrangement of the struct xfs_mount
results in the profile changing to:

   9.77%  [kernel]  [k] xfs_agino_range
   6.27%  [kernel]  [k] __xfs_dir3_data_check
   5.31%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   4.54%  [kernel]  [k] xfs_buf_find
   3.79%  [kernel]  [k] do_raw_spin_lock
   3.39%  [kernel]  [k] xfs_verify_ino
   2.73%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock

Vastly less CPU usage in xfs_agino_range(), but still 3x the amount
of machine B and still runs substantially slower than it should.

Current rm -rf of 50 million files:

		vanilla		patched
machine A	13m20s		8m30s
machine B	5m30s		5m02s

It's an improvement, hence indicating that separation and further
optimisation of read-only global filesystem data is worthwhile, but
it clearly isn't the underlying issue causing this specific
performance degradation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.h | 50 +++++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index aba5a15792792..712b3e2583316 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -88,21 +88,12 @@ typedef struct xfs_mount {
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_rtname;	/* realtime device name */
 	char			*m_logname;	/* external log device name */
-	int			m_bsize;	/* fs logical block size */
 	xfs_agnumber_t		m_agfrotor;	/* last ag where space found */
 	xfs_agnumber_t		m_agirotor;	/* last ag dir inode alloced */
 	spinlock_t		m_agirotor_lock;/* .. and lock protecting it */
-	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
-	uint			m_allocsize_log;/* min write size log bytes */
-	uint			m_allocsize_blocks; /* min write size blocks */
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
-	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
-	int			m_logbufs;	/* number of log buffers */
-	int			m_logbsize;	/* size of each log buffer */
-	uint			m_rsumlevels;	/* rt summary levels */
-	uint			m_rsumsize;	/* size of rt summary, bytes */
 	/*
 	 * Optional cache of rt summary level per bitmap block with the
 	 * invariant that m_rsum_cache[bbno] <= the minimum i for which
@@ -117,9 +108,15 @@ typedef struct xfs_mount {
 	xfs_buftarg_t		*m_ddev_targp;	/* saves taking the address */
 	xfs_buftarg_t		*m_logdev_targp;/* ptr to log device */
 	xfs_buftarg_t		*m_rtdev_targp;	/* ptr to rt device */
+
+	/*
+	 * Read-only variables that are pre-calculated at mount time.
+	 */
+	int ____cacheline_aligned m_bsize;	/* fs logical block size */
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
+	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -138,20 +135,35 @@ typedef struct xfs_mount {
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */
-	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
-	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
-	struct mutex		m_growlock;	/* growfs mutex */
+	int			m_dalign;	/* stripe unit */
+	int			m_swidth;	/* stripe width */
+	xfs_agnumber_t		m_maxagi;	/* highest inode alloc group */
+	uint			m_allocsize_log;/* min write size log bytes */
+	uint			m_allocsize_blocks; /* min write size blocks */
+	int			m_logbufs;	/* number of log buffers */
+	int			m_logbsize;	/* size of each log buffer */
+	uint			m_rsumlevels;	/* rt summary levels */
+	uint			m_rsumsize;	/* size of rt summary, bytes */
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
-	uint64_t		m_flags;	/* global mount flags */
-	bool			m_finobt_nores; /* no per-AG finobt resv. */
 	uint			m_qflags;	/* quota status flags */
+	uint64_t		m_flags;	/* global mount flags */
+	int64_t			m_low_space[XFS_LOWSP_MAX];
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
+						/* low free space thresholds */
+	bool			m_always_cow;
+	bool			m_fail_unmount;
+	bool			m_finobt_nores; /* no per-AG finobt resv. */
+	/*
+	 * End of pre-calculated read-only variables
+	 */
+
+	struct radix_tree_root	m_perag_tree;	/* per-ag accounting info */
+	spinlock_t		m_perag_lock;	/* lock for m_perag_tree */
+	struct mutex		m_growlock;	/* growfs mutex */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
-	int			m_dalign;	/* stripe unit */
-	int			m_swidth;	/* stripe width */
-	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	atomic_t		m_active_trans;	/* number trans frozen */
 	struct xfs_mru_cache	*m_filestream;  /* per-mount filestream data */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
@@ -160,8 +172,6 @@ typedef struct xfs_mount {
 	struct delayed_work	m_cowblocks_work; /* background cow blocks
 						     trimming */
 	bool			m_update_sb;	/* sb needs update in mount */
-	int64_t			m_low_space[XFS_LOWSP_MAX];
-						/* low free space thresholds */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
@@ -191,8 +201,6 @@ typedef struct xfs_mount {
 	 */
 	uint32_t		m_generation;
 
-	bool			m_always_cow;
-	bool			m_fail_unmount;
 #ifdef DEBUG
 	/*
 	 * Frequency with which errors are injected.  Replaces xfs_etest; the
-- 
2.26.1.301.g55bc3eb7cb9


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-05-20 21:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-19 22:23 [PATCH 0/2 v3] xfs: improve transaction rate scalability Dave Chinner
2020-05-19 22:23 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
2020-05-20  6:57   ` Christoph Hellwig
2020-05-20  7:12     ` Dave Chinner
2020-05-20  9:43   ` Chaitanya Kulkarni
2020-05-20 20:44   ` Darrick J. Wong
2020-05-19 22:23 ` [PATCH 2/2] xfs: remove the m_active_trans counter Dave Chinner
2020-05-20  7:01   ` Christoph Hellwig
2020-05-20  7:13     ` Dave Chinner
2020-05-20  7:17       ` Christoph Hellwig
2020-05-20 20:47   ` Darrick J. Wong
2020-05-20 21:51     ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2020-05-12  2:59 [PATCH 0/2] xfs: fix a couple of performance issues Dave Chinner
2020-05-12  2:59 ` [PATCH 1/2] xfs: separate read-only variables in struct xfs_mount Dave Chinner
2020-05-12  8:14   ` Christoph Hellwig
2020-05-12  9:11     ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).