All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/39 v4] xfs: CIL and log optimisations
@ 2021-05-19 12:12 Dave Chinner
  2021-05-19 12:12 ` [PATCH 01/39] xfs: log stripe roundoff is a property of the log Dave Chinner
                   ` (38 more replies)
  0 siblings, 39 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs


Hi folks,

This is an update of the consolidated log scalability patchset I've been working
on. Version 3 was posted here:

https://lore.kernel.org/linux-xfs/20210305051143.182133-1-david@fromorbit.com/

This version addresses many of the review comments, fixes a couple of bugs in
the CIL scalability series found by shutdown/log recovery tests, and has a heap
more testing time run on it.

Performance improvements are largely documented in the change logs of the
individual patches. Headline numbers are an increase in transaction rate from
700k commits/s to 1.7M commits/s, and a reduction in fua/flush operations by
2-3 orders of magnitude on metadata heavy workloads that don't use fsync.

Summary of series:

Patches		Modifications
-------		-------------
1-7:		log write FUA/FLUSH optimisations
8:		bug fix
9-11:		Async CIL pushes
12-25:		xlog_write() rework
26-39:		CIL commit scalability

Change log is below.

Cheers,

Dave.


Version 4:
- rebase on 5.13-rc2+
- fixed completion logic for async cache flush
- trimmed superflous comments about not requiring REQ_PREFLUSH for iclog IO
  anymore.
- ensure that setting/clearing XLOG_ICL_NEED_FLUSH is atomic (i.e. only modified
  while holding the icloglock)
- ensure callers only add iclog flush/fua flags appropriately before releasing
  the iclog so that multiple independent writes to an iclog doesn't clear flags
  other writes into the iclog depend on.
- buffer log item dirty tracking patches merged so removed from series
- replaced XFS_LSN_CMP() checks with direct lsn1 == lsn2 comparisons to simplify
  the code
- changed "push_async" to "push_commit_stable" to indicate that the push caller
  wants the entire CIL checkpoint and commit record to be on stable storage when
  it completes.
- updated comment to indicate that iclog sync state is set according to the
  caller's desire for a stable checkpoint to be performed.
- Added comment explaining why the CIL workqueue is limited to 4 concurrent
  works per filesystem.
- debug overhead reduction patches merged so removed from series
- cleaned up a couple of typedef uses.
- updated pahole output for checkpoint header in commit message
- Added BUILD_BUG_ON() to check the size of unmount records.
- got rid of XFS_VOLUME define.
- got rid of XLOG_TIC_LEN_MAX define.
- cleaned up extra blank lines.
- fixed double initialisation of lv in xlog_write_single().
- removed the unnecessary change for reserved iclog space in
  xlog_state_get_iclog_space().
- no need to check for XLOG_CONTINUE_TRANS in xlog_write_partial() as it will
  always be set.
- added a patch for removing the optype parameter from xlog_write()
- removed unused nvecs from struct xfs_cil_ctx
- fixed whitespace damage in xlog_cil_pcp_dead()
- added missing cpu dead accounting transfer in xlog_cil_pcp_dead().
- factored out CIL push percpu structure aggregation into
  xlog_cil_pcp_aggregate()
- added the ctx->ticket->t_unit_res update back into the code even though it is
  largely unnecessary.
- cleaned up the pcp, cilpcp, pcptr mess in xlog_cil_pcp_alloc() and elsewhere
  to use variable names consistently.
- simplified the CIL sort comparison functions to a single comparison operation
- fixed percpu CIL item list sort order where items in the same transaction
  (order id) were reversed, leading to intents being replayed in the wrong
  order.
- split out log vector chain conversion to list_head into separate patch
- Updated documentation with all the fixes and suggestions made.


Version 3:
- rebase onto 5.12-rc1+
- aggregate many small dependent patchsets in one large one.
- simplify xlog_wait_on_iclog_lsn() back to just a call to xlog_wait_on_iclog()
- remove xfs_blkdev_issue_flush() instead of moving and renaming it.
- pass bio to xfs_flush_bdev_async() so it doesn't need allocation.
- skip cache flush in xfs_flush_bdev_async() if the underlying queue does not
  require it.
- fixed whitespace in xfs_flush_bdev_async()
- remove the implicit external log's data device cache flush code and replace it
  with an explicit flush in the unmount record write so that it works the same
  as the new CIL checkpoint cache pre-flush mechanism. This mechanism now
  guarantees metadata vs journal ordering for both internal and external logs.
- updated various commit messages
- fixed incorrect/unintended changes to xfs_log_force() behaviour
- typedef uint64_t xfs_csn_t; and conversion.
- removed stray trace_printk()s that were used for debugging.
- fixed minor formatting details.
- uninlined xlog_prepare_iovec()
- fixed up "lv chain vector and size calculation" commit message to reflect we
  are only calculating and passin gin the vector byte count.
- reworked the loop in xlog_write_single() based on Christoph's suggestion. Much
  cleaner!
- added patch to pass log ticket down to xlog_sync() so that it accounts the
  roundoff to the log ticket rather than directly modifying grant heads. Grant
  heads are hot, so every little bit helps.
- added patch to update delayed logging design doc with background material on
  how transactions and log space accounting works in XFS.

Version 2:
- fix ticket reservation roundoff to include 2 roundoffs
- removed stale copied comment from roundoff initialisation.
- clarified "separation" to mean "separation for ordering purposes" in commit
  message.
- added comment that newly activated, clean, empty iclogs have a LSN of 0 so are
  captured by the "iclog lsn < start_lsn" case that avoids needing to wait
  before releasing the commit iclog to be written.
- added async cache flush infrastructure
- convert CIL checkpoint push work it issue an unconditional metadata device
  cache flush rather than asking the first iclog write to issue it via
  REQ_PREFLUSH.
- cleaned up xlog_write() to remove a redundant parameter and prepare the logic
  for setting flags on the iclog based on the type of operational data is being
  written to the log.
- added XLOG_ICL_NEED_FUA flag to complement the NEED_FLUSH flag, allowing
  callers to issue explicit flushes and clear the NEED_FLUSH flag before the
  iclog is written without dropping the REQ_FUA requirement in /dev/null...
- added CIL commit-in-start-iclog optimisation that clears the NEED_FLUSH flag
  to avoid an unnecessary cache flush when issuing the iclog.
- fixed typo in CIL throttle bugfix comment.
- fixed trailing whitespace in commit message.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 01/39] xfs: log stripe roundoff is a property of the log
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-28  0:54   ` Allison Henderson
  2021-05-19 12:12 ` [PATCH 02/39] xfs: separate CIL commit record IO Dave Chinner
                   ` (37 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We don't need to look at the xfs_mount and superblock every time we
need to do an iclog roundoff calculation. The property is fixed for
the life of the log, so store the roundoff in the log at mount time
and use that everywhere.

On a debug build:

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
  27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_log_format.h |  3 --
 fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h          |  2 ++
 3 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 3e15ea29fb8d..d548ea4b6aab 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
 #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
 #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
 #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
-#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
-                                 (log)->l_mp->m_sb.sb_logsunit)
-#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
 
 #define XLOG_HEADER_SIZE	512
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c19a82adea1e..0e563ff8cd3b 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1401,6 +1401,11 @@ xlog_alloc_log(
 	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
 	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
 
+	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
+		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
+	else
+		log->l_iclog_roundoff = BBSIZE;
+
 	xlog_grant_head_init(&log->l_reserve_head);
 	xlog_grant_head_init(&log->l_write_head);
 
@@ -1854,29 +1859,15 @@ xlog_calc_iclog_size(
 	uint32_t		*roundoff)
 {
 	uint32_t		count_init, count;
-	bool			use_lsunit;
-
-	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-			log->l_mp->m_sb.sb_logsunit > 1;
 
 	/* Add for LR header */
 	count_init = log->l_iclog_hsize + iclog->ic_offset;
+	count = roundup(count_init, log->l_iclog_roundoff);
 
-	/* Round out the log write size */
-	if (use_lsunit) {
-		/* we have a v2 stripe unit to use */
-		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
-	} else {
-		count = BBTOB(BTOBB(count_init));
-	}
-
-	ASSERT(count >= count_init);
 	*roundoff = count - count_init;
 
-	if (use_lsunit)
-		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
-	else
-		ASSERT(*roundoff < BBTOB(1));
+	ASSERT(count >= count_init);
+	ASSERT(*roundoff < log->l_iclog_roundoff);
 	return count;
 }
 
@@ -3151,10 +3142,9 @@ xlog_state_switch_iclogs(
 	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
 
 	/* Round up to next log-sunit */
-	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-	    log->l_mp->m_sb.sb_logsunit > 1) {
-		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
-		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
+	if (log->l_iclog_roundoff > BBSIZE) {
+		log->l_curr_block = roundup(log->l_curr_block,
+						BTOBB(log->l_iclog_roundoff));
 	}
 
 	if (log->l_curr_block >= log->l_logBBsize) {
@@ -3406,12 +3396,11 @@ xfs_log_ticket_get(
  * Figure out the total log space unit (in bytes) that would be
  * required for a log ticket.
  */
-int
-xfs_log_calc_unit_res(
-	struct xfs_mount	*mp,
+static int
+xlog_calc_unit_res(
+	struct xlog		*log,
 	int			unit_bytes)
 {
-	struct xlog		*log = mp->m_log;
 	int			iclog_space;
 	uint			num_headers;
 
@@ -3487,18 +3476,20 @@ xfs_log_calc_unit_res(
 	/* for commit-rec LR header - note: padding will subsume the ophdr */
 	unit_bytes += log->l_iclog_hsize;
 
-	/* for roundoff padding for transaction data and one for commit record */
-	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
-		/* log su roundoff */
-		unit_bytes += 2 * mp->m_sb.sb_logsunit;
-	} else {
-		/* BB roundoff */
-		unit_bytes += 2 * BBSIZE;
-        }
+	/* roundoff padding for transaction data and one for commit record */
+	unit_bytes += 2 * log->l_iclog_roundoff;
 
 	return unit_bytes;
 }
 
+int
+xfs_log_calc_unit_res(
+	struct xfs_mount	*mp,
+	int			unit_bytes)
+{
+	return xlog_calc_unit_res(mp->m_log, unit_bytes);
+}
+
 /*
  * Allocate and initialise a new log ticket.
  */
@@ -3515,7 +3506,7 @@ xlog_ticket_alloc(
 
 	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
 
-	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
+	unit_res = xlog_calc_unit_res(log, unit_bytes);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1c6fdbf3d506..037950cf1061 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -436,6 +436,8 @@ struct xlog {
 #endif
 	/* log recovery lsn tracking (for buffer submission */
 	xfs_lsn_t		l_recovery_lsn;
+
+	uint32_t		l_iclog_roundoff;/* padding roundoff */
 };
 
 #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 02/39] xfs: separate CIL commit record IO
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
  2021-05-19 12:12 ` [PATCH 01/39] xfs: log stripe roundoff is a property of the log Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-28  0:54   ` Allison Henderson
  2021-05-19 12:12 ` [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush Dave Chinner
                   ` (36 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To allow for iclog IO device cache flush behaviour to be optimised,
we first need to separate out the commit record iclog IO from the
rest of the checkpoint so we can wait for the checkpoint IO to
complete before we issue the commit record.

This separation is only necessary if the commit record is being
written into a different iclog to the start of the checkpoint as the
upcoming cache flushing changes requires completion ordering against
the other iclogs submitted by the checkpoint.

If the entire checkpoint and commit is in the one iclog, then they
are both covered by the one set of cache flush primitives on the
iclog and hence there is no need to separate them for ordering.

Otherwise, we need to wait for all the previous iclogs to complete
so they are ordered correctly and made stable by the REQ_PREFLUSH
that the commit record iclog IO issues. This guarantees that if a
reader sees the commit record in the journal, they will also see the
entire checkpoint that commit record closes off.

This also provides the guarantee that when the commit record IO
completes, we can safely unpin all the log items in the checkpoint
so they can be written back because the entire checkpoint is stable
in the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log.c      | 8 +++++---
 fs/xfs/xfs_log_cil.c  | 9 +++++++++
 fs/xfs/xfs_log_priv.h | 2 ++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 0e563ff8cd3b..4cd5840e953a 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -786,10 +786,12 @@ xfs_log_mount_cancel(
 }
 
 /*
- * Wait for the iclog to be written disk, or return an error if the log has been
- * shut down.
+ * Wait for the iclog and all prior iclogs to be written disk as required by the
+ * log force state machine. Waiting on ic_force_wait ensures iclog completions
+ * have been ordered and callbacks run before we are woken here, hence
+ * guaranteeing that all the iclogs up to this one are on stable storage.
  */
-static int
+int
 xlog_wait_on_iclog(
 	struct xlog_in_core	*iclog)
 		__releases(iclog->ic_log->l_icloglock)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..1e5fd6f268c2 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -870,6 +870,15 @@ xlog_cil_push_work(
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
+	/*
+	 * If the checkpoint spans multiple iclogs, wait for all previous
+	 * iclogs to complete before we submit the commit_iclog.
+	 */
+	if (ctx->start_lsn != commit_lsn) {
+		spin_lock(&log->l_icloglock);
+		xlog_wait_on_iclog(commit_iclog->ic_prev);
+	}
+
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
 	return;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 037950cf1061..ee7786b33da9 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -584,6 +584,8 @@ xlog_wait(
 	remove_wait_queue(wq, &wait);
 }
 
+int xlog_wait_on_iclog(struct xlog_in_core *iclog);
+
 /*
  * The LSN is valid so long as it is behind the current LSN. If it isn't, this
  * means that the next log record that includes this metadata could have a
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
  2021-05-19 12:12 ` [PATCH 01/39] xfs: log stripe roundoff is a property of the log Dave Chinner
  2021-05-19 12:12 ` [PATCH 02/39] xfs: separate CIL commit record IO Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-28  0:54   ` Allison Henderson
  2021-05-19 12:12 ` [PATCH 04/39] xfs: async blkdev cache flush Dave Chinner
                   ` (35 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It's a one line wrapper around blkdev_issue_flush(). Just replace it
with direct calls to blkdev_issue_flush().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_buf.c   | 2 +-
 fs/xfs/xfs_file.c  | 6 +++---
 fs/xfs/xfs_log.c   | 2 +-
 fs/xfs/xfs_super.c | 7 -------
 fs/xfs/xfs_super.h | 1 -
 5 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a10d49facadf..ebfcba2e8a77 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1945,7 +1945,7 @@ xfs_free_buftarg(
 	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
-	xfs_blkdev_issue_flush(btp);
+	blkdev_issue_flush(btp->bt_bdev);
 
 	kmem_free(btp);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c068dcd414f4..e7e9af57e788 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -197,9 +197,9 @@ xfs_file_fsync(
 	 * inode size in case of an extending write.
 	 */
 	if (XFS_IS_REALTIME_INODE(ip))
-		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
+		blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
 
 	/*
 	 * Any inode that has dirty modifications in the log is pinned.  The
@@ -219,7 +219,7 @@ xfs_file_fsync(
 	 */
 	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
 	    mp->m_logdev_targp == mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
 
 	return error;
 }
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 4cd5840e953a..969eebbf3f64 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1964,7 +1964,7 @@ xlog_sync(
 	 * layer state machine for preflushes.
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
-		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
+		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 		need_flush = false;
 	}
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 688309dbe18b..e339d1de2419 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -340,13 +340,6 @@ xfs_blkdev_put(
 		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 }
 
-void
-xfs_blkdev_issue_flush(
-	xfs_buftarg_t		*buftarg)
-{
-	blkdev_issue_flush(buftarg->bt_bdev);
-}
-
 STATIC void
 xfs_close_devices(
 	struct xfs_mount	*mp)
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index d2b40dc60dfc..167d23f92ffe 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -87,7 +87,6 @@ struct xfs_buftarg;
 struct block_device;
 
 extern void xfs_flush_inodes(struct xfs_mount *mp);
-extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
 extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
 					   xfs_agnumber_t agcount);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 04/39] xfs: async blkdev cache flush
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (2 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-20 23:53   ` Darrick J. Wong
  2021-05-28  0:54   ` Allison Henderson
  2021-05-19 12:12 ` [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
                   ` (34 subsequent siblings)
  38 siblings, 2 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The new checkpoint cache flush mechanism requires us to issue an
unconditional cache flush before we start a new checkpoint. We don't
want to block for this if we can help it, and we have a fair chunk
of CPU work to do between starting the checkpoint and issuing the
first journal IO.

Hence it makes sense to amortise the latency cost of the cache flush
by issuing it asynchronously and then waiting for it only when we
need to issue the first IO in the transaction.

To do this, we need async cache flush primitives to submit the cache
flush bio and to wait on it. The block layer has no such primitives
for filesystems, so roll our own for the moment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 35 +++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_linux.h  |  2 ++
 2 files changed, 37 insertions(+)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index 17f36db2f792..de727532e137 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -9,6 +9,41 @@ static inline unsigned int bio_max_vecs(unsigned int count)
 	return bio_max_segs(howmany(count, PAGE_SIZE));
 }
 
+static void
+xfs_flush_bdev_async_endio(
+	struct bio	*bio)
+{
+	complete(bio->bi_private);
+}
+
+/*
+ * Submit a request for an async cache flush to run. If the request queue does
+ * not require flush operations, just skip it altogether. If the caller needsi
+ * to wait for the flush completion at a later point in time, they must supply a
+ * valid completion. This will be signalled when the flush completes.  The
+ * caller never sees the bio that is issued here.
+ */
+void
+xfs_flush_bdev_async(
+	struct bio		*bio,
+	struct block_device	*bdev,
+	struct completion	*done)
+{
+	struct request_queue	*q = bdev->bd_disk->queue;
+
+	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
+		complete(done);
+		return;
+	}
+
+	bio_init(bio, NULL, 0);
+	bio_set_dev(bio, bdev);
+	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
+	bio->bi_private = done;
+	bio->bi_end_io = xfs_flush_bdev_async_endio;
+
+	submit_bio(bio);
+}
 int
 xfs_rw_bdev(
 	struct block_device	*bdev,
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 7688663b9773..c174262a074e 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
 
 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
 		char *data, unsigned int op);
+void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
+		struct completion *done);
 
 #define ASSERT_ALWAYS(expr)	\
 	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (3 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 04/39] xfs: async blkdev cache flush Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-28  0:54   ` Allison Henderson
  2021-05-19 12:12 ` [PATCH 06/39] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
                   ` (33 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

These rules apply to the atomic transactions recorded in the
journal, not to the journal IO itself. Hence we need to ensure
metadata is stable before we start writing a new transaction to the
journal (guarantee #1), and we need to ensure the entire transaction
is stable in the journal before we start metadata writeback
(guarantee #2).

The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
being added to every iclog IO. This causes the journal IO to issue a
cache flush and wait for it to complete before issuing the write IO
to the journal. Hence all completed metadata IO is guaranteed to be
stable before the journal overwrites the old metadata.

However, for long running CIL checkpoints that might do a thousand
journal IOs, we don't need every single one of these iclog IOs to
issue a cache flush - the cache flush done before the first iclog is
submitted is sufficient to cover the entire range in the log that
the checkpoint will overwrite because the CIL space reservation
guarantees the tail of the log (completed metadata) is already
beyond the range of the checkpoint write.

Hence we only need a full cache flush between closing off the CIL
checkpoint context (i.e. when the push switches it out) and issuing
the first journal IO. Rather than plumbing this through to the
journal IO, we can start this cache flush the moment the CIL context
is owned exclusively by the push worker. The cache flush can be in
progress while we process the CIL ready for writing, hence
reducing the latency of the initial iclog write. This is especially
true for large checkpoints, where we might have to process hundreds
of thousands of log vectors before we issue the first iclog write.
In these cases, it is likely the cache flush has already been
completed by the time we have built the CIL log vector chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 1e5fd6f268c2..7b8b7ac85ea9 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -656,6 +656,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	struct bio		bio;
+	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -719,10 +721,19 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * The CIL is stable at this point - nothing new will be added to it
+	 * because we hold the flush lock exclusively. Hence we can now issue
+	 * a cache flush to ensure all the completed metadata in the journal we
+	 * are about to overwrite is on stable storage.
+	 */
+	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
+				&bdev_flush);
+
+	/*
+	 * Pull all the log vectors off the items in the CIL, and remove the
+	 * items from the CIL. We don't need the CIL lock here because it's only
+	 * needed on the transaction commit side which is currently locked out
+	 * by the flush lock.
 	 */
 	lv = NULL;
 	num_iovecs = 0;
@@ -806,6 +817,12 @@ xlog_cil_push_work(
 	lvhdr.lv_iovecp = &lhdr;
 	lvhdr.lv_next = ctx->lv_chain;
 
+	/*
+	 * Before we format and submit the first iclog, we have to ensure that
+	 * the metadata writeback ordering cache flush is complete.
+	 */
+	wait_for_completion(&bdev_flush);
+
 	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
 	if (error)
 		goto out_abort_free_ticket;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 06/39] xfs: remove need_start_rec parameter from xlog_write()
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (4 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 07/39] xfs: journal IO cache flush reductions Dave Chinner
                   ` (32 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The CIL push is the only call to xlog_write that sets this variable
to true. The other callers don't need a start rec, and they tell
xlog_write what to do by passing the type of ophdr they need written
in the flags field. The need_start_rec parameter essentially tells
xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
so get rid of the variable to do this and pass XLOG_START_TRANS as
the flag value into xlog_write() from the CIL push.

$ size fs/xfs/xfs_log.o*
  text	   data	    bss	    dec	    hex	filename
 27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
 27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
 fs/xfs/xfs_log_cil.c  |  3 ++-
 fs/xfs/xfs_log_priv.h |  3 +--
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 969eebbf3f64..87870867d9fb 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -820,9 +820,7 @@ xlog_wait_on_iclog(
 static int
 xlog_write_unmount_record(
 	struct xlog		*log,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*lsn,
-	uint			flags)
+	struct xlog_ticket	*ticket)
 {
 	struct xfs_unmount_log_format ulf = {
 		.magic = XLOG_UNMOUNT_TYPE,
@@ -839,7 +837,7 @@ xlog_write_unmount_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
-	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -853,15 +851,13 @@ xlog_unmount_write(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xlog_in_core	*iclog;
 	struct xlog_ticket	*tic = NULL;
-	xfs_lsn_t		lsn;
-	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
 
 	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
 	if (error)
 		goto out_err;
 
-	error = xlog_write_unmount_record(log, tic, &lsn, flags);
+	error = xlog_write_unmount_record(log, tic);
 	/*
 	 * At this point, we're umounting anyway, so there's no point in
 	 * transitioning log state to IOERROR. Just continue...
@@ -1553,8 +1549,7 @@ xlog_commit_record(
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
-			   false);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2151,13 +2146,16 @@ static int
 xlog_write_calc_vec_length(
 	struct xlog_ticket	*ticket,
 	struct xfs_log_vec	*log_vector,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = need_start_rec ? 1 : 0;
+	int			headers = 0;
 	int			len = 0;
 	int			i;
 
+	if (optype & XLOG_START_TRANS)
+		headers++;
+
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2377,8 +2375,7 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			flags,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
@@ -2406,8 +2403,9 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec);
-	*start_lsn = 0;
+	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
+	if (start_lsn)
+		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 		void		*ptr;
 		int		log_offset;
@@ -2421,7 +2419,7 @@ xlog_write(
 		ptr = iclog->ic_datap + log_offset;
 
 		/* start_lsn is the first lsn written to. That's all we need. */
-		if (!*start_lsn)
+		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
 		/*
@@ -2434,6 +2432,7 @@ xlog_write(
 			int			copy_len;
 			int			copy_off;
 			bool			ordered = false;
+			bool			wrote_start_rec = false;
 
 			/* ordered log vectors have no regions to write */
 			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
@@ -2451,13 +2450,15 @@ xlog_write(
 			 * write a start record. Only do this for the first
 			 * iclog we write to.
 			 */
-			if (need_start_rec) {
+			if (optype & XLOG_START_TRANS) {
 				xlog_write_start_rec(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 						sizeof(struct xlog_op_header));
+				optype &= ~XLOG_START_TRANS;
+				wrote_start_rec = true;
 			}
 
-			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
+			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
 			if (!ophdr)
 				return -EIO;
 
@@ -2488,14 +2489,13 @@ xlog_write(
 			}
 			copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
-			if (need_start_rec) {
+			if (wrote_start_rec) {
 				copy_len += sizeof(struct xlog_op_header);
 				record_cnt++;
-				need_start_rec = false;
 			}
 			data_cnt += contwr ? copy_len : 0;
 
-			error = xlog_write_copy_finish(log, iclog, flags,
+			error = xlog_write_copy_finish(log, iclog, optype,
 						       &record_cnt, &data_cnt,
 						       &partial_copy,
 						       &partial_copy_len,
@@ -2539,7 +2539,7 @@ xlog_write(
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
 	if (commit_iclog) {
-		ASSERT(flags & XLOG_COMMIT_TRANS);
+		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
 		error = xlog_state_release_iclog(log, iclog);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7b8b7ac85ea9..172bb3551d6b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -823,7 +823,8 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
+	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
+				XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index ee7786b33da9..56e1942c47df 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint flags,
-		bool need_start_rec);
+		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 07/39] xfs: journal IO cache flush reductions
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (5 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 06/39] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:16   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 08/39] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
                   ` (31 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
causes the journal IO to issue a cache flush and wait for it to
complete before issuing the write IO to the journal. Hence all
completed metadata IO is guaranteed to be stable before the journal
overwrites the old metadata.

The ordering guarantees of #2 are provided by the REQ_FUA, which
ensures the journal writes do not complete until they are on stable
storage. Hence by the time the last journal IO in a checkpoint
completes, we know that the entire checkpoint is on stable storage
and we can unpin the dirty metadata and allow it to be written back.

This is the mechanism by which ordering was first implemented in XFS
way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
("Add support for drive write cache flushing") in the xfs-archive
tree.

A lot has changed since then, most notably we now use delayed
logging to checkpoint the filesystem to the journal rather than
write each individual transaction to the journal. Cache flushes on
journal IO are necessary when individual transactions are wholly
contained within a single iclog. However, CIL checkpoints are single
transactions that typically span hundreds to thousands of individual
journal writes, and so the requirements for device cache flushing
have changed.

That is, the ordering rules I state above apply to ordering of
atomic transactions recorded in the journal, not to the journal IO
itself. Hence we need to ensure metadata is stable before we start
writing a new transaction to the journal (guarantee #1), and we need
to ensure the entire transaction is stable in the journal before we
start metadata writeback (guarantee #2).

Hence we only need a REQ_PREFLUSH on the journal IO that starts a
new journal transaction to provide #1, and it is not on any other
journal IO done within the context of that journal transaction.

The CIL checkpoint already issues a cache flush before it starts
writing to the log, so we no longer need the iclog IO to issue a
REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
to xlog_write(), we no longer need to mark the first iclog in
the log write with REQ_PREFLUSH for this case. As an added bonus,
this ordering mechanism works for both internal and external logs,
meaning we can remove the explicit data device cache flushes from
the iclog write code when using external logs.

Given the new ordering semantics of commit records for the CIL, we
need iclogs containing commit records to issue a REQ_PREFLUSH. We
also require unmount records to do this. Hence for both
XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
to mark the first iclog being written with REQ_PREFLUSH.

For both commit records and unmount records, we also want them
immediately on stable storage, so we want to also mark the iclogs
that contain these records to be marked REQ_FUA. That means if a
record is split across multiple iclogs, they are all marked REQ_FUA
and not just the last one so that when the transaction is completed
all the parts of the record are on stable storage.

And for external logs, unmount records need a pre-write data device
cache flush similar to the CIL checkpoint cache pre-flush as the
internal iclog write code does not do this implicitly anymore.

As an optimisation, when the commit record lands in the same iclog
as the journal transaction starts, we don't need to wait for
anything and can simply use REQ_FUA to provide guarantee #2.  This
means that for fsync() heavy workloads, the cache flush behaviour is
completely unchanged and there is no degradation in performance as a
result of optimise the multi-IO transaction case.

The most notable sign that there is less IO latency on my test
machine (nvme SSDs) is that the "noiclogs" rate has dropped
substantially. This metric indicates that the CIL push is blocking
in xlog_get_iclog_space() waiting for iclog IO completion to occur.
With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
is blocking waiting for log IO. With the changes in this patch, this
drops to 1 noiclog event for every 100 iclog writes. Hence it is
clear that log IO is completing much faster than it was previously,
but it is also clear that for large iclog sizes, this isn't the
performance limiting factor on this hardware.

With smaller iclogs (32kB), however, there is a sustantial
difference. With the cache flush modifications, the journal is now
running at over 4000 write IOPS, and the journal throughput is
largely identical to the 256kB iclogs and the noiclog event rate
stays low at about 1:50 iclog writes. The existing code tops out at
about 2500 IOPS as the number of cache flushes dominate performance
and latency. The noiclog event rate is about 1:4, and the
performance variance is quite large as the journal throughput can
fall to less than half the peak sustained rate when the cache flush
rate prevents metadata writeback from keeping up and the log runs
out of space and throttles reservations.

As a result:

	logbsize	fsmark create rate	rm -rf
before	32kb		152851+/-5.3e+04	5m28s
patched	32kb		221533+/-1.1e+04	5m24s

before	256kb		220239+/-6.2e+03	4m58s
patched	256kb		228286+/-9.2e+03	5m06s

The rm -rf times are included because I ran them, but the
differences are largely noise. This workload is largely metadata
read IO latency bound and the changes to the journal cache flushing
doesn't really make any noticable difference to behaviour apart from
a reduction in noiclog events from background CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/xfs_log.c      | 66 +++++++++++++++----------------------------
 fs/xfs/xfs_log.h      |  1 -
 fs/xfs/xfs_log_cil.c  | 18 +++++++++---
 fs/xfs/xfs_log_priv.h |  6 ++++
 4 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 87870867d9fb..b6145e4cb7bc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -513,7 +513,7 @@ __xlog_state_release_iclog(
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
  */
-static int
+int
 xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
@@ -533,23 +533,6 @@ xlog_state_release_iclog(
 	return 0;
 }
 
-void
-xfs_log_release_iclog(
-	struct xlog_in_core	*iclog)
-{
-	struct xlog		*log = iclog->ic_log;
-	bool			sync = false;
-
-	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
-		if (iclog->ic_state != XLOG_STATE_IOERROR)
-			sync = __xlog_state_release_iclog(log, iclog);
-		spin_unlock(&log->l_icloglock);
-	}
-
-	if (sync)
-		xlog_sync(log, iclog);
-}
-
 /*
  * Mount a log filesystem
  *
@@ -837,6 +820,14 @@ xlog_write_unmount_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
+
+	/*
+	 * For external log devices, we need to flush the data device cache
+	 * first to ensure all metadata writeback is on stable storage before we
+	 * stamp the tail LSN into the unmount record.
+	 */
+	if (log->l_targ != log->l_mp->m_ddev_targp)
+		blkdev_issue_flush(log->l_targ->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
@@ -874,6 +865,11 @@ xlog_unmount_write(
 	else
 		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
 		       iclog->ic_state == XLOG_STATE_IOERROR);
+	/*
+	 * Ensure the journal is fully flushed and on stable storage once the
+	 * iclog containing the unmount record is written.
+	 */
+	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 	error = xlog_state_release_iclog(log, iclog);
 	xlog_wait_on_iclog(iclog);
 
@@ -1755,8 +1751,7 @@ xlog_write_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
 	uint64_t		bno,
-	unsigned int		count,
-	bool			need_flush)
+	unsigned int		count)
 {
 	ASSERT(bno < log->l_logBBsize);
 
@@ -1794,10 +1789,12 @@ xlog_write_iclog(
 	 * writeback throttle from throttling log writes behind background
 	 * metadata writeback and causing priority inversions.
 	 */
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
-				REQ_IDLE | REQ_FUA;
-	if (need_flush)
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
+		iclog->ic_bio.bi_opf |= REQ_FUA;
+	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -1900,7 +1897,6 @@ xlog_sync(
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
 	uint64_t		bno;
 	unsigned int		size;
-	bool			need_flush = true, split = false;
 
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
 
@@ -1925,10 +1921,8 @@ xlog_sync(
 	bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn));
 
 	/* Do we need to split this write into 2 parts? */
-	if (bno + BTOBB(count) > log->l_logBBsize) {
+	if (bno + BTOBB(count) > log->l_logBBsize)
 		xlog_split_iclog(log, &iclog->ic_header, bno, count);
-		split = true;
-	}
 
 	/* calculcate the checksum */
 	iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header,
@@ -1949,22 +1943,8 @@ xlog_sync(
 			 be64_to_cpu(iclog->ic_header.h_lsn));
 	}
 #endif
-
-	/*
-	 * Flush the data device before flushing the log to make sure all meta
-	 * data written back from the AIL actually made it to disk before
-	 * stamping the new log tail LSN into the log buffer.  For an external
-	 * log we need to issue the flush explicitly, and unfortunately
-	 * synchronously here; for an internal log we can simply use the block
-	 * layer state machine for preflushes.
-	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
-		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
-		need_flush = false;
-	}
-
 	xlog_verify_iclog(log, iclog, count);
-	xlog_write_iclog(log, iclog, bno, count, need_flush);
+	xlog_write_iclog(log, iclog, bno, count);
 }
 
 /*
@@ -2418,7 +2398,7 @@ xlog_write(
 		ASSERT(log_offset <= iclog->ic_size - 1);
 		ptr = iclog->ic_datap + log_offset;
 
-		/* start_lsn is the first lsn written to. That's all we need. */
+		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 044e02cb8921..99f9d6ed9598 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -117,7 +117,6 @@ void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
 void	  xfs_log_space_wake(struct xfs_mount *mp);
-void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
 int	  xfs_log_reserve(struct xfs_mount *mp,
 			  int		   length,
 			  int		   count,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 172bb3551d6b..9d2fa8464289 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -890,15 +890,25 @@ xlog_cil_push_work(
 
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous
-	 * iclogs to complete before we submit the commit_iclog.
+	 * iclogs to complete before we submit the commit_iclog. In this case,
+	 * the commit_iclog write needs to issue a pre-flush so that the
+	 * ordering is correctly preserved down to stable storage.
 	 */
+	spin_lock(&log->l_icloglock);
 	if (ctx->start_lsn != commit_lsn) {
-		spin_lock(&log->l_icloglock);
 		xlog_wait_on_iclog(commit_iclog->ic_prev);
+		spin_lock(&log->l_icloglock);
+		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
 	}
 
-	/* release the hounds! */
-	xfs_log_release_iclog(commit_iclog);
+	/*
+	 * The commit iclog must be written to stable storage to guarantee
+	 * journal IO vs metadata writeback IO is correctly ordered on stable
+	 * storage.
+	 */
+	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
+	xlog_state_release_iclog(log, commit_iclog);
+	spin_unlock(&log->l_icloglock);
 	return;
 
 out_skip:
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 56e1942c47df..2203ccecafb6 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -133,6 +133,9 @@ enum xlog_iclog_state {
 
 #define XLOG_COVER_OPS		5
 
+#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
+#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
+
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
@@ -201,6 +204,7 @@ typedef struct xlog_in_core {
 	u32			ic_size;
 	u32			ic_offset;
 	enum xlog_iclog_state	ic_state;
+	unsigned int		ic_flags;
 	char			*ic_datap;	/* pointer to iclog data */
 
 	/* Callback structures need their own cacheline */
@@ -486,6 +490,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
+int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
+
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
  * change while we are cracking it into the component values. This means we
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 08/39] xfs: Fix CIL throttle hang when CIL space used going backwards
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (6 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 07/39] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
                   ` (30 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A hang with tasks stuck on the CIL hard throttle was reported and
largely diagnosed by Donald Buczek, who discovered that it was a
result of the CIL context space usage decrementing in committed
transactions once the hard throttle limit had been hit and processes
were already blocked.  This resulted in the CIL push not waking up
those waiters because the CIL context was no longer over the hard
throttle limit.

The surprising aspect of this was the CIL space usage going
backwards regularly enough to trigger this situation. Assumptions
had been made in design that the relogging process would only
increase the size of the objects in the CIL, and so that space would
only increase.

This change and commit message fixes the issue and documents the
result of an audit of the triggers that can cause the CIL space to
go backwards, how large the backwards steps tend to be, the
frequency in which they occur, and what the impact on the CIL
accounting code is.

Even though the CIL ctx->space_used can go backwards, it will only
do so if the log item is already logged to the CIL and contains a
space reservation for it's entire logged state. This is tracked by
the shadow buffer state on the log item. If the item is not
previously logged in the CIL it has no shadow buffer nor log vector,
and hence the entire size of the logged item copied to the log
vector is accounted to the CIL space usage. i.e.  it will always go
up in this case.

If the item has a log vector (i.e. already in the CIL) and the size
decreases, then the existing log vector will be overwritten and the
space usage will go down. This is the only condition where the space
usage reduces, and it can only occur when an item is already tracked
in the CIL. Hence we are safe from CIL space usage underruns as a
result of log items decreasing in size when they are relogged.

Typically this reduction in CIL usage occurs from metadata blocks
being free, such as when a btree block merge occurs or a directory
enter/xattr entry is removed and the da-tree is reduced in size.
This generally results in a reduction in size of around a single
block in the CIL, but also tends to increase the number of log
vectors because the parent and sibling nodes in the tree needs to be
updated when a btree block is removed. If a multi-level merge
occurs, then we see reduction in size of 2+ blocks, but again the
log vector count goes up.

The other vector is inode fork size changes, which only log the
current size of the fork and ignore the previously logged size when
the fork is relogged. Hence if we are removing items from the inode
fork (dir/xattr removal in shortform, extent record removal in
extent form, etc) the relogged size of the inode for can decrease.

No other log items can decrease in size either because they are a
fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
an intent actually creates a new intent log item and doesn't relog
the old item at all.) Hence the only two vectors for CIL context
size reduction are relogging inode forks and marking buffers active
in the CIL as stale.

Long story short: the majority of the code does the right thing and
handles the reduction in log item size correctly, and only the CIL
hard throttle implementation is problematic and needs fixing. This
patch makes that fix, as well as adds comments in the log item code
that result in items shrinking in size when they are relogged as a
clear reminder that this can and does happen frequently.

The throttle fix is based upon the change Donald proposed, though it
goes further to ensure that once the throttle is activated, it
captures all tasks until the CIL push issues a wakeup, regardless of
whether the CIL space used has gone back under the throttle
threshold.

This ensures that we prevent tasks reducing the CIL slightly under
the throttle threshold and then making more changes that push it
well over the throttle limit. This is acheived by checking if the
throttle wait queue is already active as a condition of throttling.
Hence once we start throttling, we continue to apply the throttle
until the CIL context push wakes everything on the wait queue.

We can use waitqueue_active() for the waitqueue manipulations and
checks as they are all done under the ctx->xc_push_lock. Hence the
waitqueue has external serialisation and we can safely peek inside
the wait queue without holding the internal waitqueue locks.

Many thanks to Donald for his diagnostic and analysis work to
isolate the cause of this hang.

Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item.c   | 37 ++++++++++++++++++-------------------
 fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
 fs/xfs/xfs_log_cil.c    | 22 +++++++++++++++++-----
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index fb69879e4b2b..14d1fefcbf4c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -74,14 +74,12 @@ xfs_buf_item_straddle(
 }
 
 /*
- * This returns the number of log iovecs needed to log the
- * given buf log item.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item segment.
  *
- * It calculates this as 1 iovec for the buf log format structure
- * and 1 for each stretch of non-contiguous chunks to be logged.
- * Contiguous chunks are logged in a single iovec.
- *
- * If the XFS_BLI_STALE flag has been set, then log nothing.
+ * It calculates this as 1 iovec for the buf log format structure and 1 for each
+ * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
+ * in a single iovec.
  */
 STATIC void
 xfs_buf_item_size_segment(
@@ -168,11 +166,8 @@ xfs_buf_item_size_segment(
 }
 
 /*
- * This returns the number of log iovecs needed to log the given buf log item.
- *
- * It calculates this as 1 iovec for the buf log format structure and 1 for each
- * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
- * in a single iovec.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item.
  *
  * Discontiguous buffers need a format structure per region that is being
  * logged. This makes the changes in the buffer appear to log recovery as though
@@ -182,7 +177,11 @@ xfs_buf_item_size_segment(
  * what ends up on disk.
  *
  * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
- * format structures.
+ * format structures. If the item has previously been logged and has dirty
+ * regions, we do not relog them in stale buffers. This has the effect of
+ * reducing the size of the relogged item by the amount of dirty data tracked
+ * by the log item. This can result in the committing transaction reducing the
+ * amount of space being consumed by the CIL.
  */
 STATIC void
 xfs_buf_item_size(
@@ -199,9 +198,9 @@ xfs_buf_item_size(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
-		 * The buffer is stale, so all we need to log
-		 * is the buf log format structure with the
-		 * cancel flag in it.
+		 * The buffer is stale, so all we need to log is the buf log
+		 * format structure with the cancel flag in it as we are never
+		 * going to replay the changes tracked in the log item.
 		 */
 		trace_xfs_buf_item_size_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
@@ -216,9 +215,9 @@ xfs_buf_item_size(
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
-		 * The buffer has been logged just to order it.
-		 * It is not being included in the transaction
-		 * commit, so no vectors are used at all.
+		 * The buffer has been logged just to order it. It is not being
+		 * included in the transaction commit, so no vectors are used at
+		 * all.
 		 */
 		trace_xfs_buf_item_size_ordered(bip);
 		*nvecs = XFS_LOG_VEC_ORDERED;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6764d12342da..5a2dd33020e2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_inode_log_item, ili_item);
 }
 
+/*
+ * The logged size of an inode fork is always the current size of the inode
+ * fork. This means that when an inode fork is relogged, the size of the logged
+ * region is determined by the current state, not the combination of the
+ * previously logged state + the current state. This is different relogging
+ * behaviour to most other log items which will retain the size of the
+ * previously logged changes when smaller regions are relogged.
+ *
+ * Hence operations that remove data from the inode fork (e.g. shortform
+ * dir/attr remove, extent form extent removal, etc), the size of the relogged
+ * inode gets -smaller- rather than stays the same size as the previously logged
+ * size and this can result in the committing transaction reducing the amount of
+ * space being consumed by the CIL.
+ */
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 9d2fa8464289..903617e6d054 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,9 +670,14 @@ xlog_cil_push_work(
 	ASSERT(push_seq <= ctx->sequence);
 
 	/*
-	 * Wake up any background push waiters now this context is being pushed.
+	 * As we are about to switch to a new, empty CIL context, we no longer
+	 * need to throttle tasks on CIL space overruns. Wake any waiters that
+	 * the hard push throttle may have caught so they can start committing
+	 * to the new context. The ctx->xc_push_lock provides the serialisation
+	 * necessary for safely using the lockless waitqueue_active() check in
+	 * this context.
 	 */
-	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+	if (waitqueue_active(&cil->xc_push_wait))
 		wake_up_all(&cil->xc_push_wait);
 
 	/*
@@ -944,7 +949,7 @@ xlog_cil_push_background(
 	ASSERT(!list_empty(&cil->xc_cil));
 
 	/*
-	 * don't do a background push if we haven't used up all the
+	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
 	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
@@ -968,9 +973,16 @@ xlog_cil_push_background(
 
 	/*
 	 * If we are well over the space limit, throttle the work that is being
-	 * done until the push work on this context has begun.
+	 * done until the push work on this context has begun. Enforce the hard
+	 * throttle on all transaction commits once it has been activated, even
+	 * if the committing transactions have resulted in the space usage
+	 * dipping back down under the hard limit.
+	 *
+	 * The ctx->xc_push_lock provides the serialisation necessary for safely
+	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
 		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (7 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 08/39] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:20   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing Dave Chinner
                   ` (29 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In doing an investigation into AIL push stalls, I was looking at the
log force code to see if an async CIL push could be done instead.
This lead me to xfs_log_force_lsn() and looking at how it works.

xfs_log_force_lsn() is only called from inode synchronisation
contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
value as the LSN to sync the log to. This gets passed to
xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
journal, and then used by xfs_log_force_lsn() to flush the iclogs to
the journal.

The problem is that ip->i_itemp->ili_last_lsn does not store a
log sequence number. What it stores is passed to it from the
->iop_committing method, which is called by xfs_log_commit_cil().
The value this passes to the iop_committing method is the CIL
context sequence number that the item was committed to.

As it turns out, xlog_cil_force_lsn() converts the sequence to an
actual commit LSN for the related context and returns that to
xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
variable that contained a sequence with an actual LSN and then uses
that to sync the iclogs.

This caused me some confusion for a while, even though I originally
wrote all this code a decade ago. ->iop_committing is only used by
a couple of log item types, and only inode items use the sequence
number it is passed.

Let's clean up the API, CIL structures and inode log item to call it
a sequence number, and make it clear that the high level code is
using CIL sequence numbers and not on-disk LSNs for integrity
synchronisation purposes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_types.h |  1 +
 fs/xfs/xfs_buf_item.c     |  2 +-
 fs/xfs/xfs_dquot_item.c   |  2 +-
 fs/xfs/xfs_file.c         | 14 +++++++-------
 fs/xfs/xfs_inode.c        | 10 +++++-----
 fs/xfs/xfs_inode_item.c   |  4 ++--
 fs/xfs/xfs_inode_item.h   |  2 +-
 fs/xfs/xfs_log.c          | 27 ++++++++++++++-------------
 fs/xfs/xfs_log.h          |  4 +---
 fs/xfs/xfs_log_cil.c      | 30 +++++++++++-------------------
 fs/xfs/xfs_log_priv.h     | 15 +++++++--------
 fs/xfs/xfs_trans.c        |  6 +++---
 fs/xfs/xfs_trans.h        |  4 ++--
 13 files changed, 56 insertions(+), 65 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 064bd6e8c922..0870ef6f933d 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -21,6 +21,7 @@ typedef int32_t		xfs_suminfo_t;	/* type of bitmap summary info */
 typedef uint32_t	xfs_rtword_t;	/* word type for bitmap manipulations */
 
 typedef int64_t		xfs_lsn_t;	/* log sequence number */
+typedef int64_t		xfs_csn_t;	/* CIL sequence number */
 
 typedef uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
 typedef uint32_t	xfs_dahash_t;	/* dir/attr hash value */
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 14d1fefcbf4c..1cb087b320b1 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -713,7 +713,7 @@ xfs_buf_item_release(
 STATIC void
 xfs_buf_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
 	return xfs_buf_item_release(lip);
 }
diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 8c1fdf37ee8f..8ed47b739b6c 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
 STATIC void
 xfs_qm_dquot_logitem_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
 	return xfs_qm_dquot_logitem_release(lip);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e7e9af57e788..277d0f3921cc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -119,8 +119,8 @@ xfs_dir_fsync(
 	return xfs_log_force_inode(ip);
 }
 
-static xfs_lsn_t
-xfs_fsync_lsn(
+static xfs_csn_t
+xfs_fsync_seq(
 	struct xfs_inode	*ip,
 	bool			datasync)
 {
@@ -128,7 +128,7 @@ xfs_fsync_lsn(
 		return 0;
 	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
 		return 0;
-	return ip->i_itemp->ili_last_lsn;
+	return ip->i_itemp->ili_commit_seq;
 }
 
 /*
@@ -151,12 +151,12 @@ xfs_fsync_flush_log(
 	int			*log_flushed)
 {
 	int			error = 0;
-	xfs_lsn_t		lsn;
+	xfs_csn_t		seq;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	lsn = xfs_fsync_lsn(ip, datasync);
-	if (lsn) {
-		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
+	seq = xfs_fsync_seq(ip, datasync);
+	if (seq) {
+		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
 					  log_flushed);
 
 		spin_lock(&ip->i_itemp->ili_lock);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 336c350206a8..1c7e0d4e0013 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2604,7 +2604,7 @@ xfs_iunpin(
 	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
 
 	/* Give the log a push to start the unpinning I/O */
-	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
+	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
 
 }
 
@@ -3618,16 +3618,16 @@ int
 xfs_log_force_inode(
 	struct xfs_inode	*ip)
 {
-	xfs_lsn_t		lsn = 0;
+	xfs_csn_t		seq = 0;
 
 	xfs_ilock(ip, XFS_ILOCK_SHARED);
 	if (xfs_ipincount(ip))
-		lsn = ip->i_itemp->ili_last_lsn;
+		seq = ip->i_itemp->ili_commit_seq;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
-	if (!lsn)
+	if (!seq)
 		return 0;
-	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
+	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 5a2dd33020e2..35de30849fcc 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -643,9 +643,9 @@ xfs_inode_item_committed(
 STATIC void
 xfs_inode_item_committing(
 	struct xfs_log_item	*lip,
-	xfs_lsn_t		commit_lsn)
+	xfs_csn_t		seq)
 {
-	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
+	INODE_ITEM(lip)->ili_commit_seq = seq;
 	return xfs_inode_item_release(lip);
 }
 
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index 4b926e32831c..403b45ab9aa2 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -33,7 +33,7 @@ struct xfs_inode_log_item {
 	unsigned int		ili_fields;	   /* fields to be logged */
 	unsigned int		ili_fsync_fields;  /* logged since last fsync */
 	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
-	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
+	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
 };
 
 static inline int xfs_inode_clean(struct xfs_inode *ip)
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b6145e4cb7bc..aa37f4319052 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3252,14 +3252,13 @@ xfs_log_force(
 }
 
 static int
-__xfs_log_force_lsn(
-	struct xfs_mount	*mp,
+xlog_force_lsn(
+	struct xlog		*log,
 	xfs_lsn_t		lsn,
 	uint			flags,
 	int			*log_flushed,
 	bool			already_slept)
 {
-	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
 
 	spin_lock(&log->l_icloglock);
@@ -3292,8 +3291,6 @@ __xfs_log_force_lsn(
 		if (!already_slept &&
 		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
 		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
-			XFS_STATS_INC(mp, xs_log_force_sleep);
-
 			xlog_wait(&iclog->ic_prev->ic_write_wait,
 					&log->l_icloglock);
 			return -EAGAIN;
@@ -3331,25 +3328,29 @@ __xfs_log_force_lsn(
  * to disk, that thread will wake up all threads waiting on the queue.
  */
 int
-xfs_log_force_lsn(
+xfs_log_force_seq(
 	struct xfs_mount	*mp,
-	xfs_lsn_t		lsn,
+	xfs_csn_t		seq,
 	uint			flags,
 	int			*log_flushed)
 {
+	struct xlog		*log = mp->m_log;
+	xfs_lsn_t		lsn;
 	int			ret;
-	ASSERT(lsn != 0);
+	ASSERT(seq != 0);
 
 	XFS_STATS_INC(mp, xs_log_force);
-	trace_xfs_log_force(mp, lsn, _RET_IP_);
+	trace_xfs_log_force(mp, seq, _RET_IP_);
 
-	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
+	lsn = xlog_cil_force_seq(log, seq);
 	if (lsn == NULLCOMMITLSN)
 		return 0;
 
-	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
-	if (ret == -EAGAIN)
-		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
+	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
+	if (ret == -EAGAIN) {
+		XFS_STATS_INC(mp, xs_log_force_sleep);
+		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
+	}
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 99f9d6ed9598..813b972e9788 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -106,7 +106,7 @@ struct xfs_item_ops;
 struct xfs_trans;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
-int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
+int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
 		int *log_forced);
 int	  xfs_log_mount(struct xfs_mount	*mp,
 			struct xfs_buftarg	*log_target,
@@ -131,8 +131,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
 void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
 
-void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
-				xfs_lsn_t *commit_lsn, bool regrant);
 void	xlog_cil_process_committed(struct list_head *list);
 bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
 
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 903617e6d054..3c2b1205944d 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -788,7 +788,7 @@ xlog_cil_push_work(
 	 * that higher sequences will wait for us to write out a commit record
 	 * before they do.
 	 *
-	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
+	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
 	 * structure atomically with the addition of this sequence to the
 	 * committing list. This also ensures that we can do unlocked checks
 	 * against the current sequence in log forces without risking
@@ -1057,16 +1057,14 @@ xlog_cil_empty(
  * allowed again.
  */
 void
-xfs_log_commit_cil(
-	struct xfs_mount	*mp,
+xlog_cil_commit(
+	struct xlog		*log,
 	struct xfs_trans	*tp,
-	xfs_lsn_t		*commit_lsn,
+	xfs_csn_t		*commit_seq,
 	bool			regrant)
 {
-	struct xlog		*log = mp->m_log;
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_log_item	*lip, *next;
-	xfs_lsn_t		xc_commit_lsn;
 
 	/*
 	 * Do all necessary memory allocation before we lock the CIL.
@@ -1080,10 +1078,6 @@ xfs_log_commit_cil(
 
 	xlog_cil_insert_items(log, tp);
 
-	xc_commit_lsn = cil->xc_ctx->sequence;
-	if (commit_lsn)
-		*commit_lsn = xc_commit_lsn;
-
 	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
 		xfs_log_ticket_regrant(log, tp->t_ticket);
 	else
@@ -1106,8 +1100,10 @@ xfs_log_commit_cil(
 	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
 		xfs_trans_del_item(lip);
 		if (lip->li_ops->iop_committing)
-			lip->li_ops->iop_committing(lip, xc_commit_lsn);
+			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
 	}
+	if (commit_seq)
+		*commit_seq = cil->xc_ctx->sequence;
 
 	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
 	xlog_cil_push_background(log);
@@ -1124,9 +1120,9 @@ xfs_log_commit_cil(
  * iclog flush is necessary following this call.
  */
 xfs_lsn_t
-xlog_cil_force_lsn(
+xlog_cil_force_seq(
 	struct xlog	*log,
-	xfs_lsn_t	sequence)
+	xfs_csn_t	sequence)
 {
 	struct xfs_cil		*cil = log->l_cilp;
 	struct xfs_cil_ctx	*ctx;
@@ -1222,21 +1218,17 @@ bool
 xfs_log_item_in_current_chkpt(
 	struct xfs_log_item *lip)
 {
-	struct xfs_cil_ctx *ctx;
+	struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
 
 	if (list_empty(&lip->li_cil))
 		return false;
 
-	ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
-
 	/*
 	 * li_seq is written on the first commit of a log item to record the
 	 * first checkpoint it is written to. Hence if it is different to the
 	 * current sequence, we're in a new checkpoint.
 	 */
-	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
-		return false;
-	return true;
+	return lip->li_seq == ctx->sequence;
 }
 
 /*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2203ccecafb6..2d7e7cbee8b7 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -234,7 +234,7 @@ struct xfs_cil;
 
 struct xfs_cil_ctx {
 	struct xfs_cil		*cil;
-	xfs_lsn_t		sequence;	/* chkpt sequence # */
+	xfs_csn_t		sequence;	/* chkpt sequence # */
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
@@ -272,10 +272,10 @@ struct xfs_cil {
 	struct xfs_cil_ctx	*xc_ctx;
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
-	xfs_lsn_t		xc_push_seq;
+	xfs_csn_t		xc_push_seq;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
-	xfs_lsn_t		xc_current_sequence;
+	xfs_csn_t		xc_current_sequence;
 	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
@@ -554,19 +554,18 @@ int	xlog_cil_init(struct xlog *log);
 void	xlog_cil_init_post_recovery(struct xlog *log);
 void	xlog_cil_destroy(struct xlog *log);
 bool	xlog_cil_empty(struct xlog *log);
+void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
+			xfs_csn_t *commit_seq, bool regrant);
 
 /*
  * CIL force routines
  */
-xfs_lsn_t
-xlog_cil_force_lsn(
-	struct xlog *log,
-	xfs_lsn_t sequence);
+xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
 
 static inline void
 xlog_cil_force(struct xlog *log)
 {
-	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
+	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
 }
 
 /*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 586f2992b789..87bffd12c20c 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -839,7 +839,7 @@ __xfs_trans_commit(
 	bool			regrant)
 {
 	struct xfs_mount	*mp = tp->t_mountp;
-	xfs_lsn_t		commit_lsn = -1;
+	xfs_csn_t		commit_seq = 0;
 	int			error = 0;
 	int			sync = tp->t_flags & XFS_TRANS_SYNC;
 
@@ -881,7 +881,7 @@ __xfs_trans_commit(
 		xfs_trans_apply_sb_deltas(tp);
 	xfs_trans_apply_dquot_deltas(tp);
 
-	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
+	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
 
 	xfs_trans_free(tp);
 
@@ -890,7 +890,7 @@ __xfs_trans_commit(
 	 * log out now and wait for it.
 	 */
 	if (sync) {
-		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
+		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
 		XFS_STATS_INC(mp, xs_trans_sync);
 	} else {
 		XFS_STATS_INC(mp, xs_trans_async);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ee42d98d9011..50da47f23a07 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -43,7 +43,7 @@ struct xfs_log_item {
 	struct list_head		li_cil;		/* CIL pointers */
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
-	xfs_lsn_t			li_seq;		/* CIL commit seq */
+	xfs_csn_t			li_seq;		/* CIL commit seq */
 };
 
 /*
@@ -69,7 +69,7 @@ struct xfs_item_ops {
 	void (*iop_pin)(struct xfs_log_item *);
 	void (*iop_unpin)(struct xfs_log_item *, int remove);
 	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
-	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
+	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
 	void (*iop_release)(struct xfs_log_item *);
 	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
 	int (*iop_recover)(struct xfs_log_item *lip,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (8 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:33   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 11/39] xfs: CIL work is serialised, not pipelined Dave Chinner
                   ` (28 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The AIL pushing is stalling on log forces when it comes across
pinned items. This is happening on removal workloads where the AIL
is dominated by stale items that are removed from AIL when the
checkpoint that marks the items stale is committed to the journal.
This results is relatively few items in the AIL, but those that are
are often pinned as directories items are being removed from are
still being logged.

As a result, many push cycles through the CIL will first issue a
blocking log force to unpin the items. This can take some time to
complete, with tracing regularly showing push delays of half a
second and sometimes up into the range of several seconds. Sequences
like this aren't uncommon:

....
 399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 270ms delay>
 400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
 400.099623:  xfsaild: first lsn 0x11002f3600
 400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
<wanted 50ms, got 500ms delay>
 400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
 400.589349:  xfsaild: first lsn 0x1100305000
 400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
<wanted 50ms, got 460ms delay>
 400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
 400.950343:  xfsaild: first lsn 0x1100317c00
 400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
<wanted 20ms, got 200ms delay>
 401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
 401.142334:  xfsaild: first lsn 0x110032e600
 401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
<wanted 10ms, got 10ms delay>
 401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
 401.154328:  xfsaild: first lsn 0x1100353000
 401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
<wanted 20ms, got 300ms delay>
 401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
 401.451526:  xfsaild: first lsn 0x1100353000
 401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
<wanted 50ms, got 500ms delay>
 401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
....

In each of these cases, every AIL pass saw 101 log items stuck on
the AIL (pinned) with very few other items being found. Each pass, a
log force was issued, and delay between last/first is the sleep time
+ the sync log force time.

Some of these 101 items pinned the tail of the log. The tail of the
log does slowly creep forward (first lsn), but the problem is that
the log is actually out of reservation space because it's been
running so many transactions that stale items that never reach the
AIL but consume log space. Hence we have a largely empty AIL, with
long term pins on items that pin the tail of the log that don't get
pushed frequently enough to keep log space available.

The problem is the hundreds of milliseconds that we block in the log
force pushing the CIL out to disk. The AIL should not be stalled
like this - it needs to run and flush items that are at the tail of
the log with minimal latency. What we really need to do is trigger a
log flush, but then not wait for it at all - we've already done our
waiting for stuff to complete when we backed off prior to the log
force being issued.

Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
still do a blocking flush of the CIL and that is what is causing the
issue. Hence we need a new interface for the CIL to trigger an
immediate background push of the CIL to get it moving faster but not
to wait on that to occur. While the CIL is pushing, the AIL can also
be pushing.

We already have an internal interface to do this -
xlog_cil_push_now() - but we need a wrapper for it to be used
externally. xlog_cil_force_seq() can easily be extended to do what
we need as it already implements the synchronous CIL push via
xlog_cil_push_now(). Add the necessary flags and "push current
sequence" semantics to xlog_cil_force_seq() and convert the AIL
pushing to use it.

One of the complexities here is that the CIL push does not guarantee
that the commit record for the CIL checkpoint is written to disk.
The current log force ensures this by submitting the current ACTIVE
iclog that the commit record was written to. We need the CIL to
actually write this commit record to disk for an async push to
ensure that the checkpoint actually makes it to disk and unpins the
pinned items in the checkpoint on completion. Hence we need to pass
down to the CIL push that we are doing an async flush so that it can
switch out the commit_iclog if necessary to get written to disk when
the commit iclog is finally released.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c       | 38 ++++++++++++++------------
 fs/xfs/xfs_log.h       |  1 +
 fs/xfs/xfs_log_cil.c   | 61 ++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_log_priv.h  |  5 ++++
 fs/xfs/xfs_sysfs.c     |  1 +
 fs/xfs/xfs_trace.c     |  1 +
 fs/xfs/xfs_trans.c     |  2 +-
 fs/xfs/xfs_trans_ail.c | 11 +++++---
 8 files changed, 91 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index aa37f4319052..c53644d19dd3 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -50,11 +50,6 @@ xlog_state_get_iclog_space(
 	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
-xlog_state_switch_iclogs(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	int			eventual_size);
-STATIC void
 xlog_grant_push_ail(
 	struct xlog		*log,
 	int			need_bytes);
@@ -3104,7 +3099,7 @@ xfs_log_ticket_ungrant(
  * This routine will mark the current iclog in the ring as WANT_SYNC and move
  * the current iclog pointer to the next iclog in the ring.
  */
-STATIC void
+void
 xlog_state_switch_iclogs(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
@@ -3251,6 +3246,20 @@ xfs_log_force(
 	return -EIO;
 }
 
+/*
+ * Force the log to a specific LSN.
+ *
+ * If an iclog with that lsn can be found:
+ *	If it is in the DIRTY state, just return.
+ *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
+ *		state and go to sleep or return.
+ *	If it is in any other state, go to sleep or return.
+ *
+ * Synchronous forces are implemented with a wait queue.  All callers trying
+ * to force a given lsn to disk must wait on the queue attached to the
+ * specific in-core log.  When given in-core log finally completes its write
+ * to disk, that thread will wake up all threads waiting on the queue.
+ */
 static int
 xlog_force_lsn(
 	struct xlog		*log,
@@ -3314,18 +3323,13 @@ xlog_force_lsn(
 }
 
 /*
- * Force the in-core log to disk for a specific LSN.
+ * Force the log to a specific checkpoint sequence.
  *
- * Find in-core log with lsn.
- *	If it is in the DIRTY state, just return.
- *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
- *		state and go to sleep or return.
- *	If it is in any other state, go to sleep or return.
- *
- * Synchronous forces are implemented with a wait queue.  All callers trying
- * to force a given lsn to disk must wait on the queue attached to the
- * specific in-core log.  When given in-core log finally completes its write
- * to disk, that thread will wake up all threads waiting on the queue.
+ * First force the CIL so that all the required changes have been flushed to the
+ * iclogs. If the CIL force completed it will return a commit LSN that indicates
+ * the iclog that needs to be flushed to stable storage. If the caller needs
+ * a synchronous log force, we will wait on the iclog with the LSN returned by
+ * xlog_cil_force_seq() to be completed.
  */
 int
 xfs_log_force_seq(
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 813b972e9788..1bd080ce3a95 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -104,6 +104,7 @@ struct xlog_ticket;
 struct xfs_log_item;
 struct xfs_item_ops;
 struct xfs_trans;
+struct xlog;
 
 int	  xfs_log_force(struct xfs_mount *mp, uint flags);
 int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 3c2b1205944d..cb849e67b1c4 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -658,6 +658,7 @@ xlog_cil_push_work(
 	xfs_lsn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
+	bool			push_commit_stable;
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -668,6 +669,8 @@ xlog_cil_push_work(
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
 	ASSERT(push_seq <= ctx->sequence);
+	push_commit_stable = cil->xc_push_commit_stable;
+	cil->xc_push_commit_stable = false;
 
 	/*
 	 * As we are about to switch to a new, empty CIL context, we no longer
@@ -910,8 +913,15 @@ xlog_cil_push_work(
 	 * The commit iclog must be written to stable storage to guarantee
 	 * journal IO vs metadata writeback IO is correctly ordered on stable
 	 * storage.
+	 *
+	 * If the push caller needs the commit to be immediately stable and the
+	 * commit_iclog is not yet marked as XLOG_STATE_WANT_SYNC to indicate it
+	 * will be written when released, switch it's state to WANT_SYNC right
+	 * now.
 	 */
 	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
+	if (push_commit_stable && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(log, commit_iclog, 0);
 	xlog_state_release_iclog(log, commit_iclog);
 	spin_unlock(&log->l_icloglock);
 	return;
@@ -996,13 +1006,26 @@ xlog_cil_push_background(
 /*
  * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
  * number that is passed. When it returns, the work will be queued for
- * @push_seq, but it won't be completed. The caller is expected to do any
- * waiting for push_seq to complete if it is required.
+ * @push_seq, but it won't be completed.
+ *
+ * If the caller is performing a synchronous force, we will flush the workqueue
+ * to get previously queued work moving to minimise the wait time they will
+ * undergo waiting for all outstanding pushes to complete. The caller is
+ * expected to do the required waiting for push_seq to complete.
+ *
+ * If the caller is performing an async push, we need to ensure that the
+ * checkpoint is fully flushed out of the iclogs when we finish the push. If we
+ * don't do this, then the commit record may remain sitting in memory in an
+ * ACTIVE iclog. This then requires another full log force to push to disk,
+ * which defeats the purpose of having an async, non-blocking CIL force
+ * mechanism. Hence in this case we need to pass a flag to the push work to
+ * indicate it needs to flush the commit record itself.
  */
 static void
 xlog_cil_push_now(
 	struct xlog	*log,
-	xfs_lsn_t	push_seq)
+	xfs_lsn_t	push_seq,
+	bool		async)
 {
 	struct xfs_cil	*cil = log->l_cilp;
 
@@ -1012,7 +1035,8 @@ xlog_cil_push_now(
 	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
 
 	/* start on any pending background push to minimise wait time on it */
-	flush_work(&cil->xc_push_work);
+	if (!async)
+		flush_work(&cil->xc_push_work);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1025,6 +1049,7 @@ xlog_cil_push_now(
 	}
 
 	cil->xc_push_seq = push_seq;
+	cil->xc_push_commit_stable = async;
 	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
@@ -1109,12 +1134,27 @@ xlog_cil_commit(
 	xlog_cil_push_background(log);
 }
 
+/*
+ * Flush the CIL to stable storage but don't wait for it to complete. This
+ * requires the CIL push to ensure the commit record for the push hits the disk,
+ * but otherwise is no different to a push done from a log force.
+ */
+void
+xlog_cil_flush(
+	struct xlog	*log)
+{
+	xfs_csn_t	seq = log->l_cilp->xc_current_sequence;
+
+	trace_xfs_log_force(log->l_mp, seq, _RET_IP_);
+	xlog_cil_push_now(log, seq, true);
+}
+
 /*
  * Conditionally push the CIL based on the sequence passed in.
  *
- * We only need to push if we haven't already pushed the sequence
- * number given. Hence the only time we will trigger a push here is
- * if the push sequence is the same as the current context.
+ * We only need to push if we haven't already pushed the sequence number given.
+ * Hence the only time we will trigger a push here is if the push sequence is
+ * the same as the current context.
  *
  * We return the current commit lsn to allow the callers to determine if a
  * iclog flush is necessary following this call.
@@ -1130,13 +1170,17 @@ xlog_cil_force_seq(
 
 	ASSERT(sequence <= cil->xc_current_sequence);
 
+	if (!sequence)
+		sequence = cil->xc_current_sequence;
+	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
+
 	/*
 	 * check to see if we need to force out the current context.
 	 * xlog_cil_push() handles racing pushes for the same sequence,
 	 * so no need to deal with it here.
 	 */
 restart:
-	xlog_cil_push_now(log, sequence);
+	xlog_cil_push_now(log, sequence, false);
 
 	/*
 	 * See if we can find a previous sequence still committing.
@@ -1160,6 +1204,7 @@ xlog_cil_force_seq(
 			 * It is still being pushed! Wait for the push to
 			 * complete, then start again from the beginning.
 			 */
+			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
 			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
 			goto restart;
 		}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 2d7e7cbee8b7..a863ccb5ece6 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -273,6 +273,7 @@ struct xfs_cil {
 
 	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
 	xfs_csn_t		xc_push_seq;
+	bool			xc_push_commit_stable;
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
@@ -487,9 +488,12 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
+
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
+void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
+		int eventual_size);
 int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
 
 /*
@@ -560,6 +564,7 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
 /*
  * CIL force routines
  */
+void xlog_cil_flush(struct xlog *log);
 xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
 
 static inline void
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index f1bc88f4367c..18dc5eca6c04 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -10,6 +10,7 @@
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_sysfs.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_mount.h"
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 7e01e00550ac..4c86afad1617 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -20,6 +20,7 @@
 #include "xfs_bmap.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_log.h"
 #include "xfs_log_priv.h"
 #include "xfs_buf_item.h"
 #include "xfs_quota.h"
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 87bffd12c20c..c214a69b573d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -9,7 +9,6 @@
 #include "xfs_shared.h"
 #include "xfs_format.h"
 #include "xfs_log_format.h"
-#include "xfs_log_priv.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
 #include "xfs_extent_busy.h"
@@ -17,6 +16,7 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 #include "xfs_trace.h"
 #include "xfs_error.h"
 #include "xfs_defer.h"
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dbb69b4bf3ed..69aac416e2ce 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -17,6 +17,7 @@
 #include "xfs_errortag.h"
 #include "xfs_error.h"
 #include "xfs_log.h"
+#include "xfs_log_priv.h"
 
 #ifdef DEBUG
 /*
@@ -429,8 +430,12 @@ xfsaild_push(
 
 	/*
 	 * If we encountered pinned items or did not finish writing out all
-	 * buffers the last time we ran, force the log first and wait for it
-	 * before pushing again.
+	 * buffers the last time we ran, force a background CIL push to get the
+	 * items unpinned in the near future. We do not wait on the CIL push as
+	 * that could stall us for seconds if there is enough background IO
+	 * load. Stalling for that long when the tail of the log is pinned and
+	 * needs flushing will hard stop the transaction subsystem when log
+	 * space runs out.
 	 */
 	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
 	    (!list_empty_careful(&ailp->ail_buf_list) ||
@@ -438,7 +443,7 @@ xfsaild_push(
 		ailp->ail_log_flush = 0;
 
 		XFS_STATS_INC(mp, xs_push_ail_flush);
-		xfs_log_force(mp, XFS_LOG_SYNC);
+		xlog_cil_flush(mp->m_log);
 	}
 
 	spin_lock(&ailp->ail_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 11/39] xfs: CIL work is serialised, not pipelined
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (9 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:32   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 12/39] xfs: factor out the CIL transaction header building Dave Chinner
                   ` (27 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because we use a single work structure attached to the CIL rather
than the CIL context, we can only queue a single work item at a
time. This results in the CIL being single threaded and limits
performance when it becomes CPU bound.

The design of the CIL is that it is pipelined and multiple commits
can be running concurrently, but the way the work is currently
implemented means that it is not pipelining as it was intended. The
critical work to switch the CIL context can take a few milliseconds
to run, but the rest of the CIL context flush can take hundreds of
milliseconds to complete. The context switching is the serialisation
point of the CIL, once the context has been switched the rest of the
context push can run asynchrnously with all other context pushes.

Hence we can move the work to the CIL context so that we can run
multiple CIL pushes at the same time and spread the majority of
the work out over multiple CPUs. We can keep the per-cpu CIL commit
state on the CIL rather than the context, because the context is
pinned to the CIL until the switch is done and we aggregate and
drain the per-cpu state held on the CIL during the context switch.

However, because we no longer serialise the CIL work, we can have
effectively unlimited CIL pushes in progress. We don't want to do
this - not only does it create contention on the iclogs and the
state machine locks, we can run the log right out of space with
outstanding pushes. Instead, limit the work concurrency to 4
concurrent works being processed at a time. THis is enough
concurrency to remove the CIL from being a CPU bound bottleneck but
not enough to create new contention points or unbound concurrency
issues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h |  2 +-
 fs/xfs/xfs_super.c    |  6 +++-
 3 files changed, 48 insertions(+), 40 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index cb849e67b1c4..713ea66d4c0c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
 	return tic;
 }
 
+/*
+ * Unavoidable forward declaration - xlog_cil_push_work() calls
+ * xlog_cil_ctx_alloc() itself.
+ */
+static void xlog_cil_push_work(struct work_struct *work);
+
+static struct xfs_cil_ctx *
+xlog_cil_ctx_alloc(void)
+{
+	struct xfs_cil_ctx	*ctx;
+
+	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
+	INIT_LIST_HEAD(&ctx->committing);
+	INIT_LIST_HEAD(&ctx->busy_extents);
+	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
+	return ctx;
+}
+
+static void
+xlog_cil_ctx_switch(
+	struct xfs_cil		*cil,
+	struct xfs_cil_ctx	*ctx)
+{
+	ctx->sequence = ++cil->xc_current_sequence;
+	ctx->cil = cil;
+	cil->xc_ctx = ctx;
+}
+
 /*
  * After the first stage of log recovery is done, we know where the head and
  * tail of the log are. We need this log initialisation done before we can
@@ -641,11 +669,11 @@ static void
 xlog_cil_push_work(
 	struct work_struct	*work)
 {
-	struct xfs_cil		*cil =
-		container_of(work, struct xfs_cil, xc_push_work);
+	struct xfs_cil_ctx	*ctx =
+		container_of(work, struct xfs_cil_ctx, push_work);
+	struct xfs_cil		*cil = ctx->cil;
 	struct xlog		*log = cil->xc_log;
 	struct xfs_log_vec	*lv;
-	struct xfs_cil_ctx	*ctx;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
 	struct xlog_ticket	*tic;
@@ -660,11 +688,10 @@ xlog_cil_push_work(
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			push_commit_stable;
 
-	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
+	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
 
 	down_write(&cil->xc_ctx_lock);
-	ctx = cil->xc_ctx;
 
 	spin_lock(&cil->xc_push_lock);
 	push_seq = cil->xc_push_seq;
@@ -696,7 +723,7 @@ xlog_cil_push_work(
 
 
 	/* check for a previously pushed sequence */
-	if (push_seq < cil->xc_ctx->sequence) {
+	if (push_seq < ctx->sequence) {
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
 	}
@@ -761,19 +788,7 @@ xlog_cil_push_work(
 	}
 
 	/*
-	 * initialise the new context and attach it to the CIL. Then attach
-	 * the current context to the CIL committing list so it can be found
-	 * during log forces to extract the commit lsn of the sequence that
-	 * needs to be forced.
-	 */
-	INIT_LIST_HEAD(&new_ctx->committing);
-	INIT_LIST_HEAD(&new_ctx->busy_extents);
-	new_ctx->sequence = ctx->sequence + 1;
-	new_ctx->cil = cil;
-	cil->xc_ctx = new_ctx;
-
-	/*
-	 * The switch is now done, so we can drop the context lock and move out
+	 * Switch the contexts so we can drop the context lock and move out
 	 * of a shared context. We can't just go straight to the commit record,
 	 * though - we need to synchronise with previous and future commits so
 	 * that the commit records are correctly ordered in the log to ensure
@@ -798,7 +813,7 @@ xlog_cil_push_work(
 	 * deferencing a freed context pointer.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	cil->xc_current_sequence = new_ctx->sequence;
+	xlog_cil_ctx_switch(cil, new_ctx);
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
@@ -970,7 +985,7 @@ xlog_cil_push_background(
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
 		cil->xc_push_seq = cil->xc_current_sequence;
-		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	}
 
 	/*
@@ -1036,7 +1051,7 @@ xlog_cil_push_now(
 
 	/* start on any pending background push to minimise wait time on it */
 	if (!async)
-		flush_work(&cil->xc_push_work);
+		flush_workqueue(log->l_mp->m_cil_workqueue);
 
 	/*
 	 * If the CIL is empty or we've already pushed the sequence then
@@ -1050,7 +1065,7 @@ xlog_cil_push_now(
 
 	cil->xc_push_seq = push_seq;
 	cil->xc_push_commit_stable = async;
-	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
+	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
 	spin_unlock(&cil->xc_push_lock);
 }
 
@@ -1290,13 +1305,6 @@ xlog_cil_init(
 	if (!cil)
 		return -ENOMEM;
 
-	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
-	if (!ctx) {
-		kmem_free(cil);
-		return -ENOMEM;
-	}
-
-	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
@@ -1304,16 +1312,12 @@ xlog_cil_init(
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
 	init_waitqueue_head(&cil->xc_commit_wait);
-
-	INIT_LIST_HEAD(&ctx->committing);
-	INIT_LIST_HEAD(&ctx->busy_extents);
-	ctx->sequence = 1;
-	ctx->cil = cil;
-	cil->xc_ctx = ctx;
-	cil->xc_current_sequence = ctx->sequence;
-
 	cil->xc_log = log;
 	log->l_cilp = cil;
+
+	ctx = xlog_cil_ctx_alloc();
+	xlog_cil_ctx_switch(cil, ctx);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a863ccb5ece6..87447fa34c43 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -245,6 +245,7 @@ struct xfs_cil_ctx {
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
+	struct work_struct	push_work;
 };
 
 /*
@@ -277,7 +278,6 @@ struct xfs_cil {
 	struct list_head	xc_committing;
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
-	struct work_struct	xc_push_work;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index e339d1de2419..0608091f13a6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -501,9 +501,13 @@ xfs_init_mount_workqueues(
 	if (!mp->m_unwritten_workqueue)
 		goto out_destroy_buf;
 
+	/*
+	 * Limit the CIL pipeline depth to 4 concurrent works to bound the
+	 * concurrency the log spinlocks will be exposed to.
+	 */
 	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
 			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
-			0, mp->m_super->s_id);
+			4, mp->m_super->s_id);
 	if (!mp->m_cil_workqueue)
 		goto out_destroy_unwritten;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 12/39] xfs: factor out the CIL transaction header building
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (10 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 11/39] xfs: CIL work is serialised, not pipelined Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 13/39] xfs: only CIL pushes require a start record Dave Chinner
                   ` (26 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

It is static code deep in the middle of the CIL push logic. Factor
it out into a helper so that it is clear and easy to modify
separately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 71 +++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 713ea66d4c0c..c7e79800b6f5 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -651,6 +651,41 @@ xlog_cil_process_committed(
 	}
 }
 
+struct xlog_cil_trans_hdr {
+	struct xfs_trans_header	thdr;
+	struct xfs_log_iovec	lhdr;
+};
+
+/*
+ * Build a checkpoint transaction header to begin the journal transaction.  We
+ * need to account for the space used by the transaction header here as it is
+ * not accounted for in xlog_write().
+ */
+static void
+xlog_cil_build_trans_hdr(
+	struct xfs_cil_ctx	*ctx,
+	struct xlog_cil_trans_hdr *hdr,
+	struct xfs_log_vec	*lvhdr,
+	int			num_iovecs)
+{
+	struct xlog_ticket	*tic = ctx->ticket;
+
+	memset(hdr, 0, sizeof(*hdr));
+
+	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
+	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
+	hdr->thdr.th_tid = tic->t_tid;
+	hdr->thdr.th_num_items = num_iovecs;
+	hdr->lhdr.i_addr = &hdr->thdr;
+	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
+	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
+	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(struct xlog_op_header);
+
+	lvhdr->lv_niovecs = 1;
+	lvhdr->lv_iovecp = &hdr->lhdr;
+	lvhdr->lv_next = ctx->lv_chain;
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -676,11 +711,9 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
-	struct xlog_ticket	*tic;
 	int			num_iovecs;
 	int			error = 0;
-	struct xfs_trans_header thdr;
-	struct xfs_log_iovec	lhdr;
+	struct xlog_cil_trans_hdr thdr;
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
@@ -821,24 +854,8 @@ xlog_cil_push_work(
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
 	 * transaction header here as it is not accounted for in xlog_write().
-	 *
-	 * The LSN we need to pass to the log items on transaction commit is
-	 * the LSN reported by the first log vector write. If we use the commit
-	 * record lsn then we can move the tail beyond the grant write head.
 	 */
-	tic = ctx->ticket;
-	thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
-	thdr.th_type = XFS_TRANS_CHECKPOINT;
-	thdr.th_tid = tic->t_tid;
-	thdr.th_num_items = num_iovecs;
-	lhdr.i_addr = &thdr;
-	lhdr.i_len = sizeof(xfs_trans_header_t);
-	lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
-	tic->t_curr_res -= lhdr.i_len + sizeof(xlog_op_header_t);
-
-	lvhdr.lv_niovecs = 1;
-	lvhdr.lv_iovecp = &lhdr;
-	lvhdr.lv_next = ctx->lv_chain;
+	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -846,7 +863,13 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
+	/*
+	 * The LSN we need to pass to the log items on transaction commit is the
+	 * LSN reported by the first log vector write, not the commit lsn. If we
+	 * use the commit record lsn then we can move the tail beyond the grant
+	 * write head.
+	 */
+	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
 				XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
@@ -885,11 +908,11 @@ xlog_cil_push_work(
 	}
 	spin_unlock(&cil->xc_push_lock);
 
-	error = xlog_commit_record(log, tic, &commit_iclog, &commit_lsn);
+	error = xlog_commit_record(log, ctx->ticket, &commit_iclog, &commit_lsn);
 	if (error)
 		goto out_abort_free_ticket;
 
-	xfs_log_ticket_ungrant(log, tic);
+	xfs_log_ticket_ungrant(log, ctx->ticket);
 
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
@@ -948,7 +971,7 @@ xlog_cil_push_work(
 	return;
 
 out_abort_free_ticket:
-	xfs_log_ticket_ungrant(log, tic);
+	xfs_log_ticket_ungrant(log, ctx->ticket);
 out_abort:
 	ASSERT(XLOG_FORCED_SHUTDOWN(log));
 	xlog_cil_committed(ctx);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 13/39] xfs: only CIL pushes require a start record
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (11 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 12/39] xfs: factor out the CIL transaction header building Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record Dave Chinner
                   ` (25 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So move the one-off start record writing in xlog_write() out into
the static header that the CIL push builds to write into the log
initially. This simplifes the xlog_write() logic a lot.

pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly
32 bit aligned and packed for copying the log op and transaction
headers directly into the log as a single log region copy.

struct xlog_cil_trans_hdr {
        struct xlog_op_header      oph[2];               /*     0    24 */
        struct xfs_trans_header    thdr;                 /*    24    16 */
        struct xfs_log_iovec       lhdr[2];              /*    40    32 */

        /* size: 72, cachelines: 2, members: 3 */
        /* last cacheline: 8 bytes */
};

A wart is needed to handle the fact that length of the region the
opheader points to doesn't include the opheader length. hence if
we embed the opheader, we have to substract the opheader length from
the length written into the opheader by the generic copying code.
This will eventually go away when everything is converted to
embedded opheaders.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c     | 90 ++++++++++++++++++++++----------------------
 fs/xfs/xfs_log_cil.c | 44 ++++++++++++++++++----
 2 files changed, 81 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c53644d19dd3..981cd6f8f0ff 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2113,9 +2113,9 @@ xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector.  We may need a start
- * record, and each region gets its own struct xlog_op_header and may need to be
- * double word aligned.
+ * Calculate the potential space needed by the log vector. If this is a start
+ * transaction, the caller has already accounted for both opheaders in the start
+ * transaction, so we don't need to account for them here.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2128,9 +2128,6 @@ xlog_write_calc_vec_length(
 	int			len = 0;
 	int			i;
 
-	if (optype & XLOG_START_TRANS)
-		headers++;
-
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2146,24 +2143,20 @@ xlog_write_calc_vec_length(
 		}
 	}
 
+	/* Don't account for regions with embedded ophdrs */
+	if (optype && headers > 0) {
+		if (optype & XLOG_START_TRANS) {
+			ASSERT(headers >= 2);
+			headers -= 2;
+		}
+	}
+
 	ticket->t_res_num_ophdrs += headers;
 	len += headers * sizeof(struct xlog_op_header);
 
 	return len;
 }
 
-static void
-xlog_write_start_rec(
-	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket)
-{
-	ophdr->oh_tid	= cpu_to_be32(ticket->t_tid);
-	ophdr->oh_clientid = ticket->t_clientid;
-	ophdr->oh_len = 0;
-	ophdr->oh_flags = XLOG_START_TRANS;
-	ophdr->oh_res2 = 0;
-}
-
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
 	struct xlog		*log,
@@ -2368,9 +2361,11 @@ xlog_write(
 	 * If this is a commit or unmount transaction, we don't need a start
 	 * record to be written.  We do, however, have to account for the
 	 * commit or unmount header that gets written. Hence we always have
-	 * to account for an extra xlog_op_header here.
+	 * to account for an extra xlog_op_header here for commit and unmount
+	 * records.
 	 */
-	ticket->t_curr_res -= sizeof(struct xlog_op_header);
+	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
+		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
 		     "ctx ticket reservation ran out. Need to up reservation");
@@ -2407,7 +2402,7 @@ xlog_write(
 			int			copy_len;
 			int			copy_off;
 			bool			ordered = false;
-			bool			wrote_start_rec = false;
+			bool			added_ophdr = false;
 
 			/* ordered log vectors have no regions to write */
 			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
@@ -2421,25 +2416,24 @@ xlog_write(
 			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
 			/*
-			 * Before we start formatting log vectors, we need to
-			 * write a start record. Only do this for the first
-			 * iclog we write to.
+			 * The XLOG_START_TRANS has embedded ophdrs for the
+			 * start record and transaction header. They will always
+			 * be the first two regions in the lv chain.
 			 */
 			if (optype & XLOG_START_TRANS) {
-				xlog_write_start_rec(ptr, ticket);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-						sizeof(struct xlog_op_header));
-				optype &= ~XLOG_START_TRANS;
-				wrote_start_rec = true;
-			}
-
-			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
-			if (!ophdr)
-				return -EIO;
+				ophdr = reg->i_addr;
+				if (index)
+					optype &= ~XLOG_START_TRANS;
+			} else {
+				ophdr = xlog_write_setup_ophdr(log, ptr,
+							ticket, optype);
+				if (!ophdr)
+					return -EIO;
 
-			xlog_write_adv_cnt(&ptr, &len, &log_offset,
+				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
-
+				added_ophdr = true;
+			}
 			len += xlog_write_setup_copy(ticket, ophdr,
 						     iclog->ic_size-log_offset,
 						     reg->i_len,
@@ -2448,13 +2442,22 @@ xlog_write(
 						     &partial_copy_len);
 			xlog_verify_dest_ptr(log, ptr);
 
+
+			/*
+			 * Wart: need to update length in embedded ophdr not
+			 * to include it's own length.
+			 */
+			if (!added_ophdr) {
+				ophdr->oh_len = cpu_to_be32(copy_len -
+						sizeof(struct xlog_op_header));
+			}
 			/*
 			 * Copy region.
 			 *
-			 * Unmount records just log an opheader, so can have
-			 * empty payloads with no data region to copy. Hence we
-			 * only copy the payload if the vector says it has data
-			 * to copy.
+			 * Commit and unmount records just log an opheader, so
+			 * we can have empty payloads with no data region to
+			 * copy.  Hence we only copy the payload if the vector
+			 * says it has data to copy.
 			 */
 			ASSERT(copy_len >= 0);
 			if (copy_len > 0) {
@@ -2462,12 +2465,9 @@ xlog_write(
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 						   copy_len);
 			}
-			copy_len += sizeof(struct xlog_op_header);
-			record_cnt++;
-			if (wrote_start_rec) {
+			if (added_ophdr)
 				copy_len += sizeof(struct xlog_op_header);
-				record_cnt++;
-			}
+			record_cnt++;
 			data_cnt += contwr ? copy_len : 0;
 
 			error = xlog_write_copy_finish(log, iclog, optype,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index c7e79800b6f5..2983adaed675 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -652,14 +652,22 @@ xlog_cil_process_committed(
 }
 
 struct xlog_cil_trans_hdr {
+	struct xlog_op_header	oph[2];
 	struct xfs_trans_header	thdr;
-	struct xfs_log_iovec	lhdr;
+	struct xfs_log_iovec	lhdr[2];
 };
 
 /*
  * Build a checkpoint transaction header to begin the journal transaction.  We
  * need to account for the space used by the transaction header here as it is
  * not accounted for in xlog_write().
+ *
+ * This is the only place we write a transaction header, so we also build the
+ * log opheaders that indicate the start of a log transaction and wrap the
+ * transaction header. We keep the start record in it's own log vector rather
+ * than compacting them into a single region as this ends up making the logic
+ * in xlog_write() for handling empty opheaders for start, commit and unmount
+ * records much simpler.
  */
 static void
 xlog_cil_build_trans_hdr(
@@ -669,20 +677,40 @@ xlog_cil_build_trans_hdr(
 	int			num_iovecs)
 {
 	struct xlog_ticket	*tic = ctx->ticket;
+	uint32_t		tid = cpu_to_be32(tic->t_tid);
 
 	memset(hdr, 0, sizeof(*hdr));
 
+	/* Log start record */
+	hdr->oph[0].oh_tid = tid;
+	hdr->oph[0].oh_clientid = XFS_TRANSACTION;
+	hdr->oph[0].oh_flags = XLOG_START_TRANS;
+
+	/* log iovec region pointer */
+	hdr->lhdr[0].i_addr = &hdr->oph[0];
+	hdr->lhdr[0].i_len = sizeof(struct xlog_op_header);
+	hdr->lhdr[0].i_type = XLOG_REG_TYPE_LRHEADER;
+
+	/* log opheader */
+	hdr->oph[1].oh_tid = tid;
+	hdr->oph[1].oh_clientid = XFS_TRANSACTION;
+
+	/* transaction header */
 	hdr->thdr.th_magic = XFS_TRANS_HEADER_MAGIC;
 	hdr->thdr.th_type = XFS_TRANS_CHECKPOINT;
-	hdr->thdr.th_tid = tic->t_tid;
+	hdr->thdr.th_tid = tid;
 	hdr->thdr.th_num_items = num_iovecs;
-	hdr->lhdr.i_addr = &hdr->thdr;
-	hdr->lhdr.i_len = sizeof(xfs_trans_header_t);
-	hdr->lhdr.i_type = XLOG_REG_TYPE_TRANSHDR;
-	tic->t_curr_res -= hdr->lhdr.i_len + sizeof(struct xlog_op_header);
 
-	lvhdr->lv_niovecs = 1;
-	lvhdr->lv_iovecp = &hdr->lhdr;
+	/* log iovec region pointer */
+	hdr->lhdr[1].i_addr = &hdr->oph[1];
+	hdr->lhdr[1].i_len = sizeof(struct xlog_op_header) +
+				sizeof(struct xfs_trans_header);
+	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
+
+	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
+
+	lvhdr->lv_niovecs = 2;
+	lvhdr->lv_iovecp = &hdr->lhdr[0];
 	lvhdr->lv_next = ctx->lv_chain;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (12 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 13/39] xfs: only CIL pushes require a start record Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:35   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 15/39] xfs: embed the xlog_op_header in the commit record Dave Chinner
                   ` (24 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Remove another case where xlog_write() has to prepend an opheader to
a log transaction. The unmount record + ophdr is smaller than the
minimum amount of space guaranteed to be free in an iclog (2 *
sizeof(ophdr)) and so we don't have to care about an unmount record
being split across 2 iclogs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c | 39 ++++++++++++++++++++++++++++-----------
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 981cd6f8f0ff..e7a135ffa66f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -800,12 +800,22 @@ xlog_write_unmount_record(
 	struct xlog		*log,
 	struct xlog_ticket	*ticket)
 {
-	struct xfs_unmount_log_format ulf = {
-		.magic = XLOG_UNMOUNT_TYPE,
+	struct  {
+		struct xlog_op_header ophdr;
+		struct xfs_unmount_log_format ulf;
+	} unmount_rec = {
+		.ophdr = {
+			.oh_clientid = XFS_LOG,
+			.oh_tid = cpu_to_be32(ticket->t_tid),
+			.oh_flags = XLOG_UNMOUNT_TRANS,
+		},
+		.ulf = {
+			.magic = XLOG_UNMOUNT_TYPE,
+		},
 	};
 	struct xfs_log_iovec reg = {
-		.i_addr = &ulf,
-		.i_len = sizeof(ulf),
+		.i_addr = &unmount_rec,
+		.i_len = sizeof(unmount_rec),
 		.i_type = XLOG_REG_TYPE_UNMOUNT,
 	};
 	struct xfs_log_vec vec = {
@@ -813,8 +823,12 @@ xlog_write_unmount_record(
 		.lv_iovecp = &reg,
 	};
 
+	BUILD_BUG_ON((sizeof(struct xlog_op_header) +
+		      sizeof(struct xfs_unmount_log_format)) !=
+							sizeof(unmount_rec));
+
 	/* account for space used by record data */
-	ticket->t_curr_res -= sizeof(ulf);
+	ticket->t_curr_res -= sizeof(unmount_rec);
 
 	/*
 	 * For external log devices, we need to flush the data device cache
@@ -2145,6 +2159,8 @@ xlog_write_calc_vec_length(
 
 	/* Don't account for regions with embedded ophdrs */
 	if (optype && headers > 0) {
+		if (optype & XLOG_UNMOUNT_TRANS)
+			headers--;
 		if (optype & XLOG_START_TRANS) {
 			ASSERT(headers >= 2);
 			headers -= 2;
@@ -2359,12 +2375,11 @@ xlog_write(
 
 	/*
 	 * If this is a commit or unmount transaction, we don't need a start
-	 * record to be written.  We do, however, have to account for the
-	 * commit or unmount header that gets written. Hence we always have
-	 * to account for an extra xlog_op_header here for commit and unmount
-	 * records.
+	 * record to be written.  We do, however, have to account for the commit
+	 * header that gets written. Hence we always have to account for an
+	 * extra xlog_op_header here for commit records.
 	 */
-	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
+	if (optype & XLOG_COMMIT_TRANS)
 		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
@@ -2424,6 +2439,8 @@ xlog_write(
 				ophdr = reg->i_addr;
 				if (index)
 					optype &= ~XLOG_START_TRANS;
+			} else if (optype & XLOG_UNMOUNT_TRANS) {
+				ophdr = reg->i_addr;
 			} else {
 				ophdr = xlog_write_setup_ophdr(log, ptr,
 							ticket, optype);
@@ -2454,7 +2471,7 @@ xlog_write(
 			/*
 			 * Copy region.
 			 *
-			 * Commit and unmount records just log an opheader, so
+			 * Commit records just log an opheader, so
 			 * we can have empty payloads with no data region to
 			 * copy.  Hence we only copy the payload if the vector
 			 * says it has data to copy.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 15/39] xfs: embed the xlog_op_header in the commit record
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (13 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 16/39] xfs: log tickets don't need log client id Dave Chinner
                   ` (23 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

Remove the final case where xlog_write() has to prepend an opheader
to a log transaction. Similar to the start record, the commit record
is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can
just make this the payload for the region being passed to
xlog_write() and remove the special handling in xlog_write() for
the commit record.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e7a135ffa66f..76a73f4b0f30 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1540,9 +1540,14 @@ xlog_commit_record(
 	struct xlog_in_core	**iclog,
 	xfs_lsn_t		*lsn)
 {
+	struct xlog_op_header	ophdr = {
+		.oh_clientid = XFS_TRANSACTION,
+		.oh_tid = cpu_to_be32(ticket->t_tid),
+		.oh_flags = XLOG_COMMIT_TRANS,
+	};
 	struct xfs_log_iovec reg = {
-		.i_addr = NULL,
-		.i_len = 0,
+		.i_addr = &ophdr,
+		.i_len = sizeof(struct xlog_op_header),
 		.i_type = XLOG_REG_TYPE_COMMIT,
 	};
 	struct xfs_log_vec vec = {
@@ -1554,6 +1559,8 @@ xlog_commit_record(
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
+	/* account for space used by record data */
+	ticket->t_curr_res -= reg.i_len;
 	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -2159,11 +2166,10 @@ xlog_write_calc_vec_length(
 
 	/* Don't account for regions with embedded ophdrs */
 	if (optype && headers > 0) {
-		if (optype & XLOG_UNMOUNT_TRANS)
-			headers--;
+		headers--;
 		if (optype & XLOG_START_TRANS) {
-			ASSERT(headers >= 2);
-			headers -= 2;
+			ASSERT(headers >= 1);
+			headers--;
 		}
 	}
 
@@ -2373,14 +2379,6 @@ xlog_write(
 	int			data_cnt = 0;
 	int			error = 0;
 
-	/*
-	 * If this is a commit or unmount transaction, we don't need a start
-	 * record to be written.  We do, however, have to account for the commit
-	 * header that gets written. Hence we always have to account for an
-	 * extra xlog_op_header here for commit records.
-	 */
-	if (optype & XLOG_COMMIT_TRANS)
-		ticket->t_curr_res -= sizeof(struct xlog_op_header);
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
 		     "ctx ticket reservation ran out. Need to up reservation");
@@ -2433,14 +2431,13 @@ xlog_write(
 			/*
 			 * The XLOG_START_TRANS has embedded ophdrs for the
 			 * start record and transaction header. They will always
-			 * be the first two regions in the lv chain.
+			 * be the first two regions in the lv chain. Commit and
+			 * unmount records also have embedded ophdrs.
 			 */
-			if (optype & XLOG_START_TRANS) {
+			if (optype) {
 				ophdr = reg->i_addr;
 				if (index)
 					optype &= ~XLOG_START_TRANS;
-			} else if (optype & XLOG_UNMOUNT_TRANS) {
-				ophdr = reg->i_addr;
 			} else {
 				ophdr = xlog_write_setup_ophdr(log, ptr,
 							ticket, optype);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 16/39] xfs: log tickets don't need log client id
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (14 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 15/39] xfs: embed the xlog_op_header in the commit record Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-21  0:38   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 17/39] xfs: move log iovec alignment to preparation function Dave Chinner
                   ` (22 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently set the log ticket client ID when we reserve a
transaction. This client ID is only ever written to the log by
a CIL checkpoint or unmount records, and so anything using a high
level transaction allocated through xfs_trans_alloc() does not need
a log ticket client ID to be set.

For the CIL checkpoint, the client ID written to the journal is
always XFS_TRANSACTION, and for the unmount record it is always
XFS_LOG, and nothing else writes to the log. All of these operations
tell xlog_write() exactly what they need to write to the log (the
optype) and build their own opheaders for start, commit and unmount
records. Hence we no longer need to set the client id in either the
log ticket or the xfs_trans.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h |  1 -
 fs/xfs/xfs_log.c               | 47 ++++++----------------------------
 fs/xfs/xfs_log.h               | 16 +++++-------
 fs/xfs/xfs_log_cil.c           |  2 +-
 fs/xfs/xfs_log_priv.h          | 10 ++------
 fs/xfs/xfs_trans.c             |  6 ++---
 6 files changed, 19 insertions(+), 63 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index d548ea4b6aab..78d5368a7caa 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -69,7 +69,6 @@ static inline uint xlog_get_cycle(char *ptr)
 
 /* Log Clients */
 #define XFS_TRANSACTION		0x69
-#define XFS_VOLUME		0x2
 #define XFS_LOG			0xaa
 
 #define XLOG_UNMOUNT_TYPE	0x556e	/* Un for Unmount */
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 76a73f4b0f30..ccf584914b6a 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -433,10 +433,9 @@ xfs_log_regrant(
 int
 xfs_log_reserve(
 	struct xfs_mount	*mp,
-	int		 	unit_bytes,
-	int		 	cnt,
+	int			unit_bytes,
+	int			cnt,
 	struct xlog_ticket	**ticp,
-	uint8_t		 	client,
 	bool			permanent)
 {
 	struct xlog		*log = mp->m_log;
@@ -444,15 +443,13 @@ xfs_log_reserve(
 	int			need_bytes;
 	int			error = 0;
 
-	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
-
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
 	XFS_STATS_INC(mp, xs_try_logspace);
 
 	ASSERT(*ticp == NULL);
-	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent);
+	tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
 	*ticp = tic;
 
 	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
@@ -853,7 +850,7 @@ xlog_unmount_write(
 	struct xlog_ticket	*tic = NULL;
 	int			error;
 
-	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
+	error = xfs_log_reserve(mp, 600, 1, &tic, 0);
 	if (error)
 		goto out_err;
 
@@ -2181,35 +2178,13 @@ xlog_write_calc_vec_length(
 
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
-	struct xlog		*log,
 	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket,
-	uint			flags)
+	struct xlog_ticket	*ticket)
 {
 	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-	ophdr->oh_clientid = ticket->t_clientid;
+	ophdr->oh_clientid = XFS_TRANSACTION;
 	ophdr->oh_res2 = 0;
-
-	/* are we copying a commit or unmount record? */
-	ophdr->oh_flags = flags;
-
-	/*
-	 * We've seen logs corrupted with bad transaction client ids.  This
-	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
-	 * and shut down the filesystem.
-	 */
-	switch (ophdr->oh_clientid)  {
-	case XFS_TRANSACTION:
-	case XFS_VOLUME:
-	case XFS_LOG:
-		break;
-	default:
-		xfs_warn(log->l_mp,
-			"Bad XFS transaction clientid 0x%x in ticket "PTR_FMT,
-			ophdr->oh_clientid, ticket);
-		return NULL;
-	}
-
+	ophdr->oh_flags = 0;
 	return ophdr;
 }
 
@@ -2439,11 +2414,7 @@ xlog_write(
 				if (index)
 					optype &= ~XLOG_START_TRANS;
 			} else {
-				ophdr = xlog_write_setup_ophdr(log, ptr,
-							ticket, optype);
-				if (!ophdr)
-					return -EIO;
-
+                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
 				added_ophdr = true;
@@ -3499,7 +3470,6 @@ xlog_ticket_alloc(
 	struct xlog		*log,
 	int			unit_bytes,
 	int			cnt,
-	char			client,
 	bool			permanent)
 {
 	struct xlog_ticket	*tic;
@@ -3517,7 +3487,6 @@ xlog_ticket_alloc(
 	tic->t_cnt		= cnt;
 	tic->t_ocnt		= cnt;
 	tic->t_tid		= prandom_u32();
-	tic->t_clientid		= client;
 	if (permanent)
 		tic->t_flags |= XLOG_TIC_PERM_RESERV;
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1bd080ce3a95..c0c3141944ea 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -117,16 +117,12 @@ int	  xfs_log_mount_finish(struct xfs_mount *mp);
 void	xfs_log_mount_cancel(struct xfs_mount *);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
-void	  xfs_log_space_wake(struct xfs_mount *mp);
-int	  xfs_log_reserve(struct xfs_mount *mp,
-			  int		   length,
-			  int		   count,
-			  struct xlog_ticket **ticket,
-			  uint8_t		   clientid,
-			  bool		   permanent);
-int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
-void      xfs_log_unmount(struct xfs_mount *mp);
-int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
+void	xfs_log_space_wake(struct xfs_mount *mp);
+int	xfs_log_reserve(struct xfs_mount *mp, int length, int count,
+			struct xlog_ticket **ticket, bool permanent);
+int	xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
+void	xfs_log_unmount(struct xfs_mount *mp);
+int	xfs_log_force_umount(struct xfs_mount *mp, int logerror);
 bool	xfs_log_writable(struct xfs_mount *mp);
 
 struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 2983adaed675..9d3a495f1c78 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
 {
 	struct xlog_ticket *tic;
 
-	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0);
+	tic = xlog_ticket_alloc(log, 0, 1, 0);
 
 	/*
 	 * set the current reservation to zero so we know to steal the basic
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 87447fa34c43..e4e3e71b2b1b 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -158,7 +158,6 @@ typedef struct xlog_ticket {
 	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
 	char		   t_ocnt;	 /* original count		 : 1  */
 	char		   t_cnt;	 /* current count		 : 1  */
-	char		   t_clientid;	 /* who does this belong to;	 : 1  */
 	char		   t_flags;	 /* properties of reservation	 : 1  */
 
         /* reservation array fields */
@@ -465,13 +464,8 @@ extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
 			    char *dp, int size);
 
 extern kmem_zone_t *xfs_log_ticket_zone;
-struct xlog_ticket *
-xlog_ticket_alloc(
-	struct xlog	*log,
-	int		unit_bytes,
-	int		count,
-	char		client,
-	bool		permanent);
+struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes,
+		int count, bool permanent);
 
 static inline void
 xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index c214a69b573d..bc72826d1f97 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -194,11 +194,9 @@ xfs_trans_reserve(
 			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
 			error = xfs_log_regrant(mp, tp->t_ticket);
 		} else {
-			error = xfs_log_reserve(mp,
-						resp->tr_logres,
+			error = xfs_log_reserve(mp, resp->tr_logres,
 						resp->tr_logcount,
-						&tp->t_ticket, XFS_TRANSACTION,
-						permanent);
+						&tp->t_ticket, permanent);
 		}
 
 		if (error)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 17/39] xfs: move log iovec alignment to preparation function
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (15 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 16/39] xfs: log tickets don't need log client id Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 18/39] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
                   ` (21 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To include log op headers directly into the log iovec regions that
the ophdrs wrap, we need to move the buffer alignment code from
xlog_finish_iovec() to xlog_prepare_iovec(). This is because the
xlog_op_header is only 12 bytes long, and we need the buffer that
the caller formats their data into to be 8 byte aligned.

Hence once we start prepending the ophdr in xlog_prepare_iovec(), we
are going to need to manage the padding directly to ensure that the
buffer pointer returned is correctly aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log.h | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index c0c3141944ea..1ca4f2edbdaf 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,6 +21,16 @@ struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
+/*
+ * We need to make sure the buffer pointer returned is naturally aligned for the
+ * biggest basic data type we put into it. We have already accounted for this
+ * padding when sizing the buffer.
+ *
+ * However, this padding does not get written into the log, and hence we have to
+ * track the space used by the log vectors separately to prevent log space hangs
+ * due to inaccurate accounting (i.e. a leak) of the used log space through the
+ * CIL context ticket.
+ */
 static inline void *
 xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		uint type)
@@ -34,6 +44,9 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 		vec = &lv->lv_iovecp[0];
 	}
 
+	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
+		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
+
 	vec->i_type = type;
 	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
 
@@ -43,20 +56,10 @@ xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
 	return vec->i_addr;
 }
 
-/*
- * We need to make sure the next buffer is naturally aligned for the biggest
- * basic data type we put into it.  We already accounted for this padding when
- * sizing the buffer.
- *
- * However, this padding does not get written into the log, and hence we have to
- * track the space used by the log vectors separately to prevent log space hangs
- * due to inaccurate accounting (i.e. a leak) of the used log space through the
- * CIL context ticket.
- */
 static inline void
 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
 {
-	lv->lv_buf_len += round_up(len, sizeof(uint64_t));
+	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 18/39] xfs: reserve space and initialise xlog_op_header in item formatting
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (16 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 17/39] xfs: move log iovec alignment to preparation function Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 19/39] xfs: log ticket region debug is largely useless Dave Chinner
                   ` (20 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Current xlog_write() adds op headers to the log manually for every
log item region that is in the vector passed to it. While
xlog_write() needs to stamp the transaction ID into the ophdr, we
already know it's length, flags, clientid, etc at CIL commit time.

This means the only time that xlog write really needs to format and
reserve space for a new ophdr is when a region is split across two
iclogs. Adding the opheader and accounting for it as part of the
normal formatted item region means we simplify the accounting
of space used by a transaction and we don't have to special case
reserving of space in for the ophdrs in xlog_write(). It also means
we can largely initialise the ophdr in transaction commit instead
of xlog_write, making the xlog_write formatting inner loop much
tighter.

xlog_prepare_iovec() is now too large to stay as an inline function,
so we move it out of line and into xfs_log.c.

Object sizes:
text	   data	    bss	    dec	    hex	filename
1125934	 305951	    484	1432369	 15db31 fs/xfs/built-in.a.before
1123360	 305951	    484	1429795	 15d123 fs/xfs/built-in.a.after

So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now
out of line, even though it grew in size itself.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c     | 115 +++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_log.h     |  42 +++-------------
 fs/xfs/xfs_log_cil.c |  25 +++++-----
 3 files changed, 99 insertions(+), 83 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index ccf584914b6a..48890fb6122e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -89,6 +89,62 @@ xlog_iclogs_empty(
 static int
 xfs_log_cover(struct xfs_mount *);
 
+/*
+ * We need to make sure the buffer pointer returned is naturally aligned for the
+ * biggest basic data type we put into it. We have already accounted for this
+ * padding when sizing the buffer.
+ *
+ * However, this padding does not get written into the log, and hence we have to
+ * track the space used by the log vectors separately to prevent log space hangs
+ * due to inaccurate accounting (i.e. a leak) of the used log space through the
+ * CIL context ticket.
+ *
+ * We also add space for the xlog_op_header that describes this region in the
+ * log. This prepends the data region we return to the caller to copy their data
+ * into, so do all the static initialisation of the ophdr now. Because the ophdr
+ * is not 8 byte aligned, we have to be careful to ensure that we align the
+ * start of the buffer such that the region we return to the call is 8 byte
+ * aligned and packed against the tail of the ophdr.
+ */
+void *
+xlog_prepare_iovec(
+	struct xfs_log_vec	*lv,
+	struct xfs_log_iovec	**vecp,
+	uint			type)
+{
+	struct xfs_log_iovec	*vec = *vecp;
+	struct xlog_op_header	*oph;
+	uint32_t		len;
+	void			*buf;
+
+	if (vec) {
+		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
+		vec++;
+	} else {
+		vec = &lv->lv_iovecp[0];
+	}
+
+	len = lv->lv_buf_len + sizeof(struct xlog_op_header);
+	if (!IS_ALIGNED(len, sizeof(uint64_t))) {
+		lv->lv_buf_len = round_up(len, sizeof(uint64_t)) -
+					sizeof(struct xlog_op_header);
+	}
+
+	vec->i_type = type;
+	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
+
+	oph = vec->i_addr;
+	oph->oh_clientid = XFS_TRANSACTION;
+	oph->oh_res2 = 0;
+	oph->oh_flags = 0;
+
+	buf = vec->i_addr + sizeof(struct xlog_op_header);
+	ASSERT(IS_ALIGNED((unsigned long)buf, sizeof(uint64_t)));
+
+	*vecp = vec;
+	return buf;
+}
+
 static void
 xlog_grant_sub_space(
 	struct xlog		*log,
@@ -2131,9 +2187,9 @@ xlog_print_trans(
 }
 
 /*
- * Calculate the potential space needed by the log vector. If this is a start
- * transaction, the caller has already accounted for both opheaders in the start
- * transaction, so we don't need to account for them here.
+ * Calculate the potential space needed by the log vector. All regions contain
+ * their own opheaders and they are accounted for in region space so we don't
+ * need to add them to the vector length here.
  */
 static int
 xlog_write_calc_vec_length(
@@ -2160,18 +2216,7 @@ xlog_write_calc_vec_length(
 			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
 		}
 	}
-
-	/* Don't account for regions with embedded ophdrs */
-	if (optype && headers > 0) {
-		headers--;
-		if (optype & XLOG_START_TRANS) {
-			ASSERT(headers >= 1);
-			headers--;
-		}
-	}
-
 	ticket->t_res_num_ophdrs += headers;
-	len += headers * sizeof(struct xlog_op_header);
 
 	return len;
 }
@@ -2181,7 +2226,6 @@ xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
 	struct xlog_ticket	*ticket)
 {
-	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
 	ophdr->oh_clientid = XFS_TRANSACTION;
 	ophdr->oh_res2 = 0;
 	ophdr->oh_flags = 0;
@@ -2404,21 +2448,25 @@ xlog_write(
 			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
 			/*
-			 * The XLOG_START_TRANS has embedded ophdrs for the
-			 * start record and transaction header. They will always
-			 * be the first two regions in the lv chain. Commit and
-			 * unmount records also have embedded ophdrs.
+			 * Regions always have their ophdr at the start of the
+			 * region, except for:
+			 * - a transaction start which has a start record ophdr
+			 *   before the first region ophdr; and
+			 * - the previous region didn't fully fit into an iclog
+			 *   so needs a continuation ophdr to prepend the region
+			 *   in this new iclog.
 			 */
-			if (optype) {
-				ophdr = reg->i_addr;
-				if (index)
-					optype &= ~XLOG_START_TRANS;
-			} else {
+			ophdr = reg->i_addr;
+			if (optype && index) {
+				optype &= ~XLOG_START_TRANS;
+			} else if (partial_copy) {
                                 ophdr = xlog_write_setup_ophdr(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 					   sizeof(struct xlog_op_header));
 				added_ophdr = true;
 			}
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+
 			len += xlog_write_setup_copy(ticket, ophdr,
 						     iclog->ic_size-log_offset,
 						     reg->i_len,
@@ -2436,20 +2484,11 @@ xlog_write(
 				ophdr->oh_len = cpu_to_be32(copy_len -
 						sizeof(struct xlog_op_header));
 			}
-			/*
-			 * Copy region.
-			 *
-			 * Commit records just log an opheader, so
-			 * we can have empty payloads with no data region to
-			 * copy.  Hence we only copy the payload if the vector
-			 * says it has data to copy.
-			 */
-			ASSERT(copy_len >= 0);
-			if (copy_len > 0) {
-				memcpy(ptr, reg->i_addr + copy_off, copy_len);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-						   copy_len);
-			}
+
+			ASSERT(copy_len > 0);
+			memcpy(ptr, reg->i_addr + copy_off, copy_len);
+			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
+
 			if (added_ophdr)
 				copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 1ca4f2edbdaf..af54ea3f8c90 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -21,44 +21,18 @@ struct xfs_log_vec {
 
 #define XFS_LOG_VEC_ORDERED	(-1)
 
-/*
- * We need to make sure the buffer pointer returned is naturally aligned for the
- * biggest basic data type we put into it. We have already accounted for this
- * padding when sizing the buffer.
- *
- * However, this padding does not get written into the log, and hence we have to
- * track the space used by the log vectors separately to prevent log space hangs
- * due to inaccurate accounting (i.e. a leak) of the used log space through the
- * CIL context ticket.
- */
-static inline void *
-xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
-		uint type)
-{
-	struct xfs_log_iovec *vec = *vecp;
-
-	if (vec) {
-		ASSERT(vec - lv->lv_iovecp < lv->lv_niovecs);
-		vec++;
-	} else {
-		vec = &lv->lv_iovecp[0];
-	}
-
-	if (!IS_ALIGNED(lv->lv_buf_len, sizeof(uint64_t)))
-		lv->lv_buf_len = round_up(lv->lv_buf_len, sizeof(uint64_t));
-
-	vec->i_type = type;
-	vec->i_addr = lv->lv_buf + lv->lv_buf_len;
-
-	ASSERT(IS_ALIGNED((unsigned long)vec->i_addr, sizeof(uint64_t)));
-
-	*vecp = vec;
-	return vec->i_addr;
-}
+void *xlog_prepare_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec **vecp,
+		uint type);
 
 static inline void
 xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, int len)
 {
+	struct xlog_op_header	*oph = vec->i_addr;
+
+	/* opheader tracks payload length, logvec tracks region length */
+	oph->oh_len = len;
+
+	len += sizeof(struct xlog_op_header);
 	lv->lv_buf_len += len;
 	lv->lv_bytes += len;
 	vec->i_len = len;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 9d3a495f1c78..58900171de09 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -181,13 +181,20 @@ xlog_cil_alloc_shadow_bufs(
 		}
 
 		/*
-		 * We 64-bit align the length of each iovec so that the start
-		 * of the next one is naturally aligned.  We'll need to
-		 * account for that slack space here. Then round nbytes up
-		 * to 64-bit alignment so that the initial buffer alignment is
-		 * easy to calculate and verify.
+		 * We 64-bit align the length of each iovec so that the start of
+		 * the next one is naturally aligned.  We'll need to account for
+		 * that slack space here.
+		 *
+		 * We also add the xlog_op_header to each region when
+		 * formatting, but that's not accounted to the size of the item
+		 * at this point. Hence we'll need an addition number of bytes
+		 * for each vector to hold an opheader.
+		 *
+		 * Then round nbytes up to 64-bit alignment so that the initial
+		 * buffer alignment is easy to calculate and verify.
 		 */
-		nbytes += niovecs * sizeof(uint64_t);
+		nbytes += niovecs *
+			(sizeof(uint64_t) + sizeof(struct xlog_op_header));
 		nbytes = round_up(nbytes, sizeof(uint64_t));
 
 		/*
@@ -433,11 +440,6 @@ xlog_cil_insert_items(
 
 	spin_lock(&cil->xc_cil_lock);
 
-	/* account for space used by new iovec headers  */
-	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
-	len += iovhdr_res;
-	ctx->nvecs += diff_iovecs;
-
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
@@ -469,6 +471,7 @@ xlog_cil_insert_items(
 	}
 	tp->t_ticket->t_curr_res -= len;
 	ctx->space_used += len;
+	ctx->nvecs += diff_iovecs;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 19/39] xfs: log ticket region debug is largely useless
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (17 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 18/39] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-19 12:12 ` [PATCH 20/39] xfs: pass lv chain length into xlog_write() Dave Chinner
                   ` (19 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

xlog_tic_add_region() is used to trace the regions being added to a
log ticket to provide information in the situation where a ticket
reservation overrun occurs. The information gathered is stored int
the ticket, and dumped if xlog_print_tic_res() is called.

For a front end struct xfs_trans overrun, the ticket only contains
reservation tracking information - the ticket is never handed to the
log so has no regions attached to it. The overrun debug information in this
case comes from xlog_print_trans(), which walks the items attached
to the transaction and dumps their attached formatted log vectors
directly. It also dumps the ticket state, but that only contains
reservation accounting and nothing else. Hence xlog_print_tic_res()
never dumps region or overrun information from this path.

xlog_tic_add_region() is actually called from xlog_write(), which
means it is being used to track the regions seen in a
CIL checkpoint log vector chain. In looking at CIL behaviour
recently, I've seen 32MB checkpoints regularly exceed 250,000
regions in the LV chain. The log ticket debug code can track *15*
regions. IOWs, if there is a ticket overrun in the CIL code, the
ticket region tracking code is going to be completely useless for
determining what went wrong. The only thing it can tell us is how
much of an overrun occurred, and we really don't need extra debug
information in the log ticket to tell us that.

Indeed, the main place we call xlog_tic_add_region() is also adding
up the number of regions and the space used so that xlog_write()
knows how much will be written to the log. This is exactly the same
information that log ticket is storing once we take away the useless
region tracking array. Hence xlog_tic_add_region() is not useful,
but can be called 250,000 times a CIL push...

Just strip all that debug "information" out of the of the log ticket
and only have it report reservation space information when an
overrun occurs. This also reduces the size of a log ticket down by
about 150 bytes...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 107 +++---------------------------------------
 fs/xfs/xfs_log_priv.h |  20 --------
 2 files changed, 6 insertions(+), 121 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 48890fb6122e..e849f15e9e04 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -377,30 +377,6 @@ xlog_grant_head_check(
 	return error;
 }
 
-static void
-xlog_tic_reset_res(xlog_ticket_t *tic)
-{
-	tic->t_res_num = 0;
-	tic->t_res_arr_sum = 0;
-	tic->t_res_num_ophdrs = 0;
-}
-
-static void
-xlog_tic_add_region(xlog_ticket_t *tic, uint len, uint type)
-{
-	if (tic->t_res_num == XLOG_TIC_LEN_MAX) {
-		/* add to overflow and start again */
-		tic->t_res_o_flow += tic->t_res_arr_sum;
-		tic->t_res_num = 0;
-		tic->t_res_arr_sum = 0;
-	}
-
-	tic->t_res_arr[tic->t_res_num].r_len = len;
-	tic->t_res_arr[tic->t_res_num].r_type = type;
-	tic->t_res_arr_sum += len;
-	tic->t_res_num++;
-}
-
 bool
 xfs_log_writable(
 	struct xfs_mount	*mp)
@@ -450,8 +426,6 @@ xfs_log_regrant(
 	xlog_grant_push_ail(log, tic->t_unit_res);
 
 	tic->t_curr_res = tic->t_unit_res;
-	xlog_tic_reset_res(tic);
-
 	if (tic->t_cnt > 0)
 		return 0;
 
@@ -2077,63 +2051,11 @@ xlog_print_tic_res(
 	struct xfs_mount	*mp,
 	struct xlog_ticket	*ticket)
 {
-	uint i;
-	uint ophdr_spc = ticket->t_res_num_ophdrs * (uint)sizeof(xlog_op_header_t);
-
-	/* match with XLOG_REG_TYPE_* in xfs_log.h */
-#define REG_TYPE_STR(type, str)	[XLOG_REG_TYPE_##type] = str
-	static char *res_type_str[] = {
-	    REG_TYPE_STR(BFORMAT, "bformat"),
-	    REG_TYPE_STR(BCHUNK, "bchunk"),
-	    REG_TYPE_STR(EFI_FORMAT, "efi_format"),
-	    REG_TYPE_STR(EFD_FORMAT, "efd_format"),
-	    REG_TYPE_STR(IFORMAT, "iformat"),
-	    REG_TYPE_STR(ICORE, "icore"),
-	    REG_TYPE_STR(IEXT, "iext"),
-	    REG_TYPE_STR(IBROOT, "ibroot"),
-	    REG_TYPE_STR(ILOCAL, "ilocal"),
-	    REG_TYPE_STR(IATTR_EXT, "iattr_ext"),
-	    REG_TYPE_STR(IATTR_BROOT, "iattr_broot"),
-	    REG_TYPE_STR(IATTR_LOCAL, "iattr_local"),
-	    REG_TYPE_STR(QFORMAT, "qformat"),
-	    REG_TYPE_STR(DQUOT, "dquot"),
-	    REG_TYPE_STR(QUOTAOFF, "quotaoff"),
-	    REG_TYPE_STR(LRHEADER, "LR header"),
-	    REG_TYPE_STR(UNMOUNT, "unmount"),
-	    REG_TYPE_STR(COMMIT, "commit"),
-	    REG_TYPE_STR(TRANSHDR, "trans header"),
-	    REG_TYPE_STR(ICREATE, "inode create"),
-	    REG_TYPE_STR(RUI_FORMAT, "rui_format"),
-	    REG_TYPE_STR(RUD_FORMAT, "rud_format"),
-	    REG_TYPE_STR(CUI_FORMAT, "cui_format"),
-	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
-	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
-	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
-	};
-	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
-#undef REG_TYPE_STR
-
 	xfs_warn(mp, "ticket reservation summary:");
-	xfs_warn(mp, "  unit res    = %d bytes",
-		 ticket->t_unit_res);
-	xfs_warn(mp, "  current res = %d bytes",
-		 ticket->t_curr_res);
-	xfs_warn(mp, "  total reg   = %u bytes (o/flow = %u bytes)",
-		 ticket->t_res_arr_sum, ticket->t_res_o_flow);
-	xfs_warn(mp, "  ophdrs      = %u (ophdr space = %u bytes)",
-		 ticket->t_res_num_ophdrs, ophdr_spc);
-	xfs_warn(mp, "  ophdr + reg = %u bytes",
-		 ticket->t_res_arr_sum + ticket->t_res_o_flow + ophdr_spc);
-	xfs_warn(mp, "  num regions = %u",
-		 ticket->t_res_num);
-
-	for (i = 0; i < ticket->t_res_num; i++) {
-		uint r_type = ticket->t_res_arr[i].r_type;
-		xfs_warn(mp, "region[%u]: %s - %u bytes", i,
-			    ((r_type <= 0 || r_type > XLOG_REG_TYPE_MAX) ?
-			    "bad-rtype" : res_type_str[r_type]),
-			    ticket->t_res_arr[i].r_len);
-	}
+	xfs_warn(mp, "  unit res    = %d bytes", ticket->t_unit_res);
+	xfs_warn(mp, "  current res = %d bytes", ticket->t_curr_res);
+	xfs_warn(mp, "  original count  = %d", ticket->t_ocnt);
+	xfs_warn(mp, "  remaining count = %d", ticket->t_cnt);
 }
 
 /*
@@ -2198,7 +2120,6 @@ xlog_write_calc_vec_length(
 	uint			optype)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = 0;
 	int			len = 0;
 	int			i;
 
@@ -2207,17 +2128,9 @@ xlog_write_calc_vec_length(
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
 			continue;
 
-		headers += lv->lv_niovecs;
-
-		for (i = 0; i < lv->lv_niovecs; i++) {
-			struct xfs_log_iovec	*vecp = &lv->lv_iovecp[i];
-
-			len += vecp->i_len;
-			xlog_tic_add_region(ticket, vecp->i_len, vecp->i_type);
-		}
+		for (i = 0; i < lv->lv_niovecs; i++)
+			len += lv->lv_iovecp[i].i_len;
 	}
-	ticket->t_res_num_ophdrs += headers;
-
 	return len;
 }
 
@@ -2276,7 +2189,6 @@ xlog_write_setup_copy(
 
 	/* account for new log op header */
 	ticket->t_curr_res -= sizeof(struct xlog_op_header);
-	ticket->t_res_num_ophdrs++;
 
 	return sizeof(struct xlog_op_header);
 }
@@ -2973,9 +2885,6 @@ xlog_state_get_iclog_space(
 	 */
 	if (log_offset == 0) {
 		ticket->t_curr_res -= log->l_iclog_hsize;
-		xlog_tic_add_region(ticket,
-				    log->l_iclog_hsize,
-				    XLOG_REG_TYPE_LRHEADER);
 		head->h_cycle = cpu_to_be32(log->l_curr_cycle);
 		head->h_lsn = cpu_to_be64(
 			xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block));
@@ -3055,7 +2964,6 @@ xfs_log_ticket_regrant(
 	xlog_grant_sub_space(log, &log->l_write_head.grant,
 					ticket->t_curr_res);
 	ticket->t_curr_res = ticket->t_unit_res;
-	xlog_tic_reset_res(ticket);
 
 	trace_xfs_log_ticket_regrant_sub(log, ticket);
 
@@ -3066,7 +2974,6 @@ xfs_log_ticket_regrant(
 		trace_xfs_log_ticket_regrant_exit(log, ticket);
 
 		ticket->t_curr_res = ticket->t_unit_res;
-		xlog_tic_reset_res(ticket);
 	}
 
 	xfs_log_ticket_put(ticket);
@@ -3529,8 +3436,6 @@ xlog_ticket_alloc(
 	if (permanent)
 		tic->t_flags |= XLOG_TIC_PERM_RESERV;
 
-	xlog_tic_reset_res(tic);
-
 	return tic;
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index e4e3e71b2b1b..301c36165974 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -136,19 +136,6 @@ enum xlog_iclog_state {
 #define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
 #define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
 
-/* Ticket reservation region accounting */ 
-#define XLOG_TIC_LEN_MAX	15
-
-/*
- * Reservation region
- * As would be stored in xfs_log_iovec but without the i_addr which
- * we don't care about.
- */
-typedef struct xlog_res {
-	uint	r_len;	/* region length		:4 */
-	uint	r_type;	/* region's transaction type	:4 */
-} xlog_res_t;
-
 typedef struct xlog_ticket {
 	struct list_head   t_queue;	 /* reserve/write queue */
 	struct task_struct *t_task;	 /* task that owns this ticket */
@@ -159,13 +146,6 @@ typedef struct xlog_ticket {
 	char		   t_ocnt;	 /* original count		 : 1  */
 	char		   t_cnt;	 /* current count		 : 1  */
 	char		   t_flags;	 /* properties of reservation	 : 1  */
-
-        /* reservation array fields */
-	uint		   t_res_num;                    /* num in array : 4 */
-	uint		   t_res_num_ophdrs;		 /* num op hdrs  : 4 */
-	uint		   t_res_arr_sum;		 /* array sum    : 4 */
-	uint		   t_res_o_flow;		 /* sum overflow : 4 */
-	xlog_res_t	   t_res_arr[XLOG_TIC_LEN_MAX];  /* array of res : 8 * 15 */ 
 } xlog_ticket_t;
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 20/39] xfs: pass lv chain length into xlog_write()
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (18 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 19/39] xfs: log ticket region debug is largely useless Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-27 17:20   ` Darrick J. Wong
  2021-05-19 12:12 ` [PATCH 21/39] xfs: introduce xlog_write_single() Dave Chinner
                   ` (18 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The caller of xlog_write() usually has a close accounting of the
aggregated vector length contained in the log vector chain passed to
xlog_write(). There is no need to iterate the chain to calculate he
length of the data in xlog_write_calculate_len() if the caller is
already iterating that chain to build it.

Passing in the vector length avoids doing an extra chain iteration,
which can be a significant amount of work given that large CIL
commits can have hundreds of thousands of vectors attached to the
chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
 fs/xfs/xfs_log_cil.c  | 17 ++++++++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 3 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e849f15e9e04..58f9aafce29e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -864,7 +864,8 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
+				reg.i_len);
 }
 
 /*
@@ -1588,7 +1589,8 @@ xlog_commit_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
+				reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2108,32 +2110,6 @@ xlog_print_trans(
 	}
 }
 
-/*
- * Calculate the potential space needed by the log vector. All regions contain
- * their own opheaders and they are accounted for in region space so we don't
- * need to add them to the vector length here.
- */
-static int
-xlog_write_calc_vec_length(
-	struct xlog_ticket	*ticket,
-	struct xfs_log_vec	*log_vector,
-	uint			optype)
-{
-	struct xfs_log_vec	*lv;
-	int			len = 0;
-	int			i;
-
-	for (lv = log_vector; lv; lv = lv->lv_next) {
-		/* we don't write ordered log vectors */
-		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
-			continue;
-
-		for (i = 0; i < lv->lv_niovecs; i++)
-			len += lv->lv_iovecp[i].i_len;
-	}
-	return len;
-}
-
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
@@ -2296,13 +2272,13 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			optype)
+	uint			optype,
+	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
 	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
 	int			index = 0;
-	int			len;
 	int			partial_copy = 0;
 	int			partial_copy_len = 0;
 	int			contwr = 0;
@@ -2317,7 +2293,6 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
 	if (start_lsn)
 		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 58900171de09..7a6b80666f98 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
 				sizeof(struct xfs_trans_header);
 	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
 
-	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
-
 	lvhdr->lv_niovecs = 2;
 	lvhdr->lv_iovecp = &hdr->lhdr[0];
+	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
 	lvhdr->lv_next = ctx->lv_chain;
+
+	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
 /*
@@ -742,7 +743,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
-	int			num_iovecs;
+	int			num_iovecs = 0;
+	int			num_bytes = 0;
 	int			error = 0;
 	struct xlog_cil_trans_hdr thdr;
 	struct xfs_log_vec	lvhdr = { NULL };
@@ -835,7 +837,6 @@ xlog_cil_push_work(
 	 * by the flush lock.
 	 */
 	lv = NULL;
-	num_iovecs = 0;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
 
@@ -849,6 +850,10 @@ xlog_cil_push_work(
 		lv = item->li_lv;
 		item->li_lv = NULL;
 		num_iovecs += lv->lv_niovecs;
+
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
+			num_bytes += lv->lv_bytes;
 	}
 
 	/*
@@ -887,6 +892,8 @@ xlog_cil_push_work(
 	 * transaction header here as it is not accounted for in xlog_write().
 	 */
 	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
+	num_iovecs += lvhdr.lv_niovecs;
+	num_bytes += lvhdr.lv_bytes;
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -901,7 +908,7 @@ xlog_cil_push_work(
 	 * write head.
 	 */
 	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS);
+				XLOG_START_TRANS, num_bytes);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 301c36165974..eba905c273b0 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint optype);
+		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 21/39] xfs: introduce xlog_write_single()
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (19 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 20/39] xfs: pass lv chain length into xlog_write() Dave Chinner
@ 2021-05-19 12:12 ` Dave Chinner
  2021-05-27 17:27   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 22/39] xfs:_introduce xlog_write_partial() Dave Chinner
                   ` (17 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:12 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Introduce an optimised version of xlog_write() that is used when the
entire write will fit in a single iclog. This greatly simplifies the
implementation of writing a log vector chain into an iclog, and sets
the ground work for a much more understandable xlog_write()
implementation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 58f9aafce29e..3b74d21e3786 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2225,6 +2225,52 @@ xlog_write_copy_finish(
 	return error;
 }
 
+/*
+ * Write log vectors into a single iclog which is guaranteed by the caller
+ * to have enough space to write the entire log vector into. Return the number
+ * of log vectors written into the iclog.
+ */
+static int
+xlog_write_single(
+	struct xfs_log_vec	*log_vector,
+	struct xlog_ticket	*ticket,
+	struct xlog_in_core	*iclog,
+	uint32_t		log_offset,
+	uint32_t		len)
+{
+	struct xfs_log_vec	*lv;
+	void			*ptr;
+	int			index = 0;
+	int			record_cnt = 0;
+
+	ASSERT(log_offset + len <= iclog->ic_size);
+
+	ptr = iclog->ic_datap + log_offset;
+	for (lv = log_vector; lv; lv = lv->lv_next) {
+		/*
+		 * Ordered log vectors have no regions to write so this
+		 * loop will naturally skip them.
+		 */
+		for (index = 0; index < lv->lv_niovecs; index++) {
+			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
+			struct xlog_op_header	*ophdr = reg->i_addr;
+
+			ASSERT(reg->i_len % sizeof(int32_t) == 0);
+			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+			ophdr->oh_len = cpu_to_be32(reg->i_len -
+						sizeof(struct xlog_op_header));
+			memcpy(ptr, reg->i_addr, reg->i_len);
+			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
+			record_cnt++;
+		}
+	}
+	ASSERT(len == 0);
+	return record_cnt;
+}
+
+
 /*
  * Write some region out to in-core log
  *
@@ -2305,16 +2351,25 @@ xlog_write(
 			return error;
 
 		ASSERT(log_offset <= iclog->ic_size - 1);
-		ptr = iclog->ic_datap + log_offset;
 
 		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
+		/* If this is a single iclog write, go fast... */
+		if (!contwr && lv == log_vector) {
+			record_cnt = xlog_write_single(lv, ticket, iclog,
+						log_offset, len);
+			len = 0;
+			data_cnt = len;
+			break;
+		}
+
 		/*
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
 		 */
+		ptr = iclog->ic_datap + log_offset;
 		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 			struct xfs_log_iovec	*reg;
 			struct xlog_op_header	*ophdr;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 22/39] xfs:_introduce xlog_write_partial()
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (20 preceding siblings ...)
  2021-05-19 12:12 ` [PATCH 21/39] xfs: introduce xlog_write_single() Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:06   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 23/39] xfs: xlog_write() no longer needs contwr state Dave Chinner
                   ` (16 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Handle writing of a logvec chain into an iclog that doesn't have
enough space to fit it all. The iclog has already been changed to
WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
in the iclog is exclusively owned by this logvec chain.

The difference between the single and partial cases is that
we end up with partial iovec writes in the iclog and have to split
a log vec regions across two iclogs. The state handling for this is
currently awful and so we're building up the pieces needed to
handle this more cleanly one at a time.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 504 ++++++++++++++++++++++-------------------------
 1 file changed, 240 insertions(+), 264 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 3b74d21e3786..98a3e2e4f1e0 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2110,166 +2110,250 @@ xlog_print_trans(
 	}
 }
 
-static xlog_op_header_t *
-xlog_write_setup_ophdr(
-	struct xlog_op_header	*ophdr,
-	struct xlog_ticket	*ticket)
-{
-	ophdr->oh_clientid = XFS_TRANSACTION;
-	ophdr->oh_res2 = 0;
-	ophdr->oh_flags = 0;
-	return ophdr;
-}
-
 /*
- * Set up the parameters of the region copy into the log. This has
- * to handle region write split across multiple log buffers - this
- * state is kept external to this function so that this code can
- * be written in an obvious, self documenting manner.
+ * Write whole log vectors into a single iclog which is guaranteed to have
+ * either sufficient space for the entire log vector chain to be written or
+ * exclusive access to the remaining space in the iclog.
+ *
+ * Return the number of iovecs and data written into the iclog, as well as
+ * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
+ * end of the chain.
  */
-static int
-xlog_write_setup_copy(
+static struct xfs_log_vec *
+xlog_write_single(
+	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_op_header	*ophdr,
-	int			space_available,
-	int			space_required,
-	int			*copy_off,
-	int			*copy_len,
-	int			*last_was_partial_copy,
-	int			*bytes_consumed)
-{
-	int			still_to_copy;
-
-	still_to_copy = space_required - *bytes_consumed;
-	*copy_off = *bytes_consumed;
-
-	if (still_to_copy <= space_available) {
-		/* write of region completes here */
-		*copy_len = still_to_copy;
-		ophdr->oh_len = cpu_to_be32(*copy_len);
-		if (*last_was_partial_copy)
-			ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS);
-		*last_was_partial_copy = 0;
-		*bytes_consumed = 0;
-		return 0;
-	}
-
-	/* partial write of region, needs extra log op header reservation */
-	*copy_len = space_available;
-	ophdr->oh_len = cpu_to_be32(*copy_len);
-	ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
-	if (*last_was_partial_copy)
-		ophdr->oh_flags |= XLOG_WAS_CONT_TRANS;
-	*bytes_consumed += *copy_len;
-	(*last_was_partial_copy)++;
-
-	/* account for new log op header */
-	ticket->t_curr_res -= sizeof(struct xlog_op_header);
-
-	return sizeof(struct xlog_op_header);
-}
-
-static int
-xlog_write_copy_finish(
-	struct xlog		*log,
 	struct xlog_in_core	*iclog,
-	uint			flags,
-	int			*record_cnt,
-	int			*data_cnt,
-	int			*partial_copy,
-	int			*partial_copy_len,
-	int			log_offset,
-	struct xlog_in_core	**commit_iclog)
+	uint32_t		*log_offset,
+	uint32_t		*len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt)
 {
-	int			error;
+	struct xfs_log_vec	*lv;
+	void			*ptr;
+	int			index;
+
+	ASSERT(*log_offset + *len <= iclog->ic_size ||
+		iclog->ic_state == XLOG_STATE_WANT_SYNC);
 
-	if (*partial_copy) {
+	ptr = iclog->ic_datap + *log_offset;
+	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/*
-		 * This iclog has already been marked WANT_SYNC by
-		 * xlog_state_get_iclog_space.
+		 * If the entire log vec does not fit in the iclog, punt it to
+		 * the partial copy loop which can handle this case.
 		 */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
-		goto release_iclog;
-	}
+		if (lv->lv_niovecs &&
+		    lv->lv_bytes > iclog->ic_size - *log_offset)
+			break;
 
-	*partial_copy = 0;
-	*partial_copy_len = 0;
+		/*
+		 * Ordered log vectors have no regions to write so this
+		 * loop will naturally skip them.
+		 */
+		for (index = 0; index < lv->lv_niovecs; index++) {
+			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
+			struct xlog_op_header	*ophdr = reg->i_addr;
 
-	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
-		/* no more space in this iclog - push it. */
-		spin_lock(&log->l_icloglock);
-		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
-		*record_cnt = 0;
-		*data_cnt = 0;
+			ASSERT(reg->i_len % sizeof(int32_t) == 0);
+			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
 
-		if (iclog->ic_state == XLOG_STATE_ACTIVE)
-			xlog_state_switch_iclogs(log, iclog, 0);
-		else
-			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-			       iclog->ic_state == XLOG_STATE_IOERROR);
-		if (!commit_iclog)
-			goto release_iclog;
-		spin_unlock(&log->l_icloglock);
-		ASSERT(flags & XLOG_COMMIT_TRANS);
-		*commit_iclog = iclog;
+			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+			ophdr->oh_len = cpu_to_be32(reg->i_len -
+						sizeof(struct xlog_op_header));
+			memcpy(ptr, reg->i_addr, reg->i_len);
+			xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len);
+			(*record_cnt)++;
+			*data_cnt += reg->i_len;
+		}
 	}
+	ASSERT(*len == 0 || lv);
+	return lv;
+}
 
-	return 0;
+static int
+xlog_write_get_more_iclog_space(
+	struct xlog		*log,
+	struct xlog_ticket	*ticket,
+	struct xlog_in_core	**iclogp,
+	uint32_t		*log_offset,
+	uint32_t		len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt,
+	int			*contwr)
+{
+	struct xlog_in_core	*iclog = *iclogp;
+	int			error;
 
-release_iclog:
+	spin_lock(&log->l_icloglock);
+	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
+	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
+	       iclog->ic_state == XLOG_STATE_IOERROR);
 	error = xlog_state_release_iclog(log, iclog);
 	spin_unlock(&log->l_icloglock);
-	return error;
+	if (error)
+		return error;
+
+	error = xlog_state_get_iclog_space(log, len, &iclog,
+				ticket, contwr, log_offset);
+	if (error)
+		return error;
+	*record_cnt = 0;
+	*data_cnt = 0;
+	*iclogp = iclog;
+	return 0;
 }
 
 /*
- * Write log vectors into a single iclog which is guaranteed by the caller
- * to have enough space to write the entire log vector into. Return the number
- * of log vectors written into the iclog.
+ * Write log vectors into a single iclog which is smaller than the current chain
+ * length. We write until we cannot fit a full record into the remaining space
+ * and then stop. We return the log vector that is to be written that cannot
+ * wholly fit in the iclog.
  */
-static int
-xlog_write_single(
+static struct xfs_log_vec *
+xlog_write_partial(
+	struct xlog		*log,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
-	struct xlog_in_core	*iclog,
-	uint32_t		log_offset,
-	uint32_t		len)
+	struct xlog_in_core	**iclogp,
+	uint32_t		*log_offset,
+	uint32_t		*len,
+	uint32_t		*record_cnt,
+	uint32_t		*data_cnt,
+	int			*contwr)
 {
-	struct xfs_log_vec	*lv;
+	struct xlog_in_core	*iclog = *iclogp;
+	struct xfs_log_vec	*lv = log_vector;
+	struct xfs_log_iovec	*reg;
+	struct xlog_op_header	*ophdr;
 	void			*ptr;
 	int			index = 0;
-	int			record_cnt = 0;
+	uint32_t		rlen;
+	int			error;
 
-	ASSERT(log_offset + len <= iclog->ic_size);
+	/* walk the logvec, copying until we run out of space in the iclog */
+	ptr = iclog->ic_datap + *log_offset;
+	for (index = 0; index < lv->lv_niovecs; index++) {
+		uint32_t	reg_offset = 0;
+
+		reg = &lv->lv_iovecp[index];
+		ASSERT(reg->i_len % sizeof(int32_t) == 0);
 
-	ptr = iclog->ic_datap + log_offset;
-	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/*
-		 * Ordered log vectors have no regions to write so this
-		 * loop will naturally skip them.
+		 * The first region of a continuation must have a non-zero
+		 * length otherwise log recovery will just skip over it and
+		 * start recovering from the next opheader it finds. Because we
+		 * mark the next opheader as a continuation, recovery will then
+		 * incorrectly add the continuation to the previous region and
+		 * that breaks stuff.
+		 *
+		 * Hence if there isn't space for region data after the
+		 * opheader, then we need to start afresh with a new iclog.
 		 */
-		for (index = 0; index < lv->lv_niovecs; index++) {
-			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
-			struct xlog_op_header	*ophdr = reg->i_addr;
+		if (iclog->ic_size - *log_offset <=
+					sizeof(struct xlog_op_header)) {
+			error = xlog_write_get_more_iclog_space(log, ticket,
+					&iclog, log_offset, *len, record_cnt,
+					data_cnt, contwr);
+			if (error)
+				return ERR_PTR(error);
+			ptr = iclog->ic_datap + *log_offset;
+		}
 
-			ASSERT(reg->i_len % sizeof(int32_t) == 0);
-			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+		ophdr = reg->i_addr;
+		rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset);
+
+		ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
+		ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header));
+		if (rlen != reg->i_len)
+			ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
+
+		ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
+		xlog_verify_dest_ptr(log, ptr);
+		memcpy(ptr, reg->i_addr, rlen);
+		xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
+		(*record_cnt)++;
+		*data_cnt += rlen;
+
+		/* If we wrote the whole region, move to the next. */
+		if (rlen == reg->i_len)
+			continue;
+
+		/*
+		 * We now have a partially written iovec, but it can span
+		 * multiple iclogs so we loop here. First we release the iclog
+		 * we currently have, then we get a new iclog and add a new
+		 * opheader. Then we continue copying from where we were until
+		 * we either complete the iovec or fill the iclog. If we
+		 * complete the iovec, then we increment the index and go right
+		 * back to the top of the outer loop. if we fill the iclog, we
+		 * run the inner loop again.
+		 *
+		 * This is complicated by the tail of a region using all the
+		 * space in an iclog and hence requiring us to release the iclog
+		 * and get a new one before returning to the outer loop. We must
+		 * always guarantee that we exit this inner loop with at least
+		 * space for log transaction opheaders left in the current
+		 * iclog, hence we cannot just terminate the loop at the end
+		 * of the of the continuation. So we loop while there is no
+		 * space left in the current iclog, and check for the end of the
+		 * continuation after getting a new iclog.
+		 */
+		do {
+			/*
+			 * Account for the continuation opheader before we get
+			 * a new iclog. This is necessary so that we reserve
+			 * space in the iclog for it.
+			 */
+			*len += sizeof(struct xlog_op_header);
+			ticket->t_curr_res -= sizeof(struct xlog_op_header);
+
+			error = xlog_write_get_more_iclog_space(log, ticket,
+					&iclog, log_offset, *len, record_cnt,
+					data_cnt, contwr);
+			if (error)
+				return ERR_PTR(error);
+			ptr = iclog->ic_datap + *log_offset;
 
+			ophdr = ptr;
 			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-			ophdr->oh_len = cpu_to_be32(reg->i_len -
+			ophdr->oh_clientid = XFS_TRANSACTION;
+			ophdr->oh_res2 = 0;
+			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
+
+			xlog_write_adv_cnt(&ptr, len, log_offset,
 						sizeof(struct xlog_op_header));
-			memcpy(ptr, reg->i_addr, reg->i_len);
-			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
-			record_cnt++;
-		}
+			*data_cnt += sizeof(struct xlog_op_header);
+
+			/*
+			 * If rlen fits in the iclog, then end the region
+			 * continuation. Otherwise we're going around again.
+			 */
+			reg_offset += rlen;
+			rlen = reg->i_len - reg_offset;
+			if (rlen <= iclog->ic_size - *log_offset)
+				ophdr->oh_flags |= XLOG_END_TRANS;
+			else
+				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
+
+			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
+			ophdr->oh_len = cpu_to_be32(rlen);
+
+			xlog_verify_dest_ptr(log, ptr);
+			memcpy(ptr, reg->i_addr + reg_offset, rlen);
+			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
+			(*record_cnt)++;
+			*data_cnt += rlen;
+
+		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
 	}
-	ASSERT(len == 0);
-	return record_cnt;
-}
 
+	/*
+	 * No more iovecs remain in this logvec so return the next log vec to
+	 * the caller so it can go back to fast path copying.
+	 */
+	*iclogp = iclog;
+	return lv->lv_next;
+}
 
 /*
  * Write some region out to in-core log
@@ -2323,14 +2407,11 @@ xlog_write(
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
-	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
-	int			index = 0;
-	int			partial_copy = 0;
-	int			partial_copy_len = 0;
 	int			contwr = 0;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
+	int			log_offset;
 
 	if (ticket->t_curr_res < 0) {
 		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
@@ -2339,146 +2420,40 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	if (start_lsn)
-		*start_lsn = 0;
-	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
-		void		*ptr;
-		int		log_offset;
-
-		error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
-						   &contwr, &log_offset);
-		if (error)
-			return error;
-
-		ASSERT(log_offset <= iclog->ic_size - 1);
+	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
+					   &contwr, &log_offset);
+	if (error)
+		return error;
 
-		/* Start_lsn is the first lsn written to. */
-		if (start_lsn && !*start_lsn)
-			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+	/* start_lsn is the LSN of the first iclog written to. */
+	if (start_lsn)
+		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
-		/* If this is a single iclog write, go fast... */
-		if (!contwr && lv == log_vector) {
-			record_cnt = xlog_write_single(lv, ticket, iclog,
-						log_offset, len);
-			len = 0;
-			data_cnt = len;
+	while (lv) {
+		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
+					&len, &record_cnt, &data_cnt);
+		if (!lv)
 			break;
-		}
-
-		/*
-		 * This loop writes out as many regions as can fit in the amount
-		 * of space which was allocated by xlog_state_get_iclog_space().
-		 */
-		ptr = iclog->ic_datap + log_offset;
-		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
-			struct xfs_log_iovec	*reg;
-			struct xlog_op_header	*ophdr;
-			int			copy_len;
-			int			copy_off;
-			bool			ordered = false;
-			bool			added_ophdr = false;
-
-			/* ordered log vectors have no regions to write */
-			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
-				ASSERT(lv->lv_niovecs == 0);
-				ordered = true;
-				goto next_lv;
-			}
-
-			reg = &vecp[index];
-			ASSERT(reg->i_len % sizeof(int32_t) == 0);
-			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
-
-			/*
-			 * Regions always have their ophdr at the start of the
-			 * region, except for:
-			 * - a transaction start which has a start record ophdr
-			 *   before the first region ophdr; and
-			 * - the previous region didn't fully fit into an iclog
-			 *   so needs a continuation ophdr to prepend the region
-			 *   in this new iclog.
-			 */
-			ophdr = reg->i_addr;
-			if (optype && index) {
-				optype &= ~XLOG_START_TRANS;
-			} else if (partial_copy) {
-                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
-				xlog_write_adv_cnt(&ptr, &len, &log_offset,
-					   sizeof(struct xlog_op_header));
-				added_ophdr = true;
-			}
-			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
-
-			len += xlog_write_setup_copy(ticket, ophdr,
-						     iclog->ic_size-log_offset,
-						     reg->i_len,
-						     &copy_off, &copy_len,
-						     &partial_copy,
-						     &partial_copy_len);
-			xlog_verify_dest_ptr(log, ptr);
 
-
-			/*
-			 * Wart: need to update length in embedded ophdr not
-			 * to include it's own length.
-			 */
-			if (!added_ophdr) {
-				ophdr->oh_len = cpu_to_be32(copy_len -
-						sizeof(struct xlog_op_header));
-			}
-
-			ASSERT(copy_len > 0);
-			memcpy(ptr, reg->i_addr + copy_off, copy_len);
-			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
-
-			if (added_ophdr)
-				copy_len += sizeof(struct xlog_op_header);
-			record_cnt++;
-			data_cnt += contwr ? copy_len : 0;
-
-			error = xlog_write_copy_finish(log, iclog, optype,
-						       &record_cnt, &data_cnt,
-						       &partial_copy,
-						       &partial_copy_len,
-						       log_offset,
-						       commit_iclog);
-			if (error)
-				return error;
-
-			/*
-			 * if we had a partial copy, we need to get more iclog
-			 * space but we don't want to increment the region
-			 * index because there is still more is this region to
-			 * write.
-			 *
-			 * If we completed writing this region, and we flushed
-			 * the iclog (indicated by resetting of the record
-			 * count), then we also need to get more log space. If
-			 * this was the last record, though, we are done and
-			 * can just return.
-			 */
-			if (partial_copy)
-				break;
-
-			if (++index == lv->lv_niovecs) {
-next_lv:
-				lv = lv->lv_next;
-				index = 0;
-				if (lv)
-					vecp = lv->lv_iovecp;
-			}
-			if (record_cnt == 0 && !ordered) {
-				if (!lv)
-					return 0;
-				break;
-			}
+		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
+		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
+					&len, &record_cnt, &data_cnt, &contwr);
+		if (IS_ERR_OR_NULL(lv)) {
+			error = PTR_ERR_OR_ZERO(lv);
+			break;
 		}
 	}
+	ASSERT((len == 0 && !lv) || error);
 
-	ASSERT(len == 0);
-
+	/*
+	 * We've already been guaranteed that the last writes will fit inside
+	 * the current iclog, and hence it will already have the space used by
+	 * those writes accounted to it. Hence we do not need to update the
+	 * iclog with the number of bytes written here.
+	 */
+	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
 	spin_lock(&log->l_icloglock);
-	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
+	xlog_state_finish_copy(log, iclog, record_cnt, 0);
 	if (commit_iclog) {
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
@@ -3633,11 +3608,12 @@ xlog_verify_iclog(
 					iclog->ic_header.h_cycle_data[idx]);
 			}
 		}
-		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG)
+		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) {
 			xfs_warn(log->l_mp,
-				"%s: invalid clientid %d op "PTR_FMT" offset 0x%lx",
-				__func__, clientid, ophead,
+				"%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx",
+				__func__, i, clientid, ophead,
 				(unsigned long)field_offset);
+		}
 
 		/* check length */
 		p = &ophead->oh_len;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 23/39] xfs: xlog_write() no longer needs contwr state
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (21 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 22/39] xfs:_introduce xlog_write_partial() Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-19 12:13 ` [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore Dave Chinner
                   ` (15 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The rework of xlog_write() no longer requires xlog_get_iclog_state()
to tell it about internal iclog space reservation state to direct it
on what to do. Remove this parameter.

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  26520	    560	      8	  27088	   69d0	fs/xfs/xfs_log.o.orig
  26384	    560	      8	  26952	   6948	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 98a3e2e4f1e0..574078985f0a 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -47,7 +47,6 @@ xlog_state_get_iclog_space(
 	int			len,
 	struct xlog_in_core	**iclog,
 	struct xlog_ticket	*ticket,
-	int			*continued_write,
 	int			*logoffsetp);
 STATIC void
 xlog_grant_push_ail(
@@ -2178,8 +2177,7 @@ xlog_write_get_more_iclog_space(
 	uint32_t		*log_offset,
 	uint32_t		len,
 	uint32_t		*record_cnt,
-	uint32_t		*data_cnt,
-	int			*contwr)
+	uint32_t		*data_cnt)
 {
 	struct xlog_in_core	*iclog = *iclogp;
 	int			error;
@@ -2193,8 +2191,8 @@ xlog_write_get_more_iclog_space(
 	if (error)
 		return error;
 
-	error = xlog_state_get_iclog_space(log, len, &iclog,
-				ticket, contwr, log_offset);
+	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
+					log_offset);
 	if (error)
 		return error;
 	*record_cnt = 0;
@@ -2218,8 +2216,7 @@ xlog_write_partial(
 	uint32_t		*log_offset,
 	uint32_t		*len,
 	uint32_t		*record_cnt,
-	uint32_t		*data_cnt,
-	int			*contwr)
+	uint32_t		*data_cnt)
 {
 	struct xlog_in_core	*iclog = *iclogp;
 	struct xfs_log_vec	*lv = log_vector;
@@ -2253,7 +2250,7 @@ xlog_write_partial(
 					sizeof(struct xlog_op_header)) {
 			error = xlog_write_get_more_iclog_space(log, ticket,
 					&iclog, log_offset, *len, record_cnt,
-					data_cnt, contwr);
+					data_cnt);
 			if (error)
 				return ERR_PTR(error);
 			ptr = iclog->ic_datap + *log_offset;
@@ -2309,7 +2306,7 @@ xlog_write_partial(
 
 			error = xlog_write_get_more_iclog_space(log, ticket,
 					&iclog, log_offset, *len, record_cnt,
-					data_cnt, contwr);
+					data_cnt);
 			if (error)
 				return ERR_PTR(error);
 			ptr = iclog->ic_datap + *log_offset;
@@ -2407,7 +2404,6 @@ xlog_write(
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
-	int			contwr = 0;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
@@ -2421,7 +2417,7 @@ xlog_write(
 	}
 
 	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
-					   &contwr, &log_offset);
+					   &log_offset);
 	if (error)
 		return error;
 
@@ -2437,7 +2433,7 @@ xlog_write(
 
 		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
 		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
-					&len, &record_cnt, &data_cnt, &contwr);
+					&len, &record_cnt, &data_cnt);
 		if (IS_ERR_OR_NULL(lv)) {
 			error = PTR_ERR_OR_ZERO(lv);
 			break;
@@ -2451,7 +2447,6 @@ xlog_write(
 	 * those writes accounted to it. Hence we do not need to update the
 	 * iclog with the number of bytes written here.
 	 */
-	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, 0);
 	if (commit_iclog) {
@@ -2855,7 +2850,6 @@ xlog_state_get_iclog_space(
 	int			len,
 	struct xlog_in_core	**iclogp,
 	struct xlog_ticket	*ticket,
-	int			*continued_write,
 	int			*logoffsetp)
 {
 	int		  log_offset;
@@ -2931,13 +2925,10 @@ xlog_state_get_iclog_space(
 	 * iclogs (to mark it taken), this particular iclog will release/sync
 	 * to disk in xlog_write().
 	 */
-	if (len <= iclog->ic_size - iclog->ic_offset) {
-		*continued_write = 0;
+	if (len <= iclog->ic_size - iclog->ic_offset)
 		iclog->ic_offset += len;
-	} else {
-		*continued_write = 1;
+	else
 		xlog_state_switch_iclogs(log, iclog, iclog->ic_size);
-	}
 	*iclogp = iclog;
 
 	ASSERT(iclog->ic_offset <= iclog->ic_size);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (22 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 23/39] xfs: xlog_write() no longer needs contwr state Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:07   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 25/39] xfs: CIL context doesn't need to count iovecs Dave Chinner
                   ` (14 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So remove it from the interface and callers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 14 ++++----------
 fs/xfs/xfs_log_cil.c  |  2 +-
 fs/xfs/xfs_log_priv.h |  2 +-
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 574078985f0a..65b28fce4db4 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -863,8 +863,7 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
-				reg.i_len);
+	return xlog_write(log, &vec, ticket, NULL, NULL, reg.i_len);
 }
 
 /*
@@ -1588,8 +1587,7 @@ xlog_commit_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
-				reg.i_len);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2399,7 +2397,6 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			optype,
 	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
@@ -2431,7 +2428,6 @@ xlog_write(
 		if (!lv)
 			break;
 
-		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
 		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
 					&len, &record_cnt, &data_cnt);
 		if (IS_ERR_OR_NULL(lv)) {
@@ -2449,12 +2445,10 @@ xlog_write(
 	 */
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, 0);
-	if (commit_iclog) {
-		ASSERT(optype & XLOG_COMMIT_TRANS);
+	if (commit_iclog)
 		*commit_iclog = iclog;
-	} else {
+	else
 		error = xlog_state_release_iclog(log, iclog);
-	}
 	spin_unlock(&log->l_icloglock);
 
 	return error;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 7a6b80666f98..dbe3a8267e2f 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -908,7 +908,7 @@ xlog_cil_push_work(
 	 * write head.
 	 */
 	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS, num_bytes);
+				num_bytes);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index eba905c273b0..a16ffdc8ae97 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
+		struct xlog_in_core **commit_iclog, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 25/39] xfs: CIL context doesn't need to count iovecs
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (23 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:08   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 26/39] xfs: use the CIL space used counter for emptiness checks Dave Chinner
                   ` (13 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we account for log opheaders in the log item formatting
code, we don't actually use the aggregated count of log iovecs in
the CIL for anything. Remove it and the tracking code that
calculates it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 22 ++++++----------------
 fs/xfs/xfs_log_priv.h |  1 -
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index dbe3a8267e2f..eca5c82c0d60 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -252,22 +252,18 @@ xlog_cil_alloc_shadow_bufs(
 
 /*
  * Prepare the log item for insertion into the CIL. Calculate the difference in
- * log space and vectors it will consume, and if it is a new item pin it as
- * well.
+ * log space it will consume, and if it is a new item pin it as well.
  */
 STATIC void
 xfs_cil_prepare_item(
 	struct xlog		*log,
 	struct xfs_log_vec	*lv,
 	struct xfs_log_vec	*old_lv,
-	int			*diff_len,
-	int			*diff_iovecs)
+	int			*diff_len)
 {
 	/* Account for the new LV being passed in */
-	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
+	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
 		*diff_len += lv->lv_bytes;
-		*diff_iovecs += lv->lv_niovecs;
-	}
 
 	/*
 	 * If there is no old LV, this is the first time we've seen the item in
@@ -284,7 +280,6 @@ xfs_cil_prepare_item(
 		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
 
 		*diff_len -= old_lv->lv_bytes;
-		*diff_iovecs -= old_lv->lv_niovecs;
 		lv->lv_item->li_lv_shadow = old_lv;
 	}
 
@@ -333,12 +328,10 @@ static void
 xlog_cil_insert_format_items(
 	struct xlog		*log,
 	struct xfs_trans	*tp,
-	int			*diff_len,
-	int			*diff_iovecs)
+	int			*diff_len)
 {
 	struct xfs_log_item	*lip;
 
-
 	/* Bail out if we didn't find a log item.  */
 	if (list_empty(&tp->t_items)) {
 		ASSERT(0);
@@ -381,7 +374,6 @@ xlog_cil_insert_format_items(
 			 * set the item up as though it is a new insertion so
 			 * that the space reservation accounting is correct.
 			 */
-			*diff_iovecs -= lv->lv_niovecs;
 			*diff_len -= lv->lv_bytes;
 
 			/* Ensure the lv is set up according to ->iop_size */
@@ -406,7 +398,7 @@ xlog_cil_insert_format_items(
 		ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)));
 		lip->li_ops->iop_format(lip, lv);
 insert:
-		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
+		xfs_cil_prepare_item(log, lv, old_lv, diff_len);
 	}
 }
 
@@ -426,7 +418,6 @@ xlog_cil_insert_items(
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
 	struct xfs_log_item	*lip;
 	int			len = 0;
-	int			diff_iovecs = 0;
 	int			iclog_space;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
@@ -436,7 +427,7 @@ xlog_cil_insert_items(
 	 * We can do this safely because the context can't checkpoint until we
 	 * are done so it doesn't matter exactly how we update the CIL.
 	 */
-	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
+	xlog_cil_insert_format_items(log, tp, &len);
 
 	spin_lock(&cil->xc_cil_lock);
 
@@ -471,7 +462,6 @@ xlog_cil_insert_items(
 	}
 	tp->t_ticket->t_curr_res -= len;
 	ctx->space_used += len;
-	ctx->nvecs += diff_iovecs;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a16ffdc8ae97..02c94b6d0642 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -217,7 +217,6 @@ struct xfs_cil_ctx {
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
-	int			nvecs;		/* number of regions */
 	int			space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 26/39] xfs: use the CIL space used counter for emptiness checks
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (24 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 25/39] xfs: CIL context doesn't need to count iovecs Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-19 12:13 ` [PATCH 27/39] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
                   ` (12 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

In the next patches we are going to make the CIL list itself
per-cpu, and so we cannot use list_empty() to check is the list is
empty. Replace the list_empty() checks with a flag in the CIL to
indicate we have committed at least one transaction to the CIL and
hence the CIL is not empty.

We need this flag to be an atomic so that we can clear it without
holding any locks in the commit fast path, but we also need to be
careful to avoid atomic operations in the fast path. Hence we use
the fact that test_bit() is not an atomic op to first check if the
flag is set and then run the atomic test_and_clear_bit() operation
to clear it and steal the initial unit reservation for the CIL
context checkpoint.

When we are switching to a new context in a push, we place the
setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis
allows all the other places that need to check whether the CIL is
empty to use test_bit() and still be serialised correctly with the
CIL context swaps that set the bit.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_cil.c  | 45 ++++++++++++++++++++++++-------------------
 fs/xfs/xfs_log_priv.h |  4 ++++
 2 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index eca5c82c0d60..f119d8baf504 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -70,6 +70,7 @@ xlog_cil_ctx_switch(
 	struct xfs_cil		*cil,
 	struct xfs_cil_ctx	*ctx)
 {
+	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
 	ctx->sequence = ++cil->xc_current_sequence;
 	ctx->cil = cil;
 	cil->xc_ctx = ctx;
@@ -436,13 +437,12 @@ xlog_cil_insert_items(
 		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 
 	/*
-	 * Now transfer enough transaction reservation to the context ticket
-	 * for the checkpoint. The context ticket is special - the unit
-	 * reservation has to grow as well as the current reservation as we
-	 * steal from tickets so we can correctly determine the space used
-	 * during the transaction commit.
+	 * We need to take the CIL checkpoint unit reservation on the first
+	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
+	 * unnecessarily do an atomic op in the fast path here.
 	 */
-	if (ctx->ticket->t_curr_res == 0) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
+	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		ctx_res = ctx->ticket->t_unit_res;
 		ctx->ticket->t_curr_res = ctx_res;
 		tp->t_ticket->t_curr_res -= ctx_res;
@@ -771,7 +771,7 @@ xlog_cil_push_work(
 	 * move on to a new sequence number and so we have to be able to push
 	 * this sequence again later.
 	 */
-	if (list_empty(&cil->xc_cil)) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		cil->xc_push_seq = 0;
 		spin_unlock(&cil->xc_push_lock);
 		goto out_skip;
@@ -1020,9 +1020,10 @@ xlog_cil_push_background(
 
 	/*
 	 * The cil won't be empty because we are called while holding the
-	 * context lock so whatever we added to the CIL will still be there
+	 * context lock so whatever we added to the CIL will still be there.
 	 */
 	ASSERT(!list_empty(&cil->xc_cil));
+	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
 	 * Don't do a background push if we haven't used up all the
@@ -1109,7 +1110,8 @@ xlog_cil_push_now(
 	 * there's no work we need to do.
 	 */
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) {
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) ||
+	    push_seq <= cil->xc_push_seq) {
 		spin_unlock(&cil->xc_push_lock);
 		return;
 	}
@@ -1128,7 +1130,7 @@ xlog_cil_empty(
 	bool		empty = false;
 
 	spin_lock(&cil->xc_push_lock);
-	if (list_empty(&cil->xc_cil))
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		empty = true;
 	spin_unlock(&cil->xc_push_lock);
 	return empty;
@@ -1296,7 +1298,7 @@ xlog_cil_force_seq(
 	 * we would have found the context on the committing list.
 	 */
 	if (sequence == cil->xc_current_sequence &&
-	    !list_empty(&cil->xc_cil)) {
+	    !test_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
 		spin_unlock(&cil->xc_push_lock);
 		goto restart;
 	}
@@ -1329,9 +1331,9 @@ bool
 xfs_log_item_in_current_chkpt(
 	struct xfs_log_item *lip)
 {
-	struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
+	struct xfs_cil		*cil = lip->li_mountp->m_log->l_cilp;
 
-	if (list_empty(&lip->li_cil))
+	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		return false;
 
 	/*
@@ -1339,7 +1341,7 @@ xfs_log_item_in_current_chkpt(
 	 * first checkpoint it is written to. Hence if it is different to the
 	 * current sequence, we're in a new checkpoint.
 	 */
-	return lip->li_seq == ctx->sequence;
+	return lip->li_seq == cil->xc_ctx->sequence;
 }
 
 /*
@@ -1376,13 +1378,16 @@ void
 xlog_cil_destroy(
 	struct xlog	*log)
 {
-	if (log->l_cilp->xc_ctx) {
-		if (log->l_cilp->xc_ctx->ticket)
-			xfs_log_ticket_put(log->l_cilp->xc_ctx->ticket);
-		kmem_free(log->l_cilp->xc_ctx);
+	struct xfs_cil	*cil = log->l_cilp;
+
+	if (cil->xc_ctx) {
+		if (cil->xc_ctx->ticket)
+			xfs_log_ticket_put(cil->xc_ctx->ticket);
+		kmem_free(cil->xc_ctx);
 	}
 
-	ASSERT(list_empty(&log->l_cilp->xc_cil));
-	kmem_free(log->l_cilp);
+	ASSERT(list_empty(&cil->xc_cil));
+	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
+	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 02c94b6d0642..11606c378b7f 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -244,6 +244,7 @@ struct xfs_cil_ctx {
  */
 struct xfs_cil {
 	struct xlog		*xc_log;
+	unsigned long		xc_flags;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 
@@ -259,6 +260,9 @@ struct xfs_cil {
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
 } ____cacheline_aligned_in_smp;
 
+/* xc_flags bit values */
+#define	XLOG_CIL_EMPTY		1
+
 /*
  * The amount of log space we allow the CIL to aggregate is difficult to size.
  * Whatever we choose, we have to make sure we can get a reservation for the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 27/39] xfs: lift init CIL reservation out of xc_cil_lock
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (25 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 26/39] xfs: use the CIL space used counter for emptiness checks Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-19 12:13 ` [PATCH 28/39] xfs: rework per-iclog header CIL reservation Dave Chinner
                   ` (11 subsequent siblings)
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The xc_cil_lock is the most highly contended lock in XFS now. To
start the process of getting rid of it, lift the initial reservation
of the CIL log space out from under the xc_cil_lock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_cil.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index f119d8baf504..4637f8711ada 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -430,23 +430,19 @@ xlog_cil_insert_items(
 	 */
 	xlog_cil_insert_format_items(log, tp, &len);
 
-	spin_lock(&cil->xc_cil_lock);
-
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
-
 	/*
 	 * We need to take the CIL checkpoint unit reservation on the first
 	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
-	 * unnecessarily do an atomic op in the fast path here.
+	 * unnecessarily do an atomic op in the fast path here. We don't need to
+	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
+	 * under the xc_ctx_lock here and that needs to be held exclusively to
+	 * reset the XLOG_CIL_EMPTY bit.
 	 */
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
-	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags)) {
+	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		ctx_res = ctx->ticket->t_unit_res;
-		ctx->ticket->t_curr_res = ctx_res;
-		tp->t_ticket->t_curr_res -= ctx_res;
-	}
+
+	spin_lock(&cil->xc_cil_lock);
 
 	/* do we need space for more log record headers? */
 	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
@@ -456,11 +452,9 @@ xlog_cil_insert_items(
 		/* need to take into account split region headers, too */
 		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
 		ctx->ticket->t_unit_res += split_res;
-		ctx->ticket->t_curr_res += split_res;
-		tp->t_ticket->t_curr_res -= split_res;
-		ASSERT(tp->t_ticket->t_curr_res >= len);
 	}
-	tp->t_ticket->t_curr_res -= len;
+	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
+	ctx->ticket->t_curr_res += split_res + ctx_res;
 	ctx->space_used += len;
 
 	/*
@@ -498,6 +492,9 @@ xlog_cil_insert_items(
 			list_move_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 28/39] xfs: rework per-iclog header CIL reservation
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (26 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 27/39] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:17   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure Dave Chinner
                   ` (10 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For every iclog that a CIL push will use up, we need to ensure we
have space reserved for the iclog header in each iclog. It is
extremely difficult to do this accurately with a per-cpu counter
without expensive summing of the counter in every commit. However,
we know what the maximum CIL size is going to be because of the
hard space limit we have, and hence we know exactly how many iclogs
we are going to need to write out the CIL.

We are constrained by the requirement that small transactions only
have reservation space for a single iclog header built into them.
At commit time we don't know how much of the current transaction
reservation is made up of iclog header reservations as calculated by
xfs_log_calc_unit_res() when the ticket was reserved. As larger
reservations have multiple header spaces reserved, we can steal
more than one iclog header reservation at a time, but we only steal
the exact number needed for the given log vector size delta.

As a result, we don't know exactly when we are going to steal iclog
header reservations, nor do we know exactly how many we are going to
need for a given CIL.

To make things simple, start by calculating the worst case number of
iclog headers a full CIL push will require. Record this into an
atomic variable in the CIL. Then add a byte counter to the log
ticket that records exactly how much iclog header space has been
reserved in this ticket by xfs_log_calc_unit_res(). This tells us
exactly how much space we can steal from the ticket at transaction
commit time.

Now, at transaction commit time, we can check if the CIL has a full
iclog header reservation and, if not, steal the entire reservation
the current ticket holds for iclog headers. This minimises the
number of times we need to do atomic operations in the fast path,
but still guarantees we get all the reservations we need.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      |  9 ++++---
 fs/xfs/xfs_log_cil.c  | 55 +++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_log_priv.h | 20 +++++++++-------
 3 files changed, 59 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 65b28fce4db4..77d9ea7daf26 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3307,7 +3307,8 @@ xfs_log_ticket_get(
 static int
 xlog_calc_unit_res(
 	struct xlog		*log,
-	int			unit_bytes)
+	int			unit_bytes,
+	int			*niclogs)
 {
 	int			iclog_space;
 	uint			num_headers;
@@ -3387,6 +3388,8 @@ xlog_calc_unit_res(
 	/* roundoff padding for transaction data and one for commit record */
 	unit_bytes += 2 * log->l_iclog_roundoff;
 
+	if (niclogs)
+		*niclogs = num_headers;
 	return unit_bytes;
 }
 
@@ -3395,7 +3398,7 @@ xfs_log_calc_unit_res(
 	struct xfs_mount	*mp,
 	int			unit_bytes)
 {
-	return xlog_calc_unit_res(mp->m_log, unit_bytes);
+	return xlog_calc_unit_res(mp->m_log, unit_bytes, NULL);
 }
 
 /*
@@ -3413,7 +3416,7 @@ xlog_ticket_alloc(
 
 	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
 
-	unit_res = xlog_calc_unit_res(log, unit_bytes);
+	unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 4637f8711ada..87d4eb321fdc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -44,9 +44,20 @@ xlog_cil_ticket_alloc(
 	 * transaction overhead reservation from the first transaction commit.
 	 */
 	tic->t_curr_res = 0;
+	tic->t_iclog_hdrs = 0;
 	return tic;
 }
 
+static inline void
+xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
+{
+	struct xlog	*log = cil->xc_log;
+
+	atomic_set(&cil->xc_iclog_hdrs,
+		   (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) /
+			(log->l_iclog_size - log->l_iclog_hsize)));
+}
+
 /*
  * Unavoidable forward declaration - xlog_cil_push_work() calls
  * xlog_cil_ctx_alloc() itself.
@@ -70,6 +81,7 @@ xlog_cil_ctx_switch(
 	struct xfs_cil		*cil,
 	struct xfs_cil_ctx	*ctx)
 {
+	xlog_cil_set_iclog_hdr_count(cil);
 	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
 	ctx->sequence = ++cil->xc_current_sequence;
 	ctx->cil = cil;
@@ -92,6 +104,7 @@ xlog_cil_init_post_recovery(
 {
 	log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log);
 	log->l_cilp->xc_ctx->sequence = 1;
+	xlog_cil_set_iclog_hdr_count(log->l_cilp);
 }
 
 static inline int
@@ -419,7 +432,6 @@ xlog_cil_insert_items(
 	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
 	struct xfs_log_item	*lip;
 	int			len = 0;
-	int			iclog_space;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 
 	ASSERT(tp);
@@ -442,19 +454,36 @@ xlog_cil_insert_items(
 	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
 		ctx_res = ctx->ticket->t_unit_res;
 
-	spin_lock(&cil->xc_cil_lock);
-
-	/* do we need space for more log record headers? */
-	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
-	if (len > 0 && (ctx->space_used / iclog_space !=
-				(ctx->space_used + len) / iclog_space)) {
-		split_res = (len + iclog_space - 1) / iclog_space;
-		/* need to take into account split region headers, too */
-		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
-		ctx->ticket->t_unit_res += split_res;
+	/*
+	 * Check if we need to steal iclog headers. atomic_read() is not a
+	 * locked atomic operation, so we can check the value before we do any
+	 * real atomic ops in the fast path. If we've already taken the CIL unit
+	 * reservation from this commit, we've already got one iclog header
+	 * space reserved so we have to account for that otherwise we risk
+	 * overrunning the reservation on this ticket.
+	 *
+	 * If the CIL is already at the hard limit, we might need more header
+	 * space that originally reserved. So steal more header space from every
+	 * commit that occurs once we are over the hard limit to ensure the CIL
+	 * push won't run out of reservation space.
+	 *
+	 * This can steal more than we need, but that's OK.
+	 */
+	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
+	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+		int	split_res = log->l_iclog_hsize +
+					sizeof(struct xlog_op_header);
+		if (ctx_res)
+			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
+		else
+			ctx_res = split_res * tp->t_ticket->t_iclog_hdrs;
+		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
 	}
-	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
-	ctx->ticket->t_curr_res += split_res + ctx_res;
+
+	spin_lock(&cil->xc_cil_lock);
+	tp->t_ticket->t_curr_res -= ctx_res + len;
+	ctx->ticket->t_unit_res += ctx_res;
+	ctx->ticket->t_curr_res += ctx_res;
 	ctx->space_used += len;
 
 	/*
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 11606c378b7f..85a85ab569fe 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -137,15 +137,16 @@ enum xlog_iclog_state {
 #define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
 
 typedef struct xlog_ticket {
-	struct list_head   t_queue;	 /* reserve/write queue */
-	struct task_struct *t_task;	 /* task that owns this ticket */
-	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
-	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
-	int		   t_curr_res;	 /* current reservation in bytes : 4  */
-	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
-	char		   t_ocnt;	 /* original count		 : 1  */
-	char		   t_cnt;	 /* current count		 : 1  */
-	char		   t_flags;	 /* properties of reservation	 : 1  */
+	struct list_head	t_queue;	/* reserve/write queue */
+	struct task_struct	*t_task;	/* task that owns this ticket */
+	xlog_tid_t		t_tid;		/* transaction identifier */
+	atomic_t		t_ref;		/* ticket reference count */
+	int			t_curr_res;	/* current reservation */
+	int			t_unit_res;	/* unit reservation */
+	char			t_ocnt;		/* original count */
+	char			t_cnt;		/* current count */
+	char			t_flags;	/* properties of reservation */
+	int			t_iclog_hdrs;	/* iclog hdrs in t_curr_res */
 } xlog_ticket_t;
 
 /*
@@ -245,6 +246,7 @@ struct xfs_cil_ctx {
 struct xfs_cil {
 	struct xlog		*xc_log;
 	unsigned long		xc_flags;
+	atomic_t		xc_iclog_hdrs;
 	struct list_head	xc_cil;
 	spinlock_t		xc_cil_lock;
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (27 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 28/39] xfs: rework per-iclog header CIL reservation Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:31   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 30/39] xfs: implement percpu cil space used calculation Dave Chinner
                   ` (9 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The CIL push lock is highly contended on larger machines, becoming a
hard bottleneck that about 700,000 transaction commits/s on >16p
machines. To address this, start moving the CIL tracking
infrastructure to utilise per-CPU structures.

We need to track the space used, the amount of log reservation space
reserved to write the CIL, the log items in the CIL and the busy
extents that need to be completed by the CIL commit.  This requires
a couple of per-cpu counters, an unordered per-cpu list and a
globally ordered per-cpu list.

Create a per-cpu structure to hold these and all the management
interfaces needed, as well as the hooks to handle hotplug CPUs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c       | 106 +++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_priv.h      |  15 ++++++
 include/linux/cpuhotplug.h |   1 +
 3 files changed, 122 insertions(+)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 87d4eb321fdc..ba1c6979a4c7 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1370,6 +1370,105 @@ xfs_log_item_in_current_chkpt(
 	return lip->li_seq == cil->xc_ctx->sequence;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(xlog_cil_pcp_list);
+static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
+static bool xlog_cil_pcp_init;
+
+/*
+ * Move dead percpu state to the relevant CIL context structures.
+ *
+ * We have to lock the CIL context here to ensure that nothing is modifying
+ * the percpu state, either addition or removal. Both of these are done under
+ * the CIL context lock, so grabbing that exclusively here will ensure we can
+ * safely drain the cilpcp for the CPU that is dying.
+ */
+static int
+xlog_cil_pcp_dead(
+	unsigned int		cpu)
+{
+	struct xfs_cil		*cil, *n;
+
+	spin_lock(&xlog_cil_pcp_lock);
+	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
+		spin_unlock(&xlog_cil_pcp_lock);
+		down_write(&cil->xc_ctx_lock);
+		/* move stuff on dead CPU to context */
+		up_write(&cil->xc_ctx_lock);
+		spin_lock(&xlog_cil_pcp_lock);
+	}
+	spin_unlock(&xlog_cil_pcp_lock);
+	return 0;
+}
+
+static int
+xlog_cil_pcp_hpadd(
+	struct xfs_cil		*cil)
+{
+	if (!xlog_cil_pcp_init) {
+		int	ret;
+		ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD,
+						"xfs/cil_pcp:dead", NULL,
+						xlog_cil_pcp_dead);
+		if (ret < 0) {
+			xfs_warn(cil->xc_log->l_mp,
+	"Failed to initialise CIL hotplug, error %d. XFS is non-functional.",
+				ret);
+			ASSERT(0);
+			return -ENOMEM;
+		}
+		xlog_cil_pcp_init = true;
+	}
+
+	INIT_LIST_HEAD(&cil->xc_pcp_list);
+	spin_lock(&xlog_cil_pcp_lock);
+	list_add(&cil->xc_pcp_list, &xlog_cil_pcp_list);
+	spin_unlock(&xlog_cil_pcp_lock);
+	return 0;
+}
+
+static void
+xlog_cil_pcp_hpremove(
+	struct xfs_cil		*cil)
+{
+	spin_lock(&xlog_cil_pcp_lock);
+	list_del(&cil->xc_pcp_list);
+	spin_unlock(&xlog_cil_pcp_lock);
+}
+
+#else /* !CONFIG_HOTPLUG_CPU */
+static inline void xlog_cil_pcp_hpadd(struct xfs_cil *cil) {}
+static inline void xlog_cil_pcp_hpremove(struct xfs_cil *cil) {}
+#endif
+
+static void __percpu *
+xlog_cil_pcp_alloc(
+	struct xfs_cil		*cil)
+{
+	void __percpu		*pcp;
+
+	pcp = alloc_percpu(struct xlog_cil_pcp);
+	if (!pcp)
+		return NULL;
+
+	if (xlog_cil_pcp_hpadd(cil) < 0) {
+		free_percpu(pcp);
+		return NULL;
+	}
+	return pcp;
+}
+
+static void
+xlog_cil_pcp_free(
+	struct xfs_cil		*cil,
+	void __percpu		*pcp)
+{
+	if (!pcp)
+		return;
+	xlog_cil_pcp_hpremove(cil);
+	free_percpu(pcp);
+}
+
 /*
  * Perform initial CIL structure initialisation.
  */
@@ -1384,6 +1483,12 @@ xlog_cil_init(
 	if (!cil)
 		return -ENOMEM;
 
+	cil->xc_pcp = xlog_cil_pcp_alloc(cil);
+	if (!cil->xc_pcp) {
+		kmem_free(cil);
+		return -ENOMEM;
+	}
+
 	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
 	spin_lock_init(&cil->xc_cil_lock);
@@ -1414,6 +1519,7 @@ xlog_cil_destroy(
 
 	ASSERT(list_empty(&cil->xc_cil));
 	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
+	xlog_cil_pcp_free(cil, cil->xc_pcp);
 	kmem_free(cil);
 }
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 85a85ab569fe..aaa1e7f7fb66 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -227,6 +227,16 @@ struct xfs_cil_ctx {
 	struct work_struct	push_work;
 };
 
+/*
+ * Per-cpu CIL tracking items
+ */
+struct xlog_cil_pcp {
+	uint32_t		space_used;
+	uint32_t		curr_res;
+	struct list_head	busy_extents;
+	struct list_head	log_items;
+};
+
 /*
  * Committed Item List structure
  *
@@ -260,6 +270,11 @@ struct xfs_cil {
 	wait_queue_head_t	xc_commit_wait;
 	xfs_csn_t		xc_current_sequence;
 	wait_queue_head_t	xc_push_wait;	/* background push throttle */
+
+	void __percpu		*xc_pcp;	/* percpu CIL structures */
+#ifdef CONFIG_HOTPLUG_CPU
+	struct list_head	xc_pcp_list;
+#endif
 } ____cacheline_aligned_in_smp;
 
 /* xc_flags bit values */
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..3d3ccde9e9c8 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -52,6 +52,7 @@ enum cpuhp_state {
 	CPUHP_FS_BUFF_DEAD,
 	CPUHP_PRINTK_DEAD,
 	CPUHP_MM_MEMCQ_DEAD,
+	CPUHP_XFS_CIL_DEAD,
 	CPUHP_PERCPU_CNT_DEAD,
 	CPUHP_RADIX_DEAD,
 	CPUHP_PAGE_ALLOC_DEAD,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (28 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:41   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure Dave Chinner
                   ` (8 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have the CIL percpu structures in place, implement the
space used counter with a fast sum check similar to the
percpu_counter infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index ba1c6979a4c7..72693fba929b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
 	return ctx;
 }
 
+/*
+ * Aggregate the CIL per cpu structures into global counts, lists, etc and
+ * clear the percpu state ready for the next context to use.
+ */
+static void
+xlog_cil_pcp_aggregate(
+	struct xfs_cil		*cil,
+	struct xfs_cil_ctx	*ctx)
+{
+	struct xlog_cil_pcp	*cilpcp;
+	int			cpu;
+
+	for_each_online_cpu(cpu) {
+		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+		cilpcp->space_used = 0;
+	}
+}
+
 static void
 xlog_cil_ctx_switch(
 	struct xfs_cil		*cil,
@@ -433,6 +451,8 @@ xlog_cil_insert_items(
 	struct xfs_log_item	*lip;
 	int			len = 0;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
+	int			space_used;
+	struct xlog_cil_pcp	*cilpcp;
 
 	ASSERT(tp);
 
@@ -469,8 +489,9 @@ xlog_cil_insert_items(
 	 *
 	 * This can steal more than we need, but that's OK.
 	 */
+	space_used = atomic_read(&ctx->space_used);
 	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
-	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
 		int	split_res = log->l_iclog_hsize +
 					sizeof(struct xlog_op_header);
 		if (ctx_res)
@@ -480,16 +501,34 @@ xlog_cil_insert_items(
 		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
 	}
 
+	/*
+	 * Update the CIL percpu pointer. This updates the global counter when
+	 * over the percpu batch size or when the CIL is over the space limit.
+	 * This means low lock overhead for normal updates, and when over the
+	 * limit the space used is immediately accounted. This makes enforcing
+	 * the hard limit much more accurate. The per cpu fold threshold is
+	 * based on how close we are to the hard limit.
+	 */
+	cilpcp = get_cpu_ptr(cil->xc_pcp);
+	cilpcp->space_used += len;
+	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
+	    cilpcp->space_used >
+			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
+					num_online_cpus())) {
+		atomic_add(cilpcp->space_used, &ctx->space_used);
+		cilpcp->space_used = 0;
+	}
+	put_cpu_ptr(cilpcp);
+
 	spin_lock(&cil->xc_cil_lock);
-	tp->t_ticket->t_curr_res -= ctx_res + len;
 	ctx->ticket->t_unit_res += ctx_res;
 	ctx->ticket->t_curr_res += ctx_res;
-	ctx->space_used += len;
 
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
 	 * the log items. Shutdown is imminent...
 	 */
+	tp->t_ticket->t_curr_res -= ctx_res + len;
 	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
 		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
 		xfs_warn(log->l_mp,
@@ -846,6 +885,8 @@ xlog_cil_push_work(
 	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
 				&bdev_flush);
 
+	xlog_cil_pcp_aggregate(cil, ctx);
+
 	/*
 	 * Pull all the log vectors off the items in the CIL, and remove the
 	 * items from the CIL. We don't need the CIL lock here because it's only
@@ -1043,6 +1084,7 @@ xlog_cil_push_background(
 	struct xlog	*log) __releases(cil->xc_ctx_lock)
 {
 	struct xfs_cil	*cil = log->l_cilp;
+	int		space_used = atomic_read(&cil->xc_ctx->space_used);
 
 	/*
 	 * The cil won't be empty because we are called while holding the
@@ -1055,7 +1097,7 @@ xlog_cil_push_background(
 	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
-	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
@@ -1084,10 +1126,10 @@ xlog_cil_push_background(
 	 * The ctx->xc_push_lock provides the serialisation necessary for safely
 	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
 	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
-		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+		ASSERT(space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
 		return;
 	}
@@ -1391,9 +1433,14 @@ xlog_cil_pcp_dead(
 
 	spin_lock(&xlog_cil_pcp_lock);
 	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
+		struct xlog_cil_pcp	*cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+
 		spin_unlock(&xlog_cil_pcp_lock);
 		down_write(&cil->xc_ctx_lock);
-		/* move stuff on dead CPU to context */
+
+		atomic_add(cilpcp->space_used, &cil->xc_ctx->space_used);
+		cilpcp->space_used = 0;
+
 		up_write(&cil->xc_ctx_lock);
 		spin_lock(&xlog_cil_pcp_lock);
 	}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index aaa1e7f7fb66..7dc6275818de 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -218,7 +218,7 @@ struct xfs_cil_ctx {
 	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
 	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
-	int			space_used;	/* aggregate size of regions */
+	atomic_t		space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (29 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 30/39] xfs: implement percpu cil space used calculation Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:48   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 32/39] xfs: convert CIL busy extents to per-cpu Dave Chinner
                   ` (7 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To get it out from under the cil spinlock.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 20 +++++++++++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 72693fba929b..4ddc302a766b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -90,6 +90,10 @@ xlog_cil_pcp_aggregate(
 
 	for_each_online_cpu(cpu) {
 		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+
+		ctx->ticket->t_curr_res += cilpcp->space_reserved;
+		ctx->ticket->t_unit_res += cilpcp->space_reserved;
+		cilpcp->space_reserved = 0;
 		cilpcp->space_used = 0;
 	}
 }
@@ -510,6 +514,7 @@ xlog_cil_insert_items(
 	 * based on how close we are to the hard limit.
 	 */
 	cilpcp = get_cpu_ptr(cil->xc_pcp);
+	cilpcp->space_reserved += ctx_res;
 	cilpcp->space_used += len;
 	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
 	    cilpcp->space_used >
@@ -520,10 +525,6 @@ xlog_cil_insert_items(
 	}
 	put_cpu_ptr(cilpcp);
 
-	spin_lock(&cil->xc_cil_lock);
-	ctx->ticket->t_unit_res += ctx_res;
-	ctx->ticket->t_curr_res += ctx_res;
-
 	/*
 	 * If we've overrun the reservation, dump the tx details before we move
 	 * the log items. Shutdown is imminent...
@@ -545,6 +546,7 @@ xlog_cil_insert_items(
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
 	 */
+	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
 		/* Skip items which aren't dirty in this transaction. */
@@ -1434,12 +1436,20 @@ xlog_cil_pcp_dead(
 	spin_lock(&xlog_cil_pcp_lock);
 	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
 		struct xlog_cil_pcp	*cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
+		struct xfs_cil_ctx	*ctx;
 
 		spin_unlock(&xlog_cil_pcp_lock);
 		down_write(&cil->xc_ctx_lock);
+		ctx = cil->xc_ctx;
+
+		atomic_add(cilpcp->space_used, &ctx->space_used);
+		if (ctx->ticket) {
+			ctx->ticket->t_curr_res += cilpcp->space_reserved;
+			ctx->ticket->t_unit_res += cilpcp->space_reserved;
+		}
 
-		atomic_add(cilpcp->space_used, &cil->xc_ctx->space_used);
 		cilpcp->space_used = 0;
+		cilpcp->space_reserved = 0;
 
 		up_write(&cil->xc_ctx_lock);
 		spin_lock(&xlog_cil_pcp_lock);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 7dc6275818de..b80cb3a0edb7 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -232,7 +232,7 @@ struct xfs_cil_ctx {
  */
 struct xlog_cil_pcp {
 	uint32_t		space_used;
-	uint32_t		curr_res;
+	uint32_t		space_reserved;
 	struct list_head	busy_extents;
 	struct list_head	log_items;
 };
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 32/39] xfs: convert CIL busy extents to per-cpu
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (30 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 18:49   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 33/39] xfs: Add order IDs to log items in CIL Dave Chinner
                   ` (6 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To get them out from under the CIL lock.

This is an unordered list, so we can simply punt it to per-cpu lists
during transaction commits and reaggregate it back into a single
list during the CIL push work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 4ddc302a766b..b12a2f9ba23a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -93,6 +93,11 @@ xlog_cil_pcp_aggregate(
 
 		ctx->ticket->t_curr_res += cilpcp->space_reserved;
 		ctx->ticket->t_unit_res += cilpcp->space_reserved;
+		if (!list_empty(&cilpcp->busy_extents)) {
+			list_splice_init(&cilpcp->busy_extents,
+					&ctx->busy_extents);
+		}
+
 		cilpcp->space_reserved = 0;
 		cilpcp->space_used = 0;
 	}
@@ -523,6 +528,9 @@ xlog_cil_insert_items(
 		atomic_add(cilpcp->space_used, &ctx->space_used);
 		cilpcp->space_used = 0;
 	}
+	/* attach the transaction to the CIL if it has any busy extents */
+	if (!list_empty(&tp->t_busy))
+		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
 	put_cpu_ptr(cilpcp);
 
 	/*
@@ -562,9 +570,6 @@ xlog_cil_insert_items(
 			list_move_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
-	/* attach the transaction to the CIL if it has any busy extents */
-	if (!list_empty(&tp->t_busy))
-		list_splice_init(&tp->t_busy, &ctx->busy_extents);
 	spin_unlock(&cil->xc_cil_lock);
 
 	if (tp->t_ticket->t_curr_res < 0)
@@ -1447,6 +1452,10 @@ xlog_cil_pcp_dead(
 			ctx->ticket->t_curr_res += cilpcp->space_reserved;
 			ctx->ticket->t_unit_res += cilpcp->space_reserved;
 		}
+		if (!list_empty(&cilpcp->busy_extents)) {
+			list_splice_init(&cilpcp->busy_extents,
+					&ctx->busy_extents);
+		}
 
 		cilpcp->space_used = 0;
 		cilpcp->space_reserved = 0;
@@ -1502,7 +1511,9 @@ static void __percpu *
 xlog_cil_pcp_alloc(
 	struct xfs_cil		*cil)
 {
+	struct xlog_cil_pcp	*cilpcp;
 	void __percpu		*pcp;
+	int			cpu;
 
 	pcp = alloc_percpu(struct xlog_cil_pcp);
 	if (!pcp)
@@ -1512,6 +1523,11 @@ xlog_cil_pcp_alloc(
 		free_percpu(pcp);
 		return NULL;
 	}
+
+	for_each_possible_cpu(cpu) {
+		cilpcp = per_cpu_ptr(pcp, cpu);
+		INIT_LIST_HEAD(&cilpcp->busy_extents);
+	}
 	return pcp;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (31 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 32/39] xfs: convert CIL busy extents to per-cpu Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 19:00   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 34/39] xfs: convert CIL to unordered per cpu lists Dave Chinner
                   ` (5 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Before we split the ordered CIL up into per cpu lists, we need a
mechanism to track the order of the items in the CIL. We need to do
this because there are rules around the order in which related items
must physically appear in the log even inside a single checkpoint
transaction.

An example of this is intents - an intent must appear in the log
before it's intent done record so taht log recovery can cancel the
intent correctly. If we have these two records misordered in the
CIL, then they will not be recovered correctly by journal replay.

We also will not be able to move items to the tail of
the CIL list when they are relogged, hence the log items will need
some mechanism to allow the correct log item order to be recreated
before we write log items to the hournal.

Hence we need to have a mechanism for recording global order of
transactions in the log items  so that we can recover that order
from un-ordered per-cpu lists.

Do this with a simple monotonic increasing commit counter in the CIL
context. Each log item in the transaction gets stamped with the
current commit order ID before it is added to the CIL. If the item
is already in the CIL, leave it where it is instead of moving it to
the tail of the list and instead sort the list before we start the
push work.

XXX: list_sort() under the cil_ctx_lock held exclusive starts
hurting that >16 threads. Front end commits are waiting on the push
to switch contexts much longer. The item order id should likely be
moved into the logvecs when they are detacted from the items, then
the sort can be done on the logvec after the cil_ctx_lock has been
released. logvecs will need to use a list_head for this rather than
a single linked list like they do now....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 38 ++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_log_priv.h |  1 +
 fs/xfs/xfs_trans.h    |  1 +
 3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b12a2f9ba23a..ca6e411e388e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -461,6 +461,7 @@ xlog_cil_insert_items(
 	int			len = 0;
 	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
 	int			space_used;
+	int			order;
 	struct xlog_cil_pcp	*cilpcp;
 
 	ASSERT(tp);
@@ -550,10 +551,12 @@ xlog_cil_insert_items(
 	}
 
 	/*
-	 * Now (re-)position everything modified at the tail of the CIL.
+	 * Now update the order of everything modified in the transaction
+	 * and insert items into the CIL if they aren't already there.
 	 * We do this here so we only need to take the CIL lock once during
 	 * the transaction commit.
 	 */
+	order = atomic_inc_return(&ctx->order_id);
 	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
@@ -561,13 +564,10 @@ xlog_cil_insert_items(
 		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
 			continue;
 
-		/*
-		 * Only move the item if it isn't already at the tail. This is
-		 * to prevent a transient list_empty() state when reinserting
-		 * an item that is already the only item in the CIL.
-		 */
-		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
-			list_move_tail(&lip->li_cil, &cil->xc_cil);
+		lip->li_order_id = order;
+		if (!list_empty(&lip->li_cil))
+			continue;
+		list_add_tail(&lip->li_cil, &cil->xc_cil);
 	}
 
 	spin_unlock(&cil->xc_cil_lock);
@@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
 	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
+/*
+ * CIL item reordering compare function. We want to order in ascending ID order,
+ * but we want to leave items with the same ID in the order they were added to
+ * the list. This is important for operations like reflink where we log 4 order
+ * dependent intents in a single transaction when we overwrite an existing
+ * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
+ * CUI (inc), BUI(remap)...
+ */
+static int
+xlog_cil_order_cmp(
+	void			*priv,
+	const struct list_head	*a,
+	const struct list_head	*b)
+{
+	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
+	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
+
+	return l1->li_order_id > l2->li_order_id;
+}
+
 /*
  * Push the Committed Item List to the log.
  *
@@ -900,6 +920,7 @@ xlog_cil_push_work(
 	 * needed on the transaction commit side which is currently locked out
 	 * by the flush lock.
 	 */
+	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
 	lv = NULL;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
@@ -907,6 +928,7 @@ xlog_cil_push_work(
 		item = list_first_entry(&cil->xc_cil,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
+		item->li_order_id = 0;
 		if (!ctx->lv_chain)
 			ctx->lv_chain = item->li_lv;
 		else
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b80cb3a0edb7..466862a943ba 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -225,6 +225,7 @@ struct xfs_cil_ctx {
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
 	struct work_struct	push_work;
+	atomic_t		order_id;
 };
 
 /*
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 50da47f23a07..2d1cc1ff93c7 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -44,6 +44,7 @@ struct xfs_log_item {
 	struct xfs_log_vec		*li_lv;		/* active log vector */
 	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
 	xfs_csn_t			li_seq;		/* CIL commit seq */
+	uint32_t			li_order_id;	/* CIL commit order */
 };
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 34/39] xfs: convert CIL to unordered per cpu lists
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (32 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 33/39] xfs: Add order IDs to log items in CIL Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 19:03   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 35/39] xfs: convert log vector chain to use list heads Dave Chinner
                   ` (4 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So that we can remove the cil_lock which is a global serialisation
point. We've already got ordering sorted, so all we need to do is
treat the CIL list like the busy extent list and reconstruct it
before the push starts.

This is what we're trying to avoid:

 -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
    - 46.35% xfs_log_commit_cil
       - 41.54% _raw_spin_lock
          - 67.30% do_raw_spin_lock
               66.96% __pv_queued_spin_lock_slowpath

Which happens on a 32p system when running a 32-way 'rm -rf'
workload. After this patch:

-   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
   - 17.67% xfs_log_commit_cil
      - 6.51% xfs_log_ticket_ungrant
           1.40% xfs_log_space_wake
        2.32% memcpy_erms
      - 2.18% xfs_buf_item_committing
         - 2.12% xfs_buf_item_release
            - 1.03% xfs_buf_unlock
                 0.96% up
              0.72% xfs_buf_rele
        1.33% xfs_inode_item_format
        1.19% down_read
        0.91% up_read
        0.76% xfs_buf_item_format
      - 0.68% kmem_alloc_large
         - 0.67% kmem_alloc
              0.64% __kmalloc
        0.50% xfs_buf_item_size

It kinda looks like the workload is running out of log space all
the time. But all the spinlock contention is gone and the
transaction commit rate has gone from 800k/s to 1.3M/s so the amount
of real work being done has gone up a *lot*.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c  | 69 +++++++++++++++++++------------------------
 fs/xfs/xfs_log_priv.h |  3 +-
 2 files changed, 31 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index ca6e411e388e..287dc7d0d508 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
 	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
+	INIT_LIST_HEAD(&ctx->log_items);
 	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
 	return ctx;
 }
@@ -97,6 +98,8 @@ xlog_cil_pcp_aggregate(
 			list_splice_init(&cilpcp->busy_extents,
 					&ctx->busy_extents);
 		}
+		if (!list_empty(&cilpcp->log_items))
+			list_splice_init(&cilpcp->log_items, &ctx->log_items);
 
 		cilpcp->space_reserved = 0;
 		cilpcp->space_used = 0;
@@ -475,10 +478,9 @@ xlog_cil_insert_items(
 	/*
 	 * We need to take the CIL checkpoint unit reservation on the first
 	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
-	 * unnecessarily do an atomic op in the fast path here. We don't need to
-	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
-	 * under the xc_ctx_lock here and that needs to be held exclusively to
-	 * reset the XLOG_CIL_EMPTY bit.
+	 * unnecessarily do an atomic op in the fast path here. We can clear the
+	 * XLOG_CIL_EMPTY bit as we are under the xc_ctx_lock here and that
+	 * needs to be held exclusively to reset the XLOG_CIL_EMPTY bit.
 	 */
 	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
 	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
@@ -532,24 +534,6 @@ xlog_cil_insert_items(
 	/* attach the transaction to the CIL if it has any busy extents */
 	if (!list_empty(&tp->t_busy))
 		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
-	put_cpu_ptr(cilpcp);
-
-	/*
-	 * If we've overrun the reservation, dump the tx details before we move
-	 * the log items. Shutdown is imminent...
-	 */
-	tp->t_ticket->t_curr_res -= ctx_res + len;
-	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
-		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
-		xfs_warn(log->l_mp,
-			 "  log items: %d bytes (iov hdrs: %d bytes)",
-			 len, iovhdr_res);
-		xfs_warn(log->l_mp, "  split region headers: %d bytes",
-			 split_res);
-		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
-		xlog_print_trans(tp);
-	}
-
 	/*
 	 * Now update the order of everything modified in the transaction
 	 * and insert items into the CIL if they aren't already there.
@@ -557,7 +541,6 @@ xlog_cil_insert_items(
 	 * the transaction commit.
 	 */
 	order = atomic_inc_return(&ctx->order_id);
-	spin_lock(&cil->xc_cil_lock);
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 
 		/* Skip items which aren't dirty in this transaction. */
@@ -567,10 +550,25 @@ xlog_cil_insert_items(
 		lip->li_order_id = order;
 		if (!list_empty(&lip->li_cil))
 			continue;
-		list_add_tail(&lip->li_cil, &cil->xc_cil);
+		list_add_tail(&lip->li_cil, &cilpcp->log_items);
 	}
+	put_cpu_ptr(cilpcp);
 
-	spin_unlock(&cil->xc_cil_lock);
+	/*
+	 * If we've overrun the reservation, dump the tx details before we move
+	 * the log items. Shutdown is imminent...
+	 */
+	tp->t_ticket->t_curr_res -= ctx_res + len;
+	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
+		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
+		xfs_warn(log->l_mp,
+			 "  log items: %d bytes (iov hdrs: %d bytes)",
+			 len, iovhdr_res);
+		xfs_warn(log->l_mp, "  split region headers: %d bytes",
+			 split_res);
+		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
+		xlog_print_trans(tp);
+	}
 
 	if (tp->t_ticket->t_curr_res < 0)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -914,18 +912,12 @@ xlog_cil_push_work(
 
 	xlog_cil_pcp_aggregate(cil, ctx);
 
-	/*
-	 * Pull all the log vectors off the items in the CIL, and remove the
-	 * items from the CIL. We don't need the CIL lock here because it's only
-	 * needed on the transaction commit side which is currently locked out
-	 * by the flush lock.
-	 */
-	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
-	lv = NULL;
-	while (!list_empty(&cil->xc_cil)) {
+	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
+
+	while (!list_empty(&ctx->log_items)) {
 		struct xfs_log_item	*item;
 
-		item = list_first_entry(&cil->xc_cil,
+		item = list_first_entry(&ctx->log_items,
 					struct xfs_log_item, li_cil);
 		list_del_init(&item->li_cil);
 		item->li_order_id = 0;
@@ -1119,7 +1111,6 @@ xlog_cil_push_background(
 	 * The cil won't be empty because we are called while holding the
 	 * context lock so whatever we added to the CIL will still be there.
 	 */
-	ASSERT(!list_empty(&cil->xc_cil));
 	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
@@ -1478,6 +1469,8 @@ xlog_cil_pcp_dead(
 			list_splice_init(&cilpcp->busy_extents,
 					&ctx->busy_extents);
 		}
+		if (!list_empty(&cilpcp->log_items))
+			list_splice_init(&cilpcp->log_items, &ctx->log_items);
 
 		cilpcp->space_used = 0;
 		cilpcp->space_reserved = 0;
@@ -1549,6 +1542,7 @@ xlog_cil_pcp_alloc(
 	for_each_possible_cpu(cpu) {
 		cilpcp = per_cpu_ptr(pcp, cpu);
 		INIT_LIST_HEAD(&cilpcp->busy_extents);
+		INIT_LIST_HEAD(&cilpcp->log_items);
 	}
 	return pcp;
 }
@@ -1584,9 +1578,7 @@ xlog_cil_init(
 		return -ENOMEM;
 	}
 
-	INIT_LIST_HEAD(&cil->xc_cil);
 	INIT_LIST_HEAD(&cil->xc_committing);
-	spin_lock_init(&cil->xc_cil_lock);
 	spin_lock_init(&cil->xc_push_lock);
 	init_waitqueue_head(&cil->xc_push_wait);
 	init_rwsem(&cil->xc_ctx_lock);
@@ -1612,7 +1604,6 @@ xlog_cil_destroy(
 		kmem_free(cil->xc_ctx);
 	}
 
-	ASSERT(list_empty(&cil->xc_cil));
 	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 	xlog_cil_pcp_free(cil, cil->xc_pcp);
 	kmem_free(cil);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 466862a943ba..d3bf3b367370 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -220,6 +220,7 @@ struct xfs_cil_ctx {
 	struct xlog_ticket	*ticket;	/* chkpt ticket */
 	atomic_t		space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
+	struct list_head	log_items;	/* log items in chkpt */
 	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
@@ -258,8 +259,6 @@ struct xfs_cil {
 	struct xlog		*xc_log;
 	unsigned long		xc_flags;
 	atomic_t		xc_iclog_hdrs;
-	struct list_head	xc_cil;
-	spinlock_t		xc_cil_lock;
 
 	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
 	struct xfs_cil_ctx	*xc_ctx;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 35/39] xfs: convert log vector chain to use list heads
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (33 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 34/39] xfs: convert CIL to unordered per cpu lists Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 19:13   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 36/39] xfs: move CIL ordering to the logvec chain Dave Chinner
                   ` (3 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because the next change is going to require sorting log vectors, and
that requires arbitrary rearrangement of the list which cannot be
done easily with a single linked list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c        | 35 +++++++++++++++++++++++++---------
 fs/xfs/xfs_log.h        |  2 +-
 fs/xfs/xfs_log_cil.c    | 42 +++++++++++++++++++++++------------------
 fs/xfs/xfs_log_priv.h   |  4 ++--
 fs/xfs/xfs_trans.c      |  4 ++--
 fs/xfs/xfs_trans_priv.h |  3 ++-
 6 files changed, 57 insertions(+), 33 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 77d9ea7daf26..5511c5de6b78 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -848,6 +848,9 @@ xlog_write_unmount_record(
 		.lv_niovecs = 1,
 		.lv_iovecp = &reg,
 	};
+	LIST_HEAD(lv_chain);
+	INIT_LIST_HEAD(&vec.lv_list);
+	list_add(&vec.lv_list, &lv_chain);
 
 	BUILD_BUG_ON((sizeof(struct xlog_op_header) +
 		      sizeof(struct xfs_unmount_log_format)) !=
@@ -863,7 +866,7 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, reg.i_len);
+	return xlog_write(log, &lv_chain, ticket, NULL, NULL, reg.i_len);
 }
 
 /*
@@ -1581,13 +1584,16 @@ xlog_commit_record(
 		.lv_iovecp = &reg,
 	};
 	int	error;
+	LIST_HEAD(lv_chain);
+	INIT_LIST_HEAD(&vec.lv_list);
+	list_add(&vec.lv_list, &lv_chain);
 
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, reg.i_len);
+	error = xlog_write(log, &lv_chain, ticket, lsn, iclog, reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2118,6 +2124,7 @@ xlog_print_trans(
  */
 static struct xfs_log_vec *
 xlog_write_single(
+	struct list_head	*lv_chain,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
 	struct xlog_in_core	*iclog,
@@ -2134,7 +2141,9 @@ xlog_write_single(
 		iclog->ic_state == XLOG_STATE_WANT_SYNC);
 
 	ptr = iclog->ic_datap + *log_offset;
-	for (lv = log_vector; lv; lv = lv->lv_next) {
+	for (lv = log_vector;
+	     !list_entry_is_head(lv, lv_chain, lv_list);
+	     lv = list_next_entry(lv, lv_list)) {
 		/*
 		 * If the entire log vec does not fit in the iclog, punt it to
 		 * the partial copy loop which can handle this case.
@@ -2163,6 +2172,8 @@ xlog_write_single(
 			*data_cnt += reg->i_len;
 		}
 	}
+	if (list_entry_is_head(lv, lv_chain, lv_list))
+		lv = NULL;
 	ASSERT(*len == 0 || lv);
 	return lv;
 }
@@ -2208,6 +2219,7 @@ xlog_write_get_more_iclog_space(
 static struct xfs_log_vec *
 xlog_write_partial(
 	struct xlog		*log,
+	struct list_head	*lv_chain,
 	struct xfs_log_vec	*log_vector,
 	struct xlog_ticket	*ticket,
 	struct xlog_in_core	**iclogp,
@@ -2347,7 +2359,10 @@ xlog_write_partial(
 	 * the caller so it can go back to fast path copying.
 	 */
 	*iclogp = iclog;
-	return lv->lv_next;
+	lv = list_next_entry(lv, lv_list);
+	if (list_entry_is_head(lv, lv_chain, lv_list))
+		return NULL;
+	return lv;
 }
 
 /*
@@ -2393,14 +2408,14 @@ xlog_write_partial(
 int
 xlog_write(
 	struct xlog		*log,
-	struct xfs_log_vec	*log_vector,
+	struct list_head	*lv_chain,
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
 	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
-	struct xfs_log_vec	*lv = log_vector;
+	struct xfs_log_vec	*lv;
 	int			record_cnt = 0;
 	int			data_cnt = 0;
 	int			error = 0;
@@ -2422,14 +2437,16 @@ xlog_write(
 	if (start_lsn)
 		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
+	lv = list_first_entry_or_null(lv_chain, struct xfs_log_vec, lv_list);
 	while (lv) {
-		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
+		lv = xlog_write_single(lv_chain, lv, ticket, iclog, &log_offset,
 					&len, &record_cnt, &data_cnt);
 		if (!lv)
 			break;
 
-		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
-					&len, &record_cnt, &data_cnt);
+		lv = xlog_write_partial(log, lv_chain, lv, ticket, &iclog,
+					&log_offset, &len, &record_cnt,
+					&data_cnt);
 		if (IS_ERR_OR_NULL(lv)) {
 			error = PTR_ERR_OR_ZERO(lv);
 			break;
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index af54ea3f8c90..b4ad0e37a0c5 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -9,7 +9,7 @@
 struct xfs_cil_ctx;
 
 struct xfs_log_vec {
-	struct xfs_log_vec	*lv_next;	/* next lv in build list */
+	struct list_head	lv_list;	/* CIL lv chain ptrs */
 	int			lv_niovecs;	/* number of iovecs in lv */
 	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
 	struct xfs_log_item	*lv_item;	/* owner */
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 287dc7d0d508..035f0a60040a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -73,6 +73,7 @@ xlog_cil_ctx_alloc(void)
 	INIT_LIST_HEAD(&ctx->committing);
 	INIT_LIST_HEAD(&ctx->busy_extents);
 	INIT_LIST_HEAD(&ctx->log_items);
+	INIT_LIST_HEAD(&ctx->lv_chain);
 	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
 	return ctx;
 }
@@ -267,6 +268,7 @@ xlog_cil_alloc_shadow_bufs(
 			lv = kmem_alloc_large(buf_size, KM_NOFS);
 			memset(lv, 0, xlog_cil_iovec_space(niovecs));
 
+			INIT_LIST_HEAD(&lv->lv_list);
 			lv->lv_item = lip;
 			lv->lv_size = buf_size;
 			if (ordered)
@@ -282,7 +284,6 @@ xlog_cil_alloc_shadow_bufs(
 			else
 				lv->lv_buf_len = 0;
 			lv->lv_bytes = 0;
-			lv->lv_next = NULL;
 		}
 
 		/* Ensure the lv is set up according to ->iop_size */
@@ -409,7 +410,6 @@ xlog_cil_insert_format_items(
 		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
 			/* same or smaller, optimise common overwrite case */
 			lv = lip->li_lv;
-			lv->lv_next = NULL;
 
 			if (ordered)
 				goto insert;
@@ -576,14 +576,14 @@ xlog_cil_insert_items(
 
 static void
 xlog_cil_free_logvec(
-	struct xfs_log_vec	*log_vector)
+	struct list_head	*lv_chain)
 {
 	struct xfs_log_vec	*lv;
 
-	for (lv = log_vector; lv; ) {
-		struct xfs_log_vec *next = lv->lv_next;
+	while(!list_empty(lv_chain)) {
+		lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_list);
+		list_del_init(&lv->lv_list);
 		kmem_free(lv);
-		lv = next;
 	}
 }
 
@@ -682,7 +682,7 @@ xlog_cil_committed(
 		spin_unlock(&ctx->cil->xc_push_lock);
 	}
 
-	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
+	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain,
 					ctx->start_lsn, abort);
 
 	xfs_extent_busy_sort(&ctx->busy_extents);
@@ -693,7 +693,7 @@ xlog_cil_committed(
 	list_del(&ctx->committing);
 	spin_unlock(&ctx->cil->xc_push_lock);
 
-	xlog_cil_free_logvec(ctx->lv_chain);
+	xlog_cil_free_logvec(&ctx->lv_chain);
 
 	if (!list_empty(&ctx->busy_extents))
 		xlog_discard_busy_extents(mp, ctx);
@@ -773,7 +773,6 @@ xlog_cil_build_trans_hdr(
 	lvhdr->lv_niovecs = 2;
 	lvhdr->lv_iovecp = &hdr->lhdr[0];
 	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
-	lvhdr->lv_next = ctx->lv_chain;
 
 	tic->t_curr_res -= lvhdr->lv_bytes;
 }
@@ -913,25 +912,23 @@ xlog_cil_push_work(
 	xlog_cil_pcp_aggregate(cil, ctx);
 
 	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
-
 	while (!list_empty(&ctx->log_items)) {
 		struct xfs_log_item	*item;
 
 		item = list_first_entry(&ctx->log_items,
 					struct xfs_log_item, li_cil);
+		lv = item->li_lv;
 		list_del_init(&item->li_cil);
 		item->li_order_id = 0;
-		if (!ctx->lv_chain)
-			ctx->lv_chain = item->li_lv;
-		else
-			lv->lv_next = item->li_lv;
-		lv = item->li_lv;
 		item->li_lv = NULL;
-		num_iovecs += lv->lv_niovecs;
 
+		num_iovecs += lv->lv_niovecs;
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
 			num_bytes += lv->lv_bytes;
+
+		list_add_tail(&lv->lv_list, &ctx->lv_chain);
+
 	}
 
 	/*
@@ -968,10 +965,13 @@ xlog_cil_push_work(
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
 	 * transaction header here as it is not accounted for in xlog_write().
+	 * Add the lvhdr to the head of the lv chain we pass to xlog_write() so
+	 * it gets written into the iclog first.
 	 */
 	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
 	num_iovecs += lvhdr.lv_niovecs;
 	num_bytes += lvhdr.lv_bytes;
+	list_add(&lvhdr.lv_list, &ctx->lv_chain);
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -985,8 +985,14 @@ xlog_cil_push_work(
 	 * use the commit record lsn then we can move the tail beyond the grant
 	 * write head.
 	 */
-	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				num_bytes);
+	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
+				NULL, num_bytes);
+
+	/*
+	 * Take the lvhdr back off the lv_chain as it should not be passed
+	 * to log IO completion.
+	 */
+	list_del(&lvhdr.lv_list);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index d3bf3b367370..071367a96d8d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -221,7 +221,7 @@ struct xfs_cil_ctx {
 	atomic_t		space_used;	/* aggregate size of regions */
 	struct list_head	busy_extents;	/* busy extents in chkpt */
 	struct list_head	log_items;	/* log items in chkpt */
-	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
+	struct list_head	lv_chain;	/* logvecs being pushed */
 	struct list_head	iclog_entry;
 	struct list_head	committing;	/* ctx committing list */
 	struct work_struct	discard_endio_work;
@@ -477,7 +477,7 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
 
 void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
-int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
+int	xlog_write(struct xlog *log, struct list_head *lv_chain,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
 		struct xlog_in_core **commit_iclog, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bc72826d1f97..0f8300adb12d 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -735,7 +735,7 @@ xfs_log_item_batch_insert(
 void
 xfs_trans_committed_bulk(
 	struct xfs_ail		*ailp,
-	struct xfs_log_vec	*log_vector,
+	struct list_head	*lv_chain,
 	xfs_lsn_t		commit_lsn,
 	bool			aborted)
 {
@@ -750,7 +750,7 @@ xfs_trans_committed_bulk(
 	spin_unlock(&ailp->ail_lock);
 
 	/* unpin all the log items */
-	for (lv = log_vector; lv; lv = lv->lv_next ) {
+	list_for_each_entry(lv, lv_chain, lv_list) {
 		struct xfs_log_item	*lip = lv->lv_item;
 		xfs_lsn_t		item_lsn;
 
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 3004aeac9110..fc8667c728e3 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -18,7 +18,8 @@ void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
 void	xfs_trans_del_item(struct xfs_log_item *);
 void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
 
-void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
+void	xfs_trans_committed_bulk(struct xfs_ail *ailp,
+				struct list_head *lv_chain,
 				xfs_lsn_t commit_lsn, bool aborted);
 /*
  * AIL traversal cursor.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 36/39] xfs: move CIL ordering to the logvec chain
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (34 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 35/39] xfs: convert log vector chain to use list heads Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 19:14   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 37/39] xfs: avoid cil push lock if possible Dave Chinner
                   ` (2 subsequent siblings)
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Adding a list_sort() call to the CIL push work while the xc_ctx_lock
is held exclusively has resulted in fairly long lock hold times and
that stops all front end transaction commits from making progress.

We can move the sorting out of the xc_ctx_lock if we can transfer
the ordering information to the log vectors as they are detached
from the log items and then we can sort the log vectors.  With these
changes, we can move the list_sort() call to just before we call
xlog_write() when we aren't holding any locks at all.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.h     |  1 +
 fs/xfs/xfs_log_cil.c | 23 ++++++++++++++---------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index b4ad0e37a0c5..93aaee7c276e 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -10,6 +10,7 @@ struct xfs_cil_ctx;
 
 struct xfs_log_vec {
 	struct list_head	lv_list;	/* CIL lv chain ptrs */
+	uint32_t		lv_order_id;	/* chain ordering info */
 	int			lv_niovecs;	/* number of iovecs in lv */
 	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
 	struct xfs_log_item	*lv_item;	/* owner */
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 035f0a60040a..cfd3128399f6 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -791,10 +791,10 @@ xlog_cil_order_cmp(
 	const struct list_head	*a,
 	const struct list_head	*b)
 {
-	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
-	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
+	struct xfs_log_vec	*l1 = container_of(a, struct xfs_log_vec, lv_list);
+	struct xfs_log_vec	*l2 = container_of(b, struct xfs_log_vec, lv_list);
 
-	return l1->li_order_id > l2->li_order_id;
+	return l1->lv_order_id > l2->lv_order_id;
 }
 
 /*
@@ -911,24 +911,22 @@ xlog_cil_push_work(
 
 	xlog_cil_pcp_aggregate(cil, ctx);
 
-	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
 	while (!list_empty(&ctx->log_items)) {
 		struct xfs_log_item	*item;
 
 		item = list_first_entry(&ctx->log_items,
 					struct xfs_log_item, li_cil);
 		lv = item->li_lv;
-		list_del_init(&item->li_cil);
-		item->li_order_id = 0;
-		item->li_lv = NULL;
-
+		lv->lv_order_id = item->li_order_id;
 		num_iovecs += lv->lv_niovecs;
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
 			num_bytes += lv->lv_bytes;
 
 		list_add_tail(&lv->lv_list, &ctx->lv_chain);
-
+		list_del_init(&item->li_cil);
+		item->li_order_id = 0;
+		item->li_lv = NULL;
 	}
 
 	/*
@@ -961,6 +959,13 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 	up_write(&cil->xc_ctx_lock);
 
+	/*
+	 * Sort the log vector chain before we add the transaction headers.
+	 * This ensures we always have the transaction headers at the start
+	 * of the chain.
+	 */
+	list_sort(NULL, &ctx->lv_chain, xlog_cil_order_cmp);
+
 	/*
 	 * Build a checkpoint transaction header and write it to the log to
 	 * begin the transaction. We need to account for the space used by the
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 37/39] xfs: avoid cil push lock if possible
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (35 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 36/39] xfs: move CIL ordering to the logvec chain Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 19:18   ` Darrick J. Wong
  2021-05-19 12:13 ` [PATCH 38/39] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
  2021-05-19 12:13 ` [PATCH 39/39] xfs: expanding delayed logging design with background material Dave Chinner
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because now it hurts when the CIL fills up.

  - 37.20% __xfs_trans_commit
      - 35.84% xfs_log_commit_cil
         - 19.34% _raw_spin_lock
            - do_raw_spin_lock
                 19.01% __pv_queued_spin_lock_slowpath
         - 4.20% xfs_log_ticket_ungrant
              0.90% xfs_log_space_wake


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index cfd3128399f6..672cbaa4606c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1125,10 +1125,18 @@ xlog_cil_push_background(
 	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
 
 	/*
-	 * Don't do a background push if we haven't used up all the
-	 * space available yet.
+	 * We are done if:
+	 * - we haven't used up all the space available yet; or
+	 * - we've already queued up a push; and
+	 * - we're not over the hard limit; and
+	 * - nothing has been over the hard limit.
+	 *
+	 * If so, we don't need to take the push lock as there's nothing to do.
 	 */
-	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
+	if (space_used < XLOG_CIL_SPACE_LIMIT(log) ||
+	    (cil->xc_push_seq == cil->xc_current_sequence &&
+	     space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log) &&
+	     !waitqueue_active(&cil->xc_push_wait))) {
 		up_read(&cil->xc_ctx_lock);
 		return;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 38/39] xfs: xlog_sync() manually adjusts grant head space
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (36 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 37/39] xfs: avoid cil push lock if possible Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-19 12:13 ` [PATCH 39/39] xfs: expanding delayed logging design with background material Dave Chinner
  38 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When xlog_sync() rounds off the tail the iclog that is being
flushed, it manually subtracts that space from the grant heads. This
space is actually reserved by the transaction ticket that covers
the xlog_sync() call from xlog_write(), but we don't plumb the
ticket down far enough for it to account for the space consumed in
the current log ticket.

The grant heads are hot, so we really should be accounting this to
the ticket is we can, rather than adding thousands of extra grant
head updates every CIL commit.

Interestingly, this actually indicates a potential log space overrun
can occur when we force the log. By the time that xfs_log_force()
pushes out an active iclog and consumes the roundoff space, the
reservation for that roundoff space has been returned to the grant
heads and is no longer covered by a reservation. In theory the
roundoff added to log force on an already full log could push the
write head past the tail. In practice, the CIL commit that writes to
the log and needs the iclog pushed will have reserved space for
roundoff, so when it releases the ticket there will still be
physical space for the roundoff to be committed to the log, even
though it is no longer reserved. This roundoff won't be enough space
to allow a transaction to be woken if the log is full, so overruns
should not actually occur in practice.

That said, it indicates that we should not release the CIL context
log ticket until after we've released the commit iclog. It also
means that xlog_sync() still needs the direct grant head
manipulation if we don't provide it with a ticket. Log forces are
rare when we are in fast paths running 1.5 million transactions/s
that make the grant heads hot, so let's optimise the hot case and
pass CIL log tickets down to the xlog_sync() code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c      | 39 +++++++++++++++++++++++++--------------
 fs/xfs/xfs_log_cil.c  | 19 ++++++++++++++-----
 fs/xfs/xfs_log_priv.h |  3 ++-
 3 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 5511c5de6b78..63e2358f160a 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -55,7 +55,8 @@ xlog_grant_push_ail(
 STATIC void
 xlog_sync(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog);
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket);
 #if defined(DEBUG)
 STATIC void
 xlog_verify_dest_ptr(
@@ -537,7 +538,8 @@ __xlog_state_release_iclog(
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket)
 {
 	lockdep_assert_held(&log->l_icloglock);
 
@@ -547,7 +549,7 @@ xlog_state_release_iclog(
 	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
 	    __xlog_state_release_iclog(log, iclog)) {
 		spin_unlock(&log->l_icloglock);
-		xlog_sync(log, iclog);
+		xlog_sync(log, iclog, ticket);
 		spin_lock(&log->l_icloglock);
 	}
 
@@ -908,7 +910,7 @@ xlog_unmount_write(
 	 * iclog containing the unmount record is written.
 	 */
 	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, tic);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -1939,7 +1941,8 @@ xlog_calc_iclog_size(
 STATIC void
 xlog_sync(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	struct xlog_ticket	*ticket)
 {
 	unsigned int		count;		/* byte count of bwrite */
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
@@ -1950,12 +1953,20 @@ xlog_sync(
 
 	count = xlog_calc_iclog_size(log, iclog, &roundoff);
 
-	/* move grant heads by roundoff in sync */
-	xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
-	xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
+	/*
+	 * If we have a ticket, account for the roundoff via the ticket
+	 * reservation to avoid touching the hot grant heads needlessly.
+	 * Otherwise, we have to move grant heads directly.
+	 */
+	if (ticket) {
+		ticket->t_curr_res -= roundoff;
+	} else {
+		xlog_grant_add_space(log, &log->l_reserve_head.grant, roundoff);
+		xlog_grant_add_space(log, &log->l_write_head.grant, roundoff);
+	}
 
 	/* put cycle number in every block */
-	xlog_pack_data(log, iclog, roundoff); 
+	xlog_pack_data(log, iclog, roundoff);
 
 	/* real byte length */
 	size = iclog->ic_offset;
@@ -2195,7 +2206,7 @@ xlog_write_get_more_iclog_space(
 	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
 	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
 	       iclog->ic_state == XLOG_STATE_IOERROR);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, ticket);
 	spin_unlock(&log->l_icloglock);
 	if (error)
 		return error;
@@ -2465,7 +2476,7 @@ xlog_write(
 	if (commit_iclog)
 		*commit_iclog = iclog;
 	else
-		error = xlog_state_release_iclog(log, iclog);
+		error = xlog_state_release_iclog(log, iclog, ticket);
 	spin_unlock(&log->l_icloglock);
 
 	return error;
@@ -2923,7 +2934,7 @@ xlog_state_get_iclog_space(
 		 * reference to the iclog.
 		 */
 		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
-			error = xlog_state_release_iclog(log, iclog);
+			error = xlog_state_release_iclog(log, iclog, ticket);
 		spin_unlock(&log->l_icloglock);
 		if (error)
 			return error;
@@ -3151,7 +3162,7 @@ xfs_log_force(
 			atomic_inc(&iclog->ic_refcnt);
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog))
+			if (xlog_state_release_iclog(log, iclog, NULL))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3244,7 +3255,7 @@ xlog_force_lsn(
 		}
 		atomic_inc(&iclog->ic_refcnt);
 		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog))
+		if (xlog_state_release_iclog(log, iclog, NULL))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 672cbaa4606c..64e247cadb33 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -832,6 +832,7 @@ xlog_cil_push_work(
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			push_commit_stable;
+	struct xlog_ticket	*ticket;
 
 	new_ctx = xlog_cil_ctx_alloc();
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -1039,12 +1040,10 @@ xlog_cil_push_work(
 	if (error)
 		goto out_abort_free_ticket;
 
-	xfs_log_ticket_ungrant(log, ctx->ticket);
-
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state == XLOG_STATE_IOERROR) {
 		spin_unlock(&commit_iclog->ic_callback_lock);
-		goto out_abort;
+		goto out_abort_free_ticket;
 	}
 	ASSERT_ALWAYS(commit_iclog->ic_state == XLOG_STATE_ACTIVE ||
 		      commit_iclog->ic_state == XLOG_STATE_WANT_SYNC);
@@ -1061,6 +1060,15 @@ xlog_cil_push_work(
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
+	/*
+	 * Pull the ticket off the ctx so we can ungrant it after releasing the
+	 * commit_iclog. The ctx may be freed by the time we return from
+	 * releasing the commit_iclog (i.e. checkpoint has been completed and
+	 * callback run) so we can't reference the ctx after the call to
+	 * xlog_state_release_iclog().
+	 */
+	ticket = ctx->ticket;
+
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous
 	 * iclogs to complete before we submit the commit_iclog. In this case,
@@ -1087,8 +1095,10 @@ xlog_cil_push_work(
 	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
 	if (push_commit_stable && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
 		xlog_state_switch_iclogs(log, commit_iclog, 0);
-	xlog_state_release_iclog(log, commit_iclog);
+	xlog_state_release_iclog(log, commit_iclog, ticket);
 	spin_unlock(&log->l_icloglock);
+
+	xfs_log_ticket_ungrant(log, ticket);
 	return;
 
 out_skip:
@@ -1099,7 +1109,6 @@ xlog_cil_push_work(
 
 out_abort_free_ticket:
 	xfs_log_ticket_ungrant(log, ctx->ticket);
-out_abort:
 	ASSERT(XLOG_FORCED_SHUTDOWN(log));
 	xlog_cil_committed(ctx);
 }
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 071367a96d8d..615beb6781dd 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -488,7 +488,8 @@ void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
 void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
 		int eventual_size);
-int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
+int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog,
+		struct xlog_ticket *ticket);
 
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH 39/39] xfs: expanding delayed logging design with background material
  2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
                   ` (37 preceding siblings ...)
  2021-05-19 12:13 ` [PATCH 38/39] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
@ 2021-05-19 12:13 ` Dave Chinner
  2021-05-27 20:38   ` Darrick J. Wong
  38 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-05-19 12:13 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

I wrote up a description of how transactions, space reservations and
relogging work together in response to a question for background
material on the delayed logging design. Add this to the existing
document for ease of future reference.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 .../xfs-delayed-logging-design.rst            | 361 ++++++++++++++++--
 1 file changed, 322 insertions(+), 39 deletions(-)

diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst
index 464405d2801e..395c63ca5b27 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
@@ -1,29 +1,314 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-==========================
-XFS Delayed Logging Design
-==========================
-
-Introduction to Re-logging in XFS
-=================================
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
+==================
+XFS Logging Design
+==================
+
+Preamble
+========
+
+This document describes the design and algorithms that the XFS journalling
+subsystem is based on. This document describes the design and algorithms that
+the XFS journalling subsystem is based on so that readers may familiarize
+themselves with the general concepts of how transaction processing in XFS works.
+
+We begin with an overview of transactions in XFS, followed by describing how
+transaction reservations are structured and accounted, and then move into how we
+guarantee forwards progress for long running transactions with finite initial
+reservations bounds. At this point we need to explain how relogging works. With
+the basic concepts covered, the design of the delayed logging mechanism is
+documented.
+
+
+Introduction
+============
+
+XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
+are atomic and recoverable. For reasons of space and time efficiency, the
+logging mechanisms are varied and complex, combining intents, logical and
+physical logging mechanisms to provide the necessary recovery guarantees the
+filesystem requires.
+
+Some objects, such as inodes and dquots, are logged in logical format where the
+details logged are made up of the changes to in-core structures rather than
+on-disk structures. Other objects - typically buffers - have their physical
+changes logged. Long running atomic modifications have individual changes
+chained together by intents, ensuring that journal recovery can restart and
+finish an operation that was only partially done when the system stopped
+functioning.
+
+The reason for these differences is to keep the amount of log space and CPU time
+required to process objects being modified as small as possible and hence the
+logging overhead as low as possible. Some items are very frequently modified,
+and some parts of objects are more frequently modified than others, so keeping
+the overhead of metadata logging low is of prime importance.
+
+The method used to log an item or chain modifications together isn't
+particularly important in the scope of this document. It suffices to know that
+the method used for logging a particular object or chaining modifications
+together are different and are dependent on the object and/or modification being
+performed. The logging subsystem only cares that certain specific rules are
+followed to guarantee forwards progress and prevent deadlocks.
+
+
+Transactions in XFS
+===================
+
+XFS has two types of high level transactions, defined by the type of log space
+reservation they take. These are known as "one shot" and "permanent"
+transactions. Permanent transaction reservations can take reservations that span
+commit boundaries, whilst "one shot" transactions are for a single atomic
+modification.
+
+The type and size of reservation must be matched to the modification taking
+place.  This means that permanent transactions can be used for one-shot
+modifications, but one-shot reservations cannot be used for permanent
+transactions.
+
+In the code, a one-shot transaction pattern looks somewhat like this::
+
+	tp = xfs_trans_alloc(<reservation>)
+	<lock items>
+	<join item to transaction>
+	<do modification>
+	xfs_trans_commit(tp);
+
+As items are modified in the transaction, the dirty regions in those items are
+tracked via the transaction handle.  Once the transaction is committed, all
+resources joined to it are released, along with the remaining unused reservation
+space that was taken at the transaction allocation time.
+
+In contrast, a permanent transaction is made up of multiple linked individual
+transactions, and the pattern looks like this::
+
+	tp = xfs_trans_alloc(<reservation>)
+	xfs_ilock(ip, XFS_ILOCK_EXCL)
+
+	loop {
+		xfs_trans_ijoin(tp, 0);
+		<do modification>
+		xfs_trans_log_inode(tp, ip);
+		xfs_trans_roll(&tp);
+	}
+
+	xfs_trans_commit(tp);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+While this might look similar to a one-shot transaction, there is an important
+difference: xfs_trans_roll() performs a specific operation that links two
+transactions together::
+
+	ntp = xfs_trans_dup(tp);
+	xfs_trans_commit(tp);
+	xfs_log_reserve(ntp);
+
+This results in a series of "rolling transactions" where the inode is locked
+across the entire chain of transactions.  Hence while this series of rolling
+transactions is running, nothing else can read from or write to the inode and
+this provides a mechanism for complex changes to appear atomic from an external
+observer's point of view.
+
+It is important to note that a series of rolling transactions in a permanent
+transaction does not form an atomic change in the journal. While each
+individual modification is atomic, the chain is *not atomic*. If we crash half
+way through, then recovery will only replay up to the last transactional
+modification the loop made that was committed to the journal.
+
+This affects long running permanent transactions in that it is not possible to
+predict how much of a long running operation will actually be recovered because
+there is no guarantee of how much of the operation reached stale storage. Hence
+if a long running operation requires multiple transactions to fully complete,
+the high level operation must use intents and deferred operations to guarantee
+recovery can complete the operation once the first transactions is persisted in
+the on-disk journal.
+
+
+Transactions are Asynchronous
+=============================
+
+In XFS, all high level transactions are asynchronous by default. This means that
+xfs_trans_commit() does not guarantee that the modification has been committed
+to stable storage when it returns. Hence when a system crashes, not all the
+completed transactions will be replayed during recovery.
+
+However, the logging subsystem does provide global ordering guarantees, such
+that if a specific change is seen after recovery, all metadata modifications
+that were committed prior to that change will also be seen.
+
+For single shot operations that need to reach stable storage immediately, or
+ensuring that a long running permanent transaction is fully committed once it is
+complete, we can explicitly tag a transaction as synchronous. This will trigger
+a "log force" to flush the outstanding committed transactions to stable storage
+in the journal and wait for that to complete.
+
+Synchronous transactions are rarely used, however, because they limit logging
+throughput to the IO latency limitations of the underlying storage. Instead, we
+tend to use log forces to ensure modifications are on stable storage only when
+a user operation requires a synchronisation point to occur (e.g. fsync).
+
+
+Transaction Reservations
+========================
+
+It has been mentioned a number of times now that the logging subsystem needs to
+provide a forwards progress guarantee so that no modification ever stalls
+because it can't be written to the journal due to a lack of space in the
+journal. This is achieved by the transaction reservations that are made when
+a transaction is first allocated. For permanent transactions, these reservations
+are maintained as part of the transaction rolling mechanism.
+
+A transaction reservation provides a guarantee that there is physical log space
+available to write the modification into the journal before we start making
+modifications to objects and items. As such, the reservation needs to be large
+enough to take into account the amount of metadata that the change might need to
+log in the worst case. This means that if we are modifying a btree in the
+transaction, we have to reserve enough space to record a full leaf-to-root split
+of the btree. As such, the reservations are quite complex because we have to
+take into account all the hidden changes that might occur.
+
+For example, a user data extent allocation involves allocating an extent from
+free space, which modifies the free space trees. That's two btrees.  Inserting
+the extent into the inode's extent map might require a split of the extent map
+btree, which requires another allocation that can modify the free space trees
+again.  Then we might have to update reverse mappings, which modifies yet
+another btree which might require more space. And so on.  Hence the amount of
+metadata that a "simple" operation can modify can be quite large.
+
+This "worst case" calculation provides us with the static "unit reservation"
+for the transaction that is calculated at mount time. We must guarantee that the
+log has this much space available before the transaction is allowed to proceed
+so that when we come to write the dirty metadata into the log we don't run out
+of log space half way through the write.
+
+For one-shot transactions, a single unit space reservation is all that is
+required for the transaction to proceed. For permanent transactions, however, we
+also have a "log count" that affects the size of the reservation that is to be
+made.
+
+While a permanent transaction can get by with a single unit of space
+reservation, it is somewhat inefficient to do this as it requires the
+transaction rolling mechanism to re-reserve space on every transaction roll. We
+know from the implementation of the permanent transactions how many transaction
+rolls are likely for the common modifications that need to be made.
+
+For example, and inode allocation is typically two transactions - one to
+physically allocate a free inode chunk on disk, and another to allocate an inode
+from an inode chunk that has free inodes in it.  Hence for an inode allocation
+transaction, we might set the reservation log count to a value of 2 to indicate
+that the common/fast path transaction will commit two linked transactions in a
+chain. Each time a permanent transaction rolls, it consumes an entire unit
+reservation.
+
+Hence when the permanent transaction is first allocated, the log space
+reservation is increases from a single unit reservation to multiple unit
+reservations. That multiple is defined by the reservation log count, and this
+means we can roll the transaction multiple times before we have to re-reserve
+log space when we roll the transaction. This ensures that the common
+modifications we make only need to reserve log space once.
+
+If the log count for a permanent transaction reaches zero, then it needs to
+re-reserve physical space in the log. This is somewhat complex, and requires
+an understanding of how the log accounts for space that has been reserved.
+
+
+Log Space Accounting
+====================
+
+The position in the log is typically referred to as a Log Sequence Number (LSN).
+The log is circular, so the positions in the log are defined by the combination
+of a cycle number - the number of times the log has been overwritten - and the
+offset into the log.  A LSN carries the cycle in the upper 32 bits and the
+offset in the lower 32 bits. The offset is in units of "basic blocks" (512
+bytes). Hence we can do realtively simple LSN based math to keep track of
+available space in the log.
+
+Log space accounting is done via a pair of constructs called "grant heads".  The
+position of the grant heads is an absolute value, so the amount of space
+available in the log is defined by the distance between the position of the
+grant head and the current log tail. That is, how much space can be
+reserved/consumed before the grant heads would fully wrap the log and overtake
+the tail position.
+
+The first grant head is the "reserve" head. This tracks the byte count of the
+reservations currently held by active transactions. It is a purely in-memory
+accounting of the space reservation and, as such, actually tracks byte offsets
+into the log rather than basic blocks. Hence it technically isn't using LSNs to
+represent the log position, but it is still treated like a split {cycle,offset}
+tuple for the purposes of tracking reservation space.
+
+The reserve grant head is used to accurately account for exact transaction
+reservations amounts and the exact byte count that modifications actually make
+and need to write into the log. The reserve head is used to prevent new
+transactions from taking new reservations when the head reaches the current
+tail. It will block new reservations in a FIFO queue and as the log tail moves
+forward it will wake them in order once sufficient space is available. This FIFO
+mechanism ensures no transaction is starved of resources when log space
+shortages occur.
+
+The other grant head is the "write" head. Unlike the reserve head, this grant
+head contains an LSN and it tracks the physical space usage in the log. While
+this might sound like it is accounting the same state as the reserve grant head
+- and it mostly does track exactly the same location as the reserve grant head -
+there are critical differences in behaviour between them that provides the
+forwards progress guarantees that rolling permanent transactions require.
+
+These differences when a permanent transaction is rolled and the internal "log
+count" reaches zero and the initial set of unit reservations have been
+exhausted. At this point, we still require a log space reservation to continue
+the next transaction in the sequeunce, but we have none remaining. We cannot
+sleep during the transaction commit process waiting for new log space to become
+available, as we may end up on the end of the FIFO queue and the items we have
+locked while we sleep could end up pinning the tail of the log before there is
+enough free space in the log to fulfil all of the pending reservations and
+then wake up transaction commit in progress.
+
+To take a new reservation without sleeping requires us to be able to take a
+reservation even if there is no reservation space currently available. That is,
+we need to be able to *overcommit* the log reservation space. As has already
+been detailed, we cannot overcommit physical log space. However, the reserve
+grant head does not track physical space - it only accounts for the amount of
+reservations we currently have outstanding. Hence if the reserve head passes
+over the tail of the log all it means is that new reservations will be throttled
+immediately and remain throttled until the log tail is moved forward far enough
+to remove the overcommit and start taking new reservations. In other words, we
+can overcommit the reserve head without violating the physical log head and tail
+rules.
+
+As a result, permanent transactions only "regrant" reservation space during
+xfs_trans_commit() calls, while the physical log space reservation - tracked by
+the write head - is then reserved separately by a call to xfs_log_reserve()
+after the commit completes. Once the commit completes, we can sleep waiting for
+physical log space to be reserved from the write grant head, but only if one
+critical rule has been observed::
+
+	Code using permanent reservations must always log the items they hold
+	locked across each transaction they roll in the chain.
+
+"Re-logging" the locked items on every transaction roll ensures that the items
+the transaction chain is rolling are always relocated to the physical head of
+the log and so do not pin the tail of the log. If a locked item pins the tail of
+the log when we sleep on the write reservation, then we will deadlock the log as
+we cannot take the locks needed to write back that item and move the tail of the
+log forwards to free up write grant space. Re-logging the locked items avoids
+this deadlock and guarantees that the log reservation we are making cannot
+self-deadlock.
+
+If all rolling transactions obey this rule, then they can all make forwards
+progress independently because nothing will block the progress of the log
+tail moving forwards and hence ensuring that write grant space is always
+(eventually) made available to permanent transactions no matter how many times
+they roll.
+
+
+Re-logging Explained
+====================
+
+XFS allows multiple separate modifications to a single object to be carried in
+the log at any given time.  This allows the log to avoid needing to flush each
+change to disk before recording a new change to the object. XFS does this via a
+method called "re-logging". Conceptually, this is quite simple - all it requires
+is that any new change to the object is recorded with a *new copy* of all the
+existing changes in the new transaction that is written to the log.
 
 That is, if we have a sequence of changes A through to F, and the object was
 written to disk after change D, we would see in the log the following series
@@ -42,16 +327,13 @@ transaction::
 In other words, each time an object is relogged, the new transaction contains
 the aggregation of all the previous changes currently held only in the log.
 
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward.  This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
+This relogging technique allows objects to be moved forward in the log so that
+an object being relogged does not prevent the tail of the log from ever moving
+forward.  This can be seen in the table above by the changing (increasing) LSN
+of each subsequent transaction, and it's the technique that allows us to
+implement long-running, multiple-commit permanent transactions. 
 
-This relogging is also used to implement long-running, multiple-commit
-transactions.  These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
+A typical example of a rolling transaction is the removal of extents from an
 inode which can only be done at a rate of two extents per transaction because
 of reservation size limitations. Hence a rolling extent removal transaction
 keeps relogging the inode and btree buffers as they get modified in each
@@ -67,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get
 dirtier as they get relogged, so each subsequent transaction is writing more
 metadata into the log.
 
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
+It should now also be obvious how relogging and asynchronous transactions go
+hand in hand. That is, transactions don't get written to the physical journal
+until either a log buffer is filled (a log buffer can hold multiple
+transactions) or a synchronous operation forces the log buffers holding the
+transactions to disk. This means that XFS is doing aggregation of transactions
+in memory - batching them, if you like - to minimise the impact of the log IO on
+transaction throughput.
 
 The limitation on asynchronous transaction throughput is the number and size of
 log buffers made available by the log manager. By default there are 8 log
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 04/39] xfs: async blkdev cache flush
  2021-05-19 12:12 ` [PATCH 04/39] xfs: async blkdev cache flush Dave Chinner
@ 2021-05-20 23:53   ` Darrick J. Wong
  2021-05-28  0:54   ` Allison Henderson
  1 sibling, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-20 23:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:42PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The new checkpoint cache flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
> 
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
> 
> To do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. The block layer has no such primitives
> for filesystems, so roll our own for the moment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_bio_io.c | 35 +++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  2 ++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 17f36db2f792..de727532e137 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,41 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>  	return bio_max_segs(howmany(count, PAGE_SIZE));
>  }
>  
> +static void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	complete(bio->bi_private);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the request queue does
> + * not require flush operations, just skip it altogether. If the caller needsi
> + * to wait for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.  The
> + * caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct bio		*bio,
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct request_queue	*q = bdev->bd_disk->queue;
> +
> +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> +		complete(done);
> +		return;
> +	}
> +
> +	bio_init(bio, NULL, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +	bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
> +	submit_bio(bio);
> +}
>  int
>  xfs_rw_bdev(
>  	struct block_device	*bdev,
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 7688663b9773..c174262a074e 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
> +void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
> +		struct completion *done);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 07/39] xfs: journal IO cache flush reductions
  2021-05-19 12:12 ` [PATCH 07/39] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-05-21  0:16   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:45PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
> 
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
> 
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
> 
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
> 
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
> 
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> ("Add support for drive write cache flushing") in the xfs-archive
> tree.
> 
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
> 
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
> 
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
> 
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case. As an added bonus,
> this ordering mechanism works for both internal and external logs,
> meaning we can remove the explicit data device cache flushes from
> the iclog write code when using external logs.
> 
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit records to issue a REQ_PREFLUSH. We
> also require unmount records to do this. Hence for both
> XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
> to mark the first iclog being written with REQ_PREFLUSH.
> 
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
> 
> And for external logs, unmount records need a pre-write data device
> cache flush similar to the CIL checkpoint cache pre-flush as the
> internal iclog write code does not do this implicitly anymore.
> 
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
> 
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
> 
> With smaller iclogs (32kB), however, there is a sustantial

s/sustantial/substantial/


> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
> 
> As a result:
> 
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
> 
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
> 
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

With that fixed,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      | 66 +++++++++++++++----------------------------
>  fs/xfs/xfs_log.h      |  1 -
>  fs/xfs/xfs_log_cil.c  | 18 +++++++++---
>  fs/xfs/xfs_log_priv.h |  6 ++++
>  4 files changed, 43 insertions(+), 48 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 87870867d9fb..b6145e4cb7bc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -513,7 +513,7 @@ __xlog_state_release_iclog(
>   * Flush iclog to disk if this is the last reference to the given iclog and the
>   * it is in the WANT_SYNC state.
>   */
> -static int
> +int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog)
> @@ -533,23 +533,6 @@ xlog_state_release_iclog(
>  	return 0;
>  }
>  
> -void
> -xfs_log_release_iclog(
> -	struct xlog_in_core	*iclog)
> -{
> -	struct xlog		*log = iclog->ic_log;
> -	bool			sync = false;
> -
> -	if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) {
> -		if (iclog->ic_state != XLOG_STATE_IOERROR)
> -			sync = __xlog_state_release_iclog(log, iclog);
> -		spin_unlock(&log->l_icloglock);
> -	}
> -
> -	if (sync)
> -		xlog_sync(log, iclog);
> -}
> -
>  /*
>   * Mount a log filesystem
>   *
> @@ -837,6 +820,14 @@ xlog_write_unmount_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
> +
> +	/*
> +	 * For external log devices, we need to flush the data device cache
> +	 * first to ensure all metadata writeback is on stable storage before we
> +	 * stamp the tail LSN into the unmount record.
> +	 */
> +	if (log->l_targ != log->l_mp->m_ddev_targp)
> +		blkdev_issue_flush(log->l_targ->bt_bdev);
>  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
> @@ -874,6 +865,11 @@ xlog_unmount_write(
>  	else
>  		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
>  		       iclog->ic_state == XLOG_STATE_IOERROR);
> +	/*
> +	 * Ensure the journal is fully flushed and on stable storage once the
> +	 * iclog containing the unmount record is written.
> +	 */
> +	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  	error = xlog_state_release_iclog(log, iclog);
>  	xlog_wait_on_iclog(iclog);
>  
> @@ -1755,8 +1751,7 @@ xlog_write_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
>  	uint64_t		bno,
> -	unsigned int		count,
> -	bool			need_flush)
> +	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
>  
> @@ -1794,10 +1789,12 @@ xlog_write_iclog(
>  	 * writeback throttle from throttling log writes behind background
>  	 * metadata writeback and causing priority inversions.
>  	 */
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> -				REQ_IDLE | REQ_FUA;
> -	if (need_flush)
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
> +		iclog->ic_bio.bi_opf |= REQ_FUA;
> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -1900,7 +1897,6 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>  
> @@ -1925,10 +1921,8 @@ xlog_sync(
>  	bno = BLOCK_LSN(be64_to_cpu(iclog->ic_header.h_lsn));
>  
>  	/* Do we need to split this write into 2 parts? */
> -	if (bno + BTOBB(count) > log->l_logBBsize) {
> +	if (bno + BTOBB(count) > log->l_logBBsize)
>  		xlog_split_iclog(log, &iclog->ic_header, bno, count);
> -		split = true;
> -	}
>  
>  	/* calculcate the checksum */
>  	iclog->ic_header.h_crc = xlog_cksum(log, &iclog->ic_header,
> @@ -1949,22 +1943,8 @@ xlog_sync(
>  			 be64_to_cpu(iclog->ic_header.h_lsn));
>  	}
>  #endif
> -
> -	/*
> -	 * Flush the data device before flushing the log to make sure all meta
> -	 * data written back from the AIL actually made it to disk before
> -	 * stamping the new log tail LSN into the log buffer.  For an external
> -	 * log we need to issue the flush explicitly, and unfortunately
> -	 * synchronously here; for an internal log we can simply use the block
> -	 * layer state machine for preflushes.
> -	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> -	}
> -
>  	xlog_verify_iclog(log, iclog, count);
> -	xlog_write_iclog(log, iclog, bno, count, need_flush);
> +	xlog_write_iclog(log, iclog, bno, count);
>  }
>  
>  /*
> @@ -2418,7 +2398,7 @@ xlog_write(
>  		ASSERT(log_offset <= iclog->ic_size - 1);
>  		ptr = iclog->ic_datap + log_offset;
>  
> -		/* start_lsn is the first lsn written to. That's all we need. */
> +		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 044e02cb8921..99f9d6ed9598 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -117,7 +117,6 @@ void	xfs_log_mount_cancel(struct xfs_mount *);
>  xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
>  xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
>  void	  xfs_log_space_wake(struct xfs_mount *mp);
> -void	  xfs_log_release_iclog(struct xlog_in_core *iclog);
>  int	  xfs_log_reserve(struct xfs_mount *mp,
>  			  int		   length,
>  			  int		   count,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 172bb3551d6b..9d2fa8464289 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -890,15 +890,25 @@ xlog_cil_push_work(
>  
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. In this case,
> +	 * the commit_iclog write needs to issue a pre-flush so that the
> +	 * ordering is correctly preserved down to stable storage.
>  	 */
> +	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != commit_lsn) {
> -		spin_lock(&log->l_icloglock);
>  		xlog_wait_on_iclog(commit_iclog->ic_prev);
> +		spin_lock(&log->l_icloglock);
> +		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
> -	/* release the hounds! */
> -	xfs_log_release_iclog(commit_iclog);
> +	/*
> +	 * The commit iclog must be written to stable storage to guarantee
> +	 * journal IO vs metadata writeback IO is correctly ordered on stable
> +	 * storage.
> +	 */
> +	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> +	xlog_state_release_iclog(log, commit_iclog);
> +	spin_unlock(&log->l_icloglock);
>  	return;
>  
>  out_skip:
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 56e1942c47df..2203ccecafb6 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -133,6 +133,9 @@ enum xlog_iclog_state {
>  
>  #define XLOG_COVER_OPS		5
>  
> +#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
> +#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
> +
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  
> @@ -201,6 +204,7 @@ typedef struct xlog_in_core {
>  	u32			ic_size;
>  	u32			ic_offset;
>  	enum xlog_iclog_state	ic_state;
> +	unsigned int		ic_flags;
>  	char			*ic_datap;	/* pointer to iclog data */
>  
>  	/* Callback structures need their own cacheline */
> @@ -486,6 +490,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
> +
>  /*
>   * When we crack an atomic LSN, we sample it first so that the value will not
>   * change while we are cracking it into the component values. This means we
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN
  2021-05-19 12:12 ` [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
@ 2021-05-21  0:20   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:47PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In doing an investigation into AIL push stalls, I was looking at the
> log force code to see if an async CIL push could be done instead.
> This lead me to xfs_log_force_lsn() and looking at how it works.
> 
> xfs_log_force_lsn() is only called from inode synchronisation
> contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
> value as the LSN to sync the log to. This gets passed to
> xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
> journal, and then used by xfs_log_force_lsn() to flush the iclogs to
> the journal.
> 
> The problem is that ip->i_itemp->ili_last_lsn does not store a
> log sequence number. What it stores is passed to it from the
> ->iop_committing method, which is called by xfs_log_commit_cil().
> The value this passes to the iop_committing method is the CIL
> context sequence number that the item was committed to.
> 
> As it turns out, xlog_cil_force_lsn() converts the sequence to an
> actual commit LSN for the related context and returns that to
> xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
> variable that contained a sequence with an actual LSN and then uses
> that to sync the iclogs.
> 
> This caused me some confusion for a while, even though I originally
> wrote all this code a decade ago. ->iop_committing is only used by
> a couple of log item types, and only inode items use the sequence
> number it is passed.
> 
> Let's clean up the API, CIL structures and inode log item to call it
> a sequence number, and make it clear that the high level code is
> using CIL sequence numbers and not on-disk LSNs for integrity
> synchronisation purposes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks good this time,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_types.h |  1 +
>  fs/xfs/xfs_buf_item.c     |  2 +-
>  fs/xfs/xfs_dquot_item.c   |  2 +-
>  fs/xfs/xfs_file.c         | 14 +++++++-------
>  fs/xfs/xfs_inode.c        | 10 +++++-----
>  fs/xfs/xfs_inode_item.c   |  4 ++--
>  fs/xfs/xfs_inode_item.h   |  2 +-
>  fs/xfs/xfs_log.c          | 27 ++++++++++++++-------------
>  fs/xfs/xfs_log.h          |  4 +---
>  fs/xfs/xfs_log_cil.c      | 30 +++++++++++-------------------
>  fs/xfs/xfs_log_priv.h     | 15 +++++++--------
>  fs/xfs/xfs_trans.c        |  6 +++---
>  fs/xfs/xfs_trans.h        |  4 ++--
>  13 files changed, 56 insertions(+), 65 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 064bd6e8c922..0870ef6f933d 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -21,6 +21,7 @@ typedef int32_t		xfs_suminfo_t;	/* type of bitmap summary info */
>  typedef uint32_t	xfs_rtword_t;	/* word type for bitmap manipulations */
>  
>  typedef int64_t		xfs_lsn_t;	/* log sequence number */
> +typedef int64_t		xfs_csn_t;	/* CIL sequence number */
>  
>  typedef uint32_t	xfs_dablk_t;	/* dir/attr block number (in file) */
>  typedef uint32_t	xfs_dahash_t;	/* dir/attr hash value */
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 14d1fefcbf4c..1cb087b320b1 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -713,7 +713,7 @@ xfs_buf_item_release(
>  STATIC void
>  xfs_buf_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
>  	return xfs_buf_item_release(lip);
>  }
> diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
> index 8c1fdf37ee8f..8ed47b739b6c 100644
> --- a/fs/xfs/xfs_dquot_item.c
> +++ b/fs/xfs/xfs_dquot_item.c
> @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release(
>  STATIC void
>  xfs_qm_dquot_logitem_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
>  	return xfs_qm_dquot_logitem_release(lip);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e7e9af57e788..277d0f3921cc 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -119,8 +119,8 @@ xfs_dir_fsync(
>  	return xfs_log_force_inode(ip);
>  }
>  
> -static xfs_lsn_t
> -xfs_fsync_lsn(
> +static xfs_csn_t
> +xfs_fsync_seq(
>  	struct xfs_inode	*ip,
>  	bool			datasync)
>  {
> @@ -128,7 +128,7 @@ xfs_fsync_lsn(
>  		return 0;
>  	if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>  		return 0;
> -	return ip->i_itemp->ili_last_lsn;
> +	return ip->i_itemp->ili_commit_seq;
>  }
>  
>  /*
> @@ -151,12 +151,12 @@ xfs_fsync_flush_log(
>  	int			*log_flushed)
>  {
>  	int			error = 0;
> -	xfs_lsn_t		lsn;
> +	xfs_csn_t		seq;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
> -	lsn = xfs_fsync_lsn(ip, datasync);
> -	if (lsn) {
> -		error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC,
> +	seq = xfs_fsync_seq(ip, datasync);
> +	if (seq) {
> +		error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC,
>  					  log_flushed);
>  
>  		spin_lock(&ip->i_itemp->ili_lock);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 336c350206a8..1c7e0d4e0013 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2604,7 +2604,7 @@ xfs_iunpin(
>  	trace_xfs_inode_unpin_nowait(ip, _RET_IP_);
>  
>  	/* Give the log a push to start the unpinning I/O */
> -	xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL);
> +	xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL);
>  
>  }
>  
> @@ -3618,16 +3618,16 @@ int
>  xfs_log_force_inode(
>  	struct xfs_inode	*ip)
>  {
> -	xfs_lsn_t		lsn = 0;
> +	xfs_csn_t		seq = 0;
>  
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip))
> -		lsn = ip->i_itemp->ili_last_lsn;
> +		seq = ip->i_itemp->ili_commit_seq;
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> -	if (!lsn)
> +	if (!seq)
>  		return 0;
> -	return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL);
> +	return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 5a2dd33020e2..35de30849fcc 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -643,9 +643,9 @@ xfs_inode_item_committed(
>  STATIC void
>  xfs_inode_item_committing(
>  	struct xfs_log_item	*lip,
> -	xfs_lsn_t		commit_lsn)
> +	xfs_csn_t		seq)
>  {
> -	INODE_ITEM(lip)->ili_last_lsn = commit_lsn;
> +	INODE_ITEM(lip)->ili_commit_seq = seq;
>  	return xfs_inode_item_release(lip);
>  }
>  
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4b926e32831c..403b45ab9aa2 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -33,7 +33,7 @@ struct xfs_inode_log_item {
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
> -	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> +	xfs_csn_t		ili_commit_seq;	   /* last transaction commit */
>  };
>  
>  static inline int xfs_inode_clean(struct xfs_inode *ip)
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index b6145e4cb7bc..aa37f4319052 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3252,14 +3252,13 @@ xfs_log_force(
>  }
>  
>  static int
> -__xfs_log_force_lsn(
> -	struct xfs_mount	*mp,
> +xlog_force_lsn(
> +	struct xlog		*log,
>  	xfs_lsn_t		lsn,
>  	uint			flags,
>  	int			*log_flushed,
>  	bool			already_slept)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
>  
>  	spin_lock(&log->l_icloglock);
> @@ -3292,8 +3291,6 @@ __xfs_log_force_lsn(
>  		if (!already_slept &&
>  		    (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC ||
>  		     iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) {
> -			XFS_STATS_INC(mp, xs_log_force_sleep);
> -
>  			xlog_wait(&iclog->ic_prev->ic_write_wait,
>  					&log->l_icloglock);
>  			return -EAGAIN;
> @@ -3331,25 +3328,29 @@ __xfs_log_force_lsn(
>   * to disk, that thread will wake up all threads waiting on the queue.
>   */
>  int
> -xfs_log_force_lsn(
> +xfs_log_force_seq(
>  	struct xfs_mount	*mp,
> -	xfs_lsn_t		lsn,
> +	xfs_csn_t		seq,
>  	uint			flags,
>  	int			*log_flushed)
>  {
> +	struct xlog		*log = mp->m_log;
> +	xfs_lsn_t		lsn;
>  	int			ret;
> -	ASSERT(lsn != 0);
> +	ASSERT(seq != 0);
>  
>  	XFS_STATS_INC(mp, xs_log_force);
> -	trace_xfs_log_force(mp, lsn, _RET_IP_);
> +	trace_xfs_log_force(mp, seq, _RET_IP_);
>  
> -	lsn = xlog_cil_force_lsn(mp->m_log, lsn);
> +	lsn = xlog_cil_force_seq(log, seq);
>  	if (lsn == NULLCOMMITLSN)
>  		return 0;
>  
> -	ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false);
> -	if (ret == -EAGAIN)
> -		ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true);
> +	ret = xlog_force_lsn(log, lsn, flags, log_flushed, false);
> +	if (ret == -EAGAIN) {
> +		XFS_STATS_INC(mp, xs_log_force_sleep);
> +		ret = xlog_force_lsn(log, lsn, flags, log_flushed, true);
> +	}
>  	return ret;
>  }
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 99f9d6ed9598..813b972e9788 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -106,7 +106,7 @@ struct xfs_item_ops;
>  struct xfs_trans;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
> -int	  xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags,
> +int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
>  		int *log_forced);
>  int	  xfs_log_mount(struct xfs_mount	*mp,
>  			struct xfs_buftarg	*log_target,
> @@ -131,8 +131,6 @@ bool	xfs_log_writable(struct xfs_mount *mp);
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
>  void	  xfs_log_ticket_put(struct xlog_ticket *ticket);
>  
> -void	xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp,
> -				xfs_lsn_t *commit_lsn, bool regrant);
>  void	xlog_cil_process_committed(struct list_head *list);
>  bool	xfs_log_item_in_current_chkpt(struct xfs_log_item *lip);
>  
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 903617e6d054..3c2b1205944d 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -788,7 +788,7 @@ xlog_cil_push_work(
>  	 * that higher sequences will wait for us to write out a commit record
>  	 * before they do.
>  	 *
> -	 * xfs_log_force_lsn requires us to mirror the new sequence into the cil
> +	 * xfs_log_force_seq requires us to mirror the new sequence into the cil
>  	 * structure atomically with the addition of this sequence to the
>  	 * committing list. This also ensures that we can do unlocked checks
>  	 * against the current sequence in log forces without risking
> @@ -1057,16 +1057,14 @@ xlog_cil_empty(
>   * allowed again.
>   */
>  void
> -xfs_log_commit_cil(
> -	struct xfs_mount	*mp,
> +xlog_cil_commit(
> +	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	xfs_lsn_t		*commit_lsn,
> +	xfs_csn_t		*commit_seq,
>  	bool			regrant)
>  {
> -	struct xlog		*log = mp->m_log;
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_log_item	*lip, *next;
> -	xfs_lsn_t		xc_commit_lsn;
>  
>  	/*
>  	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1080,10 +1078,6 @@ xfs_log_commit_cil(
>  
>  	xlog_cil_insert_items(log, tp);
>  
> -	xc_commit_lsn = cil->xc_ctx->sequence;
> -	if (commit_lsn)
> -		*commit_lsn = xc_commit_lsn;
> -
>  	if (regrant && !XLOG_FORCED_SHUTDOWN(log))
>  		xfs_log_ticket_regrant(log, tp->t_ticket);
>  	else
> @@ -1106,8 +1100,10 @@ xfs_log_commit_cil(
>  	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
>  		xfs_trans_del_item(lip);
>  		if (lip->li_ops->iop_committing)
> -			lip->li_ops->iop_committing(lip, xc_commit_lsn);
> +			lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence);
>  	}
> +	if (commit_seq)
> +		*commit_seq = cil->xc_ctx->sequence;
>  
>  	/* xlog_cil_push_background() releases cil->xc_ctx_lock */
>  	xlog_cil_push_background(log);
> @@ -1124,9 +1120,9 @@ xfs_log_commit_cil(
>   * iclog flush is necessary following this call.
>   */
>  xfs_lsn_t
> -xlog_cil_force_lsn(
> +xlog_cil_force_seq(
>  	struct xlog	*log,
> -	xfs_lsn_t	sequence)
> +	xfs_csn_t	sequence)
>  {
>  	struct xfs_cil		*cil = log->l_cilp;
>  	struct xfs_cil_ctx	*ctx;
> @@ -1222,21 +1218,17 @@ bool
>  xfs_log_item_in_current_chkpt(
>  	struct xfs_log_item *lip)
>  {
> -	struct xfs_cil_ctx *ctx;
> +	struct xfs_cil_ctx *ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
>  
>  	if (list_empty(&lip->li_cil))
>  		return false;
>  
> -	ctx = lip->li_mountp->m_log->l_cilp->xc_ctx;
> -
>  	/*
>  	 * li_seq is written on the first commit of a log item to record the
>  	 * first checkpoint it is written to. Hence if it is different to the
>  	 * current sequence, we're in a new checkpoint.
>  	 */
> -	if (XFS_LSN_CMP(lip->li_seq, ctx->sequence) != 0)
> -		return false;
> -	return true;
> +	return lip->li_seq == ctx->sequence;
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 2203ccecafb6..2d7e7cbee8b7 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -234,7 +234,7 @@ struct xfs_cil;
>  
>  struct xfs_cil_ctx {
>  	struct xfs_cil		*cil;
> -	xfs_lsn_t		sequence;	/* chkpt sequence # */
> +	xfs_csn_t		sequence;	/* chkpt sequence # */
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> @@ -272,10 +272,10 @@ struct xfs_cil {
>  	struct xfs_cil_ctx	*xc_ctx;
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
> -	xfs_lsn_t		xc_push_seq;
> +	xfs_csn_t		xc_push_seq;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
> -	xfs_lsn_t		xc_current_sequence;
> +	xfs_csn_t		xc_current_sequence;
>  	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
> @@ -554,19 +554,18 @@ int	xlog_cil_init(struct xlog *log);
>  void	xlog_cil_init_post_recovery(struct xlog *log);
>  void	xlog_cil_destroy(struct xlog *log);
>  bool	xlog_cil_empty(struct xlog *log);
> +void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
> +			xfs_csn_t *commit_seq, bool regrant);
>  
>  /*
>   * CIL force routines
>   */
> -xfs_lsn_t
> -xlog_cil_force_lsn(
> -	struct xlog *log,
> -	xfs_lsn_t sequence);
> +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
>  
>  static inline void
>  xlog_cil_force(struct xlog *log)
>  {
> -	xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence);
> +	xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 586f2992b789..87bffd12c20c 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -839,7 +839,7 @@ __xfs_trans_commit(
>  	bool			regrant)
>  {
>  	struct xfs_mount	*mp = tp->t_mountp;
> -	xfs_lsn_t		commit_lsn = -1;
> +	xfs_csn_t		commit_seq = 0;
>  	int			error = 0;
>  	int			sync = tp->t_flags & XFS_TRANS_SYNC;
>  
> @@ -881,7 +881,7 @@ __xfs_trans_commit(
>  		xfs_trans_apply_sb_deltas(tp);
>  	xfs_trans_apply_dquot_deltas(tp);
>  
> -	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
> +	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
>  
>  	xfs_trans_free(tp);
>  
> @@ -890,7 +890,7 @@ __xfs_trans_commit(
>  	 * log out now and wait for it.
>  	 */
>  	if (sync) {
> -		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
> +		error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
>  	} else {
>  		XFS_STATS_INC(mp, xs_trans_async);
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ee42d98d9011..50da47f23a07 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -43,7 +43,7 @@ struct xfs_log_item {
>  	struct list_head		li_cil;		/* CIL pointers */
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
> -	xfs_lsn_t			li_seq;		/* CIL commit seq */
> +	xfs_csn_t			li_seq;		/* CIL commit seq */
>  };
>  
>  /*
> @@ -69,7 +69,7 @@ struct xfs_item_ops {
>  	void (*iop_pin)(struct xfs_log_item *);
>  	void (*iop_unpin)(struct xfs_log_item *, int remove);
>  	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
> -	void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn);
> +	void (*iop_committing)(struct xfs_log_item *lip, xfs_csn_t seq);
>  	void (*iop_release)(struct xfs_log_item *);
>  	xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t);
>  	int (*iop_recover)(struct xfs_log_item *lip,
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 11/39] xfs: CIL work is serialised, not pipelined
  2021-05-19 12:12 ` [PATCH 11/39] xfs: CIL work is serialised, not pipelined Dave Chinner
@ 2021-05-21  0:32   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:49PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because we use a single work structure attached to the CIL rather
> than the CIL context, we can only queue a single work item at a
> time. This results in the CIL being single threaded and limits
> performance when it becomes CPU bound.
> 
> The design of the CIL is that it is pipelined and multiple commits
> can be running concurrently, but the way the work is currently
> implemented means that it is not pipelining as it was intended. The
> critical work to switch the CIL context can take a few milliseconds
> to run, but the rest of the CIL context flush can take hundreds of
> milliseconds to complete. The context switching is the serialisation
> point of the CIL, once the context has been switched the rest of the
> context push can run asynchrnously with all other context pushes.
> 
> Hence we can move the work to the CIL context so that we can run
> multiple CIL pushes at the same time and spread the majority of
> the work out over multiple CPUs. We can keep the per-cpu CIL commit
> state on the CIL rather than the context, because the context is
> pinned to the CIL until the switch is done and we aggregate and
> drain the per-cpu state held on the CIL during the context switch.
> 
> However, because we no longer serialise the CIL work, we can have
> effectively unlimited CIL pushes in progress. We don't want to do
> this - not only does it create contention on the iclogs and the
> state machine locks, we can run the log right out of space with
> outstanding pushes. Instead, limit the work concurrency to 4
> concurrent works being processed at a time. THis is enough
> concurrency to remove the CIL from being a CPU bound bottleneck but
> not enough to create new contention points or unbound concurrency
> issues.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Heh woo :)

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c  | 80 +++++++++++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h |  2 +-
>  fs/xfs/xfs_super.c    |  6 +++-
>  3 files changed, 48 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index cb849e67b1c4..713ea66d4c0c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -47,6 +47,34 @@ xlog_cil_ticket_alloc(
>  	return tic;
>  }
>  
> +/*
> + * Unavoidable forward declaration - xlog_cil_push_work() calls
> + * xlog_cil_ctx_alloc() itself.
> + */
> +static void xlog_cil_push_work(struct work_struct *work);
> +
> +static struct xfs_cil_ctx *
> +xlog_cil_ctx_alloc(void)
> +{
> +	struct xfs_cil_ctx	*ctx;
> +
> +	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
> +	INIT_LIST_HEAD(&ctx->committing);
> +	INIT_LIST_HEAD(&ctx->busy_extents);
> +	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
> +	return ctx;
> +}
> +
> +static void
> +xlog_cil_ctx_switch(
> +	struct xfs_cil		*cil,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	ctx->sequence = ++cil->xc_current_sequence;
> +	ctx->cil = cil;
> +	cil->xc_ctx = ctx;
> +}
> +
>  /*
>   * After the first stage of log recovery is done, we know where the head and
>   * tail of the log are. We need this log initialisation done before we can
> @@ -641,11 +669,11 @@ static void
>  xlog_cil_push_work(
>  	struct work_struct	*work)
>  {
> -	struct xfs_cil		*cil =
> -		container_of(work, struct xfs_cil, xc_push_work);
> +	struct xfs_cil_ctx	*ctx =
> +		container_of(work, struct xfs_cil_ctx, push_work);
> +	struct xfs_cil		*cil = ctx->cil;
>  	struct xlog		*log = cil->xc_log;
>  	struct xfs_log_vec	*lv;
> -	struct xfs_cil_ctx	*ctx;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
>  	struct xlog_ticket	*tic;
> @@ -660,11 +688,10 @@ xlog_cil_push_work(
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  	bool			push_commit_stable;
>  
> -	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
> +	new_ctx = xlog_cil_ctx_alloc();
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
>  
>  	down_write(&cil->xc_ctx_lock);
> -	ctx = cil->xc_ctx;
>  
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
> @@ -696,7 +723,7 @@ xlog_cil_push_work(
>  
>  
>  	/* check for a previously pushed sequence */
> -	if (push_seq < cil->xc_ctx->sequence) {
> +	if (push_seq < ctx->sequence) {
>  		spin_unlock(&cil->xc_push_lock);
>  		goto out_skip;
>  	}
> @@ -761,19 +788,7 @@ xlog_cil_push_work(
>  	}
>  
>  	/*
> -	 * initialise the new context and attach it to the CIL. Then attach
> -	 * the current context to the CIL committing list so it can be found
> -	 * during log forces to extract the commit lsn of the sequence that
> -	 * needs to be forced.
> -	 */
> -	INIT_LIST_HEAD(&new_ctx->committing);
> -	INIT_LIST_HEAD(&new_ctx->busy_extents);
> -	new_ctx->sequence = ctx->sequence + 1;
> -	new_ctx->cil = cil;
> -	cil->xc_ctx = new_ctx;
> -
> -	/*
> -	 * The switch is now done, so we can drop the context lock and move out
> +	 * Switch the contexts so we can drop the context lock and move out
>  	 * of a shared context. We can't just go straight to the commit record,
>  	 * though - we need to synchronise with previous and future commits so
>  	 * that the commit records are correctly ordered in the log to ensure
> @@ -798,7 +813,7 @@ xlog_cil_push_work(
>  	 * deferencing a freed context pointer.
>  	 */
>  	spin_lock(&cil->xc_push_lock);
> -	cil->xc_current_sequence = new_ctx->sequence;
> +	xlog_cil_ctx_switch(cil, new_ctx);
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
>  
> @@ -970,7 +985,7 @@ xlog_cil_push_background(
>  	spin_lock(&cil->xc_push_lock);
>  	if (cil->xc_push_seq < cil->xc_current_sequence) {
>  		cil->xc_push_seq = cil->xc_current_sequence;
> -		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	}
>  
>  	/*
> @@ -1036,7 +1051,7 @@ xlog_cil_push_now(
>  
>  	/* start on any pending background push to minimise wait time on it */
>  	if (!async)
> -		flush_work(&cil->xc_push_work);
> +		flush_workqueue(log->l_mp->m_cil_workqueue);
>  
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> @@ -1050,7 +1065,7 @@ xlog_cil_push_now(
>  
>  	cil->xc_push_seq = push_seq;
>  	cil->xc_push_commit_stable = async;
> -	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
> +	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work);
>  	spin_unlock(&cil->xc_push_lock);
>  }
>  
> @@ -1290,13 +1305,6 @@ xlog_cil_init(
>  	if (!cil)
>  		return -ENOMEM;
>  
> -	ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL);
> -	if (!ctx) {
> -		kmem_free(cil);
> -		return -ENOMEM;
> -	}
> -
> -	INIT_WORK(&cil->xc_push_work, xlog_cil_push_work);
>  	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
> @@ -1304,16 +1312,12 @@ xlog_cil_init(
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
>  	init_waitqueue_head(&cil->xc_commit_wait);
> -
> -	INIT_LIST_HEAD(&ctx->committing);
> -	INIT_LIST_HEAD(&ctx->busy_extents);
> -	ctx->sequence = 1;
> -	ctx->cil = cil;
> -	cil->xc_ctx = ctx;
> -	cil->xc_current_sequence = ctx->sequence;
> -
>  	cil->xc_log = log;
>  	log->l_cilp = cil;
> +
> +	ctx = xlog_cil_ctx_alloc();
> +	xlog_cil_ctx_switch(cil, ctx);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a863ccb5ece6..87447fa34c43 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -245,6 +245,7 @@ struct xfs_cil_ctx {
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
> +	struct work_struct	push_work;
>  };
>  
>  /*
> @@ -277,7 +278,6 @@ struct xfs_cil {
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
> -	struct work_struct	xc_push_work;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
>  } ____cacheline_aligned_in_smp;
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e339d1de2419..0608091f13a6 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -501,9 +501,13 @@ xfs_init_mount_workqueues(
>  	if (!mp->m_unwritten_workqueue)
>  		goto out_destroy_buf;
>  
> +	/*
> +	 * Limit the CIL pipeline depth to 4 concurrent works to bound the
> +	 * concurrency the log spinlocks will be exposed to.
> +	 */
>  	mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s",
>  			XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND),
> -			0, mp->m_super->s_id);
> +			4, mp->m_super->s_id);
>  	if (!mp->m_cil_workqueue)
>  		goto out_destroy_unwritten;
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing
  2021-05-19 12:12 ` [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing Dave Chinner
@ 2021-05-21  0:33   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:48PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL pushing is stalling on log forces when it comes across
> pinned items. This is happening on removal workloads where the AIL
> is dominated by stale items that are removed from AIL when the
> checkpoint that marks the items stale is committed to the journal.
> This results is relatively few items in the AIL, but those that are
> are often pinned as directories items are being removed from are
> still being logged.
> 
> As a result, many push cycles through the CIL will first issue a
> blocking log force to unpin the items. This can take some time to
> complete, with tracing regularly showing push delays of half a
> second and sometimes up into the range of several seconds. Sequences
> like this aren't uncommon:
> 
> ....
>  399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 270ms delay>
>  400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
>  400.099623:  xfsaild: first lsn 0x11002f3600
>  400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
> <wanted 50ms, got 500ms delay>
>  400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
>  400.589349:  xfsaild: first lsn 0x1100305000
>  400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
> <wanted 50ms, got 460ms delay>
>  400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
>  400.950343:  xfsaild: first lsn 0x1100317c00
>  400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 200ms delay>
>  401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
>  401.142334:  xfsaild: first lsn 0x110032e600
>  401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
> <wanted 10ms, got 10ms delay>
>  401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
>  401.154328:  xfsaild: first lsn 0x1100353000
>  401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
> <wanted 20ms, got 300ms delay>
>  401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
>  401.451526:  xfsaild: first lsn 0x1100353000
>  401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
> <wanted 50ms, got 500ms delay>
>  401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
> ....
> 
> In each of these cases, every AIL pass saw 101 log items stuck on
> the AIL (pinned) with very few other items being found. Each pass, a
> log force was issued, and delay between last/first is the sleep time
> + the sync log force time.
> 
> Some of these 101 items pinned the tail of the log. The tail of the
> log does slowly creep forward (first lsn), but the problem is that
> the log is actually out of reservation space because it's been
> running so many transactions that stale items that never reach the
> AIL but consume log space. Hence we have a largely empty AIL, with
> long term pins on items that pin the tail of the log that don't get
> pushed frequently enough to keep log space available.
> 
> The problem is the hundreds of milliseconds that we block in the log
> force pushing the CIL out to disk. The AIL should not be stalled
> like this - it needs to run and flush items that are at the tail of
> the log with minimal latency. What we really need to do is trigger a
> log flush, but then not wait for it at all - we've already done our
> waiting for stuff to complete when we backed off prior to the log
> force being issued.
> 
> Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
> still do a blocking flush of the CIL and that is what is causing the
> issue. Hence we need a new interface for the CIL to trigger an
> immediate background push of the CIL to get it moving faster but not
> to wait on that to occur. While the CIL is pushing, the AIL can also
> be pushing.
> 
> We already have an internal interface to do this -
> xlog_cil_push_now() - but we need a wrapper for it to be used
> externally. xlog_cil_force_seq() can easily be extended to do what
> we need as it already implements the synchronous CIL push via
> xlog_cil_push_now(). Add the necessary flags and "push current
> sequence" semantics to xlog_cil_force_seq() and convert the AIL
> pushing to use it.
> 
> One of the complexities here is that the CIL push does not guarantee
> that the commit record for the CIL checkpoint is written to disk.
> The current log force ensures this by submitting the current ACTIVE
> iclog that the commit record was written to. We need the CIL to
> actually write this commit record to disk for an async push to
> ensure that the checkpoint actually makes it to disk and unpins the
> pinned items in the checkpoint on completion. Hence we need to pass
> down to the CIL push that we are doing an async flush so that it can
> switch out the commit_iclog if necessary to get written to disk when
> the commit iclog is finally released.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Not sure if I like this new push_commit_stable name, but at least it
doesn't have foo_sync = bar_async like last time.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c       | 38 ++++++++++++++------------
>  fs/xfs/xfs_log.h       |  1 +
>  fs/xfs/xfs_log_cil.c   | 61 ++++++++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_log_priv.h  |  5 ++++
>  fs/xfs/xfs_sysfs.c     |  1 +
>  fs/xfs/xfs_trace.c     |  1 +
>  fs/xfs/xfs_trans.c     |  2 +-
>  fs/xfs/xfs_trans_ail.c | 11 +++++---
>  8 files changed, 91 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index aa37f4319052..c53644d19dd3 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -50,11 +50,6 @@ xlog_state_get_iclog_space(
>  	int			*continued_write,
>  	int			*logoffsetp);
>  STATIC void
> -xlog_state_switch_iclogs(
> -	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	int			eventual_size);
> -STATIC void
>  xlog_grant_push_ail(
>  	struct xlog		*log,
>  	int			need_bytes);
> @@ -3104,7 +3099,7 @@ xfs_log_ticket_ungrant(
>   * This routine will mark the current iclog in the ring as WANT_SYNC and move
>   * the current iclog pointer to the next iclog in the ring.
>   */
> -STATIC void
> +void
>  xlog_state_switch_iclogs(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
> @@ -3251,6 +3246,20 @@ xfs_log_force(
>  	return -EIO;
>  }
>  
> +/*
> + * Force the log to a specific LSN.
> + *
> + * If an iclog with that lsn can be found:
> + *	If it is in the DIRTY state, just return.
> + *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
> + *		state and go to sleep or return.
> + *	If it is in any other state, go to sleep or return.
> + *
> + * Synchronous forces are implemented with a wait queue.  All callers trying
> + * to force a given lsn to disk must wait on the queue attached to the
> + * specific in-core log.  When given in-core log finally completes its write
> + * to disk, that thread will wake up all threads waiting on the queue.
> + */
>  static int
>  xlog_force_lsn(
>  	struct xlog		*log,
> @@ -3314,18 +3323,13 @@ xlog_force_lsn(
>  }
>  
>  /*
> - * Force the in-core log to disk for a specific LSN.
> + * Force the log to a specific checkpoint sequence.
>   *
> - * Find in-core log with lsn.
> - *	If it is in the DIRTY state, just return.
> - *	If it is in the ACTIVE state, move the in-core log into the WANT_SYNC
> - *		state and go to sleep or return.
> - *	If it is in any other state, go to sleep or return.
> - *
> - * Synchronous forces are implemented with a wait queue.  All callers trying
> - * to force a given lsn to disk must wait on the queue attached to the
> - * specific in-core log.  When given in-core log finally completes its write
> - * to disk, that thread will wake up all threads waiting on the queue.
> + * First force the CIL so that all the required changes have been flushed to the
> + * iclogs. If the CIL force completed it will return a commit LSN that indicates
> + * the iclog that needs to be flushed to stable storage. If the caller needs
> + * a synchronous log force, we will wait on the iclog with the LSN returned by
> + * xlog_cil_force_seq() to be completed.
>   */
>  int
>  xfs_log_force_seq(
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 813b972e9788..1bd080ce3a95 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -104,6 +104,7 @@ struct xlog_ticket;
>  struct xfs_log_item;
>  struct xfs_item_ops;
>  struct xfs_trans;
> +struct xlog;
>  
>  int	  xfs_log_force(struct xfs_mount *mp, uint flags);
>  int	  xfs_log_force_seq(struct xfs_mount *mp, xfs_csn_t seq, uint flags,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 3c2b1205944d..cb849e67b1c4 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -658,6 +658,7 @@ xlog_cil_push_work(
>  	xfs_lsn_t		push_seq;
>  	struct bio		bio;
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> +	bool			push_commit_stable;
>  
>  	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -668,6 +669,8 @@ xlog_cil_push_work(
>  	spin_lock(&cil->xc_push_lock);
>  	push_seq = cil->xc_push_seq;
>  	ASSERT(push_seq <= ctx->sequence);
> +	push_commit_stable = cil->xc_push_commit_stable;
> +	cil->xc_push_commit_stable = false;
>  
>  	/*
>  	 * As we are about to switch to a new, empty CIL context, we no longer
> @@ -910,8 +913,15 @@ xlog_cil_push_work(
>  	 * The commit iclog must be written to stable storage to guarantee
>  	 * journal IO vs metadata writeback IO is correctly ordered on stable
>  	 * storage.
> +	 *
> +	 * If the push caller needs the commit to be immediately stable and the
> +	 * commit_iclog is not yet marked as XLOG_STATE_WANT_SYNC to indicate it
> +	 * will be written when released, switch it's state to WANT_SYNC right
> +	 * now.
>  	 */
>  	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> +	if (push_commit_stable && commit_iclog->ic_state == XLOG_STATE_ACTIVE)
> +		xlog_state_switch_iclogs(log, commit_iclog, 0);
>  	xlog_state_release_iclog(log, commit_iclog);
>  	spin_unlock(&log->l_icloglock);
>  	return;
> @@ -996,13 +1006,26 @@ xlog_cil_push_background(
>  /*
>   * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence
>   * number that is passed. When it returns, the work will be queued for
> - * @push_seq, but it won't be completed. The caller is expected to do any
> - * waiting for push_seq to complete if it is required.
> + * @push_seq, but it won't be completed.
> + *
> + * If the caller is performing a synchronous force, we will flush the workqueue
> + * to get previously queued work moving to minimise the wait time they will
> + * undergo waiting for all outstanding pushes to complete. The caller is
> + * expected to do the required waiting for push_seq to complete.
> + *
> + * If the caller is performing an async push, we need to ensure that the
> + * checkpoint is fully flushed out of the iclogs when we finish the push. If we
> + * don't do this, then the commit record may remain sitting in memory in an
> + * ACTIVE iclog. This then requires another full log force to push to disk,
> + * which defeats the purpose of having an async, non-blocking CIL force
> + * mechanism. Hence in this case we need to pass a flag to the push work to
> + * indicate it needs to flush the commit record itself.
>   */
>  static void
>  xlog_cil_push_now(
>  	struct xlog	*log,
> -	xfs_lsn_t	push_seq)
> +	xfs_lsn_t	push_seq,
> +	bool		async)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
>  
> @@ -1012,7 +1035,8 @@ xlog_cil_push_now(
>  	ASSERT(push_seq && push_seq <= cil->xc_current_sequence);
>  
>  	/* start on any pending background push to minimise wait time on it */
> -	flush_work(&cil->xc_push_work);
> +	if (!async)
> +		flush_work(&cil->xc_push_work);
>  
>  	/*
>  	 * If the CIL is empty or we've already pushed the sequence then
> @@ -1025,6 +1049,7 @@ xlog_cil_push_now(
>  	}
>  
>  	cil->xc_push_seq = push_seq;
> +	cil->xc_push_commit_stable = async;
>  	queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
>  	spin_unlock(&cil->xc_push_lock);
>  }
> @@ -1109,12 +1134,27 @@ xlog_cil_commit(
>  	xlog_cil_push_background(log);
>  }
>  
> +/*
> + * Flush the CIL to stable storage but don't wait for it to complete. This
> + * requires the CIL push to ensure the commit record for the push hits the disk,
> + * but otherwise is no different to a push done from a log force.
> + */
> +void
> +xlog_cil_flush(
> +	struct xlog	*log)
> +{
> +	xfs_csn_t	seq = log->l_cilp->xc_current_sequence;
> +
> +	trace_xfs_log_force(log->l_mp, seq, _RET_IP_);
> +	xlog_cil_push_now(log, seq, true);
> +}
> +
>  /*
>   * Conditionally push the CIL based on the sequence passed in.
>   *
> - * We only need to push if we haven't already pushed the sequence
> - * number given. Hence the only time we will trigger a push here is
> - * if the push sequence is the same as the current context.
> + * We only need to push if we haven't already pushed the sequence number given.
> + * Hence the only time we will trigger a push here is if the push sequence is
> + * the same as the current context.
>   *
>   * We return the current commit lsn to allow the callers to determine if a
>   * iclog flush is necessary following this call.
> @@ -1130,13 +1170,17 @@ xlog_cil_force_seq(
>  
>  	ASSERT(sequence <= cil->xc_current_sequence);
>  
> +	if (!sequence)
> +		sequence = cil->xc_current_sequence;
> +	trace_xfs_log_force(log->l_mp, sequence, _RET_IP_);
> +
>  	/*
>  	 * check to see if we need to force out the current context.
>  	 * xlog_cil_push() handles racing pushes for the same sequence,
>  	 * so no need to deal with it here.
>  	 */
>  restart:
> -	xlog_cil_push_now(log, sequence);
> +	xlog_cil_push_now(log, sequence, false);
>  
>  	/*
>  	 * See if we can find a previous sequence still committing.
> @@ -1160,6 +1204,7 @@ xlog_cil_force_seq(
>  			 * It is still being pushed! Wait for the push to
>  			 * complete, then start again from the beginning.
>  			 */
> +			XFS_STATS_INC(log->l_mp, xs_log_force_sleep);
>  			xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock);
>  			goto restart;
>  		}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 2d7e7cbee8b7..a863ccb5ece6 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -273,6 +273,7 @@ struct xfs_cil {
>  
>  	spinlock_t		xc_push_lock ____cacheline_aligned_in_smp;
>  	xfs_csn_t		xc_push_seq;
> +	bool			xc_push_commit_stable;
>  	struct list_head	xc_committing;
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
> @@ -487,9 +488,12 @@ int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
> +
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> +void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog,
> +		int eventual_size);
>  int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
>  
>  /*
> @@ -560,6 +564,7 @@ void	xlog_cil_commit(struct xlog *log, struct xfs_trans *tp,
>  /*
>   * CIL force routines
>   */
> +void xlog_cil_flush(struct xlog *log);
>  xfs_lsn_t xlog_cil_force_seq(struct xlog *log, xfs_csn_t sequence);
>  
>  static inline void
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index f1bc88f4367c..18dc5eca6c04 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -10,6 +10,7 @@
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_sysfs.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_mount.h"
>  
> diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
> index 7e01e00550ac..4c86afad1617 100644
> --- a/fs/xfs/xfs_trace.c
> +++ b/fs/xfs/xfs_trace.c
> @@ -20,6 +20,7 @@
>  #include "xfs_bmap.h"
>  #include "xfs_attr.h"
>  #include "xfs_trans.h"
> +#include "xfs_log.h"
>  #include "xfs_log_priv.h"
>  #include "xfs_buf_item.h"
>  #include "xfs_quota.h"
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 87bffd12c20c..c214a69b573d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -9,7 +9,6 @@
>  #include "xfs_shared.h"
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
> -#include "xfs_log_priv.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
>  #include "xfs_extent_busy.h"
> @@ -17,6 +16,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_trans_priv.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  #include "xfs_trace.h"
>  #include "xfs_error.h"
>  #include "xfs_defer.h"
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index dbb69b4bf3ed..69aac416e2ce 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -17,6 +17,7 @@
>  #include "xfs_errortag.h"
>  #include "xfs_error.h"
>  #include "xfs_log.h"
> +#include "xfs_log_priv.h"
>  
>  #ifdef DEBUG
>  /*
> @@ -429,8 +430,12 @@ xfsaild_push(
>  
>  	/*
>  	 * If we encountered pinned items or did not finish writing out all
> -	 * buffers the last time we ran, force the log first and wait for it
> -	 * before pushing again.
> +	 * buffers the last time we ran, force a background CIL push to get the
> +	 * items unpinned in the near future. We do not wait on the CIL push as
> +	 * that could stall us for seconds if there is enough background IO
> +	 * load. Stalling for that long when the tail of the log is pinned and
> +	 * needs flushing will hard stop the transaction subsystem when log
> +	 * space runs out.
>  	 */
>  	if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
>  	    (!list_empty_careful(&ailp->ail_buf_list) ||
> @@ -438,7 +443,7 @@ xfsaild_push(
>  		ailp->ail_log_flush = 0;
>  
>  		XFS_STATS_INC(mp, xs_push_ail_flush);
> -		xfs_log_force(mp, XFS_LOG_SYNC);
> +		xlog_cil_flush(mp->m_log);
>  	}
>  
>  	spin_lock(&ailp->ail_lock);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record
  2021-05-19 12:12 ` [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record Dave Chinner
@ 2021-05-21  0:35   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:52PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Remove another case where xlog_write() has to prepend an opheader to
> a log transaction. The unmount record + ophdr is smaller than the
> minimum amount of space guaranteed to be free in an iclog (2 *
> sizeof(ophdr)) and so we don't have to care about an unmount record
> being split across 2 iclogs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

I love me some BUILD_BUG_ON, so

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 39 ++++++++++++++++++++++++++++-----------
>  1 file changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 981cd6f8f0ff..e7a135ffa66f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -800,12 +800,22 @@ xlog_write_unmount_record(
>  	struct xlog		*log,
>  	struct xlog_ticket	*ticket)
>  {
> -	struct xfs_unmount_log_format ulf = {
> -		.magic = XLOG_UNMOUNT_TYPE,
> +	struct  {
> +		struct xlog_op_header ophdr;
> +		struct xfs_unmount_log_format ulf;
> +	} unmount_rec = {
> +		.ophdr = {
> +			.oh_clientid = XFS_LOG,
> +			.oh_tid = cpu_to_be32(ticket->t_tid),
> +			.oh_flags = XLOG_UNMOUNT_TRANS,
> +		},
> +		.ulf = {
> +			.magic = XLOG_UNMOUNT_TYPE,
> +		},
>  	};
>  	struct xfs_log_iovec reg = {
> -		.i_addr = &ulf,
> -		.i_len = sizeof(ulf),
> +		.i_addr = &unmount_rec,
> +		.i_len = sizeof(unmount_rec),
>  		.i_type = XLOG_REG_TYPE_UNMOUNT,
>  	};
>  	struct xfs_log_vec vec = {
> @@ -813,8 +823,12 @@ xlog_write_unmount_record(
>  		.lv_iovecp = &reg,
>  	};
>  
> +	BUILD_BUG_ON((sizeof(struct xlog_op_header) +
> +		      sizeof(struct xfs_unmount_log_format)) !=
> +							sizeof(unmount_rec));
> +
>  	/* account for space used by record data */
> -	ticket->t_curr_res -= sizeof(ulf);
> +	ticket->t_curr_res -= sizeof(unmount_rec);
>  
>  	/*
>  	 * For external log devices, we need to flush the data device cache
> @@ -2145,6 +2159,8 @@ xlog_write_calc_vec_length(
>  
>  	/* Don't account for regions with embedded ophdrs */
>  	if (optype && headers > 0) {
> +		if (optype & XLOG_UNMOUNT_TRANS)
> +			headers--;
>  		if (optype & XLOG_START_TRANS) {
>  			ASSERT(headers >= 2);
>  			headers -= 2;
> @@ -2359,12 +2375,11 @@ xlog_write(
>  
>  	/*
>  	 * If this is a commit or unmount transaction, we don't need a start
> -	 * record to be written.  We do, however, have to account for the
> -	 * commit or unmount header that gets written. Hence we always have
> -	 * to account for an extra xlog_op_header here for commit and unmount
> -	 * records.
> +	 * record to be written.  We do, however, have to account for the commit
> +	 * header that gets written. Hence we always have to account for an
> +	 * extra xlog_op_header here for commit records.
>  	 */
> -	if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS))
> +	if (optype & XLOG_COMMIT_TRANS)
>  		ticket->t_curr_res -= sizeof(struct xlog_op_header);
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
> @@ -2424,6 +2439,8 @@ xlog_write(
>  				ophdr = reg->i_addr;
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
> +			} else if (optype & XLOG_UNMOUNT_TRANS) {
> +				ophdr = reg->i_addr;
>  			} else {
>  				ophdr = xlog_write_setup_ophdr(log, ptr,
>  							ticket, optype);
> @@ -2454,7 +2471,7 @@ xlog_write(
>  			/*
>  			 * Copy region.
>  			 *
> -			 * Commit and unmount records just log an opheader, so
> +			 * Commit records just log an opheader, so
>  			 * we can have empty payloads with no data region to
>  			 * copy.  Hence we only copy the payload if the vector
>  			 * says it has data to copy.
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 16/39] xfs: log tickets don't need log client id
  2021-05-19 12:12 ` [PATCH 16/39] xfs: log tickets don't need log client id Dave Chinner
@ 2021-05-21  0:38   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-21  0:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:54PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently set the log ticket client ID when we reserve a
> transaction. This client ID is only ever written to the log by
> a CIL checkpoint or unmount records, and so anything using a high
> level transaction allocated through xfs_trans_alloc() does not need
> a log ticket client ID to be set.
> 
> For the CIL checkpoint, the client ID written to the journal is
> always XFS_TRANSACTION, and for the unmount record it is always
> XFS_LOG, and nothing else writes to the log. All of these operations
> tell xlog_write() exactly what they need to write to the log (the
> optype) and build their own opheaders for start, commit and unmount
> records. Hence we no longer need to set the client id in either the
> log ticket or the xfs_trans.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Yay for removing never-used macros,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_log_format.h |  1 -
>  fs/xfs/xfs_log.c               | 47 ++++++----------------------------
>  fs/xfs/xfs_log.h               | 16 +++++-------
>  fs/xfs/xfs_log_cil.c           |  2 +-
>  fs/xfs/xfs_log_priv.h          | 10 ++------
>  fs/xfs/xfs_trans.c             |  6 ++---
>  6 files changed, 19 insertions(+), 63 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..78d5368a7caa 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -69,7 +69,6 @@ static inline uint xlog_get_cycle(char *ptr)
>  
>  /* Log Clients */
>  #define XFS_TRANSACTION		0x69
> -#define XFS_VOLUME		0x2
>  #define XFS_LOG			0xaa
>  
>  #define XLOG_UNMOUNT_TYPE	0x556e	/* Un for Unmount */
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 76a73f4b0f30..ccf584914b6a 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -433,10 +433,9 @@ xfs_log_regrant(
>  int
>  xfs_log_reserve(
>  	struct xfs_mount	*mp,
> -	int		 	unit_bytes,
> -	int		 	cnt,
> +	int			unit_bytes,
> +	int			cnt,
>  	struct xlog_ticket	**ticp,
> -	uint8_t		 	client,
>  	bool			permanent)
>  {
>  	struct xlog		*log = mp->m_log;
> @@ -444,15 +443,13 @@ xfs_log_reserve(
>  	int			need_bytes;
>  	int			error = 0;
>  
> -	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
> -
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
>  	XFS_STATS_INC(mp, xs_try_logspace);
>  
>  	ASSERT(*ticp == NULL);
> -	tic = xlog_ticket_alloc(log, unit_bytes, cnt, client, permanent);
> +	tic = xlog_ticket_alloc(log, unit_bytes, cnt, permanent);
>  	*ticp = tic;
>  
>  	xlog_grant_push_ail(log, tic->t_cnt ? tic->t_unit_res * tic->t_cnt
> @@ -853,7 +850,7 @@ xlog_unmount_write(
>  	struct xlog_ticket	*tic = NULL;
>  	int			error;
>  
> -	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
> +	error = xfs_log_reserve(mp, 600, 1, &tic, 0);
>  	if (error)
>  		goto out_err;
>  
> @@ -2181,35 +2178,13 @@ xlog_write_calc_vec_length(
>  
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
> -	struct xlog		*log,
>  	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -	ophdr->oh_clientid = ticket->t_clientid;
> +	ophdr->oh_clientid = XFS_TRANSACTION;
>  	ophdr->oh_res2 = 0;
> -
> -	/* are we copying a commit or unmount record? */
> -	ophdr->oh_flags = flags;
> -
> -	/*
> -	 * We've seen logs corrupted with bad transaction client ids.  This
> -	 * makes sure that XFS doesn't generate them on.  Turn this into an EIO
> -	 * and shut down the filesystem.
> -	 */
> -	switch (ophdr->oh_clientid)  {
> -	case XFS_TRANSACTION:
> -	case XFS_VOLUME:
> -	case XFS_LOG:
> -		break;
> -	default:
> -		xfs_warn(log->l_mp,
> -			"Bad XFS transaction clientid 0x%x in ticket "PTR_FMT,
> -			ophdr->oh_clientid, ticket);
> -		return NULL;
> -	}
> -
> +	ophdr->oh_flags = 0;
>  	return ophdr;
>  }
>  
> @@ -2439,11 +2414,7 @@ xlog_write(
>  				if (index)
>  					optype &= ~XLOG_START_TRANS;
>  			} else {
> -				ophdr = xlog_write_setup_ophdr(log, ptr,
> -							ticket, optype);
> -				if (!ophdr)
> -					return -EIO;
> -
> +                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  					   sizeof(struct xlog_op_header));
>  				added_ophdr = true;
> @@ -3499,7 +3470,6 @@ xlog_ticket_alloc(
>  	struct xlog		*log,
>  	int			unit_bytes,
>  	int			cnt,
> -	char			client,
>  	bool			permanent)
>  {
>  	struct xlog_ticket	*tic;
> @@ -3517,7 +3487,6 @@ xlog_ticket_alloc(
>  	tic->t_cnt		= cnt;
>  	tic->t_ocnt		= cnt;
>  	tic->t_tid		= prandom_u32();
> -	tic->t_clientid		= client;
>  	if (permanent)
>  		tic->t_flags |= XLOG_TIC_PERM_RESERV;
>  
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 1bd080ce3a95..c0c3141944ea 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -117,16 +117,12 @@ int	  xfs_log_mount_finish(struct xfs_mount *mp);
>  void	xfs_log_mount_cancel(struct xfs_mount *);
>  xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
>  xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
> -void	  xfs_log_space_wake(struct xfs_mount *mp);
> -int	  xfs_log_reserve(struct xfs_mount *mp,
> -			  int		   length,
> -			  int		   count,
> -			  struct xlog_ticket **ticket,
> -			  uint8_t		   clientid,
> -			  bool		   permanent);
> -int	  xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> -void      xfs_log_unmount(struct xfs_mount *mp);
> -int	  xfs_log_force_umount(struct xfs_mount *mp, int logerror);
> +void	xfs_log_space_wake(struct xfs_mount *mp);
> +int	xfs_log_reserve(struct xfs_mount *mp, int length, int count,
> +			struct xlog_ticket **ticket, bool permanent);
> +int	xfs_log_regrant(struct xfs_mount *mp, struct xlog_ticket *tic);
> +void	xfs_log_unmount(struct xfs_mount *mp);
> +int	xfs_log_force_umount(struct xfs_mount *mp, int logerror);
>  bool	xfs_log_writable(struct xfs_mount *mp);
>  
>  struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 2983adaed675..9d3a495f1c78 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -37,7 +37,7 @@ xlog_cil_ticket_alloc(
>  {
>  	struct xlog_ticket *tic;
>  
> -	tic = xlog_ticket_alloc(log, 0, 1, XFS_TRANSACTION, 0);
> +	tic = xlog_ticket_alloc(log, 0, 1, 0);
>  
>  	/*
>  	 * set the current reservation to zero so we know to steal the basic
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 87447fa34c43..e4e3e71b2b1b 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -158,7 +158,6 @@ typedef struct xlog_ticket {
>  	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
>  	char		   t_ocnt;	 /* original count		 : 1  */
>  	char		   t_cnt;	 /* current count		 : 1  */
> -	char		   t_clientid;	 /* who does this belong to;	 : 1  */
>  	char		   t_flags;	 /* properties of reservation	 : 1  */
>  
>          /* reservation array fields */
> @@ -465,13 +464,8 @@ extern __le32	 xlog_cksum(struct xlog *log, struct xlog_rec_header *rhead,
>  			    char *dp, int size);
>  
>  extern kmem_zone_t *xfs_log_ticket_zone;
> -struct xlog_ticket *
> -xlog_ticket_alloc(
> -	struct xlog	*log,
> -	int		unit_bytes,
> -	int		count,
> -	char		client,
> -	bool		permanent);
> +struct xlog_ticket *xlog_ticket_alloc(struct xlog *log, int unit_bytes,
> +		int count, bool permanent);
>  
>  static inline void
>  xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index c214a69b573d..bc72826d1f97 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -194,11 +194,9 @@ xfs_trans_reserve(
>  			ASSERT(resp->tr_logflags & XFS_TRANS_PERM_LOG_RES);
>  			error = xfs_log_regrant(mp, tp->t_ticket);
>  		} else {
> -			error = xfs_log_reserve(mp,
> -						resp->tr_logres,
> +			error = xfs_log_reserve(mp, resp->tr_logres,
>  						resp->tr_logcount,
> -						&tp->t_ticket, XFS_TRANSACTION,
> -						permanent);
> +						&tp->t_ticket, permanent);
>  		}
>  
>  		if (error)
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 20/39] xfs: pass lv chain length into xlog_write()
  2021-05-19 12:12 ` [PATCH 20/39] xfs: pass lv chain length into xlog_write() Dave Chinner
@ 2021-05-27 17:20   ` Darrick J. Wong
  2021-06-02 22:18     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 17:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The caller of xlog_write() usually has a close accounting of the
> aggregated vector length contained in the log vector chain passed to
> xlog_write(). There is no need to iterate the chain to calculate he
> length of the data in xlog_write_calculate_len() if the caller is
> already iterating that chain to build it.
> 
> Passing in the vector length avoids doing an extra chain iteration,
> which can be a significant amount of work given that large CIL
> commits can have hundreds of thousands of vectors attached to the
> chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
>  fs/xfs/xfs_log_cil.c  | 17 ++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  3 files changed, 19 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index e849f15e9e04..58f9aafce29e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -864,7 +864,8 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
> +				reg.i_len);
>  }
>  
>  /*
> @@ -1588,7 +1589,8 @@ xlog_commit_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> +				reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2108,32 +2110,6 @@ xlog_print_trans(
>  	}
>  }
>  
> -/*
> - * Calculate the potential space needed by the log vector. All regions contain
> - * their own opheaders and they are accounted for in region space so we don't
> - * need to add them to the vector length here.
> - */
> -static int
> -xlog_write_calc_vec_length(
> -	struct xlog_ticket	*ticket,
> -	struct xfs_log_vec	*log_vector,
> -	uint			optype)
> -{
> -	struct xfs_log_vec	*lv;
> -	int			len = 0;
> -	int			i;
> -
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
> -		/* we don't write ordered log vectors */
> -		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> -			continue;
> -
> -		for (i = 0; i < lv->lv_niovecs; i++)
> -			len += lv->lv_iovecp[i].i_len;
> -	}
> -	return len;
> -}
> -
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
>  	struct xlog_op_header	*ophdr,
> @@ -2296,13 +2272,13 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			optype)
> +	uint			optype,
> +	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
>  	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
>  	int			index = 0;
> -	int			len;
>  	int			partial_copy = 0;
>  	int			partial_copy_len = 0;
>  	int			contwr = 0;
> @@ -2317,7 +2293,6 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
>  	if (start_lsn)
>  		*start_lsn = 0;
>  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 58900171de09..7a6b80666f98 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
>  				sizeof(struct xfs_trans_header);
>  	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
>  
> -	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> -
>  	lvhdr->lv_niovecs = 2;
>  	lvhdr->lv_iovecp = &hdr->lhdr[0];
> +	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
>  	lvhdr->lv_next = ctx->lv_chain;
> +
> +	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
>  
>  /*
> @@ -742,7 +743,8 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
> -	int			num_iovecs;
> +	int			num_iovecs = 0;
> +	int			num_bytes = 0;
>  	int			error = 0;
>  	struct xlog_cil_trans_hdr thdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> @@ -835,7 +837,6 @@ xlog_cil_push_work(
>  	 * by the flush lock.
>  	 */
>  	lv = NULL;
> -	num_iovecs = 0;
>  	while (!list_empty(&cil->xc_cil)) {
>  		struct xfs_log_item	*item;
>  
> @@ -849,6 +850,10 @@ xlog_cil_push_work(
>  		lv = item->li_lv;
>  		item->li_lv = NULL;
>  		num_iovecs += lv->lv_niovecs;
> +
> +		/* we don't write ordered log vectors */
> +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> +			num_bytes += lv->lv_bytes;
>  	}
>  
>  	/*
> @@ -887,6 +892,8 @@ xlog_cil_push_work(
>  	 * transaction header here as it is not accounted for in xlog_write().
>  	 */
>  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> +	num_iovecs += lvhdr.lv_niovecs;

I have the same question that Brian had last time, which is: What's the
point of updating num_iovecs here?  It's not used after
xlog_cil_build_trans_hdr, either here or at the end of the patchset.

Is the idea that num_{iovecs,bytes} will always reflect everything
in the cil context chain that's about to be passed to xlog_write?

--D

> +	num_bytes += lvhdr.lv_bytes;
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -901,7 +908,7 @@ xlog_cil_push_work(
>  	 * write head.
>  	 */
>  	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS);
> +				XLOG_START_TRANS, num_bytes);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 301c36165974..eba905c273b0 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint optype);
> +		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 21/39] xfs: introduce xlog_write_single()
  2021-05-19 12:12 ` [PATCH 21/39] xfs: introduce xlog_write_single() Dave Chinner
@ 2021-05-27 17:27   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 17:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:12:59PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Introduce an optimised version of xlog_write() that is used when the
> entire write will fit in a single iclog. This greatly simplifies the
> implementation of writing a log vector chain into an iclog, and sets
> the ground work for a much more understandable xlog_write()
> implementation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks like a fairly simple hoist.  I might have more comments once I
wade through the next patch.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 58f9aafce29e..3b74d21e3786 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2225,6 +2225,52 @@ xlog_write_copy_finish(
>  	return error;
>  }
>  
> +/*
> + * Write log vectors into a single iclog which is guaranteed by the caller
> + * to have enough space to write the entire log vector into. Return the number
> + * of log vectors written into the iclog.
> + */
> +static int
> +xlog_write_single(
> +	struct xfs_log_vec	*log_vector,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	*iclog,
> +	uint32_t		log_offset,
> +	uint32_t		len)
> +{
> +	struct xfs_log_vec	*lv;
> +	void			*ptr;
> +	int			index = 0;
> +	int			record_cnt = 0;
> +
> +	ASSERT(log_offset + len <= iclog->ic_size);
> +
> +	ptr = iclog->ic_datap + log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
> +
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));
> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> +			record_cnt++;
> +		}
> +	}
> +	ASSERT(len == 0);
> +	return record_cnt;
> +}
> +
> +
>  /*
>   * Write some region out to in-core log
>   *
> @@ -2305,16 +2351,25 @@ xlog_write(
>  			return error;
>  
>  		ASSERT(log_offset <= iclog->ic_size - 1);
> -		ptr = iclog->ic_datap + log_offset;
>  
>  		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> +		/* If this is a single iclog write, go fast... */
> +		if (!contwr && lv == log_vector) {
> +			record_cnt = xlog_write_single(lv, ticket, iclog,
> +						log_offset, len);
> +			len = 0;
> +			data_cnt = len;
> +			break;
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
>  		 */
> +		ptr = iclog->ic_datap + log_offset;
>  		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
>  			struct xfs_log_iovec	*reg;
>  			struct xlog_op_header	*ophdr;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 22/39] xfs:_introduce xlog_write_partial()
  2021-05-19 12:13 ` [PATCH 22/39] xfs:_introduce xlog_write_partial() Dave Chinner
@ 2021-05-27 18:06   ` Darrick J. Wong
  2021-06-02 22:21     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:00PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Handle writing of a logvec chain into an iclog that doesn't have
> enough space to fit it all. The iclog has already been changed to
> WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> in the iclog is exclusively owned by this logvec chain.
> 
> The difference between the single and partial cases is that
> we end up with partial iovec writes in the iclog and have to split
> a log vec regions across two iclogs. The state handling for this is
> currently awful and so we're building up the pieces needed to
> handle this more cleanly one at a time.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Egad this diff is hard to read.  Brian's right, the patience diff is
easier to understand and shorter to boot.

That said, I actually understand what the new code does now, so:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Might be nice to hoist:

	memcpy(ptr, reg->i_addr + reg_offset, rlen);
	xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
	(*record_cnt)++;
	*data_cnt += rlen;

into a helper but it's only four lines so I'm not gonna fuss any
further.

--D

> ---
>  fs/xfs/xfs_log.c | 504 ++++++++++++++++++++++-------------------------
>  1 file changed, 240 insertions(+), 264 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 3b74d21e3786..98a3e2e4f1e0 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2110,166 +2110,250 @@ xlog_print_trans(
>  	}
>  }
>  
> -static xlog_op_header_t *
> -xlog_write_setup_ophdr(
> -	struct xlog_op_header	*ophdr,
> -	struct xlog_ticket	*ticket)
> -{
> -	ophdr->oh_clientid = XFS_TRANSACTION;
> -	ophdr->oh_res2 = 0;
> -	ophdr->oh_flags = 0;
> -	return ophdr;
> -}
> -
>  /*
> - * Set up the parameters of the region copy into the log. This has
> - * to handle region write split across multiple log buffers - this
> - * state is kept external to this function so that this code can
> - * be written in an obvious, self documenting manner.
> + * Write whole log vectors into a single iclog which is guaranteed to have
> + * either sufficient space for the entire log vector chain to be written or
> + * exclusive access to the remaining space in the iclog.
> + *
> + * Return the number of iovecs and data written into the iclog, as well as
> + * a pointer to the logvec that doesn't fit in the log (or NULL if we hit the
> + * end of the chain.
>   */
> -static int
> -xlog_write_setup_copy(
> +static struct xfs_log_vec *
> +xlog_write_single(
> +	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_op_header	*ophdr,
> -	int			space_available,
> -	int			space_required,
> -	int			*copy_off,
> -	int			*copy_len,
> -	int			*last_was_partial_copy,
> -	int			*bytes_consumed)
> -{
> -	int			still_to_copy;
> -
> -	still_to_copy = space_required - *bytes_consumed;
> -	*copy_off = *bytes_consumed;
> -
> -	if (still_to_copy <= space_available) {
> -		/* write of region completes here */
> -		*copy_len = still_to_copy;
> -		ophdr->oh_len = cpu_to_be32(*copy_len);
> -		if (*last_was_partial_copy)
> -			ophdr->oh_flags |= (XLOG_END_TRANS|XLOG_WAS_CONT_TRANS);
> -		*last_was_partial_copy = 0;
> -		*bytes_consumed = 0;
> -		return 0;
> -	}
> -
> -	/* partial write of region, needs extra log op header reservation */
> -	*copy_len = space_available;
> -	ophdr->oh_len = cpu_to_be32(*copy_len);
> -	ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> -	if (*last_was_partial_copy)
> -		ophdr->oh_flags |= XLOG_WAS_CONT_TRANS;
> -	*bytes_consumed += *copy_len;
> -	(*last_was_partial_copy)++;
> -
> -	/* account for new log op header */
> -	ticket->t_curr_res -= sizeof(struct xlog_op_header);
> -
> -	return sizeof(struct xlog_op_header);
> -}
> -
> -static int
> -xlog_write_copy_finish(
> -	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
> -	uint			flags,
> -	int			*record_cnt,
> -	int			*data_cnt,
> -	int			*partial_copy,
> -	int			*partial_copy_len,
> -	int			log_offset,
> -	struct xlog_in_core	**commit_iclog)
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt)
>  {
> -	int			error;
> +	struct xfs_log_vec	*lv;
> +	void			*ptr;
> +	int			index;
> +
> +	ASSERT(*log_offset + *len <= iclog->ic_size ||
> +		iclog->ic_state == XLOG_STATE_WANT_SYNC);
>  
> -	if (*partial_copy) {
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * This iclog has already been marked WANT_SYNC by
> -		 * xlog_state_get_iclog_space.
> +		 * If the entire log vec does not fit in the iclog, punt it to
> +		 * the partial copy loop which can handle this case.
>  		 */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> -		goto release_iclog;
> -	}
> +		if (lv->lv_niovecs &&
> +		    lv->lv_bytes > iclog->ic_size - *log_offset)
> +			break;
>  
> -	*partial_copy = 0;
> -	*partial_copy_len = 0;
> +		/*
> +		 * Ordered log vectors have no regions to write so this
> +		 * loop will naturally skip them.
> +		 */
> +		for (index = 0; index < lv->lv_niovecs; index++) {
> +			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> +			struct xlog_op_header	*ophdr = reg->i_addr;
>  
> -	if (iclog->ic_size - log_offset <= sizeof(xlog_op_header_t)) {
> -		/* no more space in this iclog - push it. */
> -		spin_lock(&log->l_icloglock);
> -		xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> -		*record_cnt = 0;
> -		*data_cnt = 0;
> +			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> +			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
>  
> -		if (iclog->ic_state == XLOG_STATE_ACTIVE)
> -			xlog_state_switch_iclogs(log, iclog, 0);
> -		else
> -			ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> -			       iclog->ic_state == XLOG_STATE_IOERROR);
> -		if (!commit_iclog)
> -			goto release_iclog;
> -		spin_unlock(&log->l_icloglock);
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> -		*commit_iclog = iclog;
> +			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +						sizeof(struct xlog_op_header));
> +			memcpy(ptr, reg->i_addr, reg->i_len);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, reg->i_len);
> +			(*record_cnt)++;
> +			*data_cnt += reg->i_len;
> +		}
>  	}
> +	ASSERT(*len == 0 || lv);
> +	return lv;
> +}
>  
> -	return 0;
> +static int
> +xlog_write_get_more_iclog_space(
> +	struct xlog		*log,
> +	struct xlog_ticket	*ticket,
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
> +{
> +	struct xlog_in_core	*iclog = *iclogp;
> +	int			error;
>  
> -release_iclog:
> +	spin_lock(&log->l_icloglock);
> +	xlog_state_finish_copy(log, iclog, *record_cnt, *data_cnt);
> +	ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> +	       iclog->ic_state == XLOG_STATE_IOERROR);
>  	error = xlog_state_release_iclog(log, iclog);
>  	spin_unlock(&log->l_icloglock);
> -	return error;
> +	if (error)
> +		return error;
> +
> +	error = xlog_state_get_iclog_space(log, len, &iclog,
> +				ticket, contwr, log_offset);
> +	if (error)
> +		return error;
> +	*record_cnt = 0;
> +	*data_cnt = 0;
> +	*iclogp = iclog;
> +	return 0;
>  }
>  
>  /*
> - * Write log vectors into a single iclog which is guaranteed by the caller
> - * to have enough space to write the entire log vector into. Return the number
> - * of log vectors written into the iclog.
> + * Write log vectors into a single iclog which is smaller than the current chain
> + * length. We write until we cannot fit a full record into the remaining space
> + * and then stop. We return the log vector that is to be written that cannot
> + * wholly fit in the iclog.
>   */
> -static int
> -xlog_write_single(
> +static struct xfs_log_vec *
> +xlog_write_partial(
> +	struct xlog		*log,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
> -	struct xlog_in_core	*iclog,
> -	uint32_t		log_offset,
> -	uint32_t		len)
> +	struct xlog_in_core	**iclogp,
> +	uint32_t		*log_offset,
> +	uint32_t		*len,
> +	uint32_t		*record_cnt,
> +	uint32_t		*data_cnt,
> +	int			*contwr)
>  {
> -	struct xfs_log_vec	*lv;
> +	struct xlog_in_core	*iclog = *iclogp;
> +	struct xfs_log_vec	*lv = log_vector;
> +	struct xfs_log_iovec	*reg;
> +	struct xlog_op_header	*ophdr;
>  	void			*ptr;
>  	int			index = 0;
> -	int			record_cnt = 0;
> +	uint32_t		rlen;
> +	int			error;
>  
> -	ASSERT(log_offset + len <= iclog->ic_size);
> +	/* walk the logvec, copying until we run out of space in the iclog */
> +	ptr = iclog->ic_datap + *log_offset;
> +	for (index = 0; index < lv->lv_niovecs; index++) {
> +		uint32_t	reg_offset = 0;
> +
> +		reg = &lv->lv_iovecp[index];
> +		ASSERT(reg->i_len % sizeof(int32_t) == 0);
>  
> -	ptr = iclog->ic_datap + log_offset;
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/*
> -		 * Ordered log vectors have no regions to write so this
> -		 * loop will naturally skip them.
> +		 * The first region of a continuation must have a non-zero
> +		 * length otherwise log recovery will just skip over it and
> +		 * start recovering from the next opheader it finds. Because we
> +		 * mark the next opheader as a continuation, recovery will then
> +		 * incorrectly add the continuation to the previous region and
> +		 * that breaks stuff.
> +		 *
> +		 * Hence if there isn't space for region data after the
> +		 * opheader, then we need to start afresh with a new iclog.
>  		 */
> -		for (index = 0; index < lv->lv_niovecs; index++) {
> -			struct xfs_log_iovec	*reg = &lv->lv_iovecp[index];
> -			struct xlog_op_header	*ophdr = reg->i_addr;
> +		if (iclog->ic_size - *log_offset <=
> +					sizeof(struct xlog_op_header)) {
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
> +		}
>  
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		ophdr = reg->i_addr;
> +		rlen = min_t(uint32_t, reg->i_len, iclog->ic_size - *log_offset);
> +
> +		ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> +		ophdr->oh_len = cpu_to_be32(rlen - sizeof(struct xlog_op_header));
> +		if (rlen != reg->i_len)
> +			ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> +
> +		ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> +		xlog_verify_dest_ptr(log, ptr);
> +		memcpy(ptr, reg->i_addr, rlen);
> +		xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +		(*record_cnt)++;
> +		*data_cnt += rlen;
> +
> +		/* If we wrote the whole region, move to the next. */
> +		if (rlen == reg->i_len)
> +			continue;
> +
> +		/*
> +		 * We now have a partially written iovec, but it can span
> +		 * multiple iclogs so we loop here. First we release the iclog
> +		 * we currently have, then we get a new iclog and add a new
> +		 * opheader. Then we continue copying from where we were until
> +		 * we either complete the iovec or fill the iclog. If we
> +		 * complete the iovec, then we increment the index and go right
> +		 * back to the top of the outer loop. if we fill the iclog, we
> +		 * run the inner loop again.
> +		 *
> +		 * This is complicated by the tail of a region using all the
> +		 * space in an iclog and hence requiring us to release the iclog
> +		 * and get a new one before returning to the outer loop. We must
> +		 * always guarantee that we exit this inner loop with at least
> +		 * space for log transaction opheaders left in the current
> +		 * iclog, hence we cannot just terminate the loop at the end
> +		 * of the of the continuation. So we loop while there is no
> +		 * space left in the current iclog, and check for the end of the
> +		 * continuation after getting a new iclog.
> +		 */
> +		do {
> +			/*
> +			 * Account for the continuation opheader before we get
> +			 * a new iclog. This is necessary so that we reserve
> +			 * space in the iclog for it.
> +			 */
> +			*len += sizeof(struct xlog_op_header);
> +			ticket->t_curr_res -= sizeof(struct xlog_op_header);
> +
> +			error = xlog_write_get_more_iclog_space(log, ticket,
> +					&iclog, log_offset, *len, record_cnt,
> +					data_cnt, contwr);
> +			if (error)
> +				return ERR_PTR(error);
> +			ptr = iclog->ic_datap + *log_offset;
>  
> +			ophdr = ptr;
>  			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -			ophdr->oh_len = cpu_to_be32(reg->i_len -
> +			ophdr->oh_clientid = XFS_TRANSACTION;
> +			ophdr->oh_res2 = 0;
> +			ophdr->oh_flags = XLOG_WAS_CONT_TRANS;
> +
> +			xlog_write_adv_cnt(&ptr, len, log_offset,
>  						sizeof(struct xlog_op_header));
> -			memcpy(ptr, reg->i_addr, reg->i_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, reg->i_len);
> -			record_cnt++;
> -		}
> +			*data_cnt += sizeof(struct xlog_op_header);
> +
> +			/*
> +			 * If rlen fits in the iclog, then end the region
> +			 * continuation. Otherwise we're going around again.
> +			 */
> +			reg_offset += rlen;
> +			rlen = reg->i_len - reg_offset;
> +			if (rlen <= iclog->ic_size - *log_offset)
> +				ophdr->oh_flags |= XLOG_END_TRANS;
> +			else
> +				ophdr->oh_flags |= XLOG_CONTINUE_TRANS;
> +
> +			rlen = min_t(uint32_t, rlen, iclog->ic_size - *log_offset);
> +			ophdr->oh_len = cpu_to_be32(rlen);
> +
> +			xlog_verify_dest_ptr(log, ptr);
> +			memcpy(ptr, reg->i_addr + reg_offset, rlen);
> +			xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> +			(*record_cnt)++;
> +			*data_cnt += rlen;
> +
> +		} while (ophdr->oh_flags & XLOG_CONTINUE_TRANS);
>  	}
> -	ASSERT(len == 0);
> -	return record_cnt;
> -}
>  
> +	/*
> +	 * No more iovecs remain in this logvec so return the next log vec to
> +	 * the caller so it can go back to fast path copying.
> +	 */
> +	*iclogp = iclog;
> +	return lv->lv_next;
> +}
>  
>  /*
>   * Write some region out to in-core log
> @@ -2323,14 +2407,11 @@ xlog_write(
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> -	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
> -	int			index = 0;
> -	int			partial_copy = 0;
> -	int			partial_copy_len = 0;
>  	int			contwr = 0;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> +	int			log_offset;
>  
>  	if (ticket->t_curr_res < 0) {
>  		xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
> @@ -2339,146 +2420,40 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	if (start_lsn)
> -		*start_lsn = 0;
> -	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -		void		*ptr;
> -		int		log_offset;
> -
> -		error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> -						   &contwr, &log_offset);
> -		if (error)
> -			return error;
> -
> -		ASSERT(log_offset <= iclog->ic_size - 1);
> +	error = xlog_state_get_iclog_space(log, len, &iclog, ticket,
> +					   &contwr, &log_offset);
> +	if (error)
> +		return error;
>  
> -		/* Start_lsn is the first lsn written to. */
> -		if (start_lsn && !*start_lsn)
> -			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +	/* start_lsn is the LSN of the first iclog written to. */
> +	if (start_lsn)
> +		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> -		/* If this is a single iclog write, go fast... */
> -		if (!contwr && lv == log_vector) {
> -			record_cnt = xlog_write_single(lv, ticket, iclog,
> -						log_offset, len);
> -			len = 0;
> -			data_cnt = len;
> +	while (lv) {
> +		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt);
> +		if (!lv)
>  			break;
> -		}
> -
> -		/*
> -		 * This loop writes out as many regions as can fit in the amount
> -		 * of space which was allocated by xlog_state_get_iclog_space().
> -		 */
> -		ptr = iclog->ic_datap + log_offset;
> -		while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> -			struct xfs_log_iovec	*reg;
> -			struct xlog_op_header	*ophdr;
> -			int			copy_len;
> -			int			copy_off;
> -			bool			ordered = false;
> -			bool			added_ophdr = false;
> -
> -			/* ordered log vectors have no regions to write */
> -			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> -				ASSERT(lv->lv_niovecs == 0);
> -				ordered = true;
> -				goto next_lv;
> -			}
> -
> -			reg = &vecp[index];
> -			ASSERT(reg->i_len % sizeof(int32_t) == 0);
> -			ASSERT((unsigned long)ptr % sizeof(int32_t) == 0);
> -
> -			/*
> -			 * Regions always have their ophdr at the start of the
> -			 * region, except for:
> -			 * - a transaction start which has a start record ophdr
> -			 *   before the first region ophdr; and
> -			 * - the previous region didn't fully fit into an iclog
> -			 *   so needs a continuation ophdr to prepend the region
> -			 *   in this new iclog.
> -			 */
> -			ophdr = reg->i_addr;
> -			if (optype && index) {
> -				optype &= ~XLOG_START_TRANS;
> -			} else if (partial_copy) {
> -                                ophdr = xlog_write_setup_ophdr(ptr, ticket);
> -				xlog_write_adv_cnt(&ptr, &len, &log_offset,
> -					   sizeof(struct xlog_op_header));
> -				added_ophdr = true;
> -			}
> -			ophdr->oh_tid = cpu_to_be32(ticket->t_tid);
> -
> -			len += xlog_write_setup_copy(ticket, ophdr,
> -						     iclog->ic_size-log_offset,
> -						     reg->i_len,
> -						     &copy_off, &copy_len,
> -						     &partial_copy,
> -						     &partial_copy_len);
> -			xlog_verify_dest_ptr(log, ptr);
>  
> -
> -			/*
> -			 * Wart: need to update length in embedded ophdr not
> -			 * to include it's own length.
> -			 */
> -			if (!added_ophdr) {
> -				ophdr->oh_len = cpu_to_be32(copy_len -
> -						sizeof(struct xlog_op_header));
> -			}
> -
> -			ASSERT(copy_len > 0);
> -			memcpy(ptr, reg->i_addr + copy_off, copy_len);
> -			xlog_write_adv_cnt(&ptr, &len, &log_offset, copy_len);
> -
> -			if (added_ophdr)
> -				copy_len += sizeof(struct xlog_op_header);
> -			record_cnt++;
> -			data_cnt += contwr ? copy_len : 0;
> -
> -			error = xlog_write_copy_finish(log, iclog, optype,
> -						       &record_cnt, &data_cnt,
> -						       &partial_copy,
> -						       &partial_copy_len,
> -						       log_offset,
> -						       commit_iclog);
> -			if (error)
> -				return error;
> -
> -			/*
> -			 * if we had a partial copy, we need to get more iclog
> -			 * space but we don't want to increment the region
> -			 * index because there is still more is this region to
> -			 * write.
> -			 *
> -			 * If we completed writing this region, and we flushed
> -			 * the iclog (indicated by resetting of the record
> -			 * count), then we also need to get more log space. If
> -			 * this was the last record, though, we are done and
> -			 * can just return.
> -			 */
> -			if (partial_copy)
> -				break;
> -
> -			if (++index == lv->lv_niovecs) {
> -next_lv:
> -				lv = lv->lv_next;
> -				index = 0;
> -				if (lv)
> -					vecp = lv->lv_iovecp;
> -			}
> -			if (record_cnt == 0 && !ordered) {
> -				if (!lv)
> -					return 0;
> -				break;
> -			}
> +		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
> +		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> +					&len, &record_cnt, &data_cnt, &contwr);
> +		if (IS_ERR_OR_NULL(lv)) {
> +			error = PTR_ERR_OR_ZERO(lv);
> +			break;
>  		}
>  	}
> +	ASSERT((len == 0 && !lv) || error);
>  
> -	ASSERT(len == 0);
> -
> +	/*
> +	 * We've already been guaranteed that the last writes will fit inside
> +	 * the current iclog, and hence it will already have the space used by
> +	 * those writes accounted to it. Hence we do not need to update the
> +	 * iclog with the number of bytes written here.
> +	 */
> +	ASSERT(!contwr || XLOG_FORCED_SHUTDOWN(log));
>  	spin_lock(&log->l_icloglock);
> -	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
> +	xlog_state_finish_copy(log, iclog, record_cnt, 0);
>  	if (commit_iclog) {
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
> @@ -3633,11 +3608,12 @@ xlog_verify_iclog(
>  					iclog->ic_header.h_cycle_data[idx]);
>  			}
>  		}
> -		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG)
> +		if (clientid != XFS_TRANSACTION && clientid != XFS_LOG) {
>  			xfs_warn(log->l_mp,
> -				"%s: invalid clientid %d op "PTR_FMT" offset 0x%lx",
> -				__func__, clientid, ophead,
> +				"%s: op %d invalid clientid %d op "PTR_FMT" offset 0x%lx",
> +				__func__, i, clientid, ophead,
>  				(unsigned long)field_offset);
> +		}
>  
>  		/* check length */
>  		p = &ophead->oh_len;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore
  2021-05-19 12:13 ` [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore Dave Chinner
@ 2021-05-27 18:07   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:02PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So remove it from the interface and callers.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks pretty straightforward.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      | 14 ++++----------
>  fs/xfs/xfs_log_cil.c  |  2 +-
>  fs/xfs/xfs_log_priv.h |  2 +-
>  3 files changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 574078985f0a..65b28fce4db4 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -863,8 +863,7 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
> -				reg.i_len);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, reg.i_len);
>  }
>  
>  /*
> @@ -1588,8 +1587,7 @@ xlog_commit_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> -				reg.i_len);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2399,7 +2397,6 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			optype,
>  	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
> @@ -2431,7 +2428,6 @@ xlog_write(
>  		if (!lv)
>  			break;
>  
> -		ASSERT(!(optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)));
>  		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
>  					&len, &record_cnt, &data_cnt);
>  		if (IS_ERR_OR_NULL(lv)) {
> @@ -2449,12 +2445,10 @@ xlog_write(
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, 0);
> -	if (commit_iclog) {
> -		ASSERT(optype & XLOG_COMMIT_TRANS);
> +	if (commit_iclog)
>  		*commit_iclog = iclog;
> -	} else {
> +	else
>  		error = xlog_state_release_iclog(log, iclog);
> -	}
>  	spin_unlock(&log->l_icloglock);
>  
>  	return error;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 7a6b80666f98..dbe3a8267e2f 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -908,7 +908,7 @@ xlog_cil_push_work(
>  	 * write head.
>  	 */
>  	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS, num_bytes);
> +				num_bytes);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index eba905c273b0..a16ffdc8ae97 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
> +		struct xlog_in_core **commit_iclog, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 25/39] xfs: CIL context doesn't need to count iovecs
  2021-05-19 12:13 ` [PATCH 25/39] xfs: CIL context doesn't need to count iovecs Dave Chinner
@ 2021-05-27 18:08   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:03PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we account for log opheaders in the log item formatting
> code, we don't actually use the aggregated count of log iovecs in
> the CIL for anything. Remove it and the tracking code that
> calculates it.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good this time around,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c  | 22 ++++++----------------
>  fs/xfs/xfs_log_priv.h |  1 -
>  2 files changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index dbe3a8267e2f..eca5c82c0d60 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -252,22 +252,18 @@ xlog_cil_alloc_shadow_bufs(
>  
>  /*
>   * Prepare the log item for insertion into the CIL. Calculate the difference in
> - * log space and vectors it will consume, and if it is a new item pin it as
> - * well.
> + * log space it will consume, and if it is a new item pin it as well.
>   */
>  STATIC void
>  xfs_cil_prepare_item(
>  	struct xlog		*log,
>  	struct xfs_log_vec	*lv,
>  	struct xfs_log_vec	*old_lv,
> -	int			*diff_len,
> -	int			*diff_iovecs)
> +	int			*diff_len)
>  {
>  	/* Account for the new LV being passed in */
> -	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED) {
> +	if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
>  		*diff_len += lv->lv_bytes;
> -		*diff_iovecs += lv->lv_niovecs;
> -	}
>  
>  	/*
>  	 * If there is no old LV, this is the first time we've seen the item in
> @@ -284,7 +280,6 @@ xfs_cil_prepare_item(
>  		ASSERT(lv->lv_buf_len != XFS_LOG_VEC_ORDERED);
>  
>  		*diff_len -= old_lv->lv_bytes;
> -		*diff_iovecs -= old_lv->lv_niovecs;
>  		lv->lv_item->li_lv_shadow = old_lv;
>  	}
>  
> @@ -333,12 +328,10 @@ static void
>  xlog_cil_insert_format_items(
>  	struct xlog		*log,
>  	struct xfs_trans	*tp,
> -	int			*diff_len,
> -	int			*diff_iovecs)
> +	int			*diff_len)
>  {
>  	struct xfs_log_item	*lip;
>  
> -
>  	/* Bail out if we didn't find a log item.  */
>  	if (list_empty(&tp->t_items)) {
>  		ASSERT(0);
> @@ -381,7 +374,6 @@ xlog_cil_insert_format_items(
>  			 * set the item up as though it is a new insertion so
>  			 * that the space reservation accounting is correct.
>  			 */
> -			*diff_iovecs -= lv->lv_niovecs;
>  			*diff_len -= lv->lv_bytes;
>  
>  			/* Ensure the lv is set up according to ->iop_size */
> @@ -406,7 +398,7 @@ xlog_cil_insert_format_items(
>  		ASSERT(IS_ALIGNED((unsigned long)lv->lv_buf, sizeof(uint64_t)));
>  		lip->li_ops->iop_format(lip, lv);
>  insert:
> -		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
> +		xfs_cil_prepare_item(log, lv, old_lv, diff_len);
>  	}
>  }
>  
> @@ -426,7 +418,6 @@ xlog_cil_insert_items(
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
> -	int			diff_iovecs = 0;
>  	int			iclog_space;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  
> @@ -436,7 +427,7 @@ xlog_cil_insert_items(
>  	 * We can do this safely because the context can't checkpoint until we
>  	 * are done so it doesn't matter exactly how we update the CIL.
>  	 */
> -	xlog_cil_insert_format_items(log, tp, &len, &diff_iovecs);
> +	xlog_cil_insert_format_items(log, tp, &len);
>  
>  	spin_lock(&cil->xc_cil_lock);
>  
> @@ -471,7 +462,6 @@ xlog_cil_insert_items(
>  	}
>  	tp->t_ticket->t_curr_res -= len;
>  	ctx->space_used += len;
> -	ctx->nvecs += diff_iovecs;
>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a16ffdc8ae97..02c94b6d0642 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -217,7 +217,6 @@ struct xfs_cil_ctx {
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> -	int			nvecs;		/* number of regions */
>  	int			space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 28/39] xfs: rework per-iclog header CIL reservation
  2021-05-19 12:13 ` [PATCH 28/39] xfs: rework per-iclog header CIL reservation Dave Chinner
@ 2021-05-27 18:17   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:06PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For every iclog that a CIL push will use up, we need to ensure we
> have space reserved for the iclog header in each iclog. It is
> extremely difficult to do this accurately with a per-cpu counter
> without expensive summing of the counter in every commit. However,
> we know what the maximum CIL size is going to be because of the
> hard space limit we have, and hence we know exactly how many iclogs
> we are going to need to write out the CIL.
> 
> We are constrained by the requirement that small transactions only
> have reservation space for a single iclog header built into them.
> At commit time we don't know how much of the current transaction
> reservation is made up of iclog header reservations as calculated by
> xfs_log_calc_unit_res() when the ticket was reserved. As larger
> reservations have multiple header spaces reserved, we can steal
> more than one iclog header reservation at a time, but we only steal
> the exact number needed for the given log vector size delta.
> 
> As a result, we don't know exactly when we are going to steal iclog
> header reservations, nor do we know exactly how many we are going to
> need for a given CIL.
> 
> To make things simple, start by calculating the worst case number of
> iclog headers a full CIL push will require. Record this into an
> atomic variable in the CIL. Then add a byte counter to the log
> ticket that records exactly how much iclog header space has been
> reserved in this ticket by xfs_log_calc_unit_res(). This tells us
> exactly how much space we can steal from the ticket at transaction
> commit time.
> 
> Now, at transaction commit time, we can check if the CIL has a full
> iclog header reservation and, if not, steal the entire reservation
> the current ticket holds for iclog headers. This minimises the
> number of times we need to do atomic operations in the fast path,
> but still guarantees we get all the reservations we need.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

No new questions since last time, so:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      |  9 ++++---
>  fs/xfs/xfs_log_cil.c  | 55 +++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_log_priv.h | 20 +++++++++-------
>  3 files changed, 59 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 65b28fce4db4..77d9ea7daf26 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3307,7 +3307,8 @@ xfs_log_ticket_get(
>  static int
>  xlog_calc_unit_res(
>  	struct xlog		*log,
> -	int			unit_bytes)
> +	int			unit_bytes,
> +	int			*niclogs)
>  {
>  	int			iclog_space;
>  	uint			num_headers;
> @@ -3387,6 +3388,8 @@ xlog_calc_unit_res(
>  	/* roundoff padding for transaction data and one for commit record */
>  	unit_bytes += 2 * log->l_iclog_roundoff;
>  
> +	if (niclogs)
> +		*niclogs = num_headers;
>  	return unit_bytes;
>  }
>  
> @@ -3395,7 +3398,7 @@ xfs_log_calc_unit_res(
>  	struct xfs_mount	*mp,
>  	int			unit_bytes)
>  {
> -	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes, NULL);
>  }
>  
>  /*
> @@ -3413,7 +3416,7 @@ xlog_ticket_alloc(
>  
>  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>  
> -	unit_res = xlog_calc_unit_res(log, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes, &tic->t_iclog_hdrs);
>  
>  	atomic_set(&tic->t_ref, 1);
>  	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4637f8711ada..87d4eb321fdc 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -44,9 +44,20 @@ xlog_cil_ticket_alloc(
>  	 * transaction overhead reservation from the first transaction commit.
>  	 */
>  	tic->t_curr_res = 0;
> +	tic->t_iclog_hdrs = 0;
>  	return tic;
>  }
>  
> +static inline void
> +xlog_cil_set_iclog_hdr_count(struct xfs_cil *cil)
> +{
> +	struct xlog	*log = cil->xc_log;
> +
> +	atomic_set(&cil->xc_iclog_hdrs,
> +		   (XLOG_CIL_BLOCKING_SPACE_LIMIT(log) /
> +			(log->l_iclog_size - log->l_iclog_hsize)));
> +}
> +
>  /*
>   * Unavoidable forward declaration - xlog_cil_push_work() calls
>   * xlog_cil_ctx_alloc() itself.
> @@ -70,6 +81,7 @@ xlog_cil_ctx_switch(
>  	struct xfs_cil		*cil,
>  	struct xfs_cil_ctx	*ctx)
>  {
> +	xlog_cil_set_iclog_hdr_count(cil);
>  	set_bit(XLOG_CIL_EMPTY, &cil->xc_flags);
>  	ctx->sequence = ++cil->xc_current_sequence;
>  	ctx->cil = cil;
> @@ -92,6 +104,7 @@ xlog_cil_init_post_recovery(
>  {
>  	log->l_cilp->xc_ctx->ticket = xlog_cil_ticket_alloc(log);
>  	log->l_cilp->xc_ctx->sequence = 1;
> +	xlog_cil_set_iclog_hdr_count(log->l_cilp);
>  }
>  
>  static inline int
> @@ -419,7 +432,6 @@ xlog_cil_insert_items(
>  	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
> -	int			iclog_space;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  
>  	ASSERT(tp);
> @@ -442,19 +454,36 @@ xlog_cil_insert_items(
>  	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
>  		ctx_res = ctx->ticket->t_unit_res;
>  
> -	spin_lock(&cil->xc_cil_lock);
> -
> -	/* do we need space for more log record headers? */
> -	iclog_space = log->l_iclog_size - log->l_iclog_hsize;
> -	if (len > 0 && (ctx->space_used / iclog_space !=
> -				(ctx->space_used + len) / iclog_space)) {
> -		split_res = (len + iclog_space - 1) / iclog_space;
> -		/* need to take into account split region headers, too */
> -		split_res *= log->l_iclog_hsize + sizeof(struct xlog_op_header);
> -		ctx->ticket->t_unit_res += split_res;
> +	/*
> +	 * Check if we need to steal iclog headers. atomic_read() is not a
> +	 * locked atomic operation, so we can check the value before we do any
> +	 * real atomic ops in the fast path. If we've already taken the CIL unit
> +	 * reservation from this commit, we've already got one iclog header
> +	 * space reserved so we have to account for that otherwise we risk
> +	 * overrunning the reservation on this ticket.
> +	 *
> +	 * If the CIL is already at the hard limit, we might need more header
> +	 * space that originally reserved. So steal more header space from every
> +	 * commit that occurs once we are over the hard limit to ensure the CIL
> +	 * push won't run out of reservation space.
> +	 *
> +	 * This can steal more than we need, but that's OK.
> +	 */
> +	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> +	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +		int	split_res = log->l_iclog_hsize +
> +					sizeof(struct xlog_op_header);
> +		if (ctx_res)
> +			ctx_res += split_res * (tp->t_ticket->t_iclog_hdrs - 1);
> +		else
> +			ctx_res = split_res * tp->t_ticket->t_iclog_hdrs;
> +		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
>  	}
> -	tp->t_ticket->t_curr_res -= split_res + ctx_res + len;
> -	ctx->ticket->t_curr_res += split_res + ctx_res;
> +
> +	spin_lock(&cil->xc_cil_lock);
> +	tp->t_ticket->t_curr_res -= ctx_res + len;
> +	ctx->ticket->t_unit_res += ctx_res;
> +	ctx->ticket->t_curr_res += ctx_res;
>  	ctx->space_used += len;
>  
>  	/*
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 11606c378b7f..85a85ab569fe 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -137,15 +137,16 @@ enum xlog_iclog_state {
>  #define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
>  
>  typedef struct xlog_ticket {
> -	struct list_head   t_queue;	 /* reserve/write queue */
> -	struct task_struct *t_task;	 /* task that owns this ticket */
> -	xlog_tid_t	   t_tid;	 /* transaction identifier	 : 4  */
> -	atomic_t	   t_ref;	 /* ticket reference count       : 4  */
> -	int		   t_curr_res;	 /* current reservation in bytes : 4  */
> -	int		   t_unit_res;	 /* unit reservation in bytes    : 4  */
> -	char		   t_ocnt;	 /* original count		 : 1  */
> -	char		   t_cnt;	 /* current count		 : 1  */
> -	char		   t_flags;	 /* properties of reservation	 : 1  */
> +	struct list_head	t_queue;	/* reserve/write queue */
> +	struct task_struct	*t_task;	/* task that owns this ticket */
> +	xlog_tid_t		t_tid;		/* transaction identifier */
> +	atomic_t		t_ref;		/* ticket reference count */
> +	int			t_curr_res;	/* current reservation */
> +	int			t_unit_res;	/* unit reservation */
> +	char			t_ocnt;		/* original count */
> +	char			t_cnt;		/* current count */
> +	char			t_flags;	/* properties of reservation */
> +	int			t_iclog_hdrs;	/* iclog hdrs in t_curr_res */
>  } xlog_ticket_t;
>  
>  /*
> @@ -245,6 +246,7 @@ struct xfs_cil_ctx {
>  struct xfs_cil {
>  	struct xlog		*xc_log;
>  	unsigned long		xc_flags;
> +	atomic_t		xc_iclog_hdrs;
>  	struct list_head	xc_cil;
>  	spinlock_t		xc_cil_lock;
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure
  2021-05-19 12:13 ` [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure Dave Chinner
@ 2021-05-27 18:31   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The CIL push lock is highly contended on larger machines, becoming a
> hard bottleneck that about 700,000 transaction commits/s on >16p
> machines. To address this, start moving the CIL tracking
> infrastructure to utilise per-CPU structures.
> 
> We need to track the space used, the amount of log reservation space
> reserved to write the CIL, the log items in the CIL and the busy
> extents that need to be completed by the CIL commit.  This requires
> a couple of per-cpu counters, an unordered per-cpu list and a
> globally ordered per-cpu list.
> 
> Create a per-cpu structure to hold these and all the management
> interfaces needed, as well as the hooks to handle hotplug CPUs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c       | 106 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_priv.h      |  15 ++++++
>  include/linux/cpuhotplug.h |   1 +
>  3 files changed, 122 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 87d4eb321fdc..ba1c6979a4c7 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1370,6 +1370,105 @@ xfs_log_item_in_current_chkpt(
>  	return lip->li_seq == cil->xc_ctx->sequence;
>  }
>  
> +#ifdef CONFIG_HOTPLUG_CPU
> +static LIST_HEAD(xlog_cil_pcp_list);
> +static DEFINE_SPINLOCK(xlog_cil_pcp_lock);
> +static bool xlog_cil_pcp_init;
> +
> +/*
> + * Move dead percpu state to the relevant CIL context structures.
> + *
> + * We have to lock the CIL context here to ensure that nothing is modifying
> + * the percpu state, either addition or removal. Both of these are done under
> + * the CIL context lock, so grabbing that exclusively here will ensure we can
> + * safely drain the cilpcp for the CPU that is dying.
> + */
> +static int
> +xlog_cil_pcp_dead(
> +	unsigned int		cpu)
> +{
> +	struct xfs_cil		*cil, *n;
> +
> +	spin_lock(&xlog_cil_pcp_lock);
> +	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
> +		spin_unlock(&xlog_cil_pcp_lock);
> +		down_write(&cil->xc_ctx_lock);
> +		/* move stuff on dead CPU to context */
> +		up_write(&cil->xc_ctx_lock);
> +		spin_lock(&xlog_cil_pcp_lock);
> +	}
> +	spin_unlock(&xlog_cil_pcp_lock);
> +	return 0;
> +}
> +
> +static int
> +xlog_cil_pcp_hpadd(
> +	struct xfs_cil		*cil)
> +{
> +	if (!xlog_cil_pcp_init) {
> +		int	ret;
> +		ret = cpuhp_setup_state_nocalls(CPUHP_XFS_CIL_DEAD,

Nit: blank line between variable declarations and code.

> +						"xfs/cil_pcp:dead", NULL,
> +						xlog_cil_pcp_dead);
> +		if (ret < 0) {
> +			xfs_warn(cil->xc_log->l_mp,
> +	"Failed to initialise CIL hotplug, error %d. XFS is non-functional.",
> +				ret);
> +			ASSERT(0);
> +			return -ENOMEM;
> +		}
> +		xlog_cil_pcp_init = true;
> +	}
> +
> +	INIT_LIST_HEAD(&cil->xc_pcp_list);
> +	spin_lock(&xlog_cil_pcp_lock);
> +	list_add(&cil->xc_pcp_list, &xlog_cil_pcp_list);
> +	spin_unlock(&xlog_cil_pcp_lock);
> +	return 0;
> +}
> +
> +static void
> +xlog_cil_pcp_hpremove(
> +	struct xfs_cil		*cil)
> +{
> +	spin_lock(&xlog_cil_pcp_lock);
> +	list_del(&cil->xc_pcp_list);
> +	spin_unlock(&xlog_cil_pcp_lock);
> +}
> +
> +#else /* !CONFIG_HOTPLUG_CPU */
> +static inline void xlog_cil_pcp_hpadd(struct xfs_cil *cil) {}
> +static inline void xlog_cil_pcp_hpremove(struct xfs_cil *cil) {}
> +#endif
> +
> +static void __percpu *
> +xlog_cil_pcp_alloc(
> +	struct xfs_cil		*cil)
> +{
> +	void __percpu		*pcp;
> +
> +	pcp = alloc_percpu(struct xlog_cil_pcp);
> +	if (!pcp)
> +		return NULL;
> +
> +	if (xlog_cil_pcp_hpadd(cil) < 0) {
> +		free_percpu(pcp);
> +		return NULL;
> +	}
> +	return pcp;
> +}
> +
> +static void
> +xlog_cil_pcp_free(
> +	struct xfs_cil		*cil,
> +	void __percpu		*pcp)
> +{
> +	if (!pcp)
> +		return;
> +	xlog_cil_pcp_hpremove(cil);
> +	free_percpu(pcp);
> +}
> +
>  /*
>   * Perform initial CIL structure initialisation.
>   */
> @@ -1384,6 +1483,12 @@ xlog_cil_init(
>  	if (!cil)
>  		return -ENOMEM;
>  
> +	cil->xc_pcp = xlog_cil_pcp_alloc(cil);
> +	if (!cil->xc_pcp) {
> +		kmem_free(cil);
> +		return -ENOMEM;
> +	}
> +
>  	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
>  	spin_lock_init(&cil->xc_cil_lock);
> @@ -1414,6 +1519,7 @@ xlog_cil_destroy(
>  
>  	ASSERT(list_empty(&cil->xc_cil));
>  	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
> +	xlog_cil_pcp_free(cil, cil->xc_pcp);
>  	kmem_free(cil);
>  }
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 85a85ab569fe..aaa1e7f7fb66 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -227,6 +227,16 @@ struct xfs_cil_ctx {
>  	struct work_struct	push_work;
>  };
>  
> +/*
> + * Per-cpu CIL tracking items
> + */
> +struct xlog_cil_pcp {
> +	uint32_t		space_used;
> +	uint32_t		curr_res;

I don't think these fields need to be in this patch.

Especially since you rename one of them in patch 31.

I think this looks like the skeleton of adding per-cpu structures to the
log, though I'll read the next few patches to see if the comments make
more sense once you actually starting using the pcpu data.

--D

> +	struct list_head	busy_extents;
> +	struct list_head	log_items;
> +};
> +
>  /*
>   * Committed Item List structure
>   *
> @@ -260,6 +270,11 @@ struct xfs_cil {
>  	wait_queue_head_t	xc_commit_wait;
>  	xfs_csn_t		xc_current_sequence;
>  	wait_queue_head_t	xc_push_wait;	/* background push throttle */
> +
> +	void __percpu		*xc_pcp;	/* percpu CIL structures */
> +#ifdef CONFIG_HOTPLUG_CPU
> +	struct list_head	xc_pcp_list;
> +#endif
>  } ____cacheline_aligned_in_smp;
>  
>  /* xc_flags bit values */
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 4a62b3980642..3d3ccde9e9c8 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -52,6 +52,7 @@ enum cpuhp_state {
>  	CPUHP_FS_BUFF_DEAD,
>  	CPUHP_PRINTK_DEAD,
>  	CPUHP_MM_MEMCQ_DEAD,
> +	CPUHP_XFS_CIL_DEAD,
>  	CPUHP_PERCPU_CNT_DEAD,
>  	CPUHP_RADIX_DEAD,
>  	CPUHP_PAGE_ALLOC_DEAD,
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-05-19 12:13 ` [PATCH 30/39] xfs: implement percpu cil space used calculation Dave Chinner
@ 2021-05-27 18:41   ` Darrick J. Wong
  2021-06-02 23:47     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that we have the CIL percpu structures in place, implement the
> space used counter with a fast sum check similar to the
> percpu_counter infrastructure.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  2 files changed, 55 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index ba1c6979a4c7..72693fba929b 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
>  	return ctx;
>  }
>  
> +/*
> + * Aggregate the CIL per cpu structures into global counts, lists, etc and
> + * clear the percpu state ready for the next context to use.
> + */
> +static void
> +xlog_cil_pcp_aggregate(
> +	struct xfs_cil		*cil,
> +	struct xfs_cil_ctx	*ctx)
> +{
> +	struct xlog_cil_pcp	*cilpcp;
> +	int			cpu;
> +
> +	for_each_online_cpu(cpu) {
> +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +		cilpcp->space_used = 0;

How does this aggregate anything?  All I see here is zeroing a counter?
I see that we /can/ add the percpu space_used counter to the cil context
if we're over the space limits, but I don't actually see where...

> +	}
> +}
> +
>  static void
>  xlog_cil_ctx_switch(
>  	struct xfs_cil		*cil,
> @@ -433,6 +451,8 @@ xlog_cil_insert_items(
>  	struct xfs_log_item	*lip;
>  	int			len = 0;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
> +	int			space_used;
> +	struct xlog_cil_pcp	*cilpcp;
>  
>  	ASSERT(tp);
>  
> @@ -469,8 +489,9 @@ xlog_cil_insert_items(
>  	 *
>  	 * This can steal more than we need, but that's OK.
>  	 */
> +	space_used = atomic_read(&ctx->space_used);
>  	if (atomic_read(&cil->xc_iclog_hdrs) > 0 ||
> -	    ctx->space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	    space_used + len >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
>  		int	split_res = log->l_iclog_hsize +
>  					sizeof(struct xlog_op_header);
>  		if (ctx_res)
> @@ -480,16 +501,34 @@ xlog_cil_insert_items(
>  		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
>  	}
>  
> +	/*
> +	 * Update the CIL percpu pointer. This updates the global counter when
> +	 * over the percpu batch size or when the CIL is over the space limit.
> +	 * This means low lock overhead for normal updates, and when over the
> +	 * limit the space used is immediately accounted. This makes enforcing
> +	 * the hard limit much more accurate. The per cpu fold threshold is
> +	 * based on how close we are to the hard limit.
> +	 */
> +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> +	cilpcp->space_used += len;
> +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> +	    cilpcp->space_used >
> +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> +					num_online_cpus())) {
> +		atomic_add(cilpcp->space_used, &ctx->space_used);
> +		cilpcp->space_used = 0;
> +	}
> +	put_cpu_ptr(cilpcp);
> +
>  	spin_lock(&cil->xc_cil_lock);
> -	tp->t_ticket->t_curr_res -= ctx_res + len;
>  	ctx->ticket->t_unit_res += ctx_res;
>  	ctx->ticket->t_curr_res += ctx_res;
> -	ctx->space_used += len;

...this update happens if we're not over the space limit?

--D

>  
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
>  	 * the log items. Shutdown is imminent...
>  	 */
> +	tp->t_ticket->t_curr_res -= ctx_res + len;
>  	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
>  		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
>  		xfs_warn(log->l_mp,
> @@ -846,6 +885,8 @@ xlog_cil_push_work(
>  	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
>  				&bdev_flush);
>  
> +	xlog_cil_pcp_aggregate(cil, ctx);
> +
>  	/*
>  	 * Pull all the log vectors off the items in the CIL, and remove the
>  	 * items from the CIL. We don't need the CIL lock here because it's only
> @@ -1043,6 +1084,7 @@ xlog_cil_push_background(
>  	struct xlog	*log) __releases(cil->xc_ctx_lock)
>  {
>  	struct xfs_cil	*cil = log->l_cilp;
> +	int		space_used = atomic_read(&cil->xc_ctx->space_used);
>  
>  	/*
>  	 * The cil won't be empty because we are called while holding the
> @@ -1055,7 +1097,7 @@ xlog_cil_push_background(
>  	 * Don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
> -	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> @@ -1084,10 +1126,10 @@ xlog_cil_push_background(
>  	 * The ctx->xc_push_lock provides the serialisation necessary for safely
>  	 * using the lockless waitqueue_active() check in this context.
>  	 */
> -	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
> +	if (space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
>  	    waitqueue_active(&cil->xc_push_wait)) {
>  		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
> -		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
> +		ASSERT(space_used < log->l_logsize);
>  		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>  		return;
>  	}
> @@ -1391,9 +1433,14 @@ xlog_cil_pcp_dead(
>  
>  	spin_lock(&xlog_cil_pcp_lock);
>  	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
> +		struct xlog_cil_pcp	*cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +
>  		spin_unlock(&xlog_cil_pcp_lock);
>  		down_write(&cil->xc_ctx_lock);
> -		/* move stuff on dead CPU to context */
> +
> +		atomic_add(cilpcp->space_used, &cil->xc_ctx->space_used);
> +		cilpcp->space_used = 0;
> +
>  		up_write(&cil->xc_ctx_lock);
>  		spin_lock(&xlog_cil_pcp_lock);
>  	}
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index aaa1e7f7fb66..7dc6275818de 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -218,7 +218,7 @@ struct xfs_cil_ctx {
>  	xfs_lsn_t		start_lsn;	/* first LSN of chkpt commit */
>  	xfs_lsn_t		commit_lsn;	/* chkpt commit record lsn */
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
> -	int			space_used;	/* aggregate size of regions */
> +	atomic_t		space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure
  2021-05-19 12:13 ` [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure Dave Chinner
@ 2021-05-27 18:48   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To get it out from under the cil spinlock.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ... straightforward enough for a percpu thing ;)
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c  | 20 +++++++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  2 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 72693fba929b..4ddc302a766b 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -90,6 +90,10 @@ xlog_cil_pcp_aggregate(
>  
>  	for_each_online_cpu(cpu) {
>  		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +
> +		ctx->ticket->t_curr_res += cilpcp->space_reserved;
> +		ctx->ticket->t_unit_res += cilpcp->space_reserved;
> +		cilpcp->space_reserved = 0;
>  		cilpcp->space_used = 0;
>  	}
>  }
> @@ -510,6 +514,7 @@ xlog_cil_insert_items(
>  	 * based on how close we are to the hard limit.
>  	 */
>  	cilpcp = get_cpu_ptr(cil->xc_pcp);
> +	cilpcp->space_reserved += ctx_res;
>  	cilpcp->space_used += len;
>  	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>  	    cilpcp->space_used >
> @@ -520,10 +525,6 @@ xlog_cil_insert_items(
>  	}
>  	put_cpu_ptr(cilpcp);
>  
> -	spin_lock(&cil->xc_cil_lock);
> -	ctx->ticket->t_unit_res += ctx_res;
> -	ctx->ticket->t_curr_res += ctx_res;
> -
>  	/*
>  	 * If we've overrun the reservation, dump the tx details before we move
>  	 * the log items. Shutdown is imminent...
> @@ -545,6 +546,7 @@ xlog_cil_insert_items(
>  	 * We do this here so we only need to take the CIL lock once during
>  	 * the transaction commit.
>  	 */
> +	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
>  		/* Skip items which aren't dirty in this transaction. */
> @@ -1434,12 +1436,20 @@ xlog_cil_pcp_dead(
>  	spin_lock(&xlog_cil_pcp_lock);
>  	list_for_each_entry_safe(cil, n, &xlog_cil_pcp_list, xc_pcp_list) {
>  		struct xlog_cil_pcp	*cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> +		struct xfs_cil_ctx	*ctx;
>  
>  		spin_unlock(&xlog_cil_pcp_lock);
>  		down_write(&cil->xc_ctx_lock);
> +		ctx = cil->xc_ctx;
> +
> +		atomic_add(cilpcp->space_used, &ctx->space_used);
> +		if (ctx->ticket) {
> +			ctx->ticket->t_curr_res += cilpcp->space_reserved;
> +			ctx->ticket->t_unit_res += cilpcp->space_reserved;
> +		}
>  
> -		atomic_add(cilpcp->space_used, &cil->xc_ctx->space_used);
>  		cilpcp->space_used = 0;
> +		cilpcp->space_reserved = 0;
>  
>  		up_write(&cil->xc_ctx_lock);
>  		spin_lock(&xlog_cil_pcp_lock);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 7dc6275818de..b80cb3a0edb7 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -232,7 +232,7 @@ struct xfs_cil_ctx {
>   */
>  struct xlog_cil_pcp {
>  	uint32_t		space_used;
> -	uint32_t		curr_res;
> +	uint32_t		space_reserved;
>  	struct list_head	busy_extents;
>  	struct list_head	log_items;
>  };
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 32/39] xfs: convert CIL busy extents to per-cpu
  2021-05-19 12:13 ` [PATCH 32/39] xfs: convert CIL busy extents to per-cpu Dave Chinner
@ 2021-05-27 18:49   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 18:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To get them out from under the CIL lock.
> 
> This is an unordered list, so we can simply punt it to per-cpu lists
> during transaction commits and reaggregate it back into a single
> list during the CIL push work.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4ddc302a766b..b12a2f9ba23a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -93,6 +93,11 @@ xlog_cil_pcp_aggregate(
>  
>  		ctx->ticket->t_curr_res += cilpcp->space_reserved;
>  		ctx->ticket->t_unit_res += cilpcp->space_reserved;
> +		if (!list_empty(&cilpcp->busy_extents)) {
> +			list_splice_init(&cilpcp->busy_extents,
> +					&ctx->busy_extents);
> +		}
> +
>  		cilpcp->space_reserved = 0;
>  		cilpcp->space_used = 0;
>  	}
> @@ -523,6 +528,9 @@ xlog_cil_insert_items(
>  		atomic_add(cilpcp->space_used, &ctx->space_used);
>  		cilpcp->space_used = 0;
>  	}
> +	/* attach the transaction to the CIL if it has any busy extents */
> +	if (!list_empty(&tp->t_busy))
> +		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
>  	put_cpu_ptr(cilpcp);
>  
>  	/*
> @@ -562,9 +570,6 @@ xlog_cil_insert_items(
>  			list_move_tail(&lip->li_cil, &cil->xc_cil);
>  	}
>  
> -	/* attach the transaction to the CIL if it has any busy extents */
> -	if (!list_empty(&tp->t_busy))
> -		list_splice_init(&tp->t_busy, &ctx->busy_extents);
>  	spin_unlock(&cil->xc_cil_lock);
>  
>  	if (tp->t_ticket->t_curr_res < 0)
> @@ -1447,6 +1452,10 @@ xlog_cil_pcp_dead(
>  			ctx->ticket->t_curr_res += cilpcp->space_reserved;
>  			ctx->ticket->t_unit_res += cilpcp->space_reserved;
>  		}
> +		if (!list_empty(&cilpcp->busy_extents)) {
> +			list_splice_init(&cilpcp->busy_extents,
> +					&ctx->busy_extents);
> +		}
>  
>  		cilpcp->space_used = 0;
>  		cilpcp->space_reserved = 0;
> @@ -1502,7 +1511,9 @@ static void __percpu *
>  xlog_cil_pcp_alloc(
>  	struct xfs_cil		*cil)
>  {
> +	struct xlog_cil_pcp	*cilpcp;
>  	void __percpu		*pcp;
> +	int			cpu;
>  
>  	pcp = alloc_percpu(struct xlog_cil_pcp);
>  	if (!pcp)
> @@ -1512,6 +1523,11 @@ xlog_cil_pcp_alloc(
>  		free_percpu(pcp);
>  		return NULL;
>  	}
> +
> +	for_each_possible_cpu(cpu) {
> +		cilpcp = per_cpu_ptr(pcp, cpu);
> +		INIT_LIST_HEAD(&cilpcp->busy_extents);
> +	}
>  	return pcp;
>  }
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-05-19 12:13 ` [PATCH 33/39] xfs: Add order IDs to log items in CIL Dave Chinner
@ 2021-05-27 19:00   ` Darrick J. Wong
  2021-06-03  0:16     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Before we split the ordered CIL up into per cpu lists, we need a
> mechanism to track the order of the items in the CIL. We need to do
> this because there are rules around the order in which related items
> must physically appear in the log even inside a single checkpoint
> transaction.
> 
> An example of this is intents - an intent must appear in the log
> before it's intent done record so taht log recovery can cancel the

s/taht/that/

> intent correctly. If we have these two records misordered in the
> CIL, then they will not be recovered correctly by journal replay.
> 
> We also will not be able to move items to the tail of
> the CIL list when they are relogged, hence the log items will need
> some mechanism to allow the correct log item order to be recreated
> before we write log items to the hournal.
> 
> Hence we need to have a mechanism for recording global order of
> transactions in the log items  so that we can recover that order
> from un-ordered per-cpu lists.
> 
> Do this with a simple monotonic increasing commit counter in the CIL
> context. Each log item in the transaction gets stamped with the
> current commit order ID before it is added to the CIL. If the item
> is already in the CIL, leave it where it is instead of moving it to
> the tail of the list and instead sort the list before we start the
> push work.
> 
> XXX: list_sort() under the cil_ctx_lock held exclusive starts
> hurting that >16 threads. Front end commits are waiting on the push
> to switch contexts much longer. The item order id should likely be
> moved into the logvecs when they are detacted from the items, then
> the sort can be done on the logvec after the cil_ctx_lock has been
> released. logvecs will need to use a list_head for this rather than
> a single linked list like they do now....

...which I guess happens in patch 35 now?

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 38 ++++++++++++++++++++++++++++++--------
>  fs/xfs/xfs_log_priv.h |  1 +
>  fs/xfs/xfs_trans.h    |  1 +
>  3 files changed, 32 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b12a2f9ba23a..ca6e411e388e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -461,6 +461,7 @@ xlog_cil_insert_items(
>  	int			len = 0;
>  	int			iovhdr_res = 0, split_res = 0, ctx_res = 0;
>  	int			space_used;
> +	int			order;
>  	struct xlog_cil_pcp	*cilpcp;
>  
>  	ASSERT(tp);
> @@ -550,10 +551,12 @@ xlog_cil_insert_items(
>  	}
>  
>  	/*
> -	 * Now (re-)position everything modified at the tail of the CIL.
> +	 * Now update the order of everything modified in the transaction
> +	 * and insert items into the CIL if they aren't already there.
>  	 * We do this here so we only need to take the CIL lock once during
>  	 * the transaction commit.
>  	 */
> +	order = atomic_inc_return(&ctx->order_id);
>  	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
> @@ -561,13 +564,10 @@ xlog_cil_insert_items(
>  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
>  			continue;
>  
> -		/*
> -		 * Only move the item if it isn't already at the tail. This is
> -		 * to prevent a transient list_empty() state when reinserting
> -		 * an item that is already the only item in the CIL.
> -		 */
> -		if (!list_is_last(&lip->li_cil, &cil->xc_cil))
> -			list_move_tail(&lip->li_cil, &cil->xc_cil);
> +		lip->li_order_id = order;
> +		if (!list_empty(&lip->li_cil))
> +			continue;
> +		list_add_tail(&lip->li_cil, &cil->xc_cil);
>  	}
>  
>  	spin_unlock(&cil->xc_cil_lock);
> @@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
>  	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
>  
> +/*
> + * CIL item reordering compare function. We want to order in ascending ID order,
> + * but we want to leave items with the same ID in the order they were added to

When do we have items with the same id?

I guess that happens if we have multiple transactions adding items to
the cil at the same time?  I guess that's not a big deal since each of
those threads will hold a disjoint set of locks, so even if the order
ids are the same for a bunch of items, they're never going to be
touching the same AG/inode/metadata object, right?

If that's correct, then:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> + * the list. This is important for operations like reflink where we log 4 order
> + * dependent intents in a single transaction when we overwrite an existing
> + * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
> + * CUI (inc), BUI(remap)...
> + */
> +static int
> +xlog_cil_order_cmp(
> +	void			*priv,
> +	const struct list_head	*a,
> +	const struct list_head	*b)
> +{
> +	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
> +	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
> +
> +	return l1->li_order_id > l2->li_order_id;
> +}
> +
>  /*
>   * Push the Committed Item List to the log.
>   *
> @@ -900,6 +920,7 @@ xlog_cil_push_work(
>  	 * needed on the transaction commit side which is currently locked out
>  	 * by the flush lock.
>  	 */
> +	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
>  	lv = NULL;
>  	while (!list_empty(&cil->xc_cil)) {
>  		struct xfs_log_item	*item;
> @@ -907,6 +928,7 @@ xlog_cil_push_work(
>  		item = list_first_entry(&cil->xc_cil,
>  					struct xfs_log_item, li_cil);
>  		list_del_init(&item->li_cil);
> +		item->li_order_id = 0;
>  		if (!ctx->lv_chain)
>  			ctx->lv_chain = item->li_lv;
>  		else
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b80cb3a0edb7..466862a943ba 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -225,6 +225,7 @@ struct xfs_cil_ctx {
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
>  	struct work_struct	push_work;
> +	atomic_t		order_id;
>  };
>  
>  /*
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index 50da47f23a07..2d1cc1ff93c7 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -44,6 +44,7 @@ struct xfs_log_item {
>  	struct xfs_log_vec		*li_lv;		/* active log vector */
>  	struct xfs_log_vec		*li_lv_shadow;	/* standby vector */
>  	xfs_csn_t			li_seq;		/* CIL commit seq */
> +	uint32_t			li_order_id;	/* CIL commit order */
>  };
>  
>  /*
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 34/39] xfs: convert CIL to unordered per cpu lists
  2021-05-19 12:13 ` [PATCH 34/39] xfs: convert CIL to unordered per cpu lists Dave Chinner
@ 2021-05-27 19:03   ` Darrick J. Wong
  2021-06-03  0:27     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 19:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:12PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> So that we can remove the cil_lock which is a global serialisation
> point. We've already got ordering sorted, so all we need to do is
> treat the CIL list like the busy extent list and reconstruct it
> before the push starts.
> 
> This is what we're trying to avoid:
> 
>  -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
>     - 46.35% xfs_log_commit_cil
>        - 41.54% _raw_spin_lock
>           - 67.30% do_raw_spin_lock
>                66.96% __pv_queued_spin_lock_slowpath
> 
> Which happens on a 32p system when running a 32-way 'rm -rf'
> workload. After this patch:
> 
> -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
>    - 17.67% xfs_log_commit_cil
>       - 6.51% xfs_log_ticket_ungrant
>            1.40% xfs_log_space_wake
>         2.32% memcpy_erms
>       - 2.18% xfs_buf_item_committing
>          - 2.12% xfs_buf_item_release
>             - 1.03% xfs_buf_unlock
>                  0.96% up
>               0.72% xfs_buf_rele
>         1.33% xfs_inode_item_format
>         1.19% down_read
>         0.91% up_read
>         0.76% xfs_buf_item_format
>       - 0.68% kmem_alloc_large
>          - 0.67% kmem_alloc
>               0.64% __kmalloc
>         0.50% xfs_buf_item_size
> 
> It kinda looks like the workload is running out of log space all
> the time. But all the spinlock contention is gone and the
> transaction commit rate has gone from 800k/s to 1.3M/s so the amount
> of real work being done has gone up a *lot*.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c  | 69 +++++++++++++++++++------------------------
>  fs/xfs/xfs_log_priv.h |  3 +-
>  2 files changed, 31 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index ca6e411e388e..287dc7d0d508 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
>  	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
> +	INIT_LIST_HEAD(&ctx->log_items);

I see you moved the log item list to the cil ctx for benefit of
_pcp_dead, correct?

If so, then this isn't especially different from the last version.

Yay for shortening lock critical sections,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
>  	return ctx;
>  }
> @@ -97,6 +98,8 @@ xlog_cil_pcp_aggregate(
>  			list_splice_init(&cilpcp->busy_extents,
>  					&ctx->busy_extents);
>  		}
> +		if (!list_empty(&cilpcp->log_items))
> +			list_splice_init(&cilpcp->log_items, &ctx->log_items);
>  
>  		cilpcp->space_reserved = 0;
>  		cilpcp->space_used = 0;
> @@ -475,10 +478,9 @@ xlog_cil_insert_items(
>  	/*
>  	 * We need to take the CIL checkpoint unit reservation on the first
>  	 * commit into the CIL. Test the XLOG_CIL_EMPTY bit first so we don't
> -	 * unnecessarily do an atomic op in the fast path here. We don't need to
> -	 * hold the xc_cil_lock here to clear the XLOG_CIL_EMPTY bit as we are
> -	 * under the xc_ctx_lock here and that needs to be held exclusively to
> -	 * reset the XLOG_CIL_EMPTY bit.
> +	 * unnecessarily do an atomic op in the fast path here. We can clear the
> +	 * XLOG_CIL_EMPTY bit as we are under the xc_ctx_lock here and that
> +	 * needs to be held exclusively to reset the XLOG_CIL_EMPTY bit.
>  	 */
>  	if (test_bit(XLOG_CIL_EMPTY, &cil->xc_flags) &&
>  	    test_and_clear_bit(XLOG_CIL_EMPTY, &cil->xc_flags))
> @@ -532,24 +534,6 @@ xlog_cil_insert_items(
>  	/* attach the transaction to the CIL if it has any busy extents */
>  	if (!list_empty(&tp->t_busy))
>  		list_splice_init(&tp->t_busy, &cilpcp->busy_extents);
> -	put_cpu_ptr(cilpcp);
> -
> -	/*
> -	 * If we've overrun the reservation, dump the tx details before we move
> -	 * the log items. Shutdown is imminent...
> -	 */
> -	tp->t_ticket->t_curr_res -= ctx_res + len;
> -	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
> -		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
> -		xfs_warn(log->l_mp,
> -			 "  log items: %d bytes (iov hdrs: %d bytes)",
> -			 len, iovhdr_res);
> -		xfs_warn(log->l_mp, "  split region headers: %d bytes",
> -			 split_res);
> -		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
> -		xlog_print_trans(tp);
> -	}
> -
>  	/*
>  	 * Now update the order of everything modified in the transaction
>  	 * and insert items into the CIL if they aren't already there.
> @@ -557,7 +541,6 @@ xlog_cil_insert_items(
>  	 * the transaction commit.
>  	 */
>  	order = atomic_inc_return(&ctx->order_id);
> -	spin_lock(&cil->xc_cil_lock);
>  	list_for_each_entry(lip, &tp->t_items, li_trans) {
>  
>  		/* Skip items which aren't dirty in this transaction. */
> @@ -567,10 +550,25 @@ xlog_cil_insert_items(
>  		lip->li_order_id = order;
>  		if (!list_empty(&lip->li_cil))
>  			continue;
> -		list_add_tail(&lip->li_cil, &cil->xc_cil);
> +		list_add_tail(&lip->li_cil, &cilpcp->log_items);
>  	}
> +	put_cpu_ptr(cilpcp);
>  
> -	spin_unlock(&cil->xc_cil_lock);
> +	/*
> +	 * If we've overrun the reservation, dump the tx details before we move
> +	 * the log items. Shutdown is imminent...
> +	 */
> +	tp->t_ticket->t_curr_res -= ctx_res + len;
> +	if (WARN_ON(tp->t_ticket->t_curr_res < 0)) {
> +		xfs_warn(log->l_mp, "Transaction log reservation overrun:");
> +		xfs_warn(log->l_mp,
> +			 "  log items: %d bytes (iov hdrs: %d bytes)",
> +			 len, iovhdr_res);
> +		xfs_warn(log->l_mp, "  split region headers: %d bytes",
> +			 split_res);
> +		xfs_warn(log->l_mp, "  ctx ticket: %d bytes", ctx_res);
> +		xlog_print_trans(tp);
> +	}
>  
>  	if (tp->t_ticket->t_curr_res < 0)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -914,18 +912,12 @@ xlog_cil_push_work(
>  
>  	xlog_cil_pcp_aggregate(cil, ctx);
>  
> -	/*
> -	 * Pull all the log vectors off the items in the CIL, and remove the
> -	 * items from the CIL. We don't need the CIL lock here because it's only
> -	 * needed on the transaction commit side which is currently locked out
> -	 * by the flush lock.
> -	 */
> -	list_sort(NULL, &cil->xc_cil, xlog_cil_order_cmp);
> -	lv = NULL;
> -	while (!list_empty(&cil->xc_cil)) {
> +	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
> +
> +	while (!list_empty(&ctx->log_items)) {
>  		struct xfs_log_item	*item;
>  
> -		item = list_first_entry(&cil->xc_cil,
> +		item = list_first_entry(&ctx->log_items,
>  					struct xfs_log_item, li_cil);
>  		list_del_init(&item->li_cil);
>  		item->li_order_id = 0;
> @@ -1119,7 +1111,6 @@ xlog_cil_push_background(
>  	 * The cil won't be empty because we are called while holding the
>  	 * context lock so whatever we added to the CIL will still be there.
>  	 */
> -	ASSERT(!list_empty(&cil->xc_cil));
>  	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  
>  	/*
> @@ -1478,6 +1469,8 @@ xlog_cil_pcp_dead(
>  			list_splice_init(&cilpcp->busy_extents,
>  					&ctx->busy_extents);
>  		}
> +		if (!list_empty(&cilpcp->log_items))
> +			list_splice_init(&cilpcp->log_items, &ctx->log_items);
>  
>  		cilpcp->space_used = 0;
>  		cilpcp->space_reserved = 0;
> @@ -1549,6 +1542,7 @@ xlog_cil_pcp_alloc(
>  	for_each_possible_cpu(cpu) {
>  		cilpcp = per_cpu_ptr(pcp, cpu);
>  		INIT_LIST_HEAD(&cilpcp->busy_extents);
> +		INIT_LIST_HEAD(&cilpcp->log_items);
>  	}
>  	return pcp;
>  }
> @@ -1584,9 +1578,7 @@ xlog_cil_init(
>  		return -ENOMEM;
>  	}
>  
> -	INIT_LIST_HEAD(&cil->xc_cil);
>  	INIT_LIST_HEAD(&cil->xc_committing);
> -	spin_lock_init(&cil->xc_cil_lock);
>  	spin_lock_init(&cil->xc_push_lock);
>  	init_waitqueue_head(&cil->xc_push_wait);
>  	init_rwsem(&cil->xc_ctx_lock);
> @@ -1612,7 +1604,6 @@ xlog_cil_destroy(
>  		kmem_free(cil->xc_ctx);
>  	}
>  
> -	ASSERT(list_empty(&cil->xc_cil));
>  	ASSERT(test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  	xlog_cil_pcp_free(cil, cil->xc_pcp);
>  	kmem_free(cil);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 466862a943ba..d3bf3b367370 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -220,6 +220,7 @@ struct xfs_cil_ctx {
>  	struct xlog_ticket	*ticket;	/* chkpt ticket */
>  	atomic_t		space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
> +	struct list_head	log_items;	/* log items in chkpt */
>  	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
> @@ -258,8 +259,6 @@ struct xfs_cil {
>  	struct xlog		*xc_log;
>  	unsigned long		xc_flags;
>  	atomic_t		xc_iclog_hdrs;
> -	struct list_head	xc_cil;
> -	spinlock_t		xc_cil_lock;
>  
>  	struct rw_semaphore	xc_ctx_lock ____cacheline_aligned_in_smp;
>  	struct xfs_cil_ctx	*xc_ctx;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 35/39] xfs: convert log vector chain to use list heads
  2021-05-19 12:13 ` [PATCH 35/39] xfs: convert log vector chain to use list heads Dave Chinner
@ 2021-05-27 19:13   ` Darrick J. Wong
  2021-06-03  0:38     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 19:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:13PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because the next change is going to require sorting log vectors, and
> that requires arbitrary rearrangement of the list which cannot be
> done easily with a single linked list.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c        | 35 +++++++++++++++++++++++++---------
>  fs/xfs/xfs_log.h        |  2 +-
>  fs/xfs/xfs_log_cil.c    | 42 +++++++++++++++++++++++------------------
>  fs/xfs/xfs_log_priv.h   |  4 ++--
>  fs/xfs/xfs_trans.c      |  4 ++--
>  fs/xfs/xfs_trans_priv.h |  3 ++-
>  6 files changed, 57 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 77d9ea7daf26..5511c5de6b78 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -848,6 +848,9 @@ xlog_write_unmount_record(
>  		.lv_niovecs = 1,
>  		.lv_iovecp = &reg,
>  	};
> +	LIST_HEAD(lv_chain);
> +	INIT_LIST_HEAD(&vec.lv_list);
> +	list_add(&vec.lv_list, &lv_chain);
>  
>  	BUILD_BUG_ON((sizeof(struct xlog_op_header) +
>  		      sizeof(struct xfs_unmount_log_format)) !=
> @@ -863,7 +866,7 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, reg.i_len);
> +	return xlog_write(log, &lv_chain, ticket, NULL, NULL, reg.i_len);
>  }
>  
>  /*
> @@ -1581,13 +1584,16 @@ xlog_commit_record(
>  		.lv_iovecp = &reg,
>  	};
>  	int	error;
> +	LIST_HEAD(lv_chain);
> +	INIT_LIST_HEAD(&vec.lv_list);
> +	list_add(&vec.lv_list, &lv_chain);
>  
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, reg.i_len);
> +	error = xlog_write(log, &lv_chain, ticket, lsn, iclog, reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2118,6 +2124,7 @@ xlog_print_trans(
>   */
>  static struct xfs_log_vec *
>  xlog_write_single(
> +	struct list_head	*lv_chain,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	*iclog,
> @@ -2134,7 +2141,9 @@ xlog_write_single(
>  		iclog->ic_state == XLOG_STATE_WANT_SYNC);
>  
>  	ptr = iclog->ic_datap + *log_offset;
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
> +	for (lv = log_vector;
> +	     !list_entry_is_head(lv, lv_chain, lv_list);
> +	     lv = list_next_entry(lv, lv_list)) {
>  		/*
>  		 * If the entire log vec does not fit in the iclog, punt it to
>  		 * the partial copy loop which can handle this case.
> @@ -2163,6 +2172,8 @@ xlog_write_single(
>  			*data_cnt += reg->i_len;
>  		}
>  	}
> +	if (list_entry_is_head(lv, lv_chain, lv_list))
> +		lv = NULL;
>  	ASSERT(*len == 0 || lv);
>  	return lv;
>  }
> @@ -2208,6 +2219,7 @@ xlog_write_get_more_iclog_space(
>  static struct xfs_log_vec *
>  xlog_write_partial(
>  	struct xlog		*log,
> +	struct list_head	*lv_chain,
>  	struct xfs_log_vec	*log_vector,
>  	struct xlog_ticket	*ticket,
>  	struct xlog_in_core	**iclogp,
> @@ -2347,7 +2359,10 @@ xlog_write_partial(
>  	 * the caller so it can go back to fast path copying.
>  	 */
>  	*iclogp = iclog;
> -	return lv->lv_next;
> +	lv = list_next_entry(lv, lv_list);
> +	if (list_entry_is_head(lv, lv_chain, lv_list))
> +		return NULL;
> +	return lv;
>  }
>  
>  /*
> @@ -2393,14 +2408,14 @@ xlog_write_partial(
>  int
>  xlog_write(
>  	struct xlog		*log,
> -	struct xfs_log_vec	*log_vector,
> +	struct list_head	*lv_chain,
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
>  	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
> -	struct xfs_log_vec	*lv = log_vector;
> +	struct xfs_log_vec	*lv;
>  	int			record_cnt = 0;
>  	int			data_cnt = 0;
>  	int			error = 0;
> @@ -2422,14 +2437,16 @@ xlog_write(
>  	if (start_lsn)
>  		*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> +	lv = list_first_entry_or_null(lv_chain, struct xfs_log_vec, lv_list);
>  	while (lv) {
> -		lv = xlog_write_single(lv, ticket, iclog, &log_offset,
> +		lv = xlog_write_single(lv_chain, lv, ticket, iclog, &log_offset,
>  					&len, &record_cnt, &data_cnt);
>  		if (!lv)
>  			break;
>  
> -		lv = xlog_write_partial(log, lv, ticket, &iclog, &log_offset,
> -					&len, &record_cnt, &data_cnt);
> +		lv = xlog_write_partial(log, lv_chain, lv, ticket, &iclog,
> +					&log_offset, &len, &record_cnt,
> +					&data_cnt);
>  		if (IS_ERR_OR_NULL(lv)) {
>  			error = PTR_ERR_OR_ZERO(lv);
>  			break;
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index af54ea3f8c90..b4ad0e37a0c5 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -9,7 +9,7 @@
>  struct xfs_cil_ctx;
>  
>  struct xfs_log_vec {
> -	struct xfs_log_vec	*lv_next;	/* next lv in build list */
> +	struct list_head	lv_list;	/* CIL lv chain ptrs */
>  	int			lv_niovecs;	/* number of iovecs in lv */
>  	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
>  	struct xfs_log_item	*lv_item;	/* owner */
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 287dc7d0d508..035f0a60040a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -73,6 +73,7 @@ xlog_cil_ctx_alloc(void)
>  	INIT_LIST_HEAD(&ctx->committing);
>  	INIT_LIST_HEAD(&ctx->busy_extents);
>  	INIT_LIST_HEAD(&ctx->log_items);
> +	INIT_LIST_HEAD(&ctx->lv_chain);
>  	INIT_WORK(&ctx->push_work, xlog_cil_push_work);
>  	return ctx;
>  }
> @@ -267,6 +268,7 @@ xlog_cil_alloc_shadow_bufs(
>  			lv = kmem_alloc_large(buf_size, KM_NOFS);
>  			memset(lv, 0, xlog_cil_iovec_space(niovecs));
>  
> +			INIT_LIST_HEAD(&lv->lv_list);
>  			lv->lv_item = lip;
>  			lv->lv_size = buf_size;
>  			if (ordered)
> @@ -282,7 +284,6 @@ xlog_cil_alloc_shadow_bufs(
>  			else
>  				lv->lv_buf_len = 0;
>  			lv->lv_bytes = 0;
> -			lv->lv_next = NULL;
>  		}
>  
>  		/* Ensure the lv is set up according to ->iop_size */
> @@ -409,7 +410,6 @@ xlog_cil_insert_format_items(
>  		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
>  			/* same or smaller, optimise common overwrite case */
>  			lv = lip->li_lv;
> -			lv->lv_next = NULL;
>  
>  			if (ordered)
>  				goto insert;
> @@ -576,14 +576,14 @@ xlog_cil_insert_items(
>  
>  static void
>  xlog_cil_free_logvec(
> -	struct xfs_log_vec	*log_vector)
> +	struct list_head	*lv_chain)
>  {
>  	struct xfs_log_vec	*lv;
>  
> -	for (lv = log_vector; lv; ) {
> -		struct xfs_log_vec *next = lv->lv_next;
> +	while(!list_empty(lv_chain)) {

Nit: space after 'while'.

> +		lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_list);
> +		list_del_init(&lv->lv_list);
>  		kmem_free(lv);
> -		lv = next;
>  	}
>  }
>  
> @@ -682,7 +682,7 @@ xlog_cil_committed(
>  		spin_unlock(&ctx->cil->xc_push_lock);
>  	}
>  
> -	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
> +	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, &ctx->lv_chain,
>  					ctx->start_lsn, abort);
>  
>  	xfs_extent_busy_sort(&ctx->busy_extents);
> @@ -693,7 +693,7 @@ xlog_cil_committed(
>  	list_del(&ctx->committing);
>  	spin_unlock(&ctx->cil->xc_push_lock);
>  
> -	xlog_cil_free_logvec(ctx->lv_chain);
> +	xlog_cil_free_logvec(&ctx->lv_chain);
>  
>  	if (!list_empty(&ctx->busy_extents))
>  		xlog_discard_busy_extents(mp, ctx);
> @@ -773,7 +773,6 @@ xlog_cil_build_trans_hdr(
>  	lvhdr->lv_niovecs = 2;
>  	lvhdr->lv_iovecp = &hdr->lhdr[0];
>  	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> -	lvhdr->lv_next = ctx->lv_chain;
>  
>  	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
> @@ -913,25 +912,23 @@ xlog_cil_push_work(
>  	xlog_cil_pcp_aggregate(cil, ctx);
>  
>  	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
> -
>  	while (!list_empty(&ctx->log_items)) {
>  		struct xfs_log_item	*item;
>  
>  		item = list_first_entry(&ctx->log_items,
>  					struct xfs_log_item, li_cil);
> +		lv = item->li_lv;
>  		list_del_init(&item->li_cil);
>  		item->li_order_id = 0;
> -		if (!ctx->lv_chain)
> -			ctx->lv_chain = item->li_lv;
> -		else
> -			lv->lv_next = item->li_lv;
> -		lv = item->li_lv;
>  		item->li_lv = NULL;
> -		num_iovecs += lv->lv_niovecs;
>  
> +		num_iovecs += lv->lv_niovecs;

Not sure why "lv = item->li_lv" needed to move up?

I think the only change needed here is replacing the lv_chain/lv_next
business with the list_add_tail?

>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
>  			num_bytes += lv->lv_bytes;
> +
> +		list_add_tail(&lv->lv_list, &ctx->lv_chain);
> +
>  	}
>  
>  	/*
> @@ -968,10 +965,13 @@ xlog_cil_push_work(
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
>  	 * transaction header here as it is not accounted for in xlog_write().
> +	 * Add the lvhdr to the head of the lv chain we pass to xlog_write() so
> +	 * it gets written into the iclog first.
>  	 */
>  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
>  	num_iovecs += lvhdr.lv_niovecs;
>  	num_bytes += lvhdr.lv_bytes;
> +	list_add(&lvhdr.lv_list, &ctx->lv_chain);
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -985,8 +985,14 @@ xlog_cil_push_work(
>  	 * use the commit record lsn then we can move the tail beyond the grant
>  	 * write head.
>  	 */
> -	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				num_bytes);
> +	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
> +				NULL, num_bytes);
> +
> +	/*
> +	 * Take the lvhdr back off the lv_chain as it should not be passed
> +	 * to log IO completion.
> +	 */
> +	list_del(&lvhdr.lv_list);

Seems a little clunky, but I guess I see why it's needed.

I /think/ I don't see any place where the onstack lvhdr can escape out
of the chain after _push_work returns, so this is safe enough.

--D

>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index d3bf3b367370..071367a96d8d 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -221,7 +221,7 @@ struct xfs_cil_ctx {
>  	atomic_t		space_used;	/* aggregate size of regions */
>  	struct list_head	busy_extents;	/* busy extents in chkpt */
>  	struct list_head	log_items;	/* log items in chkpt */
> -	struct xfs_log_vec	*lv_chain;	/* logvecs being pushed */
> +	struct list_head	lv_chain;	/* logvecs being pushed */
>  	struct list_head	iclog_entry;
>  	struct list_head	committing;	/* ctx committing list */
>  	struct work_struct	discard_endio_work;
> @@ -477,7 +477,7 @@ xlog_write_adv_cnt(void **ptr, int *len, int *off, size_t bytes)
>  
>  void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
> -int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
> +int	xlog_write(struct xlog *log, struct list_head *lv_chain,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
>  		struct xlog_in_core **commit_iclog, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index bc72826d1f97..0f8300adb12d 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -735,7 +735,7 @@ xfs_log_item_batch_insert(
>  void
>  xfs_trans_committed_bulk(
>  	struct xfs_ail		*ailp,
> -	struct xfs_log_vec	*log_vector,
> +	struct list_head	*lv_chain,
>  	xfs_lsn_t		commit_lsn,
>  	bool			aborted)
>  {
> @@ -750,7 +750,7 @@ xfs_trans_committed_bulk(
>  	spin_unlock(&ailp->ail_lock);
>  
>  	/* unpin all the log items */
> -	for (lv = log_vector; lv; lv = lv->lv_next ) {
> +	list_for_each_entry(lv, lv_chain, lv_list) {
>  		struct xfs_log_item	*lip = lv->lv_item;
>  		xfs_lsn_t		item_lsn;
>  
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 3004aeac9110..fc8667c728e3 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -18,7 +18,8 @@ void	xfs_trans_add_item(struct xfs_trans *, struct xfs_log_item *);
>  void	xfs_trans_del_item(struct xfs_log_item *);
>  void	xfs_trans_unreserve_and_mod_sb(struct xfs_trans *tp);
>  
> -void	xfs_trans_committed_bulk(struct xfs_ail *ailp, struct xfs_log_vec *lv,
> +void	xfs_trans_committed_bulk(struct xfs_ail *ailp,
> +				struct list_head *lv_chain,
>  				xfs_lsn_t commit_lsn, bool aborted);
>  /*
>   * AIL traversal cursor.
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 36/39] xfs: move CIL ordering to the logvec chain
  2021-05-19 12:13 ` [PATCH 36/39] xfs: move CIL ordering to the logvec chain Dave Chinner
@ 2021-05-27 19:14   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 19:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:14PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Adding a list_sort() call to the CIL push work while the xc_ctx_lock
> is held exclusively has resulted in fairly long lock hold times and
> that stops all front end transaction commits from making progress.
> 
> We can move the sorting out of the xc_ctx_lock if we can transfer
> the ordering information to the log vectors as they are detached
> from the log items and then we can sort the log vectors.  With these
> changes, we can move the list_sort() call to just before we call
> xlog_write() when we aren't holding any locks at all.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.h     |  1 +
>  fs/xfs/xfs_log_cil.c | 23 ++++++++++++++---------
>  2 files changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index b4ad0e37a0c5..93aaee7c276e 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -10,6 +10,7 @@ struct xfs_cil_ctx;
>  
>  struct xfs_log_vec {
>  	struct list_head	lv_list;	/* CIL lv chain ptrs */
> +	uint32_t		lv_order_id;	/* chain ordering info */
>  	int			lv_niovecs;	/* number of iovecs in lv */
>  	struct xfs_log_iovec	*lv_iovecp;	/* iovec array */
>  	struct xfs_log_item	*lv_item;	/* owner */
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 035f0a60040a..cfd3128399f6 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -791,10 +791,10 @@ xlog_cil_order_cmp(
>  	const struct list_head	*a,
>  	const struct list_head	*b)
>  {
> -	struct xfs_log_item	*l1 = container_of(a, struct xfs_log_item, li_cil);
> -	struct xfs_log_item	*l2 = container_of(b, struct xfs_log_item, li_cil);
> +	struct xfs_log_vec	*l1 = container_of(a, struct xfs_log_vec, lv_list);
> +	struct xfs_log_vec	*l2 = container_of(b, struct xfs_log_vec, lv_list);
>  
> -	return l1->li_order_id > l2->li_order_id;
> +	return l1->lv_order_id > l2->lv_order_id;
>  }
>  
>  /*
> @@ -911,24 +911,22 @@ xlog_cil_push_work(
>  
>  	xlog_cil_pcp_aggregate(cil, ctx);
>  
> -	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
>  	while (!list_empty(&ctx->log_items)) {
>  		struct xfs_log_item	*item;
>  
>  		item = list_first_entry(&ctx->log_items,
>  					struct xfs_log_item, li_cil);
>  		lv = item->li_lv;
> -		list_del_init(&item->li_cil);
> -		item->li_order_id = 0;
> -		item->li_lv = NULL;
> -
> +		lv->lv_order_id = item->li_order_id;
>  		num_iovecs += lv->lv_niovecs;
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
>  			num_bytes += lv->lv_bytes;
>  
>  		list_add_tail(&lv->lv_list, &ctx->lv_chain);
> -
> +		list_del_init(&item->li_cil);
> +		item->li_order_id = 0;
> +		item->li_lv = NULL;
>  	}
>  
>  	/*
> @@ -961,6 +959,13 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  	up_write(&cil->xc_ctx_lock);
>  
> +	/*
> +	 * Sort the log vector chain before we add the transaction headers.
> +	 * This ensures we always have the transaction headers at the start
> +	 * of the chain.
> +	 */
> +	list_sort(NULL, &ctx->lv_chain, xlog_cil_order_cmp);
> +
>  	/*
>  	 * Build a checkpoint transaction header and write it to the log to
>  	 * begin the transaction. We need to account for the space used by the
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 37/39] xfs: avoid cil push lock if possible
  2021-05-19 12:13 ` [PATCH 37/39] xfs: avoid cil push lock if possible Dave Chinner
@ 2021-05-27 19:18   ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 19:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:15PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because now it hurts when the CIL fills up.
> 
>   - 37.20% __xfs_trans_commit
>       - 35.84% xfs_log_commit_cil
>          - 19.34% _raw_spin_lock
>             - do_raw_spin_lock
>                  19.01% __pv_queued_spin_lock_slowpath
>          - 4.20% xfs_log_ticket_ungrant
>               0.90% xfs_log_space_wake
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index cfd3128399f6..672cbaa4606c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1125,10 +1125,18 @@ xlog_cil_push_background(
>  	ASSERT(!test_bit(XLOG_CIL_EMPTY, &cil->xc_flags));
>  
>  	/*
> -	 * Don't do a background push if we haven't used up all the
> -	 * space available yet.
> +	 * We are done if:
> +	 * - we haven't used up all the space available yet; or
> +	 * - we've already queued up a push; and
> +	 * - we're not over the hard limit; and
> +	 * - nothing has been over the hard limit.
> +	 *
> +	 * If so, we don't need to take the push lock as there's nothing to do.
>  	 */
> -	if (space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> +	if (space_used < XLOG_CIL_SPACE_LIMIT(log) ||
> +	    (cil->xc_push_seq == cil->xc_current_sequence &&
> +	     space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log) &&
> +	     !waitqueue_active(&cil->xc_push_wait))) {
>  		up_read(&cil->xc_ctx_lock);
>  		return;
>  	}
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 39/39] xfs: expanding delayed logging design with background material
  2021-05-19 12:13 ` [PATCH 39/39] xfs: expanding delayed logging design with background material Dave Chinner
@ 2021-05-27 20:38   ` Darrick J. Wong
  2021-06-03  0:57     ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-05-27 20:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, May 19, 2021 at 10:13:17PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I wrote up a description of how transactions, space reservations and
> relogging work together in response to a question for background
> material on the delayed logging design. Add this to the existing
> document for ease of future reference.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  .../xfs-delayed-logging-design.rst            | 361 ++++++++++++++++--
>  1 file changed, 322 insertions(+), 39 deletions(-)
> 
> diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst
> index 464405d2801e..395c63ca5b27 100644
> --- a/Documentation/filesystems/xfs-delayed-logging-design.rst
> +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
> @@ -1,29 +1,314 @@
>  .. SPDX-License-Identifier: GPL-2.0
>  
> -==========================
> -XFS Delayed Logging Design
> -==========================
> -
> -Introduction to Re-logging in XFS
> -=================================
> -
> -XFS logging is a combination of logical and physical logging. Some objects,
> -such as inodes and dquots, are logged in logical format where the details
> -logged are made up of the changes to in-core structures rather than on-disk
> -structures. Other objects - typically buffers - have their physical changes
> -logged. The reason for these differences is to reduce the amount of log space
> -required for objects that are frequently logged. Some parts of inodes are more
> -frequently logged than others, and inodes are typically more frequently logged
> -than any other object (except maybe the superblock buffer) so keeping the
> -amount of metadata logged low is of prime importance.
> -
> -The reason that this is such a concern is that XFS allows multiple separate
> -modifications to a single object to be carried in the log at any given time.
> -This allows the log to avoid needing to flush each change to disk before
> -recording a new change to the object. XFS does this via a method called
> -"re-logging". Conceptually, this is quite simple - all it requires is that any
> -new change to the object is recorded with a *new copy* of all the existing
> -changes in the new transaction that is written to the log.
> +==================
> +XFS Logging Design
> +==================
> +
> +Preamble
> +========
> +
> +This document describes the design and algorithms that the XFS journalling
> +subsystem is based on. This document describes the design and algorithms that
> +the XFS journalling subsystem is based on so that readers may familiarize
> +themselves with the general concepts of how transaction processing in XFS works.
> +
> +We begin with an overview of transactions in XFS, followed by describing how
> +transaction reservations are structured and accounted, and then move into how we
> +guarantee forwards progress for long running transactions with finite initial
> +reservations bounds. At this point we need to explain how relogging works. With
> +the basic concepts covered, the design of the delayed logging mechanism is
> +documented.
> +
> +
> +Introduction
> +============
> +
> +XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
> +are atomic and recoverable. For reasons of space and time efficiency, the
> +logging mechanisms are varied and complex, combining intents, logical and
> +physical logging mechanisms to provide the necessary recovery guarantees the
> +filesystem requires.
> +
> +Some objects, such as inodes and dquots, are logged in logical format where the
> +details logged are made up of the changes to in-core structures rather than
> +on-disk structures. Other objects - typically buffers - have their physical
> +changes logged. Long running atomic modifications have individual changes
> +chained together by intents, ensuring that journal recovery can restart and
> +finish an operation that was only partially done when the system stopped
> +functioning.
> +
> +The reason for these differences is to keep the amount of log space and CPU time
> +required to process objects being modified as small as possible and hence the
> +logging overhead as low as possible. Some items are very frequently modified,
> +and some parts of objects are more frequently modified than others, so keeping
> +the overhead of metadata logging low is of prime importance.
> +
> +The method used to log an item or chain modifications together isn't
> +particularly important in the scope of this document. It suffices to know that
> +the method used for logging a particular object or chaining modifications
> +together are different and are dependent on the object and/or modification being
> +performed. The logging subsystem only cares that certain specific rules are
> +followed to guarantee forwards progress and prevent deadlocks.
> +
> +
> +Transactions in XFS
> +===================
> +
> +XFS has two types of high level transactions, defined by the type of log space
> +reservation they take. These are known as "one shot" and "permanent"
> +transactions. Permanent transaction reservations can take reservations that span
> +commit boundaries, whilst "one shot" transactions are for a single atomic
> +modification.
> +
> +The type and size of reservation must be matched to the modification taking
> +place.  This means that permanent transactions can be used for one-shot
> +modifications, but one-shot reservations cannot be used for permanent
> +transactions.
> +
> +In the code, a one-shot transaction pattern looks somewhat like this::
> +
> +	tp = xfs_trans_alloc(<reservation>)
> +	<lock items>
> +	<join item to transaction>
> +	<do modification>
> +	xfs_trans_commit(tp);
> +
> +As items are modified in the transaction, the dirty regions in those items are
> +tracked via the transaction handle.  Once the transaction is committed, all
> +resources joined to it are released, along with the remaining unused reservation
> +space that was taken at the transaction allocation time.
> +
> +In contrast, a permanent transaction is made up of multiple linked individual
> +transactions, and the pattern looks like this::
> +
> +	tp = xfs_trans_alloc(<reservation>)
> +	xfs_ilock(ip, XFS_ILOCK_EXCL)
> +
> +	loop {
> +		xfs_trans_ijoin(tp, 0);
> +		<do modification>
> +		xfs_trans_log_inode(tp, ip);
> +		xfs_trans_roll(&tp);
> +	}
> +
> +	xfs_trans_commit(tp);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +While this might look similar to a one-shot transaction, there is an important
> +difference: xfs_trans_roll() performs a specific operation that links two
> +transactions together::
> +
> +	ntp = xfs_trans_dup(tp);
> +	xfs_trans_commit(tp);
> +	xfs_log_reserve(ntp);
> +
> +This results in a series of "rolling transactions" where the inode is locked
> +across the entire chain of transactions.  Hence while this series of rolling
> +transactions is running, nothing else can read from or write to the inode and
> +this provides a mechanism for complex changes to appear atomic from an external
> +observer's point of view.
> +
> +It is important to note that a series of rolling transactions in a permanent
> +transaction does not form an atomic change in the journal. While each
> +individual modification is atomic, the chain is *not atomic*. If we crash half
> +way through, then recovery will only replay up to the last transactional
> +modification the loop made that was committed to the journal.
> +
> +This affects long running permanent transactions in that it is not possible to
> +predict how much of a long running operation will actually be recovered because
> +there is no guarantee of how much of the operation reached stale storage. Hence
> +if a long running operation requires multiple transactions to fully complete,
> +the high level operation must use intents and deferred operations to guarantee
> +recovery can complete the operation once the first transactions is persisted in
> +the on-disk journal.
> +
> +
> +Transactions are Asynchronous
> +=============================
> +
> +In XFS, all high level transactions are asynchronous by default. This means that
> +xfs_trans_commit() does not guarantee that the modification has been committed
> +to stable storage when it returns. Hence when a system crashes, not all the
> +completed transactions will be replayed during recovery.
> +
> +However, the logging subsystem does provide global ordering guarantees, such
> +that if a specific change is seen after recovery, all metadata modifications
> +that were committed prior to that change will also be seen.
> +
> +For single shot operations that need to reach stable storage immediately, or
> +ensuring that a long running permanent transaction is fully committed once it is
> +complete, we can explicitly tag a transaction as synchronous. This will trigger
> +a "log force" to flush the outstanding committed transactions to stable storage
> +in the journal and wait for that to complete.
> +
> +Synchronous transactions are rarely used, however, because they limit logging
> +throughput to the IO latency limitations of the underlying storage. Instead, we
> +tend to use log forces to ensure modifications are on stable storage only when
> +a user operation requires a synchronisation point to occur (e.g. fsync).
> +
> +
> +Transaction Reservations
> +========================
> +
> +It has been mentioned a number of times now that the logging subsystem needs to
> +provide a forwards progress guarantee so that no modification ever stalls
> +because it can't be written to the journal due to a lack of space in the
> +journal. This is achieved by the transaction reservations that are made when
> +a transaction is first allocated. For permanent transactions, these reservations
> +are maintained as part of the transaction rolling mechanism.
> +
> +A transaction reservation provides a guarantee that there is physical log space
> +available to write the modification into the journal before we start making
> +modifications to objects and items. As such, the reservation needs to be large
> +enough to take into account the amount of metadata that the change might need to
> +log in the worst case. This means that if we are modifying a btree in the
> +transaction, we have to reserve enough space to record a full leaf-to-root split
> +of the btree. As such, the reservations are quite complex because we have to
> +take into account all the hidden changes that might occur.
> +
> +For example, a user data extent allocation involves allocating an extent from
> +free space, which modifies the free space trees. That's two btrees.  Inserting
> +the extent into the inode's extent map might require a split of the extent map
> +btree, which requires another allocation that can modify the free space trees
> +again.  Then we might have to update reverse mappings, which modifies yet
> +another btree which might require more space. And so on.  Hence the amount of
> +metadata that a "simple" operation can modify can be quite large.
> +
> +This "worst case" calculation provides us with the static "unit reservation"
> +for the transaction that is calculated at mount time. We must guarantee that the
> +log has this much space available before the transaction is allowed to proceed
> +so that when we come to write the dirty metadata into the log we don't run out
> +of log space half way through the write.
> +
> +For one-shot transactions, a single unit space reservation is all that is
> +required for the transaction to proceed. For permanent transactions, however, we
> +also have a "log count" that affects the size of the reservation that is to be
> +made.
> +
> +While a permanent transaction can get by with a single unit of space
> +reservation, it is somewhat inefficient to do this as it requires the
> +transaction rolling mechanism to re-reserve space on every transaction roll. We
> +know from the implementation of the permanent transactions how many transaction
> +rolls are likely for the common modifications that need to be made.
> +
> +For example, and inode allocation is typically two transactions - one to
> +physically allocate a free inode chunk on disk, and another to allocate an inode
> +from an inode chunk that has free inodes in it.  Hence for an inode allocation
> +transaction, we might set the reservation log count to a value of 2 to indicate
> +that the common/fast path transaction will commit two linked transactions in a
> +chain. Each time a permanent transaction rolls, it consumes an entire unit
> +reservation.
> +
> +Hence when the permanent transaction is first allocated, the log space
> +reservation is increases from a single unit reservation to multiple unit
> +reservations. That multiple is defined by the reservation log count, and this
> +means we can roll the transaction multiple times before we have to re-reserve
> +log space when we roll the transaction. This ensures that the common
> +modifications we make only need to reserve log space once.
> +
> +If the log count for a permanent transaction reaches zero, then it needs to
> +re-reserve physical space in the log. This is somewhat complex, and requires
> +an understanding of how the log accounts for space that has been reserved.
> +
> +
> +Log Space Accounting
> +====================
> +
> +The position in the log is typically referred to as a Log Sequence Number (LSN).
> +The log is circular, so the positions in the log are defined by the combination
> +of a cycle number - the number of times the log has been overwritten - and the
> +offset into the log.  A LSN carries the cycle in the upper 32 bits and the
> +offset in the lower 32 bits. The offset is in units of "basic blocks" (512
> +bytes). Hence we can do realtively simple LSN based math to keep track of
> +available space in the log.
> +
> +Log space accounting is done via a pair of constructs called "grant heads".  The
> +position of the grant heads is an absolute value, so the amount of space
> +available in the log is defined by the distance between the position of the
> +grant head and the current log tail. That is, how much space can be
> +reserved/consumed before the grant heads would fully wrap the log and overtake
> +the tail position.
> +
> +The first grant head is the "reserve" head. This tracks the byte count of the
> +reservations currently held by active transactions. It is a purely in-memory
> +accounting of the space reservation and, as such, actually tracks byte offsets
> +into the log rather than basic blocks. Hence it technically isn't using LSNs to
> +represent the log position, but it is still treated like a split {cycle,offset}
> +tuple for the purposes of tracking reservation space.
> +
> +The reserve grant head is used to accurately account for exact transaction
> +reservations amounts and the exact byte count that modifications actually make
> +and need to write into the log. The reserve head is used to prevent new
> +transactions from taking new reservations when the head reaches the current
> +tail. It will block new reservations in a FIFO queue and as the log tail moves
> +forward it will wake them in order once sufficient space is available. This FIFO
> +mechanism ensures no transaction is starved of resources when log space
> +shortages occur.
> +
> +The other grant head is the "write" head. Unlike the reserve head, this grant
> +head contains an LSN and it tracks the physical space usage in the log. While
> +this might sound like it is accounting the same state as the reserve grant head
> +- and it mostly does track exactly the same location as the reserve grant head -
> +there are critical differences in behaviour between them that provides the
> +forwards progress guarantees that rolling permanent transactions require.
> +
> +These differences when a permanent transaction is rolled and the internal "log
> +count" reaches zero and the initial set of unit reservations have been
> +exhausted. At this point, we still require a log space reservation to continue
> +the next transaction in the sequeunce, but we have none remaining. We cannot
> +sleep during the transaction commit process waiting for new log space to become
> +available, as we may end up on the end of the FIFO queue and the items we have
> +locked while we sleep could end up pinning the tail of the log before there is
> +enough free space in the log to fulfil all of the pending reservations and
> +then wake up transaction commit in progress.
> +
> +To take a new reservation without sleeping requires us to be able to take a
> +reservation even if there is no reservation space currently available. That is,
> +we need to be able to *overcommit* the log reservation space. As has already
> +been detailed, we cannot overcommit physical log space. However, the reserve
> +grant head does not track physical space - it only accounts for the amount of
> +reservations we currently have outstanding. Hence if the reserve head passes
> +over the tail of the log all it means is that new reservations will be throttled
> +immediately and remain throttled until the log tail is moved forward far enough
> +to remove the overcommit and start taking new reservations. In other words, we
> +can overcommit the reserve head without violating the physical log head and tail
> +rules.
> +
> +As a result, permanent transactions only "regrant" reservation space during
> +xfs_trans_commit() calls, while the physical log space reservation - tracked by
> +the write head - is then reserved separately by a call to xfs_log_reserve()
> +after the commit completes. Once the commit completes, we can sleep waiting for
> +physical log space to be reserved from the write grant head, but only if one
> +critical rule has been observed::
> +
> +	Code using permanent reservations must always log the items they hold
> +	locked across each transaction they roll in the chain.
> +
> +"Re-logging" the locked items on every transaction roll ensures that the items
> +the transaction chain is rolling are always relocated to the physical head of

This reads (to me) a little awkwardly.  One could ask if the transaction
chain itself is rolling the items?  Which is not really what's
happening.  How about:

"...ensures that the items attached to the transaction chain being
rolled are always relocated..."

> +the log and so do not pin the tail of the log. If a locked item pins the tail of
> +the log when we sleep on the write reservation, then we will deadlock the log as
> +we cannot take the locks needed to write back that item and move the tail of the
> +log forwards to free up write grant space. Re-logging the locked items avoids
> +this deadlock and guarantees that the log reservation we are making cannot
> +self-deadlock.
> +
> +If all rolling transactions obey this rule, then they can all make forwards
> +progress independently because nothing will block the progress of the log
> +tail moving forwards and hence ensuring that write grant space is always
> +(eventually) made available to permanent transactions no matter how many times
> +they roll.
> +
> +
> +Re-logging Explained
> +====================
> +
> +XFS allows multiple separate modifications to a single object to be carried in
> +the log at any given time.  This allows the log to avoid needing to flush each
> +change to disk before recording a new change to the object. XFS does this via a
> +method called "re-logging". Conceptually, this is quite simple - all it requires
> +is that any new change to the object is recorded with a *new copy* of all the
> +existing changes in the new transaction that is written to the log.
>  
>  That is, if we have a sequence of changes A through to F, and the object was
>  written to disk after change D, we would see in the log the following series
> @@ -42,16 +327,13 @@ transaction::
>  In other words, each time an object is relogged, the new transaction contains
>  the aggregation of all the previous changes currently held only in the log.
>  
> -This relogging technique also allows objects to be moved forward in the log so
> -that an object being relogged does not prevent the tail of the log from ever
> -moving forward.  This can be seen in the table above by the changing
> -(increasing) LSN of each subsequent transaction - the LSN is effectively a
> -direct encoding of the location in the log of the transaction.
> +This relogging technique allows objects to be moved forward in the log so that
> +an object being relogged does not prevent the tail of the log from ever moving
> +forward.  This can be seen in the table above by the changing (increasing) LSN
> +of each subsequent transaction, and it's the technique that allows us to
> +implement long-running, multiple-commit permanent transactions. 
>  
> -This relogging is also used to implement long-running, multiple-commit
> -transactions.  These transaction are known as rolling transactions, and require
> -a special log reservation known as a permanent transaction reservation. A
> -typical example of a rolling transaction is the removal of extents from an
> +A typical example of a rolling transaction is the removal of extents from an
>  inode which can only be done at a rate of two extents per transaction because

Ignoring rt files, do we even have /that/ limit anymore?  Especially
considering the other patchset you just sent... :)

With that one odd sentence up there reworked,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  of reservation size limitations. Hence a rolling extent removal transaction
>  keeps relogging the inode and btree buffers as they get modified in each
> @@ -67,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get
>  dirtier as they get relogged, so each subsequent transaction is writing more
>  metadata into the log.
>  
> -Another feature of the XFS transaction subsystem is that most transactions are
> -asynchronous. That is, they don't commit to disk until either a log buffer is
> -filled (a log buffer can hold multiple transactions) or a synchronous operation
> -forces the log buffers holding the transactions to disk. This means that XFS is
> -doing aggregation of transactions in memory - batching them, if you like - to
> -minimise the impact of the log IO on transaction throughput.
> +It should now also be obvious how relogging and asynchronous transactions go
> +hand in hand. That is, transactions don't get written to the physical journal
> +until either a log buffer is filled (a log buffer can hold multiple
> +transactions) or a synchronous operation forces the log buffers holding the
> +transactions to disk. This means that XFS is doing aggregation of transactions
> +in memory - batching them, if you like - to minimise the impact of the log IO on
> +transaction throughput.
>  
>  The limitation on asynchronous transaction throughput is the number and size of
>  log buffers made available by the log manager. By default there are 8 log
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 01/39] xfs: log stripe roundoff is a property of the log
  2021-05-19 12:12 ` [PATCH 01/39] xfs: log stripe roundoff is a property of the log Dave Chinner
@ 2021-05-28  0:54   ` Allison Henderson
  0 siblings, 0 replies; 86+ messages in thread
From: Allison Henderson @ 2021-05-28  0:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 5/19/21 5:12 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We don't need to look at the xfs_mount and superblock every time we
> need to do an iclog roundoff calculation. The property is fixed for
> the life of the log, so store the roundoff in the log at mount time
> and use that everywhere.
> 
> On a debug build:
> 
> $ size fs/xfs/xfs_log.o.*
>     text	   data	    bss	    dec	    hex	filename
>    27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
>    27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
Ok makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_log_format.h |  3 --
>   fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
>   fs/xfs/xfs_log_priv.h          |  2 ++
>   3 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 3e15ea29fb8d..d548ea4b6aab 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
>   #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
>   #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
>   #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
> -#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
> -                                 (log)->l_mp->m_sb.sb_logsunit)
> -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
>   
>   #define XLOG_HEADER_SIZE	512
>   
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index c19a82adea1e..0e563ff8cd3b 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1401,6 +1401,11 @@ xlog_alloc_log(
>   	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
>   	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
>   
> +	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
> +		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
> +	else
> +		log->l_iclog_roundoff = BBSIZE;
> +
>   	xlog_grant_head_init(&log->l_reserve_head);
>   	xlog_grant_head_init(&log->l_write_head);
>   
> @@ -1854,29 +1859,15 @@ xlog_calc_iclog_size(
>   	uint32_t		*roundoff)
>   {
>   	uint32_t		count_init, count;
> -	bool			use_lsunit;
> -
> -	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -			log->l_mp->m_sb.sb_logsunit > 1;
>   
>   	/* Add for LR header */
>   	count_init = log->l_iclog_hsize + iclog->ic_offset;
> +	count = roundup(count_init, log->l_iclog_roundoff);
>   
> -	/* Round out the log write size */
> -	if (use_lsunit) {
> -		/* we have a v2 stripe unit to use */
> -		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
> -	} else {
> -		count = BBTOB(BTOBB(count_init));
> -	}
> -
> -	ASSERT(count >= count_init);
>   	*roundoff = count - count_init;
>   
> -	if (use_lsunit)
> -		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
> -	else
> -		ASSERT(*roundoff < BBTOB(1));
> +	ASSERT(count >= count_init);
> +	ASSERT(*roundoff < log->l_iclog_roundoff);
>   	return count;
>   }
>   
> @@ -3151,10 +3142,9 @@ xlog_state_switch_iclogs(
>   	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
>   
>   	/* Round up to next log-sunit */
> -	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -	    log->l_mp->m_sb.sb_logsunit > 1) {
> -		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
> -		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
> +	if (log->l_iclog_roundoff > BBSIZE) {
> +		log->l_curr_block = roundup(log->l_curr_block,
> +						BTOBB(log->l_iclog_roundoff));
>   	}
>   
>   	if (log->l_curr_block >= log->l_logBBsize) {
> @@ -3406,12 +3396,11 @@ xfs_log_ticket_get(
>    * Figure out the total log space unit (in bytes) that would be
>    * required for a log ticket.
>    */
> -int
> -xfs_log_calc_unit_res(
> -	struct xfs_mount	*mp,
> +static int
> +xlog_calc_unit_res(
> +	struct xlog		*log,
>   	int			unit_bytes)
>   {
> -	struct xlog		*log = mp->m_log;
>   	int			iclog_space;
>   	uint			num_headers;
>   
> @@ -3487,18 +3476,20 @@ xfs_log_calc_unit_res(
>   	/* for commit-rec LR header - note: padding will subsume the ophdr */
>   	unit_bytes += log->l_iclog_hsize;
>   
> -	/* for roundoff padding for transaction data and one for commit record */
> -	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
> -		/* log su roundoff */
> -		unit_bytes += 2 * mp->m_sb.sb_logsunit;
> -	} else {
> -		/* BB roundoff */
> -		unit_bytes += 2 * BBSIZE;
> -        }
> +	/* roundoff padding for transaction data and one for commit record */
> +	unit_bytes += 2 * log->l_iclog_roundoff;
>   
>   	return unit_bytes;
>   }
>   
> +int
> +xfs_log_calc_unit_res(
> +	struct xfs_mount	*mp,
> +	int			unit_bytes)
> +{
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +}
> +
>   /*
>    * Allocate and initialise a new log ticket.
>    */
> @@ -3515,7 +3506,7 @@ xlog_ticket_alloc(
>   
>   	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>   
> -	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes);
>   
>   	atomic_set(&tic->t_ref, 1);
>   	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 1c6fdbf3d506..037950cf1061 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -436,6 +436,8 @@ struct xlog {
>   #endif
>   	/* log recovery lsn tracking (for buffer submission */
>   	xfs_lsn_t		l_recovery_lsn;
> +
> +	uint32_t		l_iclog_roundoff;/* padding roundoff */
>   };
>   
>   #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 02/39] xfs: separate CIL commit record IO
  2021-05-19 12:12 ` [PATCH 02/39] xfs: separate CIL commit record IO Dave Chinner
@ 2021-05-28  0:54   ` Allison Henderson
  0 siblings, 0 replies; 86+ messages in thread
From: Allison Henderson @ 2021-05-28  0:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 5/19/21 5:12 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log.c      | 8 +++++---
>   fs/xfs/xfs_log_cil.c  | 9 +++++++++
>   fs/xfs/xfs_log_priv.h | 2 ++
>   3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 0e563ff8cd3b..4cd5840e953a 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -786,10 +786,12 @@ xfs_log_mount_cancel(
>   }
>   
>   /*
> - * Wait for the iclog to be written disk, or return an error if the log has been
> - * shut down.
> + * Wait for the iclog and all prior iclogs to be written disk as required by the
> + * log force state machine. Waiting on ic_force_wait ensures iclog completions
> + * have been ordered and callbacks run before we are woken here, hence
> + * guaranteeing that all the iclogs up to this one are on stable storage.
>    */
> -static int
> +int
>   xlog_wait_on_iclog(
>   	struct xlog_in_core	*iclog)
>   		__releases(iclog->ic_log->l_icloglock)
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..1e5fd6f268c2 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,15 @@ xlog_cil_push_work(
>   	wake_up_all(&cil->xc_commit_wait);
>   	spin_unlock(&cil->xc_push_lock);
>   
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn) {
> +		spin_lock(&log->l_icloglock);
> +		xlog_wait_on_iclog(commit_iclog->ic_prev);
> +	}
> +
>   	/* release the hounds! */
>   	xfs_log_release_iclog(commit_iclog);
>   	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..ee7786b33da9 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>   	remove_wait_queue(wq, &wait);
>   }
>   
> +int xlog_wait_on_iclog(struct xlog_in_core *iclog);
> +
>   /*
>    * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>    * means that the next log record that includes this metadata could have a
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush
  2021-05-19 12:12 ` [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush Dave Chinner
@ 2021-05-28  0:54   ` Allison Henderson
  0 siblings, 0 replies; 86+ messages in thread
From: Allison Henderson @ 2021-05-28  0:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 5/19/21 5:12 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It's a one line wrapper around blkdev_issue_flush(). Just replace it
> with direct calls to blkdev_issue_flush().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
Looks fine
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_buf.c   | 2 +-
>   fs/xfs/xfs_file.c  | 6 +++---
>   fs/xfs/xfs_log.c   | 2 +-
>   fs/xfs/xfs_super.c | 7 -------
>   fs/xfs/xfs_super.h | 1 -
>   5 files changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index a10d49facadf..ebfcba2e8a77 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1945,7 +1945,7 @@ xfs_free_buftarg(
>   	percpu_counter_destroy(&btp->bt_io_count);
>   	list_lru_destroy(&btp->bt_lru);
>   
> -	xfs_blkdev_issue_flush(btp);
> +	blkdev_issue_flush(btp->bt_bdev);
>   
>   	kmem_free(btp);
>   }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c068dcd414f4..e7e9af57e788 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -197,9 +197,9 @@ xfs_file_fsync(
>   	 * inode size in case of an extending write.
>   	 */
>   	if (XFS_IS_REALTIME_INODE(ip))
> -		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
> +		blkdev_issue_flush(mp->m_rtdev_targp->bt_bdev);
>   	else if (mp->m_logdev_targp != mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>   
>   	/*
>   	 * Any inode that has dirty modifications in the log is pinned.  The
> @@ -219,7 +219,7 @@ xfs_file_fsync(
>   	 */
>   	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>   	    mp->m_logdev_targp == mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		blkdev_issue_flush(mp->m_ddev_targp->bt_bdev);
>   
>   	return error;
>   }
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 4cd5840e953a..969eebbf3f64 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1964,7 +1964,7 @@ xlog_sync(
>   	 * layer state machine for preflushes.
>   	 */
>   	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
> +		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
>   		need_flush = false;
>   	}
>   
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 688309dbe18b..e339d1de2419 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -340,13 +340,6 @@ xfs_blkdev_put(
>   		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>   }
>   
> -void
> -xfs_blkdev_issue_flush(
> -	xfs_buftarg_t		*buftarg)
> -{
> -	blkdev_issue_flush(buftarg->bt_bdev);
> -}
> -
>   STATIC void
>   xfs_close_devices(
>   	struct xfs_mount	*mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index d2b40dc60dfc..167d23f92ffe 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -87,7 +87,6 @@ struct xfs_buftarg;
>   struct block_device;
>   
>   extern void xfs_flush_inodes(struct xfs_mount *mp);
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
>   extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
>   					   xfs_agnumber_t agcount);
>   
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 04/39] xfs: async blkdev cache flush
  2021-05-19 12:12 ` [PATCH 04/39] xfs: async blkdev cache flush Dave Chinner
  2021-05-20 23:53   ` Darrick J. Wong
@ 2021-05-28  0:54   ` Allison Henderson
  1 sibling, 0 replies; 86+ messages in thread
From: Allison Henderson @ 2021-05-28  0:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 5/19/21 5:12 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The new checkpoint cache flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
> 
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
> 
> To do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. The block layer has no such primitives
> for filesystems, so roll our own for the moment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_bio_io.c | 35 +++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_linux.h  |  2 ++
>   2 files changed, 37 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 17f36db2f792..de727532e137 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -9,6 +9,41 @@ static inline unsigned int bio_max_vecs(unsigned int count)
>   	return bio_max_segs(howmany(count, PAGE_SIZE));
>   }
>   
> +static void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	complete(bio->bi_private);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the request queue does
> + * not require flush operations, just skip it altogether. If the caller needsi
typo nit: needs
Otherwise looks fine
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> + * to wait for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.  The
> + * caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct bio		*bio,
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct request_queue	*q = bdev->bd_disk->queue;
> +
> +	if (!test_bit(QUEUE_FLAG_WC, &q->queue_flags)) {
> +		complete(done);
> +		return;
> +	}
> +
> +	bio_init(bio, NULL, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +	bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
> +	submit_bio(bio);
> +}
>   int
>   xfs_rw_bdev(
>   	struct block_device	*bdev,
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index 7688663b9773..c174262a074e 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,8 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>   
>   int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>   		char *data, unsigned int op);
> +void xfs_flush_bdev_async(struct bio *bio, struct block_device *bdev,
> +		struct completion *done);
>   
>   #define ASSERT_ALWAYS(expr)	\
>   	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally
  2021-05-19 12:12 ` [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-05-28  0:54   ` Allison Henderson
  0 siblings, 0 replies; 86+ messages in thread
From: Allison Henderson @ 2021-05-28  0:54 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs



On 5/19/21 5:12 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
> 
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
> 
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
> 
> These rules apply to the atomic transactions recorded in the
> journal, not to the journal IO itself. Hence we need to ensure
> metadata is stable before we start writing a new transaction to the
> journal (guarantee #1), and we need to ensure the entire transaction
> is stable in the journal before we start metadata writeback
> (guarantee #2).
> 
> The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
> being added to every iclog IO. This causes the journal IO to issue a
> cache flush and wait for it to complete before issuing the write IO
> to the journal. Hence all completed metadata IO is guaranteed to be
> stable before the journal overwrites the old metadata.
> 
> However, for long running CIL checkpoints that might do a thousand
> journal IOs, we don't need every single one of these iclog IOs to
> issue a cache flush - the cache flush done before the first iclog is
> submitted is sufficient to cover the entire range in the log that
> the checkpoint will overwrite because the CIL space reservation
> guarantees the tail of the log (completed metadata) is already
> beyond the range of the checkpoint write.
> 
> Hence we only need a full cache flush between closing off the CIL
> checkpoint context (i.e. when the push switches it out) and issuing
> the first journal IO. Rather than plumbing this through to the
> journal IO, we can start this cache flush the moment the CIL context
> is owned exclusively by the push worker. The cache flush can be in
> progress while we process the CIL ready for writing, hence
> reducing the latency of the initial iclog write. This is especially
> true for large checkpoints, where we might have to process hundreds
> of thousands of log vectors before we issue the first iclog write.
> In these cases, it is likely the cache flush has already been
> completed by the time we have built the CIL log vector chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_cil.c | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 1e5fd6f268c2..7b8b7ac85ea9 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -656,6 +656,8 @@ xlog_cil_push_work(
>   	struct xfs_log_vec	lvhdr = { NULL };
>   	xfs_lsn_t		commit_lsn;
>   	xfs_lsn_t		push_seq;
> +	struct bio		bio;
> +	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>   
>   	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>   	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -719,10 +721,19 @@ xlog_cil_push_work(
>   	spin_unlock(&cil->xc_push_lock);
>   
>   	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 */
> +	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> +				&bdev_flush);
> +
> +	/*
> +	 * Pull all the log vectors off the items in the CIL, and remove the
> +	 * items from the CIL. We don't need the CIL lock here because it's only
> +	 * needed on the transaction commit side which is currently locked out
> +	 * by the flush lock.
>   	 */
>   	lv = NULL;
>   	num_iovecs = 0;
> @@ -806,6 +817,12 @@ xlog_cil_push_work(
>   	lvhdr.lv_iovecp = &lhdr;
>   	lvhdr.lv_next = ctx->lv_chain;
>   
> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);
> +
>   	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
>   	if (error)
>   		goto out_abort_free_ticket;
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 20/39] xfs: pass lv chain length into xlog_write()
  2021-05-27 17:20   ` Darrick J. Wong
@ 2021-06-02 22:18     ` Dave Chinner
  2021-06-02 22:24       ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-02 22:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 10:20:27AM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:12:58PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The caller of xlog_write() usually has a close accounting of the
> > aggregated vector length contained in the log vector chain passed to
> > xlog_write(). There is no need to iterate the chain to calculate he
> > length of the data in xlog_write_calculate_len() if the caller is
> > already iterating that chain to build it.
> > 
> > Passing in the vector length avoids doing an extra chain iteration,
> > which can be a significant amount of work given that large CIL
> > commits can have hundreds of thousands of vectors attached to the
> > chain.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
....
> > @@ -849,6 +850,10 @@ xlog_cil_push_work(
> >  		lv = item->li_lv;
> >  		item->li_lv = NULL;
> >  		num_iovecs += lv->lv_niovecs;
> > +
> > +		/* we don't write ordered log vectors */
> > +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> > +			num_bytes += lv->lv_bytes;
> >  	}
> >  
> >  	/*
> > @@ -887,6 +892,8 @@ xlog_cil_push_work(
> >  	 * transaction header here as it is not accounted for in xlog_write().
> >  	 */
> >  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> > +	num_iovecs += lvhdr.lv_niovecs;
> 
> I have the same question that Brian had last time, which is: What's the
> point of updating num_iovecs here?  It's not used after
> xlog_cil_build_trans_hdr, either here or at the end of the patchset.
> 
> Is the idea that num_{iovecs,bytes} will always reflect everything
> in the cil context chain that's about to be passed to xlog_write?

I left it there because I did want to keep the two variables up to
date for future use. i.e. I didn't want to leave a landmine later
down the track if I need to use num_iovecs in future changes. I've
also used it a few times for temporary debugging code, so I'd
prefer to keep it even though it isn't used.

But if "not used" is the only reason for people not giving rvbs,
then I can remove it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 22/39] xfs:_introduce xlog_write_partial()
  2021-05-27 18:06   ` Darrick J. Wong
@ 2021-06-02 22:21     ` Dave Chinner
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-06-02 22:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 11:06:59AM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:00PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Handle writing of a logvec chain into an iclog that doesn't have
> > enough space to fit it all. The iclog has already been changed to
> > WANT_SYNC by xlog_get_iclog_space(), so the entire remaining space
> > in the iclog is exclusively owned by this logvec chain.
> > 
> > The difference between the single and partial cases is that
> > we end up with partial iovec writes in the iclog and have to split
> > a log vec regions across two iclogs. The state handling for this is
> > currently awful and so we're building up the pieces needed to
> > handle this more cleanly one at a time.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Egad this diff is hard to read.  Brian's right, the patience diff is
> easier to understand and shorter to boot.
> 
> That said, I actually understand what the new code does now, so:
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thx!

> Might be nice to hoist:
> 
> 	memcpy(ptr, reg->i_addr + reg_offset, rlen);
> 	xlog_write_adv_cnt(&ptr, len, log_offset, rlen);
> 	(*record_cnt)++;
> 	*data_cnt += rlen;
> 
> into a helper but it's only four lines so I'm not gonna fuss any
> further.

Agreed, there are opportunities for further factoring and
simplification of this code, but I'll leave that for another
patchset rather than risking destabilisation at this late point.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 20/39] xfs: pass lv chain length into xlog_write()
  2021-06-02 22:18     ` Dave Chinner
@ 2021-06-02 22:24       ` Darrick J. Wong
  2021-06-02 22:58         ` [PATCH 20/39 V2] " Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-02 22:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 08:18:52AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2021 at 10:20:27AM -0700, Darrick J. Wong wrote:
> > On Wed, May 19, 2021 at 10:12:58PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The caller of xlog_write() usually has a close accounting of the
> > > aggregated vector length contained in the log vector chain passed to
> > > xlog_write(). There is no need to iterate the chain to calculate he
> > > length of the data in xlog_write_calculate_len() if the caller is
> > > already iterating that chain to build it.
> > > 
> > > Passing in the vector length avoids doing an extra chain iteration,
> > > which can be a significant amount of work given that large CIL
> > > commits can have hundreds of thousands of vectors attached to the
> > > chain.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ....
> > > @@ -849,6 +850,10 @@ xlog_cil_push_work(
> > >  		lv = item->li_lv;
> > >  		item->li_lv = NULL;
> > >  		num_iovecs += lv->lv_niovecs;
> > > +
> > > +		/* we don't write ordered log vectors */
> > > +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> > > +			num_bytes += lv->lv_bytes;
> > >  	}
> > >  
> > >  	/*
> > > @@ -887,6 +892,8 @@ xlog_cil_push_work(
> > >  	 * transaction header here as it is not accounted for in xlog_write().
> > >  	 */
> > >  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> > > +	num_iovecs += lvhdr.lv_niovecs;
> > 
> > I have the same question that Brian had last time, which is: What's the
> > point of updating num_iovecs here?  It's not used after
> > xlog_cil_build_trans_hdr, either here or at the end of the patchset.
> > 
> > Is the idea that num_{iovecs,bytes} will always reflect everything
> > in the cil context chain that's about to be passed to xlog_write?
> 
> I left it there because I did want to keep the two variables up to
> date for future use. i.e. I didn't want to leave a landmine later
> down the track if I need to use num_iovecs in future changes. I've
> also used it a few times for temporary debugging code, so I'd
> prefer to keep it even though it isn't used.
> 
> But if "not used" is the only reason for people not giving rvbs,
> then I can remove it...

...or feed it to a tracepoint, if you find it useful for debugging the
size of log writes?  <shrug>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH 20/39 V2] xfs: pass lv chain length into xlog_write()
  2021-06-02 22:24       ` Darrick J. Wong
@ 2021-06-02 22:58         ` Dave Chinner
  2021-06-02 23:01           ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-02 22:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The caller of xlog_write() usually has a close accounting of the
aggregated vector length contained in the log vector chain passed to
xlog_write(). There is no need to iterate the chain to calculate he
length of the data in xlog_write_calculate_len() if the caller is
already iterating that chain to build it.

Passing in the vector length avoids doing an extra chain iteration,
which can be a significant amount of work given that large CIL
commits can have hundreds of thousands of vectors attached to the
chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
V2: removed unnecessary update of num_iovecs.

 fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
 fs/xfs/xfs_log_cil.c  | 16 +++++++++++-----
 fs/xfs/xfs_log_priv.h |  2 +-
 3 files changed, 18 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e849f15e9e04..58f9aafce29e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -864,7 +864,8 @@ xlog_write_unmount_record(
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
 		blkdev_issue_flush(log->l_targ->bt_bdev);
-	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
+				reg.i_len);
 }
 
 /*
@@ -1588,7 +1589,8 @@ xlog_commit_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= reg.i_len;
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
+				reg.i_len);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2108,32 +2110,6 @@ xlog_print_trans(
 	}
 }
 
-/*
- * Calculate the potential space needed by the log vector. All regions contain
- * their own opheaders and they are accounted for in region space so we don't
- * need to add them to the vector length here.
- */
-static int
-xlog_write_calc_vec_length(
-	struct xlog_ticket	*ticket,
-	struct xfs_log_vec	*log_vector,
-	uint			optype)
-{
-	struct xfs_log_vec	*lv;
-	int			len = 0;
-	int			i;
-
-	for (lv = log_vector; lv; lv = lv->lv_next) {
-		/* we don't write ordered log vectors */
-		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
-			continue;
-
-		for (i = 0; i < lv->lv_niovecs; i++)
-			len += lv->lv_iovecp[i].i_len;
-	}
-	return len;
-}
-
 static xlog_op_header_t *
 xlog_write_setup_ophdr(
 	struct xlog_op_header	*ophdr,
@@ -2296,13 +2272,13 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			optype)
+	uint			optype,
+	uint32_t		len)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
 	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
 	int			index = 0;
-	int			len;
 	int			partial_copy = 0;
 	int			partial_copy_len = 0;
 	int			contwr = 0;
@@ -2317,7 +2293,6 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
 	if (start_lsn)
 		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 58900171de09..68bec4b81052 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
 				sizeof(struct xfs_trans_header);
 	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
 
-	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
-
 	lvhdr->lv_niovecs = 2;
 	lvhdr->lv_iovecp = &hdr->lhdr[0];
+	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
 	lvhdr->lv_next = ctx->lv_chain;
+
+	tic->t_curr_res -= lvhdr->lv_bytes;
 }
 
 /*
@@ -742,7 +743,8 @@ xlog_cil_push_work(
 	struct xfs_log_vec	*lv;
 	struct xfs_cil_ctx	*new_ctx;
 	struct xlog_in_core	*commit_iclog;
-	int			num_iovecs;
+	int			num_iovecs = 0;
+	int			num_bytes = 0;
 	int			error = 0;
 	struct xlog_cil_trans_hdr thdr;
 	struct xfs_log_vec	lvhdr = { NULL };
@@ -835,7 +837,6 @@ xlog_cil_push_work(
 	 * by the flush lock.
 	 */
 	lv = NULL;
-	num_iovecs = 0;
 	while (!list_empty(&cil->xc_cil)) {
 		struct xfs_log_item	*item;
 
@@ -849,6 +850,10 @@ xlog_cil_push_work(
 		lv = item->li_lv;
 		item->li_lv = NULL;
 		num_iovecs += lv->lv_niovecs;
+
+		/* we don't write ordered log vectors */
+		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
+			num_bytes += lv->lv_bytes;
 	}
 
 	/*
@@ -887,6 +892,7 @@ xlog_cil_push_work(
 	 * transaction header here as it is not accounted for in xlog_write().
 	 */
 	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
+	num_bytes += lvhdr.lv_bytes;
 
 	/*
 	 * Before we format and submit the first iclog, we have to ensure that
@@ -901,7 +907,7 @@ xlog_cil_push_work(
 	 * write head.
 	 */
 	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
-				XLOG_START_TRANS);
+				XLOG_START_TRANS, num_bytes);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 301c36165974..eba905c273b0 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint optype);
+		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH 20/39 V2] xfs: pass lv chain length into xlog_write()
  2021-06-02 22:58         ` [PATCH 20/39 V2] " Dave Chinner
@ 2021-06-02 23:01           ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-02 23:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 08:58:00AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The caller of xlog_write() usually has a close accounting of the
> aggregated vector length contained in the log vector chain passed to
> xlog_write(). There is no need to iterate the chain to calculate he
> length of the data in xlog_write_calculate_len() if the caller is
> already iterating that chain to build it.
> 
> Passing in the vector length avoids doing an extra chain iteration,
> which can be a significant amount of work given that large CIL
> commits can have hundreds of thousands of vectors attached to the
> chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> V2: removed unnecessary update of num_iovecs.

Heh, nice refactor :)
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
>  fs/xfs/xfs_log.c      | 37 ++++++-------------------------------
>  fs/xfs/xfs_log_cil.c  | 16 +++++++++++-----
>  fs/xfs/xfs_log_priv.h |  2 +-
>  3 files changed, 18 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index e849f15e9e04..58f9aafce29e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -864,7 +864,8 @@ xlog_write_unmount_record(
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
>  		blkdev_issue_flush(log->l_targ->bt_bdev);
> -	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS,
> +				reg.i_len);
>  }
>  
>  /*
> @@ -1588,7 +1589,8 @@ xlog_commit_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= reg.i_len;
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> +				reg.i_len);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2108,32 +2110,6 @@ xlog_print_trans(
>  	}
>  }
>  
> -/*
> - * Calculate the potential space needed by the log vector. All regions contain
> - * their own opheaders and they are accounted for in region space so we don't
> - * need to add them to the vector length here.
> - */
> -static int
> -xlog_write_calc_vec_length(
> -	struct xlog_ticket	*ticket,
> -	struct xfs_log_vec	*log_vector,
> -	uint			optype)
> -{
> -	struct xfs_log_vec	*lv;
> -	int			len = 0;
> -	int			i;
> -
> -	for (lv = log_vector; lv; lv = lv->lv_next) {
> -		/* we don't write ordered log vectors */
> -		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> -			continue;
> -
> -		for (i = 0; i < lv->lv_niovecs; i++)
> -			len += lv->lv_iovecp[i].i_len;
> -	}
> -	return len;
> -}
> -
>  static xlog_op_header_t *
>  xlog_write_setup_ophdr(
>  	struct xlog_op_header	*ophdr,
> @@ -2296,13 +2272,13 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			optype)
> +	uint			optype,
> +	uint32_t		len)
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
>  	struct xfs_log_iovec	*vecp = lv->lv_iovecp;
>  	int			index = 0;
> -	int			len;
>  	int			partial_copy = 0;
>  	int			partial_copy_len = 0;
>  	int			contwr = 0;
> @@ -2317,7 +2293,6 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
>  	if (start_lsn)
>  		*start_lsn = 0;
>  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 58900171de09..68bec4b81052 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -710,11 +710,12 @@ xlog_cil_build_trans_hdr(
>  				sizeof(struct xfs_trans_header);
>  	hdr->lhdr[1].i_type = XLOG_REG_TYPE_TRANSHDR;
>  
> -	tic->t_curr_res -= hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
> -
>  	lvhdr->lv_niovecs = 2;
>  	lvhdr->lv_iovecp = &hdr->lhdr[0];
> +	lvhdr->lv_bytes = hdr->lhdr[0].i_len + hdr->lhdr[1].i_len;
>  	lvhdr->lv_next = ctx->lv_chain;
> +
> +	tic->t_curr_res -= lvhdr->lv_bytes;
>  }
>  
>  /*
> @@ -742,7 +743,8 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	*lv;
>  	struct xfs_cil_ctx	*new_ctx;
>  	struct xlog_in_core	*commit_iclog;
> -	int			num_iovecs;
> +	int			num_iovecs = 0;
> +	int			num_bytes = 0;
>  	int			error = 0;
>  	struct xlog_cil_trans_hdr thdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> @@ -835,7 +837,6 @@ xlog_cil_push_work(
>  	 * by the flush lock.
>  	 */
>  	lv = NULL;
> -	num_iovecs = 0;
>  	while (!list_empty(&cil->xc_cil)) {
>  		struct xfs_log_item	*item;
>  
> @@ -849,6 +850,10 @@ xlog_cil_push_work(
>  		lv = item->li_lv;
>  		item->li_lv = NULL;
>  		num_iovecs += lv->lv_niovecs;
> +
> +		/* we don't write ordered log vectors */
> +		if (lv->lv_buf_len != XFS_LOG_VEC_ORDERED)
> +			num_bytes += lv->lv_bytes;
>  	}
>  
>  	/*
> @@ -887,6 +892,7 @@ xlog_cil_push_work(
>  	 * transaction header here as it is not accounted for in xlog_write().
>  	 */
>  	xlog_cil_build_trans_hdr(ctx, &thdr, &lvhdr, num_iovecs);
> +	num_bytes += lvhdr.lv_bytes;
>  
>  	/*
>  	 * Before we format and submit the first iclog, we have to ensure that
> @@ -901,7 +907,7 @@ xlog_cil_push_work(
>  	 * write head.
>  	 */
>  	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> -				XLOG_START_TRANS);
> +				XLOG_START_TRANS, num_bytes);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 301c36165974..eba905c273b0 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -459,7 +459,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint optype);
> +		struct xlog_in_core **commit_iclog, uint optype, uint32_t len);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-05-27 18:41   ` Darrick J. Wong
@ 2021-06-02 23:47     ` Dave Chinner
  2021-06-03  1:26       ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-02 23:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 11:41:21AM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:08PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that we have the CIL percpu structures in place, implement the
> > space used counter with a fast sum check similar to the
> > percpu_counter infrastructure.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
> >  fs/xfs/xfs_log_priv.h |  2 +-
> >  2 files changed, 55 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index ba1c6979a4c7..72693fba929b 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
> >  	return ctx;
> >  }
> >  
> > +/*
> > + * Aggregate the CIL per cpu structures into global counts, lists, etc and
> > + * clear the percpu state ready for the next context to use.
> > + */
> > +static void
> > +xlog_cil_pcp_aggregate(
> > +	struct xfs_cil		*cil,
> > +	struct xfs_cil_ctx	*ctx)
> > +{
> > +	struct xlog_cil_pcp	*cilpcp;
> > +	int			cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > +		cilpcp->space_used = 0;
> 
> How does this aggregate anything?  All I see here is zeroing a counter?

Yup, zeroing all the percpu counters is an aggregation function....

By definition "aggregate != sum".

An aggregate is formed by the collection of discrete units into a
larger whole; the collective definition involves manipulating all
discrete units as a single whole entity. e.g. a percpu counter is
an aggregate of percpu variables that, via aggregation, can sum the
discrete variables into a single value. IOWs, percpu_counter_sum()
is an aggregation function that sums...

> I see that we /can/ add the percpu space_used counter to the cil context
> if we're over the space limits, but I don't actually see where...

In this case, the global CIL space used counter is summed by the
per-cpu counter update context and not an aggregation context. For
it to work as a global counter since a distinct point in time, it
needs an aggregation operation that zeros all the discrete units of
the counter at a single point in time. IOWs, the aggregation
function of this counter is a zeroing operation, not a summing
operation. This is what xlog_cil_pcp_aggregate() is doing here.

Put simply, an aggregation function is not a summing function, but a
function that operates on all the discrete units of the
aggregate so that it can operate correctly as a single unit....

I don't know of a better way of describing what this function does.
At the end of the series, this function will zero some units. In
other cases it will sum units. In some cases it will do both. Not to
mention that it will merge discrete lists into a global list. And so
on. The only common thing between these operations is that they are
all aggregation functions that allow the CIL context to operate as a
whole unit...

If you've got a better name, then I'm all ears :)

....

> > @@ -480,16 +501,34 @@ xlog_cil_insert_items(
> >  		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
> >  	}
> >  
> > +	/*
> > +	 * Update the CIL percpu pointer. This updates the global counter when
> > +	 * over the percpu batch size or when the CIL is over the space limit.
> > +	 * This means low lock overhead for normal updates, and when over the
> > +	 * limit the space used is immediately accounted. This makes enforcing
> > +	 * the hard limit much more accurate. The per cpu fold threshold is
> > +	 * based on how close we are to the hard limit.
> > +	 */
> > +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > +	cilpcp->space_used += len;
> > +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> > +	    cilpcp->space_used >
> > +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > +					num_online_cpus())) {
> > +		atomic_add(cilpcp->space_used, &ctx->space_used);
> > +		cilpcp->space_used = 0;
> > +	}
> > +	put_cpu_ptr(cilpcp);
> > +
> >  	spin_lock(&cil->xc_cil_lock);
> > -	tp->t_ticket->t_curr_res -= ctx_res + len;
> >  	ctx->ticket->t_unit_res += ctx_res;
> >  	ctx->ticket->t_curr_res += ctx_res;
> > -	ctx->space_used += len;
> 
> ...this update happens if we're not over the space limit?

It's the second case in the above if statement. As the space used in
the percpu pointer goes over it's fraction of the remaining space
limit (limit remaining / num_cpus_online), then it adds the
pcp counter back into the global counter. Essentially it is:

	if (over push threshold ||
>>>>>>	    pcp->used > ((hard limit - ctx->space_used) / cpus)) {
		ctx->space_used += pcp->used;
		pcp->used = 0;
	}

Hence, to begin with, the percpu counter is allowed to sum a large
chunk of space before it trips the per CPU summing threshold. When
summing occurs, the per-cpu threshold goes down, meaning there pcp
counters will trip sooner in the next cycle.

IOWs, the summing threshold gets closer to zero the closer the
global count gets to the hard limit. Hence when there's lots of
space available, we have little summing contention, but when we
are close to the blocking limit we essentially update the global
counter on every modification.

As such, we get scalability when the CIL is empty by trading off
accuracy, but we get accuracy when it is nearing full by trading off
scalability. We might need to tweak it for really large CPU counts
(maybe use log2(num_online_cpus()), but fundamentally the algorithm
is designed to scale according to how close we are to the push
thresholds....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-05-27 19:00   ` Darrick J. Wong
@ 2021-06-03  0:16     ` Dave Chinner
  2021-06-03  0:49       ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  0:16 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 12:00:23PM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:11PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Before we split the ordered CIL up into per cpu lists, we need a
> > mechanism to track the order of the items in the CIL. We need to do
> > this because there are rules around the order in which related items
> > must physically appear in the log even inside a single checkpoint
> > transaction.
> > 
> > An example of this is intents - an intent must appear in the log
> > before it's intent done record so taht log recovery can cancel the
> 
> s/taht/that/
> 
> > intent correctly. If we have these two records misordered in the
> > CIL, then they will not be recovered correctly by journal replay.
> > 
> > We also will not be able to move items to the tail of
> > the CIL list when they are relogged, hence the log items will need
> > some mechanism to allow the correct log item order to be recreated
> > before we write log items to the hournal.
> > 
> > Hence we need to have a mechanism for recording global order of
> > transactions in the log items  so that we can recover that order
> > from un-ordered per-cpu lists.
> > 
> > Do this with a simple monotonic increasing commit counter in the CIL
> > context. Each log item in the transaction gets stamped with the
> > current commit order ID before it is added to the CIL. If the item
> > is already in the CIL, leave it where it is instead of moving it to
> > the tail of the list and instead sort the list before we start the
> > push work.
> > 
> > XXX: list_sort() under the cil_ctx_lock held exclusive starts
> > hurting that >16 threads. Front end commits are waiting on the push
> > to switch contexts much longer. The item order id should likely be
> > moved into the logvecs when they are detacted from the items, then
> > the sort can be done on the logvec after the cil_ctx_lock has been
> > released. logvecs will need to use a list_head for this rather than
> > a single linked list like they do now....
> 
> ...which I guess happens in patch 35 now?

Right. I'll just remove this from the commit message.

> > @@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
> >  	tic->t_curr_res -= lvhdr->lv_bytes;
> >  }
> >  
> > +/*
> > + * CIL item reordering compare function. We want to order in ascending ID order,
> > + * but we want to leave items with the same ID in the order they were added to
> 
> When do we have items with the same id?

All the items in a single transaction have the same id. The order id
increments before we tag all the items in the transaction and insert
them into the CIL.

> I guess that happens if we have multiple transactions adding items to
> the cil at the same time?  I guess that's not a big deal since each of
> those threads will hold a disjoint set of locks, so even if the order
> ids are the same for a bunch of items, they're never going to be
> touching the same AG/inode/metadata object, right?
>
> If that's correct, then:
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>


While true, it's not the way this works so I won't immediately
accept your RVB. The reason for not changing the ordering within a
single transaction is actually intent logging.  i.e. this:

> > + * the list. This is important for operations like reflink where we log 4 order
> > + * dependent intents in a single transaction when we overwrite an existing
> > + * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
> > + * CUI (inc), BUI(remap)...

There's a specific order of operations that recovery must run these
intents in, and so if we re-order them here in the CIL they'll be
out of order in the log and recovery will replay the intents in the
wrong order. Replaying the intents in the wrong order results in
corruption warnings and assert failures during log recovery, hence
the constraint of not re-ordering items within the same transaction.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 34/39] xfs: convert CIL to unordered per cpu lists
  2021-05-27 19:03   ` Darrick J. Wong
@ 2021-06-03  0:27     ` Dave Chinner
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  0:27 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 12:03:18PM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:12PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > So that we can remove the cil_lock which is a global serialisation
> > point. We've already got ordering sorted, so all we need to do is
> > treat the CIL list like the busy extent list and reconstruct it
> > before the push starts.
> > 
> > This is what we're trying to avoid:
> > 
> >  -   75.35%     1.83%  [kernel]            [k] xfs_log_commit_cil
> >     - 46.35% xfs_log_commit_cil
> >        - 41.54% _raw_spin_lock
> >           - 67.30% do_raw_spin_lock
> >                66.96% __pv_queued_spin_lock_slowpath
> > 
> > Which happens on a 32p system when running a 32-way 'rm -rf'
> > workload. After this patch:
> > 
> > -   20.90%     3.23%  [kernel]               [k] xfs_log_commit_cil
> >    - 17.67% xfs_log_commit_cil
> >       - 6.51% xfs_log_ticket_ungrant
> >            1.40% xfs_log_space_wake
> >         2.32% memcpy_erms
> >       - 2.18% xfs_buf_item_committing
> >          - 2.12% xfs_buf_item_release
> >             - 1.03% xfs_buf_unlock
> >                  0.96% up
> >               0.72% xfs_buf_rele
> >         1.33% xfs_inode_item_format
> >         1.19% down_read
> >         0.91% up_read
> >         0.76% xfs_buf_item_format
> >       - 0.68% kmem_alloc_large
> >          - 0.67% kmem_alloc
> >               0.64% __kmalloc
> >         0.50% xfs_buf_item_size
> > 
> > It kinda looks like the workload is running out of log space all
> > the time. But all the spinlock contention is gone and the
> > transaction commit rate has gone from 800k/s to 1.3M/s so the amount
> > of real work being done has gone up a *lot*.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c  | 69 +++++++++++++++++++------------------------
> >  fs/xfs/xfs_log_priv.h |  3 +-
> >  2 files changed, 31 insertions(+), 41 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index ca6e411e388e..287dc7d0d508 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -72,6 +72,7 @@ xlog_cil_ctx_alloc(void)
> >  	ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS);
> >  	INIT_LIST_HEAD(&ctx->committing);
> >  	INIT_LIST_HEAD(&ctx->busy_extents);
> > +	INIT_LIST_HEAD(&ctx->log_items);
> 
> I see you moved the log item list to the cil ctx for benefit of
> _pcp_dead, correct?

Largely, yes. It also helps to have the item push list rooted in the
structure that holds all of the push specific state (i.e. the CIL
ctx) once we detatch that from the CIL itself.

> If so, then this isn't especially different from the last version.

*nod*

> Yay for shortening lock critical sections,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Ta.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 35/39] xfs: convert log vector chain to use list heads
  2021-05-27 19:13   ` Darrick J. Wong
@ 2021-06-03  0:38     ` Dave Chinner
  2021-06-03  0:50       ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  0:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 12:13:19PM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:13PM +1000, Dave Chinner wrote:
.....
> > @@ -913,25 +912,23 @@ xlog_cil_push_work(
> >  	xlog_cil_pcp_aggregate(cil, ctx);
> >  
> >  	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
> > -
> >  	while (!list_empty(&ctx->log_items)) {
> >  		struct xfs_log_item	*item;
> >  
> >  		item = list_first_entry(&ctx->log_items,
> >  					struct xfs_log_item, li_cil);
> > +		lv = item->li_lv;
> >  		list_del_init(&item->li_cil);
> >  		item->li_order_id = 0;
> > -		if (!ctx->lv_chain)
> > -			ctx->lv_chain = item->li_lv;
> > -		else
> > -			lv->lv_next = item->li_lv;
> > -		lv = item->li_lv;
> >  		item->li_lv = NULL;
> > -		num_iovecs += lv->lv_niovecs;
> >  
> > +		num_iovecs += lv->lv_niovecs;
> 
> Not sure why "lv = item->li_lv" needed to move up?
>
> I think the only change needed here is replacing the lv_chain/lv_next
> business with the list_add_tail?

Yes, but someone complained about the awful diff in the next patch,
so moving the "lv = item->li_lv" made the diff in the next patch
much, much cleaner...

<shrug>

I can move it back to the next patch if you really want, but it's
really just shuffling deck chairs at this point...

> > @@ -985,8 +985,14 @@ xlog_cil_push_work(
> >  	 * use the commit record lsn then we can move the tail beyond the grant
> >  	 * write head.
> >  	 */
> > -	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> > -				num_bytes);
> > +	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
> > +				NULL, num_bytes);
> > +
> > +	/*
> > +	 * Take the lvhdr back off the lv_chain as it should not be passed
> > +	 * to log IO completion.
> > +	 */
> > +	list_del(&lvhdr.lv_list);
> 
> Seems a little clunky, but I guess I see why it's needed.

I could replace the stack structure with a memory allocation and
then we wouldn't need to care, but I'm trying to keep memory
allocation out of this fast path as much as possible....

> I /think/ I don't see any place where the onstack lvhdr can escape out
> of the chain after _push_work returns, so this is safe enough.

It can't, because we own the chain here and are completely
responsible for cleaning it up on failure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-06-03  0:16     ` Dave Chinner
@ 2021-06-03  0:49       ` Darrick J. Wong
  2021-06-03  2:13         ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-03  0:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 10:16:22AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2021 at 12:00:23PM -0700, Darrick J. Wong wrote:
> > On Wed, May 19, 2021 at 10:13:11PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Before we split the ordered CIL up into per cpu lists, we need a
> > > mechanism to track the order of the items in the CIL. We need to do
> > > this because there are rules around the order in which related items
> > > must physically appear in the log even inside a single checkpoint
> > > transaction.
> > > 
> > > An example of this is intents - an intent must appear in the log
> > > before it's intent done record so taht log recovery can cancel the
> > 
> > s/taht/that/
> > 
> > > intent correctly. If we have these two records misordered in the
> > > CIL, then they will not be recovered correctly by journal replay.
> > > 
> > > We also will not be able to move items to the tail of
> > > the CIL list when they are relogged, hence the log items will need
> > > some mechanism to allow the correct log item order to be recreated
> > > before we write log items to the hournal.
> > > 
> > > Hence we need to have a mechanism for recording global order of
> > > transactions in the log items  so that we can recover that order
> > > from un-ordered per-cpu lists.
> > > 
> > > Do this with a simple monotonic increasing commit counter in the CIL
> > > context. Each log item in the transaction gets stamped with the
> > > current commit order ID before it is added to the CIL. If the item
> > > is already in the CIL, leave it where it is instead of moving it to
> > > the tail of the list and instead sort the list before we start the
> > > push work.
> > > 
> > > XXX: list_sort() under the cil_ctx_lock held exclusive starts
> > > hurting that >16 threads. Front end commits are waiting on the push
> > > to switch contexts much longer. The item order id should likely be
> > > moved into the logvecs when they are detacted from the items, then
> > > the sort can be done on the logvec after the cil_ctx_lock has been
> > > released. logvecs will need to use a list_head for this rather than
> > > a single linked list like they do now....
> > 
> > ...which I guess happens in patch 35 now?
> 
> Right. I'll just remove this from the commit message.
> 
> > > @@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
> > >  	tic->t_curr_res -= lvhdr->lv_bytes;
> > >  }
> > >  
> > > +/*
> > > + * CIL item reordering compare function. We want to order in ascending ID order,
> > > + * but we want to leave items with the same ID in the order they were added to
> > 
> > When do we have items with the same id?
> 
> All the items in a single transaction have the same id. The order id
> increments before we tag all the items in the transaction and insert
> them into the CIL.
> 
> > I guess that happens if we have multiple transactions adding items to
> > the cil at the same time?  I guess that's not a big deal since each of
> > those threads will hold a disjoint set of locks, so even if the order
> > ids are the same for a bunch of items, they're never going to be
> > touching the same AG/inode/metadata object, right?
> >
> > If that's correct, then:
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> 
> While true, it's not the way this works so I won't immediately
> accept your RVB. The reason for not changing the ordering within a
> single transaction is actually intent logging.  i.e. this:
> 
> > > + * the list. This is important for operations like reflink where we log 4 order
> > > + * dependent intents in a single transaction when we overwrite an existing
> > > + * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
> > > + * CUI (inc), BUI(remap)...
> 
> There's a specific order of operations that recovery must run these
> intents in, and so if we re-order them here in the CIL they'll be
> out of order in the log and recovery will replay the intents in the
> wrong order. Replaying the intents in the wrong order results in
> corruption warnings and assert failures during log recovery, hence
> the constraint of not re-ordering items within the same transaction.

<ding> lightbulb comes on.  I think I understood this better the last
time I read all these patches. :/

Basically, for each item that can be attached to a transaction, you're
assigning it an "order id" that is a monotonically increasing counter
that (roughly) records the last time the item was committed.  Certain
items (like inodes) can be relogged and committed multiple times in
rapid fire succession, in which case the order_id will get bumped
forward.

In the /next/ patch you'll change the cil item list to be per-cpu and
only splice the mess together at cil push time.  For that to work
properly, you have to re-sort that resulting list in commit order (aka
the order_id) to keep the items in order of commit.

For items *within* a transaction, you take advantage of the property
of list_sort that it won't reorder items with cmp(a, b) == 0, which
means that all the intents logged to a transaction will maintain the
same order that the author of higher level code wrote into the software.

Question: xlog_cil_push_work zeroes the order_id of pushed log items.
Is there any potential problem here when ctx->order_id wraps around to
zero?  I think the answer is that we'll move on to a new cil context
long before we hit 2^32-1 transactions?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 35/39] xfs: convert log vector chain to use list heads
  2021-06-03  0:38     ` Dave Chinner
@ 2021-06-03  0:50       ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-03  0:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 10:38:19AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2021 at 12:13:19PM -0700, Darrick J. Wong wrote:
> > On Wed, May 19, 2021 at 10:13:13PM +1000, Dave Chinner wrote:
> .....
> > > @@ -913,25 +912,23 @@ xlog_cil_push_work(
> > >  	xlog_cil_pcp_aggregate(cil, ctx);
> > >  
> > >  	list_sort(NULL, &ctx->log_items, xlog_cil_order_cmp);
> > > -
> > >  	while (!list_empty(&ctx->log_items)) {
> > >  		struct xfs_log_item	*item;
> > >  
> > >  		item = list_first_entry(&ctx->log_items,
> > >  					struct xfs_log_item, li_cil);
> > > +		lv = item->li_lv;
> > >  		list_del_init(&item->li_cil);
> > >  		item->li_order_id = 0;
> > > -		if (!ctx->lv_chain)
> > > -			ctx->lv_chain = item->li_lv;
> > > -		else
> > > -			lv->lv_next = item->li_lv;
> > > -		lv = item->li_lv;
> > >  		item->li_lv = NULL;
> > > -		num_iovecs += lv->lv_niovecs;
> > >  
> > > +		num_iovecs += lv->lv_niovecs;
> > 
> > Not sure why "lv = item->li_lv" needed to move up?
> >
> > I think the only change needed here is replacing the lv_chain/lv_next
> > business with the list_add_tail?
> 
> Yes, but someone complained about the awful diff in the next patch,
> so moving the "lv = item->li_lv" made the diff in the next patch
> much, much cleaner...
> 
> <shrug>
> 
> I can move it back to the next patch if you really want, but it's
> really just shuffling deck chairs at this point...

Nope, don't care that much.

> > > @@ -985,8 +985,14 @@ xlog_cil_push_work(
> > >  	 * use the commit record lsn then we can move the tail beyond the grant
> > >  	 * write head.
> > >  	 */
> > > -	error = xlog_write(log, &lvhdr, ctx->ticket, &ctx->start_lsn, NULL,
> > > -				num_bytes);
> > > +	error = xlog_write(log, &ctx->lv_chain, ctx->ticket, &ctx->start_lsn,
> > > +				NULL, num_bytes);
> > > +
> > > +	/*
> > > +	 * Take the lvhdr back off the lv_chain as it should not be passed
> > > +	 * to log IO completion.
> > > +	 */
> > > +	list_del(&lvhdr.lv_list);
> > 
> > Seems a little clunky, but I guess I see why it's needed.
> 
> I could replace the stack structure with a memory allocation and
> then we wouldn't need to care, but I'm trying to keep memory
> allocation out of this fast path as much as possible....

Oh, that's much worse.

> > I /think/ I don't see any place where the onstack lvhdr can escape out
> > of the chain after _push_work returns, so this is safe enough.
> 
> It can't, because we own the chain here and are completely
> responsible for cleaning it up on failure.

Ok.  I think I'm satisfied now:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 39/39] xfs: expanding delayed logging design with background material
  2021-05-27 20:38   ` Darrick J. Wong
@ 2021-06-03  0:57     ` Dave Chinner
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, May 27, 2021 at 01:38:44PM -0700, Darrick J. Wong wrote:
> On Wed, May 19, 2021 at 10:13:17PM +1000, Dave Chinner wrote:
> > +As a result, permanent transactions only "regrant" reservation space during
> > +xfs_trans_commit() calls, while the physical log space reservation - tracked by
> > +the write head - is then reserved separately by a call to xfs_log_reserve()
> > +after the commit completes. Once the commit completes, we can sleep waiting for
> > +physical log space to be reserved from the write grant head, but only if one
> > +critical rule has been observed::
> > +
> > +	Code using permanent reservations must always log the items they hold
> > +	locked across each transaction they roll in the chain.
> > +
> > +"Re-logging" the locked items on every transaction roll ensures that the items
> > +the transaction chain is rolling are always relocated to the physical head of
> 
> This reads (to me) a little awkwardly.  One could ask if the transaction
> chain itself is rolling the items?  Which is not really what's
> happening.  How about:
> 
> "...ensures that the items attached to the transaction chain being
> rolled are always relocated..."

Fixed.

> > -This relogging is also used to implement long-running, multiple-commit
> > -transactions.  These transaction are known as rolling transactions, and require
> > -a special log reservation known as a permanent transaction reservation. A
> > -typical example of a rolling transaction is the removal of extents from an
> > +A typical example of a rolling transaction is the removal of extents from an
> >  inode which can only be done at a rate of two extents per transaction because
> 
> Ignoring rt files, do we even have /that/ limit anymore?  Especially
> considering the other patchset you just sent... :)

Well.... Remind me to make this update in that patch set if I
forget to do it. :P

> With that one odd sentence up there reworked,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thx!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-06-02 23:47     ` Dave Chinner
@ 2021-06-03  1:26       ` Darrick J. Wong
  2021-06-03  2:28         ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-03  1:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 09:47:47AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2021 at 11:41:21AM -0700, Darrick J. Wong wrote:
> > On Wed, May 19, 2021 at 10:13:08PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now that we have the CIL percpu structures in place, implement the
> > > space used counter with a fast sum check similar to the
> > > percpu_counter infrastructure.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
> > >  fs/xfs/xfs_log_priv.h |  2 +-
> > >  2 files changed, 55 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index ba1c6979a4c7..72693fba929b 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
> > >  	return ctx;
> > >  }
> > >  
> > > +/*
> > > + * Aggregate the CIL per cpu structures into global counts, lists, etc and
> > > + * clear the percpu state ready for the next context to use.
> > > + */
> > > +static void
> > > +xlog_cil_pcp_aggregate(
> > > +	struct xfs_cil		*cil,
> > > +	struct xfs_cil_ctx	*ctx)
> > > +{
> > > +	struct xlog_cil_pcp	*cilpcp;
> > > +	int			cpu;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > > +		cilpcp->space_used = 0;
> > 
> > How does this aggregate anything?  All I see here is zeroing a counter?
> 
> Yup, zeroing all the percpu counters is an aggregation function....
> 
> By definition "aggregate != sum".
> 
> An aggregate is formed by the collection of discrete units into a
> larger whole; the collective definition involves manipulating all
> discrete units as a single whole entity. e.g. a percpu counter is
> an aggregate of percpu variables that, via aggregation, can sum the
> discrete variables into a single value. IOWs, percpu_counter_sum()
> is an aggregation function that sums...
> 
> > I see that we /can/ add the percpu space_used counter to the cil context
> > if we're over the space limits, but I don't actually see where...
> 
> In this case, the global CIL space used counter is summed by the
> per-cpu counter update context and not an aggregation context. For
> it to work as a global counter since a distinct point in time, it
> needs an aggregation operation that zeros all the discrete units of
> the counter at a single point in time. IOWs, the aggregation
> function of this counter is a zeroing operation, not a summing
> operation. This is what xlog_cil_pcp_aggregate() is doing here.
> 
> Put simply, an aggregation function is not a summing function, but a
> function that operates on all the discrete units of the
> aggregate so that it can operate correctly as a single unit....

<nod> I grok what 'aggregate' means as a general term, though perhaps I
was too quick to associate it with 'sum' here.

> I don't know of a better way of describing what this function does.
> At the end of the series, this function will zero some units. In
> other cases it will sum units. In some cases it will do both. Not to
> mention that it will merge discrete lists into a global list. And so
> on. The only common thing between these operations is that they are
> all aggregation functions that allow the CIL context to operate as a
> whole unit...

*Oh* I think I realized where my understanding gap lies.

space_used isn't part of some global space accounting scheme that has to
be kept accurate.  It's a per-cil-context variable that we used to
throttle incoming commits when the current context starts to get full.
That's why _cil_insert_items only bothers to add the per-cpu space_used
to the ctx space_used if either (a) the current cpu has used up more
than its slice of space or (b) enough cpus have hit (a) to push the
ctx's space_used above the XLOG_CIL_SPACE_LIMIT.  If no cpus have hit
(a) then the current cil context still has plenty of space and there
isn't any need to throttle the frontend.

By the time we get to the aggregation step in _cil_push_work we've
already decided to install a new context and write the current context
to disk.  We don't care about throttling the frontend on this (closed)
context and we hold xc_ctx_lock so nobody can see the new context.
That's why the aggregate function zeroes the per-cpu space_used.

> If you've got a better name, then I'm all ears :)

Not really.

> ....
> 
> > > @@ -480,16 +501,34 @@ xlog_cil_insert_items(
> > >  		atomic_sub(tp->t_ticket->t_iclog_hdrs, &cil->xc_iclog_hdrs);
> > >  	}
> > >  
> > > +	/*
> > > +	 * Update the CIL percpu pointer. This updates the global counter when
> > > +	 * over the percpu batch size or when the CIL is over the space limit.
> > > +	 * This means low lock overhead for normal updates, and when over the
> > > +	 * limit the space used is immediately accounted. This makes enforcing
> > > +	 * the hard limit much more accurate. The per cpu fold threshold is
> > > +	 * based on how close we are to the hard limit.
> > > +	 */
> > > +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > > +	cilpcp->space_used += len;
> > > +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> > > +	    cilpcp->space_used >
> > > +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > > +					num_online_cpus())) {
> > > +		atomic_add(cilpcp->space_used, &ctx->space_used);
> > > +		cilpcp->space_used = 0;
> > > +	}
> > > +	put_cpu_ptr(cilpcp);
> > > +
> > >  	spin_lock(&cil->xc_cil_lock);
> > > -	tp->t_ticket->t_curr_res -= ctx_res + len;
> > >  	ctx->ticket->t_unit_res += ctx_res;
> > >  	ctx->ticket->t_curr_res += ctx_res;
> > > -	ctx->space_used += len;
> > 
> > ...this update happens if we're not over the space limit?
> 
> It's the second case in the above if statement. As the space used in
> the percpu pointer goes over it's fraction of the remaining space
> limit (limit remaining / num_cpus_online), then it adds the
> pcp counter back into the global counter. Essentially it is:
> 
> 	if (over push threshold ||
> >>>>>>	    pcp->used > ((hard limit - ctx->space_used) / cpus)) {
> 		ctx->space_used += pcp->used;
> 		pcp->used = 0;
> 	}
> 
> Hence, to begin with, the percpu counter is allowed to sum a large
> chunk of space before it trips the per CPU summing threshold. When
> summing occurs, the per-cpu threshold goes down, meaning there pcp
> counters will trip sooner in the next cycle.
> 
> IOWs, the summing threshold gets closer to zero the closer the
> global count gets to the hard limit. Hence when there's lots of
> space available, we have little summing contention, but when we
> are close to the blocking limit we essentially update the global
> counter on every modification.

Ok, I think I get it now.  The statement "ctx->space_used = 0" is part
of clearing the percpu state and is not part of aggregating the CIL per
cpu structure into the context.

So assuming that I grokked it all on the second try, maybe a comment is
in order for the aggregate function?

	/*
	 * We're in the middle of switching cil contexts.  Reset the
	 * counter we use to detect when the current context is nearing
	 * full.
	 */
	ctx->space_used = 0;

> As such, we get scalability when the CIL is empty by trading off
> accuracy, but we get accuracy when it is nearing full by trading off
> scalability. We might need to tweak it for really large CPU counts
> (maybe use log2(num_online_cpus()), but fundamentally the algorithm
> is designed to scale according to how close we are to the push
> thresholds....

<nod> I suppose if you had a very large number of CPUs and a very small
log then the slivers could be zero ... which just means we'd just lose
performance due to overly-careful accounting.  I wonder if you want to
leave a breadcrumb to warn if:

XLOG_CIL_BLOCKING_SPACE_LIMIT() / num_online_cpus() == 0

just in case someone ever wanders in with such a configuration?

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-06-03  0:49       ` Darrick J. Wong
@ 2021-06-03  2:13         ` Dave Chinner
  2021-06-03  3:02           ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  2:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 02, 2021 at 05:49:14PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 03, 2021 at 10:16:22AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2021 at 12:00:23PM -0700, Darrick J. Wong wrote:
> > > On Wed, May 19, 2021 at 10:13:11PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Before we split the ordered CIL up into per cpu lists, we need a
> > > > mechanism to track the order of the items in the CIL. We need to do
> > > > this because there are rules around the order in which related items
> > > > must physically appear in the log even inside a single checkpoint
> > > > transaction.
> > > > 
> > > > An example of this is intents - an intent must appear in the log
> > > > before it's intent done record so taht log recovery can cancel the
> > > 
> > > s/taht/that/
> > > 
> > > > intent correctly. If we have these two records misordered in the
> > > > CIL, then they will not be recovered correctly by journal replay.
> > > > 
> > > > We also will not be able to move items to the tail of
> > > > the CIL list when they are relogged, hence the log items will need
> > > > some mechanism to allow the correct log item order to be recreated
> > > > before we write log items to the hournal.
> > > > 
> > > > Hence we need to have a mechanism for recording global order of
> > > > transactions in the log items  so that we can recover that order
> > > > from un-ordered per-cpu lists.
> > > > 
> > > > Do this with a simple monotonic increasing commit counter in the CIL
> > > > context. Each log item in the transaction gets stamped with the
> > > > current commit order ID before it is added to the CIL. If the item
> > > > is already in the CIL, leave it where it is instead of moving it to
> > > > the tail of the list and instead sort the list before we start the
> > > > push work.
> > > > 
> > > > XXX: list_sort() under the cil_ctx_lock held exclusive starts
> > > > hurting that >16 threads. Front end commits are waiting on the push
> > > > to switch contexts much longer. The item order id should likely be
> > > > moved into the logvecs when they are detacted from the items, then
> > > > the sort can be done on the logvec after the cil_ctx_lock has been
> > > > released. logvecs will need to use a list_head for this rather than
> > > > a single linked list like they do now....
> > > 
> > > ...which I guess happens in patch 35 now?
> > 
> > Right. I'll just remove this from the commit message.
> > 
> > > > @@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
> > > >  	tic->t_curr_res -= lvhdr->lv_bytes;
> > > >  }
> > > >  
> > > > +/*
> > > > + * CIL item reordering compare function. We want to order in ascending ID order,
> > > > + * but we want to leave items with the same ID in the order they were added to
> > > 
> > > When do we have items with the same id?
> > 
> > All the items in a single transaction have the same id. The order id
> > increments before we tag all the items in the transaction and insert
> > them into the CIL.
> > 
> > > I guess that happens if we have multiple transactions adding items to
> > > the cil at the same time?  I guess that's not a big deal since each of
> > > those threads will hold a disjoint set of locks, so even if the order
> > > ids are the same for a bunch of items, they're never going to be
> > > touching the same AG/inode/metadata object, right?
> > >
> > > If that's correct, then:
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > 
> > While true, it's not the way this works so I won't immediately
> > accept your RVB. The reason for not changing the ordering within a
> > single transaction is actually intent logging.  i.e. this:
> > 
> > > > + * the list. This is important for operations like reflink where we log 4 order
> > > > + * dependent intents in a single transaction when we overwrite an existing
> > > > + * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
> > > > + * CUI (inc), BUI(remap)...
> > 
> > There's a specific order of operations that recovery must run these
> > intents in, and so if we re-order them here in the CIL they'll be
> > out of order in the log and recovery will replay the intents in the
> > wrong order. Replaying the intents in the wrong order results in
> > corruption warnings and assert failures during log recovery, hence
> > the constraint of not re-ordering items within the same transaction.
> 
> <ding> lightbulb comes on.  I think I understood this better the last
> time I read all these patches. :/
> 
> Basically, for each item that can be attached to a transaction, you're
> assigning it an "order id" that is a monotonically increasing counter
> that (roughly) records the last time the item was committed.  Certain
> items (like inodes) can be relogged and committed multiple times in
> rapid fire succession, in which case the order_id will get bumped
> forward.

Effectively, yes.

> In the /next/ patch you'll change the cil item list to be per-cpu and
> only splice the mess together at cil push time.  For that to work
> properly, you have to re-sort that resulting list in commit order (aka
> the order_id) to keep the items in order of commit.
> 
> For items *within* a transaction, you take advantage of the property
> of list_sort that it won't reorder items with cmp(a, b) == 0, which
> means that all the intents logged to a transaction will maintain the
> same order that the author of higher level code wrote into the software.

Correct.

> Question: xlog_cil_push_work zeroes the order_id of pushed log items.
> Is there any potential problem here when ctx->order_id wraps around to
> zero?  I think the answer is that we'll move on to a new cil context
> long before we hit 2^32-1 transactions?

Yes. At the moment, the max transaction rate is about 800k/s, which
means it'd take a couple of hours to run 4 billion transactions. So
we're in no danger of overruning the number of transactions in a CIL
commit any time soon. And if we ever get near that, we can just bump
the counter to a 64 bit value...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-06-03  1:26       ` Darrick J. Wong
@ 2021-06-03  2:28         ` Dave Chinner
  2021-06-03  3:01           ` Darrick J. Wong
  0 siblings, 1 reply; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  2:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 02, 2021 at 06:26:09PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 03, 2021 at 09:47:47AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2021 at 11:41:21AM -0700, Darrick J. Wong wrote:
> > > On Wed, May 19, 2021 at 10:13:08PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Now that we have the CIL percpu structures in place, implement the
> > > > space used counter with a fast sum check similar to the
> > > > percpu_counter infrastructure.
> > > > 
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
> > > >  fs/xfs/xfs_log_priv.h |  2 +-
> > > >  2 files changed, 55 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index ba1c6979a4c7..72693fba929b 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
> > > >  	return ctx;
> > > >  }
> > > >  
> > > > +/*
> > > > + * Aggregate the CIL per cpu structures into global counts, lists, etc and
> > > > + * clear the percpu state ready for the next context to use.
> > > > + */
> > > > +static void
> > > > +xlog_cil_pcp_aggregate(
> > > > +	struct xfs_cil		*cil,
> > > > +	struct xfs_cil_ctx	*ctx)
> > > > +{
> > > > +	struct xlog_cil_pcp	*cilpcp;
> > > > +	int			cpu;
> > > > +
> > > > +	for_each_online_cpu(cpu) {
> > > > +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > > > +		cilpcp->space_used = 0;
> > > 
> > > How does this aggregate anything?  All I see here is zeroing a counter?
> > 
> > Yup, zeroing all the percpu counters is an aggregation function....
> > 
> > By definition "aggregate != sum".
> > 
> > An aggregate is formed by the collection of discrete units into a
> > larger whole; the collective definition involves manipulating all
> > discrete units as a single whole entity. e.g. a percpu counter is
> > an aggregate of percpu variables that, via aggregation, can sum the
> > discrete variables into a single value. IOWs, percpu_counter_sum()
> > is an aggregation function that sums...
> > 
> > > I see that we /can/ add the percpu space_used counter to the cil context
> > > if we're over the space limits, but I don't actually see where...
> > 
> > In this case, the global CIL space used counter is summed by the
> > per-cpu counter update context and not an aggregation context. For
> > it to work as a global counter since a distinct point in time, it
> > needs an aggregation operation that zeros all the discrete units of
> > the counter at a single point in time. IOWs, the aggregation
> > function of this counter is a zeroing operation, not a summing
> > operation. This is what xlog_cil_pcp_aggregate() is doing here.
> > 
> > Put simply, an aggregation function is not a summing function, but a
> > function that operates on all the discrete units of the
> > aggregate so that it can operate correctly as a single unit....
> 
> <nod> I grok what 'aggregate' means as a general term, though perhaps I
> was too quick to associate it with 'sum' here.
> 
> > I don't know of a better way of describing what this function does.
> > At the end of the series, this function will zero some units. In
> > other cases it will sum units. In some cases it will do both. Not to
> > mention that it will merge discrete lists into a global list. And so
> > on. The only common thing between these operations is that they are
> > all aggregation functions that allow the CIL context to operate as a
> > whole unit...
> 
> *Oh* I think I realized where my understanding gap lies.
> 
> space_used isn't part of some global space accounting scheme that has to
> be kept accurate.  It's a per-cil-context variable that we used to
> throttle incoming commits when the current context starts to get full.
> That's why _cil_insert_items only bothers to add the per-cpu space_used
> to the ctx space_used if either (a) the current cpu has used up more
> than its slice of space or (b) enough cpus have hit (a) to push the
> ctx's space_used above the XLOG_CIL_SPACE_LIMIT.  If no cpus have hit
> (a) then the current cil context still has plenty of space and there
> isn't any need to throttle the frontend.
> 
> By the time we get to the aggregation step in _cil_push_work we've
> already decided to install a new context and write the current context
> to disk.  We don't care about throttling the frontend on this (closed)
> context and we hold xc_ctx_lock so nobody can see the new context.
> That's why the aggregate function zeroes the per-cpu space_used.

Exactly.

> > > > +	/*
> > > > +	 * Update the CIL percpu pointer. This updates the global counter when
> > > > +	 * over the percpu batch size or when the CIL is over the space limit.
> > > > +	 * This means low lock overhead for normal updates, and when over the
> > > > +	 * limit the space used is immediately accounted. This makes enforcing
> > > > +	 * the hard limit much more accurate. The per cpu fold threshold is
> > > > +	 * based on how close we are to the hard limit.
> > > > +	 */
> > > > +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > > > +	cilpcp->space_used += len;
> > > > +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> > > > +	    cilpcp->space_used >
> > > > +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > > > +					num_online_cpus())) {
> > > > +		atomic_add(cilpcp->space_used, &ctx->space_used);
> > > > +		cilpcp->space_used = 0;
> > > > +	}
> > > > +	put_cpu_ptr(cilpcp);
> > > > +
> > > >  	spin_lock(&cil->xc_cil_lock);
> > > > -	tp->t_ticket->t_curr_res -= ctx_res + len;
> > > >  	ctx->ticket->t_unit_res += ctx_res;
> > > >  	ctx->ticket->t_curr_res += ctx_res;
> > > > -	ctx->space_used += len;
> > > 
> > > ...this update happens if we're not over the space limit?
> > 
> > It's the second case in the above if statement. As the space used in
> > the percpu pointer goes over it's fraction of the remaining space
> > limit (limit remaining / num_cpus_online), then it adds the
> > pcp counter back into the global counter. Essentially it is:
> > 
> > 	if (over push threshold ||
> > >>>>>>	    pcp->used > ((hard limit - ctx->space_used) / cpus)) {
> > 		ctx->space_used += pcp->used;
> > 		pcp->used = 0;
> > 	}
> > 
> > Hence, to begin with, the percpu counter is allowed to sum a large
> > chunk of space before it trips the per CPU summing threshold. When
> > summing occurs, the per-cpu threshold goes down, meaning there pcp
> > counters will trip sooner in the next cycle.
> > 
> > IOWs, the summing threshold gets closer to zero the closer the
> > global count gets to the hard limit. Hence when there's lots of
> > space available, we have little summing contention, but when we
> > are close to the blocking limit we essentially update the global
> > counter on every modification.
> 
> Ok, I think I get it now.  The statement "ctx->space_used = 0" is part
> of clearing the percpu state and is not part of aggregating the CIL per
> cpu structure into the context.

If you mean the 'cilpcp->space_used = 0' statement in the
_aggregate() function, then yes.

We don't actually ever zero the ctx->space_used because it always is
initialised to zero by allocation of a new context and switching to
it...

> So assuming that I grokked it all on the second try, maybe a comment is
> in order for the aggregate function?
> 
> 	/*
> 	 * We're in the middle of switching cil contexts.  Reset the
> 	 * counter we use to detect when the current context is nearing
> 	 * full.
> 	 */
> 	ctx->space_used = 0;

Hmmmm - I'm not sure where you are asking I put this comment...

> > As such, we get scalability when the CIL is empty by trading off
> > accuracy, but we get accuracy when it is nearing full by trading off
> > scalability. We might need to tweak it for really large CPU counts
> > (maybe use log2(num_online_cpus()), but fundamentally the algorithm
> > is designed to scale according to how close we are to the push
> > thresholds....
> 
> <nod> I suppose if you had a very large number of CPUs and a very small
> log then the slivers could be zero ... which just means we'd just lose
> performance due to overly-careful accounting.  I wonder if you want to
> leave a breadcrumb to warn if:
> 
> XLOG_CIL_BLOCKING_SPACE_LIMIT() / num_online_cpus() == 0
> 
> just in case someone ever wanders in with such a configuration?

I don't think that can happen. The smallest blocking size limit will
be a quarter of the smallest log size which puts us around 800KB
as the blocking limit. I can't see us supporting 800k CPUs any time
soon, and even if you have a couple of hundred thousand CPUs, you're
going to have lots of other basic problems trying to do concurrent
operations on such a tiny log before we even consider this
threshold...

I'm not too worried about scalability and performance on tiny logs -
if they result in an atomic update every transaction, it'll still be
sufficient to scale to the concurrency the transaction reservation
code will let into the commit path in the first place...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-06-03  2:28         ` Dave Chinner
@ 2021-06-03  3:01           ` Darrick J. Wong
  2021-06-03  3:56             ` Dave Chinner
  0 siblings, 1 reply; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 12:28:14PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 06:26:09PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 03, 2021 at 09:47:47AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2021 at 11:41:21AM -0700, Darrick J. Wong wrote:
> > > > On Wed, May 19, 2021 at 10:13:08PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Now that we have the CIL percpu structures in place, implement the
> > > > > space used counter with a fast sum check similar to the
> > > > > percpu_counter infrastructure.
> > > > > 
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c  | 61 ++++++++++++++++++++++++++++++++++++++-----
> > > > >  fs/xfs/xfs_log_priv.h |  2 +-
> > > > >  2 files changed, 55 insertions(+), 8 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index ba1c6979a4c7..72693fba929b 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > @@ -76,6 +76,24 @@ xlog_cil_ctx_alloc(void)
> > > > >  	return ctx;
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * Aggregate the CIL per cpu structures into global counts, lists, etc and
> > > > > + * clear the percpu state ready for the next context to use.
> > > > > + */
> > > > > +static void
> > > > > +xlog_cil_pcp_aggregate(
> > > > > +	struct xfs_cil		*cil,
> > > > > +	struct xfs_cil_ctx	*ctx)
> > > > > +{
> > > > > +	struct xlog_cil_pcp	*cilpcp;
> > > > > +	int			cpu;
> > > > > +
> > > > > +	for_each_online_cpu(cpu) {
> > > > > +		cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
> > > > > +		cilpcp->space_used = 0;
> > > > 
> > > > How does this aggregate anything?  All I see here is zeroing a counter?
> > > 
> > > Yup, zeroing all the percpu counters is an aggregation function....
> > > 
> > > By definition "aggregate != sum".
> > > 
> > > An aggregate is formed by the collection of discrete units into a
> > > larger whole; the collective definition involves manipulating all
> > > discrete units as a single whole entity. e.g. a percpu counter is
> > > an aggregate of percpu variables that, via aggregation, can sum the
> > > discrete variables into a single value. IOWs, percpu_counter_sum()
> > > is an aggregation function that sums...
> > > 
> > > > I see that we /can/ add the percpu space_used counter to the cil context
> > > > if we're over the space limits, but I don't actually see where...
> > > 
> > > In this case, the global CIL space used counter is summed by the
> > > per-cpu counter update context and not an aggregation context. For
> > > it to work as a global counter since a distinct point in time, it
> > > needs an aggregation operation that zeros all the discrete units of
> > > the counter at a single point in time. IOWs, the aggregation
> > > function of this counter is a zeroing operation, not a summing
> > > operation. This is what xlog_cil_pcp_aggregate() is doing here.
> > > 
> > > Put simply, an aggregation function is not a summing function, but a
> > > function that operates on all the discrete units of the
> > > aggregate so that it can operate correctly as a single unit....
> > 
> > <nod> I grok what 'aggregate' means as a general term, though perhaps I
> > was too quick to associate it with 'sum' here.
> > 
> > > I don't know of a better way of describing what this function does.
> > > At the end of the series, this function will zero some units. In
> > > other cases it will sum units. In some cases it will do both. Not to
> > > mention that it will merge discrete lists into a global list. And so
> > > on. The only common thing between these operations is that they are
> > > all aggregation functions that allow the CIL context to operate as a
> > > whole unit...
> > 
> > *Oh* I think I realized where my understanding gap lies.
> > 
> > space_used isn't part of some global space accounting scheme that has to
> > be kept accurate.  It's a per-cil-context variable that we used to
> > throttle incoming commits when the current context starts to get full.
> > That's why _cil_insert_items only bothers to add the per-cpu space_used
> > to the ctx space_used if either (a) the current cpu has used up more
> > than its slice of space or (b) enough cpus have hit (a) to push the
> > ctx's space_used above the XLOG_CIL_SPACE_LIMIT.  If no cpus have hit
> > (a) then the current cil context still has plenty of space and there
> > isn't any need to throttle the frontend.
> > 
> > By the time we get to the aggregation step in _cil_push_work we've
> > already decided to install a new context and write the current context
> > to disk.  We don't care about throttling the frontend on this (closed)
> > context and we hold xc_ctx_lock so nobody can see the new context.
> > That's why the aggregate function zeroes the per-cpu space_used.
> 
> Exactly.

oh good!

> > > > > +	/*
> > > > > +	 * Update the CIL percpu pointer. This updates the global counter when
> > > > > +	 * over the percpu batch size or when the CIL is over the space limit.
> > > > > +	 * This means low lock overhead for normal updates, and when over the
> > > > > +	 * limit the space used is immediately accounted. This makes enforcing
> > > > > +	 * the hard limit much more accurate. The per cpu fold threshold is
> > > > > +	 * based on how close we are to the hard limit.
> > > > > +	 */
> > > > > +	cilpcp = get_cpu_ptr(cil->xc_pcp);
> > > > > +	cilpcp->space_used += len;
> > > > > +	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
> > > > > +	    cilpcp->space_used >
> > > > > +			((XLOG_CIL_BLOCKING_SPACE_LIMIT(log) - space_used) /
> > > > > +					num_online_cpus())) {
> > > > > +		atomic_add(cilpcp->space_used, &ctx->space_used);
> > > > > +		cilpcp->space_used = 0;
> > > > > +	}
> > > > > +	put_cpu_ptr(cilpcp);
> > > > > +
> > > > >  	spin_lock(&cil->xc_cil_lock);
> > > > > -	tp->t_ticket->t_curr_res -= ctx_res + len;
> > > > >  	ctx->ticket->t_unit_res += ctx_res;
> > > > >  	ctx->ticket->t_curr_res += ctx_res;
> > > > > -	ctx->space_used += len;
> > > > 
> > > > ...this update happens if we're not over the space limit?
> > > 
> > > It's the second case in the above if statement. As the space used in
> > > the percpu pointer goes over it's fraction of the remaining space
> > > limit (limit remaining / num_cpus_online), then it adds the
> > > pcp counter back into the global counter. Essentially it is:
> > > 
> > > 	if (over push threshold ||
> > > >>>>>>	    pcp->used > ((hard limit - ctx->space_used) / cpus)) {
> > > 		ctx->space_used += pcp->used;
> > > 		pcp->used = 0;
> > > 	}
> > > 
> > > Hence, to begin with, the percpu counter is allowed to sum a large
> > > chunk of space before it trips the per CPU summing threshold. When
> > > summing occurs, the per-cpu threshold goes down, meaning there pcp
> > > counters will trip sooner in the next cycle.
> > > 
> > > IOWs, the summing threshold gets closer to zero the closer the
> > > global count gets to the hard limit. Hence when there's lots of
> > > space available, we have little summing contention, but when we
> > > are close to the blocking limit we essentially update the global
> > > counter on every modification.
> > 
> > Ok, I think I get it now.  The statement "ctx->space_used = 0" is part
> > of clearing the percpu state and is not part of aggregating the CIL per
> > cpu structure into the context.
> 
> If you mean the 'cilpcp->space_used = 0' statement in the
> _aggregate() function, then yes.

Yes.  My brain is tired.

(Also it's really hot here.)

> We don't actually ever zero the ctx->space_used because it always is
> initialised to zero by allocation of a new context and switching to
> it...
> 
> > So assuming that I grokked it all on the second try, maybe a comment is
> > in order for the aggregate function?
> > 
> > 	/*
> > 	 * We're in the middle of switching cil contexts.  Reset the
> > 	 * counter we use to detect when the current context is nearing
> > 	 * full.
> > 	 */
> > 	ctx->space_used = 0;
> 
> Hmmmm - I'm not sure where you are asking I put this comment...

Sorry, I meant the comment to be placed above the
"cilpcp->space_used = 0;" in the _aggregate function.

> > > As such, we get scalability when the CIL is empty by trading off
> > > accuracy, but we get accuracy when it is nearing full by trading off
> > > scalability. We might need to tweak it for really large CPU counts
> > > (maybe use log2(num_online_cpus()), but fundamentally the algorithm
> > > is designed to scale according to how close we are to the push
> > > thresholds....
> > 
> > <nod> I suppose if you had a very large number of CPUs and a very small
> > log then the slivers could be zero ... which just means we'd just lose
> > performance due to overly-careful accounting.  I wonder if you want to
> > leave a breadcrumb to warn if:
> > 
> > XLOG_CIL_BLOCKING_SPACE_LIMIT() / num_online_cpus() == 0
> > 
> > just in case someone ever wanders in with such a configuration?
> 
> I don't think that can happen. The smallest blocking size limit will
> be a quarter of the smallest log size which puts us around 800KB
> as the blocking limit. I can't see us supporting 800k CPUs any time
> soon, and even if you have a couple of hundred thousand CPUs, you're
> going to have lots of other basic problems trying to do concurrent
> operations on such a tiny log before we even consider this
> threshold...

Just think of what IPIs look like! :D
> 
> I'm not too worried about scalability and performance on tiny logs -
> if they result in an atomic update every transaction, it'll still be
> sufficient to scale to the concurrency the transaction reservation
> code will let into the commit path in the first place...

Ok.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 33/39] xfs: Add order IDs to log items in CIL
  2021-06-03  2:13         ` Dave Chinner
@ 2021-06-03  3:02           ` Darrick J. Wong
  0 siblings, 0 replies; 86+ messages in thread
From: Darrick J. Wong @ 2021-06-03  3:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jun 03, 2021 at 12:13:30PM +1000, Dave Chinner wrote:
> On Wed, Jun 02, 2021 at 05:49:14PM -0700, Darrick J. Wong wrote:
> > On Thu, Jun 03, 2021 at 10:16:22AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2021 at 12:00:23PM -0700, Darrick J. Wong wrote:
> > > > On Wed, May 19, 2021 at 10:13:11PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > Before we split the ordered CIL up into per cpu lists, we need a
> > > > > mechanism to track the order of the items in the CIL. We need to do
> > > > > this because there are rules around the order in which related items
> > > > > must physically appear in the log even inside a single checkpoint
> > > > > transaction.
> > > > > 
> > > > > An example of this is intents - an intent must appear in the log
> > > > > before it's intent done record so taht log recovery can cancel the
> > > > 
> > > > s/taht/that/
> > > > 
> > > > > intent correctly. If we have these two records misordered in the
> > > > > CIL, then they will not be recovered correctly by journal replay.
> > > > > 
> > > > > We also will not be able to move items to the tail of
> > > > > the CIL list when they are relogged, hence the log items will need
> > > > > some mechanism to allow the correct log item order to be recreated
> > > > > before we write log items to the hournal.
> > > > > 
> > > > > Hence we need to have a mechanism for recording global order of
> > > > > transactions in the log items  so that we can recover that order
> > > > > from un-ordered per-cpu lists.
> > > > > 
> > > > > Do this with a simple monotonic increasing commit counter in the CIL
> > > > > context. Each log item in the transaction gets stamped with the
> > > > > current commit order ID before it is added to the CIL. If the item
> > > > > is already in the CIL, leave it where it is instead of moving it to
> > > > > the tail of the list and instead sort the list before we start the
> > > > > push work.
> > > > > 
> > > > > XXX: list_sort() under the cil_ctx_lock held exclusive starts
> > > > > hurting that >16 threads. Front end commits are waiting on the push
> > > > > to switch contexts much longer. The item order id should likely be
> > > > > moved into the logvecs when they are detacted from the items, then
> > > > > the sort can be done on the logvec after the cil_ctx_lock has been
> > > > > released. logvecs will need to use a list_head for this rather than
> > > > > a single linked list like they do now....
> > > > 
> > > > ...which I guess happens in patch 35 now?
> > > 
> > > Right. I'll just remove this from the commit message.
> > > 
> > > > > @@ -780,6 +780,26 @@ xlog_cil_build_trans_hdr(
> > > > >  	tic->t_curr_res -= lvhdr->lv_bytes;
> > > > >  }
> > > > >  
> > > > > +/*
> > > > > + * CIL item reordering compare function. We want to order in ascending ID order,
> > > > > + * but we want to leave items with the same ID in the order they were added to
> > > > 
> > > > When do we have items with the same id?
> > > 
> > > All the items in a single transaction have the same id. The order id
> > > increments before we tag all the items in the transaction and insert
> > > them into the CIL.
> > > 
> > > > I guess that happens if we have multiple transactions adding items to
> > > > the cil at the same time?  I guess that's not a big deal since each of
> > > > those threads will hold a disjoint set of locks, so even if the order
> > > > ids are the same for a bunch of items, they're never going to be
> > > > touching the same AG/inode/metadata object, right?
> > > >
> > > > If that's correct, then:
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > 
> > > While true, it's not the way this works so I won't immediately
> > > accept your RVB. The reason for not changing the ordering within a
> > > single transaction is actually intent logging.  i.e. this:
> > > 
> > > > > + * the list. This is important for operations like reflink where we log 4 order
> > > > > + * dependent intents in a single transaction when we overwrite an existing
> > > > > + * shared extent with a new shared extent. i.e. BUI(unmap), CUI(drop),
> > > > > + * CUI (inc), BUI(remap)...
> > > 
> > > There's a specific order of operations that recovery must run these
> > > intents in, and so if we re-order them here in the CIL they'll be
> > > out of order in the log and recovery will replay the intents in the
> > > wrong order. Replaying the intents in the wrong order results in
> > > corruption warnings and assert failures during log recovery, hence
> > > the constraint of not re-ordering items within the same transaction.
> > 
> > <ding> lightbulb comes on.  I think I understood this better the last
> > time I read all these patches. :/
> > 
> > Basically, for each item that can be attached to a transaction, you're
> > assigning it an "order id" that is a monotonically increasing counter
> > that (roughly) records the last time the item was committed.  Certain
> > items (like inodes) can be relogged and committed multiple times in
> > rapid fire succession, in which case the order_id will get bumped
> > forward.
> 
> Effectively, yes.
> 
> > In the /next/ patch you'll change the cil item list to be per-cpu and
> > only splice the mess together at cil push time.  For that to work
> > properly, you have to re-sort that resulting list in commit order (aka
> > the order_id) to keep the items in order of commit.
> > 
> > For items *within* a transaction, you take advantage of the property
> > of list_sort that it won't reorder items with cmp(a, b) == 0, which
> > means that all the intents logged to a transaction will maintain the
> > same order that the author of higher level code wrote into the software.
> 
> Correct.

Ok, good.

> > Question: xlog_cil_push_work zeroes the order_id of pushed log items.
> > Is there any potential problem here when ctx->order_id wraps around to
> > zero?  I think the answer is that we'll move on to a new cil context
> > long before we hit 2^32-1 transactions?
> 
> Yes. At the moment, the max transaction rate is about 800k/s, which
> means it'd take a couple of hours to run 4 billion transactions. So
> we're in no danger of overruning the number of transactions in a CIL
> commit any time soon. And if we ever get near that, we can just bump
> the counter to a 64 bit value...

Ok.

With the "taht" in the commit message fixed,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH 30/39] xfs: implement percpu cil space used calculation
  2021-06-03  3:01           ` Darrick J. Wong
@ 2021-06-03  3:56             ` Dave Chinner
  0 siblings, 0 replies; 86+ messages in thread
From: Dave Chinner @ 2021-06-03  3:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Jun 02, 2021 at 08:01:48PM -0700, Darrick J. Wong wrote:
> On Thu, Jun 03, 2021 at 12:28:14PM +1000, Dave Chinner wrote:
> > On Wed, Jun 02, 2021 at 06:26:09PM -0700, Darrick J. Wong wrote:
> > > So assuming that I grokked it all on the second try, maybe a comment is
> > > in order for the aggregate function?
> > > 
> > > 	/*
> > > 	 * We're in the middle of switching cil contexts.  Reset the
> > > 	 * counter we use to detect when the current context is nearing
> > > 	 * full.
> > > 	 */
> > > 	ctx->space_used = 0;
> > 
> > Hmmmm - I'm not sure where you are asking I put this comment...
> 
> Sorry, I meant the comment to be placed above the
> "cilpcp->space_used = 0;" in the _aggregate function.

Ok. That'll cause rejects on all the subsequent patches so there's
not much point in posting just an update to this patch with that
done. I guess I'm sending out another 39 patches before the end of
the day.... :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2021-06-03  3:56 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 12:12 [PATCH 00/39 v4] xfs: CIL and log optimisations Dave Chinner
2021-05-19 12:12 ` [PATCH 01/39] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-05-28  0:54   ` Allison Henderson
2021-05-19 12:12 ` [PATCH 02/39] xfs: separate CIL commit record IO Dave Chinner
2021-05-28  0:54   ` Allison Henderson
2021-05-19 12:12 ` [PATCH 03/39] xfs: remove xfs_blkdev_issue_flush Dave Chinner
2021-05-28  0:54   ` Allison Henderson
2021-05-19 12:12 ` [PATCH 04/39] xfs: async blkdev cache flush Dave Chinner
2021-05-20 23:53   ` Darrick J. Wong
2021-05-28  0:54   ` Allison Henderson
2021-05-19 12:12 ` [PATCH 05/39] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-05-28  0:54   ` Allison Henderson
2021-05-19 12:12 ` [PATCH 06/39] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-05-19 12:12 ` [PATCH 07/39] xfs: journal IO cache flush reductions Dave Chinner
2021-05-21  0:16   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 08/39] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-05-19 12:12 ` [PATCH 09/39] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner
2021-05-21  0:20   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 10/39] xfs: AIL needs asynchronous CIL forcing Dave Chinner
2021-05-21  0:33   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 11/39] xfs: CIL work is serialised, not pipelined Dave Chinner
2021-05-21  0:32   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 12/39] xfs: factor out the CIL transaction header building Dave Chinner
2021-05-19 12:12 ` [PATCH 13/39] xfs: only CIL pushes require a start record Dave Chinner
2021-05-19 12:12 ` [PATCH 14/39] xfs: embed the xlog_op_header in the unmount record Dave Chinner
2021-05-21  0:35   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 15/39] xfs: embed the xlog_op_header in the commit record Dave Chinner
2021-05-19 12:12 ` [PATCH 16/39] xfs: log tickets don't need log client id Dave Chinner
2021-05-21  0:38   ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 17/39] xfs: move log iovec alignment to preparation function Dave Chinner
2021-05-19 12:12 ` [PATCH 18/39] xfs: reserve space and initialise xlog_op_header in item formatting Dave Chinner
2021-05-19 12:12 ` [PATCH 19/39] xfs: log ticket region debug is largely useless Dave Chinner
2021-05-19 12:12 ` [PATCH 20/39] xfs: pass lv chain length into xlog_write() Dave Chinner
2021-05-27 17:20   ` Darrick J. Wong
2021-06-02 22:18     ` Dave Chinner
2021-06-02 22:24       ` Darrick J. Wong
2021-06-02 22:58         ` [PATCH 20/39 V2] " Dave Chinner
2021-06-02 23:01           ` Darrick J. Wong
2021-05-19 12:12 ` [PATCH 21/39] xfs: introduce xlog_write_single() Dave Chinner
2021-05-27 17:27   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 22/39] xfs:_introduce xlog_write_partial() Dave Chinner
2021-05-27 18:06   ` Darrick J. Wong
2021-06-02 22:21     ` Dave Chinner
2021-05-19 12:13 ` [PATCH 23/39] xfs: xlog_write() no longer needs contwr state Dave Chinner
2021-05-19 12:13 ` [PATCH 24/39] xfs: xlog_write() doesn't need optype anymore Dave Chinner
2021-05-27 18:07   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 25/39] xfs: CIL context doesn't need to count iovecs Dave Chinner
2021-05-27 18:08   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 26/39] xfs: use the CIL space used counter for emptiness checks Dave Chinner
2021-05-19 12:13 ` [PATCH 27/39] xfs: lift init CIL reservation out of xc_cil_lock Dave Chinner
2021-05-19 12:13 ` [PATCH 28/39] xfs: rework per-iclog header CIL reservation Dave Chinner
2021-05-27 18:17   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 29/39] xfs: introduce per-cpu CIL tracking structure Dave Chinner
2021-05-27 18:31   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 30/39] xfs: implement percpu cil space used calculation Dave Chinner
2021-05-27 18:41   ` Darrick J. Wong
2021-06-02 23:47     ` Dave Chinner
2021-06-03  1:26       ` Darrick J. Wong
2021-06-03  2:28         ` Dave Chinner
2021-06-03  3:01           ` Darrick J. Wong
2021-06-03  3:56             ` Dave Chinner
2021-05-19 12:13 ` [PATCH 31/39] xfs: track CIL ticket reservation in percpu structure Dave Chinner
2021-05-27 18:48   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 32/39] xfs: convert CIL busy extents to per-cpu Dave Chinner
2021-05-27 18:49   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 33/39] xfs: Add order IDs to log items in CIL Dave Chinner
2021-05-27 19:00   ` Darrick J. Wong
2021-06-03  0:16     ` Dave Chinner
2021-06-03  0:49       ` Darrick J. Wong
2021-06-03  2:13         ` Dave Chinner
2021-06-03  3:02           ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 34/39] xfs: convert CIL to unordered per cpu lists Dave Chinner
2021-05-27 19:03   ` Darrick J. Wong
2021-06-03  0:27     ` Dave Chinner
2021-05-19 12:13 ` [PATCH 35/39] xfs: convert log vector chain to use list heads Dave Chinner
2021-05-27 19:13   ` Darrick J. Wong
2021-06-03  0:38     ` Dave Chinner
2021-06-03  0:50       ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 36/39] xfs: move CIL ordering to the logvec chain Dave Chinner
2021-05-27 19:14   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 37/39] xfs: avoid cil push lock if possible Dave Chinner
2021-05-27 19:18   ` Darrick J. Wong
2021-05-19 12:13 ` [PATCH 38/39] xfs: xlog_sync() manually adjusts grant head space Dave Chinner
2021-05-19 12:13 ` [PATCH 39/39] xfs: expanding delayed logging design with background material Dave Chinner
2021-05-27 20:38   ` Darrick J. Wong
2021-06-03  0:57     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.