All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] xfs: various log stuff...
@ 2021-02-23  3:34 Dave Chinner
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

HI folks,

Version 2 of this set of changes to the log code. First version
was posted here:

https://lore.kernel.org/linux-xfs/20210128044154.806715-1-david@fromorbit.com/

"Quick patch dump for y'all. A couple of minor cleanups to the
log behaviour, a fix for the CIL throttle hang and a couple of
patches to rework the cache flushing that journal IO does to reduce
the number of cache flushes by a couple of orders of magnitude."

Version 2:
- fix ticket reservation roundoff to include 2 roundoffs
- removed stale copied comment from roundoff initialisation.
- clarified "separation" to mean "separation for ordering purposes" in commit
  message.
- added comment that newly activated, clean, empty iclogs have a LSN of 0 so are
  captured by the "iclog lsn < start_lsn" case that avoids needing to wait
  before releasing the commit iclog to be written.
- added async cache flush infrastructure
- convert CIL checkpoint push work it issue an unconditional metadata device
  cache flush rather than asking the first iclog write to issue it via
  REQ_PREFLUSH.
- cleaned up xlog_write() to remove a redundant parameter and prepare the logic
  for setting flags on the iclog based on the type of operational data is being
  written to the log.
- added XLOG_ICL_NEED_FUA flag to complement the NEED_FLUSH flag, allowing
  callers to issue explicit flushes and clear the NEED_FLUSH flag before the
  iclog is written without dropping the REQ_FUA requirement in /dev/null...
- added CIL commit-in-start-iclog optimisation that clears the NEED_FLUSH flag
  to avoid an unnecessary cache flush when issuing the iclog.
- fixed typo in CIL throttle bugfix comment.
- fixed trailing whitespace in commit message.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/8] xfs: log stripe roundoff is a property of the log
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-23 10:29   ` Chandan Babu R
                     ` (3 more replies)
  2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We don't need to look at the xfs_mount and superblock every time we
need to do an iclog roundoff calculation. The property is fixed for
the life of the log, so store the roundoff in the log at mount time
and use that everywhere.

On a debug build:

$ size fs/xfs/xfs_log.o.*
   text	   data	    bss	    dec	    hex	filename
  27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
  27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h |  3 --
 fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
 fs/xfs/xfs_log_priv.h          |  2 ++
 3 files changed, 27 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 8bd00da6d2a4..16587219549c 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
 #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
 #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
 #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
-#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
-                                 (log)->l_mp->m_sb.sb_logsunit)
-#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
 
 #define XLOG_HEADER_SIZE	512
 
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 06041834daa3..fa284f26d10e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1399,6 +1399,11 @@ xlog_alloc_log(
 	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
 	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
 
+	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
+		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
+	else
+		log->l_iclog_roundoff = BBSIZE;
+
 	xlog_grant_head_init(&log->l_reserve_head);
 	xlog_grant_head_init(&log->l_write_head);
 
@@ -1852,29 +1857,15 @@ xlog_calc_iclog_size(
 	uint32_t		*roundoff)
 {
 	uint32_t		count_init, count;
-	bool			use_lsunit;
-
-	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-			log->l_mp->m_sb.sb_logsunit > 1;
 
 	/* Add for LR header */
 	count_init = log->l_iclog_hsize + iclog->ic_offset;
+	count = roundup(count_init, log->l_iclog_roundoff);
 
-	/* Round out the log write size */
-	if (use_lsunit) {
-		/* we have a v2 stripe unit to use */
-		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
-	} else {
-		count = BBTOB(BTOBB(count_init));
-	}
-
-	ASSERT(count >= count_init);
 	*roundoff = count - count_init;
 
-	if (use_lsunit)
-		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
-	else
-		ASSERT(*roundoff < BBTOB(1));
+	ASSERT(count >= count_init);
+	ASSERT(*roundoff < log->l_iclog_roundoff);
 	return count;
 }
 
@@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs(
 	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
 
 	/* Round up to next log-sunit */
-	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
-	    log->l_mp->m_sb.sb_logsunit > 1) {
-		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
-		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
+	if (log->l_iclog_roundoff > BBSIZE) {
+		log->l_curr_block = roundup(log->l_curr_block,
+						BTOBB(log->l_iclog_roundoff));
 	}
 
 	if (log->l_curr_block >= log->l_logBBsize) {
@@ -3404,12 +3394,11 @@ xfs_log_ticket_get(
  * Figure out the total log space unit (in bytes) that would be
  * required for a log ticket.
  */
-int
-xfs_log_calc_unit_res(
-	struct xfs_mount	*mp,
+static int
+xlog_calc_unit_res(
+	struct xlog		*log,
 	int			unit_bytes)
 {
-	struct xlog		*log = mp->m_log;
 	int			iclog_space;
 	uint			num_headers;
 
@@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res(
 	/* for commit-rec LR header - note: padding will subsume the ophdr */
 	unit_bytes += log->l_iclog_hsize;
 
-	/* for roundoff padding for transaction data and one for commit record */
-	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
-		/* log su roundoff */
-		unit_bytes += 2 * mp->m_sb.sb_logsunit;
-	} else {
-		/* BB roundoff */
-		unit_bytes += 2 * BBSIZE;
-        }
+	/* roundoff padding for transaction data and one for commit record */
+	unit_bytes += 2 * log->l_iclog_roundoff;
 
 	return unit_bytes;
 }
 
+int
+xfs_log_calc_unit_res(
+	struct xfs_mount	*mp,
+	int			unit_bytes)
+{
+	return xlog_calc_unit_res(mp->m_log, unit_bytes);
+}
+
 /*
  * Allocate and initialise a new log ticket.
  */
@@ -3513,7 +3504,7 @@ xlog_ticket_alloc(
 
 	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
 
-	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
+	unit_res = xlog_calc_unit_res(log, unit_bytes);
 
 	atomic_set(&tic->t_ref, 1);
 	tic->t_task		= current;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1c6fdbf3d506..037950cf1061 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -436,6 +436,8 @@ struct xlog {
 #endif
 	/* log recovery lsn tracking (for buffer submission */
 	xfs_lsn_t		l_recovery_lsn;
+
+	uint32_t		l_iclog_roundoff;/* padding roundoff */
 };
 
 #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-23 12:12   ` Chandan Babu R
                     ` (2 more replies)
  2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

To allow for iclog IO device cache flush behaviour to be optimised,
we first need to separate out the commit record iclog IO from the
rest of the checkpoint so we can wait for the checkpoint IO to
complete before we issue the commit record.

This separation is only necessary if the commit record is being
written into a different iclog to the start of the checkpoint as the
upcoming cache flushing changes requires completion ordering against
the other iclogs submitted by the checkpoint.

If the entire checkpoint and commit is in the one iclog, then they
are both covered by the one set of cache flush primitives on the
iclog and hence there is no need to separate them for ordering.

Otherwise, we need to wait for all the previous iclogs to complete
so they are ordered correctly and made stable by the REQ_PREFLUSH
that the commit record iclog IO issues. This guarantees that if a
reader sees the commit record in the journal, they will also see the
entire checkpoint that commit record closes off.

This also provides the guarantee that when the commit record IO
completes, we can safely unpin all the log items in the checkpoint
so they can be written back because the entire checkpoint is stable
in the journal.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log_cil.c  |  7 ++++++
 fs/xfs/xfs_log_priv.h |  2 ++
 3 files changed, 64 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index fa284f26d10e..ff26fb46d70f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -808,6 +808,61 @@ xlog_wait_on_iclog(
 	return 0;
 }
 
+/*
+ * Wait on any iclogs that are still flushing in the range of start_lsn to the
+ * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
+ * holds no log locks.
+ *
+ * We walk backwards through the iclogs to find the iclog with the highest lsn
+ * in the range that we need to wait for and then wait for it to complete.
+ * Completion ordering of iclog IOs ensures that all prior iclogs to the
+ * candidate iclog we need to sleep on have been complete by the time our
+ * candidate has completed it's IO.
+ *
+ * Therefore we only need to find the first iclog that isn't clean within the
+ * span of our flush range. If we come across a clean, newly activated iclog
+ * with a lsn of 0, it means IO has completed on this iclog and all previous
+ * iclogs will be have been completed prior to this one. Hence finding a newly
+ * activated iclog indicates that there are no iclogs in the range we need to
+ * wait on and we are done searching.
+ */
+int
+xlog_wait_on_iclog_lsn(
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		start_lsn)
+{
+	struct xlog		*log = iclog->ic_log;
+	struct xlog_in_core	*prev;
+	int			error = -EIO;
+
+	spin_lock(&log->l_icloglock);
+	if (XLOG_FORCED_SHUTDOWN(log))
+		goto out_unlock;
+
+	error = 0;
+	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
+
+		/* Done if the lsn is before our start lsn */
+		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
+				start_lsn) < 0)
+			break;
+
+		/* Don't need to wait on completed, clean iclogs */
+		if (prev->ic_state == XLOG_STATE_DIRTY ||
+		    prev->ic_state == XLOG_STATE_ACTIVE) {
+			continue;
+		}
+
+		/* wait for completion on this iclog */
+		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
+		return 0;
+	}
+
+out_unlock:
+	spin_unlock(&log->l_icloglock);
+	return error;
+}
+
 /*
  * Write out an unmount record using the ticket provided. We have to account for
  * the data space used in the unmount ticket as this write is not done from a
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..c5cc1b7ad25e 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -870,6 +870,13 @@ xlog_cil_push_work(
 	wake_up_all(&cil->xc_commit_wait);
 	spin_unlock(&cil->xc_push_lock);
 
+	/*
+	 * If the checkpoint spans multiple iclogs, wait for all previous
+	 * iclogs to complete before we submit the commit_iclog.
+	 */
+	if (ctx->start_lsn != commit_lsn)
+		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
+
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
 	return;
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 037950cf1061..a7ac85aaff4e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -584,6 +584,8 @@ xlog_wait(
 	remove_wait_queue(wq, &wait);
 }
 
+int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
+
 /*
  * The LSN is valid so long as it is behind the current LSN. If it isn't, this
  * means that the next log record that includes this metadata could have a
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
  2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-23 12:57   ` Chandan Babu R
                     ` (2 more replies)
  2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Move it to xfs_bio_io.c as we are about to add new async cache flush
functionality that uses bios directly, so all this stuff should be
in the same place. Rename the function to xfs_flush_bdev() to match
the xfs_rw_bdev() function that already exists in this file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 8 ++++++++
 fs/xfs/xfs_buf.c    | 2 +-
 fs/xfs/xfs_file.c   | 6 +++---
 fs/xfs/xfs_linux.h  | 1 +
 fs/xfs/xfs_log.c    | 2 +-
 fs/xfs/xfs_super.c  | 7 -------
 fs/xfs/xfs_super.h  | 1 -
 7 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index e2148f2d5d6b..5abf653a45d4 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -59,3 +59,11 @@ xfs_rw_bdev(
 		invalidate_kernel_vmap_range(data, count);
 	return error;
 }
+
+void
+xfs_flush_bdev(
+	struct block_device	*bdev)
+{
+	blkdev_issue_flush(bdev, GFP_NOFS);
+}
+
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f6e5235df7c9..b1d6c530c693 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1958,7 +1958,7 @@ xfs_free_buftarg(
 	percpu_counter_destroy(&btp->bt_io_count);
 	list_lru_destroy(&btp->bt_lru);
 
-	xfs_blkdev_issue_flush(btp);
+	xfs_flush_bdev(btp->bt_bdev);
 
 	kmem_free(btp);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 38528e59030e..dd33ef2d0e20 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -196,9 +196,9 @@ xfs_file_fsync(
 	 * inode size in case of an extending write.
 	 */
 	if (XFS_IS_REALTIME_INODE(ip))
-		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
+		xfs_flush_bdev(mp->m_rtdev_targp->bt_bdev);
 	else if (mp->m_logdev_targp != mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
 
 	/*
 	 * Any inode that has dirty modifications in the log is pinned.  The
@@ -218,7 +218,7 @@ xfs_file_fsync(
 	 */
 	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
 	    mp->m_logdev_targp == mp->m_ddev_targp)
-		xfs_blkdev_issue_flush(mp->m_ddev_targp);
+		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
 
 	return error;
 }
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index af6be9b9ccdf..e94a2aeefee8 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -196,6 +196,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
 
 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
 		char *data, unsigned int op);
+void xfs_flush_bdev(struct block_device *bdev);
 
 #define ASSERT_ALWAYS(expr)	\
 	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index ff26fb46d70f..493454c98c6f 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2015,7 +2015,7 @@ xlog_sync(
 	 * layer state machine for preflushes.
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
-		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
+		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
 		need_flush = false;
 	}
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 21b1d034aca3..85dd9593b40b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -339,13 +339,6 @@ xfs_blkdev_put(
 		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
 }
 
-void
-xfs_blkdev_issue_flush(
-	xfs_buftarg_t		*buftarg)
-{
-	blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS);
-}
-
 STATIC void
 xfs_close_devices(
 	struct xfs_mount	*mp)
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index 1ca484b8357f..79cb2dece811 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -88,7 +88,6 @@ struct block_device;
 
 extern void xfs_quiesce_attr(struct xfs_mount *mp);
 extern void xfs_flush_inodes(struct xfs_mount *mp);
-extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
 extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
 					   xfs_agnumber_t agcount);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/8] xfs: async blkdev cache flush
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
                   ` (2 preceding siblings ...)
  2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-23  5:29   ` Chaitanya Kulkarni
                     ` (2 more replies)
  2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The new checkpoint caceh flush mechanism requires us to issue an
unconditional cache flush before we start a new checkpoint. We don't
want to block for this if we can help it, and we have a fair chunk
of CPU work to do between starting the checkpoint and issuing the
first journal IO.

Hence it makes sense to amortise the latency cost of the cache flush
by issuing it asynchronously and then waiting for it only when we
need to issue the first IO in the transaction.

TO do this, we need async cache flush primitives to submit the cache
flush bio and to wait on it. THe block layer has no such primitives
for filesystems, so roll our own for the moment.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 30 ++++++++++++++++++++++++++++++
 fs/xfs/xfs_linux.h  |  1 +
 2 files changed, 31 insertions(+)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index 5abf653a45d4..d55420bc72b5 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -67,3 +67,33 @@ xfs_flush_bdev(
 	blkdev_issue_flush(bdev, GFP_NOFS);
 }
 
+void
+xfs_flush_bdev_async_endio(
+	struct bio	*bio)
+{
+	if (bio->bi_private)
+		complete(bio->bi_private);
+	bio_put(bio);
+}
+
+/*
+ * Submit a request for an async cache flush to run. If the caller needs to wait
+ * for the flush completion at a later point in time, they must supply a
+ * valid completion. This will be signalled when the flush completes.
+ * The caller never sees the bio that is issued here.
+ */
+void
+xfs_flush_bdev_async(
+	struct block_device	*bdev,
+	struct completion	*done)
+{
+	struct bio *bio;
+
+	bio = bio_alloc(GFP_NOFS, 0);
+	bio_set_dev(bio, bdev);
+	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
+	bio->bi_private = done;
+        bio->bi_end_io = xfs_flush_bdev_async_endio;
+
+	submit_bio(bio);
+}
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index e94a2aeefee8..293ff2355e80 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -197,6 +197,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
 int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
 		char *data, unsigned int op);
 void xfs_flush_bdev(struct block_device *bdev);
+void xfs_flush_bdev_async(struct block_device *bdev, struct completion *done);
 
 #define ASSERT_ALWAYS(expr)	\
 	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
                   ` (3 preceding siblings ...)
  2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-24  7:16   ` Chandan Babu R
                     ` (2 more replies)
  2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

These rules apply to the atomic transactions recorded in the
journal, not to the journal IO itself. Hence we need to ensure
metadata is stable before we start writing a new transaction to the
journal (guarantee #1), and we need to ensure the entire transaction
is stable in the journal before we start metadata writeback
(guarantee #2).

The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
being added to every iclog IO. This causes the journal IO to issue a
cache flush and wait for it to complete before issuing the write IO
to the journal. Hence all completed metadata IO is guaranteed to be
stable before the journal overwrites the old metadata.

However, for long running CIL checkpoints that might do a thousand
journal IOs, we don't need every single one of these iclog IOs to
issue a cache flush - the cache flush done before the first iclog is
submitted is sufficient to cover the entire range in the log that
the checkpoint will overwrite because the CIL space reservation
guarantees the tail of the log (completed metadata) is already
beyond the range of the checkpoint write.

Hence we only need a full cache flush between closing off the CIL
checkpoint context (i.e. when the push switches it out) and issuing
the first journal IO. Rather than plumbing this through to the
journal IO, we can start this cache flush the moment the CIL context
is owned exclusively by the push worker. The cache flush can be in
progress while we process the CIL ready for writing, hence
reducing the latency of the initial iclog write. This is especially
true for large checkpoints, where we might have to process hundreds
of thousands of log vectors before we issue the first iclog write.
In these cases, it is likely the cache flush has already been
completed by the time we have built the CIL log vector chain.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index c5cc1b7ad25e..8bcacd463f06 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -656,6 +656,7 @@ xlog_cil_push_work(
 	struct xfs_log_vec	lvhdr = { NULL };
 	xfs_lsn_t		commit_lsn;
 	xfs_lsn_t		push_seq;
+	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
 	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
 	new_ctx->ticket = xlog_cil_ticket_alloc(log);
@@ -719,10 +720,24 @@ xlog_cil_push_work(
 	spin_unlock(&cil->xc_push_lock);
 
 	/*
-	 * pull all the log vectors off the items in the CIL, and
-	 * remove the items from the CIL. We don't need the CIL lock
-	 * here because it's only needed on the transaction commit
-	 * side which is currently locked out by the flush lock.
+	 * The CIL is stable at this point - nothing new will be added to it
+	 * because we hold the flush lock exclusively. Hence we can now issue
+	 * a cache flush to ensure all the completed metadata in the journal we
+	 * are about to overwrite is on stable storage.
+	 *
+	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
+	 * cache flushes to provide this ordering guarantee, and hence for CIL
+	 * checkpoints that require hundreds or thousands of log writes no
+	 * longer need to issue device cache flushes to provide metadata
+	 * writeback ordering.
+	 */
+	xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush);
+
+	/*
+	 * Pull all the log vectors off the items in the CIL, and remove the
+	 * items from the CIL. We don't need the CIL lock here because it's only
+	 * needed on the transaction commit side which is currently locked out
+	 * by the flush lock.
 	 */
 	lv = NULL;
 	num_iovecs = 0;
@@ -806,6 +821,12 @@ xlog_cil_push_work(
 	lvhdr.lv_iovecp = &lhdr;
 	lvhdr.lv_next = ctx->lv_chain;
 
+	/*
+	 * Before we format and submit the first iclog, we have to ensure that
+	 * the metadata writeback ordering cache flush is complete.
+	 */
+	wait_for_completion(&bdev_flush);
+
 	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
 	if (error)
 		goto out_abort_free_ticket;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write()
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
                   ` (4 preceding siblings ...)
  2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-24  7:17   ` Chandan Babu R
                     ` (2 more replies)
  2021-02-23  3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
  2021-02-23  3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
  7 siblings, 3 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The CIL push is the only call to xlog_write that sets this variable
to true. The other callers don't need a start rec, and they tell
xlog_write what to do by passing the type of ophdr they need written
in the flags field. The need_start_rec parameter essentially tells
xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
so get rid of the variable to do this and pass XLOG_START_TRANS as
the flag value into xlog_write() from the CIL push.

$ size fs/xfs/xfs_log.o*
  text	   data	    bss	    dec	    hex	filename
 27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
 27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
 fs/xfs/xfs_log_cil.c  |  3 ++-
 fs/xfs/xfs_log_priv.h |  3 +--
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 493454c98c6f..6c3fb6dcb505 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -871,9 +871,7 @@ xlog_wait_on_iclog_lsn(
 static int
 xlog_write_unmount_record(
 	struct xlog		*log,
-	struct xlog_ticket	*ticket,
-	xfs_lsn_t		*lsn,
-	uint			flags)
+	struct xlog_ticket	*ticket)
 {
 	struct xfs_unmount_log_format ulf = {
 		.magic = XLOG_UNMOUNT_TYPE,
@@ -890,7 +888,7 @@ xlog_write_unmount_record(
 
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
-	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
+	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
 /*
@@ -904,15 +902,13 @@ xlog_unmount_write(
 	struct xfs_mount	*mp = log->l_mp;
 	struct xlog_in_core	*iclog;
 	struct xlog_ticket	*tic = NULL;
-	xfs_lsn_t		lsn;
-	uint			flags = XLOG_UNMOUNT_TRANS;
 	int			error;
 
 	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
 	if (error)
 		goto out_err;
 
-	error = xlog_write_unmount_record(log, tic, &lsn, flags);
+	error = xlog_write_unmount_record(log, tic);
 	/*
 	 * At this point, we're umounting anyway, so there's no point in
 	 * transitioning log state to IOERROR. Just continue...
@@ -1604,8 +1600,7 @@ xlog_commit_record(
 	if (XLOG_FORCED_SHUTDOWN(log))
 		return -EIO;
 
-	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
-			   false);
+	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
 	if (error)
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	return error;
@@ -2202,13 +2197,16 @@ static int
 xlog_write_calc_vec_length(
 	struct xlog_ticket	*ticket,
 	struct xfs_log_vec	*log_vector,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xfs_log_vec	*lv;
-	int			headers = need_start_rec ? 1 : 0;
+	int			headers = 0;
 	int			len = 0;
 	int			i;
 
+	if (optype & XLOG_START_TRANS)
+		headers++;
+
 	for (lv = log_vector; lv; lv = lv->lv_next) {
 		/* we don't write ordered log vectors */
 		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
@@ -2428,8 +2426,7 @@ xlog_write(
 	struct xlog_ticket	*ticket,
 	xfs_lsn_t		*start_lsn,
 	struct xlog_in_core	**commit_iclog,
-	uint			flags,
-	bool			need_start_rec)
+	uint			optype)
 {
 	struct xlog_in_core	*iclog = NULL;
 	struct xfs_log_vec	*lv = log_vector;
@@ -2457,8 +2454,9 @@ xlog_write(
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
 	}
 
-	len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec);
-	*start_lsn = 0;
+	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
+	if (start_lsn)
+		*start_lsn = 0;
 	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
 		void		*ptr;
 		int		log_offset;
@@ -2472,7 +2470,7 @@ xlog_write(
 		ptr = iclog->ic_datap + log_offset;
 
 		/* start_lsn is the first lsn written to. That's all we need. */
-		if (!*start_lsn)
+		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
 		/*
@@ -2485,6 +2483,7 @@ xlog_write(
 			int			copy_len;
 			int			copy_off;
 			bool			ordered = false;
+			bool			wrote_start_rec = false;
 
 			/* ordered log vectors have no regions to write */
 			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
@@ -2502,13 +2501,15 @@ xlog_write(
 			 * write a start record. Only do this for the first
 			 * iclog we write to.
 			 */
-			if (need_start_rec) {
+			if (optype & XLOG_START_TRANS) {
 				xlog_write_start_rec(ptr, ticket);
 				xlog_write_adv_cnt(&ptr, &len, &log_offset,
 						sizeof(struct xlog_op_header));
+				optype &= ~XLOG_START_TRANS;
+				wrote_start_rec = true;
 			}
 
-			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
+			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
 			if (!ophdr)
 				return -EIO;
 
@@ -2539,14 +2540,13 @@ xlog_write(
 			}
 			copy_len += sizeof(struct xlog_op_header);
 			record_cnt++;
-			if (need_start_rec) {
+			if (wrote_start_rec) {
 				copy_len += sizeof(struct xlog_op_header);
 				record_cnt++;
-				need_start_rec = false;
 			}
 			data_cnt += contwr ? copy_len : 0;
 
-			error = xlog_write_copy_finish(log, iclog, flags,
+			error = xlog_write_copy_finish(log, iclog, optype,
 						       &record_cnt, &data_cnt,
 						       &partial_copy,
 						       &partial_copy_len,
@@ -2590,7 +2590,7 @@ xlog_write(
 	spin_lock(&log->l_icloglock);
 	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
 	if (commit_iclog) {
-		ASSERT(flags & XLOG_COMMIT_TRANS);
+		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
 		error = xlog_state_release_iclog(log, iclog);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 8bcacd463f06..4093d2d0db7c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -827,7 +827,8 @@ xlog_cil_push_work(
 	 */
 	wait_for_completion(&bdev_flush);
 
-	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
+	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
+				XLOG_START_TRANS);
 	if (error)
 		goto out_abort_free_ticket;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index a7ac85aaff4e..10a41b1dd895 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
 void	xlog_print_trans(struct xfs_trans *);
 int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
 		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
-		struct xlog_in_core **commit_iclog, uint flags,
-		bool need_start_rec);
+		struct xlog_in_core **commit_iclog, uint optype);
 int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/8] xfs: journal IO cache flush reductions
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
                   ` (5 preceding siblings ...)
  2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
  2021-02-23  3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
  7 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Steve Lord <lord@sgi.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
causes the journal IO to issue a cache flush and wait for it to
complete before issuing the write IO to the journal. Hence all
completed metadata IO is guaranteed to be stable before the journal
overwrites the old metadata.

The ordering guarantees of #2 are provided by the REQ_FUA, which
ensures the journal writes do not complete until they are on stable
storage. Hence by the time the last journal IO in a checkpoint
completes, we know that the entire checkpoint is on stable storage
and we can unpin the dirty metadata and allow it to be written back.

This is the mechanism by which ordering was first implemented in XFS
way back in 2002 by this commit:

commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
Author: Steve Lord <lord@sgi.com>
Date:   Fri May 24 14:30:21 2002 +0000

    Add support for drive write cache flushing - should the kernel
    have the infrastructure

A lot has changed since then, most notably we now use delayed
logging to checkpoint the filesystem to the journal rather than
write each individual transaction to the journal. Cache flushes on
journal IO are necessary when individual transactions are wholly
contained within a single iclog. However, CIL checkpoints are single
transactions that typically span hundreds to thousands of individual
journal writes, and so the requirements for device cache flushing
have changed.

That is, the ordering rules I state above apply to ordering of
atomic transactions recorded in the journal, not to the journal IO
itself. Hence we need to ensure metadata is stable before we start
writing a new transaction to the journal (guarantee #1), and we need
to ensure the entire transaction is stable in the journal before we
start metadata writeback (guarantee #2).

Hence we only need a REQ_PREFLUSH on the journal IO that starts a
new journal transaction to provide #1, and it is not on any other
journal IO done within the context of that journal transaction.

The CIL checkpoint already issues a cache flush before it starts
writing to the log, so we no longer need the iclog IO to issue a
REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
to xlog_write(), we no longer need to mark the first iclog in
the log write with REQ_PREFLUSH for this case.

Given the new ordering semantics of commit records for the CIL, we
need iclogs containing commit to issue a REQ_PREFLUSH. We also
require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
the first iclog being written with REQ_PREFLUSH.

For both commit records and unmount records, we also want them
immediately on stable storage, so we want to also mark the iclogs
that contain these records to be marked REQ_FUA. That means if a
record is split across multiple iclogs, they are all marked REQ_FUA
and not just the last one so that when the transaction is completed
all the parts of the record are on stable storage.

As an optimisation, when the commit record lands in the same iclog
as the journal transaction starts, we don't need to wait for
anything and can simply use REQ_FUA to provide guarantee #2.  This
means that for fsync() heavy workloads, the cache flush behaviour is
completely unchanged and there is no degradation in performance as a
result of optimise the multi-IO transaction case.

The most notable sign that there is less IO latency on my test
machine (nvme SSDs) is that the "noiclogs" rate has dropped
substantially. This metric indicates that the CIL push is blocking
in xlog_get_iclog_space() waiting for iclog IO completion to occur.
With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
is blocking waiting for log IO. With the changes in this patch, this
drops to 1 noiclog event for every 100 iclog writes. Hence it is
clear that log IO is completing much faster than it was previously,
but it is also clear that for large iclog sizes, this isn't the
performance limiting factor on this hardware.

With smaller iclogs (32kB), however, there is a sustantial
difference. With the cache flush modifications, the journal is now
running at over 4000 write IOPS, and the journal throughput is
largely identical to the 256kB iclogs and the noiclog event rate
stays low at about 1:50 iclog writes. The existing code tops out at
about 2500 IOPS as the number of cache flushes dominate performance
and latency. The noiclog event rate is about 1:4, and the
performance variance is quite large as the journal throughput can
fall to less than half the peak sustained rate when the cache flush
rate prevents metadata writeback from keeping up and the log runs
out of space and throttles reservations.

As a result:

	logbsize	fsmark create rate	rm -rf
before	32kb		152851+/-5.3e+04	5m28s
patched	32kb		221533+/-1.1e+04	5m24s

before	256kb		220239+/-6.2e+03	4m58s
patched	256kb		228286+/-9.2e+03	5m06s

The rm -rf times are included because I ran them, but the
differences are largely noise. This workload is largely metadata
read IO latency bound and the changes to the journal cache flushing
doesn't really make any noticable difference to behaviour apart from
a reduction in noiclog events from background CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
 fs/xfs/xfs_log_cil.c  |  7 ++++++-
 fs/xfs/xfs_log_priv.h |  4 ++++
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6c3fb6dcb505..08d68a6161ae 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1806,8 +1806,7 @@ xlog_write_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
 	uint64_t		bno,
-	unsigned int		count,
-	bool			need_flush)
+	unsigned int		count)
 {
 	ASSERT(bno < log->l_logBBsize);
 
@@ -1845,10 +1844,12 @@ xlog_write_iclog(
 	 * writeback throttle from throttling log writes behind background
 	 * metadata writeback and causing priority inversions.
 	 */
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
-				REQ_IDLE | REQ_FUA;
-	if (need_flush)
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
+		iclog->ic_bio.bi_opf |= REQ_FUA;
+	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -1951,7 +1952,7 @@ xlog_sync(
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
 	uint64_t		bno;
 	unsigned int		size;
-	bool			need_flush = true, split = false;
+	bool			split = false;
 
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
 
@@ -2009,13 +2010,14 @@ xlog_sync(
 	 * synchronously here; for an internal log we can simply use the block
 	 * layer state machine for preflushes.
 	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
+	if (log->l_targ != log->l_mp->m_ddev_targp ||
+	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
 		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
-		need_flush = false;
+		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 	}
 
 	xlog_verify_iclog(log, iclog, count);
-	xlog_write_iclog(log, iclog, bno, count, need_flush);
+	xlog_write_iclog(log, iclog, bno, count);
 }
 
 /*
@@ -2469,10 +2471,21 @@ xlog_write(
 		ASSERT(log_offset <= iclog->ic_size - 1);
 		ptr = iclog->ic_datap + log_offset;
 
-		/* start_lsn is the first lsn written to. That's all we need. */
+		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
+		/*
+		 * iclogs containing commit records or unmount records need
+		 * to issue ordering cache flushes and commit immediately
+		 * to stable storage to guarantee journal vs metadata ordering
+		 * is correctly maintained in the storage media.
+		 */
+		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
+			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
+						XLOG_ICL_NEED_FUA);
+		}
+
 		/*
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 4093d2d0db7c..370da7c2bfc8 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -894,10 +894,15 @@ xlog_cil_push_work(
 
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous
-	 * iclogs to complete before we submit the commit_iclog.
+	 * iclogs to complete before we submit the commit_iclog. If it is in the
+	 * same iclog as the start of the checkpoint, then we can skip the iclog
+	 * cache flush because there are no other iclogs we need to order
+	 * against.
 	 */
 	if (ctx->start_lsn != commit_lsn)
 		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
+	else
+		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 10a41b1dd895..24acdc54e44e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -133,6 +133,9 @@ enum xlog_iclog_state {
 
 #define XLOG_COVER_OPS		5
 
+#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
+#define XLOG_ICL_NEED_FUA	(1 << 0)	/* iclog needs REQ_FUA */
+
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
@@ -201,6 +204,7 @@ typedef struct xlog_in_core {
 	u32			ic_size;
 	u32			ic_offset;
 	enum xlog_iclog_state	ic_state;
+	unsigned int		ic_flags;
 	char			*ic_datap;	/* pointer to iclog data */
 
 	/* Callback structures need their own cacheline */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards
  2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
                   ` (6 preceding siblings ...)
  2021-02-23  3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-02-23  3:34 ` Dave Chinner
  2021-02-24 21:18   ` Darrick J. Wong
  7 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  3:34 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A hang with tasks stuck on the CIL hard throttle was reported and
largely diagnosed by Donald Buczek, who discovered that it was a
result of the CIL context space usage decrementing in committed
transactions once the hard throttle limit had been hit and processes
were already blocked.  This resulted in the CIL push not waking up
those waiters because the CIL context was no longer over the hard
throttle limit.

The surprising aspect of this was the CIL space usage going
backwards regularly enough to trigger this situation. Assumptions
had been made in design that the relogging process would only
increase the size of the objects in the CIL, and so that space would
only increase.

This change and commit message fixes the issue and documents the
result of an audit of the triggers that can cause the CIL space to
go backwards, how large the backwards steps tend to be, the
frequency in which they occur, and what the impact on the CIL
accounting code is.

Even though the CIL ctx->space_used can go backwards, it will only
do so if the log item is already logged to the CIL and contains a
space reservation for it's entire logged state. This is tracked by
the shadow buffer state on the log item. If the item is not
previously logged in the CIL it has no shadow buffer nor log vector,
and hence the entire size of the logged item copied to the log
vector is accounted to the CIL space usage. i.e.  it will always go
up in this case.

If the item has a log vector (i.e. already in the CIL) and the size
decreases, then the existing log vector will be overwritten and the
space usage will go down. This is the only condition where the space
usage reduces, and it can only occur when an item is already tracked
in the CIL. Hence we are safe from CIL space usage underruns as a
result of log items decreasing in size when they are relogged.

Typically this reduction in CIL usage occurs from metadta blocks
being free, such as when a btree block merge
occurs or a directory enter/xattr entry is removed and the da-tree
is reduced in size. This generally results in a reduction in size of
around a single block in the CIL, but also tends to increase the
number of log vectors because the parent and sibling nodes in the
tree needs to be updated when a btree block is removed. If a
multi-level merge occurs, then we see reduction in size of 2+
blocks, but again the log vector count goes up.

The other vector is inode fork size changes, which only log the
current size of the fork and ignore the previously logged size when
the fork is relogged. Hence if we are removing items from the inode
fork (dir/xattr removal in shortform, extent record removal in
extent form, etc) the relogged size of the inode for can decrease.

No other log items can decrease in size either because they are a
fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
an intent actually creates a new intent log item and doesn't relog
the old item at all.) Hence the only two vectors for CIL context
size reduction are relogging inode forks and marking buffers active
in the CIL as stale.

Long story short: the majority of the code does the right thing and
handles the reduction in log item size correctly, and only the CIL
hard throttle implementation is problematic and needs fixing. This
patch makes that fix, as well as adds comments in the log item code
that result in items shrinking in size when they are relogged as a
clear reminder that this can and does happen frequently.

The throttle fix is based upon the change Donald proposed, though it
goes further to ensure that once the throttle is activated, it
captures all tasks until the CIL push issues a wakeup, regardless of
whether the CIL space used has gone back under the throttle
threshold.

This ensures that we prevent tasks reducing the CIL slightly under
the throttle threshold and then making more changes that push it
well over the throttle limit. This is acheived by checking if the
throttle wait queue is already active as a condition of throttling.
Hence once we start throttling, we continue to apply the throttle
until the CIL context push wakes everything on the wait queue.

We can use waitqueue_active() for the waitqueue manipulations and
checks as they are all done under the ctx->xc_push_lock. Hence the
waitqueue has external serialisation and we can safely peek inside
the wait queue without holding the internal waitqueue locks.

Many thanks to Donald for his diagnostic and analysis work to
isolate the cause of this hang.

Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/xfs_buf_item.c   | 37 ++++++++++++++++++-------------------
 fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
 fs/xfs/xfs_log_cil.c    | 22 +++++++++++++++++-----
 3 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index dc0be2a639cc..17960b1ce5ef 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -56,14 +56,12 @@ xfs_buf_log_format_size(
 }
 
 /*
- * This returns the number of log iovecs needed to log the
- * given buf log item.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item segment.
  *
- * It calculates this as 1 iovec for the buf log format structure
- * and 1 for each stretch of non-contiguous chunks to be logged.
- * Contiguous chunks are logged in a single iovec.
- *
- * If the XFS_BLI_STALE flag has been set, then log nothing.
+ * It calculates this as 1 iovec for the buf log format structure and 1 for each
+ * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
+ * in a single iovec.
  */
 STATIC void
 xfs_buf_item_size_segment(
@@ -119,11 +117,8 @@ xfs_buf_item_size_segment(
 }
 
 /*
- * This returns the number of log iovecs needed to log the given buf log item.
- *
- * It calculates this as 1 iovec for the buf log format structure and 1 for each
- * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
- * in a single iovec.
+ * Return the number of log iovecs and space needed to log the given buf log
+ * item.
  *
  * Discontiguous buffers need a format structure per region that is being
  * logged. This makes the changes in the buffer appear to log recovery as though
@@ -133,7 +128,11 @@ xfs_buf_item_size_segment(
  * what ends up on disk.
  *
  * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
- * format structures.
+ * format structures. If the item has previously been logged and has dirty
+ * regions, we do not relog them in stale buffers. This has the effect of
+ * reducing the size of the relogged item by the amount of dirty data tracked
+ * by the log item. This can result in the committing transaction reducing the
+ * amount of space being consumed by the CIL.
  */
 STATIC void
 xfs_buf_item_size(
@@ -147,9 +146,9 @@ xfs_buf_item_size(
 	ASSERT(atomic_read(&bip->bli_refcount) > 0);
 	if (bip->bli_flags & XFS_BLI_STALE) {
 		/*
-		 * The buffer is stale, so all we need to log
-		 * is the buf log format structure with the
-		 * cancel flag in it.
+		 * The buffer is stale, so all we need to log is the buf log
+		 * format structure with the cancel flag in it as we are never
+		 * going to replay the changes tracked in the log item.
 		 */
 		trace_xfs_buf_item_size_stale(bip);
 		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
@@ -164,9 +163,9 @@ xfs_buf_item_size(
 
 	if (bip->bli_flags & XFS_BLI_ORDERED) {
 		/*
-		 * The buffer has been logged just to order it.
-		 * It is not being included in the transaction
-		 * commit, so no vectors are used at all.
+		 * The buffer has been logged just to order it. It is not being
+		 * included in the transaction commit, so no vectors are used at
+		 * all.
 		 */
 		trace_xfs_buf_item_size_ordered(bip);
 		*nvecs = XFS_LOG_VEC_ORDERED;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 17e20a6d8b4e..6ff91e5bf3cd 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_inode_log_item, ili_item);
 }
 
+/*
+ * The logged size of an inode fork is always the current size of the inode
+ * fork. This means that when an inode fork is relogged, the size of the logged
+ * region is determined by the current state, not the combination of the
+ * previously logged state + the current state. This is different relogging
+ * behaviour to most other log items which will retain the size of the
+ * previously logged changes when smaller regions are relogged.
+ *
+ * Hence operations that remove data from the inode fork (e.g. shortform
+ * dir/attr remove, extent form extent removal, etc), the size of the relogged
+ * inode gets -smaller- rather than stays the same size as the previously logged
+ * size and this can result in the committing transaction reducing the amount of
+ * space being consumed by the CIL.
+ */
 STATIC void
 xfs_inode_item_data_fork_size(
 	struct xfs_inode_log_item *iip,
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 370da7c2bfc8..0a00c3c9610c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -669,9 +669,14 @@ xlog_cil_push_work(
 	ASSERT(push_seq <= ctx->sequence);
 
 	/*
-	 * Wake up any background push waiters now this context is being pushed.
+	 * As we are about to switch to a new, empty CIL context, we no longer
+	 * need to throttle tasks on CIL space overruns. Wake any waiters that
+	 * the hard push throttle may have caught so they can start committing
+	 * to the new context. The ctx->xc_push_lock provides the serialisation
+	 * necessary for safely using the lockless waitqueue_active() check in
+	 * this context.
 	 */
-	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+	if (waitqueue_active(&cil->xc_push_wait))
 		wake_up_all(&cil->xc_push_wait);
 
 	/*
@@ -941,7 +946,7 @@ xlog_cil_push_background(
 	ASSERT(!list_empty(&cil->xc_cil));
 
 	/*
-	 * don't do a background push if we haven't used up all the
+	 * Don't do a background push if we haven't used up all the
 	 * space available yet.
 	 */
 	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
@@ -965,9 +970,16 @@ xlog_cil_push_background(
 
 	/*
 	 * If we are well over the space limit, throttle the work that is being
-	 * done until the push work on this context has begun.
+	 * done until the push work on this context has begun. Enforce the hard
+	 * throttle on all transaction commits once it has been activated, even
+	 * if the committing transactions have resulted in the space usage
+	 * dipping back down under the hard limit.
+	 *
+	 * The ctx->xc_push_lock provides the serialisation necessary for safely
+	 * using the lockless waitqueue_active() check in this context.
 	 */
-	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
+	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
+	    waitqueue_active(&cil->xc_push_wait)) {
 		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
 		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
 		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/8] xfs: async blkdev cache flush
  2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
@ 2021-02-23  5:29   ` Chaitanya Kulkarni
  2021-02-23 14:02   ` Chandan Babu R
  2021-02-24 20:51   ` Darrick J. Wong
  2 siblings, 0 replies; 59+ messages in thread
From: Chaitanya Kulkarni @ 2021-02-23  5:29 UTC (permalink / raw)
  To: Dave Chinner, linux-xfs

On 2/22/21 19:35, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The new checkpoint caceh flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
>
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
>
> TO do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. THe block layer has no such primitives
> for filesystems, so roll our own for the moment.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 30 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  1 +
>  2 files changed, 31 insertions(+)
>
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 5abf653a45d4..d55420bc72b5 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -67,3 +67,33 @@ xfs_flush_bdev(
>  	blkdev_issue_flush(bdev, GFP_NOFS);
>  }
>  
> +void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +	bio_put(bio);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the caller needs to wait
> + * for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.
> + * The caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct bio *bio;
> +
> +	bio = bio_alloc(GFP_NOFS, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +        bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
nit: need to align above line with the rest of the code ? can be done at
the time of applying the patch.
> +	submit_bio(bio);
> +}
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index e94a2aeefee8..293ff2355e80 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -197,6 +197,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
>  void xfs_flush_bdev(struct block_device *bdev);
> +void xfs_flush_bdev_async(struct block_device *bdev, struct completion *done);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
@ 2021-02-23  8:05   ` Dave Chinner
  2021-02-24 12:27     ` Chandan Babu R
                       ` (4 more replies)
  0 siblings, 5 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-23  8:05 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
guarantee the ordering requirements the journal has w.r.t. metadata
writeback. THe two ordering constraints are:

1. we cannot overwrite metadata in the journal until we guarantee
that the dirty metadata has been written back in place and is
stable.

2. we cannot write back dirty metadata until it has been written to
the journal and guaranteed to be stable (and hence recoverable) in
the journal.

The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
causes the journal IO to issue a cache flush and wait for it to
complete before issuing the write IO to the journal. Hence all
completed metadata IO is guaranteed to be stable before the journal
overwrites the old metadata.

The ordering guarantees of #2 are provided by the REQ_FUA, which
ensures the journal writes do not complete until they are on stable
storage. Hence by the time the last journal IO in a checkpoint
completes, we know that the entire checkpoint is on stable storage
and we can unpin the dirty metadata and allow it to be written back.

This is the mechanism by which ordering was first implemented in XFS
way back in 2002 by this commit:

commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
Author: Steve Lord <lord@sgi.com>
Date:   Fri May 24 14:30:21 2002 +0000

    Add support for drive write cache flushing - should the kernel
    have the infrastructure

A lot has changed since then, most notably we now use delayed
logging to checkpoint the filesystem to the journal rather than
write each individual transaction to the journal. Cache flushes on
journal IO are necessary when individual transactions are wholly
contained within a single iclog. However, CIL checkpoints are single
transactions that typically span hundreds to thousands of individual
journal writes, and so the requirements for device cache flushing
have changed.

That is, the ordering rules I state above apply to ordering of
atomic transactions recorded in the journal, not to the journal IO
itself. Hence we need to ensure metadata is stable before we start
writing a new transaction to the journal (guarantee #1), and we need
to ensure the entire transaction is stable in the journal before we
start metadata writeback (guarantee #2).

Hence we only need a REQ_PREFLUSH on the journal IO that starts a
new journal transaction to provide #1, and it is not on any other
journal IO done within the context of that journal transaction.

The CIL checkpoint already issues a cache flush before it starts
writing to the log, so we no longer need the iclog IO to issue a
REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
to xlog_write(), we no longer need to mark the first iclog in
the log write with REQ_PREFLUSH for this case.

Given the new ordering semantics of commit records for the CIL, we
need iclogs containing commit to issue a REQ_PREFLUSH. We also
require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
the first iclog being written with REQ_PREFLUSH.

For both commit records and unmount records, we also want them
immediately on stable storage, so we want to also mark the iclogs
that contain these records to be marked REQ_FUA. That means if a
record is split across multiple iclogs, they are all marked REQ_FUA
and not just the last one so that when the transaction is completed
all the parts of the record are on stable storage.

As an optimisation, when the commit record lands in the same iclog
as the journal transaction starts, we don't need to wait for
anything and can simply use REQ_FUA to provide guarantee #2.  This
means that for fsync() heavy workloads, the cache flush behaviour is
completely unchanged and there is no degradation in performance as a
result of optimise the multi-IO transaction case.

The most notable sign that there is less IO latency on my test
machine (nvme SSDs) is that the "noiclogs" rate has dropped
substantially. This metric indicates that the CIL push is blocking
in xlog_get_iclog_space() waiting for iclog IO completion to occur.
With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
is blocking waiting for log IO. With the changes in this patch, this
drops to 1 noiclog event for every 100 iclog writes. Hence it is
clear that log IO is completing much faster than it was previously,
but it is also clear that for large iclog sizes, this isn't the
performance limiting factor on this hardware.

With smaller iclogs (32kB), however, there is a sustantial
difference. With the cache flush modifications, the journal is now
running at over 4000 write IOPS, and the journal throughput is
largely identical to the 256kB iclogs and the noiclog event rate
stays low at about 1:50 iclog writes. The existing code tops out at
about 2500 IOPS as the number of cache flushes dominate performance
and latency. The noiclog event rate is about 1:4, and the
performance variance is quite large as the journal throughput can
fall to less than half the peak sustained rate when the cache flush
rate prevents metadata writeback from keeping up and the log runs
out of space and throttles reservations.

As a result:

	logbsize	fsmark create rate	rm -rf
before	32kb		152851+/-5.3e+04	5m28s
patched	32kb		221533+/-1.1e+04	5m24s

before	256kb		220239+/-6.2e+03	4m58s
patched	256kb		228286+/-9.2e+03	5m06s

The rm -rf times are included because I ran them, but the
differences are largely noise. This workload is largely metadata
read IO latency bound and the changes to the journal cache flushing
doesn't really make any noticable difference to behaviour apart from
a reduction in noiclog events from background CIL pushing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- repost manually without git/guilt mangling the patch author
- fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
  ordering bug in generic/45[57] until testing the CIL pipelining
  changes much later in the series.

 fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
 fs/xfs/xfs_log_cil.c  |  7 ++++++-
 fs/xfs/xfs_log_priv.h |  4 ++++
 3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 6c3fb6dcb505..08d68a6161ae 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1806,8 +1806,7 @@ xlog_write_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog,
 	uint64_t		bno,
-	unsigned int		count,
-	bool			need_flush)
+	unsigned int		count)
 {
 	ASSERT(bno < log->l_logBBsize);
 
@@ -1845,10 +1844,12 @@ xlog_write_iclog(
 	 * writeback throttle from throttling log writes behind background
 	 * metadata writeback and causing priority inversions.
 	 */
-	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
-				REQ_IDLE | REQ_FUA;
-	if (need_flush)
+	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
+		iclog->ic_bio.bi_opf |= REQ_FUA;
+	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
 		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
@@ -1951,7 +1952,7 @@ xlog_sync(
 	unsigned int		roundoff;       /* roundoff to BB or stripe */
 	uint64_t		bno;
 	unsigned int		size;
-	bool			need_flush = true, split = false;
+	bool			split = false;
 
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
 
@@ -2009,13 +2010,14 @@ xlog_sync(
 	 * synchronously here; for an internal log we can simply use the block
 	 * layer state machine for preflushes.
 	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
+	if (log->l_targ != log->l_mp->m_ddev_targp ||
+	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
 		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
-		need_flush = false;
+		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 	}
 
 	xlog_verify_iclog(log, iclog, count);
-	xlog_write_iclog(log, iclog, bno, count, need_flush);
+	xlog_write_iclog(log, iclog, bno, count);
 }
 
 /*
@@ -2469,10 +2471,21 @@ xlog_write(
 		ASSERT(log_offset <= iclog->ic_size - 1);
 		ptr = iclog->ic_datap + log_offset;
 
-		/* start_lsn is the first lsn written to. That's all we need. */
+		/* Start_lsn is the first lsn written to. */
 		if (start_lsn && !*start_lsn)
 			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 
+		/*
+		 * iclogs containing commit records or unmount records need
+		 * to issue ordering cache flushes and commit immediately
+		 * to stable storage to guarantee journal vs metadata ordering
+		 * is correctly maintained in the storage media.
+		 */
+		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
+			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
+						XLOG_ICL_NEED_FUA);
+		}
+
 		/*
 		 * This loop writes out as many regions as can fit in the amount
 		 * of space which was allocated by xlog_state_get_iclog_space().
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 4093d2d0db7c..370da7c2bfc8 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -894,10 +894,15 @@ xlog_cil_push_work(
 
 	/*
 	 * If the checkpoint spans multiple iclogs, wait for all previous
-	 * iclogs to complete before we submit the commit_iclog.
+	 * iclogs to complete before we submit the commit_iclog. If it is in the
+	 * same iclog as the start of the checkpoint, then we can skip the iclog
+	 * cache flush because there are no other iclogs we need to order
+	 * against.
 	 */
 	if (ctx->start_lsn != commit_lsn)
 		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
+	else
+		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
 
 	/* release the hounds! */
 	xfs_log_release_iclog(commit_iclog);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 10a41b1dd895..a77e00b7789a 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -133,6 +133,9 @@ enum xlog_iclog_state {
 
 #define XLOG_COVER_OPS		5
 
+#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
+#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
+
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
@@ -201,6 +204,7 @@ typedef struct xlog_in_core {
 	u32			ic_size;
 	u32			ic_offset;
 	enum xlog_iclog_state	ic_state;
+	unsigned int		ic_flags;
 	char			*ic_datap;	/* pointer to iclog data */
 
 	/* Callback structures need their own cacheline */

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/8] xfs: log stripe roundoff is a property of the log
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
@ 2021-02-23 10:29   ` Chandan Babu R
  2021-02-24 20:14   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-23 10:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> We don't need to look at the xfs_mount and superblock every time we
> need to do an iclog roundoff calculation. The property is fixed for
> the life of the log, so store the roundoff in the log at mount time
> and use that everywhere.
>
> On a debug build:
>
> $ size fs/xfs/xfs_log.o.*
>    text	   data	    bss	    dec	    hex	filename
>   27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
>   27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched
>

The changes look good to me.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h |  3 --
>  fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h          |  2 ++
>  3 files changed, 27 insertions(+), 37 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 8bd00da6d2a4..16587219549c 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
>  #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
>  #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
>  #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
> -#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
> -                                 (log)->l_mp->m_sb.sb_logsunit)
> -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
>  
>  #define XLOG_HEADER_SIZE	512
>  
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 06041834daa3..fa284f26d10e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1399,6 +1399,11 @@ xlog_alloc_log(
>  	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
>  	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
>  
> +	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
> +		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
> +	else
> +		log->l_iclog_roundoff = BBSIZE;
> +
>  	xlog_grant_head_init(&log->l_reserve_head);
>  	xlog_grant_head_init(&log->l_write_head);
>  
> @@ -1852,29 +1857,15 @@ xlog_calc_iclog_size(
>  	uint32_t		*roundoff)
>  {
>  	uint32_t		count_init, count;
> -	bool			use_lsunit;
> -
> -	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -			log->l_mp->m_sb.sb_logsunit > 1;
>  
>  	/* Add for LR header */
>  	count_init = log->l_iclog_hsize + iclog->ic_offset;
> +	count = roundup(count_init, log->l_iclog_roundoff);
>  
> -	/* Round out the log write size */
> -	if (use_lsunit) {
> -		/* we have a v2 stripe unit to use */
> -		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
> -	} else {
> -		count = BBTOB(BTOBB(count_init));
> -	}
> -
> -	ASSERT(count >= count_init);
>  	*roundoff = count - count_init;
>  
> -	if (use_lsunit)
> -		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
> -	else
> -		ASSERT(*roundoff < BBTOB(1));
> +	ASSERT(count >= count_init);
> +	ASSERT(*roundoff < log->l_iclog_roundoff);
>  	return count;
>  }
>  
> @@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs(
>  	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
>  
>  	/* Round up to next log-sunit */
> -	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -	    log->l_mp->m_sb.sb_logsunit > 1) {
> -		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
> -		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
> +	if (log->l_iclog_roundoff > BBSIZE) {
> +		log->l_curr_block = roundup(log->l_curr_block,
> +						BTOBB(log->l_iclog_roundoff));
>  	}
>  
>  	if (log->l_curr_block >= log->l_logBBsize) {
> @@ -3404,12 +3394,11 @@ xfs_log_ticket_get(
>   * Figure out the total log space unit (in bytes) that would be
>   * required for a log ticket.
>   */
> -int
> -xfs_log_calc_unit_res(
> -	struct xfs_mount	*mp,
> +static int
> +xlog_calc_unit_res(
> +	struct xlog		*log,
>  	int			unit_bytes)
>  {
> -	struct xlog		*log = mp->m_log;
>  	int			iclog_space;
>  	uint			num_headers;
>  
> @@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res(
>  	/* for commit-rec LR header - note: padding will subsume the ophdr */
>  	unit_bytes += log->l_iclog_hsize;
>  
> -	/* for roundoff padding for transaction data and one for commit record */
> -	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
> -		/* log su roundoff */
> -		unit_bytes += 2 * mp->m_sb.sb_logsunit;
> -	} else {
> -		/* BB roundoff */
> -		unit_bytes += 2 * BBSIZE;
> -        }
> +	/* roundoff padding for transaction data and one for commit record */
> +	unit_bytes += 2 * log->l_iclog_roundoff;
>  
>  	return unit_bytes;
>  }
>  
> +int
> +xfs_log_calc_unit_res(
> +	struct xfs_mount	*mp,
> +	int			unit_bytes)
> +{
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +}
> +
>  /*
>   * Allocate and initialise a new log ticket.
>   */
> @@ -3513,7 +3504,7 @@ xlog_ticket_alloc(
>  
>  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>  
> -	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes);
>  
>  	atomic_set(&tic->t_ref, 1);
>  	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 1c6fdbf3d506..037950cf1061 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -436,6 +436,8 @@ struct xlog {
>  #endif
>  	/* log recovery lsn tracking (for buffer submission */
>  	xfs_lsn_t		l_recovery_lsn;
> +
> +	uint32_t		l_iclog_roundoff;/* padding roundoff */
>  };
>  
>  #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \


-- 
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
@ 2021-02-23 12:12   ` Chandan Babu R
  2021-02-24 20:34   ` Darrick J. Wong
  2021-03-01 15:19   ` Brian Foster
  2 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-23 12:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
>
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
>
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
>
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
>
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
>

The changes seem to be logically correct.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.
> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.
> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a


-- 
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush
  2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
@ 2021-02-23 12:57   ` Chandan Babu R
  2021-02-24 20:45   ` Darrick J. Wong
  2021-02-25  8:36   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-23 12:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Move it to xfs_bio_io.c as we are about to add new async cache flush
> functionality that uses bios directly, so all this stuff should be
> in the same place. Rename the function to xfs_flush_bdev() to match
> the xfs_rw_bdev() function that already exists in this file.
>

Looks good to me.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 8 ++++++++
>  fs/xfs/xfs_buf.c    | 2 +-
>  fs/xfs/xfs_file.c   | 6 +++---
>  fs/xfs/xfs_linux.h  | 1 +
>  fs/xfs/xfs_log.c    | 2 +-
>  fs/xfs/xfs_super.c  | 7 -------
>  fs/xfs/xfs_super.h  | 1 -
>  7 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index e2148f2d5d6b..5abf653a45d4 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -59,3 +59,11 @@ xfs_rw_bdev(
>  		invalidate_kernel_vmap_range(data, count);
>  	return error;
>  }
> +
> +void
> +xfs_flush_bdev(
> +	struct block_device	*bdev)
> +{
> +	blkdev_issue_flush(bdev, GFP_NOFS);
> +}
> +
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f6e5235df7c9..b1d6c530c693 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1958,7 +1958,7 @@ xfs_free_buftarg(
>  	percpu_counter_destroy(&btp->bt_io_count);
>  	list_lru_destroy(&btp->bt_lru);
>  
> -	xfs_blkdev_issue_flush(btp);
> +	xfs_flush_bdev(btp->bt_bdev);
>  
>  	kmem_free(btp);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 38528e59030e..dd33ef2d0e20 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -196,9 +196,9 @@ xfs_file_fsync(
>  	 * inode size in case of an extending write.
>  	 */
>  	if (XFS_IS_REALTIME_INODE(ip))
> -		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
> +		xfs_flush_bdev(mp->m_rtdev_targp->bt_bdev);
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
>  
>  	/*
>  	 * Any inode that has dirty modifications in the log is pinned.  The
> @@ -218,7 +218,7 @@ xfs_file_fsync(
>  	 */
>  	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>  	    mp->m_logdev_targp == mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index af6be9b9ccdf..e94a2aeefee8 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
> +void xfs_flush_bdev(struct block_device *bdev);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index ff26fb46d70f..493454c98c6f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2015,7 +2015,7 @@ xlog_sync(
>  	 * layer state machine for preflushes.
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
> +		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
>  		need_flush = false;
>  	}
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 21b1d034aca3..85dd9593b40b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -339,13 +339,6 @@ xfs_blkdev_put(
>  		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>  }
>  
> -void
> -xfs_blkdev_issue_flush(
> -	xfs_buftarg_t		*buftarg)
> -{
> -	blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS);
> -}
> -
>  STATIC void
>  xfs_close_devices(
>  	struct xfs_mount	*mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index 1ca484b8357f..79cb2dece811 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -88,7 +88,6 @@ struct block_device;
>  
>  extern void xfs_quiesce_attr(struct xfs_mount *mp);
>  extern void xfs_flush_inodes(struct xfs_mount *mp);
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
>  extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
>  					   xfs_agnumber_t agcount);


-- 
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/8] xfs: async blkdev cache flush
  2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
  2021-02-23  5:29   ` Chaitanya Kulkarni
@ 2021-02-23 14:02   ` Chandan Babu R
  2021-02-24 20:51   ` Darrick J. Wong
  2 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-23 14:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The new checkpoint caceh flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
>
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
>
> TO do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. THe block layer has no such primitives
> for filesystems, so roll our own for the moment.
>

Thanks for the detailed commit message explaining the reasoning behind the
requirement for an async cache flush primitive.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 30 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  1 +
>  2 files changed, 31 insertions(+)
>
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 5abf653a45d4..d55420bc72b5 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -67,3 +67,33 @@ xfs_flush_bdev(
>  	blkdev_issue_flush(bdev, GFP_NOFS);
>  }
>
> +void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +	bio_put(bio);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the caller needs to wait
> + * for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.
> + * The caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct block_device	*bdev,
> +	struct completion	*done)
> +{
> +	struct bio *bio;
> +
> +	bio = bio_alloc(GFP_NOFS, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +        bio->bi_end_io = xfs_flush_bdev_async_endio;
> +
> +	submit_bio(bio);
> +}
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index e94a2aeefee8..293ff2355e80 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -197,6 +197,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
>  void xfs_flush_bdev(struct block_device *bdev);
> +void xfs_flush_bdev_async(struct block_device *bdev, struct completion *done);
>
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))


--
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally
  2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
@ 2021-02-24  7:16   ` Chandan Babu R
  2021-02-24 20:57   ` Darrick J. Wong
  2021-02-25  8:42   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-24  7:16 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
>
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
>
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
>
> These rules apply to the atomic transactions recorded in the
> journal, not to the journal IO itself. Hence we need to ensure
> metadata is stable before we start writing a new transaction to the
> journal (guarantee #1), and we need to ensure the entire transaction
> is stable in the journal before we start metadata writeback
> (guarantee #2).
>
> The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
> being added to every iclog IO. This causes the journal IO to issue a
> cache flush and wait for it to complete before issuing the write IO
> to the journal. Hence all completed metadata IO is guaranteed to be
> stable before the journal overwrites the old metadata.
>
> However, for long running CIL checkpoints that might do a thousand
> journal IOs, we don't need every single one of these iclog IOs to
> issue a cache flush - the cache flush done before the first iclog is
> submitted is sufficient to cover the entire range in the log that
> the checkpoint will overwrite because the CIL space reservation
> guarantees the tail of the log (completed metadata) is already
> beyond the range of the checkpoint write.
>
> Hence we only need a full cache flush between closing off the CIL
> checkpoint context (i.e. when the push switches it out) and issuing
> the first journal IO. Rather than plumbing this through to the
> journal IO, we can start this cache flush the moment the CIL context
> is owned exclusively by the push worker. The cache flush can be in
> progress while we process the CIL ready for writing, hence
> reducing the latency of the initial iclog write. This is especially
> true for large checkpoints, where we might have to process hundreds
> of thousands of log vectors before we issue the first iclog write.
> In these cases, it is likely the cache flush has already been
> completed by the time we have built the CIL log vector chain.
>

Indeed, a single cache flush of the "data device" that is issued before
writing the first iclog of a CIL context is sufficient to make sure that the
metadata has really reached non-volatile storage.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index c5cc1b7ad25e..8bcacd463f06 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -656,6 +656,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_lsn_t		commit_lsn;
>  	xfs_lsn_t		push_seq;
> +	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>
>  	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -719,10 +720,24 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 *
> +	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
> +	 * cache flushes to provide this ordering guarantee, and hence for CIL
> +	 * checkpoints that require hundreds or thousands of log writes no
> +	 * longer need to issue device cache flushes to provide metadata
> +	 * writeback ordering.
> +	 */
> +	xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush);
> +
> +	/*
> +	 * Pull all the log vectors off the items in the CIL, and remove the
> +	 * items from the CIL. We don't need the CIL lock here because it's only
> +	 * needed on the transaction commit side which is currently locked out
> +	 * by the flush lock.
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> @@ -806,6 +821,12 @@ xlog_cil_push_work(
>  	lvhdr.lv_iovecp = &lhdr;
>  	lvhdr.lv_next = ctx->lv_chain;
>
> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);
> +
>  	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
>  	if (error)
>  		goto out_abort_free_ticket;


--
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write()
  2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
@ 2021-02-24  7:17   ` Chandan Babu R
  2021-02-24 20:59   ` Darrick J. Wong
  2021-02-25  8:49   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-24  7:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 09:04, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The CIL push is the only call to xlog_write that sets this variable
> to true. The other callers don't need a start rec, and they tell
> xlog_write what to do by passing the type of ophdr they need written
> in the flags field. The need_start_rec parameter essentially tells
> xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
> so get rid of the variable to do this and pass XLOG_START_TRANS as
> the flag value into xlog_write() from the CIL push.
>
> $ size fs/xfs/xfs_log.o*
>   text	   data	    bss	    dec	    hex	filename
>  27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
>  27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
>

Looks good to me.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
>  fs/xfs/xfs_log_cil.c  |  3 ++-
>  fs/xfs/xfs_log_priv.h |  3 +--
>  3 files changed, 25 insertions(+), 25 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 493454c98c6f..6c3fb6dcb505 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -871,9 +871,7 @@ xlog_wait_on_iclog_lsn(
>  static int
>  xlog_write_unmount_record(
>  	struct xlog		*log,
> -	struct xlog_ticket	*ticket,
> -	xfs_lsn_t		*lsn,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	struct xfs_unmount_log_format ulf = {
>  		.magic = XLOG_UNMOUNT_TYPE,
> @@ -890,7 +888,7 @@ xlog_write_unmount_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
> -	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
>  /*
> @@ -904,15 +902,13 @@ xlog_unmount_write(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xlog_in_core	*iclog;
>  	struct xlog_ticket	*tic = NULL;
> -	xfs_lsn_t		lsn;
> -	uint			flags = XLOG_UNMOUNT_TRANS;
>  	int			error;
>  
>  	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
>  	if (error)
>  		goto out_err;
>  
> -	error = xlog_write_unmount_record(log, tic, &lsn, flags);
> +	error = xlog_write_unmount_record(log, tic);
>  	/*
>  	 * At this point, we're umounting anyway, so there's no point in
>  	 * transitioning log state to IOERROR. Just continue...
> @@ -1604,8 +1600,7 @@ xlog_commit_record(
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> -			   false);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2202,13 +2197,16 @@ static int
>  xlog_write_calc_vec_length(
>  	struct xlog_ticket	*ticket,
>  	struct xfs_log_vec	*log_vector,
> -	bool			need_start_rec)
> +	uint			optype)
>  {
>  	struct xfs_log_vec	*lv;
> -	int			headers = need_start_rec ? 1 : 0;
> +	int			headers = 0;
>  	int			len = 0;
>  	int			i;
>  
> +	if (optype & XLOG_START_TRANS)
> +		headers++;
> +
>  	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> @@ -2428,8 +2426,7 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			flags,
> -	bool			need_start_rec)
> +	uint			optype)
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> @@ -2457,8 +2454,9 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec);
> -	*start_lsn = 0;
> +	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
> +	if (start_lsn)
> +		*start_lsn = 0;
>  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
>  		void		*ptr;
>  		int		log_offset;
> @@ -2472,7 +2470,7 @@ xlog_write(
>  		ptr = iclog->ic_datap + log_offset;
>  
>  		/* start_lsn is the first lsn written to. That's all we need. */
> -		if (!*start_lsn)
> +		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
>  		/*
> @@ -2485,6 +2483,7 @@ xlog_write(
>  			int			copy_len;
>  			int			copy_off;
>  			bool			ordered = false;
> +			bool			wrote_start_rec = false;
>  
>  			/* ordered log vectors have no regions to write */
>  			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> @@ -2502,13 +2501,15 @@ xlog_write(
>  			 * write a start record. Only do this for the first
>  			 * iclog we write to.
>  			 */
> -			if (need_start_rec) {
> +			if (optype & XLOG_START_TRANS) {
>  				xlog_write_start_rec(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  						sizeof(struct xlog_op_header));
> +				optype &= ~XLOG_START_TRANS;
> +				wrote_start_rec = true;
>  			}
>  
> -			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
> +			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
>  			if (!ophdr)
>  				return -EIO;
>  
> @@ -2539,14 +2540,13 @@ xlog_write(
>  			}
>  			copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
> -			if (need_start_rec) {
> +			if (wrote_start_rec) {
>  				copy_len += sizeof(struct xlog_op_header);
>  				record_cnt++;
> -				need_start_rec = false;
>  			}
>  			data_cnt += contwr ? copy_len : 0;
>  
> -			error = xlog_write_copy_finish(log, iclog, flags,
> +			error = xlog_write_copy_finish(log, iclog, optype,
>  						       &record_cnt, &data_cnt,
>  						       &partial_copy,
>  						       &partial_copy_len,
> @@ -2590,7 +2590,7 @@ xlog_write(
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
>  	if (commit_iclog) {
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> +		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
>  		error = xlog_state_release_iclog(log, iclog);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 8bcacd463f06..4093d2d0db7c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -827,7 +827,8 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
> +	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
> +				XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a7ac85aaff4e..10a41b1dd895 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint flags,
> -		bool need_start_rec);
> +		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);


-- 
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
@ 2021-02-24 12:27     ` Chandan Babu R
  2021-02-24 20:32       ` Dave Chinner
  2021-02-24 21:13     ` Darrick J. Wong
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Chandan Babu R @ 2021-02-24 12:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 13:35, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
>
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
>
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
>
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
>
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
>
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by this commit:
>
> commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> Author: Steve Lord <lord@sgi.com>
> Date:   Fri May 24 14:30:21 2002 +0000
>
>     Add support for drive write cache flushing - should the kernel
>     have the infrastructure
>
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
>
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
>
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
>
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case.
>
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit to issue a REQ_PREFLUSH. We also

We flush the data device before writing the first iclog (containing
XLOG_START_TRANS) to the disk. This satisfies the first ordering constraint
listed above. Why is it required to have another REQ_PREFLUSH when writing the
iclog containing XLOG_COMMIT_TRANS? I am guessing that it is required to
make sure that the previous iclogs (belonging to the same checkpoint
transaction) have indeed been written to the disk.

> require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
> and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
> the first iclog being written with REQ_PREFLUSH.
>
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
>
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
>
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
>
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
>
> As a result:
>
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
>
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
>
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - repost manually without git/guilt mangling the patch author
> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>   ordering bug in generic/45[57] until testing the CIL pipelining
>   changes much later in the series.
>
>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  3 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6c3fb6dcb505..08d68a6161ae 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1806,8 +1806,7 @@ xlog_write_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
>  	uint64_t		bno,
> -	unsigned int		count,
> -	bool			need_flush)
> +	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
>
> @@ -1845,10 +1844,12 @@ xlog_write_iclog(
>  	 * writeback throttle from throttling log writes behind background
>  	 * metadata writeback and causing priority inversions.
>  	 */
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> -				REQ_IDLE | REQ_FUA;
> -	if (need_flush)
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
> +		iclog->ic_bio.bi_opf |= REQ_FUA;
> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -1951,7 +1952,7 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
> +	bool			split = false;
>
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>
> @@ -2009,13 +2010,14 @@ xlog_sync(
>  	 * synchronously here; for an internal log we can simply use the block
>  	 * layer state machine for preflushes.
>  	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}
>
>  	xlog_verify_iclog(log, iclog, count);
> -	xlog_write_iclog(log, iclog, bno, count, need_flush);
> +	xlog_write_iclog(log, iclog, bno, count);
>  }
>
>  /*
> @@ -2469,10 +2471,21 @@ xlog_write(
>  		ASSERT(log_offset <= iclog->ic_size - 1);
>  		ptr = iclog->ic_datap + log_offset;
>
> -		/* start_lsn is the first lsn written to. That's all we need. */
> +		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>
> +		/*
> +		 * iclogs containing commit records or unmount records need
> +		 * to issue ordering cache flushes and commit immediately
> +		 * to stable storage to guarantee journal vs metadata ordering
> +		 * is correctly maintained in the storage media.
> +		 */
> +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> +						XLOG_ICL_NEED_FUA);
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4093d2d0db7c..370da7c2bfc8 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -894,10 +894,15 @@ xlog_cil_push_work(
>
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> +	 * cache flush because there are no other iclogs we need to order
> +	 * against.
>  	 */
>  	if (ctx->start_lsn != commit_lsn)
>  		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +	else
> +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 10a41b1dd895..a77e00b7789a 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -133,6 +133,9 @@ enum xlog_iclog_state {
>
>  #define XLOG_COVER_OPS		5
>
> +#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
> +#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
> +
>  /* Ticket reservation region accounting */
>  #define XLOG_TIC_LEN_MAX	15
>
> @@ -201,6 +204,7 @@ typedef struct xlog_in_core {
>  	u32			ic_size;
>  	u32			ic_offset;
>  	enum xlog_iclog_state	ic_state;
> +	unsigned int		ic_flags;
>  	char			*ic_datap;	/* pointer to iclog data */
>
>  	/* Callback structures need their own cacheline */


--
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/8] xfs: log stripe roundoff is a property of the log
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
  2021-02-23 10:29   ` Chandan Babu R
@ 2021-02-24 20:14   ` Darrick J. Wong
  2021-02-25  8:32   ` Christoph Hellwig
  2021-03-01 15:13   ` Brian Foster
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We don't need to look at the xfs_mount and superblock every time we
> need to do an iclog roundoff calculation. The property is fixed for
> the life of the log, so store the roundoff in the log at mount time
> and use that everywhere.
> 
> On a debug build:
> 
> $ size fs/xfs/xfs_log.o.*
>    text	   data	    bss	    dec	    hex	filename
>   27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
>   27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Seems ok to me...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_log_format.h |  3 --
>  fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h          |  2 ++
>  3 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 8bd00da6d2a4..16587219549c 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
>  #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
>  #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
>  #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
> -#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
> -                                 (log)->l_mp->m_sb.sb_logsunit)
> -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
>  
>  #define XLOG_HEADER_SIZE	512
>  
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 06041834daa3..fa284f26d10e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1399,6 +1399,11 @@ xlog_alloc_log(
>  	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
>  	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
>  
> +	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
> +		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
> +	else
> +		log->l_iclog_roundoff = BBSIZE;
> +
>  	xlog_grant_head_init(&log->l_reserve_head);
>  	xlog_grant_head_init(&log->l_write_head);
>  
> @@ -1852,29 +1857,15 @@ xlog_calc_iclog_size(
>  	uint32_t		*roundoff)
>  {
>  	uint32_t		count_init, count;
> -	bool			use_lsunit;
> -
> -	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -			log->l_mp->m_sb.sb_logsunit > 1;
>  
>  	/* Add for LR header */
>  	count_init = log->l_iclog_hsize + iclog->ic_offset;
> +	count = roundup(count_init, log->l_iclog_roundoff);
>  
> -	/* Round out the log write size */
> -	if (use_lsunit) {
> -		/* we have a v2 stripe unit to use */
> -		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
> -	} else {
> -		count = BBTOB(BTOBB(count_init));
> -	}
> -
> -	ASSERT(count >= count_init);
>  	*roundoff = count - count_init;
>  
> -	if (use_lsunit)
> -		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
> -	else
> -		ASSERT(*roundoff < BBTOB(1));
> +	ASSERT(count >= count_init);
> +	ASSERT(*roundoff < log->l_iclog_roundoff);
>  	return count;
>  }
>  
> @@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs(
>  	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
>  
>  	/* Round up to next log-sunit */
> -	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -	    log->l_mp->m_sb.sb_logsunit > 1) {
> -		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
> -		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
> +	if (log->l_iclog_roundoff > BBSIZE) {
> +		log->l_curr_block = roundup(log->l_curr_block,
> +						BTOBB(log->l_iclog_roundoff));
>  	}
>  
>  	if (log->l_curr_block >= log->l_logBBsize) {
> @@ -3404,12 +3394,11 @@ xfs_log_ticket_get(
>   * Figure out the total log space unit (in bytes) that would be
>   * required for a log ticket.
>   */
> -int
> -xfs_log_calc_unit_res(
> -	struct xfs_mount	*mp,
> +static int
> +xlog_calc_unit_res(
> +	struct xlog		*log,
>  	int			unit_bytes)
>  {
> -	struct xlog		*log = mp->m_log;
>  	int			iclog_space;
>  	uint			num_headers;
>  
> @@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res(
>  	/* for commit-rec LR header - note: padding will subsume the ophdr */
>  	unit_bytes += log->l_iclog_hsize;
>  
> -	/* for roundoff padding for transaction data and one for commit record */
> -	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
> -		/* log su roundoff */
> -		unit_bytes += 2 * mp->m_sb.sb_logsunit;
> -	} else {
> -		/* BB roundoff */
> -		unit_bytes += 2 * BBSIZE;
> -        }
> +	/* roundoff padding for transaction data and one for commit record */
> +	unit_bytes += 2 * log->l_iclog_roundoff;
>  
>  	return unit_bytes;
>  }
>  
> +int
> +xfs_log_calc_unit_res(
> +	struct xfs_mount	*mp,
> +	int			unit_bytes)
> +{
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +}
> +
>  /*
>   * Allocate and initialise a new log ticket.
>   */
> @@ -3513,7 +3504,7 @@ xlog_ticket_alloc(
>  
>  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>  
> -	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes);
>  
>  	atomic_set(&tic->t_ref, 1);
>  	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 1c6fdbf3d506..037950cf1061 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -436,6 +436,8 @@ struct xlog {
>  #endif
>  	/* log recovery lsn tracking (for buffer submission */
>  	xfs_lsn_t		l_recovery_lsn;
> +
> +	uint32_t		l_iclog_roundoff;/* padding roundoff */
>  };
>  
>  #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-24 12:27     ` Chandan Babu R
@ 2021-02-24 20:32       ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-24 20:32 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 05:57:20PM +0530, Chandan Babu R wrote:
> On 23 Feb 2021 at 13:35, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> > guarantee the ordering requirements the journal has w.r.t. metadata
> > writeback. THe two ordering constraints are:
> >
> > 1. we cannot overwrite metadata in the journal until we guarantee
> > that the dirty metadata has been written back in place and is
> > stable.
> >
> > 2. we cannot write back dirty metadata until it has been written to
> > the journal and guaranteed to be stable (and hence recoverable) in
> > the journal.
> >
> > The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> > causes the journal IO to issue a cache flush and wait for it to
> > complete before issuing the write IO to the journal. Hence all
> > completed metadata IO is guaranteed to be stable before the journal
> > overwrites the old metadata.
> >
> > The ordering guarantees of #2 are provided by the REQ_FUA, which
> > ensures the journal writes do not complete until they are on stable
> > storage. Hence by the time the last journal IO in a checkpoint
> > completes, we know that the entire checkpoint is on stable storage
> > and we can unpin the dirty metadata and allow it to be written back.
> >
> > This is the mechanism by which ordering was first implemented in XFS
> > way back in 2002 by this commit:
> >
> > commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> > Author: Steve Lord <lord@sgi.com>
> > Date:   Fri May 24 14:30:21 2002 +0000
> >
> >     Add support for drive write cache flushing - should the kernel
> >     have the infrastructure
> >
> > A lot has changed since then, most notably we now use delayed
> > logging to checkpoint the filesystem to the journal rather than
> > write each individual transaction to the journal. Cache flushes on
> > journal IO are necessary when individual transactions are wholly
> > contained within a single iclog. However, CIL checkpoints are single
> > transactions that typically span hundreds to thousands of individual
> > journal writes, and so the requirements for device cache flushing
> > have changed.
> >
> > That is, the ordering rules I state above apply to ordering of
> > atomic transactions recorded in the journal, not to the journal IO
> > itself. Hence we need to ensure metadata is stable before we start
> > writing a new transaction to the journal (guarantee #1), and we need
> > to ensure the entire transaction is stable in the journal before we
> > start metadata writeback (guarantee #2).
> >
> > Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> > new journal transaction to provide #1, and it is not on any other
> > journal IO done within the context of that journal transaction.
> >
> > The CIL checkpoint already issues a cache flush before it starts
> > writing to the log, so we no longer need the iclog IO to issue a
> > REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> > to xlog_write(), we no longer need to mark the first iclog in
> > the log write with REQ_PREFLUSH for this case.
> >
> > Given the new ordering semantics of commit records for the CIL, we
> > need iclogs containing commit to issue a REQ_PREFLUSH. We also
> 
> We flush the data device before writing the first iclog (containing
> XLOG_START_TRANS) to the disk. This satisfies the first ordering constraint
> listed above. Why is it required to have another REQ_PREFLUSH when writing the
> iclog containing XLOG_COMMIT_TRANS? I am guessing that it is required to
> make sure that the previous iclogs (belonging to the same checkpoint
> transaction) have indeed been written to the disk.

Yes, that is correct.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
  2021-02-23 12:12   ` Chandan Babu R
@ 2021-02-24 20:34   ` Darrick J. Wong
  2021-02-24 21:44     ` Dave Chinner
  2021-03-01 15:19   ` Brian Foster
  2 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.

Hmm, I guess this means that iclog header lsns are supposed to increase
as one walks forwards through the list?

> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.

I don't see an explicit check for an iclog with a zero lsn?  Is that
implied by XLOG_STATE_ACTIVE?

Also, do you have any idea what was Christoph talking about wrt devices
with no-op flushes the last time this patch was posted?  This change
seems straightforward to me (assuming the answers to my two question are
'yes') but I didn't grok what subtlety he was alluding to...?

--D

> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush
  2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
  2021-02-23 12:57   ` Chandan Babu R
@ 2021-02-24 20:45   ` Darrick J. Wong
  2021-02-24 22:01     ` Dave Chinner
  2021-02-25  8:36   ` Christoph Hellwig
  2 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Move it to xfs_bio_io.c as we are about to add new async cache flush
> functionality that uses bios directly, so all this stuff should be
> in the same place. Rename the function to xfs_flush_bdev() to match
> the xfs_rw_bdev() function that already exists in this file.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

I don't get why it's necessary to consolidate the synchronous flush
function with the (future) async flush, since all the sync flush callers
go through a buftarg, including the log.  All this seems to do is shift
pointer dereferencing into callers.

Why not make the async flush function take a buftarg?

--D

> ---
>  fs/xfs/xfs_bio_io.c | 8 ++++++++
>  fs/xfs/xfs_buf.c    | 2 +-
>  fs/xfs/xfs_file.c   | 6 +++---
>  fs/xfs/xfs_linux.h  | 1 +
>  fs/xfs/xfs_log.c    | 2 +-
>  fs/xfs/xfs_super.c  | 7 -------
>  fs/xfs/xfs_super.h  | 1 -
>  7 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index e2148f2d5d6b..5abf653a45d4 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -59,3 +59,11 @@ xfs_rw_bdev(
>  		invalidate_kernel_vmap_range(data, count);
>  	return error;
>  }
> +
> +void
> +xfs_flush_bdev(
> +	struct block_device	*bdev)
> +{
> +	blkdev_issue_flush(bdev, GFP_NOFS);
> +}
> +
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f6e5235df7c9..b1d6c530c693 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1958,7 +1958,7 @@ xfs_free_buftarg(
>  	percpu_counter_destroy(&btp->bt_io_count);
>  	list_lru_destroy(&btp->bt_lru);
>  
> -	xfs_blkdev_issue_flush(btp);
> +	xfs_flush_bdev(btp->bt_bdev);
>  
>  	kmem_free(btp);
>  }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 38528e59030e..dd33ef2d0e20 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -196,9 +196,9 @@ xfs_file_fsync(
>  	 * inode size in case of an extending write.
>  	 */
>  	if (XFS_IS_REALTIME_INODE(ip))
> -		xfs_blkdev_issue_flush(mp->m_rtdev_targp);
> +		xfs_flush_bdev(mp->m_rtdev_targp->bt_bdev);
>  	else if (mp->m_logdev_targp != mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
>  
>  	/*
>  	 * Any inode that has dirty modifications in the log is pinned.  The
> @@ -218,7 +218,7 @@ xfs_file_fsync(
>  	 */
>  	if (!log_flushed && !XFS_IS_REALTIME_INODE(ip) &&
>  	    mp->m_logdev_targp == mp->m_ddev_targp)
> -		xfs_blkdev_issue_flush(mp->m_ddev_targp);
> +		xfs_flush_bdev(mp->m_ddev_targp->bt_bdev);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index af6be9b9ccdf..e94a2aeefee8 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -196,6 +196,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
> +void xfs_flush_bdev(struct block_device *bdev);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index ff26fb46d70f..493454c98c6f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2015,7 +2015,7 @@ xlog_sync(
>  	 * layer state machine for preflushes.
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> -		xfs_blkdev_issue_flush(log->l_mp->m_ddev_targp);
> +		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
>  		need_flush = false;
>  	}
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 21b1d034aca3..85dd9593b40b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -339,13 +339,6 @@ xfs_blkdev_put(
>  		blkdev_put(bdev, FMODE_READ|FMODE_WRITE|FMODE_EXCL);
>  }
>  
> -void
> -xfs_blkdev_issue_flush(
> -	xfs_buftarg_t		*buftarg)
> -{
> -	blkdev_issue_flush(buftarg->bt_bdev, GFP_NOFS);
> -}
> -
>  STATIC void
>  xfs_close_devices(
>  	struct xfs_mount	*mp)
> diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
> index 1ca484b8357f..79cb2dece811 100644
> --- a/fs/xfs/xfs_super.h
> +++ b/fs/xfs/xfs_super.h
> @@ -88,7 +88,6 @@ struct block_device;
>  
>  extern void xfs_quiesce_attr(struct xfs_mount *mp);
>  extern void xfs_flush_inodes(struct xfs_mount *mp);
> -extern void xfs_blkdev_issue_flush(struct xfs_buftarg *);
>  extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
>  					   xfs_agnumber_t agcount);
>  
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 4/8] xfs: async blkdev cache flush
  2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
  2021-02-23  5:29   ` Chaitanya Kulkarni
  2021-02-23 14:02   ` Chandan Babu R
@ 2021-02-24 20:51   ` Darrick J. Wong
  2 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:38PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The new checkpoint caceh flush mechanism requires us to issue an
> unconditional cache flush before we start a new checkpoint. We don't
> want to block for this if we can help it, and we have a fair chunk
> of CPU work to do between starting the checkpoint and issuing the
> first journal IO.
> 
> Hence it makes sense to amortise the latency cost of the cache flush
> by issuing it asynchronously and then waiting for it only when we
> need to issue the first IO in the transaction.
> 
> TO do this, we need async cache flush primitives to submit the cache
> flush bio and to wait on it. THe block layer has no such primitives
> for filesystems, so roll our own for the moment.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bio_io.c | 30 ++++++++++++++++++++++++++++++
>  fs/xfs/xfs_linux.h  |  1 +
>  2 files changed, 31 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
> index 5abf653a45d4..d55420bc72b5 100644
> --- a/fs/xfs/xfs_bio_io.c
> +++ b/fs/xfs/xfs_bio_io.c
> @@ -67,3 +67,33 @@ xfs_flush_bdev(
>  	blkdev_issue_flush(bdev, GFP_NOFS);
>  }
>  
> +void
> +xfs_flush_bdev_async_endio(
> +	struct bio	*bio)
> +{
> +	if (bio->bi_private)
> +		complete(bio->bi_private);
> +	bio_put(bio);
> +}
> +
> +/*
> + * Submit a request for an async cache flush to run. If the caller needs to wait
> + * for the flush completion at a later point in time, they must supply a
> + * valid completion. This will be signalled when the flush completes.
> + * The caller never sees the bio that is issued here.
> + */
> +void
> +xfs_flush_bdev_async(
> +	struct block_device	*bdev,

Not sure why this isn't a buftarg function, since (AFAICT) this is the
only caller in the ~30 patches you've sent to the list.  Is there
something else coming down the pipeline such that you only have a raw
block_device pointer?

> +	struct completion	*done)
> +{
> +	struct bio *bio;
> +
> +	bio = bio_alloc(GFP_NOFS, 0);
> +	bio_set_dev(bio, bdev);
> +	bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC;
> +	bio->bi_private = done;
> +        bio->bi_end_io = xfs_flush_bdev_async_endio;

Weird indent here.

--D

> +
> +	submit_bio(bio);
> +}
> diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
> index e94a2aeefee8..293ff2355e80 100644
> --- a/fs/xfs/xfs_linux.h
> +++ b/fs/xfs/xfs_linux.h
> @@ -197,6 +197,7 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y)
>  int xfs_rw_bdev(struct block_device *bdev, sector_t sector, unsigned int count,
>  		char *data, unsigned int op);
>  void xfs_flush_bdev(struct block_device *bdev);
> +void xfs_flush_bdev_async(struct block_device *bdev, struct completion *done);
>  
>  #define ASSERT_ALWAYS(expr)	\
>  	(likely(expr) ? (void)0 : assfail(NULL, #expr, __FILE__, __LINE__))
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally
  2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
  2021-02-24  7:16   ` Chandan Babu R
@ 2021-02-24 20:57   ` Darrick J. Wong
  2021-02-25  8:42   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
> 
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
> 
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
> 
> These rules apply to the atomic transactions recorded in the
> journal, not to the journal IO itself. Hence we need to ensure
> metadata is stable before we start writing a new transaction to the
> journal (guarantee #1), and we need to ensure the entire transaction
> is stable in the journal before we start metadata writeback
> (guarantee #2).
> 
> The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
> being added to every iclog IO. This causes the journal IO to issue a
> cache flush and wait for it to complete before issuing the write IO
> to the journal. Hence all completed metadata IO is guaranteed to be
> stable before the journal overwrites the old metadata.
> 
> However, for long running CIL checkpoints that might do a thousand
> journal IOs, we don't need every single one of these iclog IOs to
> issue a cache flush - the cache flush done before the first iclog is
> submitted is sufficient to cover the entire range in the log that
> the checkpoint will overwrite because the CIL space reservation
> guarantees the tail of the log (completed metadata) is already
> beyond the range of the checkpoint write.
> 
> Hence we only need a full cache flush between closing off the CIL
> checkpoint context (i.e. when the push switches it out) and issuing
> the first journal IO. Rather than plumbing this through to the
> journal IO, we can start this cache flush the moment the CIL context
> is owned exclusively by the push worker. The cache flush can be in
> progress while we process the CIL ready for writing, hence
> reducing the latency of the initial iclog write. This is especially
> true for large checkpoints, where we might have to process hundreds
> of thousands of log vectors before we issue the first iclog write.
> In these cases, it is likely the cache flush has already been
> completed by the time we have built the CIL log vector chain.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index c5cc1b7ad25e..8bcacd463f06 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -656,6 +656,7 @@ xlog_cil_push_work(
>  	struct xfs_log_vec	lvhdr = { NULL };
>  	xfs_lsn_t		commit_lsn;
>  	xfs_lsn_t		push_seq;
> +	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  
>  	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -719,10 +720,24 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 *
> +	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
> +	 * cache flushes to provide this ordering guarantee, and hence for CIL
> +	 * checkpoints that require hundreds or thousands of log writes no
> +	 * longer need to issue device cache flushes to provide metadata
> +	 * writeback ordering.
> +	 */
> +	xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush);
> +
> +	/*
> +	 * Pull all the log vectors off the items in the CIL, and remove the
> +	 * items from the CIL. We don't need the CIL lock here because it's only
> +	 * needed on the transaction commit side which is currently locked out
> +	 * by the flush lock.
>  	 */
>  	lv = NULL;
>  	num_iovecs = 0;
> @@ -806,6 +821,12 @@ xlog_cil_push_work(
>  	lvhdr.lv_iovecp = &lhdr;
>  	lvhdr.lv_next = ctx->lv_chain;
>  
> +	/*
> +	 * Before we format and submit the first iclog, we have to ensure that
> +	 * the metadata writeback ordering cache flush is complete.
> +	 */
> +	wait_for_completion(&bdev_flush);
> +
>  	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
>  	if (error)
>  		goto out_abort_free_ticket;
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write()
  2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
  2021-02-24  7:17   ` Chandan Babu R
@ 2021-02-24 20:59   ` Darrick J. Wong
  2021-02-25  8:49   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 20:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:40PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The CIL push is the only call to xlog_write that sets this variable
> to true. The other callers don't need a start rec, and they tell
> xlog_write what to do by passing the type of ophdr they need written
> in the flags field. The need_start_rec parameter essentially tells
> xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
> so get rid of the variable to do this and pass XLOG_START_TRANS as
> the flag value into xlog_write() from the CIL push.
> 
> $ size fs/xfs/xfs_log.o*
>   text	   data	    bss	    dec	    hex	filename
>  27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
>  27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c      | 44 +++++++++++++++++++++----------------------
>  fs/xfs/xfs_log_cil.c  |  3 ++-
>  fs/xfs/xfs_log_priv.h |  3 +--
>  3 files changed, 25 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 493454c98c6f..6c3fb6dcb505 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -871,9 +871,7 @@ xlog_wait_on_iclog_lsn(
>  static int
>  xlog_write_unmount_record(
>  	struct xlog		*log,
> -	struct xlog_ticket	*ticket,
> -	xfs_lsn_t		*lsn,
> -	uint			flags)
> +	struct xlog_ticket	*ticket)
>  {
>  	struct xfs_unmount_log_format ulf = {
>  		.magic = XLOG_UNMOUNT_TYPE,
> @@ -890,7 +888,7 @@ xlog_write_unmount_record(
>  
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
> -	return xlog_write(log, &vec, ticket, lsn, NULL, flags, false);
> +	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
>  /*
> @@ -904,15 +902,13 @@ xlog_unmount_write(
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xlog_in_core	*iclog;
>  	struct xlog_ticket	*tic = NULL;
> -	xfs_lsn_t		lsn;
> -	uint			flags = XLOG_UNMOUNT_TRANS;
>  	int			error;
>  
>  	error = xfs_log_reserve(mp, 600, 1, &tic, XFS_LOG, 0);
>  	if (error)
>  		goto out_err;
>  
> -	error = xlog_write_unmount_record(log, tic, &lsn, flags);
> +	error = xlog_write_unmount_record(log, tic);
>  	/*
>  	 * At this point, we're umounting anyway, so there's no point in
>  	 * transitioning log state to IOERROR. Just continue...
> @@ -1604,8 +1600,7 @@ xlog_commit_record(
>  	if (XLOG_FORCED_SHUTDOWN(log))
>  		return -EIO;
>  
> -	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS,
> -			   false);
> +	error = xlog_write(log, &vec, ticket, lsn, iclog, XLOG_COMMIT_TRANS);
>  	if (error)
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	return error;
> @@ -2202,13 +2197,16 @@ static int
>  xlog_write_calc_vec_length(
>  	struct xlog_ticket	*ticket,
>  	struct xfs_log_vec	*log_vector,
> -	bool			need_start_rec)
> +	uint			optype)
>  {
>  	struct xfs_log_vec	*lv;
> -	int			headers = need_start_rec ? 1 : 0;
> +	int			headers = 0;
>  	int			len = 0;
>  	int			i;
>  
> +	if (optype & XLOG_START_TRANS)
> +		headers++;
> +
>  	for (lv = log_vector; lv; lv = lv->lv_next) {
>  		/* we don't write ordered log vectors */
>  		if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED)
> @@ -2428,8 +2426,7 @@ xlog_write(
>  	struct xlog_ticket	*ticket,
>  	xfs_lsn_t		*start_lsn,
>  	struct xlog_in_core	**commit_iclog,
> -	uint			flags,
> -	bool			need_start_rec)
> +	uint			optype)
>  {
>  	struct xlog_in_core	*iclog = NULL;
>  	struct xfs_log_vec	*lv = log_vector;
> @@ -2457,8 +2454,9 @@ xlog_write(
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>  	}
>  
> -	len = xlog_write_calc_vec_length(ticket, log_vector, need_start_rec);
> -	*start_lsn = 0;
> +	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
> +	if (start_lsn)
> +		*start_lsn = 0;
>  	while (lv && (!lv->lv_niovecs || index < lv->lv_niovecs)) {
>  		void		*ptr;
>  		int		log_offset;
> @@ -2472,7 +2470,7 @@ xlog_write(
>  		ptr = iclog->ic_datap + log_offset;
>  
>  		/* start_lsn is the first lsn written to. That's all we need. */
> -		if (!*start_lsn)
> +		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
>  		/*
> @@ -2485,6 +2483,7 @@ xlog_write(
>  			int			copy_len;
>  			int			copy_off;
>  			bool			ordered = false;
> +			bool			wrote_start_rec = false;
>  
>  			/* ordered log vectors have no regions to write */
>  			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> @@ -2502,13 +2501,15 @@ xlog_write(
>  			 * write a start record. Only do this for the first
>  			 * iclog we write to.
>  			 */
> -			if (need_start_rec) {
> +			if (optype & XLOG_START_TRANS) {
>  				xlog_write_start_rec(ptr, ticket);
>  				xlog_write_adv_cnt(&ptr, &len, &log_offset,
>  						sizeof(struct xlog_op_header));
> +				optype &= ~XLOG_START_TRANS;
> +				wrote_start_rec = true;
>  			}
>  
> -			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, flags);
> +			ophdr = xlog_write_setup_ophdr(log, ptr, ticket, optype);
>  			if (!ophdr)
>  				return -EIO;
>  
> @@ -2539,14 +2540,13 @@ xlog_write(
>  			}
>  			copy_len += sizeof(struct xlog_op_header);
>  			record_cnt++;
> -			if (need_start_rec) {
> +			if (wrote_start_rec) {
>  				copy_len += sizeof(struct xlog_op_header);
>  				record_cnt++;
> -				need_start_rec = false;
>  			}
>  			data_cnt += contwr ? copy_len : 0;
>  
> -			error = xlog_write_copy_finish(log, iclog, flags,
> +			error = xlog_write_copy_finish(log, iclog, optype,
>  						       &record_cnt, &data_cnt,
>  						       &partial_copy,
>  						       &partial_copy_len,
> @@ -2590,7 +2590,7 @@ xlog_write(
>  	spin_lock(&log->l_icloglock);
>  	xlog_state_finish_copy(log, iclog, record_cnt, data_cnt);
>  	if (commit_iclog) {
> -		ASSERT(flags & XLOG_COMMIT_TRANS);
> +		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
>  		error = xlog_state_release_iclog(log, iclog);
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 8bcacd463f06..4093d2d0db7c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -827,7 +827,8 @@ xlog_cil_push_work(
>  	 */
>  	wait_for_completion(&bdev_flush);
>  
> -	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0, true);
> +	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL,
> +				XLOG_START_TRANS);
>  	if (error)
>  		goto out_abort_free_ticket;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index a7ac85aaff4e..10a41b1dd895 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -480,8 +480,7 @@ void	xlog_print_tic_res(struct xfs_mount *mp, struct xlog_ticket *ticket);
>  void	xlog_print_trans(struct xfs_trans *);
>  int	xlog_write(struct xlog *log, struct xfs_log_vec *log_vector,
>  		struct xlog_ticket *tic, xfs_lsn_t *start_lsn,
> -		struct xlog_in_core **commit_iclog, uint flags,
> -		bool need_start_rec);
> +		struct xlog_in_core **commit_iclog, uint optype);
>  int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  		struct xlog_in_core **iclog, xfs_lsn_t *lsn);
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
  2021-02-24 12:27     ` Chandan Babu R
@ 2021-02-24 21:13     ` Darrick J. Wong
  2021-02-24 22:03       ` Dave Chinner
  2021-02-25  4:09     ` Chandan Babu R
                       ` (2 subsequent siblings)
  4 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 07:05:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
> 
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
> 
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
> 
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
> 
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
> 
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by this commit:
> 
> commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> Author: Steve Lord <lord@sgi.com>
> Date:   Fri May 24 14:30:21 2002 +0000
> 
>     Add support for drive write cache flushing - should the kernel
>     have the infrastructure
> 
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
> 
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
> 
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
> 
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case.
> 
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit to issue a REQ_PREFLUSH. We also
> require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
> and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
> the first iclog being written with REQ_PREFLUSH.
> 
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
> 
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
> 
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
> 
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
> 
> As a result:
> 
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
> 
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
> 
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

If I've gotten this right so far, patch 4 introduces the ability to
flush the data device while we get ready to overwrite parts of the log.
This patch makes it so that writing a commit (or unmount) record to an
iclog causes that iclog write to be issued (to the log device) with at
least FUA set, and possibly PREFLUSH if we've written out more than one
iclog since the last commit?

IOWs, the data device flush is now asynchronous with CIL processing and
we've ripped out all the cache flushes and FUA writes for iclogs except
for the last one in a chain, which should maintain data integrity
requirements 1 and 2?

If that's correct,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
> Version 2:
> - repost manually without git/guilt mangling the patch author
> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>   ordering bug in generic/45[57] until testing the CIL pipelining
>   changes much later in the series.
> 
>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  3 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6c3fb6dcb505..08d68a6161ae 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1806,8 +1806,7 @@ xlog_write_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
>  	uint64_t		bno,
> -	unsigned int		count,
> -	bool			need_flush)
> +	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
>  
> @@ -1845,10 +1844,12 @@ xlog_write_iclog(
>  	 * writeback throttle from throttling log writes behind background
>  	 * metadata writeback and causing priority inversions.
>  	 */
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> -				REQ_IDLE | REQ_FUA;
> -	if (need_flush)
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
> +		iclog->ic_bio.bi_opf |= REQ_FUA;
> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -1951,7 +1952,7 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
> +	bool			split = false;
>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>  
> @@ -2009,13 +2010,14 @@ xlog_sync(
>  	 * synchronously here; for an internal log we can simply use the block
>  	 * layer state machine for preflushes.
>  	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}
>  
>  	xlog_verify_iclog(log, iclog, count);
> -	xlog_write_iclog(log, iclog, bno, count, need_flush);
> +	xlog_write_iclog(log, iclog, bno, count);
>  }
>  
>  /*
> @@ -2469,10 +2471,21 @@ xlog_write(
>  		ASSERT(log_offset <= iclog->ic_size - 1);
>  		ptr = iclog->ic_datap + log_offset;
>  
> -		/* start_lsn is the first lsn written to. That's all we need. */
> +		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> +		/*
> +		 * iclogs containing commit records or unmount records need
> +		 * to issue ordering cache flushes and commit immediately
> +		 * to stable storage to guarantee journal vs metadata ordering
> +		 * is correctly maintained in the storage media.
> +		 */
> +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> +						XLOG_ICL_NEED_FUA);
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4093d2d0db7c..370da7c2bfc8 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -894,10 +894,15 @@ xlog_cil_push_work(
>  
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> +	 * cache flush because there are no other iclogs we need to order
> +	 * against.
>  	 */
>  	if (ctx->start_lsn != commit_lsn)
>  		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +	else
> +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 10a41b1dd895..a77e00b7789a 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -133,6 +133,9 @@ enum xlog_iclog_state {
>  
>  #define XLOG_COVER_OPS		5
>  
> +#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
> +#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
> +
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  
> @@ -201,6 +204,7 @@ typedef struct xlog_in_core {
>  	u32			ic_size;
>  	u32			ic_offset;
>  	enum xlog_iclog_state	ic_state;
> +	unsigned int		ic_flags;
>  	char			*ic_datap;	/* pointer to iclog data */
>  
>  	/* Callback structures need their own cacheline */

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards
  2021-02-23  3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
@ 2021-02-24 21:18   ` Darrick J. Wong
  2021-02-24 22:05     ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 21:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:42PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A hang with tasks stuck on the CIL hard throttle was reported and
> largely diagnosed by Donald Buczek, who discovered that it was a
> result of the CIL context space usage decrementing in committed
> transactions once the hard throttle limit had been hit and processes
> were already blocked.  This resulted in the CIL push not waking up
> those waiters because the CIL context was no longer over the hard
> throttle limit.
> 
> The surprising aspect of this was the CIL space usage going
> backwards regularly enough to trigger this situation. Assumptions
> had been made in design that the relogging process would only
> increase the size of the objects in the CIL, and so that space would
> only increase.
> 
> This change and commit message fixes the issue and documents the
> result of an audit of the triggers that can cause the CIL space to
> go backwards, how large the backwards steps tend to be, the
> frequency in which they occur, and what the impact on the CIL
> accounting code is.
> 
> Even though the CIL ctx->space_used can go backwards, it will only
> do so if the log item is already logged to the CIL and contains a
> space reservation for it's entire logged state. This is tracked by
> the shadow buffer state on the log item. If the item is not
> previously logged in the CIL it has no shadow buffer nor log vector,
> and hence the entire size of the logged item copied to the log
> vector is accounted to the CIL space usage. i.e.  it will always go
> up in this case.
> 
> If the item has a log vector (i.e. already in the CIL) and the size
> decreases, then the existing log vector will be overwritten and the
> space usage will go down. This is the only condition where the space
> usage reduces, and it can only occur when an item is already tracked
> in the CIL. Hence we are safe from CIL space usage underruns as a
> result of log items decreasing in size when they are relogged.
> 
> Typically this reduction in CIL usage occurs from metadta blocks

"metadata"...

> being free, such as when a btree block merge
> occurs or a directory enter/xattr entry is removed and the da-tree
> is reduced in size. This generally results in a reduction in size of
> around a single block in the CIL, but also tends to increase the
> number of log vectors because the parent and sibling nodes in the
> tree needs to be updated when a btree block is removed. If a
> multi-level merge occurs, then we see reduction in size of 2+
> blocks, but again the log vector count goes up.
> 
> The other vector is inode fork size changes, which only log the
> current size of the fork and ignore the previously logged size when
> the fork is relogged. Hence if we are removing items from the inode
> fork (dir/xattr removal in shortform, extent record removal in
> extent form, etc) the relogged size of the inode for can decrease.
> 
> No other log items can decrease in size either because they are a
> fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
> an intent actually creates a new intent log item and doesn't relog
> the old item at all.) Hence the only two vectors for CIL context
> size reduction are relogging inode forks and marking buffers active
> in the CIL as stale.
> 
> Long story short: the majority of the code does the right thing and
> handles the reduction in log item size correctly, and only the CIL
> hard throttle implementation is problematic and needs fixing. This
> patch makes that fix, as well as adds comments in the log item code
> that result in items shrinking in size when they are relogged as a
> clear reminder that this can and does happen frequently.
> 
> The throttle fix is based upon the change Donald proposed, though it
> goes further to ensure that once the throttle is activated, it
> captures all tasks until the CIL push issues a wakeup, regardless of
> whether the CIL space used has gone back under the throttle
> threshold.
> 
> This ensures that we prevent tasks reducing the CIL slightly under
> the throttle threshold and then making more changes that push it
> well over the throttle limit. This is acheived by checking if the
> throttle wait queue is already active as a condition of throttling.
> Hence once we start throttling, we continue to apply the throttle
> until the CIL context push wakes everything on the wait queue.
> 
> We can use waitqueue_active() for the waitqueue manipulations and
> checks as they are all done under the ctx->xc_push_lock. Hence the
> waitqueue has external serialisation and we can safely peek inside
> the wait queue without holding the internal waitqueue locks.
> 
> Many thanks to Donald for his diagnostic and analysis work to
> isolate the cause of this hang.
> 
> Reported-by: Donald Buczek <buczek@molgen.mpg.de>

Does this whole series fix the Donald's problem?

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

Looks ok to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_buf_item.c   | 37 ++++++++++++++++++-------------------
>  fs/xfs/xfs_inode_item.c | 14 ++++++++++++++
>  fs/xfs/xfs_log_cil.c    | 22 +++++++++++++++++-----
>  3 files changed, 49 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index dc0be2a639cc..17960b1ce5ef 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -56,14 +56,12 @@ xfs_buf_log_format_size(
>  }
>  
>  /*
> - * This returns the number of log iovecs needed to log the
> - * given buf log item.
> + * Return the number of log iovecs and space needed to log the given buf log
> + * item segment.
>   *
> - * It calculates this as 1 iovec for the buf log format structure
> - * and 1 for each stretch of non-contiguous chunks to be logged.
> - * Contiguous chunks are logged in a single iovec.
> - *
> - * If the XFS_BLI_STALE flag has been set, then log nothing.
> + * It calculates this as 1 iovec for the buf log format structure and 1 for each
> + * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
> + * in a single iovec.
>   */
>  STATIC void
>  xfs_buf_item_size_segment(
> @@ -119,11 +117,8 @@ xfs_buf_item_size_segment(
>  }
>  
>  /*
> - * This returns the number of log iovecs needed to log the given buf log item.
> - *
> - * It calculates this as 1 iovec for the buf log format structure and 1 for each
> - * stretch of non-contiguous chunks to be logged.  Contiguous chunks are logged
> - * in a single iovec.
> + * Return the number of log iovecs and space needed to log the given buf log
> + * item.
>   *
>   * Discontiguous buffers need a format structure per region that is being
>   * logged. This makes the changes in the buffer appear to log recovery as though
> @@ -133,7 +128,11 @@ xfs_buf_item_size_segment(
>   * what ends up on disk.
>   *
>   * If the XFS_BLI_STALE flag has been set, then log nothing but the buf log
> - * format structures.
> + * format structures. If the item has previously been logged and has dirty
> + * regions, we do not relog them in stale buffers. This has the effect of
> + * reducing the size of the relogged item by the amount of dirty data tracked
> + * by the log item. This can result in the committing transaction reducing the
> + * amount of space being consumed by the CIL.
>   */
>  STATIC void
>  xfs_buf_item_size(
> @@ -147,9 +146,9 @@ xfs_buf_item_size(
>  	ASSERT(atomic_read(&bip->bli_refcount) > 0);
>  	if (bip->bli_flags & XFS_BLI_STALE) {
>  		/*
> -		 * The buffer is stale, so all we need to log
> -		 * is the buf log format structure with the
> -		 * cancel flag in it.
> +		 * The buffer is stale, so all we need to log is the buf log
> +		 * format structure with the cancel flag in it as we are never
> +		 * going to replay the changes tracked in the log item.
>  		 */
>  		trace_xfs_buf_item_size_stale(bip);
>  		ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> @@ -164,9 +163,9 @@ xfs_buf_item_size(
>  
>  	if (bip->bli_flags & XFS_BLI_ORDERED) {
>  		/*
> -		 * The buffer has been logged just to order it.
> -		 * It is not being included in the transaction
> -		 * commit, so no vectors are used at all.
> +		 * The buffer has been logged just to order it. It is not being
> +		 * included in the transaction commit, so no vectors are used at
> +		 * all.
>  		 */
>  		trace_xfs_buf_item_size_ordered(bip);
>  		*nvecs = XFS_LOG_VEC_ORDERED;
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 17e20a6d8b4e..6ff91e5bf3cd 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -28,6 +28,20 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_inode_log_item, ili_item);
>  }
>  
> +/*
> + * The logged size of an inode fork is always the current size of the inode
> + * fork. This means that when an inode fork is relogged, the size of the logged
> + * region is determined by the current state, not the combination of the
> + * previously logged state + the current state. This is different relogging
> + * behaviour to most other log items which will retain the size of the
> + * previously logged changes when smaller regions are relogged.
> + *
> + * Hence operations that remove data from the inode fork (e.g. shortform
> + * dir/attr remove, extent form extent removal, etc), the size of the relogged
> + * inode gets -smaller- rather than stays the same size as the previously logged
> + * size and this can result in the committing transaction reducing the amount of
> + * space being consumed by the CIL.
> + */
>  STATIC void
>  xfs_inode_item_data_fork_size(
>  	struct xfs_inode_log_item *iip,
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 370da7c2bfc8..0a00c3c9610c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -669,9 +669,14 @@ xlog_cil_push_work(
>  	ASSERT(push_seq <= ctx->sequence);
>  
>  	/*
> -	 * Wake up any background push waiters now this context is being pushed.
> +	 * As we are about to switch to a new, empty CIL context, we no longer
> +	 * need to throttle tasks on CIL space overruns. Wake any waiters that
> +	 * the hard push throttle may have caught so they can start committing
> +	 * to the new context. The ctx->xc_push_lock provides the serialisation
> +	 * necessary for safely using the lockless waitqueue_active() check in
> +	 * this context.
>  	 */
> -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> +	if (waitqueue_active(&cil->xc_push_wait))
>  		wake_up_all(&cil->xc_push_wait);
>  
>  	/*
> @@ -941,7 +946,7 @@ xlog_cil_push_background(
>  	ASSERT(!list_empty(&cil->xc_cil));
>  
>  	/*
> -	 * don't do a background push if we haven't used up all the
> +	 * Don't do a background push if we haven't used up all the
>  	 * space available yet.
>  	 */
>  	if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) {
> @@ -965,9 +970,16 @@ xlog_cil_push_background(
>  
>  	/*
>  	 * If we are well over the space limit, throttle the work that is being
> -	 * done until the push work on this context has begun.
> +	 * done until the push work on this context has begun. Enforce the hard
> +	 * throttle on all transaction commits once it has been activated, even
> +	 * if the committing transactions have resulted in the space usage
> +	 * dipping back down under the hard limit.
> +	 *
> +	 * The ctx->xc_push_lock provides the serialisation necessary for safely
> +	 * using the lockless waitqueue_active() check in this context.
>  	 */
> -	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
> +	if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log) ||
> +	    waitqueue_active(&cil->xc_push_wait)) {
>  		trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
>  		ASSERT(cil->xc_ctx->space_used < log->l_logsize);
>  		xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-24 20:34   ` Darrick J. Wong
@ 2021-02-24 21:44     ` Dave Chinner
  2021-02-24 23:06       ` Darrick J. Wong
  2021-02-25  8:34       ` Christoph Hellwig
  0 siblings, 2 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-24 21:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 12:34:29PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To allow for iclog IO device cache flush behaviour to be optimised,
> > we first need to separate out the commit record iclog IO from the
> > rest of the checkpoint so we can wait for the checkpoint IO to
> > complete before we issue the commit record.
> > 
> > This separation is only necessary if the commit record is being
> > written into a different iclog to the start of the checkpoint as the
> > upcoming cache flushing changes requires completion ordering against
> > the other iclogs submitted by the checkpoint.
> > 
> > If the entire checkpoint and commit is in the one iclog, then they
> > are both covered by the one set of cache flush primitives on the
> > iclog and hence there is no need to separate them for ordering.
> > 
> > Otherwise, we need to wait for all the previous iclogs to complete
> > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > that the commit record iclog IO issues. This guarantees that if a
> > reader sees the commit record in the journal, they will also see the
> > entire checkpoint that commit record closes off.
> > 
> > This also provides the guarantee that when the commit record IO
> > completes, we can safely unpin all the log items in the checkpoint
> > so they can be written back because the entire checkpoint is stable
> > in the journal.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> >  fs/xfs/xfs_log_priv.h |  2 ++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index fa284f26d10e..ff26fb46d70f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > + * holds no log locks.
> > + *
> > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > + * in the range that we need to wait for and then wait for it to complete.
> > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > + * candidate iclog we need to sleep on have been complete by the time our
> > + * candidate has completed it's IO.
> 
> Hmm, I guess this means that iclog header lsns are supposed to increase
> as one walks forwards through the list?

yes, the iclogs are written sequentially to the log - we don't
switch the log->l_iclog pointer to the current active iclog until we
switch it out, and then the next iclog in the loop is physically
located at a higher lsn to the one we just switched out.

> > + *
> > + * Therefore we only need to find the first iclog that isn't clean within the
> > + * span of our flush range. If we come across a clean, newly activated iclog
> > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > + * activated iclog indicates that there are no iclogs in the range we need to
> > + * wait on and we are done searching.
> 
> I don't see an explicit check for an iclog with a zero lsn?  Is that
> implied by XLOG_STATE_ACTIVE?

It's handled by the XFS_LSN_CMP(prev_lsn, start_lsn) < 0 check.  if
the prev_lsn is zero because the iclog is clean, then this check
will always be true.

> Also, do you have any idea what was Christoph talking about wrt devices
> with no-op flushes the last time this patch was posted?  This change
> seems straightforward to me (assuming the answers to my two question are
> 'yes') but I didn't grok what subtlety he was alluding to...?

He was wondering what devices benefited from this. It has no impact
on highspeed devices that do not require flushes/FUA (e.g. high end
intel optane SSDs) but those are not the devices this change is
aimed at. There are no regressions on these high end devices,
either, so they are largely irrelevant to the patch and what it
targets...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush
  2021-02-24 20:45   ` Darrick J. Wong
@ 2021-02-24 22:01     ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-24 22:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 12:45:29PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 02:34:37PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Move it to xfs_bio_io.c as we are about to add new async cache flush
> > functionality that uses bios directly, so all this stuff should be
> > in the same place. Rename the function to xfs_flush_bdev() to match
> > the xfs_rw_bdev() function that already exists in this file.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> I don't get why it's necessary to consolidate the synchronous flush
> function with the (future) async flush, since all the sync flush callers
> go through a buftarg, including the log.  All this seems to do is shift
> pointer dereferencing into callers.
> 
> Why not make the async flush function take a buftarg?

Because we pretty much got rid of the buffer/buftarg abstraction
from the log IO code completely by going direct to bios. The async
flush goes direct to bios like the rest of the log code does. And
given that xfs_blkdev_issue_flush() is just a one line wrapper
around the blkdev interface, it just seems totally weird to wrap it
differently to other interfaces that go direct to the bios and block
devices.

Part of the problem here is that we've completely screwed up the
separation/abstraction of the log from the xfs_mount and buftargs.
The log just doesn't use buftargs anymore except as a holder of the
log bdev, and the only other interaction it has with them is when
the metadata device cache needs to be invalidated after log recovery
and during unmount.

It just doesn't make sense to me to have bdev flush interfaces that
the log uses hidden behind an abstraction that the rest of the
log subsystem doesn't use...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-24 21:13     ` Darrick J. Wong
@ 2021-02-24 22:03       ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-24 22:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 01:13:37PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 07:05:03PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> > guarantee the ordering requirements the journal has w.r.t. metadata
> > writeback. THe two ordering constraints are:
> > 
> > 1. we cannot overwrite metadata in the journal until we guarantee
> > that the dirty metadata has been written back in place and is
> > stable.
> > 
> > 2. we cannot write back dirty metadata until it has been written to
> > the journal and guaranteed to be stable (and hence recoverable) in
> > the journal.
> > 
> > The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> > causes the journal IO to issue a cache flush and wait for it to
> > complete before issuing the write IO to the journal. Hence all
> > completed metadata IO is guaranteed to be stable before the journal
> > overwrites the old metadata.
> > 
> > The ordering guarantees of #2 are provided by the REQ_FUA, which
> > ensures the journal writes do not complete until they are on stable
> > storage. Hence by the time the last journal IO in a checkpoint
> > completes, we know that the entire checkpoint is on stable storage
> > and we can unpin the dirty metadata and allow it to be written back.

....

> If I've gotten this right so far, patch 4 introduces the ability to
> flush the data device while we get ready to overwrite parts of the log.

Yes.

> This patch makes it so that writing a commit (or unmount) record to an
> iclog causes that iclog write to be issued (to the log device) with at
> least FUA set, and possibly PREFLUSH if we've written out more than one
> iclog since the last commit?

Yes.

> IOWs, the data device flush is now asynchronous with CIL processing and
> we've ripped out all the cache flushes and FUA writes for iclogs except
> for the last one in a chain, which should maintain data integrity
> requirements 1 and 2?

Yes.

> If that's correct,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards
  2021-02-24 21:18   ` Darrick J. Wong
@ 2021-02-24 22:05     ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-24 22:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 24, 2021 at 01:18:10PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 23, 2021 at 02:34:42PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > A hang with tasks stuck on the CIL hard throttle was reported and
> > largely diagnosed by Donald Buczek, who discovered that it was a
> > result of the CIL context space usage decrementing in committed
> > transactions once the hard throttle limit had been hit and processes
> > were already blocked.  This resulted in the CIL push not waking up
> > those waiters because the CIL context was no longer over the hard
> > throttle limit.
> > 
> > The surprising aspect of this was the CIL space usage going
> > backwards regularly enough to trigger this situation. Assumptions
> > had been made in design that the relogging process would only
> > increase the size of the objects in the CIL, and so that space would
> > only increase.
> > 
> > This change and commit message fixes the issue and documents the
> > result of an audit of the triggers that can cause the CIL space to
> > go backwards, how large the backwards steps tend to be, the
> > frequency in which they occur, and what the impact on the CIL
> > accounting code is.

....

> Does this whole series fix the Donald's problem?

No, just this patch is needed to fix that problem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-24 21:44     ` Dave Chinner
@ 2021-02-24 23:06       ` Darrick J. Wong
  2021-02-25  8:34       ` Christoph Hellwig
  1 sibling, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-24 23:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> On Wed, Feb 24, 2021 at 12:34:29PM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > To allow for iclog IO device cache flush behaviour to be optimised,
> > > we first need to separate out the commit record iclog IO from the
> > > rest of the checkpoint so we can wait for the checkpoint IO to
> > > complete before we issue the commit record.
> > > 
> > > This separation is only necessary if the commit record is being
> > > written into a different iclog to the start of the checkpoint as the
> > > upcoming cache flushing changes requires completion ordering against
> > > the other iclogs submitted by the checkpoint.
> > > 
> > > If the entire checkpoint and commit is in the one iclog, then they
> > > are both covered by the one set of cache flush primitives on the
> > > iclog and hence there is no need to separate them for ordering.
> > > 
> > > Otherwise, we need to wait for all the previous iclogs to complete
> > > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > > that the commit record iclog IO issues. This guarantees that if a
> > > reader sees the commit record in the journal, they will also see the
> > > entire checkpoint that commit record closes off.
> > > 
> > > This also provides the guarantee that when the commit record IO
> > > completes, we can safely unpin all the log items in the checkpoint
> > > so they can be written back because the entire checkpoint is stable
> > > in the journal.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> > >  fs/xfs/xfs_log_priv.h |  2 ++
> > >  3 files changed, 64 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index fa284f26d10e..ff26fb46d70f 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > > + * holds no log locks.
> > > + *
> > > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > > + * in the range that we need to wait for and then wait for it to complete.
> > > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > > + * candidate iclog we need to sleep on have been complete by the time our
> > > + * candidate has completed it's IO.
> > 
> > Hmm, I guess this means that iclog header lsns are supposed to increase
> > as one walks forwards through the list?
> 
> yes, the iclogs are written sequentially to the log - we don't
> switch the log->l_iclog pointer to the current active iclog until we
> switch it out, and then the next iclog in the loop is physically
> located at a higher lsn to the one we just switched out.
> 
> > > + *
> > > + * Therefore we only need to find the first iclog that isn't clean within the
> > > + * span of our flush range. If we come across a clean, newly activated iclog
> > > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > > + * activated iclog indicates that there are no iclogs in the range we need to
> > > + * wait on and we are done searching.
> > 
> > I don't see an explicit check for an iclog with a zero lsn?  Is that
> > implied by XLOG_STATE_ACTIVE?
> 
> It's handled by the XFS_LSN_CMP(prev_lsn, start_lsn) < 0 check.  if
> the prev_lsn is zero because the iclog is clean, then this check
> will always be true.
> 
> > Also, do you have any idea what was Christoph talking about wrt devices
> > with no-op flushes the last time this patch was posted?  This change
> > seems straightforward to me (assuming the answers to my two question are
> > 'yes') but I didn't grok what subtlety he was alluding to...?
> 
> He was wondering what devices benefited from this. It has no impact
> on highspeed devices that do not require flushes/FUA (e.g. high end
> intel optane SSDs) but those are not the devices this change is
> aimed at. There are no regressions on these high end devices,
> either, so they are largely irrelevant to the patch and what it
> targets...

Ok, that's what I thought.  It seemed fairly self-evident to me that
high speed devices wouldn't care.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
  2021-02-24 12:27     ` Chandan Babu R
  2021-02-24 21:13     ` Darrick J. Wong
@ 2021-02-25  4:09     ` Chandan Babu R
  2021-02-25  7:13       ` Chandan Babu R
  2021-03-01  5:44       ` Dave Chinner
  2021-02-25  8:58     ` Christoph Hellwig
  2021-03-01 19:29     ` Brian Foster
  4 siblings, 2 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-25  4:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On 23 Feb 2021 at 13:35, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
> guarantee the ordering requirements the journal has w.r.t. metadata
> writeback. THe two ordering constraints are:
>
> 1. we cannot overwrite metadata in the journal until we guarantee
> that the dirty metadata has been written back in place and is
> stable.
>
> 2. we cannot write back dirty metadata until it has been written to
> the journal and guaranteed to be stable (and hence recoverable) in
> the journal.
>
> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
> causes the journal IO to issue a cache flush and wait for it to
> complete before issuing the write IO to the journal. Hence all
> completed metadata IO is guaranteed to be stable before the journal
> overwrites the old metadata.
>
> The ordering guarantees of #2 are provided by the REQ_FUA, which
> ensures the journal writes do not complete until they are on stable
> storage. Hence by the time the last journal IO in a checkpoint
> completes, we know that the entire checkpoint is on stable storage
> and we can unpin the dirty metadata and allow it to be written back.
>
> This is the mechanism by which ordering was first implemented in XFS
> way back in 2002 by this commit:
>
> commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
> Author: Steve Lord <lord@sgi.com>
> Date:   Fri May 24 14:30:21 2002 +0000
>
>     Add support for drive write cache flushing - should the kernel
>     have the infrastructure
>
> A lot has changed since then, most notably we now use delayed
> logging to checkpoint the filesystem to the journal rather than
> write each individual transaction to the journal. Cache flushes on
> journal IO are necessary when individual transactions are wholly
> contained within a single iclog. However, CIL checkpoints are single
> transactions that typically span hundreds to thousands of individual
> journal writes, and so the requirements for device cache flushing
> have changed.
>
> That is, the ordering rules I state above apply to ordering of
> atomic transactions recorded in the journal, not to the journal IO
> itself. Hence we need to ensure metadata is stable before we start
> writing a new transaction to the journal (guarantee #1), and we need
> to ensure the entire transaction is stable in the journal before we
> start metadata writeback (guarantee #2).
>
> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
> new journal transaction to provide #1, and it is not on any other
> journal IO done within the context of that journal transaction.
>
> The CIL checkpoint already issues a cache flush before it starts
> writing to the log, so we no longer need the iclog IO to issue a
> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
> to xlog_write(), we no longer need to mark the first iclog in
> the log write with REQ_PREFLUSH for this case.
>
> Given the new ordering semantics of commit records for the CIL, we
> need iclogs containing commit to issue a REQ_PREFLUSH. We also
> require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
> and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
> the first iclog being written with REQ_PREFLUSH.
>
> For both commit records and unmount records, we also want them
> immediately on stable storage, so we want to also mark the iclogs
> that contain these records to be marked REQ_FUA. That means if a
> record is split across multiple iclogs, they are all marked REQ_FUA
> and not just the last one so that when the transaction is completed
> all the parts of the record are on stable storage.
>
> As an optimisation, when the commit record lands in the same iclog
> as the journal transaction starts, we don't need to wait for
> anything and can simply use REQ_FUA to provide guarantee #2.  This
> means that for fsync() heavy workloads, the cache flush behaviour is
> completely unchanged and there is no degradation in performance as a
> result of optimise the multi-IO transaction case.
>
> The most notable sign that there is less IO latency on my test
> machine (nvme SSDs) is that the "noiclogs" rate has dropped
> substantially. This metric indicates that the CIL push is blocking
> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
> is blocking waiting for log IO. With the changes in this patch, this
> drops to 1 noiclog event for every 100 iclog writes. Hence it is
> clear that log IO is completing much faster than it was previously,
> but it is also clear that for large iclog sizes, this isn't the
> performance limiting factor on this hardware.
>
> With smaller iclogs (32kB), however, there is a sustantial
> difference. With the cache flush modifications, the journal is now
> running at over 4000 write IOPS, and the journal throughput is
> largely identical to the 256kB iclogs and the noiclog event rate
> stays low at about 1:50 iclog writes. The existing code tops out at
> about 2500 IOPS as the number of cache flushes dominate performance
> and latency. The noiclog event rate is about 1:4, and the
> performance variance is quite large as the journal throughput can
> fall to less than half the peak sustained rate when the cache flush
> rate prevents metadata writeback from keeping up and the log runs
> out of space and throttles reservations.
>
> As a result:
>
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
>
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
>
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - repost manually without git/guilt mangling the patch author
> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>   ordering bug in generic/45[57] until testing the CIL pipelining
>   changes much later in the series.
>
>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  3 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6c3fb6dcb505..08d68a6161ae 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1806,8 +1806,7 @@ xlog_write_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog,
>  	uint64_t		bno,
> -	unsigned int		count,
> -	bool			need_flush)
> +	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
>
> @@ -1845,10 +1844,12 @@ xlog_write_iclog(
>  	 * writeback throttle from throttling log writes behind background
>  	 * metadata writeback and causing priority inversions.
>  	 */
> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
> -				REQ_IDLE | REQ_FUA;
> -	if (need_flush)
> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
> +		iclog->ic_bio.bi_opf |= REQ_FUA;
> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
> @@ -1951,7 +1952,7 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
> +	bool			split = false;
>
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>
> @@ -2009,13 +2010,14 @@ xlog_sync(
>  	 * synchronously here; for an internal log we can simply use the block
>  	 * layer state machine for preflushes.
>  	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}

If a checkpoint transaction spans across 2 or more iclogs and the log is
stored on an external device, then the above would remove XLOG_ICL_NEED_FLUSH
flag from iclog->ic_flags causing xlog_write_iclog() to include only REQ_FUA
flag in the corresponding bio.

Documentation/block/writeback_cache_control.rst seems to suggest that REQ_FUA
guarantees only that the data associated with the bio is stable on disk before
I/O completion is signalled. So looks like REQ_PREFLUSH is required in this
scenario.

--
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-25  4:09     ` Chandan Babu R
@ 2021-02-25  7:13       ` Chandan Babu R
  2021-03-01  5:44       ` Dave Chinner
  1 sibling, 0 replies; 59+ messages in thread
From: Chandan Babu R @ 2021-02-25  7:13 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: Dave Chinner, linux-xfs

On 25 Feb 2021 at 09:39, Chandan Babu R wrote:
> On 23 Feb 2021 at 13:35, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
>> guarantee the ordering requirements the journal has w.r.t. metadata
>> writeback. THe two ordering constraints are:
>>
>> 1. we cannot overwrite metadata in the journal until we guarantee
>> that the dirty metadata has been written back in place and is
>> stable.
>>
>> 2. we cannot write back dirty metadata until it has been written to
>> the journal and guaranteed to be stable (and hence recoverable) in
>> the journal.
>>
>> The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
>> causes the journal IO to issue a cache flush and wait for it to
>> complete before issuing the write IO to the journal. Hence all
>> completed metadata IO is guaranteed to be stable before the journal
>> overwrites the old metadata.
>>
>> The ordering guarantees of #2 are provided by the REQ_FUA, which
>> ensures the journal writes do not complete until they are on stable
>> storage. Hence by the time the last journal IO in a checkpoint
>> completes, we know that the entire checkpoint is on stable storage
>> and we can unpin the dirty metadata and allow it to be written back.
>>
>> This is the mechanism by which ordering was first implemented in XFS
>> way back in 2002 by this commit:
>>
>> commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
>> Author: Steve Lord <lord@sgi.com>
>> Date:   Fri May 24 14:30:21 2002 +0000
>>
>>     Add support for drive write cache flushing - should the kernel
>>     have the infrastructure
>>
>> A lot has changed since then, most notably we now use delayed
>> logging to checkpoint the filesystem to the journal rather than
>> write each individual transaction to the journal. Cache flushes on
>> journal IO are necessary when individual transactions are wholly
>> contained within a single iclog. However, CIL checkpoints are single
>> transactions that typically span hundreds to thousands of individual
>> journal writes, and so the requirements for device cache flushing
>> have changed.
>>
>> That is, the ordering rules I state above apply to ordering of
>> atomic transactions recorded in the journal, not to the journal IO
>> itself. Hence we need to ensure metadata is stable before we start
>> writing a new transaction to the journal (guarantee #1), and we need
>> to ensure the entire transaction is stable in the journal before we
>> start metadata writeback (guarantee #2).
>>
>> Hence we only need a REQ_PREFLUSH on the journal IO that starts a
>> new journal transaction to provide #1, and it is not on any other
>> journal IO done within the context of that journal transaction.
>>
>> The CIL checkpoint already issues a cache flush before it starts
>> writing to the log, so we no longer need the iclog IO to issue a
>> REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
>> to xlog_write(), we no longer need to mark the first iclog in
>> the log write with REQ_PREFLUSH for this case.
>>
>> Given the new ordering semantics of commit records for the CIL, we
>> need iclogs containing commit to issue a REQ_PREFLUSH. We also
>> require unmount records to do this. Hence for both XLOG_COMMIT_TRANS
>> and XLOG_UNMOUNT_TRANS xlog_write() calls we need to mark
>> the first iclog being written with REQ_PREFLUSH.
>>
>> For both commit records and unmount records, we also want them
>> immediately on stable storage, so we want to also mark the iclogs
>> that contain these records to be marked REQ_FUA. That means if a
>> record is split across multiple iclogs, they are all marked REQ_FUA
>> and not just the last one so that when the transaction is completed
>> all the parts of the record are on stable storage.
>>
>> As an optimisation, when the commit record lands in the same iclog
>> as the journal transaction starts, we don't need to wait for
>> anything and can simply use REQ_FUA to provide guarantee #2.  This
>> means that for fsync() heavy workloads, the cache flush behaviour is
>> completely unchanged and there is no degradation in performance as a
>> result of optimise the multi-IO transaction case.
>>
>> The most notable sign that there is less IO latency on my test
>> machine (nvme SSDs) is that the "noiclogs" rate has dropped
>> substantially. This metric indicates that the CIL push is blocking
>> in xlog_get_iclog_space() waiting for iclog IO completion to occur.
>> With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
>> every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
>> is blocking waiting for log IO. With the changes in this patch, this
>> drops to 1 noiclog event for every 100 iclog writes. Hence it is
>> clear that log IO is completing much faster than it was previously,
>> but it is also clear that for large iclog sizes, this isn't the
>> performance limiting factor on this hardware.
>>
>> With smaller iclogs (32kB), however, there is a sustantial
>> difference. With the cache flush modifications, the journal is now
>> running at over 4000 write IOPS, and the journal throughput is
>> largely identical to the 256kB iclogs and the noiclog event rate
>> stays low at about 1:50 iclog writes. The existing code tops out at
>> about 2500 IOPS as the number of cache flushes dominate performance
>> and latency. The noiclog event rate is about 1:4, and the
>> performance variance is quite large as the journal throughput can
>> fall to less than half the peak sustained rate when the cache flush
>> rate prevents metadata writeback from keeping up and the log runs
>> out of space and throttles reservations.
>>
>> As a result:
>>
>> 	logbsize	fsmark create rate	rm -rf
>> before	32kb		152851+/-5.3e+04	5m28s
>> patched	32kb		221533+/-1.1e+04	5m24s
>>
>> before	256kb		220239+/-6.2e+03	4m58s
>> patched	256kb		228286+/-9.2e+03	5m06s
>>
>> The rm -rf times are included because I ran them, but the
>> differences are largely noise. This workload is largely metadata
>> read IO latency bound and the changes to the journal cache flushing
>> doesn't really make any noticable difference to behaviour apart from
>> a reduction in noiclog events from background CIL pushing.
>>
>> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>> ---
>> Version 2:
>> - repost manually without git/guilt mangling the patch author
>> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>>   ordering bug in generic/45[57] until testing the CIL pipelining
>>   changes much later in the series.
>>
>>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>>  fs/xfs/xfs_log_priv.h |  4 ++++
>>  3 files changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
>> index 6c3fb6dcb505..08d68a6161ae 100644
>> --- a/fs/xfs/xfs_log.c
>> +++ b/fs/xfs/xfs_log.c
>> @@ -1806,8 +1806,7 @@ xlog_write_iclog(
>>  	struct xlog		*log,
>>  	struct xlog_in_core	*iclog,
>>  	uint64_t		bno,
>> -	unsigned int		count,
>> -	bool			need_flush)
>> +	unsigned int		count)
>>  {
>>  	ASSERT(bno < log->l_logBBsize);
>>
>> @@ -1845,10 +1844,12 @@ xlog_write_iclog(
>>  	 * writeback throttle from throttling log writes behind background
>>  	 * metadata writeback and causing priority inversions.
>>  	 */
>> -	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC |
>> -				REQ_IDLE | REQ_FUA;
>> -	if (need_flush)
>> +	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
>> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
>>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
>> +	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
>> +		iclog->ic_bio.bi_opf |= REQ_FUA;
>> +	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>>
>>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
>>  		xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
>> @@ -1951,7 +1952,7 @@ xlog_sync(
>>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>>  	uint64_t		bno;
>>  	unsigned int		size;
>> -	bool			need_flush = true, split = false;
>> +	bool			split = false;
>>
>>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>>
>> @@ -2009,13 +2010,14 @@ xlog_sync(
>>  	 * synchronously here; for an internal log we can simply use the block
>>  	 * layer state machine for preflushes.
>>  	 */
>> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
>> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
>> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
>> -		need_flush = false;
>> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>>  	}
>
> If a checkpoint transaction spans across 2 or more iclogs and the log is
> stored on an external device, then the above would remove XLOG_ICL_NEED_FLUSH
> flag from iclog->ic_flags causing xlog_write_iclog() to include only REQ_FUA

... would remove XLOG_ICL_NEED_FLUSH flag from *commit iclog's* ic_flags

> flag in the corresponding bio.
>
> Documentation/block/writeback_cache_control.rst seems to suggest that REQ_FUA
> guarantees only that the data associated with the bio is stable on disk before
> I/O completion is signalled. So looks like REQ_PREFLUSH is required in this
> scenario.


-- 
chandan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/8] xfs: log stripe roundoff is a property of the log
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
  2021-02-23 10:29   ` Chandan Babu R
  2021-02-24 20:14   ` Darrick J. Wong
@ 2021-02-25  8:32   ` Christoph Hellwig
  2021-03-01 15:13   ` Brian Foster
  3 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-24 21:44     ` Dave Chinner
  2021-02-24 23:06       ` Darrick J. Wong
@ 2021-02-25  8:34       ` Christoph Hellwig
  2021-02-25 20:47         ` Dave Chinner
  2021-02-26  2:48         ` Darrick J. Wong
  1 sibling, 2 replies; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > Also, do you have any idea what was Christoph talking about wrt devices
> > with no-op flushes the last time this patch was posted?  This change
> > seems straightforward to me (assuming the answers to my two question are
> > 'yes') but I didn't grok what subtlety he was alluding to...?
> 
> He was wondering what devices benefited from this. It has no impact
> on highspeed devices that do not require flushes/FUA (e.g. high end
> intel optane SSDs) but those are not the devices this change is
> aimed at. There are no regressions on these high end devices,
> either, so they are largely irrelevant to the patch and what it
> targets...

I don't think it is that simple.  Pretty much every device aimed at
enterprise use does not enable a volatile write cache by default.  That
also includes hard drives, arrays and NAND based SSDs.

Especially for hard drives (or slower arrays) the actual I/O wait might
matter.  What is the argument against making this conditional?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush
  2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
  2021-02-23 12:57   ` Chandan Babu R
  2021-02-24 20:45   ` Darrick J. Wong
@ 2021-02-25  8:36   ` Christoph Hellwig
  2 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Move it to xfs_bio_io.c as we are about to add new async cache flush
> functionality that uses bios directly, so all this stuff should be
> in the same place. Rename the function to xfs_flush_bdev() to match
> the xfs_rw_bdev() function that already exists in this file.

I'd rather kill it off.  None that as of the latest Linus tree,
blkdev_issue_flush has also lost the gfp_mask argument.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally
  2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
  2021-02-24  7:16   ` Chandan Babu R
  2021-02-24 20:57   ` Darrick J. Wong
@ 2021-02-25  8:42   ` Christoph Hellwig
  2021-02-25 21:07     ` Dave Chinner
  2 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

This looks ok, but please make add two trivial checks that the device
actually supports/needs flushes.  All that magic of allocating a bio


On Tue, Feb 23, 2021 at 02:34:39PM +1100, Dave Chinner wrote:
>  	new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS);
>  	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -719,10 +720,24 @@ xlog_cil_push_work(
>  	spin_unlock(&cil->xc_push_lock);
>  
>  	/*
> -	 * pull all the log vectors off the items in the CIL, and
> -	 * remove the items from the CIL. We don't need the CIL lock
> -	 * here because it's only needed on the transaction commit
> -	 * side which is currently locked out by the flush lock.
> +	 * The CIL is stable at this point - nothing new will be added to it
> +	 * because we hold the flush lock exclusively. Hence we can now issue
> +	 * a cache flush to ensure all the completed metadata in the journal we
> +	 * are about to overwrite is on stable storage.
> +	 *
> +	 * This avoids the need to have the iclogs issue REQ_PREFLUSH based
> +	 * cache flushes to provide this ordering guarantee, and hence for CIL
> +	 * checkpoints that require hundreds or thousands of log writes no
> +	 * longer need to issue device cache flushes to provide metadata
> +	 * writeback ordering.
> +	 */
> +	xfs_flush_bdev_async(log->l_mp->m_ddev_targp->bt_bdev, &bdev_flush);

This still causes a bio allocation, also even if the device does not need
flush.  Please also use bio_init on a bio passed into xfs_flush_bdev_async to
avoid that, and make the whole code conditional to only run if we actually need
to flush caches.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write()
  2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
  2021-02-24  7:17   ` Chandan Babu R
  2021-02-24 20:59   ` Darrick J. Wong
@ 2021-02-25  8:49   ` Christoph Hellwig
  2021-02-25 20:55     ` Dave Chinner
  2 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> +	if (optype & XLOG_START_TRANS)
> +		headers++;

This deserves a comment.

> +	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
> +	if (start_lsn)
> +		*start_lsn = 0;

I'd slightly prefer that allowing a NULL start_lsn was a separate prep
patch.  As-is it really clutters the patch and detracts from the real
change.

>  			int			copy_len;
>  			int			copy_off;
>  			bool			ordered = false;
> +			bool			wrote_start_rec = false;
>  
>  			/* ordered log vectors have no regions to write */
>  			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> @@ -2502,13 +2501,15 @@ xlog_write(
>  			 * write a start record. Only do this for the first
>  			 * iclog we write to.
>  			 */
> -			if (need_start_rec) {
> +			if (optype & XLOG_START_TRANS) {

So this relies on the fact that the only callers that passes an optype of
XLOG_START_TRANS only writes a single lv.  I think we want an assert for
that somewhere to avoid a bad surprise later.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
                       ` (2 preceding siblings ...)
  2021-02-25  4:09     ` Chandan Babu R
@ 2021-02-25  8:58     ` Christoph Hellwig
  2021-02-25 21:06       ` Dave Chinner
  2021-03-01 19:29     ` Brian Foster
  4 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2021-02-25  8:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> As a result:
> 
> 	logbsize	fsmark create rate	rm -rf
> before	32kb		152851+/-5.3e+04	5m28s
> patched	32kb		221533+/-1.1e+04	5m24s
> 
> before	256kb		220239+/-6.2e+03	4m58s
> patched	256kb		228286+/-9.2e+03	5m06s
> 
> The rm -rf times are included because I ran them, but the
> differences are largely noise. This workload is largely metadata
> read IO latency bound and the changes to the journal cache flushing
> doesn't really make any noticable difference to behaviour apart from
> a reduction in noiclog events from background CIL pushing.

The 256b rm -rf case actually seems like a regression not in the noise
here.  Does this reproduce over multiple runs?

> @@ -2009,13 +2010,14 @@ xlog_sync(
>  	 * synchronously here; for an internal log we can simply use the block
>  	 * layer state machine for preflushes.
>  	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;

Once you touch all the buffer flags anyway we should optimize the
log wraparound case here - insteaad of th synchronous flush we just
need to set REQ_PREFLUSH on the first log bio, which should be nicely
doable with your infrastruture.

> +		/*
> +		 * iclogs containing commit records or unmount records need
> +		 * to issue ordering cache flushes and commit immediately
> +		 * to stable storage to guarantee journal vs metadata ordering
> +		 * is correctly maintained in the storage media.
> +		 */
> +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> +						XLOG_ICL_NEED_FUA);
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4093d2d0db7c..370da7c2bfc8 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -894,10 +894,15 @@ xlog_cil_push_work(
>  
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> +	 * cache flush because there are no other iclogs we need to order
> +	 * against.

Nit: the iclogs in the first changed line would easily fit onto the previous
line.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-25  8:34       ` Christoph Hellwig
@ 2021-02-25 20:47         ` Dave Chinner
  2021-03-01  9:09           ` Christoph Hellwig
  2021-02-26  2:48         ` Darrick J. Wong
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2021-02-25 20:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > Also, do you have any idea what was Christoph talking about wrt devices
> > > with no-op flushes the last time this patch was posted?  This change
> > > seems straightforward to me (assuming the answers to my two question are
> > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > 
> > He was wondering what devices benefited from this. It has no impact
> > on highspeed devices that do not require flushes/FUA (e.g. high end
> > intel optane SSDs) but those are not the devices this change is
> > aimed at. There are no regressions on these high end devices,
> > either, so they are largely irrelevant to the patch and what it
> > targets...
> 
> I don't think it is that simple.  Pretty much every device aimed at
> enterprise use does not enable a volatile write cache by default.  That
> also includes hard drives, arrays and NAND based SSDs.
> 
> Especially for hard drives (or slower arrays) the actual I/O wait might
> matter. 

Sorry, I/O wait might matter for what?

I'm really not sure what you're objecting to - you've hand-waved
about hardware that doesn't need cache flushes twice now and
inferred that they'd be adversely affected by removing cache
flushes. That just doesn't make any sense at all, and I have numbers
to back it up.

You also asked what storage it improved performance on and I told
you and then also pointed out all the software layers that it
massively helps, too, regardless of the physical storage
characteristics.

https://lore.kernel.org/linux-xfs/20210203212013.GV4662@dread.disaster.area/

I have numbers to back it up. You did not reply to me, so I'm not
going to waste time repeating myself here.

> What is the argument against making this conditional?

There is no argument for making this conditional. You've created an
undefined strawman and are demanding that I prove it wrong. If
you've got anything concrete, then tell us about it directly and
provide numbers.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write()
  2021-02-25  8:49   ` Christoph Hellwig
@ 2021-02-25 20:55     ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-25 20:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 09:49:20AM +0100, Christoph Hellwig wrote:
> > +	if (optype & XLOG_START_TRANS)
> > +		headers++;
> 
> This deserves a comment.

It gets killed off later, so it's a waste of time to prettify this.

> > +	len = xlog_write_calc_vec_length(ticket, log_vector, optype);
> > +	if (start_lsn)
> > +		*start_lsn = 0;
> 
> I'd slightly prefer that allowing a NULL start_lsn was a separate prep
> patch.  As-is it really clutters the patch and detracts from the real
> change.

No, I've already got enough patches in this whole series to deal
with. I'm not splitting out simple, obvious changes into tiny two
line patches that require me to do more work for zero gain.

> >  			int			copy_len;
> >  			int			copy_off;
> >  			bool			ordered = false;
> > +			bool			wrote_start_rec = false;
> >  
> >  			/* ordered log vectors have no regions to write */
> >  			if (lv->lv_buf_len == XFS_LOG_VEC_ORDERED) {
> > @@ -2502,13 +2501,15 @@ xlog_write(
> >  			 * write a start record. Only do this for the first
> >  			 * iclog we write to.
> >  			 */
> > -			if (need_start_rec) {
> > +			if (optype & XLOG_START_TRANS) {
> 
> So this relies on the fact that the only callers that passes an optype of
> XLOG_START_TRANS only writes a single lv.  I think we want an assert for
> that somewhere to avoid a bad surprise later.

This also gets killed off later, so again such things are largely a
waste of my time as all it does is cause rebase conflicts in
multiple patches  and doesn't actually change the end result. So,
again, no.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-25  8:58     ` Christoph Hellwig
@ 2021-02-25 21:06       ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-25 21:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 09:58:42AM +0100, Christoph Hellwig wrote:
> > As a result:
> > 
> > 	logbsize	fsmark create rate	rm -rf
> > before	32kb		152851+/-5.3e+04	5m28s
> > patched	32kb		221533+/-1.1e+04	5m24s
> > 
> > before	256kb		220239+/-6.2e+03	4m58s
> > patched	256kb		228286+/-9.2e+03	5m06s
> > 
> > The rm -rf times are included because I ran them, but the
> > differences are largely noise. This workload is largely metadata
> > read IO latency bound and the changes to the journal cache flushing
> > doesn't really make any noticable difference to behaviour apart from
> > a reduction in noiclog events from background CIL pushing.
> 
> The 256b rm -rf case actually seems like a regression not in the noise
> here.  Does this reproduce over multiple runs?

It's noise. The unlink repeat times on this machine at 16 threads
are at least +/-15s because the removals are not synchronised in
groups like the creates are.

These are CPU bound workloads when the log is not limiting the
transaction rate (only the {before, 32kB} numbers in this test are
log IO bound) so there's always some variation in performance due to
non-deterministic factors like memory reclaim, AG lock-stepping
between threads, etc.

Hence there's a bit of unfairness between the threads and often the
first thread finishes 30s before the last thread. The times are for
the last thread completing and there can be significant variation on
that.

> > @@ -2009,13 +2010,14 @@ xlog_sync(
> >  	 * synchronously here; for an internal log we can simply use the block
> >  	 * layer state machine for preflushes.
> >  	 */
> > -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> > +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> > +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
> >  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> > -		need_flush = false;
> > +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> 
> Once you touch all the buffer flags anyway we should optimize the
> log wraparound case here - insteaad of th synchronous flush we just
> need to set REQ_PREFLUSH on the first log bio, which should be nicely
> doable with your infrastruture.

That sounds like another patch because it's a change of behaviour.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally
  2021-02-25  8:42   ` Christoph Hellwig
@ 2021-02-25 21:07     ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-02-25 21:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 09:42:02AM +0100, Christoph Hellwig wrote:
> This looks ok, but please make add two trivial checks that the device
> actually supports/needs flushes.  All that magic of allocating a bio

Ok, should be easy enough to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-25  8:34       ` Christoph Hellwig
  2021-02-25 20:47         ` Dave Chinner
@ 2021-02-26  2:48         ` Darrick J. Wong
  2021-02-28 16:36           ` Brian Foster
  1 sibling, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2021-02-26  2:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > Also, do you have any idea what was Christoph talking about wrt devices
> > > with no-op flushes the last time this patch was posted?  This change
> > > seems straightforward to me (assuming the answers to my two question are
> > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > 
> > He was wondering what devices benefited from this. It has no impact
> > on highspeed devices that do not require flushes/FUA (e.g. high end
> > intel optane SSDs) but those are not the devices this change is
> > aimed at. There are no regressions on these high end devices,
> > either, so they are largely irrelevant to the patch and what it
> > targets...
> 
> I don't think it is that simple.  Pretty much every device aimed at
> enterprise use does not enable a volatile write cache by default.  That
> also includes hard drives, arrays and NAND based SSDs.
> 
> Especially for hard drives (or slower arrays) the actual I/O wait might
> matter.  What is the argument against making this conditional?

I still don't understand what you're asking about here --

AFAICT the net effect of this patchset is that it reduces the number of
preflushes and FUA log writes.  To my knowledge, on a high end device
with no volatile write cache, flushes are a no-op (because all writes
are persisted somewhere immediately) and a FUA write should be the exact
same thing as a non-FUA write.  Because XFS will now issue fewer no-op
persistence commands to the device, there should be no effect at all.

In contrast, a dumb stone tablet with a write cache hooked up to SATA
will have agonizingly slow cache flushes.  XFS will issue fewer
persistence commands to the rock, which in turn makes things faster
because we're calling the engravers less often.

What am I missing here?  Are you saying that the cost of a cache flush
goes up much faster than the amount of data that has to be flushed?

--D

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-26  2:48         ` Darrick J. Wong
@ 2021-02-28 16:36           ` Brian Foster
  2021-02-28 23:46             ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2021-02-28 16:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, Dave Chinner, linux-xfs

On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > with no-op flushes the last time this patch was posted?  This change
> > > > seems straightforward to me (assuming the answers to my two question are
> > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > 
> > > He was wondering what devices benefited from this. It has no impact
> > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > intel optane SSDs) but those are not the devices this change is
> > > aimed at. There are no regressions on these high end devices,
> > > either, so they are largely irrelevant to the patch and what it
> > > targets...
> > 
> > I don't think it is that simple.  Pretty much every device aimed at
> > enterprise use does not enable a volatile write cache by default.  That
> > also includes hard drives, arrays and NAND based SSDs.
> > 
> > Especially for hard drives (or slower arrays) the actual I/O wait might
> > matter.  What is the argument against making this conditional?
> 
> I still don't understand what you're asking about here --
> 
> AFAICT the net effect of this patchset is that it reduces the number of
> preflushes and FUA log writes.  To my knowledge, on a high end device
> with no volatile write cache, flushes are a no-op (because all writes
> are persisted somewhere immediately) and a FUA write should be the exact
> same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> persistence commands to the device, there should be no effect at all.
> 

Except the cost of the new iowaits used to implement iclog ordering...
which I think is what Christoph has been asking about..?

IOW, considering the storage configuration noted above where the impact
of the flush/fua optimizations is neutral, the net effect of this change
is whatever impact is introduced by intra-checkpoint iowaits and iclog
ordering. What is that impact?

Note that it's not clear enough to me to suggest whether that impact
might be significant or not. Hopefully it's neutral (?), but that seems
like best case scenario so I do think it's a reasonable question.

Brian

> In contrast, a dumb stone tablet with a write cache hooked up to SATA
> will have agonizingly slow cache flushes.  XFS will issue fewer
> persistence commands to the rock, which in turn makes things faster
> because we're calling the engravers less often.
> 
> What am I missing here?  Are you saying that the cost of a cache flush
> goes up much faster than the amount of data that has to be flushed?
> 
> --D
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-28 16:36           ` Brian Foster
@ 2021-02-28 23:46             ` Dave Chinner
  2021-03-01 15:33               ` Brian Foster
  0 siblings, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2021-02-28 23:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs

On Sun, Feb 28, 2021 at 11:36:13AM -0500, Brian Foster wrote:
> On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > > with no-op flushes the last time this patch was posted?  This change
> > > > > seems straightforward to me (assuming the answers to my two question are
> > > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > > 
> > > > He was wondering what devices benefited from this. It has no impact
> > > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > > intel optane SSDs) but those are not the devices this change is
> > > > aimed at. There are no regressions on these high end devices,
> > > > either, so they are largely irrelevant to the patch and what it
> > > > targets...
> > > 
> > > I don't think it is that simple.  Pretty much every device aimed at
> > > enterprise use does not enable a volatile write cache by default.  That
> > > also includes hard drives, arrays and NAND based SSDs.
> > > 
> > > Especially for hard drives (or slower arrays) the actual I/O wait might
> > > matter.  What is the argument against making this conditional?
> > 
> > I still don't understand what you're asking about here --
> > 
> > AFAICT the net effect of this patchset is that it reduces the number of
> > preflushes and FUA log writes.  To my knowledge, on a high end device
> > with no volatile write cache, flushes are a no-op (because all writes
> > are persisted somewhere immediately) and a FUA write should be the exact
> > same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> > persistence commands to the device, there should be no effect at all.
> > 
> 
> Except the cost of the new iowaits used to implement iclog ordering...
> which I think is what Christoph has been asking about..?

And I've already answered - it is largely just noise.

> IOW, considering the storage configuration noted above where the impact
> of the flush/fua optimizations is neutral, the net effect of this change
> is whatever impact is introduced by intra-checkpoint iowaits and iclog
> ordering. What is that impact?

All I've really noticed is that long tail latencies on operations go
down a bit. That seems to correlate with spending less time waiting
for log space when the log is full, but it's a marginal improvement
at best.

Otherwise I cannot measure any significant difference in performance
or behaviour across any of the metrics I monitor during performance
testing.

> Note that it's not clear enough to me to suggest whether that impact
> might be significant or not. Hopefully it's neutral (?), but that seems
> like best case scenario so I do think it's a reasonable question.

Yes, It's a reasonable question, but I answered it entirely and in
great detail the first time.  Repeating the same question multiple
times just with slightly different phrasing does not change the
answer, nor explain to me what the undocumented concern might be...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-25  4:09     ` Chandan Babu R
  2021-02-25  7:13       ` Chandan Babu R
@ 2021-03-01  5:44       ` Dave Chinner
  2021-03-01  5:56         ` Dave Chinner
  1 sibling, 1 reply; 59+ messages in thread
From: Dave Chinner @ 2021-03-01  5:44 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Thu, Feb 25, 2021 at 09:39:05AM +0530, Chandan Babu R wrote:
> On 23 Feb 2021 at 13:35, Dave Chinner wrote:
> > @@ -2009,13 +2010,14 @@ xlog_sync(
> >  	 * synchronously here; for an internal log we can simply use the block
> >  	 * layer state machine for preflushes.
> >  	 */
> > -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> > +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> > +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
> >  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> > -		need_flush = false;
> > +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> >  	}
> 
> If a checkpoint transaction spans across 2 or more iclogs and the log is
> stored on an external device, then the above would remove XLOG_ICL_NEED_FLUSH
> flag from iclog->ic_flags causing xlog_write_iclog() to include only REQ_FUA
> flag in the corresponding bio.

Yup, good catch, this is a subtle change of behaviour only for
external logs and only for the commit iclog that needs to flush the
previous log writes to stable storage. I'll rework the logic here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-03-01  5:44       ` Dave Chinner
@ 2021-03-01  5:56         ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-03-01  5:56 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Mon, Mar 01, 2021 at 04:44:53PM +1100, Dave Chinner wrote:
> On Thu, Feb 25, 2021 at 09:39:05AM +0530, Chandan Babu R wrote:
> > On 23 Feb 2021 at 13:35, Dave Chinner wrote:
> > > @@ -2009,13 +2010,14 @@ xlog_sync(
> > >  	 * synchronously here; for an internal log we can simply use the block
> > >  	 * layer state machine for preflushes.
> > >  	 */
> > > -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> > > +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> > > +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
> > >  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> > > -		need_flush = false;
> > > +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
> > >  	}
> > 
> > If a checkpoint transaction spans across 2 or more iclogs and the log is
> > stored on an external device, then the above would remove XLOG_ICL_NEED_FLUSH
> > flag from iclog->ic_flags causing xlog_write_iclog() to include only REQ_FUA
> > flag in the corresponding bio.
> 
> Yup, good catch, this is a subtle change of behaviour only for
> external logs and only for the commit iclog that needs to flush the
> previous log writes to stable storage. I'll rework the logic here.

And now that I think about it, we can simply remove this code if we
put an explicit data device cache flush in the unmount recrod write
code. The CIL has already guaranteed metadata vs journal ordering
before we start writing the checkpoint, meaning the
XLOG_ICL_NEED_FLUSH flag only has meaning for internal iclog write
ordering, not external metadata ordering. ANd for the split, we can
simply clear the REQ_PREFLUSH flag from the split bio before
submitting it....

Much simpler and faster for external logs, too.

CHeers,

Dav.e
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-25 20:47         ` Dave Chinner
@ 2021-03-01  9:09           ` Christoph Hellwig
  2021-03-03  0:11             ` Dave Chinner
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2021-03-01  9:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs

On Fri, Feb 26, 2021 at 07:47:55AM +1100, Dave Chinner wrote:
> Sorry, I/O wait might matter for what?

Think you have a SAS hard drive, WCE=0, typical queue depth of a few
dozend commands.

Before that we'd submit a bunch of iclogs, which are generally
sequential except of course for the log wrap around case.  The drive
can now easily take all the iclogs and write them in one rotation.

Now if we wait for the previous iclogs before submitting the
commit_iclog we need at least one more additional full roundtrip.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 1/8] xfs: log stripe roundoff is a property of the log
  2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
                     ` (2 preceding siblings ...)
  2021-02-25  8:32   ` Christoph Hellwig
@ 2021-03-01 15:13   ` Brian Foster
  3 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2021-03-01 15:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:35PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We don't need to look at the xfs_mount and superblock every time we
> need to do an iclog roundoff calculation. The property is fixed for
> the life of the log, so store the roundoff in the log at mount time
> and use that everywhere.
> 
> On a debug build:
> 
> $ size fs/xfs/xfs_log.o.*
>    text	   data	    bss	    dec	    hex	filename
>   27360	    560	      8	  27928	   6d18	fs/xfs/xfs_log.o.orig
>   27219	    560	      8	  27787	   6c8b	fs/xfs/xfs_log.o.patched
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_log_format.h |  3 --
>  fs/xfs/xfs_log.c               | 59 ++++++++++++++--------------------
>  fs/xfs/xfs_log_priv.h          |  2 ++
>  3 files changed, 27 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 8bd00da6d2a4..16587219549c 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -34,9 +34,6 @@ typedef uint32_t xlog_tid_t;
>  #define XLOG_MIN_RECORD_BSHIFT	14		/* 16384 == 1 << 14 */
>  #define XLOG_BIG_RECORD_BSHIFT	15		/* 32k == 1 << 15 */
>  #define XLOG_MAX_RECORD_BSHIFT	18		/* 256k == 1 << 18 */
> -#define XLOG_BTOLSUNIT(log, b)  (((b)+(log)->l_mp->m_sb.sb_logsunit-1) / \
> -                                 (log)->l_mp->m_sb.sb_logsunit)
> -#define XLOG_LSUNITTOB(log, su) ((su) * (log)->l_mp->m_sb.sb_logsunit)
>  
>  #define XLOG_HEADER_SIZE	512
>  
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 06041834daa3..fa284f26d10e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1399,6 +1399,11 @@ xlog_alloc_log(
>  	xlog_assign_atomic_lsn(&log->l_last_sync_lsn, 1, 0);
>  	log->l_curr_cycle  = 1;	    /* 0 is bad since this is initial value */
>  
> +	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1)
> +		log->l_iclog_roundoff = mp->m_sb.sb_logsunit;
> +	else
> +		log->l_iclog_roundoff = BBSIZE;
> +
>  	xlog_grant_head_init(&log->l_reserve_head);
>  	xlog_grant_head_init(&log->l_write_head);
>  
> @@ -1852,29 +1857,15 @@ xlog_calc_iclog_size(
>  	uint32_t		*roundoff)
>  {
>  	uint32_t		count_init, count;
> -	bool			use_lsunit;
> -
> -	use_lsunit = xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -			log->l_mp->m_sb.sb_logsunit > 1;
>  
>  	/* Add for LR header */
>  	count_init = log->l_iclog_hsize + iclog->ic_offset;
> +	count = roundup(count_init, log->l_iclog_roundoff);
>  
> -	/* Round out the log write size */
> -	if (use_lsunit) {
> -		/* we have a v2 stripe unit to use */
> -		count = XLOG_LSUNITTOB(log, XLOG_BTOLSUNIT(log, count_init));
> -	} else {
> -		count = BBTOB(BTOBB(count_init));
> -	}
> -
> -	ASSERT(count >= count_init);
>  	*roundoff = count - count_init;
>  
> -	if (use_lsunit)
> -		ASSERT(*roundoff < log->l_mp->m_sb.sb_logsunit);
> -	else
> -		ASSERT(*roundoff < BBTOB(1));
> +	ASSERT(count >= count_init);
> +	ASSERT(*roundoff < log->l_iclog_roundoff);
>  	return count;
>  }
>  
> @@ -3149,10 +3140,9 @@ xlog_state_switch_iclogs(
>  	log->l_curr_block += BTOBB(eventual_size)+BTOBB(log->l_iclog_hsize);
>  
>  	/* Round up to next log-sunit */
> -	if (xfs_sb_version_haslogv2(&log->l_mp->m_sb) &&
> -	    log->l_mp->m_sb.sb_logsunit > 1) {
> -		uint32_t sunit_bb = BTOBB(log->l_mp->m_sb.sb_logsunit);
> -		log->l_curr_block = roundup(log->l_curr_block, sunit_bb);
> +	if (log->l_iclog_roundoff > BBSIZE) {
> +		log->l_curr_block = roundup(log->l_curr_block,
> +						BTOBB(log->l_iclog_roundoff));
>  	}
>  
>  	if (log->l_curr_block >= log->l_logBBsize) {
> @@ -3404,12 +3394,11 @@ xfs_log_ticket_get(
>   * Figure out the total log space unit (in bytes) that would be
>   * required for a log ticket.
>   */
> -int
> -xfs_log_calc_unit_res(
> -	struct xfs_mount	*mp,
> +static int
> +xlog_calc_unit_res(
> +	struct xlog		*log,
>  	int			unit_bytes)
>  {
> -	struct xlog		*log = mp->m_log;
>  	int			iclog_space;
>  	uint			num_headers;
>  
> @@ -3485,18 +3474,20 @@ xfs_log_calc_unit_res(
>  	/* for commit-rec LR header - note: padding will subsume the ophdr */
>  	unit_bytes += log->l_iclog_hsize;
>  
> -	/* for roundoff padding for transaction data and one for commit record */
> -	if (xfs_sb_version_haslogv2(&mp->m_sb) && mp->m_sb.sb_logsunit > 1) {
> -		/* log su roundoff */
> -		unit_bytes += 2 * mp->m_sb.sb_logsunit;
> -	} else {
> -		/* BB roundoff */
> -		unit_bytes += 2 * BBSIZE;
> -        }
> +	/* roundoff padding for transaction data and one for commit record */
> +	unit_bytes += 2 * log->l_iclog_roundoff;
>  
>  	return unit_bytes;
>  }
>  
> +int
> +xfs_log_calc_unit_res(
> +	struct xfs_mount	*mp,
> +	int			unit_bytes)
> +{
> +	return xlog_calc_unit_res(mp->m_log, unit_bytes);
> +}
> +
>  /*
>   * Allocate and initialise a new log ticket.
>   */
> @@ -3513,7 +3504,7 @@ xlog_ticket_alloc(
>  
>  	tic = kmem_cache_zalloc(xfs_log_ticket_zone, GFP_NOFS | __GFP_NOFAIL);
>  
> -	unit_res = xfs_log_calc_unit_res(log->l_mp, unit_bytes);
> +	unit_res = xlog_calc_unit_res(log, unit_bytes);
>  
>  	atomic_set(&tic->t_ref, 1);
>  	tic->t_task		= current;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 1c6fdbf3d506..037950cf1061 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -436,6 +436,8 @@ struct xlog {
>  #endif
>  	/* log recovery lsn tracking (for buffer submission */
>  	xfs_lsn_t		l_recovery_lsn;
> +
> +	uint32_t		l_iclog_roundoff;/* padding roundoff */
>  };
>  
>  #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
  2021-02-23 12:12   ` Chandan Babu R
  2021-02-24 20:34   ` Darrick J. Wong
@ 2021-03-01 15:19   ` Brian Foster
  2021-03-03  0:41     ` Dave Chinner
  2 siblings, 1 reply; 59+ messages in thread
From: Brian Foster @ 2021-03-01 15:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> To allow for iclog IO device cache flush behaviour to be optimised,
> we first need to separate out the commit record iclog IO from the
> rest of the checkpoint so we can wait for the checkpoint IO to
> complete before we issue the commit record.
> 
> This separation is only necessary if the commit record is being
> written into a different iclog to the start of the checkpoint as the
> upcoming cache flushing changes requires completion ordering against
> the other iclogs submitted by the checkpoint.
> 
> If the entire checkpoint and commit is in the one iclog, then they
> are both covered by the one set of cache flush primitives on the
> iclog and hence there is no need to separate them for ordering.
> 
> Otherwise, we need to wait for all the previous iclogs to complete
> so they are ordered correctly and made stable by the REQ_PREFLUSH
> that the commit record iclog IO issues. This guarantees that if a
> reader sees the commit record in the journal, they will also see the
> entire checkpoint that commit record closes off.
> 
> This also provides the guarantee that when the commit record IO
> completes, we can safely unpin all the log items in the checkpoint
> so they can be written back because the entire checkpoint is stable
> in the journal.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_log_cil.c  |  7 ++++++
>  fs/xfs/xfs_log_priv.h |  2 ++
>  3 files changed, 64 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index fa284f26d10e..ff26fb46d70f 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
>  	return 0;
>  }
>  
> +/*
> + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> + * holds no log locks.
> + *
> + * We walk backwards through the iclogs to find the iclog with the highest lsn
> + * in the range that we need to wait for and then wait for it to complete.
> + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> + * candidate iclog we need to sleep on have been complete by the time our
> + * candidate has completed it's IO.
> + *
> + * Therefore we only need to find the first iclog that isn't clean within the
> + * span of our flush range. If we come across a clean, newly activated iclog
> + * with a lsn of 0, it means IO has completed on this iclog and all previous
> + * iclogs will be have been completed prior to this one. Hence finding a newly
> + * activated iclog indicates that there are no iclogs in the range we need to
> + * wait on and we are done searching.
> + */
> +int
> +xlog_wait_on_iclog_lsn(
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		start_lsn)
> +{
> +	struct xlog		*log = iclog->ic_log;
> +	struct xlog_in_core	*prev;
> +	int			error = -EIO;
> +
> +	spin_lock(&log->l_icloglock);
> +	if (XLOG_FORCED_SHUTDOWN(log))
> +		goto out_unlock;
> +
> +	error = 0;
> +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> +
> +		/* Done if the lsn is before our start lsn */
> +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> +				start_lsn) < 0)
> +			break;
> +
> +		/* Don't need to wait on completed, clean iclogs */
> +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> +			continue;
> +		}
> +
> +		/* wait for completion on this iclog */
> +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);

You haven't addressed my feedback from the previous version. In
particular the bit about whether it is safe to block on ->ic_force_wait
from here considering some of our more quirky buffer locking behavior.

That aside, this iteration logic all seems a bit overengineered to me.
We have the commit record iclog of the current checkpoint and thus the
immediately previous iclog in the ring. We know that previous record
isn't earlier than start_lsn because the caller confirmed that start_lsn
!= commit_lsn. We also know that iclog can't become dirty -> active
until it and all previous iclog writes have completed because the
callback ordering implemented by xlog_state_do_callback() won't clean
the iclog until that point. Given that, can't this whole thing be
replaced with a check of iclog->prev to either see if it's been cleaned
or to otherwise xlog_wait() for that condition and return?

Brian

> +		return 0;
> +	}
> +
> +out_unlock:
> +	spin_unlock(&log->l_icloglock);
> +	return error;
> +}
> +
>  /*
>   * Write out an unmount record using the ticket provided. We have to account for
>   * the data space used in the unmount ticket as this write is not done from a
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..c5cc1b7ad25e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -870,6 +870,13 @@ xlog_cil_push_work(
>  	wake_up_all(&cil->xc_commit_wait);
>  	spin_unlock(&cil->xc_push_lock);
>  
> +	/*
> +	 * If the checkpoint spans multiple iclogs, wait for all previous
> +	 * iclogs to complete before we submit the commit_iclog.
> +	 */
> +	if (ctx->start_lsn != commit_lsn)
> +		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
>  	return;
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 037950cf1061..a7ac85aaff4e 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -584,6 +584,8 @@ xlog_wait(
>  	remove_wait_queue(wq, &wait);
>  }
>  
> +int xlog_wait_on_iclog_lsn(struct xlog_in_core *iclog, xfs_lsn_t start_lsn);
> +
>  /*
>   * The LSN is valid so long as it is behind the current LSN. If it isn't, this
>   * means that the next log record that includes this metadata could have a
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-02-28 23:46             ` Dave Chinner
@ 2021-03-01 15:33               ` Brian Foster
  0 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2021-03-01 15:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Christoph Hellwig, linux-xfs

On Mon, Mar 01, 2021 at 10:46:42AM +1100, Dave Chinner wrote:
> On Sun, Feb 28, 2021 at 11:36:13AM -0500, Brian Foster wrote:
> > On Thu, Feb 25, 2021 at 06:48:28PM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 25, 2021 at 09:34:47AM +0100, Christoph Hellwig wrote:
> > > > On Thu, Feb 25, 2021 at 08:44:17AM +1100, Dave Chinner wrote:
> > > > > > Also, do you have any idea what was Christoph talking about wrt devices
> > > > > > with no-op flushes the last time this patch was posted?  This change
> > > > > > seems straightforward to me (assuming the answers to my two question are
> > > > > > 'yes') but I didn't grok what subtlety he was alluding to...?
> > > > > 
> > > > > He was wondering what devices benefited from this. It has no impact
> > > > > on highspeed devices that do not require flushes/FUA (e.g. high end
> > > > > intel optane SSDs) but those are not the devices this change is
> > > > > aimed at. There are no regressions on these high end devices,
> > > > > either, so they are largely irrelevant to the patch and what it
> > > > > targets...
> > > > 
> > > > I don't think it is that simple.  Pretty much every device aimed at
> > > > enterprise use does not enable a volatile write cache by default.  That
> > > > also includes hard drives, arrays and NAND based SSDs.
> > > > 
> > > > Especially for hard drives (or slower arrays) the actual I/O wait might
> > > > matter.  What is the argument against making this conditional?
> > > 
> > > I still don't understand what you're asking about here --
> > > 
> > > AFAICT the net effect of this patchset is that it reduces the number of
> > > preflushes and FUA log writes.  To my knowledge, on a high end device
> > > with no volatile write cache, flushes are a no-op (because all writes
> > > are persisted somewhere immediately) and a FUA write should be the exact
> > > same thing as a non-FUA write.  Because XFS will now issue fewer no-op
> > > persistence commands to the device, there should be no effect at all.
> > > 
> > 
> > Except the cost of the new iowaits used to implement iclog ordering...
> > which I think is what Christoph has been asking about..?
> 
> And I've already answered - it is largely just noise.
> 
> > IOW, considering the storage configuration noted above where the impact
> > of the flush/fua optimizations is neutral, the net effect of this change
> > is whatever impact is introduced by intra-checkpoint iowaits and iclog
> > ordering. What is that impact?
> 
> All I've really noticed is that long tail latencies on operations go
> down a bit. That seems to correlate with spending less time waiting
> for log space when the log is full, but it's a marginal improvement
> at best.
> 
> Otherwise I cannot measure any significant difference in performance
> or behaviour across any of the metrics I monitor during performance
> testing.
> 

Ok.

> > Note that it's not clear enough to me to suggest whether that impact
> > might be significant or not. Hopefully it's neutral (?), but that seems
> > like best case scenario so I do think it's a reasonable question.
> 
> Yes, It's a reasonable question, but I answered it entirely and in
> great detail the first time.  Repeating the same question multiple
> times just with slightly different phrasing does not change the
> answer, nor explain to me what the undocumented concern might be...
> 

Darrick noted he wasn't clear on the question being asked. I rephrased
it to hopefully add some clarity, not change the answer (?).

(FWIW, the response in the previous version of this series didn't
clearly answer the question from my perspective either, so perhaps that
is why you're seeing it repeated by multiple reviewers. Regardless,
Christoph already replied with more detail so I'll just follow along in
that sub-thread..)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8 v2] xfs: journal IO cache flush reductions
  2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
                       ` (3 preceding siblings ...)
  2021-02-25  8:58     ` Christoph Hellwig
@ 2021-03-01 19:29     ` Brian Foster
  4 siblings, 0 replies; 59+ messages in thread
From: Brian Foster @ 2021-03-01 19:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Feb 23, 2021 at 07:05:03PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> Version 2:
> - repost manually without git/guilt mangling the patch author
> - fix bug in XLOG_ICL_NEED_FUA definition that didn't manifest as an
>   ordering bug in generic/45[57] until testing the CIL pipelining
>   changes much later in the series.
> 
>  fs/xfs/xfs_log.c      | 33 +++++++++++++++++++++++----------
>  fs/xfs/xfs_log_cil.c  |  7 ++++++-
>  fs/xfs/xfs_log_priv.h |  4 ++++
>  3 files changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 6c3fb6dcb505..08d68a6161ae 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
...
> @@ -1951,7 +1952,7 @@ xlog_sync(
>  	unsigned int		roundoff;       /* roundoff to BB or stripe */
>  	uint64_t		bno;
>  	unsigned int		size;
> -	bool			need_flush = true, split = false;
> +	bool			split = false;

I find the mash of the logic change (i.e. when to flush) and rework of
the bool to ->ic_flags changes in this patch unnecessarily convoluted.
IMO, this should be two patches.

>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
>  
> @@ -2009,13 +2010,14 @@ xlog_sync(
>  	 * synchronously here; for an internal log we can simply use the block
>  	 * layer state machine for preflushes.
>  	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp || split) {
> +	if (log->l_targ != log->l_mp->m_ddev_targp ||
> +	    (split && (iclog->ic_flags & XLOG_ICL_NEED_FLUSH))) {
>  		xfs_flush_bdev(log->l_mp->m_ddev_targp->bt_bdev);
> -		need_flush = false;
> +		iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;
>  	}
>  
>  	xlog_verify_iclog(log, iclog, count);
> -	xlog_write_iclog(log, iclog, bno, count, need_flush);
> +	xlog_write_iclog(log, iclog, bno, count);
>  }
>  
>  /*
> @@ -2469,10 +2471,21 @@ xlog_write(
>  		ASSERT(log_offset <= iclog->ic_size - 1);
>  		ptr = iclog->ic_datap + log_offset;
>  
> -		/* start_lsn is the first lsn written to. That's all we need. */
> +		/* Start_lsn is the first lsn written to. */
>  		if (start_lsn && !*start_lsn)
>  			*start_lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  
> +		/*
> +		 * iclogs containing commit records or unmount records need
> +		 * to issue ordering cache flushes and commit immediately
> +		 * to stable storage to guarantee journal vs metadata ordering
> +		 * is correctly maintained in the storage media.
> +		 */
> +		if (optype & (XLOG_COMMIT_TRANS | XLOG_UNMOUNT_TRANS)) {
> +			iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH |
> +						XLOG_ICL_NEED_FUA);
> +		}
> +
>  		/*
>  		 * This loop writes out as many regions as can fit in the amount
>  		 * of space which was allocated by xlog_state_get_iclog_space().
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 4093d2d0db7c..370da7c2bfc8 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -894,10 +894,15 @@ xlog_cil_push_work(
>  
>  	/*
>  	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog.
> +	 * iclogs to complete before we submit the commit_iclog. If it is in the
> +	 * same iclog as the start of the checkpoint, then we can skip the iclog
> +	 * cache flush because there are no other iclogs we need to order
> +	 * against.
>  	 */
>  	if (ctx->start_lsn != commit_lsn)
>  		xlog_wait_on_iclog_lsn(commit_iclog, ctx->start_lsn);
> +	else
> +		commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH;

Is there a reason we're making these iclog changes (here and in
xlog_write(), at least) outside of ->l_icloglock? I suspect this is safe
atm due to the fact that the CIL push is single threaded via the
workqueue, but the rest of the iclog management code is written with
proper internal exclusion in mind. This seems like a bit of a landmine
if the execution context is the only thing protecting us here... hm?

Brian

>  
>  	/* release the hounds! */
>  	xfs_log_release_iclog(commit_iclog);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 10a41b1dd895..a77e00b7789a 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -133,6 +133,9 @@ enum xlog_iclog_state {
>  
>  #define XLOG_COVER_OPS		5
>  
> +#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
> +#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
> +
>  /* Ticket reservation region accounting */ 
>  #define XLOG_TIC_LEN_MAX	15
>  
> @@ -201,6 +204,7 @@ typedef struct xlog_in_core {
>  	u32			ic_size;
>  	u32			ic_offset;
>  	enum xlog_iclog_state	ic_state;
> +	unsigned int		ic_flags;
>  	char			*ic_datap;	/* pointer to iclog data */
>  
>  	/* Callback structures need their own cacheline */
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-03-01  9:09           ` Christoph Hellwig
@ 2021-03-03  0:11             ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-03-03  0:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs

On Mon, Mar 01, 2021 at 09:09:01AM +0000, Christoph Hellwig wrote:
> On Fri, Feb 26, 2021 at 07:47:55AM +1100, Dave Chinner wrote:
> > Sorry, I/O wait might matter for what?
> 
> Think you have a SAS hard drive, WCE=0, typical queue depth of a few
> dozend commands.

Yup, so typical IO latency of 2-3ms, assuming 10krpm, assuming we
aren't doing metadata writeback which will blow this out.

I've tested on slower iSCSI devices than this (7-8ms typical av.
seek time), and it didn't show up any performance anomalies.

> Before that we'd submit a bunch of iclogs, which are generally
> sequential except of course for the log wrap around case.  The drive
> can now easily take all the iclogs and write them in one rotation.

Even if we take the best case for your example, this still means we
block on every 8 iclogs waiting 2-3ms for the spindle to rotate and
complete the IOs. Hence for a checkpoint of 32MB with 256kB iclogs,
we're blocking for 2-3ms at least 16 times before we get to the
commit iclog. With default iclog size of 32kB, we'll block a couple
of hundred times waiting on iclog IO...

IOWs, we're already talking about a best case checkpoint commit
latency of 30-50ms here.

[ And this isn't even considering media bandwidth there - 32MB on a
drive that can do maybe 200MB/s in the middle of the spindle where
the log is. That's another 150ms of data transfer time to physical
media. So if the drive is actually writing to physical media because
WCE=0, then we're taking *at least* 200ms per 32MB checkpoint. ]

> Now if we wait for the previous iclogs before submitting the
> commit_iclog we need at least one more additional full roundtrip.

So we add an average of 2-3ms to what is already taking, in the best
case, 30-50ms.

And these are mostly async commits this overhead is added to, so
there's rarely anything waiting on it and hence the extra small
latency is almost always lost in the noise. Even if the extra delay
is larger, there is rarely anything waiting on it so it's still
noise...

I just don't see anything relevant that stands out from the noise on
my systems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-03-01 15:19   ` Brian Foster
@ 2021-03-03  0:41     ` Dave Chinner
  2021-03-03 15:22       ` Brian Foster
  2021-03-05  0:44       ` Dave Chinner
  0 siblings, 2 replies; 59+ messages in thread
From: Dave Chinner @ 2021-03-03  0:41 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > To allow for iclog IO device cache flush behaviour to be optimised,
> > we first need to separate out the commit record iclog IO from the
> > rest of the checkpoint so we can wait for the checkpoint IO to
> > complete before we issue the commit record.
> > 
> > This separation is only necessary if the commit record is being
> > written into a different iclog to the start of the checkpoint as the
> > upcoming cache flushing changes requires completion ordering against
> > the other iclogs submitted by the checkpoint.
> > 
> > If the entire checkpoint and commit is in the one iclog, then they
> > are both covered by the one set of cache flush primitives on the
> > iclog and hence there is no need to separate them for ordering.
> > 
> > Otherwise, we need to wait for all the previous iclogs to complete
> > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > that the commit record iclog IO issues. This guarantees that if a
> > reader sees the commit record in the journal, they will also see the
> > entire checkpoint that commit record closes off.
> > 
> > This also provides the guarantee that when the commit record IO
> > completes, we can safely unpin all the log items in the checkpoint
> > so they can be written back because the entire checkpoint is stable
> > in the journal.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> >  fs/xfs/xfs_log_priv.h |  2 ++
> >  3 files changed, 64 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index fa284f26d10e..ff26fb46d70f 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > + * holds no log locks.
> > + *
> > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > + * in the range that we need to wait for and then wait for it to complete.
> > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > + * candidate iclog we need to sleep on have been complete by the time our
> > + * candidate has completed it's IO.
> > + *
> > + * Therefore we only need to find the first iclog that isn't clean within the
> > + * span of our flush range. If we come across a clean, newly activated iclog
> > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > + * activated iclog indicates that there are no iclogs in the range we need to
> > + * wait on and we are done searching.
> > + */
> > +int
> > +xlog_wait_on_iclog_lsn(
> > +	struct xlog_in_core	*iclog,
> > +	xfs_lsn_t		start_lsn)
> > +{
> > +	struct xlog		*log = iclog->ic_log;
> > +	struct xlog_in_core	*prev;
> > +	int			error = -EIO;
> > +
> > +	spin_lock(&log->l_icloglock);
> > +	if (XLOG_FORCED_SHUTDOWN(log))
> > +		goto out_unlock;
> > +
> > +	error = 0;
> > +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> > +
> > +		/* Done if the lsn is before our start lsn */
> > +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> > +				start_lsn) < 0)
> > +			break;
> > +
> > +		/* Don't need to wait on completed, clean iclogs */
> > +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> > +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> > +			continue;
> > +		}
> > +
> > +		/* wait for completion on this iclog */
> > +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> 
> You haven't addressed my feedback from the previous version. In
> particular the bit about whether it is safe to block on ->ic_force_wait
> from here considering some of our more quirky buffer locking behavior.

Sorry, first I've heard about this. I don't have any such email in
my inbox.

I don't know what waiting on an iclog in the middle of a checkpoint
has to do with buffer locking behaviour, because iclogs don't use
buffers and we block waiting on iclog IO completion all the time in
xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
completion here, then it's not safe to block on an iclog in
xlog_state_get_iclog_space(). That's obviously not true, so I'm
really not sure what the concern here is...

> That aside, this iteration logic all seems a bit overengineered to me.
> We have the commit record iclog of the current checkpoint and thus the
> immediately previous iclog in the ring. We know that previous record
> isn't earlier than start_lsn because the caller confirmed that start_lsn
> != commit_lsn. We also know that iclog can't become dirty -> active
> until it and all previous iclog writes have completed because the
> callback ordering implemented by xlog_state_do_callback() won't clean
> the iclog until that point. Given that, can't this whole thing be
> replaced with a check of iclog->prev to either see if it's been cleaned
> or to otherwise xlog_wait() for that condition and return?

Maybe. I was more concerned about ensuring that it did the right
thing so I checked all the things that came to mind. There was more
than enough compexity in other parts of this patchset to fill my
brain that minimal implementation were not a concern. I'll go take
another look at it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-03-03  0:41     ` Dave Chinner
@ 2021-03-03 15:22       ` Brian Foster
  2021-03-04 22:57         ` Dave Chinner
  2021-03-05  0:44       ` Dave Chinner
  1 sibling, 1 reply; 59+ messages in thread
From: Brian Foster @ 2021-03-03 15:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > To allow for iclog IO device cache flush behaviour to be optimised,
> > > we first need to separate out the commit record iclog IO from the
> > > rest of the checkpoint so we can wait for the checkpoint IO to
> > > complete before we issue the commit record.
> > > 
> > > This separation is only necessary if the commit record is being
> > > written into a different iclog to the start of the checkpoint as the
> > > upcoming cache flushing changes requires completion ordering against
> > > the other iclogs submitted by the checkpoint.
> > > 
> > > If the entire checkpoint and commit is in the one iclog, then they
> > > are both covered by the one set of cache flush primitives on the
> > > iclog and hence there is no need to separate them for ordering.
> > > 
> > > Otherwise, we need to wait for all the previous iclogs to complete
> > > so they are ordered correctly and made stable by the REQ_PREFLUSH
> > > that the commit record iclog IO issues. This guarantees that if a
> > > reader sees the commit record in the journal, they will also see the
> > > entire checkpoint that commit record closes off.
> > > 
> > > This also provides the guarantee that when the commit record IO
> > > completes, we can safely unpin all the log items in the checkpoint
> > > so they can be written back because the entire checkpoint is stable
> > > in the journal.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c      | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_log_cil.c  |  7 ++++++
> > >  fs/xfs/xfs_log_priv.h |  2 ++
> > >  3 files changed, 64 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index fa284f26d10e..ff26fb46d70f 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -808,6 +808,61 @@ xlog_wait_on_iclog(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Wait on any iclogs that are still flushing in the range of start_lsn to the
> > > + * current iclog's lsn. The caller holds a reference to the iclog, but otherwise
> > > + * holds no log locks.
> > > + *
> > > + * We walk backwards through the iclogs to find the iclog with the highest lsn
> > > + * in the range that we need to wait for and then wait for it to complete.
> > > + * Completion ordering of iclog IOs ensures that all prior iclogs to the
> > > + * candidate iclog we need to sleep on have been complete by the time our
> > > + * candidate has completed it's IO.
> > > + *
> > > + * Therefore we only need to find the first iclog that isn't clean within the
> > > + * span of our flush range. If we come across a clean, newly activated iclog
> > > + * with a lsn of 0, it means IO has completed on this iclog and all previous
> > > + * iclogs will be have been completed prior to this one. Hence finding a newly
> > > + * activated iclog indicates that there are no iclogs in the range we need to
> > > + * wait on and we are done searching.
> > > + */
> > > +int
> > > +xlog_wait_on_iclog_lsn(
> > > +	struct xlog_in_core	*iclog,
> > > +	xfs_lsn_t		start_lsn)
> > > +{
> > > +	struct xlog		*log = iclog->ic_log;
> > > +	struct xlog_in_core	*prev;
> > > +	int			error = -EIO;
> > > +
> > > +	spin_lock(&log->l_icloglock);
> > > +	if (XLOG_FORCED_SHUTDOWN(log))
> > > +		goto out_unlock;
> > > +
> > > +	error = 0;
> > > +	for (prev = iclog->ic_prev; prev != iclog; prev = prev->ic_prev) {
> > > +
> > > +		/* Done if the lsn is before our start lsn */
> > > +		if (XFS_LSN_CMP(be64_to_cpu(prev->ic_header.h_lsn),
> > > +				start_lsn) < 0)
> > > +			break;
> > > +
> > > +		/* Don't need to wait on completed, clean iclogs */
> > > +		if (prev->ic_state == XLOG_STATE_DIRTY ||
> > > +		    prev->ic_state == XLOG_STATE_ACTIVE) {
> > > +			continue;
> > > +		}
> > > +
> > > +		/* wait for completion on this iclog */
> > > +		xlog_wait(&prev->ic_force_wait, &log->l_icloglock);
> > 
> > You haven't addressed my feedback from the previous version. In
> > particular the bit about whether it is safe to block on ->ic_force_wait
> > from here considering some of our more quirky buffer locking behavior.
> 
> Sorry, first I've heard about this. I don't have any such email in
> my inbox.
> 

For reference, the last bit of this mail:

https://lore.kernel.org/linux-xfs/20210201160737.GA3252048@bfoster/

> I don't know what waiting on an iclog in the middle of a checkpoint
> has to do with buffer locking behaviour, because iclogs don't use
> buffers and we block waiting on iclog IO completion all the time in
> xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
> completion here, then it's not safe to block on an iclog in
> xlog_state_get_iclog_space(). That's obviously not true, so I'm
> really not sure what the concern here is...
> 

I think the broader question is not so much whether it's safe to block
here or not, but whether our current use of async log forces might have
a deadlock vector (which may or may not also include the
_get_iclog_space() scenario, I'd need to stare at that one a bit). I
referred to buffer locking because the buffer ->iop_unpin() handler can
attempt to acquire a buffer lock.

Looking again, that is the only place I see that blocks in iclog
completion callbacks and it's actually an abort scenario, which means
shutdown. I am slightly concerned that introducing more regular blocking
in the CIL push might lead to more frequent async log forces that block
on callback iclogs and thus exacerbate that issue (i.e. somebody might
be able to now reproduce yet another shutdown deadlock scenario to track
down that might not have been reproducible before, for whatever reason),
but that's probably not a serious enough problem to block this patch and
the advantages of the series overall.

Brian

> > That aside, this iteration logic all seems a bit overengineered to me.
> > We have the commit record iclog of the current checkpoint and thus the
> > immediately previous iclog in the ring. We know that previous record
> > isn't earlier than start_lsn because the caller confirmed that start_lsn
> > != commit_lsn. We also know that iclog can't become dirty -> active
> > until it and all previous iclog writes have completed because the
> > callback ordering implemented by xlog_state_do_callback() won't clean
> > the iclog until that point. Given that, can't this whole thing be
> > replaced with a check of iclog->prev to either see if it's been cleaned
> > or to otherwise xlog_wait() for that condition and return?
> 
> Maybe. I was more concerned about ensuring that it did the right
> thing so I checked all the things that came to mind. There was more
> than enough compexity in other parts of this patchset to fill my
> brain that minimal implementation were not a concern. I'll go take
> another look at it.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-03-03 15:22       ` Brian Foster
@ 2021-03-04 22:57         ` Dave Chinner
  0 siblings, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-03-04 22:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 10:22:05AM -0500, Brian Foster wrote:
> On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> > On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > > You haven't addressed my feedback from the previous version. In
> > > particular the bit about whether it is safe to block on ->ic_force_wait
> > > from here considering some of our more quirky buffer locking behavior.
> > 
> > Sorry, first I've heard about this. I don't have any such email in
> > my inbox.
> > 
> 
> For reference, the last bit of this mail:
> 
> https://lore.kernel.org/linux-xfs/20210201160737.GA3252048@bfoster/
> 
> > I don't know what waiting on an iclog in the middle of a checkpoint
> > has to do with buffer locking behaviour, because iclogs don't use
> > buffers and we block waiting on iclog IO completion all the time in
> > xlog_state_get_iclog_space(). If it's not safe to block on iclog IO
> > completion here, then it's not safe to block on an iclog in
> > xlog_state_get_iclog_space(). That's obviously not true, so I'm
> > really not sure what the concern here is...
> > 
> 
> I think the broader question is not so much whether it's safe to block
> here or not, but whether our current use of async log forces might have
> a deadlock vector (which may or may not also include the
> _get_iclog_space() scenario, I'd need to stare at that one a bit). I
> referred to buffer locking because the buffer ->iop_unpin() handler can
> attempt to acquire a buffer lock.

There are none that I know of, and I'm not changing any of the log
write blocking rules. Hence if there is a problem, it's a zero-day
that we have never triggered nor have any awareness about at all.
Hence for the purposes of development and review, we can assume such
unknown design problems don't actually exist because there's
absolutely zero evidence to indicate there is problem here...

> Looking again, that is the only place I see that blocks in iclog
> completion callbacks and it's actually an abort scenario, which means
> shutdown.

Yup. The AIL simply needs to abort writeback of such locked, pinned
buffers and then everything works just fine.

> I am slightly concerned that introducing more regular blocking in
> the CIL push might lead to more frequent async log forces that
> block on callback iclogs and thus exacerbate that issue (i.e.
> somebody might be able to now reproduce yet another shutdown
> deadlock scenario to track down that might not have been
> reproducible before, for whatever reason), but that's probably not
> a serious enough problem to block this patch and the advantages of
> the series overall.

And that's why I updated the log force stats accounting to capture
the async log forces and how we account log forces that block. That
gives me direct visibility into the blocking behaviour while I'm
running tests. And even with this new visibility, I can't see any
change in the metrics that are above the noise floor...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 2/8] xfs: separate CIL commit record IO
  2021-03-03  0:41     ` Dave Chinner
  2021-03-03 15:22       ` Brian Foster
@ 2021-03-05  0:44       ` Dave Chinner
  1 sibling, 0 replies; 59+ messages in thread
From: Dave Chinner @ 2021-03-05  0:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 03, 2021 at 11:41:19AM +1100, Dave Chinner wrote:
> On Mon, Mar 01, 2021 at 10:19:36AM -0500, Brian Foster wrote:
> > On Tue, Feb 23, 2021 at 02:34:36PM +1100, Dave Chinner wrote:
> > That aside, this iteration logic all seems a bit overengineered to me.
> > We have the commit record iclog of the current checkpoint and thus the
> > immediately previous iclog in the ring. We know that previous record
> > isn't earlier than start_lsn because the caller confirmed that start_lsn
> > != commit_lsn. We also know that iclog can't become dirty -> active
> > until it and all previous iclog writes have completed because the
> > callback ordering implemented by xlog_state_do_callback() won't clean
> > the iclog until that point. Given that, can't this whole thing be
> > replaced with a check of iclog->prev to either see if it's been cleaned
> > or to otherwise xlog_wait() for that condition and return?
> 
> Maybe. I was more concerned about ensuring that it did the right
> thing so I checked all the things that came to mind. There was more
> than enough compexity in other parts of this patchset to fill my
> brain that minimal implementation were not a concern. I'll go take
> another look at it.

Ok, so we can just use xlog_wait_on_iclog() here. I didn't look too
closely at the implementation of that function, just took the
comment above it at face value that it only waited for an iclog to
hit the disk.

We actually have two different iclog IO completion wait points - one
to wait for an iclog to hit the disk, and one to wait for hit the
disk and run completion callbacks. i.e. one is not ordered against
other iclogs and the other is strictly ordered.

The ordered version runs completion callbacks before waking waiters
thereby guaranteeing all previous iclog have been completed before
completing the current iclog and waking waiters.

The CIL code needs the latter, so yes, this can be simplified down
to a single xlog_wait_on_iclog(commit_iclog->ic_prev); call from the
CIL.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2021-03-05  0:45 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-23  3:34 [PATCH v2] xfs: various log stuff Dave Chinner
2021-02-23  3:34 ` [PATCH 1/8] xfs: log stripe roundoff is a property of the log Dave Chinner
2021-02-23 10:29   ` Chandan Babu R
2021-02-24 20:14   ` Darrick J. Wong
2021-02-25  8:32   ` Christoph Hellwig
2021-03-01 15:13   ` Brian Foster
2021-02-23  3:34 ` [PATCH 2/8] xfs: separate CIL commit record IO Dave Chinner
2021-02-23 12:12   ` Chandan Babu R
2021-02-24 20:34   ` Darrick J. Wong
2021-02-24 21:44     ` Dave Chinner
2021-02-24 23:06       ` Darrick J. Wong
2021-02-25  8:34       ` Christoph Hellwig
2021-02-25 20:47         ` Dave Chinner
2021-03-01  9:09           ` Christoph Hellwig
2021-03-03  0:11             ` Dave Chinner
2021-02-26  2:48         ` Darrick J. Wong
2021-02-28 16:36           ` Brian Foster
2021-02-28 23:46             ` Dave Chinner
2021-03-01 15:33               ` Brian Foster
2021-03-01 15:19   ` Brian Foster
2021-03-03  0:41     ` Dave Chinner
2021-03-03 15:22       ` Brian Foster
2021-03-04 22:57         ` Dave Chinner
2021-03-05  0:44       ` Dave Chinner
2021-02-23  3:34 ` [PATCH 3/8] xfs: move and rename xfs_blkdev_issue_flush Dave Chinner
2021-02-23 12:57   ` Chandan Babu R
2021-02-24 20:45   ` Darrick J. Wong
2021-02-24 22:01     ` Dave Chinner
2021-02-25  8:36   ` Christoph Hellwig
2021-02-23  3:34 ` [PATCH 4/8] xfs: async blkdev cache flush Dave Chinner
2021-02-23  5:29   ` Chaitanya Kulkarni
2021-02-23 14:02   ` Chandan Babu R
2021-02-24 20:51   ` Darrick J. Wong
2021-02-23  3:34 ` [PATCH 5/8] xfs: CIL checkpoint flushes caches unconditionally Dave Chinner
2021-02-24  7:16   ` Chandan Babu R
2021-02-24 20:57   ` Darrick J. Wong
2021-02-25  8:42   ` Christoph Hellwig
2021-02-25 21:07     ` Dave Chinner
2021-02-23  3:34 ` [PATCH 6/8] xfs: remove need_start_rec parameter from xlog_write() Dave Chinner
2021-02-24  7:17   ` Chandan Babu R
2021-02-24 20:59   ` Darrick J. Wong
2021-02-25  8:49   ` Christoph Hellwig
2021-02-25 20:55     ` Dave Chinner
2021-02-23  3:34 ` [PATCH 7/8] xfs: journal IO cache flush reductions Dave Chinner
2021-02-23  8:05   ` [PATCH 7/8 v2] " Dave Chinner
2021-02-24 12:27     ` Chandan Babu R
2021-02-24 20:32       ` Dave Chinner
2021-02-24 21:13     ` Darrick J. Wong
2021-02-24 22:03       ` Dave Chinner
2021-02-25  4:09     ` Chandan Babu R
2021-02-25  7:13       ` Chandan Babu R
2021-03-01  5:44       ` Dave Chinner
2021-03-01  5:56         ` Dave Chinner
2021-02-25  8:58     ` Christoph Hellwig
2021-02-25 21:06       ` Dave Chinner
2021-03-01 19:29     ` Brian Foster
2021-02-23  3:34 ` [PATCH 8/8] xfs: Fix CIL throttle hang when CIL space used going backwards Dave Chinner
2021-02-24 21:18   ` Darrick J. Wong
2021-02-24 22:05     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.