All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] xfs: fix log cache flush regressions
@ 2021-07-22  1:53 Dave Chinner
  2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

The first four patches in this series fix regressions introduced by
commit eef983ffeae7 ("xfs: journal IO cache flush reductions") and
exposed by generic/482. I have one test machine that reproduces
the failures, and in general a failure occurs 1 in every 30 test
iterations.

The first two patches are related to external logs. These were found
by code inspection while looking for the cause of the generic/482
failures.

The third patch addresses a race condition between the new
unconditional async data device cache flush run by the CIL push and
the log tail changing between that flush completing and the commit
iclog being synced. In this situation, if the commit_iclog doesn't
issue a REQ_PREFLUSH, we fail to guarantee that the metadata IO
completions that moved the log tail are on stable storage. The fix
for this issue is to sample the log tail before we issue the async
cache flush, and if the log tail has changed when we release the
commit_iclog we set the XLOG_ICL_NEED_FLUSH flag on the iclog to
guarantee a cache flush is issued before the commit record that
moves the tail of the log forward is written to the log.

The fourth patch addresses an oversight about log forces. Log forces
imply a cache flush if they result in iclog IO being issued,
allowing the caller to elide cache flushes that they might require
for data integrity. The change to requiring the iclog owner to
signal that a cache flush is required completely missed the log
force code. This patch ensures that all ACTIVE and WANT_SYNC iclogs
that we either flush or wait on issue cache flushes and so guarantee
that they are on stable storage at IO completion before they signal
completion to log force waiters.

The last patch is not a regression, but a fix for another log force
iclog related flush problem I noticed and found while triaging the
above problems. Occasionally the tests would sit idle doing nothing
for up to 30s waiting on a xfs_log_force_seq() call to complete. The
trigger for them to wake up was the background log worker issuing a
new log force. This appears to be another "incorrectly wait on
future iclog state changes" situation, and it's notable that
xfs_log_force() specifically handles this situation while
xfs_log_force_seq() does not. Changing xfs_log_force_seq() to use
the same "iclog IO already completed" detection as xfs_log_force()
made those random delays go away.

I'm still not sure that I've caught all the cases where we don't
issue cache flushes correctly - debugging this has been a slow
process of whack-a-mole, and there are still questionable failures
occurring where an fsync is not appearing to issue an iclog with a
cache flush correctly after hundreds of test cycles. Capturing a
failure with enough debug information for it to be meaningful is a
slow, time-consuming process...

However, this patchset as it stands makes things a lot better, so I
figure it's better to get this out there now and get more eyes on it
sooner rather than later...

Version 1:
- based on 5.14-rc1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] xfs: flush data dev on external log write
  2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
@ 2021-07-22  1:53 ` Dave Chinner
  2021-07-22  6:41   ` Christoph Hellwig
  2021-07-22 15:52   ` Darrick J. Wong
  2021-07-22  1:53 ` [PATCH 2/5] xfs: external logs need to flush data device Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We incorrectly flush the log device instead of the data device when
trying to ensure metadata is correctly on disk before writing the
unmount record.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 36fa2650b081..96434cc4df6e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -833,7 +833,7 @@ xlog_write_unmount_record(
 	 * stamp the tail LSN into the unmount record.
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
-		blkdev_issue_flush(log->l_targ->bt_bdev);
+		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] xfs: external logs need to flush data device
  2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
  2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
@ 2021-07-22  1:53 ` Dave Chinner
  2021-07-22  6:48   ` Christoph Hellwig
  2021-07-22 18:14   ` Darrick J. Wong
  2021-07-22  1:53 ` [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The recent journal flush/FUA changes replaced the flushing of the
data device on every iclog write with an up-front async data device
cache flush. Unfortunately, the assumption of which this was based
on has been proven incorrect by the flush vs log tail update
ordering issue. As the fix for that issue uses the
XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
flush, we now need to (once again) ensure that an iclog write to
external logs that need a cache flush to be issued actually issue a
cache flush to the data device as well as the log device.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 96434cc4df6e..a3c4d48195d9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -827,13 +827,6 @@ xlog_write_unmount_record(
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
 
-	/*
-	 * For external log devices, we need to flush the data device cache
-	 * first to ensure all metadata writeback is on stable storage before we
-	 * stamp the tail LSN into the unmount record.
-	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp)
-		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
@@ -1796,10 +1789,20 @@ xlog_write_iclog(
 	 * metadata writeback and causing priority inversions.
 	 */
 	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
-	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) {
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+		/*
+		 * For external log devices, we also need to flush the data
+		 * device cache first to ensure all metadata writeback covered
+		 * by the LSN in this iclog is on stable storage. This is slow,
+		 * but it *must* complete before we issue the external log IO.
+		 */
+		if (log->l_targ != log->l_mp->m_ddev_targp)
+			blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
+	}
 	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
 		iclog->ic_bio.bi_opf |= REQ_FUA;
+
 	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
  2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
  2021-07-22  1:53 ` [PATCH 2/5] xfs: external logs need to flush data device Dave Chinner
@ 2021-07-22  1:53 ` Dave Chinner
  2021-07-22  7:06   ` Christoph Hellwig
  2021-07-22  1:53 ` [PATCH 4/5] xfs: log forces imply data device cache flushes Dave Chinner
  2021-07-22  1:53 ` [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
  4 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work				AIL push work
-------------				-------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  ....					push inode create buffer
					<start IO>
					.....
xlog_write(commit record)
  ....					<IO completes>
  					log tail moves
  					  xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 81 ++++++++++++++++++++++++-------------------
 fs/xfs/xfs_log_cil.c  | 13 +++++--
 fs/xfs/xfs_log_priv.h |  3 +-
 3 files changed, 58 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a3c4d48195d9..c5ccef6ab423 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -487,51 +487,60 @@ xfs_log_reserve(
 	return error;
 }
 
-static bool
-__xlog_state_release_iclog(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog)
-{
-	lockdep_assert_held(&log->l_icloglock);
-
-	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
-		/* update tail before writing to iclog */
-		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
-
-		iclog->ic_state = XLOG_STATE_SYNCING;
-		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
-		xlog_verify_tail_lsn(log, iclog, tail_lsn);
-		/* cycle incremented when incrementing curr_block */
-		trace_xlog_iclog_syncing(iclog, _RET_IP_);
-		return true;
-	}
-
-	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
-	return false;
-}
-
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
- * it is in the WANT_SYNC state.
+ * it is in the WANT_SYNC state. If the caller passes in a non-zero
+ * @old_tail_lsn, then we need to check if the log tail is different to the
+ * caller's value. If it is different, this indicates that the log tail has
+ * moved since the caller sampled the log tail and issued a cache flush and so
+ * there may be metadata on disk taht we need to flush before this iclog is
+ * written. In that case, set the XLOG_ICL_NEED_FLUSH flag so taht when the
+ * iclog is synced we correctly issue a cache flush before the iclog is
+ * submitted.
  */
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		old_tail_lsn)
 {
+	xfs_lsn_t		tail_lsn;
+
 	lockdep_assert_held(&log->l_icloglock);
 
 	trace_xlog_iclog_release(iclog, _RET_IP_);
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
-	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
-	    __xlog_state_release_iclog(log, iclog)) {
-		spin_unlock(&log->l_icloglock);
-		xlog_sync(log, iclog);
-		spin_lock(&log->l_icloglock);
+	/*
+	 * Grabbing the current log tail needs to be atomic w.r.t. the writing
+	 * of the tail LSN into the iclog so we guarantee that the log tail does
+	 * not move between deciding if a cache flush is required and writing
+	 * the LSN into the iclog below.
+	 */
+	if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
+		tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+
+		if (old_tail_lsn && tail_lsn != old_tail_lsn)
+			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
 	}
 
+	if (!atomic_dec_and_test(&iclog->ic_refcnt))
+		return 0;
+
+	if (iclog->ic_state != XLOG_STATE_WANT_SYNC) {
+		ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
+		return 0;
+	}
+
+	/* update tail before writing to iclog */
+	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+	iclog->ic_state = XLOG_STATE_SYNCING;
+	xlog_verify_tail_lsn(log, iclog, tail_lsn);
+	spin_unlock(&log->l_icloglock);
+	xlog_sync(log, iclog);
+	spin_lock(&log->l_icloglock);
+
 	return 0;
 }
 
@@ -869,7 +878,7 @@ xlog_unmount_write(
 	 * iclog containing the unmount record is written.
 	 */
 	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -2313,7 +2322,7 @@ xlog_write_copy_finish(
 	return 0;
 
 release_iclog:
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	spin_unlock(&log->l_icloglock);
 	return error;
 }
@@ -2532,7 +2541,7 @@ xlog_write(
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
-		error = xlog_state_release_iclog(log, iclog);
+		error = xlog_state_release_iclog(log, iclog, 0);
 	}
 	spin_unlock(&log->l_icloglock);
 
@@ -2970,7 +2979,7 @@ xlog_state_get_iclog_space(
 		 * reference to the iclog.
 		 */
 		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
-			error = xlog_state_release_iclog(log, iclog);
+			error = xlog_state_release_iclog(log, iclog, 0);
 		spin_unlock(&log->l_icloglock);
 		if (error)
 			return error;
@@ -3206,7 +3215,7 @@ xfs_log_force(
 			atomic_inc(&iclog->ic_refcnt);
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog))
+			if (xlog_state_release_iclog(log, iclog, 0))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3286,7 +3295,7 @@ xlog_force_lsn(
 		}
 		atomic_inc(&iclog->ic_refcnt);
 		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog))
+		if (xlog_state_release_iclog(log, iclog, 0))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b128aaa9b870..4c44bc3786c0 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -654,8 +654,9 @@ xlog_cil_push_work(
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
 	struct xfs_log_vec	lvhdr = { NULL };
+	xfs_lsn_t		preflush_tail_lsn;
 	xfs_lsn_t		commit_lsn;
-	xfs_lsn_t		push_seq;
+	xfs_csn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
@@ -730,7 +731,15 @@ xlog_cil_push_work(
 	 * because we hold the flush lock exclusively. Hence we can now issue
 	 * a cache flush to ensure all the completed metadata in the journal we
 	 * are about to overwrite is on stable storage.
+	 *
+	 * Because we are issuing this cache flush before we've written the
+	 * tail lsn to the iclog, we can have metadata IO completions move the
+	 * tail forwards between the completion of this flush and the iclog
+	 * being written. In this case, we need to re-issue the cache flush
+	 * before the iclog write. To detect whether the log tail moves, sample
+	 * the tail LSN *before* we issue the flush.
 	 */
+	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
 	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
 				&bdev_flush);
 
@@ -941,7 +950,7 @@ xlog_cil_push_work(
 	 * storage.
 	 */
 	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
-	xlog_state_release_iclog(log, commit_iclog);
+	xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
 	spin_unlock(&log->l_icloglock);
 	return;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4c41bbfa33b0..7cbde0b4f990 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -497,7 +497,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
+int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
+		xfs_lsn_t log_tail_lsn);
 
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
                   ` (2 preceding siblings ...)
  2021-07-22  1:53 ` [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
@ 2021-07-22  1:53 ` Dave Chinner
  2021-07-22  7:14   ` Christoph Hellwig
  2021-07-22 19:30   ` Darrick J. Wong
  2021-07-22  1:53 ` [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

After fixing the tail_lsn vs cache flush race, generic/482 continued
to fail in a similar way where cache flushes were missing before
iclog FUA writes. Tracing of iclog state changes during the fsstress
workload portion of the test (via xlog_iclog* events) indicated that
iclog writes were coming from two sources - CIL pushes and log
forces (due to fsync/O_SYNC operations). All of the cases where a
recovery problem was triggered indicated that the log force was the
source of the iclog write that was not preceeded by a cache flush.

This was an oversight in the modifications made in commit
eef983ffeae7 ("xfs: journal IO cache flush reductions"). Log forces
for fsync imply a data device cache flush has been issued if an
iclog was flushed to disk and is indicated to the caller via the
log_flushed parameter so they can elide the device cache flush if
the journal issued one.

The change in eef983ffeae7 results in iclogs only issuing a cache
flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
added to the iclogs that the log force code flushes to disk. Hence
log forces are no longer guaranteeing that a cache flush is issued,
hence opening up a potential on-disk ordering failure.

Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
the actual iclogs it forces to the journal are also on stable
storage before it returns to the caller.

This patch introduces the xlog_force_iclog() helper function to
encapsulate the process of taking a reference to an iclog, switching
it's state if WANT_SYNC and flushing it to stable storage correctly.

Both xfs_log_force() and xfs_log_force_lsn() are converted to use
it, as is xlog_unmount_write() which has an elaborate method of
doing exactly the same "write this iclog to stable storage"
operation.

Further, if the log force code needs to wait on a iclog in the
WANT_SYNC state, it needs to ensure that iclog also results in a
cache flush being issued. This covers the case where the iclog
contains the commit record of the CIL flush that the log force
triggered, but it hasn't been written yet because there is still an
active reference to the iclog.

Note: this whole cache flush whack-a-mole patch is a result of log
forces still being iclog state centric rather than being CIL
sequence centric. Most of this nasty code will go away in future
when log forces are converted to wait on CIL sequence push
completion rather than iclog completion. With the CIL push algorithm
guaranteeing that the CIL checkpoint is fully on stable storage when
it completes, we no longer need to iterate iclogs and push them to
ensure a CIL sequence push has completed and so all this nasty iclog
iteration and flushing code will go away.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 103 ++++++++++++++++++++++++++++-------------------
 1 file changed, 62 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index c5ccef6ab423..7da42c0656e3 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -782,6 +782,21 @@ xfs_log_mount_cancel(
 	xfs_log_unmount(mp);
 }
 
+/*
+ * Flush out the iclog to disk ensuring that device caches are flushed and
+ * the iclog hits stable storage before any completion waiters are woken.
+ */
+static inline int
+xlog_force_iclog(
+	struct xlog_in_core	*iclog)
+{
+	atomic_inc(&iclog->ic_refcnt);
+	iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
+	if (iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
+	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
+}
+
 /*
  * Wait for the iclog and all prior iclogs to be written disk as required by the
  * log force state machine. Waiting on ic_force_wait ensures iclog completions
@@ -862,29 +877,18 @@ xlog_unmount_write(
 	 * transitioning log state to IOERROR. Just continue...
 	 */
 out_err:
-	if (error)
-		xfs_alert(mp, "%s: unmount record failed", __func__);
-
 	spin_lock(&log->l_icloglock);
 	iclog = log->l_iclog;
-	atomic_inc(&iclog->ic_refcnt);
-	if (iclog->ic_state == XLOG_STATE_ACTIVE)
-		xlog_state_switch_iclogs(log, iclog, 0);
-	else
-		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-		       iclog->ic_state == XLOG_STATE_IOERROR);
-	/*
-	 * Ensure the journal is fully flushed and on stable storage once the
-	 * iclog containing the unmount record is written.
-	 */
-	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog, 0);
+	error = xlog_force_iclog(iclog);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
 		trace_xfs_log_umount_write(log, tic);
 		xfs_log_ticket_ungrant(log, tic);
 	}
+	if (error)
+		xfs_alert(mp, "%s: unmount record failed", __func__);
+
 }
 
 static void
@@ -3205,39 +3209,36 @@ xfs_log_force(
 		iclog = iclog->ic_prev;
 	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
 		if (atomic_read(&iclog->ic_refcnt) == 0) {
-			/*
-			 * We are the only one with access to this iclog.
-			 *
-			 * Flush it out now.  There should be a roundoff of zero
-			 * to show that someone has already taken care of the
-			 * roundoff from the previous sync.
-			 */
-			atomic_inc(&iclog->ic_refcnt);
+			/* We have exclusive access to this iclog. */
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
-			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog, 0))
+			if (xlog_force_iclog(iclog))
 				goto out_error;
-
+			/*
+			 * If the iclog has already been completed and reused
+			 * the header LSN will have been rewritten. Don't wait
+			 * on these iclogs that contain future modifications.
+			 */
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
 				goto out_unlock;
 		} else {
 			/*
-			 * Someone else is writing to this iclog.
-			 *
-			 * Use its call to flush out the data.  However, the
-			 * other thread may not force out this LR, so we mark
-			 * it WANT_SYNC.
+			 * Someone else is still writing to this iclog, so we
+			 * need to ensure that when they release the iclog it
+			 * gets synced immediately as we may be waiting on it.
 			 */
 			xlog_state_switch_iclogs(log, iclog, 0);
 		}
-	} else {
-		/*
-		 * If the head iclog is not active nor dirty, we just attach
-		 * ourselves to the head and go to sleep if necessary.
-		 */
-		;
 	}
 
+	/*
+	 * The iclog we are about to wait on may contain the checkpoint pushed
+	 * by the above xlog_cil_force() call, but it may not have been pushed
+	 * to disk yet. Like the ACTIVE case above, we need to make sure caches
+	 * are flushed when this iclog is written.
+	 */
+	if (iclog->ic_state == XLOG_STATE_WANT_SYNC)
+		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
+
 	if (flags & XFS_LOG_SYNC)
 		return xlog_wait_on_iclog(iclog);
 out_unlock:
@@ -3270,7 +3271,8 @@ xlog_force_lsn(
 			goto out_unlock;
 	}
 
-	if (iclog->ic_state == XLOG_STATE_ACTIVE) {
+	switch (iclog->ic_state) {
+	case XLOG_STATE_ACTIVE:
 		/*
 		 * We sleep here if we haven't already slept (e.g. this is the
 		 * first time we've looked at the correct iclog buf) and the
@@ -3293,12 +3295,31 @@ xlog_force_lsn(
 					&log->l_icloglock);
 			return -EAGAIN;
 		}
-		atomic_inc(&iclog->ic_refcnt);
-		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog, 0))
+		if (xlog_force_iclog(iclog))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
+		break;
+	case XLOG_STATE_WANT_SYNC:
+		/*
+		 * This iclog may contain the checkpoint pushed by the 
+		 * xlog_cil_force_seq() call, but there are other writers still
+		 * accessing it so it hasn't been pushed to disk yet. Like the
+		 * ACTIVE case above, we need to make sure caches are flushed
+		 * when this iclog is written.
+		 */
+		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
+		if (log_flushed)
+			*log_flushed = 1;
+		break;
+	default:
+		/*
+		 * The entire checkpoint was written by the CIL force and is on
+		 * it's way to disk already. It will be stable when it
+		 * completes, so we don't need to manipulate caches here at all.
+		 * We just need to wait for completion if necessary.
+		 */
+		break;
 	}
 
 	if (flags & XFS_LOG_SYNC)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn()
  2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
                   ` (3 preceding siblings ...)
  2021-07-22  1:53 ` [PATCH 4/5] xfs: log forces imply data device cache flushes Dave Chinner
@ 2021-07-22  1:53 ` Dave Chinner
  2021-07-22  7:15   ` Christoph Hellwig
  2021-07-22 19:13   ` Darrick J. Wong
  4 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  1:53 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
see if the iclog has already been completed and the contents on
stable storage. We check for completed iclogs in xfs_log_force(), so
we should do the same thing for xfs_log_force_lsn().

This fixed some random up-to-30s pauses seen in unmounting
filesystems in some tests. A log force ends up waiting on completed
iclog, and that doesn't then get flushed (and hence the log force
get completed) until the background log worker issues a log force
that flushes the iclog in question. Then the unmount unblocks and
continues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 47 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 7da42c0656e3..baee9871cd65 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3147,6 +3147,35 @@ xlog_state_switch_iclogs(
 	log->l_iclog = iclog->ic_next;
 }
 
+/*
+ * Force the iclog to disk and check if the iclog has been completed before
+ * xlog_force_iclog() returns. This can happen on synchronous (e.g.
+ * pmem) or fast async storage because we drop the icloglock to issue the IO.
+ * If completion has already occurred, tell the caller so that it can avoid an
+ * unnecessary wait on the iclog.
+ */
+static int
+xlog_force_and_check_iclog(
+	struct xlog_in_core	*iclog,
+	bool			*completed)
+{
+	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+	int			error;
+
+	*completed = false;
+	error = xlog_force_iclog(iclog);
+	if (error)
+		return error;
+
+	/*
+	 * If the iclog has already been completed and reused the header LSN
+	 * will have been rewritten by completion
+	 */
+	if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
+		*completed = true;
+	return 0;
+}
+
 /*
  * Write out all data in the in-core log as of this exact moment in time.
  *
@@ -3181,7 +3210,6 @@ xfs_log_force(
 {
 	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
-	xfs_lsn_t		lsn;
 
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, 0, _RET_IP_);
@@ -3210,15 +3238,11 @@ xfs_log_force(
 	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
 		if (atomic_read(&iclog->ic_refcnt) == 0) {
 			/* We have exclusive access to this iclog. */
-			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
-			if (xlog_force_iclog(iclog))
+			bool	completed;
+
+			if (xlog_force_and_check_iclog(iclog, &completed))
 				goto out_error;
-			/*
-			 * If the iclog has already been completed and reused
-			 * the header LSN will have been rewritten. Don't wait
-			 * on these iclogs that contain future modifications.
-			 */
-			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
+			if (completed)
 				goto out_unlock;
 		} else {
 			/*
@@ -3258,6 +3282,7 @@ xlog_force_lsn(
 	bool			already_slept)
 {
 	struct xlog_in_core	*iclog;
+	bool			completed;
 
 	spin_lock(&log->l_icloglock);
 	iclog = log->l_iclog;
@@ -3295,10 +3320,12 @@ xlog_force_lsn(
 					&log->l_icloglock);
 			return -EAGAIN;
 		}
-		if (xlog_force_iclog(iclog))
+		if (xlog_force_and_check_iclog(iclog, &completed))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
+		if (completed)
+			goto out_unlock;
 		break;
 	case XLOG_STATE_WANT_SYNC:
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] xfs: flush data dev on external log write
  2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
@ 2021-07-22  6:41   ` Christoph Hellwig
  2021-07-22 15:52   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  6:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:31AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We incorrectly flush the log device instead of the data device when
> trying to ensure metadata is correctly on disk before writing the
> unmount record.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Ooops, looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] xfs: external logs need to flush data device
  2021-07-22  1:53 ` [PATCH 2/5] xfs: external logs need to flush data device Dave Chinner
@ 2021-07-22  6:48   ` Christoph Hellwig
  2021-07-22 18:14   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  6:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:32AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The recent journal flush/FUA changes replaced the flushing of the
> data device on every iclog write with an up-front async data device
> cache flush. Unfortunately, the assumption of which this was based
> on has been proven incorrect by the flush vs log tail update
> ordering issue. As the fix for that issue uses the
> XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
> flush, we now need to (once again) ensure that an iclog write to
> external logs that need a cache flush to be issued actually issue a
> cache flush to the data device as well as the log device.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-22  1:53 ` [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
@ 2021-07-22  7:06   ` Christoph Hellwig
  2021-07-22  7:28     ` Dave Chinner
  2021-07-22 19:12     ` Darrick J. Wong
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  7:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 344 bytes --]

Due the combination of the code move and the actual change I had
a really hard time reviewing this.

> -		trace_xlog_iclog_syncing(iclog, _RET_IP_);

This tracepoint got lost.

Otherwise this looks good.

As I had to split the move from the technical changes to follow
the change I might as well share my split, attached below as two
patches:


[-- Attachment #2: 0001-xfs-fold-__xlog_state_release_iclog-into-xlog_state_.patch --]
[-- Type: text/plain, Size: 2572 bytes --]

From 6f8ff7f10b25986518ff76e9a6ef61edb629a97f Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Thu, 22 Jul 2021 08:58:49 +0200
Subject: xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog

Fold __xlog_state_release_iclog intos it's only caller to prepare
make an upcomding fix easier.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c | 46 ++++++++++++++++++----------------------------
 1 file changed, 18 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a3c4d48195d98a..d7ab9d0814c2c6 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -487,29 +487,6 @@ xfs_log_reserve(
 	return error;
 }
 
-static bool
-__xlog_state_release_iclog(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog)
-{
-	lockdep_assert_held(&log->l_icloglock);
-
-	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
-		/* update tail before writing to iclog */
-		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
-
-		iclog->ic_state = XLOG_STATE_SYNCING;
-		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
-		xlog_verify_tail_lsn(log, iclog, tail_lsn);
-		/* cycle incremented when incrementing curr_block */
-		trace_xlog_iclog_syncing(iclog, _RET_IP_);
-		return true;
-	}
-
-	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
-	return false;
-}
-
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
@@ -519,19 +496,32 @@ xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
 {
+	xfs_lsn_t		tail_lsn;
+
 	lockdep_assert_held(&log->l_icloglock);
 
 	trace_xlog_iclog_release(iclog, _RET_IP_);
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
-	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
-	    __xlog_state_release_iclog(log, iclog)) {
-		spin_unlock(&log->l_icloglock);
-		xlog_sync(log, iclog);
-		spin_lock(&log->l_icloglock);
+	if (!atomic_dec_and_test(&iclog->ic_refcnt))
+		return 0;
+
+	if (iclog->ic_state != XLOG_STATE_WANT_SYNC) {
+		ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
+		return 0;
 	}
 
+	/* update tail before writing to iclog */
+	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+	iclog->ic_state = XLOG_STATE_SYNCING;
+	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+	xlog_verify_tail_lsn(log, iclog, tail_lsn);
+	trace_xlog_iclog_syncing(iclog, _RET_IP_);
+
+	spin_unlock(&log->l_icloglock);
+	xlog_sync(log, iclog);
+	spin_lock(&log->l_icloglock);
 	return 0;
 }
 
-- 
2.30.2


[-- Attachment #3: 0002-xfs-fix-ordering-violation-between-cache-flushes-and.patch --]
[-- Type: text/plain, Size: 10270 bytes --]

From 7a592dfeef83ca61ae001d612b8932027e0080cf Mon Sep 17 00:00:00 2001
From: Dave Chinner <dchinner@redhat.com>
Date: Thu, 22 Jul 2021 09:00:08 +0200
Subject: xfs: fix ordering violation between cache flushes and tail updates

There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work				AIL push work
-------------				-------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  ....					push inode create buffer
					<start IO>
					.....
xlog_write(commit record)
  ....					<IO completes>
  					log tail moves
  					  xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. This requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 38 +++++++++++++++++++++++++++++---------
 fs/xfs/xfs_log_cil.c  | 13 +++++++++++--
 fs/xfs/xfs_log_priv.h |  3 ++-
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index d7ab9d0814c2c6..5767fd3ab90555 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -489,12 +489,20 @@ xfs_log_reserve(
 
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
- * it is in the WANT_SYNC state.
+ * it is in the WANT_SYNC state. If the caller passes in a non-zero
+ * @old_tail_lsn, then we need to check if the log tail is different to the
+ * caller's value. If it is different, this indicates that the log tail has
+ * moved since the caller sampled the log tail and issued a cache flush and so
+ * there may be metadata on disk taht we need to flush before this iclog is
+ * written. In that case, set the XLOG_ICL_NEED_FLUSH flag so taht when the
+ * iclog is synced we correctly issue a cache flush before the iclog is
+ * submitted.
  */
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		old_tail_lsn)
 {
 	xfs_lsn_t		tail_lsn;
 
@@ -504,6 +512,19 @@ xlog_state_release_iclog(
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
+	/*
+	 * Grabbing the current log tail needs to be atomic w.r.t. the writing
+	 * of the tail LSN into the iclog so we guarantee that the log tail does
+	 * not move between deciding if a cache flush is required and writing
+	 * the LSN into the iclog below.
+	 */
+	if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
+		tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+
+		if (old_tail_lsn && tail_lsn != old_tail_lsn)
+			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+	}
+
 	if (!atomic_dec_and_test(&iclog->ic_refcnt))
 		return 0;
 
@@ -513,7 +534,6 @@ xlog_state_release_iclog(
 	}
 
 	/* update tail before writing to iclog */
-	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
 	iclog->ic_state = XLOG_STATE_SYNCING;
 	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 	xlog_verify_tail_lsn(log, iclog, tail_lsn);
@@ -859,7 +879,7 @@ xlog_unmount_write(
 	 * iclog containing the unmount record is written.
 	 */
 	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -2303,7 +2323,7 @@ xlog_write_copy_finish(
 	return 0;
 
 release_iclog:
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	spin_unlock(&log->l_icloglock);
 	return error;
 }
@@ -2522,7 +2542,7 @@ xlog_write(
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
-		error = xlog_state_release_iclog(log, iclog);
+		error = xlog_state_release_iclog(log, iclog, 0);
 	}
 	spin_unlock(&log->l_icloglock);
 
@@ -2960,7 +2980,7 @@ xlog_state_get_iclog_space(
 		 * reference to the iclog.
 		 */
 		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
-			error = xlog_state_release_iclog(log, iclog);
+			error = xlog_state_release_iclog(log, iclog, 0);
 		spin_unlock(&log->l_icloglock);
 		if (error)
 			return error;
@@ -3196,7 +3216,7 @@ xfs_log_force(
 			atomic_inc(&iclog->ic_refcnt);
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog))
+			if (xlog_state_release_iclog(log, iclog, 0))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3276,7 +3296,7 @@ xlog_force_lsn(
 		}
 		atomic_inc(&iclog->ic_refcnt);
 		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog))
+		if (xlog_state_release_iclog(log, iclog, 0))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b128aaa9b870d5..4c44bc3786c0f0 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -654,8 +654,9 @@ xlog_cil_push_work(
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
 	struct xfs_log_vec	lvhdr = { NULL };
+	xfs_lsn_t		preflush_tail_lsn;
 	xfs_lsn_t		commit_lsn;
-	xfs_lsn_t		push_seq;
+	xfs_csn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
@@ -730,7 +731,15 @@ xlog_cil_push_work(
 	 * because we hold the flush lock exclusively. Hence we can now issue
 	 * a cache flush to ensure all the completed metadata in the journal we
 	 * are about to overwrite is on stable storage.
+	 *
+	 * Because we are issuing this cache flush before we've written the
+	 * tail lsn to the iclog, we can have metadata IO completions move the
+	 * tail forwards between the completion of this flush and the iclog
+	 * being written. In this case, we need to re-issue the cache flush
+	 * before the iclog write. To detect whether the log tail moves, sample
+	 * the tail LSN *before* we issue the flush.
 	 */
+	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
 	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
 				&bdev_flush);
 
@@ -941,7 +950,7 @@ xlog_cil_push_work(
 	 * storage.
 	 */
 	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
-	xlog_state_release_iclog(log, commit_iclog);
+	xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
 	spin_unlock(&log->l_icloglock);
 	return;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4c41bbfa33b0df..7cbde0b4f9901d 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -497,7 +497,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
+int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
+		xfs_lsn_t log_tail_lsn);
 
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22  1:53 ` [PATCH 4/5] xfs: log forces imply data device cache flushes Dave Chinner
@ 2021-07-22  7:14   ` Christoph Hellwig
  2021-07-22  7:32     ` Dave Chinner
  2021-07-22 19:30   ` Darrick J. Wong
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  7:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> +static inline int
> +xlog_force_iclog(
> +	struct xlog_in_core	*iclog)
> +{
> +	atomic_inc(&iclog->ic_refcnt);
> +	iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> +	if (iclog->ic_state == XLOG_STATE_ACTIVE)
> +		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
> +	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
> +}

This also combines code move with technical changes.  At least it is
small enough to be reviewable.

>  out_err:
> -	if (error)
> -		xfs_alert(mp, "%s: unmount record failed", __func__);
> -
>  

> +	if (error)
> +		xfs_alert(mp, "%s: unmount record failed", __func__);
> +
>  }

This now doesn't print an error when the log reservation or log write
fails, but only one when the log force fails.  Also there i a spurious
empty line at the end.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn()
  2021-07-22  1:53 ` [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
@ 2021-07-22  7:15   ` Christoph Hellwig
  2021-07-22 19:13   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  7:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-22  7:06   ` Christoph Hellwig
@ 2021-07-22  7:28     ` Dave Chinner
  2021-07-22 19:12     ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  7:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 08:06:47AM +0100, Christoph Hellwig wrote:
> Due the combination of the code move and the actual change I had
> a really hard time reviewing this.
> 
> > -		trace_xlog_iclog_syncing(iclog, _RET_IP_);
> 
> This tracepoint got lost.

Ah, will fix that.

> Otherwise this looks good.
> 
> As I had to split the move from the technical changes to follow
> the change I might as well share my split, attached below as two
> patches:

Yeah, I haven't got to that point yet - I'm still chasing a wild
goose that is now only showing up once every 3 or 4 hours of
cyclic testing, so I've been trying to capture that in sufficient
detail to be able to diagnose the failure. I wanted to get these out
there so new eyes might see what I'm missing or suggest a better
approach...

I'll add this split into my patchset tomorrow when I go looking at
the corpses overnight testing created...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22  7:14   ` Christoph Hellwig
@ 2021-07-22  7:32     ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2021-07-22  7:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 08:14:32AM +0100, Christoph Hellwig wrote:
> On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> > +static inline int
> > +xlog_force_iclog(
> > +	struct xlog_in_core	*iclog)
> > +{
> > +	atomic_inc(&iclog->ic_refcnt);
> > +	iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> > +	if (iclog->ic_state == XLOG_STATE_ACTIVE)
> > +		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
> > +	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
> > +}
> 
> This also combines code move with technical changes.  At least it is
> small enough to be reviewable.

Yup, another split really needed here. And the need to catch
WANT_SYNC iclogs for flushing should really be split out, too,
because that's more than just "oops, I forgot to mark ACTIVE iclogs
we force out for flushing"...

> >  out_err:
> > -	if (error)
> > -		xfs_alert(mp, "%s: unmount record failed", __func__);
> > -
> >  
> 
> > +	if (error)
> > +		xfs_alert(mp, "%s: unmount record failed", __func__);
> > +
> >  }
> 
> This now doesn't print an error when the log reservation or log write
> fails, but only one when the log force fails.  Also there i a spurious
> empty line at the end.

Ah, I thought it caught them - oh, error gets overwritten. Doh! I'll
fix that up.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks for looking at these quickly, Christoph.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] xfs: flush data dev on external log write
  2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
  2021-07-22  6:41   ` Christoph Hellwig
@ 2021-07-22 15:52   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 15:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:31AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We incorrectly flush the log device instead of the data device when
> trying to ensure metadata is correctly on disk before writing the
> unmount record.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

DOH

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 36fa2650b081..96434cc4df6e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -833,7 +833,7 @@ xlog_write_unmount_record(
>  	 * stamp the tail LSN into the unmount record.
>  	 */
>  	if (log->l_targ != log->l_mp->m_ddev_targp)
> -		blkdev_issue_flush(log->l_targ->bt_bdev);
> +		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
>  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] xfs: external logs need to flush data device
  2021-07-22  1:53 ` [PATCH 2/5] xfs: external logs need to flush data device Dave Chinner
  2021-07-22  6:48   ` Christoph Hellwig
@ 2021-07-22 18:14   ` Darrick J. Wong
  2021-07-22 21:45     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 18:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:32AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The recent journal flush/FUA changes replaced the flushing of the
> data device on every iclog write with an up-front async data device
> cache flush. Unfortunately, the assumption of which this was based
> on has been proven incorrect by the flush vs log tail update
> ordering issue. As the fix for that issue uses the
> XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
> flush, we now need to (once again) ensure that an iclog write to
> external logs that need a cache flush to be issued actually issue a
> cache flush to the data device as well as the log device.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 96434cc4df6e..a3c4d48195d9 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -827,13 +827,6 @@ xlog_write_unmount_record(
>  	/* account for space used by record data */
>  	ticket->t_curr_res -= sizeof(ulf);
>  
> -	/*
> -	 * For external log devices, we need to flush the data device cache
> -	 * first to ensure all metadata writeback is on stable storage before we
> -	 * stamp the tail LSN into the unmount record.
> -	 */
> -	if (log->l_targ != log->l_mp->m_ddev_targp)
> -		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
>  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
>  }
>  
> @@ -1796,10 +1789,20 @@ xlog_write_iclog(
>  	 * metadata writeback and causing priority inversions.
>  	 */
>  	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> -	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
> +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) {
>  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> +		/*
> +		 * For external log devices, we also need to flush the data
> +		 * device cache first to ensure all metadata writeback covered
> +		 * by the LSN in this iclog is on stable storage. This is slow,
> +		 * but it *must* complete before we issue the external log IO.

I'm a little confused about what's going on here.  We're about to write
a log record to disk, with h_tail_lsn reflecting the tail of the log and
h_lsn reflecting the current head of the log (i.e. this record).

If the log tail has moved forward since the last log record was written
and this fs has an external log, we need to flush the data device
because the AIL could have written logged items back into the filesystem
and we need to ensure those items have been persisted before we write to
the log the fact that the tail moved forward.  The AIL itself doesn't
issue cache flushes (nor does it need to), so that's why we do that
here.

Why don't we need a flush like this if only FUA is set?  Is it not
possible to write a checkpoint that fits within a single iclog after the
log tail has moved forward?

--D

> +		 */
> +		if (log->l_targ != log->l_mp->m_ddev_targp)
> +			blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
> +	}
>  	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
>  		iclog->ic_bio.bi_opf |= REQ_FUA;
> +
>  	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
>  
>  	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-22  7:06   ` Christoph Hellwig
  2021-07-22  7:28     ` Dave Chinner
@ 2021-07-22 19:12     ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 19:12 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, linux-xfs

On Thu, Jul 22, 2021 at 08:06:47AM +0100, Christoph Hellwig wrote:
> Due the combination of the code move and the actual change I had
> a really hard time reviewing this.
> 
> > -		trace_xlog_iclog_syncing(iclog, _RET_IP_);
> 
> This tracepoint got lost.
> 
> Otherwise this looks good.
> 
> As I had to split the move from the technical changes to follow
> the change I might as well share my split, attached below as two
> patches:
> 

> From 6f8ff7f10b25986518ff76e9a6ef61edb629a97f Mon Sep 17 00:00:00 2001
> From: Dave Chinner <dchinner@redhat.com>
> Date: Thu, 22 Jul 2021 08:58:49 +0200
> Subject: xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
> 
> Fold __xlog_state_release_iclog intos it's only caller to prepare
> make an upcomding fix easier.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> [hch: split from a larger patch]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I like this split out as a refactoring patch, it /does/ make things
easier to read.  For this patch,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

> ---
>  fs/xfs/xfs_log.c | 46 ++++++++++++++++++----------------------------
>  1 file changed, 18 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index a3c4d48195d98a..d7ab9d0814c2c6 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -487,29 +487,6 @@ xfs_log_reserve(
>  	return error;
>  }
>  
> -static bool
> -__xlog_state_release_iclog(
> -	struct xlog		*log,
> -	struct xlog_in_core	*iclog)
> -{
> -	lockdep_assert_held(&log->l_icloglock);
> -
> -	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
> -		/* update tail before writing to iclog */
> -		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> -
> -		iclog->ic_state = XLOG_STATE_SYNCING;
> -		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> -		xlog_verify_tail_lsn(log, iclog, tail_lsn);
> -		/* cycle incremented when incrementing curr_block */
> -		trace_xlog_iclog_syncing(iclog, _RET_IP_);
> -		return true;
> -	}
> -
> -	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
> -	return false;
> -}
> -
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
>   * it is in the WANT_SYNC state.
> @@ -519,19 +496,32 @@ xlog_state_release_iclog(
>  	struct xlog		*log,
>  	struct xlog_in_core	*iclog)
>  {
> +	xfs_lsn_t		tail_lsn;
> +
>  	lockdep_assert_held(&log->l_icloglock);
>  
>  	trace_xlog_iclog_release(iclog, _RET_IP_);
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		return -EIO;
>  
> -	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
> -	    __xlog_state_release_iclog(log, iclog)) {
> -		spin_unlock(&log->l_icloglock);
> -		xlog_sync(log, iclog);
> -		spin_lock(&log->l_icloglock);
> +	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> +		return 0;
> +
> +	if (iclog->ic_state != XLOG_STATE_WANT_SYNC) {
> +		ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
> +		return 0;
>  	}
>  
> +	/* update tail before writing to iclog */
> +	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> +	iclog->ic_state = XLOG_STATE_SYNCING;
> +	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> +	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> +	trace_xlog_iclog_syncing(iclog, _RET_IP_);
> +
> +	spin_unlock(&log->l_icloglock);
> +	xlog_sync(log, iclog);
> +	spin_lock(&log->l_icloglock);
>  	return 0;
>  }
>  
> -- 
> 2.30.2
> 

> From 7a592dfeef83ca61ae001d612b8932027e0080cf Mon Sep 17 00:00:00 2001
> From: Dave Chinner <dchinner@redhat.com>
> Date: Thu, 22 Jul 2021 09:00:08 +0200
> Subject: xfs: fix ordering violation between cache flushes and tail updates
> 
> There is a race between the new CIL async data device metadata IO
> completion cache flush and the log tail in the iclog the flush
> covers being updated. This can be seen by repeating generic/482 in a
> loop and eventually log recovery fails with a failures such as this:
> 
> XFS (dm-3): Starting recovery (logdev: internal)
> XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
> XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
> XFS (dm-3): Unmount and run xfs_repair
> XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
> 
> Analysis of the logwrite replay shows that there were no writes to
> the data device between the FUA @ write 124 and the FUA at write @
> 125, but log recovery @ 125 failed. The difference was the one log
> write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
> and so the inode create intent in (1,8) was not replayed and so the
> inode cluster was zero on disk when replay of the first inode item
> in (1,32) was attempted.
> 
> What this meant was that the journal write that occurred at @ 125
> did not ensure that metadata completed before the iclog was written
> was correctly on stable storage. The tail of the log moved forward,
> so IO must have been completed between the two iclog writes. This
> means that there is a race condition between the unconditional async
> cache flush in the CIL push work and the tail LSN that is written to
> the iclog. This happens like so:
> 
> CIL push work				AIL push work
> -------------				-------------
> Add to committing list
> start async data dev cache flush
> .....
> <flush completes>
> <all writes to old tail lsn are stable>
> xlog_write
>   ....					push inode create buffer
> 					<start IO>
> 					.....
> xlog_write(commit record)
>   ....					<IO completes>
>   					log tail moves
>   					  xlog_assign_tail_lsn()
> start_lsn == commit_lsn
>   <no iclog preflush!>
> xlog_state_release_iclog
>   __xlog_state_release_iclog()
>     <writes *new* tail_lsn into iclog>
>   xlog_sync()
>     ....
>     submit_bio()
> <tail in log moves forward without flushing written metadata>
> 
> Essentially, this can only occur if the commit iclog is issued
> without a cache flush. If the iclog bio is submitted with
> REQ_PREFLUSH, then it will guarantee that all the completed IO is
> one stable storage before the iclog bio with the new tail LSN in it
> is written to the log.
> 
> IOWs, the tail lsn that is written to the iclog needs to be sampled
> *before* we issue the cache flush that guarantees all IO up to that
> LSN has been completed.
> 
> To fix this without giving up the performance advantage of the
> flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
> compared to 5.13), we need to ensure that we always issue a cache
> flush if the tail LSN changes between the initial async flush and
> the commit record being written. This requires sampling the tail_lsn
> before we start the flush, and then passing the sampled tail LSN to
> xlog_state_release_iclog() so it can determine if the the tail LSN
> has changed while writing the checkpoint. If the tail LSN has
> changed, then it needs to set the NEED_FLUSH flag on the iclog and
> we'll issue another cache flush before writing the iclog.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 38 +++++++++++++++++++++++++++++---------
>  fs/xfs/xfs_log_cil.c  | 13 +++++++++++--
>  fs/xfs/xfs_log_priv.h |  3 ++-
>  3 files changed, 42 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index d7ab9d0814c2c6..5767fd3ab90555 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -489,12 +489,20 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.
> + * it is in the WANT_SYNC state. If the caller passes in a non-zero
> + * @old_tail_lsn, then we need to check if the log tail is different to the

Nit: NULLCOMMITLSN instead of zero?

> + * caller's value. If it is different, this indicates that the log tail has
> + * moved since the caller sampled the log tail and issued a cache flush and so

"...sampled the log tail as part of issuing a data device cache flush, so..."

> + * there may be metadata on disk taht we need to flush before this iclog is

s/taht/that/

> + * written. In that case, set the XLOG_ICL_NEED_FLUSH flag so taht when the

s/taht/that/

> + * iclog is synced we correctly issue a cache flush before the iclog is

"...issue a cache flush to the log and data devices before the iclog..."

With those issues addressed I think this look decent, modulo my
question in the previous patch.

--D

> + * submitted.
>   */
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog)
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		old_tail_lsn)
>  {
>  	xfs_lsn_t		tail_lsn;
>  
> @@ -504,6 +512,19 @@ xlog_state_release_iclog(
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		return -EIO;
>  
> +	/*
> +	 * Grabbing the current log tail needs to be atomic w.r.t. the writing
> +	 * of the tail LSN into the iclog so we guarantee that the log tail does
> +	 * not move between deciding if a cache flush is required and writing
> +	 * the LSN into the iclog below.
> +	 */
> +	if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
> +		tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> +
> +		if (old_tail_lsn && tail_lsn != old_tail_lsn)
> +			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +	}
> +
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
>  		return 0;
>  
> @@ -513,7 +534,6 @@ xlog_state_release_iclog(
>  	}
>  
>  	/* update tail before writing to iclog */
> -	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
>  	iclog->ic_state = XLOG_STATE_SYNCING;
>  	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> @@ -859,7 +879,7 @@ xlog_unmount_write(
>  	 * iclog containing the unmount record is written.
>  	 */
>  	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, 0);
>  	xlog_wait_on_iclog(iclog);
>  
>  	if (tic) {
> @@ -2303,7 +2323,7 @@ xlog_write_copy_finish(
>  	return 0;
>  
>  release_iclog:
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, 0);
>  	spin_unlock(&log->l_icloglock);
>  	return error;
>  }
> @@ -2522,7 +2542,7 @@ xlog_write(
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
> -		error = xlog_state_release_iclog(log, iclog);
> +		error = xlog_state_release_iclog(log, iclog, 0);
>  	}
>  	spin_unlock(&log->l_icloglock);
>  
> @@ -2960,7 +2980,7 @@ xlog_state_get_iclog_space(
>  		 * reference to the iclog.
>  		 */
>  		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
> -			error = xlog_state_release_iclog(log, iclog);
> +			error = xlog_state_release_iclog(log, iclog, 0);
>  		spin_unlock(&log->l_icloglock);
>  		if (error)
>  			return error;
> @@ -3196,7 +3216,7 @@ xfs_log_force(
>  			atomic_inc(&iclog->ic_refcnt);
>  			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  			xlog_state_switch_iclogs(log, iclog, 0);
> -			if (xlog_state_release_iclog(log, iclog))
> +			if (xlog_state_release_iclog(log, iclog, 0))
>  				goto out_error;
>  
>  			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> @@ -3276,7 +3296,7 @@ xlog_force_lsn(
>  		}
>  		atomic_inc(&iclog->ic_refcnt);
>  		xlog_state_switch_iclogs(log, iclog, 0);
> -		if (xlog_state_release_iclog(log, iclog))
> +		if (xlog_state_release_iclog(log, iclog, 0))
>  			goto out_error;
>  		if (log_flushed)
>  			*log_flushed = 1;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b128aaa9b870d5..4c44bc3786c0f0 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -654,8 +654,9 @@ xlog_cil_push_work(
>  	struct xfs_trans_header thdr;
>  	struct xfs_log_iovec	lhdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> +	xfs_lsn_t		preflush_tail_lsn;
>  	xfs_lsn_t		commit_lsn;
> -	xfs_lsn_t		push_seq;
> +	xfs_csn_t		push_seq;
>  	struct bio		bio;
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  
> @@ -730,7 +731,15 @@ xlog_cil_push_work(
>  	 * because we hold the flush lock exclusively. Hence we can now issue
>  	 * a cache flush to ensure all the completed metadata in the journal we
>  	 * are about to overwrite is on stable storage.
> +	 *
> +	 * Because we are issuing this cache flush before we've written the
> +	 * tail lsn to the iclog, we can have metadata IO completions move the
> +	 * tail forwards between the completion of this flush and the iclog
> +	 * being written. In this case, we need to re-issue the cache flush
> +	 * before the iclog write. To detect whether the log tail moves, sample
> +	 * the tail LSN *before* we issue the flush.
>  	 */
> +	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
>  	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
>  				&bdev_flush);
>  
> @@ -941,7 +950,7 @@ xlog_cil_push_work(
>  	 * storage.
>  	 */
>  	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> -	xlog_state_release_iclog(log, commit_iclog);
> +	xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
>  	spin_unlock(&log->l_icloglock);
>  	return;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 4c41bbfa33b0df..7cbde0b4f9901d 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -497,7 +497,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
> +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
> +		xfs_lsn_t log_tail_lsn);
>  
>  /*
>   * When we crack an atomic LSN, we sample it first so that the value will not
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn()
  2021-07-22  1:53 ` [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
  2021-07-22  7:15   ` Christoph Hellwig
@ 2021-07-22 19:13   ` Darrick J. Wong
  1 sibling, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 19:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:35AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
> see if the iclog has already been completed and the contents on
> stable storage. We check for completed iclogs in xfs_log_force(), so
> we should do the same thing for xfs_log_force_lsn().
> 
> This fixed some random up-to-30s pauses seen in unmounting
> filesystems in some tests. A log force ends up waiting on completed
> iclog, and that doesn't then get flushed (and hence the log force
> get completed) until the background log worker issues a log force
> that flushes the iclog in question. Then the unmount unblocks and
> continues.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Heh.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_log.c | 47 +++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 37 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 7da42c0656e3..baee9871cd65 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3147,6 +3147,35 @@ xlog_state_switch_iclogs(
>  	log->l_iclog = iclog->ic_next;
>  }
>  
> +/*
> + * Force the iclog to disk and check if the iclog has been completed before
> + * xlog_force_iclog() returns. This can happen on synchronous (e.g.
> + * pmem) or fast async storage because we drop the icloglock to issue the IO.
> + * If completion has already occurred, tell the caller so that it can avoid an
> + * unnecessary wait on the iclog.
> + */
> +static int
> +xlog_force_and_check_iclog(
> +	struct xlog_in_core	*iclog,
> +	bool			*completed)
> +{
> +	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +	int			error;
> +
> +	*completed = false;
> +	error = xlog_force_iclog(iclog);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * If the iclog has already been completed and reused the header LSN
> +	 * will have been rewritten by completion
> +	 */
> +	if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> +		*completed = true;
> +	return 0;
> +}
> +
>  /*
>   * Write out all data in the in-core log as of this exact moment in time.
>   *
> @@ -3181,7 +3210,6 @@ xfs_log_force(
>  {
>  	struct xlog		*log = mp->m_log;
>  	struct xlog_in_core	*iclog;
> -	xfs_lsn_t		lsn;
>  
>  	XFS_STATS_INC(mp, xs_log_force);
>  	trace_xfs_log_force(mp, 0, _RET_IP_);
> @@ -3210,15 +3238,11 @@ xfs_log_force(
>  	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
>  		if (atomic_read(&iclog->ic_refcnt) == 0) {
>  			/* We have exclusive access to this iclog. */
> -			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> -			if (xlog_force_iclog(iclog))
> +			bool	completed;
> +
> +			if (xlog_force_and_check_iclog(iclog, &completed))
>  				goto out_error;
> -			/*
> -			 * If the iclog has already been completed and reused
> -			 * the header LSN will have been rewritten. Don't wait
> -			 * on these iclogs that contain future modifications.
> -			 */
> -			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> +			if (completed)
>  				goto out_unlock;
>  		} else {
>  			/*
> @@ -3258,6 +3282,7 @@ xlog_force_lsn(
>  	bool			already_slept)
>  {
>  	struct xlog_in_core	*iclog;
> +	bool			completed;
>  
>  	spin_lock(&log->l_icloglock);
>  	iclog = log->l_iclog;
> @@ -3295,10 +3320,12 @@ xlog_force_lsn(
>  					&log->l_icloglock);
>  			return -EAGAIN;
>  		}
> -		if (xlog_force_iclog(iclog))
> +		if (xlog_force_and_check_iclog(iclog, &completed))
>  			goto out_error;
>  		if (log_flushed)
>  			*log_flushed = 1;
> +		if (completed)
> +			goto out_unlock;
>  		break;
>  	case XLOG_STATE_WANT_SYNC:
>  		/*
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22  1:53 ` [PATCH 4/5] xfs: log forces imply data device cache flushes Dave Chinner
  2021-07-22  7:14   ` Christoph Hellwig
@ 2021-07-22 19:30   ` Darrick J. Wong
  2021-07-22 22:12     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 19:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> After fixing the tail_lsn vs cache flush race, generic/482 continued
> to fail in a similar way where cache flushes were missing before
> iclog FUA writes. Tracing of iclog state changes during the fsstress

Heh. ;)

> workload portion of the test (via xlog_iclog* events) indicated that
> iclog writes were coming from two sources - CIL pushes and log
> forces (due to fsync/O_SYNC operations). All of the cases where a
> recovery problem was triggered indicated that the log force was the
> source of the iclog write that was not preceeded by a cache flush.
> 
> This was an oversight in the modifications made in commit
> eef983ffeae7 ("xfs: journal IO cache flush reductions"). Log forces
> for fsync imply a data device cache flush has been issued if an
> iclog was flushed to disk and is indicated to the caller via the
> log_flushed parameter so they can elide the device cache flush if
> the journal issued one.
> 
> The change in eef983ffeae7 results in iclogs only issuing a cache
> flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
> added to the iclogs that the log force code flushes to disk. Hence
> log forces are no longer guaranteeing that a cache flush is issued,
> hence opening up a potential on-disk ordering failure.
> 
> Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
> the actual iclogs it forces to the journal are also on stable
> storage before it returns to the caller.
> 
> This patch introduces the xlog_force_iclog() helper function to
> encapsulate the process of taking a reference to an iclog, switching
> it's state if WANT_SYNC and flushing it to stable storage correctly.
> 
> Both xfs_log_force() and xfs_log_force_lsn() are converted to use
> it, as is xlog_unmount_write() which has an elaborate method of
> doing exactly the same "write this iclog to stable storage"
> operation.
> 
> Further, if the log force code needs to wait on a iclog in the
> WANT_SYNC state, it needs to ensure that iclog also results in a
> cache flush being issued. This covers the case where the iclog
> contains the commit record of the CIL flush that the log force
> triggered, but it hasn't been written yet because there is still an
> active reference to the iclog.
> 
> Note: this whole cache flush whack-a-mole patch is a result of log
> forces still being iclog state centric rather than being CIL
> sequence centric. Most of this nasty code will go away in future
> when log forces are converted to wait on CIL sequence push
> completion rather than iclog completion. With the CIL push algorithm
> guaranteeing that the CIL checkpoint is fully on stable storage when
> it completes, we no longer need to iterate iclogs and push them to
> ensure a CIL sequence push has completed and so all this nasty iclog
> iteration and flushing code will go away.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 103 ++++++++++++++++++++++++++++-------------------
>  1 file changed, 62 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index c5ccef6ab423..7da42c0656e3 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -782,6 +782,21 @@ xfs_log_mount_cancel(
>  	xfs_log_unmount(mp);
>  }
>  
> +/*
> + * Flush out the iclog to disk ensuring that device caches are flushed and
> + * the iclog hits stable storage before any completion waiters are woken.
> + */
> +static inline int
> +xlog_force_iclog(
> +	struct xlog_in_core	*iclog)
> +{
> +	atomic_inc(&iclog->ic_refcnt);
> +	iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> +	if (iclog->ic_state == XLOG_STATE_ACTIVE)
> +		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
> +	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
> +}
> +
>  /*
>   * Wait for the iclog and all prior iclogs to be written disk as required by the
>   * log force state machine. Waiting on ic_force_wait ensures iclog completions
> @@ -862,29 +877,18 @@ xlog_unmount_write(
>  	 * transitioning log state to IOERROR. Just continue...
>  	 */
>  out_err:
> -	if (error)
> -		xfs_alert(mp, "%s: unmount record failed", __func__);

Not sure why this was removed.

> -
>  	spin_lock(&log->l_icloglock);
>  	iclog = log->l_iclog;
> -	atomic_inc(&iclog->ic_refcnt);
> -	if (iclog->ic_state == XLOG_STATE_ACTIVE)
> -		xlog_state_switch_iclogs(log, iclog, 0);
> -	else
> -		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
> -		       iclog->ic_state == XLOG_STATE_IOERROR);
> -	/*
> -	 * Ensure the journal is fully flushed and on stable storage once the
> -	 * iclog containing the unmount record is written.
> -	 */
> -	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> -	error = xlog_state_release_iclog(log, iclog, 0);
> +	error = xlog_force_iclog(iclog);
>  	xlog_wait_on_iclog(iclog);
>  
>  	if (tic) {
>  		trace_xfs_log_umount_write(log, tic);
>  		xfs_log_ticket_ungrant(log, tic);
>  	}
> +	if (error)
> +		xfs_alert(mp, "%s: unmount record failed", __func__);
> +
>  }
>  
>  static void
> @@ -3205,39 +3209,36 @@ xfs_log_force(
>  		iclog = iclog->ic_prev;
>  	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
>  		if (atomic_read(&iclog->ic_refcnt) == 0) {
> -			/*
> -			 * We are the only one with access to this iclog.
> -			 *
> -			 * Flush it out now.  There should be a roundoff of zero
> -			 * to show that someone has already taken care of the
> -			 * roundoff from the previous sync.
> -			 */
> -			atomic_inc(&iclog->ic_refcnt);
> +			/* We have exclusive access to this iclog. */
>  			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> -			xlog_state_switch_iclogs(log, iclog, 0);
> -			if (xlog_state_release_iclog(log, iclog, 0))
> +			if (xlog_force_iclog(iclog))
>  				goto out_error;
> -
> +			/*
> +			 * If the iclog has already been completed and reused
> +			 * the header LSN will have been rewritten. Don't wait
> +			 * on these iclogs that contain future modifications.
> +			 */
>  			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
>  				goto out_unlock;
>  		} else {
>  			/*
> -			 * Someone else is writing to this iclog.
> -			 *
> -			 * Use its call to flush out the data.  However, the
> -			 * other thread may not force out this LR, so we mark
> -			 * it WANT_SYNC.
> +			 * Someone else is still writing to this iclog, so we
> +			 * need to ensure that when they release the iclog it
> +			 * gets synced immediately as we may be waiting on it.
>  			 */
>  			xlog_state_switch_iclogs(log, iclog, 0);
>  		}
> -	} else {
> -		/*
> -		 * If the head iclog is not active nor dirty, we just attach
> -		 * ourselves to the head and go to sleep if necessary.
> -		 */
> -		;
>  	}
>  
> +	/*
> +	 * The iclog we are about to wait on may contain the checkpoint pushed
> +	 * by the above xlog_cil_force() call, but it may not have been pushed
> +	 * to disk yet. Like the ACTIVE case above, we need to make sure caches
> +	 * are flushed when this iclog is written.
> +	 */
> +	if (iclog->ic_state == XLOG_STATE_WANT_SYNC)
> +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> +
>  	if (flags & XFS_LOG_SYNC)
>  		return xlog_wait_on_iclog(iclog);
>  out_unlock:
> @@ -3270,7 +3271,8 @@ xlog_force_lsn(
>  			goto out_unlock;
>  	}
>  
> -	if (iclog->ic_state == XLOG_STATE_ACTIVE) {
> +	switch (iclog->ic_state) {
> +	case XLOG_STATE_ACTIVE:
>  		/*
>  		 * We sleep here if we haven't already slept (e.g. this is the
>  		 * first time we've looked at the correct iclog buf) and the
> @@ -3293,12 +3295,31 @@ xlog_force_lsn(
>  					&log->l_icloglock);
>  			return -EAGAIN;
>  		}
> -		atomic_inc(&iclog->ic_refcnt);
> -		xlog_state_switch_iclogs(log, iclog, 0);
> -		if (xlog_state_release_iclog(log, iclog, 0))
> +		if (xlog_force_iclog(iclog))
>  			goto out_error;
>  		if (log_flushed)
>  			*log_flushed = 1;
> +		break;
> +	case XLOG_STATE_WANT_SYNC:
> +		/*
> +		 * This iclog may contain the checkpoint pushed by the 

Nit: extra whitespace at end of line.

> +		 * xlog_cil_force_seq() call, but there are other writers still
> +		 * accessing it so it hasn't been pushed to disk yet. Like the
> +		 * ACTIVE case above, we need to make sure caches are flushed
> +		 * when this iclog is written.
> +		 */
> +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> +		if (log_flushed)
> +			*log_flushed = 1;
> +		break;
> +	default:
> +		/*
> +		 * The entire checkpoint was written by the CIL force and is on
> +		 * it's way to disk already. It will be stable when it

s/it's/its/

So now that we're at the end of this series, what are the rules for when
when issue cache flushes and FUA writes?

- Writing the unmount record always flushes the data and log devices.
  Does it need to flush the rt device too?  I guess xfs_free_buftarg
  does that.

- Start an async flush of the data device when doing CIL push work so
  that anything the AIL wrote to disk (and pushed the tail) is persisted
  before we assign a tail to the log record that we write at the end?

- If any other AIL work completes (and pushes the tail ahead) by the
  time we actually write the log record, flush the data device a second
  time?

- If a log checkpoint spans multiple iclogs, flush the *log* device
  before writing the iclog with the commit record in it.

- Any time we write an iclog that commits a checkpoint, write that
  record with FUA to ensure it's persisted.

- If we're forcing the log to disk as part of an integrity operation
  (fsync, syncfs, etc.) then issue cache flushes for ... each? iclog
  written to disk?  And use FUA for that write too?

--D

> +		 * completes, so we don't need to manipulate caches here at all.
> +		 * We just need to wait for completion if necessary.
> +		 */
> +		break;
>  	}
>  
>  	if (flags & XFS_LOG_SYNC)
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] xfs: external logs need to flush data device
  2021-07-22 18:14   ` Darrick J. Wong
@ 2021-07-22 21:45     ` Dave Chinner
  2021-07-22 23:10       ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2021-07-22 21:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 11:14:45AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 22, 2021 at 11:53:32AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The recent journal flush/FUA changes replaced the flushing of the
> > data device on every iclog write with an up-front async data device
> > cache flush. Unfortunately, the assumption of which this was based
> > on has been proven incorrect by the flush vs log tail update
> > ordering issue. As the fix for that issue uses the
> > XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
> > flush, we now need to (once again) ensure that an iclog write to
> > external logs that need a cache flush to be issued actually issue a
> > cache flush to the data device as well as the log device.
> > 
> > Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 19 +++++++++++--------
> >  1 file changed, 11 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 96434cc4df6e..a3c4d48195d9 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -827,13 +827,6 @@ xlog_write_unmount_record(
> >  	/* account for space used by record data */
> >  	ticket->t_curr_res -= sizeof(ulf);
> >  
> > -	/*
> > -	 * For external log devices, we need to flush the data device cache
> > -	 * first to ensure all metadata writeback is on stable storage before we
> > -	 * stamp the tail LSN into the unmount record.
> > -	 */
> > -	if (log->l_targ != log->l_mp->m_ddev_targp)
> > -		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
> >  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
> >  }
> >  
> > @@ -1796,10 +1789,20 @@ xlog_write_iclog(
> >  	 * metadata writeback and causing priority inversions.
> >  	 */
> >  	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> > -	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
> > +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) {
> >  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> > +		/*
> > +		 * For external log devices, we also need to flush the data
> > +		 * device cache first to ensure all metadata writeback covered
> > +		 * by the LSN in this iclog is on stable storage. This is slow,
> > +		 * but it *must* complete before we issue the external log IO.
> 
> I'm a little confused about what's going on here.  We're about to write
> a log record to disk, with h_tail_lsn reflecting the tail of the log and
> h_lsn reflecting the current head of the log (i.e. this record).
> 
> If the log tail has moved forward since the last log record was written
> and this fs has an external log, we need to flush the data device
> because the AIL could have written logged items back into the filesystem
> and we need to ensure those items have been persisted before we write to
> the log the fact that the tail moved forward.  The AIL itself doesn't
> issue cache flushes (nor does it need to), so that's why we do that
> here.
> 
> Why don't we need a flush like this if only FUA is set?  Is it not
> possible to write a checkpoint that fits within a single iclog after the
> log tail has moved forward?

Yes, it is, and that is the race condition is exactly what the next
patch in the series addresses. If the log tail moves after the data
device cache flush was issued before we started writing the
checkpoint to the iclogs, then we detect that when releasing the
commit iclog and set the XLOG_ICL_NEED_FLUSH flag on it. That will
then trigger this code to issue a data device cache flush....

IOWs, for external logs, the XLOG_ICL_NEED_FLUSH flag indicates that
both the data device and the log device need a cache flush, rather
than just the log device. I think it could be split into two flags,
but then my head explodes thinking about log forces and trying to
determine what type of flush is implied (and what flags we'd need to
set) when we return log_flushed = true....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22 19:30   ` Darrick J. Wong
@ 2021-07-22 22:12     ` Dave Chinner
  2021-07-22 23:13       ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2021-07-22 22:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Jul 22, 2021 at 12:30:18PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > After fixing the tail_lsn vs cache flush race, generic/482 continued
> > to fail in a similar way where cache flushes were missing before
> > iclog FUA writes. Tracing of iclog state changes during the fsstress
> 
> Heh. ;)
....
> > +		 * xlog_cil_force_seq() call, but there are other writers still
> > +		 * accessing it so it hasn't been pushed to disk yet. Like the
> > +		 * ACTIVE case above, we need to make sure caches are flushed
> > +		 * when this iclog is written.
> > +		 */
> > +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> > +		if (log_flushed)
> > +			*log_flushed = 1;
> > +		break;
> > +	default:
> > +		/*
> > +		 * The entire checkpoint was written by the CIL force and is on
> > +		 * it's way to disk already. It will be stable when it
> 
> s/it's/its/
> 
> So now that we're at the end of this series, what are the rules for when
> when issue cache flushes and FUA writes?
> 
> - Writing the unmount record always flushes the data and log devices.
>   Does it need to flush the rt device too?  I guess xfs_free_buftarg
>   does that.

Correct. RT device behaviour is unchanged as it only contains data
and data is already guaranteed to be on stable storage before we
write the unmount record.

> - Start an async flush of the data device when doing CIL push work so
>   that anything the AIL wrote to disk (and pushed the tail) is persisted
>   before we assign a tail to the log record that we write at the end?
> 
> - If any other AIL work completes (and pushes the tail ahead) by the
>   time we actually write the log record, flush the data device a second
>   time?

Yes.

> - If a log checkpoint spans multiple iclogs, flush the *log* device
>   before writing the iclog with the commit record in it.

Yes. And for internal logs we have the natural optimisation that
these two cases are handled by same cache flush and so for large
checkpoints on internal logs we don't see lot tail update races.

> - Any time we write an iclog that commits a checkpoint, write that
>   record with FUA to ensure it's persisted.

*nod*

> - If we're forcing the log to disk as part of an integrity operation
>   (fsync, syncfs, etc.) then issue cache flushes for ... each? iclog
>   written to disk?  And use FUA for that write too?

This is where it gets messy, because log forces are not based around
checkpoint completions. Hence we have no idea what is actually in
the iclog we are flushing and so must treat them all as if they
contain a commit record, close off a multi-iclog checkpoint, and
might have raced with a log tail update. We don't know - and can't
know from the iclog state - which conditions exist and so we have to
assume that at least one of the above states exist for any ACTIVE or
WANT_SYNC iclog we end flushing or up waiting on.

If the iclog is already on it's way to disk, and it contains a
commit record, then the cache flush requirements for
metadata/journal ordering have already been met and we don't need to
do anything other than wait. But if we have to flush the iclog or
wait for a flush by a third party, we need to ensure that cache
flushes occur so that the log force semantics are upheld.

If the iclog doesn't contain a commit record (i.e. a log force in
the middle of a new, racing checkpoint write) we don't actually care
if the iclog contains flushes or not, because a crash immediately
after the log force won't actually recover the checkpoint contained
in that iclog. From the log force perspective, the iclog contains
future changes, so we don't care about whether it can be recovered.
But we don't know this, so we have to issue cache flushes on every
iclog we flush from the log force code.

This is why I mentioned that the log force code needs to be turned
inside out to guarantee CIL checkpoints are flushed and stable
rather than iclogs. We care about whole checkpoints being
recoverable, not whether some random iclog in the middle of a
checkpoint write is stable....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] xfs: external logs need to flush data device
  2021-07-22 21:45     ` Dave Chinner
@ 2021-07-22 23:10       ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 23:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 23, 2021 at 07:45:39AM +1000, Dave Chinner wrote:
> On Thu, Jul 22, 2021 at 11:14:45AM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 22, 2021 at 11:53:32AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The recent journal flush/FUA changes replaced the flushing of the
> > > data device on every iclog write with an up-front async data device
> > > cache flush. Unfortunately, the assumption of which this was based
> > > on has been proven incorrect by the flush vs log tail update
> > > ordering issue. As the fix for that issue uses the
> > > XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
> > > flush, we now need to (once again) ensure that an iclog write to
> > > external logs that need a cache flush to be issued actually issue a
> > > cache flush to the data device as well as the log device.
> > > 
> > > Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c | 19 +++++++++++--------
> > >  1 file changed, 11 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 96434cc4df6e..a3c4d48195d9 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -827,13 +827,6 @@ xlog_write_unmount_record(
> > >  	/* account for space used by record data */
> > >  	ticket->t_curr_res -= sizeof(ulf);
> > >  
> > > -	/*
> > > -	 * For external log devices, we need to flush the data device cache
> > > -	 * first to ensure all metadata writeback is on stable storage before we
> > > -	 * stamp the tail LSN into the unmount record.
> > > -	 */
> > > -	if (log->l_targ != log->l_mp->m_ddev_targp)
> > > -		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
> > >  	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
> > >  }
> > >  
> > > @@ -1796,10 +1789,20 @@ xlog_write_iclog(
> > >  	 * metadata writeback and causing priority inversions.
> > >  	 */
> > >  	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
> > > -	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
> > > +	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) {
> > >  		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
> > > +		/*
> > > +		 * For external log devices, we also need to flush the data
> > > +		 * device cache first to ensure all metadata writeback covered
> > > +		 * by the LSN in this iclog is on stable storage. This is slow,
> > > +		 * but it *must* complete before we issue the external log IO.
> > 
> > I'm a little confused about what's going on here.  We're about to write
> > a log record to disk, with h_tail_lsn reflecting the tail of the log and
> > h_lsn reflecting the current head of the log (i.e. this record).
> > 
> > If the log tail has moved forward since the last log record was written
> > and this fs has an external log, we need to flush the data device
> > because the AIL could have written logged items back into the filesystem
> > and we need to ensure those items have been persisted before we write to
> > the log the fact that the tail moved forward.  The AIL itself doesn't
> > issue cache flushes (nor does it need to), so that's why we do that
> > here.
> > 
> > Why don't we need a flush like this if only FUA is set?  Is it not
> > possible to write a checkpoint that fits within a single iclog after the
> > log tail has moved forward?
> 
> Yes, it is, and that is the race condition is exactly what the next
> patch in the series addresses. If the log tail moves after the data
> device cache flush was issued before we started writing the
> checkpoint to the iclogs, then we detect that when releasing the
> commit iclog and set the XLOG_ICL_NEED_FLUSH flag on it. That will
> then trigger this code to issue a data device cache flush....

Aha, yeah, I noticed that after scanning the next few patches.

> IOWs, for external logs, the XLOG_ICL_NEED_FLUSH flag indicates that
> both the data device and the log device need a cache flush, rather
> than just the log device. I think it could be split into two flags,
> but then my head explodes thinking about log forces and trying to
> determine what type of flush is implied (and what flags we'd need to
> set) when we return log_flushed = true....

Maybe later when we're not focussed on recovery failures.

In the meantime, I'm satisfied enough to
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] xfs: log forces imply data device cache flushes
  2021-07-22 22:12     ` Dave Chinner
@ 2021-07-22 23:13       ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2021-07-22 23:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Jul 23, 2021 at 08:12:58AM +1000, Dave Chinner wrote:
> On Thu, Jul 22, 2021 at 12:30:18PM -0700, Darrick J. Wong wrote:
> > On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > After fixing the tail_lsn vs cache flush race, generic/482 continued
> > > to fail in a similar way where cache flushes were missing before
> > > iclog FUA writes. Tracing of iclog state changes during the fsstress
> > 
> > Heh. ;)
> ....
> > > +		 * xlog_cil_force_seq() call, but there are other writers still
> > > +		 * accessing it so it hasn't been pushed to disk yet. Like the
> > > +		 * ACTIVE case above, we need to make sure caches are flushed
> > > +		 * when this iclog is written.
> > > +		 */
> > > +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> > > +		if (log_flushed)
> > > +			*log_flushed = 1;
> > > +		break;
> > > +	default:
> > > +		/*
> > > +		 * The entire checkpoint was written by the CIL force and is on
> > > +		 * it's way to disk already. It will be stable when it
> > 
> > s/it's/its/
> > 
> > So now that we're at the end of this series, what are the rules for when
> > when issue cache flushes and FUA writes?
> > 
> > - Writing the unmount record always flushes the data and log devices.
> >   Does it need to flush the rt device too?  I guess xfs_free_buftarg
> >   does that.
> 
> Correct. RT device behaviour is unchanged as it only contains data
> and data is already guaranteed to be on stable storage before we
> write the unmount record.
> 
> > - Start an async flush of the data device when doing CIL push work so
> >   that anything the AIL wrote to disk (and pushed the tail) is persisted
> >   before we assign a tail to the log record that we write at the end?
> > 
> > - If any other AIL work completes (and pushes the tail ahead) by the
> >   time we actually write the log record, flush the data device a second
> >   time?
> 
> Yes.
> 
> > - If a log checkpoint spans multiple iclogs, flush the *log* device
> >   before writing the iclog with the commit record in it.
> 
> Yes. And for internal logs we have the natural optimisation that
> these two cases are handled by same cache flush and so for large
> checkpoints on internal logs we don't see lot tail update races.
> 
> > - Any time we write an iclog that commits a checkpoint, write that
> >   record with FUA to ensure it's persisted.
> 
> *nod*
> 
> > - If we're forcing the log to disk as part of an integrity operation
> >   (fsync, syncfs, etc.) then issue cache flushes for ... each? iclog
> >   written to disk?  And use FUA for that write too?
> 
> This is where it gets messy, because log forces are not based around
> checkpoint completions. Hence we have no idea what is actually in
> the iclog we are flushing and so must treat them all as if they
> contain a commit record, close off a multi-iclog checkpoint, and
> might have raced with a log tail update. We don't know - and can't
> know from the iclog state - which conditions exist and so we have to
> assume that at least one of the above states exist for any ACTIVE or
> WANT_SYNC iclog we end flushing or up waiting on.
> 
> If the iclog is already on it's way to disk, and it contains a
> commit record, then the cache flush requirements for
> metadata/journal ordering have already been met and we don't need to
> do anything other than wait. But if we have to flush the iclog or
> wait for a flush by a third party, we need to ensure that cache
> flushes occur so that the log force semantics are upheld.
> 
> If the iclog doesn't contain a commit record (i.e. a log force in
> the middle of a new, racing checkpoint write) we don't actually care
> if the iclog contains flushes or not, because a crash immediately
> after the log force won't actually recover the checkpoint contained
> in that iclog. From the log force perspective, the iclog contains
> future changes, so we don't care about whether it can be recovered.
> But we don't know this, so we have to issue cache flushes on every
> iclog we flush from the log force code.
> 
> This is why I mentioned that the log force code needs to be turned
> inside out to guarantee CIL checkpoints are flushed and stable
> rather than iclogs. We care about whole checkpoints being
> recoverable, not whether some random iclog in the middle of a
> checkpoint write is stable....

<nod> Ok, that's kinda what I thought -- the first few "where do we
flush?" cases were pretty straightforward, and the last one is murky.

With all the random little changes applied,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-22 23:13 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22  1:53 [PATCH 0/5] xfs: fix log cache flush regressions Dave Chinner
2021-07-22  1:53 ` [PATCH 1/5] xfs: flush data dev on external log write Dave Chinner
2021-07-22  6:41   ` Christoph Hellwig
2021-07-22 15:52   ` Darrick J. Wong
2021-07-22  1:53 ` [PATCH 2/5] xfs: external logs need to flush data device Dave Chinner
2021-07-22  6:48   ` Christoph Hellwig
2021-07-22 18:14   ` Darrick J. Wong
2021-07-22 21:45     ` Dave Chinner
2021-07-22 23:10       ` Darrick J. Wong
2021-07-22  1:53 ` [PATCH 3/5] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-22  7:06   ` Christoph Hellwig
2021-07-22  7:28     ` Dave Chinner
2021-07-22 19:12     ` Darrick J. Wong
2021-07-22  1:53 ` [PATCH 4/5] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-22  7:14   ` Christoph Hellwig
2021-07-22  7:32     ` Dave Chinner
2021-07-22 19:30   ` Darrick J. Wong
2021-07-22 22:12     ` Dave Chinner
2021-07-22 23:13       ` Darrick J. Wong
2021-07-22  1:53 ` [PATCH 5/5] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-22  7:15   ` Christoph Hellwig
2021-07-22 19:13   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.