linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs
@ 2021-07-27  7:10 Dave Chinner
  2021-07-27  7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

tl;dr: I may have found all the cache flush problems that I'm
reproducing with generic/482. The last two failures I've RCA'd have
been repair complaining about invalid attribute because attr fork
setup is non-atomic (xfs_attr_shortform_to_leaf() on an empty attr
fork). I'm still running tests to see if another cache flush related
failure occurs, but I'm more confident than I was yesterday that the
light I see isn't another train...

The first seven patches in this series fix regressions introduced by
commit eef983ffeae7 ("xfs: journal IO cache flush reductions") and
exposed by generic/482. I have one test machine that reproduces
the failures, and in general a failure occurs 1 in every 30 test
iterations.

The first two patches are related to external logs. These were found
by code inspection while looking for the cause of the generic/482
failures.

The third and fourth patch addresses a race condition between the
new unconditional async data device cache flush run by the CIL push
and the log tail changing between that flush completing and the
commit iclog being synced. In this situation, if the commit_iclog
doesn't issue a REQ_PREFLUSH, we fail to guarantee that the metadata
IO completions that moved the log tail are on stable storage. The
fix for this issue is to sample the log tail before we issue the
async cache flush, and if the log tail has changed when we release
the commit_iclog we set the XLOG_ICL_NEED_FLUSH flag on the iclog to
guarantee a cache flush is issued before the commit record that
moves the tail of the log forward is written to the log.

The fifth and sixth patches address an oversight about log forces.
Log forces imply a cache flush if they result in iclog IO being
issued, allowing the caller to elide cache flushes that they might
require for data integrity. The change to requiring the iclog owner
to signal that a cache flush is required completely missed the log
force code. This patch ensures that all ACTIVE and WANT_SYNC iclogs
that we either flush or wait on issue cache flushes and so guarantee
that they are on stable storage at IO completion before they signal
completion to log force waiters.

The seventh patch is not a regression, but a fix for another log
force iclog related flush problem I noticed and found while triaging
the above problems. Occasionally the tests would sit idle doing
nothing for up to 30s waiting on a xfs_log_force_seq() call to
complete. The trigger for them to wake up was the background log
worker issuing a new log force. This appears to be another
"incorrectly wait on future iclog state changes" situation, and it's
notable that xfs_log_force() specifically handles this situation
while xfs_log_force_seq() does not. Changing xfs_log_force_seq() to
use the same "iclog IO already completed" detection as
xfs_log_force() made those random delays go away.

The eighth patch is a fix for inode logging being able to make the
on-disk LSN in the inode go backwards in time. This is a regression
from 2016 when we shrunk the inode by not tracking the on-disk inode
LSN in memory any more. This can cause inodes to be replayed
incorrectly, leading to silent corruption during log recovery.

The ninth patch is a fix for a zero-day bug in the on-disk LSN log
recovery ordering for attribute leaf blocks. We fail to recognise
them and extract their LSN, so they are always recovered
immediately. This can result in recovery taking them backwards in
time, leading to silent corruption during log recovery. This has
been around since recovery ordering was introduced for v5
filesystems way back in 2013....

The tenth patch is an addition to the iclog state tracing that dumps
the iclog flags which tells me whether the NEED_FLUSH/NEED_FUA flags
are set on the iclog as it is released for IO. This helps validate
whether the iclog flushing is doing the right thing when a g/482
failure occurs with a suspected cache flushing issue.

The last patch is a fix for the new race condition found. It's
tricky and complex and requires a commit record in the iclog left in
ACTIVE state with NEED_FUA set on it, then to have a log tail update
occur after the next CIL push flush is done but before the iclog is
flushed when the checkpoint overruns the remaining iclog space.

Please consider this patchset for a 5.14-rc4 merge, because we need
to get the obvious failures and regressions fixed sooner rather
later.

Version 3:
- fixed typos in commits
- Updated tail lsn vs flush comments on xlog_state_release_iclog()
- Added new patch to fix another tail lsn vs iclog write race

Version 2:
- rebased on 5.14-rc3
- https://lore.kernel.org/linux-xfs/20210726060716.3295008-1-david@fromorbit.com/T/#t
- split __xlog_state_release_iclog() folding into separate patch (Christoph)
- Reinstated lost trace_xlog_iclog_syncing tracepoint.
- split xlog_force_iclog() helper into separate patch.
- In xlog_force_lsn(), don't set log_flushed for the WANT_SYNC case as there is
  no guarantee that the log has been flushed when we return if we are not
  waiting.
- appended inode LSN logging fix to series from
  https://lore.kernel.org/linux-xfs/20210722015335.3063274-1-david@fromorbit.com/T/#t
- Fixed inode recovery failure when two checkpoints have the same recovery LSN
  (start records in the same iclog) and an inode is logged in both checkpoints.
  The second transaction would see an equal LSN to the inode on disk and skip
  replay, even though it was required.
- added more comments to indicate the LSN in the xfs_log_dinode is never valid
  for recovery or replay purposes
- added new patch for ATTR3_LEAF magic number detection and recovery (yet
  another zero day!).

Version 1:
- based on 5.14-rc1
- https://lore.kernel.org/linux-xfs/20210722015335.3063274-1-david@fromorbit.com/T/#t


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/11] xfs: flush data dev on external log write
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We incorrectly flush the log device instead of the data device when
trying to ensure metadata is correctly on disk before writing the
unmount record.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 36fa2650b081..96434cc4df6e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -833,7 +833,7 @@ xlog_write_unmount_record(
 	 * stamp the tail LSN into the unmount record.
 	 */
 	if (log->l_targ != log->l_mp->m_ddev_targp)
-		blkdev_issue_flush(log->l_targ->bt_bdev);
+		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 02/11] xfs: external logs need to flush data device
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
  2021-07-27  7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The recent journal flush/FUA changes replaced the flushing of the
data device on every iclog write with an up-front async data device
cache flush. Unfortunately, the assumption of which this was based
on has been proven incorrect by the flush vs log tail update
ordering issue. As the fix for that issue uses the
XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
flush, we now need to (once again) ensure that an iclog write to
external logs that need a cache flush to be issued actually issue a
cache flush to the data device as well as the log device.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 96434cc4df6e..a3c4d48195d9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -827,13 +827,6 @@ xlog_write_unmount_record(
 	/* account for space used by record data */
 	ticket->t_curr_res -= sizeof(ulf);
 
-	/*
-	 * For external log devices, we need to flush the data device cache
-	 * first to ensure all metadata writeback is on stable storage before we
-	 * stamp the tail LSN into the unmount record.
-	 */
-	if (log->l_targ != log->l_mp->m_ddev_targp)
-		blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
 	return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS);
 }
 
@@ -1796,10 +1789,20 @@ xlog_write_iclog(
 	 * metadata writeback and causing priority inversions.
 	 */
 	iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE;
-	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH)
+	if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) {
 		iclog->ic_bio.bi_opf |= REQ_PREFLUSH;
+		/*
+		 * For external log devices, we also need to flush the data
+		 * device cache first to ensure all metadata writeback covered
+		 * by the LSN in this iclog is on stable storage. This is slow,
+		 * but it *must* complete before we issue the external log IO.
+		 */
+		if (log->l_targ != log->l_mp->m_ddev_targp)
+			blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev);
+	}
 	if (iclog->ic_flags & XLOG_ICL_NEED_FUA)
 		iclog->ic_bio.bi_opf |= REQ_FUA;
+
 	iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 
 	if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
  2021-07-27  7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
  2021-07-27  7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Fold __xlog_state_release_iclog into its only caller to prepare
make an upcoming fix easier.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 45 +++++++++++++++++----------------------------
 1 file changed, 17 insertions(+), 28 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a3c4d48195d9..82f5996d3889 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -487,29 +487,6 @@ xfs_log_reserve(
 	return error;
 }
 
-static bool
-__xlog_state_release_iclog(
-	struct xlog		*log,
-	struct xlog_in_core	*iclog)
-{
-	lockdep_assert_held(&log->l_icloglock);
-
-	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
-		/* update tail before writing to iclog */
-		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
-
-		iclog->ic_state = XLOG_STATE_SYNCING;
-		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
-		xlog_verify_tail_lsn(log, iclog, tail_lsn);
-		/* cycle incremented when incrementing curr_block */
-		trace_xlog_iclog_syncing(iclog, _RET_IP_);
-		return true;
-	}
-
-	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
-	return false;
-}
-
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
  * it is in the WANT_SYNC state.
@@ -519,19 +496,31 @@ xlog_state_release_iclog(
 	struct xlog		*log,
 	struct xlog_in_core	*iclog)
 {
+	xfs_lsn_t		tail_lsn;
 	lockdep_assert_held(&log->l_icloglock);
 
 	trace_xlog_iclog_release(iclog, _RET_IP_);
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
-	if (atomic_dec_and_test(&iclog->ic_refcnt) &&
-	    __xlog_state_release_iclog(log, iclog)) {
-		spin_unlock(&log->l_icloglock);
-		xlog_sync(log, iclog);
-		spin_lock(&log->l_icloglock);
+	if (!atomic_dec_and_test(&iclog->ic_refcnt))
+		return 0;
+
+	if (iclog->ic_state != XLOG_STATE_WANT_SYNC) {
+		ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
+		return 0;
 	}
 
+	/* update tail before writing to iclog */
+	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+	iclog->ic_state = XLOG_STATE_SYNCING;
+	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+	xlog_verify_tail_lsn(log, iclog, tail_lsn);
+	trace_xlog_iclog_syncing(iclog, _RET_IP_);
+
+	spin_unlock(&log->l_icloglock);
+	xlog_sync(log, iclog);
+	spin_lock(&log->l_icloglock);
 	return 0;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (2 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27 18:50   ` Darrick J. Wong
  2021-07-27  7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

There is a race between the new CIL async data device metadata IO
completion cache flush and the log tail in the iclog the flush
covers being updated. This can be seen by repeating generic/482 in a
loop and eventually log recovery fails with a failures such as this:

XFS (dm-3): Starting recovery (logdev: internal)
XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117

Analysis of the logwrite replay shows that there were no writes to
the data device between the FUA @ write 124 and the FUA at write @
125, but log recovery @ 125 failed. The difference was the one log
write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
and so the inode create intent in (1,8) was not replayed and so the
inode cluster was zero on disk when replay of the first inode item
in (1,32) was attempted.

What this meant was that the journal write that occurred at @ 125
did not ensure that metadata completed before the iclog was written
was correctly on stable storage. The tail of the log moved forward,
so IO must have been completed between the two iclog writes. This
means that there is a race condition between the unconditional async
cache flush in the CIL push work and the tail LSN that is written to
the iclog. This happens like so:

CIL push work				AIL push work
-------------				-------------
Add to committing list
start async data dev cache flush
.....
<flush completes>
<all writes to old tail lsn are stable>
xlog_write
  ....					push inode create buffer
					<start IO>
					.....
xlog_write(commit record)
  ....					<IO completes>
  					log tail moves
  					  xlog_assign_tail_lsn()
start_lsn == commit_lsn
  <no iclog preflush!>
xlog_state_release_iclog
  __xlog_state_release_iclog()
    <writes *new* tail_lsn into iclog>
  xlog_sync()
    ....
    submit_bio()
<tail in log moves forward without flushing written metadata>

Essentially, this can only occur if the commit iclog is issued
without a cache flush. If the iclog bio is submitted with
REQ_PREFLUSH, then it will guarantee that all the completed IO is
one stable storage before the iclog bio with the new tail LSN in it
is written to the log.

IOWs, the tail lsn that is written to the iclog needs to be sampled
*before* we issue the cache flush that guarantees all IO up to that
LSN has been completed.

To fix this without giving up the performance advantage of the
flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
compared to 5.13), we need to ensure that we always issue a cache
flush if the tail LSN changes between the initial async flush and
the commit record being written. THis requires sampling the tail_lsn
before we start the flush, and then passing the sampled tail LSN to
xlog_state_release_iclog() so it can determine if the the tail LSN
has changed while writing the checkpoint. If the tail LSN has
changed, then it needs to set the NEED_FLUSH flag on the iclog and
we'll issue another cache flush before writing the iclog.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_log.c      | 36 ++++++++++++++++++++++++++----------
 fs/xfs/xfs_log_cil.c  | 13 +++++++++++--
 fs/xfs/xfs_log_priv.h |  3 ++-
 3 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 82f5996d3889..e8c6c96d4f7c 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -489,12 +489,17 @@ xfs_log_reserve(
 
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
- * it is in the WANT_SYNC state.
+ * it is in the WANT_SYNC state.  If the caller passes in a non-zero
+ * @old_tail_lsn and the current log tail does not match, there may be metadata
+ * on disk that must be persisted before this iclog is written.  To satisfy that
+ * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
+ * iclog with the new log tail value.
  */
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog)
+	struct xlog_in_core	*iclog,
+	xfs_lsn_t		old_tail_lsn)
 {
 	xfs_lsn_t		tail_lsn;
 	lockdep_assert_held(&log->l_icloglock);
@@ -503,6 +508,19 @@ xlog_state_release_iclog(
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
+	/*
+	 * Grabbing the current log tail needs to be atomic w.r.t. the writing
+	 * of the tail LSN into the iclog so we guarantee that the log tail does
+	 * not move between deciding if a cache flush is required and writing
+	 * the LSN into the iclog below.
+	 */
+	if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
+		tail_lsn = xlog_assign_tail_lsn(log->l_mp);
+
+		if (old_tail_lsn && tail_lsn != old_tail_lsn)
+			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+	}
+
 	if (!atomic_dec_and_test(&iclog->ic_refcnt))
 		return 0;
 
@@ -511,8 +529,6 @@ xlog_state_release_iclog(
 		return 0;
 	}
 
-	/* update tail before writing to iclog */
-	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
 	iclog->ic_state = XLOG_STATE_SYNCING;
 	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 	xlog_verify_tail_lsn(log, iclog, tail_lsn);
@@ -858,7 +874,7 @@ xlog_unmount_write(
 	 * iclog containing the unmount record is written.
 	 */
 	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -2302,7 +2318,7 @@ xlog_write_copy_finish(
 	return 0;
 
 release_iclog:
-	error = xlog_state_release_iclog(log, iclog);
+	error = xlog_state_release_iclog(log, iclog, 0);
 	spin_unlock(&log->l_icloglock);
 	return error;
 }
@@ -2521,7 +2537,7 @@ xlog_write(
 		ASSERT(optype & XLOG_COMMIT_TRANS);
 		*commit_iclog = iclog;
 	} else {
-		error = xlog_state_release_iclog(log, iclog);
+		error = xlog_state_release_iclog(log, iclog, 0);
 	}
 	spin_unlock(&log->l_icloglock);
 
@@ -2959,7 +2975,7 @@ xlog_state_get_iclog_space(
 		 * reference to the iclog.
 		 */
 		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
-			error = xlog_state_release_iclog(log, iclog);
+			error = xlog_state_release_iclog(log, iclog, 0);
 		spin_unlock(&log->l_icloglock);
 		if (error)
 			return error;
@@ -3195,7 +3211,7 @@ xfs_log_force(
 			atomic_inc(&iclog->ic_refcnt);
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog))
+			if (xlog_state_release_iclog(log, iclog, 0))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3275,7 +3291,7 @@ xlog_force_lsn(
 		}
 		atomic_inc(&iclog->ic_refcnt);
 		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog))
+		if (xlog_state_release_iclog(log, iclog, 0))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b128aaa9b870..4c44bc3786c0 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -654,8 +654,9 @@ xlog_cil_push_work(
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
 	struct xfs_log_vec	lvhdr = { NULL };
+	xfs_lsn_t		preflush_tail_lsn;
 	xfs_lsn_t		commit_lsn;
-	xfs_lsn_t		push_seq;
+	xfs_csn_t		push_seq;
 	struct bio		bio;
 	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 
@@ -730,7 +731,15 @@ xlog_cil_push_work(
 	 * because we hold the flush lock exclusively. Hence we can now issue
 	 * a cache flush to ensure all the completed metadata in the journal we
 	 * are about to overwrite is on stable storage.
+	 *
+	 * Because we are issuing this cache flush before we've written the
+	 * tail lsn to the iclog, we can have metadata IO completions move the
+	 * tail forwards between the completion of this flush and the iclog
+	 * being written. In this case, we need to re-issue the cache flush
+	 * before the iclog write. To detect whether the log tail moves, sample
+	 * the tail LSN *before* we issue the flush.
 	 */
+	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
 	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
 				&bdev_flush);
 
@@ -941,7 +950,7 @@ xlog_cil_push_work(
 	 * storage.
 	 */
 	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
-	xlog_state_release_iclog(log, commit_iclog);
+	xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
 	spin_unlock(&log->l_icloglock);
 	return;
 
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4c41bbfa33b0..7cbde0b4f990 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -497,7 +497,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
 void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
 void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
 
-int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
+int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
+		xfs_lsn_t log_tail_lsn);
 
 /*
  * When we crack an atomic LSN, we sample it first so that the value will not
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 05/11] xfs: factor out forced iclog flushes
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (3 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We force iclogs in several places - we need them all to have the
same cache flush semantics, so start by factoring out the iclog
force into a common helper.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e8c6c96d4f7c..184c68ea62bb 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -778,6 +778,20 @@ xfs_log_mount_cancel(
 	xfs_log_unmount(mp);
 }
 
+/*
+ * Flush out the iclog to disk ensuring that device caches are flushed and
+ * the iclog hits stable storage before any completion waiters are woken.
+ */
+static inline int
+xlog_force_iclog(
+	struct xlog_in_core	*iclog)
+{
+	atomic_inc(&iclog->ic_refcnt);
+	if (iclog->ic_state == XLOG_STATE_ACTIVE)
+		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
+	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
+}
+
 /*
  * Wait for the iclog and all prior iclogs to be written disk as required by the
  * log force state machine. Waiting on ic_force_wait ensures iclog completions
@@ -863,18 +877,8 @@ xlog_unmount_write(
 
 	spin_lock(&log->l_icloglock);
 	iclog = log->l_iclog;
-	atomic_inc(&iclog->ic_refcnt);
-	if (iclog->ic_state == XLOG_STATE_ACTIVE)
-		xlog_state_switch_iclogs(log, iclog, 0);
-	else
-		ASSERT(iclog->ic_state == XLOG_STATE_WANT_SYNC ||
-		       iclog->ic_state == XLOG_STATE_IOERROR);
-	/*
-	 * Ensure the journal is fully flushed and on stable storage once the
-	 * iclog containing the unmount record is written.
-	 */
 	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
-	error = xlog_state_release_iclog(log, iclog, 0);
+	error = xlog_force_iclog(iclog);
 	xlog_wait_on_iclog(iclog);
 
 	if (tic) {
@@ -3201,17 +3205,9 @@ xfs_log_force(
 		iclog = iclog->ic_prev;
 	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
 		if (atomic_read(&iclog->ic_refcnt) == 0) {
-			/*
-			 * We are the only one with access to this iclog.
-			 *
-			 * Flush it out now.  There should be a roundoff of zero
-			 * to show that someone has already taken care of the
-			 * roundoff from the previous sync.
-			 */
-			atomic_inc(&iclog->ic_refcnt);
+			/* We have exclusive access to this iclog. */
 			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
-			xlog_state_switch_iclogs(log, iclog, 0);
-			if (xlog_state_release_iclog(log, iclog, 0))
+			if (xlog_force_iclog(iclog))
 				goto out_error;
 
 			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
@@ -3289,9 +3285,7 @@ xlog_force_lsn(
 					&log->l_icloglock);
 			return -EAGAIN;
 		}
-		atomic_inc(&iclog->ic_refcnt);
-		xlog_state_switch_iclogs(log, iclog, 0);
-		if (xlog_state_release_iclog(log, iclog, 0))
+		if (xlog_force_iclog(iclog))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 06/11] xfs: log forces imply data device cache flushes
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (4 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

After fixing the tail_lsn vs cache flush race, generic/482 continued
to fail in a similar way where cache flushes were missing before
iclog FUA writes. Tracing of iclog state changes during the fsstress
workload portion of the test (via xlog_iclog* events) indicated that
iclog writes were coming from two sources - CIL pushes and log
forces (due to fsync/O_SYNC operations). All of the cases where a
recovery problem was triggered indicated that the log force was the
source of the iclog write that was not preceeded by a cache flush.

This was an oversight in the modifications made in commit
eef983ffeae7 ("xfs: journal IO cache flush reductions"). Log forces
for fsync imply a data device cache flush has been issued if an
iclog was flushed to disk and is indicated to the caller via the
log_flushed parameter so they can elide the device cache flush if
the journal issued one.

The change in eef983ffeae7 results in iclogs only issuing a cache
flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
added to the iclogs that the log force code flushes to disk. Hence
log forces are no longer guaranteeing that a cache flush is issued,
hence opening up a potential on-disk ordering failure.

Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
the actual iclogs it forces to the journal are also on stable
storage before it returns to the caller.

This patch introduces the xlog_force_iclog() helper function to
encapsulate the process of taking a reference to an iclog, switching
its state if WANT_SYNC and flushing it to stable storage correctly.

Both xfs_log_force() and xfs_log_force_lsn() are converted to use
it, as is xlog_unmount_write() which has an elaborate method of
doing exactly the same "write this iclog to stable storage"
operation.

Further, if the log force code needs to wait on a iclog in the
WANT_SYNC state, it needs to ensure that iclog also results in a
cache flush being issued. This covers the case where the iclog
contains the commit record of the CIL flush that the log force
triggered, but it hasn't been written yet because there is still an
active reference to the iclog.

Note: this whole cache flush whack-a-mole patch is a result of log
forces still being iclog state centric rather than being CIL
sequence centric. Most of this nasty code will go away in future
when log forces are converted to wait on CIL sequence push
completion rather than iclog completion. With the CIL push algorithm
guaranteeing that the CIL checkpoint is fully on stable storage when
it completes, we no longer need to iterate iclogs and push them to
ensure a CIL sequence push has completed and so all this nasty iclog
iteration and flushing code will go away.

Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 47 ++++++++++++++++++++++++++++++++++-------------
 1 file changed, 34 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 184c68ea62bb..160b8bb7ee60 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -787,6 +787,7 @@ xlog_force_iclog(
 	struct xlog_in_core	*iclog)
 {
 	atomic_inc(&iclog->ic_refcnt);
+	iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
 	if (iclog->ic_state == XLOG_STATE_ACTIVE)
 		xlog_state_switch_iclogs(iclog->ic_log, iclog, 0);
 	return xlog_state_release_iclog(iclog->ic_log, iclog, 0);
@@ -877,7 +878,6 @@ xlog_unmount_write(
 
 	spin_lock(&log->l_icloglock);
 	iclog = log->l_iclog;
-	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
 	error = xlog_force_iclog(iclog);
 	xlog_wait_on_iclog(iclog);
 
@@ -3214,22 +3214,23 @@ xfs_log_force(
 				goto out_unlock;
 		} else {
 			/*
-			 * Someone else is writing to this iclog.
-			 *
-			 * Use its call to flush out the data.  However, the
-			 * other thread may not force out this LR, so we mark
-			 * it WANT_SYNC.
+			 * Someone else is still writing to this iclog, so we
+			 * need to ensure that when they release the iclog it
+			 * gets synced immediately as we may be waiting on it.
 			 */
 			xlog_state_switch_iclogs(log, iclog, 0);
 		}
-	} else {
-		/*
-		 * If the head iclog is not active nor dirty, we just attach
-		 * ourselves to the head and go to sleep if necessary.
-		 */
-		;
 	}
 
+	/*
+	 * The iclog we are about to wait on may contain the checkpoint pushed
+	 * by the above xlog_cil_force() call, but it may not have been pushed
+	 * to disk yet. Like the ACTIVE case above, we need to make sure caches
+	 * are flushed when this iclog is written.
+	 */
+	if (iclog->ic_state == XLOG_STATE_WANT_SYNC)
+		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
+
 	if (flags & XFS_LOG_SYNC)
 		return xlog_wait_on_iclog(iclog);
 out_unlock:
@@ -3262,7 +3263,8 @@ xlog_force_lsn(
 			goto out_unlock;
 	}
 
-	if (iclog->ic_state == XLOG_STATE_ACTIVE) {
+	switch (iclog->ic_state) {
+	case XLOG_STATE_ACTIVE:
 		/*
 		 * We sleep here if we haven't already slept (e.g. this is the
 		 * first time we've looked at the correct iclog buf) and the
@@ -3289,6 +3291,25 @@ xlog_force_lsn(
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
+		break;
+	case XLOG_STATE_WANT_SYNC:
+		/*
+		 * This iclog may contain the checkpoint pushed by the
+		 * xlog_cil_force_seq() call, but there are other writers still
+		 * accessing it so it hasn't been pushed to disk yet. Like the
+		 * ACTIVE case above, we need to make sure caches are flushed
+		 * when this iclog is written.
+		 */
+		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
+		break;
+	default:
+		/*
+		 * The entire checkpoint was written by the CIL force and is on
+		 * its way to disk already. It will be stable when it
+		 * completes, so we don't need to manipulate caches here at all.
+		 * We just need to wait for completion if necessary.
+		 */
+		break;
 	}
 
 	if (flags & XFS_LOG_SYNC)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn()
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (5 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
see if the iclog has already been completed and the contents on
stable storage. We check for completed iclogs in xfs_log_force(), so
we should do the same thing for xfs_log_force_lsn().

This fixed some random up-to-30s pauses seen in unmounting
filesystems in some tests. A log force ends up waiting on completed
iclog, and that doesn't then get flushed (and hence the log force
get completed) until the background log worker issues a log force
that flushes the iclog in question. Then the unmount unblocks and
continues.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 160b8bb7ee60..1c328efdca66 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3143,6 +3143,35 @@ xlog_state_switch_iclogs(
 	log->l_iclog = iclog->ic_next;
 }
 
+/*
+ * Force the iclog to disk and check if the iclog has been completed before
+ * xlog_force_iclog() returns. This can happen on synchronous (e.g.
+ * pmem) or fast async storage because we drop the icloglock to issue the IO.
+ * If completion has already occurred, tell the caller so that it can avoid an
+ * unnecessary wait on the iclog.
+ */
+static int
+xlog_force_and_check_iclog(
+	struct xlog_in_core	*iclog,
+	bool			*completed)
+{
+	xfs_lsn_t		lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+	int			error;
+
+	*completed = false;
+	error = xlog_force_iclog(iclog);
+	if (error)
+		return error;
+
+	/*
+	 * If the iclog has already been completed and reused the header LSN
+	 * will have been rewritten by completion
+	 */
+	if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
+		*completed = true;
+	return 0;
+}
+
 /*
  * Write out all data in the in-core log as of this exact moment in time.
  *
@@ -3177,7 +3206,6 @@ xfs_log_force(
 {
 	struct xlog		*log = mp->m_log;
 	struct xlog_in_core	*iclog;
-	xfs_lsn_t		lsn;
 
 	XFS_STATS_INC(mp, xs_log_force);
 	trace_xfs_log_force(mp, 0, _RET_IP_);
@@ -3206,11 +3234,12 @@ xfs_log_force(
 	} else if (iclog->ic_state == XLOG_STATE_ACTIVE) {
 		if (atomic_read(&iclog->ic_refcnt) == 0) {
 			/* We have exclusive access to this iclog. */
-			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
-			if (xlog_force_iclog(iclog))
+			bool	completed;
+
+			if (xlog_force_and_check_iclog(iclog, &completed))
 				goto out_error;
 
-			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
+			if (completed)
 				goto out_unlock;
 		} else {
 			/*
@@ -3250,6 +3279,7 @@ xlog_force_lsn(
 	bool			already_slept)
 {
 	struct xlog_in_core	*iclog;
+	bool			completed;
 
 	spin_lock(&log->l_icloglock);
 	iclog = log->l_iclog;
@@ -3287,10 +3317,12 @@ xlog_force_lsn(
 					&log->l_icloglock);
 			return -EAGAIN;
 		}
-		if (xlog_force_iclog(iclog))
+		if (xlog_force_and_check_iclog(iclog, &completed))
 			goto out_error;
 		if (log_flushed)
 			*log_flushed = 1;
+		if (completed)
+			goto out_unlock;
 		break;
 	case XLOG_STATE_WANT_SYNC:
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (6 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27 19:00   ` Darrick J. Wong
  2023-06-13  6:20   ` Chandan Babu R
  2021-07-27  7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c41f, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
 fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index d548ea4b6aab..2c5bcbc19264 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -411,7 +411,16 @@ struct xfs_log_dinode {
 	/* start of the extended dinode, writable fields */
 	uint32_t	di_crc;		/* CRC of the inode */
 	uint64_t	di_changecount;	/* number of attribute changes */
-	xfs_lsn_t	di_lsn;		/* flush sequence */
+
+	/*
+	 * The LSN we write to this field during formatting is not a reflection
+	 * of the current on-disk LSN. It should never be used for recovery
+	 * sequencing, nor should it be recovered into the on-disk inode at all.
+	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
+	 * for details.
+	 */
+	xfs_lsn_t	di_lsn;
+
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
 	uint8_t		di_pad2[12];	/* more padding for future expansion */
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 7b79518b6c20..e0072a6cd2d3 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
 STATIC void
 xfs_log_dinode_to_disk(
 	struct xfs_log_dinode	*from,
-	struct xfs_dinode	*to)
+	struct xfs_dinode	*to,
+	xfs_lsn_t		lsn)
 {
 	to->di_magic = cpu_to_be16(from->di_magic);
 	to->di_mode = cpu_to_be16(from->di_mode);
@@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
 		to->di_ino = cpu_to_be64(from->di_ino);
-		to->di_lsn = cpu_to_be64(from->di_lsn);
+		to->di_lsn = cpu_to_be64(lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
 		uuid_copy(&to->di_uuid, &from->di_uuid);
 		to->di_flushiter = 0;
@@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
 	}
 
 	/*
-	 * If the inode has an LSN in it, recover the inode only if it's less
-	 * than the lsn of the transaction we are replaying. Note: we still
-	 * need to replay an owner change even though the inode is more recent
-	 * than the transaction as there is no guarantee that all the btree
-	 * blocks are more recent than this transaction, too.
+	 * If the inode has an LSN in it, recover the inode only if the on-disk
+	 * inode's LSN is older than the lsn of the transaction we are
+	 * replaying. We can have multiple checkpoints with the same start LSN,
+	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
+	 * mean that the on-disk inode is more recent than the change being
+	 * replayed.
+	 *
+	 * We must check the current_lsn against the on-disk inode
+	 * here because the we can't trust the log dinode to contain a valid LSN
+	 * (see comment below before replaying the log dinode for details).
+	 *
+	 * Note: we still need to replay an owner change even though the inode
+	 * is more recent than the transaction as there is no guarantee that all
+	 * the btree blocks are more recent than this transaction, too.
 	 */
 	if (dip->di_version >= 3) {
 		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
 
-		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
+		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
 			trace_xfs_log_recover_inode_skip(log, in_f);
 			error = 0;
 			goto out_owner_change;
@@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
 		goto out_release;
 	}
 
-	/* recover the log dinode inode into the on disk inode */
-	xfs_log_dinode_to_disk(ldip, dip);
+	/*
+	 * Recover the log dinode inode into the on disk inode.
+	 *
+	 * The LSN in the log dinode is garbage - it can be zero or reflect
+	 * stale in-memory runtime state that isn't coherent with the changes
+	 * logged in this transaction or the changes written to the on-disk
+	 * inode.  Hence we write the current lSN into the inode because that
+	 * matches what xfs_iflush() would write inode the inode when flushing
+	 * the changes in this transaction.
+	 */
+	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
 
 	fields = in_f->ilf_fields;
 	if (fields & XFS_ILOG_DEV)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 09/11] xfs: Enforce attr3 buffer recovery order
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (7 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
  2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

From the department of "WTAF? How did we miss that!?"...

When we are recovering a buffer, the first thing we do is check the
buffer magic number and extract the LSN from the buffer. If the LSN
is older than the current LSN, we replay the modification to it. If
the metadata on disk is newer than the transaction in the log, we
skip it. This is a fundamental v5 filesystem metadata recovery
behaviour.

generic/482 failed with an attribute writeback failure during log
recovery. The write verifier caught the corruption before it got
written to disk, and the attr buffer dump looked like:

XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
XFS (dm-3): Unmount and run xfs_repair
XFS (dm-3): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                  ^^^^^^^^^^^^^^^^^^^^^^^
00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
.....

The highlighted bytes are the LSN that was replayed into the
buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
that block on disk looks like this:

$ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
hdr.info.hdr.forw = 0
hdr.info.hdr.back = 0
hdr.info.hdr.magic = 0x3bee
hdr.info.crc = 0xb5af0bc6 (correct)
hdr.info.bno = 105448
hdr.info.lsn = 0x100000900
               ^^^^^^^^^^^
hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
hdr.info.owner = 131203
hdr.count = 2
hdr.usedbytes = 120
hdr.firstused = 3796
hdr.holes = 1
hdr.freemap[0-2] = [base,size]

Note the LSN stamped into the buffer on disk: 1/0x900. The version
on disk is much newer than the log transaction that was being
replayed. That's a bug, and should -never- happen.

So I immediately went to look at xlog_recover_get_buf_lsn() to check
that we handled the LSN correctly. I was wondering if there was a
similar "two commits with the same start LSN skips the second
replay" problem with buffers. I didn't get that far, because I found
a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!

IOWs, attr3 leaf buffers fall through the magic number checks
unrecognised, so trigger the "recover immediately" behaviour instead
of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
buffers and that causes silent on disk corruption of inode attribute
forks and potentially other things....

Git history shows this is *another* zero day bug, this time
introduced in commit 50d5c8d8e938 ("xfs: check LSN ordering for v5
superblocks during recovery") which failed to handle the attr3 leaf
buffers in recovery. And we've failed to handle them ever since...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf_item_recover.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
index d44e8b4a3391..05fd816edf59 100644
--- a/fs/xfs/xfs_buf_item_recover.c
+++ b/fs/xfs/xfs_buf_item_recover.c
@@ -796,6 +796,7 @@ xlog_recover_get_buf_lsn(
 	switch (magicda) {
 	case XFS_DIR3_LEAF1_MAGIC:
 	case XFS_DIR3_LEAFN_MAGIC:
+	case XFS_ATTR3_LEAF_MAGIC:
 	case XFS_DA3_NODE_MAGIC:
 		lsn = be64_to_cpu(((struct xfs_da3_blkinfo *)blk)->lsn);
 		uuid = &((struct xfs_da3_blkinfo *)blk)->uuid;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 10/11] xfs: need to see iclog flags in tracing
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (8 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
  10 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Because I cannot tell if the NEED_FLUSH flag is being set correctly
by the log force and CIL push machinery without it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_log_priv.h | 13 ++++++++++---
 fs/xfs/xfs_trace.h    |  5 ++++-
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 7cbde0b4f990..f3e79a45d60a 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -59,6 +59,16 @@ enum xlog_iclog_state {
 	{ XLOG_STATE_DIRTY,	"XLOG_STATE_DIRTY" }, \
 	{ XLOG_STATE_IOERROR,	"XLOG_STATE_IOERROR" }
 
+/*
+ * In core log flags
+ */
+#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
+#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
+
+#define XLOG_ICL_STRINGS \
+	{ XLOG_ICL_NEED_FLUSH,	"XLOG_ICL_NEED_FLUSH" }, \
+	{ XLOG_ICL_NEED_FUA,	"XLOG_ICL_NEED_FUA" }
+
 
 /*
  * Log ticket flags
@@ -143,9 +153,6 @@ enum xlog_iclog_state {
 
 #define XLOG_COVER_OPS		5
 
-#define XLOG_ICL_NEED_FLUSH	(1 << 0)	/* iclog needs REQ_PREFLUSH */
-#define XLOG_ICL_NEED_FUA	(1 << 1)	/* iclog needs REQ_FUA */
-
 /* Ticket reservation region accounting */ 
 #define XLOG_TIC_LEN_MAX	15
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f9d8d605f9b1..19260291ff8b 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3944,6 +3944,7 @@ DECLARE_EVENT_CLASS(xlog_iclog_class,
 		__field(uint32_t, state)
 		__field(int32_t, refcount)
 		__field(uint32_t, offset)
+		__field(uint32_t, flags)
 		__field(unsigned long long, lsn)
 		__field(unsigned long, caller_ip)
 	),
@@ -3952,15 +3953,17 @@ DECLARE_EVENT_CLASS(xlog_iclog_class,
 		__entry->state = iclog->ic_state;
 		__entry->refcount = atomic_read(&iclog->ic_refcnt);
 		__entry->offset = iclog->ic_offset;
+		__entry->flags = iclog->ic_flags;
 		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
 		__entry->caller_ip = caller_ip;
 	),
-	TP_printk("dev %d:%d state %s refcnt %d offset %u lsn 0x%llx caller %pS",
+	TP_printk("dev %d:%d state %s refcnt %d offset %u lsn 0x%llx flags %s caller %pS",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __print_symbolic(__entry->state, XLOG_STATE_STRINGS),
 		  __entry->refcount,
 		  __entry->offset,
 		  __entry->lsn,
+		  __print_flags(__entry->flags, "|", XLOG_ICL_STRINGS),
 		  (char *)__entry->caller_ip)
 
 );
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 11/11] xfs: limit iclog tail updates
  2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
                   ` (9 preceding siblings ...)
  2021-07-27  7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
@ 2021-07-27  7:10 ` Dave Chinner
  2021-07-27 21:25   ` Darrick J. Wong
  2021-07-27 23:44   ` [PATCH 11/11 v2] " Dave Chinner
  10 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27  7:10 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

From the department of "generic/482 keeps giving", we bring you
another tail update race condition:

iclog:
	S1			C1
	+-----------------------+-----------------------+
				 S2			EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1:	Cache flush, log tail = X
At S1:		Metadata stable, write start record and checkpoint
At C1:		Write commit record, set NEED_FUA
		Single iclog checkpoint, so no need for NEED_FLUSH
		Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2:	Cache flush, log tail = X
At S2:		Metadata stable, write start record and checkpoint
After S2:	Log tail moves to X+1
At EOIC:	End of iclog, more journal data to write
		Releases iclog
		Not a commit iclog, so no need for NEED_FLUSH
		Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 1c328efdca66..1b20fb479ebc 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -489,12 +489,31 @@ xfs_log_reserve(
 
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
- * it is in the WANT_SYNC state.  If the caller passes in a non-zero
- * @old_tail_lsn and the current log tail does not match, there may be metadata
- * on disk that must be persisted before this iclog is written.  To satisfy that
- * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
- * iclog with the new log tail value.
+ * it is in the WANT_SYNC state.
+ *
+ * If the caller passes in a non-zero @old_tail_lsn and the current log tail
+ * does not match, there may be metadata on disk that must be persisted before
+ * this iclog is written.  To satisfy that requirement, set the
+ * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
+ * log tail value.
+ *
+ * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
+ * log tail is updated correctly. NEED_FUA indicates that the iclog will be
+ * written to stable storage, and implies that a commit record is contained
+ * within the iclog. We need to ensure that the log tail does not move beyond
+ * the tail that the first commit record in the iclog ordered against, otherwise
+ * correct recovery of that checkpoint becomes dependent on future operations
+ * performed on this iclog.
+ *
+ * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
+ * current tail into iclog. Once the iclog tail is set, future operations must
+ * not modify it, otherwise they potentially violate ordering constraints for
+ * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
+ * the iclog will get zeroed on activation of the iclog after sync, so we
+ * always capture the tail lsn on the iclog on the first NEED_FUA release
+ * regardless of the number of active reference counts on this iclog.
  */
+
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
@@ -519,6 +538,10 @@ xlog_state_release_iclog(
 
 		if (old_tail_lsn && tail_lsn != old_tail_lsn)
 			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+
+		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
+		    !iclog->ic_header.h_tail_lsn)
+			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 	}
 
 	if (!atomic_dec_and_test(&iclog->ic_refcnt))
@@ -530,7 +553,8 @@ xlog_state_release_iclog(
 	}
 
 	iclog->ic_state = XLOG_STATE_SYNCING;
-	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+	if (!iclog->ic_header.h_tail_lsn)
+		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 	xlog_verify_tail_lsn(log, iclog, tail_lsn);
 	trace_xlog_iclog_syncing(iclog, _RET_IP_);
 
@@ -2579,6 +2603,7 @@ xlog_state_activate_iclog(
 	memset(iclog->ic_header.h_cycle_data, 0,
 		sizeof(iclog->ic_header.h_cycle_data));
 	iclog->ic_header.h_lsn = 0;
+	iclog->ic_header.h_tail_lsn = 0;
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates
  2021-07-27  7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
@ 2021-07-27 18:50   ` Darrick J. Wong
  0 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2021-07-27 18:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 27, 2021 at 05:10:05PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There is a race between the new CIL async data device metadata IO
> completion cache flush and the log tail in the iclog the flush
> covers being updated. This can be seen by repeating generic/482 in a
> loop and eventually log recovery fails with a failures such as this:

Looks good, can't remember why I didn't do this last night:
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> XFS (dm-3): Starting recovery (logdev: internal)
> XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
> XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
> XFS (dm-3): Unmount and run xfs_repair
> XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
> 
> Analysis of the logwrite replay shows that there were no writes to
> the data device between the FUA @ write 124 and the FUA at write @
> 125, but log recovery @ 125 failed. The difference was the one log
> write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
> and so the inode create intent in (1,8) was not replayed and so the
> inode cluster was zero on disk when replay of the first inode item
> in (1,32) was attempted.
> 
> What this meant was that the journal write that occurred at @ 125
> did not ensure that metadata completed before the iclog was written
> was correctly on stable storage. The tail of the log moved forward,
> so IO must have been completed between the two iclog writes. This
> means that there is a race condition between the unconditional async
> cache flush in the CIL push work and the tail LSN that is written to
> the iclog. This happens like so:
> 
> CIL push work				AIL push work
> -------------				-------------
> Add to committing list
> start async data dev cache flush
> .....
> <flush completes>
> <all writes to old tail lsn are stable>
> xlog_write
>   ....					push inode create buffer
> 					<start IO>
> 					.....
> xlog_write(commit record)
>   ....					<IO completes>
>   					log tail moves
>   					  xlog_assign_tail_lsn()
> start_lsn == commit_lsn
>   <no iclog preflush!>
> xlog_state_release_iclog
>   __xlog_state_release_iclog()
>     <writes *new* tail_lsn into iclog>
>   xlog_sync()
>     ....
>     submit_bio()
> <tail in log moves forward without flushing written metadata>
> 
> Essentially, this can only occur if the commit iclog is issued
> without a cache flush. If the iclog bio is submitted with
> REQ_PREFLUSH, then it will guarantee that all the completed IO is
> one stable storage before the iclog bio with the new tail LSN in it
> is written to the log.
> 
> IOWs, the tail lsn that is written to the iclog needs to be sampled
> *before* we issue the cache flush that guarantees all IO up to that
> LSN has been completed.
> 
> To fix this without giving up the performance advantage of the
> flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
> compared to 5.13), we need to ensure that we always issue a cache
> flush if the tail LSN changes between the initial async flush and
> the commit record being written. THis requires sampling the tail_lsn
> before we start the flush, and then passing the sampled tail LSN to
> xlog_state_release_iclog() so it can determine if the the tail LSN
> has changed while writing the checkpoint. If the tail LSN has
> changed, then it needs to set the NEED_FLUSH flag on the iclog and
> we'll issue another cache flush before writing the iclog.
> 
> Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/xfs_log.c      | 36 ++++++++++++++++++++++++++----------
>  fs/xfs/xfs_log_cil.c  | 13 +++++++++++--
>  fs/xfs/xfs_log_priv.h |  3 ++-
>  3 files changed, 39 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 82f5996d3889..e8c6c96d4f7c 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -489,12 +489,17 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.
> + * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> + * @old_tail_lsn and the current log tail does not match, there may be metadata
> + * on disk that must be persisted before this iclog is written.  To satisfy that
> + * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> + * iclog with the new log tail value.
>   */
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog)
> +	struct xlog_in_core	*iclog,
> +	xfs_lsn_t		old_tail_lsn)
>  {
>  	xfs_lsn_t		tail_lsn;
>  	lockdep_assert_held(&log->l_icloglock);
> @@ -503,6 +508,19 @@ xlog_state_release_iclog(
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		return -EIO;
>  
> +	/*
> +	 * Grabbing the current log tail needs to be atomic w.r.t. the writing
> +	 * of the tail LSN into the iclog so we guarantee that the log tail does
> +	 * not move between deciding if a cache flush is required and writing
> +	 * the LSN into the iclog below.
> +	 */
> +	if (old_tail_lsn || iclog->ic_state == XLOG_STATE_WANT_SYNC) {
> +		tail_lsn = xlog_assign_tail_lsn(log->l_mp);
> +
> +		if (old_tail_lsn && tail_lsn != old_tail_lsn)
> +			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +	}
> +
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
>  		return 0;
>  
> @@ -511,8 +529,6 @@ xlog_state_release_iclog(
>  		return 0;
>  	}
>  
> -	/* update tail before writing to iclog */
> -	tail_lsn = xlog_assign_tail_lsn(log->l_mp);
>  	iclog->ic_state = XLOG_STATE_SYNCING;
>  	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> @@ -858,7 +874,7 @@ xlog_unmount_write(
>  	 * iclog containing the unmount record is written.
>  	 */
>  	iclog->ic_flags |= (XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA);
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, 0);
>  	xlog_wait_on_iclog(iclog);
>  
>  	if (tic) {
> @@ -2302,7 +2318,7 @@ xlog_write_copy_finish(
>  	return 0;
>  
>  release_iclog:
> -	error = xlog_state_release_iclog(log, iclog);
> +	error = xlog_state_release_iclog(log, iclog, 0);
>  	spin_unlock(&log->l_icloglock);
>  	return error;
>  }
> @@ -2521,7 +2537,7 @@ xlog_write(
>  		ASSERT(optype & XLOG_COMMIT_TRANS);
>  		*commit_iclog = iclog;
>  	} else {
> -		error = xlog_state_release_iclog(log, iclog);
> +		error = xlog_state_release_iclog(log, iclog, 0);
>  	}
>  	spin_unlock(&log->l_icloglock);
>  
> @@ -2959,7 +2975,7 @@ xlog_state_get_iclog_space(
>  		 * reference to the iclog.
>  		 */
>  		if (!atomic_add_unless(&iclog->ic_refcnt, -1, 1))
> -			error = xlog_state_release_iclog(log, iclog);
> +			error = xlog_state_release_iclog(log, iclog, 0);
>  		spin_unlock(&log->l_icloglock);
>  		if (error)
>  			return error;
> @@ -3195,7 +3211,7 @@ xfs_log_force(
>  			atomic_inc(&iclog->ic_refcnt);
>  			lsn = be64_to_cpu(iclog->ic_header.h_lsn);
>  			xlog_state_switch_iclogs(log, iclog, 0);
> -			if (xlog_state_release_iclog(log, iclog))
> +			if (xlog_state_release_iclog(log, iclog, 0))
>  				goto out_error;
>  
>  			if (be64_to_cpu(iclog->ic_header.h_lsn) != lsn)
> @@ -3275,7 +3291,7 @@ xlog_force_lsn(
>  		}
>  		atomic_inc(&iclog->ic_refcnt);
>  		xlog_state_switch_iclogs(log, iclog, 0);
> -		if (xlog_state_release_iclog(log, iclog))
> +		if (xlog_state_release_iclog(log, iclog, 0))
>  			goto out_error;
>  		if (log_flushed)
>  			*log_flushed = 1;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b128aaa9b870..4c44bc3786c0 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -654,8 +654,9 @@ xlog_cil_push_work(
>  	struct xfs_trans_header thdr;
>  	struct xfs_log_iovec	lhdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> +	xfs_lsn_t		preflush_tail_lsn;
>  	xfs_lsn_t		commit_lsn;
> -	xfs_lsn_t		push_seq;
> +	xfs_csn_t		push_seq;
>  	struct bio		bio;
>  	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  
> @@ -730,7 +731,15 @@ xlog_cil_push_work(
>  	 * because we hold the flush lock exclusively. Hence we can now issue
>  	 * a cache flush to ensure all the completed metadata in the journal we
>  	 * are about to overwrite is on stable storage.
> +	 *
> +	 * Because we are issuing this cache flush before we've written the
> +	 * tail lsn to the iclog, we can have metadata IO completions move the
> +	 * tail forwards between the completion of this flush and the iclog
> +	 * being written. In this case, we need to re-issue the cache flush
> +	 * before the iclog write. To detect whether the log tail moves, sample
> +	 * the tail LSN *before* we issue the flush.
>  	 */
> +	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
>  	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
>  				&bdev_flush);
>  
> @@ -941,7 +950,7 @@ xlog_cil_push_work(
>  	 * storage.
>  	 */
>  	commit_iclog->ic_flags |= XLOG_ICL_NEED_FUA;
> -	xlog_state_release_iclog(log, commit_iclog);
> +	xlog_state_release_iclog(log, commit_iclog, preflush_tail_lsn);
>  	spin_unlock(&log->l_icloglock);
>  	return;
>  
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 4c41bbfa33b0..7cbde0b4f990 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -497,7 +497,8 @@ int	xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket,
>  void	xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket);
>  void	xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket);
>  
> -int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog);
> +int xlog_state_release_iclog(struct xlog *log, struct xlog_in_core *iclog,
> +		xfs_lsn_t log_tail_lsn);
>  
>  /*
>   * When we crack an atomic LSN, we sample it first so that the value will not
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards
  2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
@ 2021-07-27 19:00   ` Darrick J. Wong
  2023-06-13  6:20   ` Chandan Babu R
  1 sibling, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2021-07-27 19:00 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log an inode, we format the "log inode" core and set an LSN
> in that inode core. We do that via xfs_inode_item_format_core(),
> which calls:
> 
> 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
> 
> to format the log inode. It writes the LSN from the inode item into
> the log inode, and if recovery decides the inode item needs to be
> replayed, it recovers the log inode LSN field and writes it into the
> on disk inode LSN field.
> 
> Now this might seem like a reasonable thing to do, but it is wrong
> on multiple levels. Firstly, if the item is not yet in the AIL,
> item->li_lsn is zero. i.e. the first time the inode it is logged and
> formatted, the LSN we write into the log inode will be zero. If we
> only log it once, recovery will run and can write this zero LSN into
> the inode.
> 
> This means that the next time the inode is logged and log recovery
> runs, it will *always* replay changes to the inode regardless of
> whether the inode is newer on disk than the version in the log and
> that violates the entire purpose of recording the LSN in the inode
> at writeback time (i.e. to stop it going backwards in time on disk
> during recovery).
> 
> Secondly, if we commit the CIL to the journal so the inode item
> moves to the AIL, and then relog the inode, the LSN that gets
> stamped into the log inode will be the LSN of the inode's current
> location in the AIL, not it's age on disk. And it's not the LSN that
> will be associated with the current change. That means when log
> recovery replays this inode item, the LSN that ends up on disk is
> the LSN for the previous changes in the log, not the current
> changes being replayed. IOWs, after recovery the LSN on disk is not
> in sync with the LSN of the modifications that were replayed into
> the inode. This, again, violates the recovery ordering semantics
> that on-disk writeback LSNs provide.
> 
> Hence the inode LSN in the log dinode is -always- invalid.
> 
> Thirdly, recovery actually has the LSN of the log transaction it is
> replaying right at hand - it uses it to determine if it should
> replay the inode by comparing it to the on-disk inode's LSN. But it
> doesn't use that LSN to stamp the LSN into the inode which will be
> written back when the transaction is fully replayed. It uses the one
> in the log dinode, which we know is always going to be incorrect.
> 
> Looking back at the change history, the inode logging was broken by
> commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
> back in 2016 by a stupid idiot who thought he knew how this code
> worked. i.e. me. That commit replaced an in memory di_lsn field that
> was updated only at inode writeback time from the inode item.li_lsn
> value - and hence always contained the same LSN that appeared in the
> on-disk inode - with a read of the inode item LSN at inode format
> time. CLearly these are not the same thing.
> 
> Before 93f958f9c41f, the log recovery behaviour was irrelevant,
> because the LSN in the log inode always matched the on-disk LSN at
> the time the inode was logged, hence recovery of the transaction
> would never make the on-disk LSN in the inode go backwards or get
> out of sync.
> 
> A symptom of the problem is this, caught from a failure of
> generic/482. Before log recovery, the inode has been allocated but
> never used:
> 
> xfs_db> inode 393388
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0
> ....
> v3.crc = 0x99126961 (correct)
> v3.change_count = 0
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jan  1 10:00:00 1970
> v3.crtime.nsec = 0
> 
> After log recovery:
> 
> xfs_db> p
> core.magic = 0x494e
> core.mode = 020444
> ....
> v3.crc = 0x23e68f23 (correct)
> v3.change_count = 2
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jul 22 17:03:03 2021
> v3.crtime.nsec = 751000000
> ...
> 
> You can see that the LSN of the on-disk inode is 0, even though it
> clearly has been written to disk. I point out this inode, because
> the generic/482 failure occurred because several adjacent inodes in
> this specific inode cluster were not replayed correctly and still
> appeared to be zero on disk when all the other metadata (inobt,
> finobt, directories, etc) indicated they should be allocated and
> written back.
> 
> The fix for this is two-fold. The first is that we need to either
> revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
> altogether. If we do the former, log recovery does not need to
> change but we add 8 bytes of memory per inode to store what is
> largely a write-only inode field. If we do the latter, log recovery
> needs to stamp the on-disk inode in the same manner that inode
> writeback does.
> 
> I prefer the latter, because we shouldn't really be trying to log
> and replay changes to the on disk LSN as the on-disk value is the
> canonical source of the on-disk version of the inode. It also
> matches the way we recover buffer items - we create a buf_log_item
> that carries the current recovery transaction LSN that gets stamped
> into the buffer by the write verifier when it gets written back
> when the transaction is fully recovered.
> 
> However, this might break log recovery on older kernels even more,
> so I'm going to simply ignore the logged value in recovery and stamp
> the on-disk inode with the LSN of the transaction being recovered
> that will trigger writeback on transaction recovery completion. This
> will ensure that the on-disk inode LSN always reflects the LSN of
> the last change that was written to disk, regardless of whether it
> comes from log recovery or runtime writeback.
> 
> Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
>  fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..2c5bcbc19264 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -411,7 +411,16 @@ struct xfs_log_dinode {
>  	/* start of the extended dinode, writable fields */
>  	uint32_t	di_crc;		/* CRC of the inode */
>  	uint64_t	di_changecount;	/* number of attribute changes */
> -	xfs_lsn_t	di_lsn;		/* flush sequence */
> +
> +	/*
> +	 * The LSN we write to this field during formatting is not a reflection
> +	 * of the current on-disk LSN. It should never be used for recovery
> +	 * sequencing, nor should it be recovered into the on-disk inode at all.
> +	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
> +	 * for details.
> +	 */
> +	xfs_lsn_t	di_lsn;

I wonder if we should change the name of this to di_lsn_DONOTUSE?  The
net effect of this patch is that xfs_log_dinode.di_lsn is written in the
same questionable manner that it always has been for the sake of not
breaking recovery on old kernels; but from now on we're never going to
read it ever again, not even in log recovery.

The AIL will flush the inode cluster (and stamp the /correct/ LSN into
the cluster buffer), and recovery will now compare the ondisk LSN
against the LSN of the recovered transaction to decide if it should skip
the update.

If I got that right,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
>  	uint8_t		di_pad2[12];	/* more padding for future expansion */
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 7b79518b6c20..e0072a6cd2d3 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
>  STATIC void
>  xfs_log_dinode_to_disk(
>  	struct xfs_log_dinode	*from,
> -	struct xfs_dinode	*to)
> +	struct xfs_dinode	*to,
> +	xfs_lsn_t		lsn)
>  {
>  	to->di_magic = cpu_to_be16(from->di_magic);
>  	to->di_mode = cpu_to_be16(from->di_mode);
> @@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
>  		to->di_ino = cpu_to_be64(from->di_ino);
> -		to->di_lsn = cpu_to_be64(from->di_lsn);
> +		to->di_lsn = cpu_to_be64(lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
>  		uuid_copy(&to->di_uuid, &from->di_uuid);
>  		to->di_flushiter = 0;
> @@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
>  	}
>  
>  	/*
> -	 * If the inode has an LSN in it, recover the inode only if it's less
> -	 * than the lsn of the transaction we are replaying. Note: we still
> -	 * need to replay an owner change even though the inode is more recent
> -	 * than the transaction as there is no guarantee that all the btree
> -	 * blocks are more recent than this transaction, too.
> +	 * If the inode has an LSN in it, recover the inode only if the on-disk
> +	 * inode's LSN is older than the lsn of the transaction we are
> +	 * replaying. We can have multiple checkpoints with the same start LSN,
> +	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
> +	 * mean that the on-disk inode is more recent than the change being
> +	 * replayed.
> +	 *
> +	 * We must check the current_lsn against the on-disk inode
> +	 * here because the we can't trust the log dinode to contain a valid LSN
> +	 * (see comment below before replaying the log dinode for details).
> +	 *
> +	 * Note: we still need to replay an owner change even though the inode
> +	 * is more recent than the transaction as there is no guarantee that all
> +	 * the btree blocks are more recent than this transaction, too.
>  	 */
>  	if (dip->di_version >= 3) {
>  		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
>  
> -		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
> +		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>  			trace_xfs_log_recover_inode_skip(log, in_f);
>  			error = 0;
>  			goto out_owner_change;
> @@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
>  		goto out_release;
>  	}
>  
> -	/* recover the log dinode inode into the on disk inode */
> -	xfs_log_dinode_to_disk(ldip, dip);
> +	/*
> +	 * Recover the log dinode inode into the on disk inode.
> +	 *
> +	 * The LSN in the log dinode is garbage - it can be zero or reflect
> +	 * stale in-memory runtime state that isn't coherent with the changes
> +	 * logged in this transaction or the changes written to the on-disk
> +	 * inode.  Hence we write the current lSN into the inode because that
> +	 * matches what xfs_iflush() would write inode the inode when flushing
> +	 * the changes in this transaction.
> +	 */
> +	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
>  
>  	fields = in_f->ilf_fields;
>  	if (fields & XFS_ILOG_DEV)
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 11/11] xfs: limit iclog tail updates
  2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
@ 2021-07-27 21:25   ` Darrick J. Wong
  2021-07-27 22:13     ` Dave Chinner
  2021-07-27 23:44   ` [PATCH 11/11 v2] " Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2021-07-27 21:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jul 27, 2021 at 05:10:12PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> From the department of "generic/482 keeps giving", we bring you
> another tail update race condition:
> 
> iclog:
> 	S1			C1
> 	+-----------------------+-----------------------+
> 				 S2			EOIC
> 
> Two checkpoints in a single iclog. One is complete, the other just
> contains the start record and overruns into a new iclog.
> 
> Timeline:
> 
> Before S1:	Cache flush, log tail = X
> At S1:		Metadata stable, write start record and checkpoint
> At C1:		Write commit record, set NEED_FUA
> 		Single iclog checkpoint, so no need for NEED_FLUSH
> 		Log tail still = X, so no need for NEED_FLUSH
> 
> After C1,
> Before S2:	Cache flush, log tail = X
> At S2:		Metadata stable, write start record and checkpoint
> After S2:	Log tail moves to X+1
> At EOIC:	End of iclog, more journal data to write
> 		Releases iclog
> 		Not a commit iclog, so no need for NEED_FLUSH
> 		Writes log tail X+1 into iclog.
> 
> At this point, the iclog has tail X+1 and NEED_FUA set. There has
> been no cache flush for the metadata between X and X+1, and the
> iclog writes the new tail permanently to the log. THis is sufficient
> to violate on disk metadata/journal ordering.
> 
> We have two options here. The first is to detect this case in some
> manner and ensure that the partial checkpoint write sets NEED_FLUSH
> when the iclog is already marked NEED_FUA and the log tail changes.
> This seems somewhat fragile and quite complex to get right, and it
> doesn't actually make it obvious what underlying problem it is
> actually addressing from reading the code.
> 
> The second option seems much cleaner to me, because it is derived
> directly from the requirements of the C1 commit record in the iclog.
> That is, when we write this commit record to the iclog, we've
> guaranteed that the metadata/data ordering is correct for tail
> update purposes. Hence if we only write the log tail into the iclog
> for the *first* commit record rather than the log tail at the last
> release, we guarantee that the log tail does not move past where the
> the first commit record in the log expects it to be.
> 
> IOWs, taking the first option means that replay of C1 becomes
> dependent on future operations doing the right thing, not just the
> C1 checkpoint itself doing the right thing. This makes log recovery
> almost impossible to reason about because now we have to take into
> account what might or might not have happened in the future when
> looking at checkpoints in the log rather than just having to
> reconstruct the past...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 37 +++++++++++++++++++++++++++++++------
>  1 file changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1c328efdca66..1b20fb479ebc 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -489,12 +489,31 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> - * @old_tail_lsn and the current log tail does not match, there may be metadata
> - * on disk that must be persisted before this iclog is written.  To satisfy that
> - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> - * iclog with the new log tail value.
> + * it is in the WANT_SYNC state.
> + *
> + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> + * does not match, there may be metadata on disk that must be persisted before
> + * this iclog is written.  To satisfy that requirement, set the
> + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> + * log tail value.
> + *
> + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> + * written to stable storage, and implies that a commit record is contained
> + * within the iclog. We need to ensure that the log tail does not move beyond
> + * the tail that the first commit record in the iclog ordered against, otherwise
> + * correct recovery of that checkpoint becomes dependent on future operations
> + * performed on this iclog.
> + *
> + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> + * current tail into iclog. Once the iclog tail is set, future operations must
> + * not modify it, otherwise they potentially violate ordering constraints for
> + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> + * the iclog will get zeroed on activation of the iclog after sync, so we
> + * always capture the tail lsn on the iclog on the first NEED_FUA release
> + * regardless of the number of active reference counts on this iclog.
>   */
> +
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> @@ -519,6 +538,10 @@ xlog_state_release_iclog(
>  
>  		if (old_tail_lsn && tail_lsn != old_tail_lsn)
>  			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +
> +		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> +		    !iclog->ic_header.h_tail_lsn)
> +			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	}
>  
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> @@ -530,7 +553,8 @@ xlog_state_release_iclog(
>  	}
>  
>  	iclog->ic_state = XLOG_STATE_SYNCING;
> -	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> +	if (!iclog->ic_header.h_tail_lsn)
> +		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	xlog_verify_tail_lsn(log, iclog, tail_lsn);

What is xlog_verify_tail_lsn protecting against?  I /think/ it is a
debug check that makes sure that there's enough space between the head
and the tail (i.e. the space that hasn't been written to yet) to write
the iclog that we're releasing, right?

I think that means that we're indirectly checking that if this iclog is
persisted then recovery will be able to find complete log records
between h_tail_lsn and h_lsn for replay, right?

Before this patch we'd write out this iclog with h_tail_lsn set to
wherever the log tail is /now/, but now we write out the iclog with
h_tail_lsn set to wherever the tail was when the first commit record was
added to this iclog.  That implies that recovery will now have to do
more work, but at least the flushes will be in order, right?

So then, the question is this: should this call be:

	xlog_verify_tail_lsn(log, iclog,
			be64_to_cpu(iclog->ic_header.h_tail_lsn));

since we need to guarantee that recovery will be able to start reading
in log records starting at h_tail_lsn?

--D

>  	trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  
> @@ -2579,6 +2603,7 @@ xlog_state_activate_iclog(
>  	memset(iclog->ic_header.h_cycle_data, 0,
>  		sizeof(iclog->ic_header.h_cycle_data));
>  	iclog->ic_header.h_lsn = 0;
> +	iclog->ic_header.h_tail_lsn = 0;
>  }
>  
>  /*
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 11/11] xfs: limit iclog tail updates
  2021-07-27 21:25   ` Darrick J. Wong
@ 2021-07-27 22:13     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2021-07-27 22:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jul 27, 2021 at 02:25:31PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 27, 2021 at 05:10:12PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > From the department of "generic/482 keeps giving", we bring you
> > another tail update race condition:
> > 
> > iclog:
> > 	S1			C1
> > 	+-----------------------+-----------------------+
> > 				 S2			EOIC
> > 
> > Two checkpoints in a single iclog. One is complete, the other just
> > contains the start record and overruns into a new iclog.
> > 
> > Timeline:
> > 
> > Before S1:	Cache flush, log tail = X
> > At S1:		Metadata stable, write start record and checkpoint
> > At C1:		Write commit record, set NEED_FUA
> > 		Single iclog checkpoint, so no need for NEED_FLUSH
> > 		Log tail still = X, so no need for NEED_FLUSH
> > 
> > After C1,
> > Before S2:	Cache flush, log tail = X
> > At S2:		Metadata stable, write start record and checkpoint
> > After S2:	Log tail moves to X+1
> > At EOIC:	End of iclog, more journal data to write
> > 		Releases iclog
> > 		Not a commit iclog, so no need for NEED_FLUSH
> > 		Writes log tail X+1 into iclog.
> > 
> > At this point, the iclog has tail X+1 and NEED_FUA set. There has
> > been no cache flush for the metadata between X and X+1, and the
> > iclog writes the new tail permanently to the log. THis is sufficient
> > to violate on disk metadata/journal ordering.
> > 
> > We have two options here. The first is to detect this case in some
> > manner and ensure that the partial checkpoint write sets NEED_FLUSH
> > when the iclog is already marked NEED_FUA and the log tail changes.
> > This seems somewhat fragile and quite complex to get right, and it
> > doesn't actually make it obvious what underlying problem it is
> > actually addressing from reading the code.
> > 
> > The second option seems much cleaner to me, because it is derived
> > directly from the requirements of the C1 commit record in the iclog.
> > That is, when we write this commit record to the iclog, we've
> > guaranteed that the metadata/data ordering is correct for tail
> > update purposes. Hence if we only write the log tail into the iclog
> > for the *first* commit record rather than the log tail at the last
> > release, we guarantee that the log tail does not move past where the
> > the first commit record in the log expects it to be.
> > 
> > IOWs, taking the first option means that replay of C1 becomes
> > dependent on future operations doing the right thing, not just the
> > C1 checkpoint itself doing the right thing. This makes log recovery
> > almost impossible to reason about because now we have to take into
> > account what might or might not have happened in the future when
> > looking at checkpoints in the log rather than just having to
> > reconstruct the past...
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 37 +++++++++++++++++++++++++++++++------
> >  1 file changed, 31 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 1c328efdca66..1b20fb479ebc 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -489,12 +489,31 @@ xfs_log_reserve(
> >  
> >  /*
> >   * Flush iclog to disk if this is the last reference to the given iclog and the
> > - * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> > - * @old_tail_lsn and the current log tail does not match, there may be metadata
> > - * on disk that must be persisted before this iclog is written.  To satisfy that
> > - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> > - * iclog with the new log tail value.
> > + * it is in the WANT_SYNC state.
> > + *
> > + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> > + * does not match, there may be metadata on disk that must be persisted before
> > + * this iclog is written.  To satisfy that requirement, set the
> > + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> > + * log tail value.
> > + *
> > + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> > + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> > + * written to stable storage, and implies that a commit record is contained
> > + * within the iclog. We need to ensure that the log tail does not move beyond
> > + * the tail that the first commit record in the iclog ordered against, otherwise
> > + * correct recovery of that checkpoint becomes dependent on future operations
> > + * performed on this iclog.
> > + *
> > + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> > + * current tail into iclog. Once the iclog tail is set, future operations must
> > + * not modify it, otherwise they potentially violate ordering constraints for
> > + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> > + * the iclog will get zeroed on activation of the iclog after sync, so we
> > + * always capture the tail lsn on the iclog on the first NEED_FUA release
> > + * regardless of the number of active reference counts on this iclog.
> >   */
> > +
> >  int
> >  xlog_state_release_iclog(
> >  	struct xlog		*log,
> > @@ -519,6 +538,10 @@ xlog_state_release_iclog(
> >  
> >  		if (old_tail_lsn && tail_lsn != old_tail_lsn)
> >  			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> > +
> > +		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> > +		    !iclog->ic_header.h_tail_lsn)
> > +			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> >  	}
> >  
> >  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> > @@ -530,7 +553,8 @@ xlog_state_release_iclog(
> >  	}
> >  
> >  	iclog->ic_state = XLOG_STATE_SYNCING;
> > -	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> > +	if (!iclog->ic_header.h_tail_lsn)
> > +		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> >  	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> 
> What is xlog_verify_tail_lsn protecting against?  I /think/ it is a
> debug check that makes sure that there's enough space between the head
> and the tail (i.e. the space that hasn't been written to yet) to write
> the iclog that we're releasing, right?

Yeah, I thinks so - it is a physical space check to see if the current iclog
will fit between the tail and the location that the log thinks the
iclog is going to be located at.

That said, prev_block/cycle is the location that the last iclog that
was switched into WANT_SYNC state starts at. It's not necessarily
the location this iclog starts at because xlog_state_switch_iclog()
isn't always called from the same lock context that releases the
iclog (e.g. log force). Hence I think this debug check is racy and
potentially wrong.

It seems kinda stupid, really, because the iclog has it's physical
location in the header, too. It doesn't need the prev block/cycle to
be able to check this.

I note that l_prev_cycle is only used for xlog_verify_tail_lsn(),
but l_prev_block is also written into the iclog header in
xlog_state_switch_iclog(). It's not used at all by log recovery and
only for information prints in userspace, and I really can't see how
it is useful in any way for recovery. So this all really looks like
code that can be simplified by removing it and fixing
xlog_verify_tail_lsn() just to look at internal iclog state.

So I'd suggest there are substantial cleanups around this debug
operation which would also remove unnecessary runtime overhead.

> I think that means that we're indirectly checking that if this iclog is
> persisted then recovery will be able to find complete log records
> between h_tail_lsn and h_lsn for replay, right?
> 
> Before this patch we'd write out this iclog with h_tail_lsn set to
> wherever the log tail is /now/, but now we write out the iclog with
> h_tail_lsn set to wherever the tail was when the first commit record was
> added to this iclog.  That implies that recovery will now have to do
> more work, but at least the flushes will be in order, right?
> 
> So then, the question is this: should this call be:
> 
> 	xlog_verify_tail_lsn(log, iclog,
> 			be64_to_cpu(iclog->ic_header.h_tail_lsn));
> 
> since we need to guarantee that recovery will be able to start reading
> in log records starting at h_tail_lsn?

Probably should be. I'll update the patch and sen dout a new version
once I've let the smoke out a couple of times.

Thanks, Darrick!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 11/11 v2] xfs: limit iclog tail updates
  2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
  2021-07-27 21:25   ` Darrick J. Wong
@ 2021-07-27 23:44   ` Dave Chinner
  2021-07-29 16:28     ` Darrick J. Wong
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2021-07-27 23:44 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
	S1			C1
	+-----------------------+-----------------------+
				 S2			EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1:	Cache flush, log tail = X
At S1:		Metadata stable, write start record and checkpoint
At C1:		Write commit record, set NEED_FUA
		Single iclog checkpoint, so no need for NEED_FLUSH
		Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2:	Cache flush, log tail = X
At S2:		Metadata stable, write start record and checkpoint
After S2:	Log tail moves to X+1
At EOIC:	End of iclog, more journal data to write
		Releases iclog
		Not a commit iclog, so no need for NEED_FLUSH
		Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- fix xlog_verify_tail_lsn() debug code to look at the iclog tail
  lsn rather than the current tail lsn that we might not have
  written into the iclog.

 fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 1c328efdca66..60ac5fd63f1e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -78,13 +78,12 @@ xlog_verify_iclog(
 STATIC void
 xlog_verify_tail_lsn(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	xfs_lsn_t		tail_lsn);
+	struct xlog_in_core	*iclog);
 #else
 #define xlog_verify_dest_ptr(a,b)
 #define xlog_verify_grant_tail(a)
 #define xlog_verify_iclog(a,b,c)
-#define xlog_verify_tail_lsn(a,b,c)
+#define xlog_verify_tail_lsn(a,b)
 #endif
 
 STATIC int
@@ -489,12 +488,31 @@ xfs_log_reserve(
 
 /*
  * Flush iclog to disk if this is the last reference to the given iclog and the
- * it is in the WANT_SYNC state.  If the caller passes in a non-zero
- * @old_tail_lsn and the current log tail does not match, there may be metadata
- * on disk that must be persisted before this iclog is written.  To satisfy that
- * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
- * iclog with the new log tail value.
+ * it is in the WANT_SYNC state.
+ *
+ * If the caller passes in a non-zero @old_tail_lsn and the current log tail
+ * does not match, there may be metadata on disk that must be persisted before
+ * this iclog is written.  To satisfy that requirement, set the
+ * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
+ * log tail value.
+ *
+ * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
+ * log tail is updated correctly. NEED_FUA indicates that the iclog will be
+ * written to stable storage, and implies that a commit record is contained
+ * within the iclog. We need to ensure that the log tail does not move beyond
+ * the tail that the first commit record in the iclog ordered against, otherwise
+ * correct recovery of that checkpoint becomes dependent on future operations
+ * performed on this iclog.
+ *
+ * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
+ * current tail into iclog. Once the iclog tail is set, future operations must
+ * not modify it, otherwise they potentially violate ordering constraints for
+ * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
+ * the iclog will get zeroed on activation of the iclog after sync, so we
+ * always capture the tail lsn on the iclog on the first NEED_FUA release
+ * regardless of the number of active reference counts on this iclog.
  */
+
 int
 xlog_state_release_iclog(
 	struct xlog		*log,
@@ -519,6 +537,10 @@ xlog_state_release_iclog(
 
 		if (old_tail_lsn && tail_lsn != old_tail_lsn)
 			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+
+		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
+		    !iclog->ic_header.h_tail_lsn)
+			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 	}
 
 	if (!atomic_dec_and_test(&iclog->ic_refcnt))
@@ -530,8 +552,9 @@ xlog_state_release_iclog(
 	}
 
 	iclog->ic_state = XLOG_STATE_SYNCING;
-	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
-	xlog_verify_tail_lsn(log, iclog, tail_lsn);
+	if (!iclog->ic_header.h_tail_lsn)
+		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
+	xlog_verify_tail_lsn(log, iclog);
 	trace_xlog_iclog_syncing(iclog, _RET_IP_);
 
 	spin_unlock(&log->l_icloglock);
@@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
 	memset(iclog->ic_header.h_cycle_data, 0,
 		sizeof(iclog->ic_header.h_cycle_data));
 	iclog->ic_header.h_lsn = 0;
+	iclog->ic_header.h_tail_lsn = 0;
 }
 
 /*
@@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
 STATIC void
 xlog_verify_tail_lsn(
 	struct xlog		*log,
-	struct xlog_in_core	*iclog,
-	xfs_lsn_t		tail_lsn)
+	struct xlog_in_core	*iclog)
 {
-    int blocks;
+	xfs_lsn_t	tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
+	int		blocks;
 
     if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
 	blocks =

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 11/11 v2] xfs: limit iclog tail updates
  2021-07-27 23:44   ` [PATCH 11/11 v2] " Dave Chinner
@ 2021-07-29 16:28     ` Darrick J. Wong
  0 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2021-07-29 16:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jul 28, 2021 at 09:44:10AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> From the department of "generic/482 keeps on giving", we bring you
> another tail update race condition:
> 
> iclog:
> 	S1			C1
> 	+-----------------------+-----------------------+
> 				 S2			EOIC
> 
> Two checkpoints in a single iclog. One is complete, the other just
> contains the start record and overruns into a new iclog.
> 
> Timeline:
> 
> Before S1:	Cache flush, log tail = X
> At S1:		Metadata stable, write start record and checkpoint
> At C1:		Write commit record, set NEED_FUA
> 		Single iclog checkpoint, so no need for NEED_FLUSH
> 		Log tail still = X, so no need for NEED_FLUSH
> 
> After C1,
> Before S2:	Cache flush, log tail = X
> At S2:		Metadata stable, write start record and checkpoint
> After S2:	Log tail moves to X+1
> At EOIC:	End of iclog, more journal data to write
> 		Releases iclog
> 		Not a commit iclog, so no need for NEED_FLUSH
> 		Writes log tail X+1 into iclog.
> 
> At this point, the iclog has tail X+1 and NEED_FUA set. There has
> been no cache flush for the metadata between X and X+1, and the
> iclog writes the new tail permanently to the log. THis is sufficient
> to violate on disk metadata/journal ordering.
> 
> We have two options here. The first is to detect this case in some
> manner and ensure that the partial checkpoint write sets NEED_FLUSH
> when the iclog is already marked NEED_FUA and the log tail changes.
> This seems somewhat fragile and quite complex to get right, and it
> doesn't actually make it obvious what underlying problem it is
> actually addressing from reading the code.
> 
> The second option seems much cleaner to me, because it is derived
> directly from the requirements of the C1 commit record in the iclog.
> That is, when we write this commit record to the iclog, we've
> guaranteed that the metadata/data ordering is correct for tail
> update purposes. Hence if we only write the log tail into the iclog
> for the *first* commit record rather than the log tail at the last
> release, we guarantee that the log tail does not move past where the
> the first commit record in the log expects it to be.
> 
> IOWs, taking the first option means that replay of C1 becomes
> dependent on future operations doing the right thing, not just the
> C1 checkpoint itself doing the right thing. This makes log recovery
> almost impossible to reason about because now we have to take into
> account what might or might not have happened in the future when
> looking at checkpoints in the log rather than just having to
> reconstruct the past...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good now,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
> Version 2:
> - fix xlog_verify_tail_lsn() debug code to look at the iclog tail
>   lsn rather than the current tail lsn that we might not have
>   written into the iclog.
> 
>  fs/xfs/xfs_log.c | 50 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 37 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 1c328efdca66..60ac5fd63f1e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -78,13 +78,12 @@ xlog_verify_iclog(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn);
> +	struct xlog_in_core	*iclog);
>  #else
>  #define xlog_verify_dest_ptr(a,b)
>  #define xlog_verify_grant_tail(a)
>  #define xlog_verify_iclog(a,b,c)
> -#define xlog_verify_tail_lsn(a,b,c)
> +#define xlog_verify_tail_lsn(a,b)
>  #endif
>  
>  STATIC int
> @@ -489,12 +488,31 @@ xfs_log_reserve(
>  
>  /*
>   * Flush iclog to disk if this is the last reference to the given iclog and the
> - * it is in the WANT_SYNC state.  If the caller passes in a non-zero
> - * @old_tail_lsn and the current log tail does not match, there may be metadata
> - * on disk that must be persisted before this iclog is written.  To satisfy that
> - * requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
> - * iclog with the new log tail value.
> + * it is in the WANT_SYNC state.
> + *
> + * If the caller passes in a non-zero @old_tail_lsn and the current log tail
> + * does not match, there may be metadata on disk that must be persisted before
> + * this iclog is written.  To satisfy that requirement, set the
> + * XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
> + * log tail value.
> + *
> + * If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
> + * log tail is updated correctly. NEED_FUA indicates that the iclog will be
> + * written to stable storage, and implies that a commit record is contained
> + * within the iclog. We need to ensure that the log tail does not move beyond
> + * the tail that the first commit record in the iclog ordered against, otherwise
> + * correct recovery of that checkpoint becomes dependent on future operations
> + * performed on this iclog.
> + *
> + * Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
> + * current tail into iclog. Once the iclog tail is set, future operations must
> + * not modify it, otherwise they potentially violate ordering constraints for
> + * the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
> + * the iclog will get zeroed on activation of the iclog after sync, so we
> + * always capture the tail lsn on the iclog on the first NEED_FUA release
> + * regardless of the number of active reference counts on this iclog.
>   */
> +
>  int
>  xlog_state_release_iclog(
>  	struct xlog		*log,
> @@ -519,6 +537,10 @@ xlog_state_release_iclog(
>  
>  		if (old_tail_lsn && tail_lsn != old_tail_lsn)
>  			iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +
> +		if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
> +		    !iclog->ic_header.h_tail_lsn)
> +			iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  	}
>  
>  	if (!atomic_dec_and_test(&iclog->ic_refcnt))
> @@ -530,8 +552,9 @@ xlog_state_release_iclog(
>  	}
>  
>  	iclog->ic_state = XLOG_STATE_SYNCING;
> -	iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> -	xlog_verify_tail_lsn(log, iclog, tail_lsn);
> +	if (!iclog->ic_header.h_tail_lsn)
> +		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> +	xlog_verify_tail_lsn(log, iclog);
>  	trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  
>  	spin_unlock(&log->l_icloglock);
> @@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
>  	memset(iclog->ic_header.h_cycle_data, 0,
>  		sizeof(iclog->ic_header.h_cycle_data));
>  	iclog->ic_header.h_lsn = 0;
> +	iclog->ic_header.h_tail_lsn = 0;
>  }
>  
>  /*
> @@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
>  STATIC void
>  xlog_verify_tail_lsn(
>  	struct xlog		*log,
> -	struct xlog_in_core	*iclog,
> -	xfs_lsn_t		tail_lsn)
> +	struct xlog_in_core	*iclog)
>  {
> -    int blocks;
> +	xfs_lsn_t	tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
> +	int		blocks;
>  
>      if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
>  	blocks =

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards
  2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
  2021-07-27 19:00   ` Darrick J. Wong
@ 2023-06-13  6:20   ` Chandan Babu R
  2023-06-14  9:13     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Chandan Babu R @ 2023-06-13  6:20 UTC (permalink / raw)
  To: david; +Cc: linux-xfs

On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:

Hi Dave,

I am trying to backport this patch to Linux-v5.4 LTS kernel. I am unable to
understand as to how log recovery could overwrite a disk inode (holding newer
contents) with a Log inode (containing stale contents). Please refer to my
query below.

> From: Dave Chinner <dchinner@redhat.com>

> When we log an inode, we format the "log inode" core and set an LSN
> in that inode core. We do that via xfs_inode_item_format_core(),
> which calls:
>
> 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

> to format the log inode. It writes the LSN from the inode item into
> the log inode, and if recovery decides the inode item needs to be
> replayed, it recovers the log inode LSN field and writes it into the
> on disk inode LSN field.

> Now this might seem like a reasonable thing to do, but it is wrong
> on multiple levels. Firstly, if the item is not yet in the AIL,
> item->li_lsn is zero. i.e. the first time the inode it is logged and
> formatted, the LSN we write into the log inode will be zero. If we
> only log it once, recovery will run and can write this zero LSN into
> the inode.

Assume that the following is the state before the inode is written back,
Disk inode LSN = x
Log inode LSN = 0
Transaction LSN = x + 1

At this point, if the filesystem shuts down abruptly, log recovery could
change the disk inode's LSN to zero.

> This means that the next time the inode is logged and log recovery
> runs, it will *always* replay changes to the inode regardless of
> whether the inode is newer on disk than the version in the log and
> that violates the entire purpose of recording the LSN in the inode
> at writeback time (i.e. to stop it going backwards in time on disk
> during recovery).

After the log recovery indicated by me above, if the filesystem modifies the
inode then the following is the state of metadata in memory and on disk,

Disk inode LSN = 0 (Due to changes made by log recovery during mount)
Log inode LSN = 0 (xfs_log_item->li_lsn is zero until the log item is moved to
                   the AIL in the current mount cycle)
Transaction LSN = x + 2

Now, if the filesystem shuts down abruptly once again, log recovery replays
the contents of the Log dinode since Disk inode's LSN is less than
transaction's LSN. In this example, the Log inode's contents were newer than
the disk inode.

Your description suggests that there could be a scenario where a Log inode
holding stale content can still overwrite the contents of the Disk inode
holding newer content. I am unable to come with an example of how that could
happen. Could please explain this to me.

> Secondly, if we commit the CIL to the journal so the inode item
> moves to the AIL, and then relog the inode, the LSN that gets
> stamped into the log inode will be the LSN of the inode's current
> location in the AIL, not it's age on disk. And it's not the LSN that
> will be associated with the current change. That means when log
> recovery replays this inode item, the LSN that ends up on disk is
> the LSN for the previous changes in the log, not the current
> changes being replayed. IOWs, after recovery the LSN on disk is not
> in sync with the LSN of the modifications that were replayed into
> the inode. This, again, violates the recovery ordering semantics
> that on-disk writeback LSNs provide.

> Hence the inode LSN in the log dinode is -always- invalid.

> Thirdly, recovery actually has the LSN of the log transaction it is
> replaying right at hand - it uses it to determine if it should
> replay the inode by comparing it to the on-disk inode's LSN. But it
> doesn't use that LSN to stamp the LSN into the inode which will be
> written back when the transaction is fully replayed. It uses the one
> in the log dinode, which we know is always going to be incorrect.

> Looking back at the change history, the inode logging was broken by
> commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
> back in 2016 by a stupid idiot who thought he knew how this code
> worked. i.e. me. That commit replaced an in memory di_lsn field that
> was updated only at inode writeback time from the inode item.li_lsn
> value - and hence always contained the same LSN that appeared in the
> on-disk inode - with a read of the inode item LSN at inode format
> time. CLearly these are not the same thing.

> Before 93f958f9c41f, the log recovery behaviour was irrelevant,
> because the LSN in the log inode always matched the on-disk LSN at
> the time the inode was logged, hence recovery of the transaction
> would never make the on-disk LSN in the inode go backwards or get
> out of sync.

> A symptom of the problem is this, caught from a failure of
> generic/482. Before log recovery, the inode has been allocated but
> never used:

> xfs_db> inode 393388
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0
> ....
> v3.crc = 0x99126961 (correct)
> v3.change_count = 0
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jan  1 10:00:00 1970
> v3.crtime.nsec = 0

> After log recovery:

> xfs_db> p
> core.magic = 0x494e
> core.mode = 020444
> ....
> v3.crc = 0x23e68f23 (correct)
> v3.change_count = 2
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jul 22 17:03:03 2021
> v3.crtime.nsec = 751000000
> ...

> You can see that the LSN of the on-disk inode is 0, even though it
> clearly has been written to disk. I point out this inode, because
> the generic/482 failure occurred because several adjacent inodes in
> this specific inode cluster were not replayed correctly and still
> appeared to be zero on disk when all the other metadata (inobt,
> finobt, directories, etc) indicated they should be allocated and
> written back.

> The fix for this is two-fold. The first is that we need to either
> revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
> altogether. If we do the former, log recovery does not need to
> change but we add 8 bytes of memory per inode to store what is
> largely a write-only inode field. If we do the latter, log recovery
> needs to stamp the on-disk inode in the same manner that inode
> writeback does.

> I prefer the latter, because we shouldn't really be trying to log
> and replay changes to the on disk LSN as the on-disk value is the
> canonical source of the on-disk version of the inode. It also
> matches the way we recover buffer items - we create a buf_log_item
> that carries the current recovery transaction LSN that gets stamped
> into the buffer by the write verifier when it gets written back
> when the transaction is fully recovered.

> However, this might break log recovery on older kernels even more,
> so I'm going to simply ignore the logged value in recovery and stamp
> the on-disk inode with the LSN of the transaction being recovered
> that will trigger writeback on transaction recovery completion. This
> will ensure that the on-disk inode LSN always reflects the LSN of
> the last change that was written to disk, regardless of whether it
> comes from log recovery or runtime writeback.

> Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  | 11 +++++++++-
>  fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
>  2 files changed, 39 insertions(+), 11 deletions(-)

> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..2c5bcbc19264 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -411,7 +411,16 @@ struct xfs_log_dinode {
>  	/* start of the extended dinode, writable fields */
>  	uint32_t	di_crc;		/* CRC of the inode */
>  	uint64_t	di_changecount;	/* number of attribute changes */
> -	xfs_lsn_t	di_lsn;		/* flush sequence */
> +
> +	/*
> +	 * The LSN we write to this field during formatting is not a reflection
> +	 * of the current on-disk LSN. It should never be used for recovery
> +	 * sequencing, nor should it be recovered into the on-disk inode at all.
> +	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
> +	 * for details.
> +	 */
> +	xfs_lsn_t	di_lsn;
> +
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
>  	uint8_t		di_pad2[12];	/* more padding for future expansion */
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 7b79518b6c20..e0072a6cd2d3 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
>  STATIC void
>  xfs_log_dinode_to_disk(
>  	struct xfs_log_dinode	*from,
> -	struct xfs_dinode	*to)
> +	struct xfs_dinode	*to,
> +	xfs_lsn_t		lsn)
>  {
>  	to->di_magic = cpu_to_be16(from->di_magic);
>  	to->di_mode = cpu_to_be16(from->di_mode);
> @@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
>  		to->di_ino = cpu_to_be64(from->di_ino);
> -		to->di_lsn = cpu_to_be64(from->di_lsn);
> +		to->di_lsn = cpu_to_be64(lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
>  		uuid_copy(&to->di_uuid, &from->di_uuid);
>  		to->di_flushiter = 0;
> @@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
>  	}
 
>  	/*
> -	 * If the inode has an LSN in it, recover the inode only if it's less
> -	 * than the lsn of the transaction we are replaying. Note: we still
> -	 * need to replay an owner change even though the inode is more recent
> -	 * than the transaction as there is no guarantee that all the btree
> -	 * blocks are more recent than this transaction, too.
> +	 * If the inode has an LSN in it, recover the inode only if the on-disk
> +	 * inode's LSN is older than the lsn of the transaction we are
> +	 * replaying. We can have multiple checkpoints with the same start LSN,
> +	 * so the current LSN being equal to the on-disk LSN doesn't necessarily
> +	 * mean that the on-disk inode is more recent than the change being
> +	 * replayed.
> +	 *
> +	 * We must check the current_lsn against the on-disk inode
> +	 * here because the we can't trust the log dinode to contain a valid LSN
> +	 * (see comment below before replaying the log dinode for details).
> +	 *
> +	 * Note: we still need to replay an owner change even though the inode
> +	 * is more recent than the transaction as there is no guarantee that all
> +	 * the btree blocks are more recent than this transaction, too.
>  	 */
>  	if (dip->di_version >= 3) {
>  		xfs_lsn_t	lsn = be64_to_cpu(dip->di_lsn);
 
> -		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
> +		if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>  			trace_xfs_log_recover_inode_skip(log, in_f);
>  			error = 0;
>  			goto out_owner_change;
> @@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
>  		goto out_release;
>  	}
 
> -	/* recover the log dinode inode into the on disk inode */
> -	xfs_log_dinode_to_disk(ldip, dip);
> +	/*
> +	 * Recover the log dinode inode into the on disk inode.
> +	 *
> +	 * The LSN in the log dinode is garbage - it can be zero or reflect
> +	 * stale in-memory runtime state that isn't coherent with the changes
> +	 * logged in this transaction or the changes written to the on-disk
> +	 * inode.  Hence we write the current lSN into the inode because that
> +	 * matches what xfs_iflush() would write inode the inode when flushing
> +	 * the changes in this transaction.
> +	 */
> +	xfs_log_dinode_to_disk(ldip, dip, current_lsn);
 
>  	fields = in_f->ilf_fields;
>  	if (fields & XFS_ILOG_DEV)
> -- 
> 2.31.17

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards
  2023-06-13  6:20   ` Chandan Babu R
@ 2023-06-14  9:13     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2023-06-14  9:13 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Tue, Jun 13, 2023 at 11:50:03AM +0530, Chandan Babu R wrote:
> On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:
> 
> Hi Dave,
> 
> I am trying to backport this patch to Linux-v5.4 LTS kernel. I am unable to
> understand as to how log recovery could overwrite a disk inode (holding newer
> contents) with a Log inode (containing stale contents). Please refer to my
> query below.

I've paged most of this out of my brain.

> 
> > From: Dave Chinner <dchinner@redhat.com>
> 
> > When we log an inode, we format the "log inode" core and set an LSN
> > in that inode core. We do that via xfs_inode_item_format_core(),
> > which calls:
> >
> > 	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
> 
> > to format the log inode. It writes the LSN from the inode item into
> > the log inode, and if recovery decides the inode item needs to be
> > replayed, it recovers the log inode LSN field and writes it into the
> > on disk inode LSN field.
> 
> > Now this might seem like a reasonable thing to do, but it is wrong
> > on multiple levels. Firstly, if the item is not yet in the AIL,
> > item->li_lsn is zero. i.e. the first time the inode it is logged and
> > formatted, the LSN we write into the log inode will be zero. If we
> > only log it once, recovery will run and can write this zero LSN into
> > the inode.
> 
> Assume that the following is the state before the inode is written back,
> Disk inode LSN = x
> Log inode LSN = 0
> Transaction LSN = x + 1
> 
> At this point, if the filesystem shuts down abruptly, log recovery could
> change the disk inode's LSN to zero.

Yes, that happened when we first logged a clean inode, so it was
dirty in memory but hadn't been placed in the AIL.

> > This means that the next time the inode is logged and log recovery
> > runs, it will *always* replay changes to the inode regardless of
> > whether the inode is newer on disk than the version in the log and
> > that violates the entire purpose of recording the LSN in the inode
> > at writeback time (i.e. to stop it going backwards in time on disk
> > during recovery).
> 
> After the log recovery indicated by me above, if the filesystem modifies the
> inode then the following is the state of metadata in memory and on disk,
> 
> Disk inode LSN = 0 (Due to changes made by log recovery during mount)
> Log inode LSN = 0 (xfs_log_item->li_lsn is zero until the log item is moved to
>                    the AIL in the current mount cycle)
> Transaction LSN = x + 2
>
> Now, if the filesystem shuts down abruptly once again, log recovery replays
> the contents of the Log dinode since Disk inode's LSN is less than
> transaction's LSN. In this example, the Log inode's contents were newer than
> the disk inode.

I'm not sure about the log shutdown aspects you talk about - the
test that exposed the issue wasn't doing shutdowns.  The test that
exposed this problem was generic/482, which uses dm-logwrites to
replay the filesystem up to each FUA operation, at which point it
mounts it to perform recovery and then runs reapir to check it for
consistency.

IOWs, it performs "recovery at every checkpoint in the log"
independently to determine if log recovery is doing the write
things. It tripped over recovery failing to replay inode changes
due to log/on-disk inode lsn inconsistencies like the following:

Checkpoint	log inode lsn	disk inode lsn		recovery disk inode lsn
1		0			0		0 (wrong!)
2		1			0		1 (wrong!)
3		2			0		2 (wrong!)
		<inode writeback>
					3		2 -should be 3-
		<inode reclaimed>
4		0 			3		-wont get replayed-
							-wrong!-
							-corruption!-

5		4			3		4 (wrong!)
6		5			3		5 (wrong!)
.....
25		6			3		6 (wrong!)
		<inode writeback>
					25		6 -should be 25-
26		25			25		25 -wrong!-


Do you see how the log inode LSN does not match up with the actual
LSN that is written into the inode at writeback time?

Do you see how the log inode LSN could magically go zero in the
middle of the log (Ckpt 4)? That's going to cause the inode log item
to be skipped, when it should have been replayed. That's a
corruption event.

Then if we look at checkpoint 26, where the on-disk inode LSN is 25,
and the log inode LSN is 25 - that log inode should not get replayed
because the LSNs are the same, but the old code did perform replay
and so this bug largely masked the typical off-by-one checkpoint the
log inode lsn contained.

> Your description suggests that there could be a scenario where a Log inode
> holding stale content can still overwrite the contents of the Disk inode
> holding newer content. I am unable to come with an example of how that could
> happen. Could please explain this to me.

I can't really remember. There are lots of ways using the wrong LSN
for recovery can go wrong; I was really just describing stuff I saw
in the test failures.....

> > Secondly, if we commit the CIL to the journal so the inode item
> > moves to the AIL, and then relog the inode, the LSN that gets
> > stamped into the log inode will be the LSN of the inode's current
> > location in the AIL, not it's age on disk. And it's not the LSN that
> > will be associated with the current change. That means when log
> > recovery replays this inode item, the LSN that ends up on disk is
> > the LSN for the previous changes in the log, not the current
> > changes being replayed. IOWs, after recovery the LSN on disk is not
> > in sync with the LSN of the modifications that were replayed into
> > the inode. This, again, violates the recovery ordering semantics
> > that on-disk writeback LSNs provide.

... and this is clearly evident in the example I give above.

> 
> > Hence the inode LSN in the log dinode is -always- invalid.
> 
> > Thirdly, recovery actually has the LSN of the log transaction it is
> > replaying right at hand - it uses it to determine if it should
> > replay the inode by comparing it to the on-disk inode's LSN. But it
> > doesn't use that LSN to stamp the LSN into the inode which will be
> > written back when the transaction is fully replayed. It uses the one
> > in the log dinode, which we know is always going to be incorrect.

This is also demonstrated in the example I give above.

Really, I don't remember any of the finer details of this fix now;
it was a long time ago and I don't really have the time to go back
an reconstruct it from first principles right now.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-06-14  9:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27  7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
2021-07-27  7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
2021-07-27  7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
2021-07-27  7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
2021-07-27  7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-27 18:50   ` Darrick J. Wong
2021-07-27  7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
2021-07-27  7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-27  7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-27  7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
2021-07-27 19:00   ` Darrick J. Wong
2023-06-13  6:20   ` Chandan Babu R
2023-06-14  9:13     ` Dave Chinner
2021-07-27  7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
2021-07-27  7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
2021-07-27  7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
2021-07-27 21:25   ` Darrick J. Wong
2021-07-27 22:13     ` Dave Chinner
2021-07-27 23:44   ` [PATCH 11/11 v2] " Dave Chinner
2021-07-29 16:28     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).