All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: sandeen@sandeen.net, djwong@kernel.org
Cc: Dave Chinner <dchinner@redhat.com>, linux-xfs@vger.kernel.org
Subject: [PATCH 59/61] xfs: logging the on disk inode LSN can make it go backwards
Date: Wed, 15 Sep 2021 16:11:56 -0700	[thread overview]
Message-ID: <163174751688.350433.9364841513650368999.stgit@magnolia> (raw)
In-Reply-To: <163174719429.350433.8562606396437219220.stgit@magnolia>

From: Dave Chinner <dchinner@redhat.com>

Source kernel commit: 32baa63d82ee3f5ab3bd51bae6bf7d1c15aed8c7

When we log an inode, we format the "log inode" core and set an LSN
in that inode core. We do that via xfs_inode_item_format_core(),
which calls:

xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);

to format the log inode. It writes the LSN from the inode item into
the log inode, and if recovery decides the inode item needs to be
replayed, it recovers the log inode LSN field and writes it into the
on disk inode LSN field.

Now this might seem like a reasonable thing to do, but it is wrong
on multiple levels. Firstly, if the item is not yet in the AIL,
item->li_lsn is zero. i.e. the first time the inode it is logged and
formatted, the LSN we write into the log inode will be zero. If we
only log it once, recovery will run and can write this zero LSN into
the inode.

This means that the next time the inode is logged and log recovery
runs, it will *always* replay changes to the inode regardless of
whether the inode is newer on disk than the version in the log and
that violates the entire purpose of recording the LSN in the inode
at writeback time (i.e. to stop it going backwards in time on disk
during recovery).

Secondly, if we commit the CIL to the journal so the inode item
moves to the AIL, and then relog the inode, the LSN that gets
stamped into the log inode will be the LSN of the inode's current
location in the AIL, not it's age on disk. And it's not the LSN that
will be associated with the current change. That means when log
recovery replays this inode item, the LSN that ends up on disk is
the LSN for the previous changes in the log, not the current
changes being replayed. IOWs, after recovery the LSN on disk is not
in sync with the LSN of the modifications that were replayed into
the inode. This, again, violates the recovery ordering semantics
that on-disk writeback LSNs provide.

Hence the inode LSN in the log dinode is -always- invalid.

Thirdly, recovery actually has the LSN of the log transaction it is
replaying right at hand - it uses it to determine if it should
replay the inode by comparing it to the on-disk inode's LSN. But it
doesn't use that LSN to stamp the LSN into the inode which will be
written back when the transaction is fully replayed. It uses the one
in the log dinode, which we know is always going to be incorrect.

Looking back at the change history, the inode logging was broken by
back in 2016 by a stupid idiot who thought he knew how this code
worked. i.e. me. That commit replaced an in memory di_lsn field that
was updated only at inode writeback time from the inode item.li_lsn
value - and hence always contained the same LSN that appeared in the
on-disk inode - with a read of the inode item LSN at inode format
time. CLearly these are not the same thing.

Before 93f958f9c41f, the log recovery behaviour was irrelevant,
because the LSN in the log inode always matched the on-disk LSN at
the time the inode was logged, hence recovery of the transaction
would never make the on-disk LSN in the inode go backwards or get
out of sync.

A symptom of the problem is this, caught from a failure of
generic/482. Before log recovery, the inode has been allocated but
never used:

xfs_db> inode 393388
xfs_db> p
core.magic = 0x494e
core.mode = 0
....
v3.crc = 0x99126961 (correct)
v3.change_count = 0
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jan  1 10:00:00 1970
v3.crtime.nsec = 0

After log recovery:

xfs_db> p
core.magic = 0x494e
core.mode = 020444
....
v3.crc = 0x23e68f23 (correct)
v3.change_count = 2
v3.lsn = 0
v3.flags2 = 0
v3.cowextsize = 0
v3.crtime.sec = Thu Jul 22 17:03:03 2021
v3.crtime.nsec = 751000000
...

You can see that the LSN of the on-disk inode is 0, even though it
clearly has been written to disk. I point out this inode, because
the generic/482 failure occurred because several adjacent inodes in
this specific inode cluster were not replayed correctly and still
appeared to be zero on disk when all the other metadata (inobt,
finobt, directories, etc) indicated they should be allocated and
written back.

The fix for this is two-fold. The first is that we need to either
revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
altogether. If we do the former, log recovery does not need to
change but we add 8 bytes of memory per inode to store what is
largely a write-only inode field. If we do the latter, log recovery
needs to stamp the on-disk inode in the same manner that inode
writeback does.

I prefer the latter, because we shouldn't really be trying to log
and replay changes to the on disk LSN as the on-disk value is the
canonical source of the on-disk version of the inode. It also
matches the way we recover buffer items - we create a buf_log_item
that carries the current recovery transaction LSN that gets stamped
into the buffer by the write verifier when it gets written back
when the transaction is fully recovered.

However, this might break log recovery on older kernels even more,
so I'm going to simply ignore the logged value in recovery and stamp
the on-disk inode with the LSN of the transaction being recovered
that will trigger writeback on transaction recovery completion. This
will ensure that the on-disk inode LSN always reflects the LSN of
the last change that was written to disk, regardless of whether it
comes from log recovery or runtime writeback.

Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_log_format.h |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)


diff --git a/libxfs/xfs_log_format.h b/libxfs/xfs_log_format.h
index d548ea4b..2c5bcbc1 100644
--- a/libxfs/xfs_log_format.h
+++ b/libxfs/xfs_log_format.h
@@ -411,7 +411,16 @@ struct xfs_log_dinode {
 	/* start of the extended dinode, writable fields */
 	uint32_t	di_crc;		/* CRC of the inode */
 	uint64_t	di_changecount;	/* number of attribute changes */
-	xfs_lsn_t	di_lsn;		/* flush sequence */
+
+	/*
+	 * The LSN we write to this field during formatting is not a reflection
+	 * of the current on-disk LSN. It should never be used for recovery
+	 * sequencing, nor should it be recovered into the on-disk inode at all.
+	 * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
+	 * for details.
+	 */
+	xfs_lsn_t	di_lsn;
+
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
 	uint8_t		di_pad2[12];	/* more padding for future expansion */


  parent reply	other threads:[~2021-09-15 23:11 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-15 23:06 [PATCHSET 00/61] xfs: sync libxfs with 5.14 Darrick J. Wong
2021-09-15 23:06 ` [PATCH 01/61] mkfs: move mkfs/proto.c declarations to mkfs/proto.h Darrick J. Wong
2021-09-16  7:24   ` Christoph Hellwig
2021-09-15 23:06 ` [PATCH 02/61] libfrog: move topology.[ch] to libxfs Darrick J. Wong
2021-09-16  7:26   ` Christoph Hellwig
2021-09-15 23:06 ` [PATCH 03/61] libfrog: create header file for mocked-up kernel data structures Darrick J. Wong
2021-09-16  0:46   ` Dave Chinner
2021-09-16  0:58     ` Darrick J. Wong
2021-09-16  1:29       ` Dave Chinner
2021-09-16  1:37         ` Dave Chinner
2021-09-16  1:46           ` [PATCH 0/5] xfsprogs: generic serialisation primitives Dave Chinner
2021-09-16  1:46             ` [PATCH 1/5] xfsprogs: introduce liburcu support Dave Chinner
2021-09-24  0:41               ` Eric Sandeen
2021-09-24  3:02                 ` Chandan Babu R
2021-09-16  1:46             ` [PATCH 2/5] libxfs: add spinlock_t wrapper Dave Chinner
2021-09-16  1:46             ` [PATCH 3/5] atomic: convert to uatomic Dave Chinner
2021-09-16  1:46             ` [PATCH 4/5] libxfs: add kernel-compatible completion API Dave Chinner
2021-09-16  1:46             ` [PATCH 5/5] libxfs: add wrappers for kernel semaphores Dave Chinner
2021-09-22 22:08             ` [PATCH 0/5] xfsprogs: generic serialisation primitives Eric Sandeen
2021-09-23  8:47             ` [External] : " Chandan Babu R
2021-09-16 16:23     ` [PATCH 03/61] libfrog: create header file for mocked-up kernel data structures Eric Sandeen
2021-09-15 23:06 ` [PATCH 04/61] libxfs: port xfs_set_inode_alloc from the kernel Darrick J. Wong
2021-10-01 17:54   ` Eric Sandeen
2021-09-15 23:07 ` [PATCH 05/61] libxfs: fix whitespace inconsistencies with kernel Darrick J. Wong
2021-10-01 19:06   ` Eric Sandeen
2021-09-15 23:07 ` [PATCH 06/61] xfs: Fix fall-through warnings for Clang Darrick J. Wong
2021-10-01 19:57   ` Eric Sandeen
2021-09-15 23:07 ` [PATCH 07/61] misc: convert utilities to use "fallthrough;" Darrick J. Wong
2021-10-01 19:10   ` Eric Sandeen
2021-09-15 23:07 ` [PATCH 08/61] xfs: use xfs_buf_alloc_pages for uncached buffers Darrick J. Wong
2021-09-15 23:07 ` [PATCH 09/61] xfs: Reverse apply 72b97ea40d Darrick J. Wong
2021-09-15 23:07 ` [PATCH 10/61] xfs: Add xfs_attr_node_remove_name Darrick J. Wong
2021-09-15 23:07 ` [PATCH 11/61] xfs: Refactor xfs_attr_set_shortform Darrick J. Wong
2021-09-15 23:07 ` [PATCH 12/61] xfs: Separate xfs_attr_node_addname and xfs_attr_node_addname_clear_incomplete Darrick J. Wong
2021-09-15 23:07 ` [PATCH 13/61] xfs: Add helper xfs_attr_node_addname_find_attr Darrick J. Wong
2021-09-15 23:07 ` [PATCH 14/61] xfs: Hoist xfs_attr_node_addname Darrick J. Wong
2021-09-15 23:07 ` [PATCH 15/61] xfs: Hoist xfs_attr_leaf_addname Darrick J. Wong
2021-09-15 23:08 ` [PATCH 16/61] xfs: Hoist node transaction handling Darrick J. Wong
2021-09-15 23:08 ` [PATCH 17/61] xfs: Add delay ready attr remove routines Darrick J. Wong
2021-09-15 23:08 ` [PATCH 18/61] xfs: Add delay ready attr set routines Darrick J. Wong
2021-09-15 23:08 ` [PATCH 19/61] xfs: Remove xfs_attr_rmtval_set Darrick J. Wong
2021-09-15 23:08 ` [PATCH 20/61] xfs: Clean up xfs_attr_node_addname_clear_incomplete Darrick J. Wong
2021-09-15 23:08 ` [PATCH 21/61] xfs: clean up open-coded fs block unit conversions Darrick J. Wong
2021-09-15 23:08 ` [PATCH 22/61] xfs: move xfs_perag_get/put to xfs_ag.[ch] Darrick J. Wong
2021-09-15 23:08 ` [PATCH 23/61] xfs: move perag structure and setup to libxfs/xfs_ag.[ch] Darrick J. Wong
2021-09-15 23:08 ` [PATCH 24/61] xfs: make for_each_perag... a first class citizen Darrick J. Wong
2021-09-15 23:08 ` [PATCH 25/61] xfs: convert raw ag walks to use for_each_perag Darrick J. Wong
2021-09-15 23:08 ` [PATCH 26/61] xfs: convert xfs_iwalk to use perag references Darrick J. Wong
2021-09-15 23:09 ` [PATCH 27/61] xfs: convert secondary superblock walk to use perags Darrick J. Wong
2021-09-15 23:09 ` [PATCH 28/61] xfs: pass perags through to the busy extent code Darrick J. Wong
2021-09-15 23:09 ` [PATCH 29/61] xfs: push perags through the ag reservation callouts Darrick J. Wong
2021-09-15 23:09 ` [PATCH 30/61] xfs: pass perags around in fsmap data dev functions Darrick J. Wong
2021-09-15 23:09 ` [PATCH 31/61] xfs: add a perag to the btree cursor Darrick J. Wong
2021-09-15 23:09 ` [PATCH 32/61] xfs: convert rmap btree cursor to using a perag Darrick J. Wong
2021-09-15 23:09 ` [PATCH 33/61] xfs: convert refcount btree cursor to use perags Darrick J. Wong
2021-09-15 23:09 ` [PATCH 34/61] xfs: convert allocbt cursors " Darrick J. Wong
2021-09-15 23:09 ` [PATCH 35/61] xfs: use perag for ialloc btree cursors Darrick J. Wong
2021-09-15 23:09 ` [PATCH 36/61] xfs: remove agno from btree cursor Darrick J. Wong
2021-09-15 23:09 ` [PATCH 37/61] xfs: simplify xfs_dialloc_select_ag() return values Darrick J. Wong
2021-09-15 23:10 ` [PATCH 38/61] xfs: collapse AG selection for inode allocation Darrick J. Wong
2021-09-15 23:10 ` [PATCH 39/61] xfs: get rid of xfs_dir_ialloc() Darrick J. Wong
2021-09-15 23:10 ` [PATCH 40/61] xfs: inode allocation can use a single perag instance Darrick J. Wong
2021-09-15 23:10 ` [PATCH 41/61] xfs: clean up and simplify xfs_dialloc() Darrick J. Wong
2021-09-15 23:10 ` [PATCH 42/61] xfs: use perag through unlink processing Darrick J. Wong
2021-09-15 23:10 ` [PATCH 43/61] xfs: remove xfs_perag_t Darrick J. Wong
2021-09-15 23:10 ` [PATCH 44/61] xfs: sort variable alphabetically to avoid repeated declaration Darrick J. Wong
2021-09-15 23:10 ` [PATCH 45/61] xfs: Remove redundant assignment to busy Darrick J. Wong
2021-09-15 23:10 ` [PATCH 46/61] xfs: mark xfs_bmap_set_attrforkoff static Darrick J. Wong
2021-09-15 23:10 ` [PATCH 47/61] xfs: fix radix tree tag signs Darrick J. Wong
2021-09-15 23:10 ` [PATCH 48/61] xfs: drop the AGI being passed to xfs_check_agi_freecount Darrick J. Wong
2021-09-15 23:11 ` [PATCH 49/61] xfs: Fix default ASSERT in xfs_attr_set_iter Darrick J. Wong
2021-09-15 23:11 ` [PATCH 50/61] xfs: Make attr name schemes consistent Darrick J. Wong
2021-09-15 23:11 ` [PATCH 51/61] xfs: perag may be null in xfs_imap() Darrick J. Wong
2021-09-15 23:11 ` [PATCH 52/61] xfs: log stripe roundoff is a property of the log Darrick J. Wong
2021-09-15 23:11 ` [PATCH 53/61] xfs: xfs_log_force_lsn isn't passed a LSN Darrick J. Wong
2021-09-15 23:11 ` [PATCH 54/61] xfs: fix endianness issue in xfs_ag_shrink_space Darrick J. Wong
2021-09-15 23:11 ` [PATCH 55/61] xfs: Initialize error in xfs_attr_remove_iter Darrick J. Wong
2021-09-15 23:11 ` [PATCH 56/61] xfs: Fix multiple fall-through warnings for Clang Darrick J. Wong
2021-09-15 23:11 ` [PATCH 57/61] xfs: check for sparse inode clusters that cross new EOAG when shrinking Darrick J. Wong
2021-09-15 23:11 ` [PATCH 58/61] xfs: correct the narrative around misaligned rtinherit/extszinherit dirs Darrick J. Wong
2021-09-15 23:11 ` Darrick J. Wong [this message]
2021-09-15 23:12 ` [PATCH 60/61] xfs_db: convert the agresv command to use for_each_perag Darrick J. Wong
2021-09-16  7:20   ` Christoph Hellwig
2021-09-15 23:12 ` [PATCH 61/61] mkfs: warn about V4 deprecation when creating new V4 filesystems Darrick J. Wong
2021-09-16  7:18   ` Christoph Hellwig
2021-09-16 15:10     ` Darrick J. Wong
2021-09-16 15:15       ` Christoph Hellwig
2021-11-04  2:25   ` Darrick J. Wong
2021-11-04  2:30     ` Eric Sandeen
2021-09-15 23:36 ` [PATCHSET 00/61] xfs: sync libxfs with 5.14 Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=163174751688.350433.9364841513650368999.stgit@magnolia \
    --to=djwong@kernel.org \
    --cc=dchinner@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.