From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards
Date: Tue, 27 Jul 2021 12:00:59 -0700 [thread overview]
Message-ID: <20210727190059.GG559212@magnolia> (raw)
In-Reply-To: <20210727071012.3358033-9-david@fromorbit.com>
On Tue, Jul 27, 2021 at 05:10:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When we log an inode, we format the "log inode" core and set an LSN
> in that inode core. We do that via xfs_inode_item_format_core(),
> which calls:
>
> xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
>
> to format the log inode. It writes the LSN from the inode item into
> the log inode, and if recovery decides the inode item needs to be
> replayed, it recovers the log inode LSN field and writes it into the
> on disk inode LSN field.
>
> Now this might seem like a reasonable thing to do, but it is wrong
> on multiple levels. Firstly, if the item is not yet in the AIL,
> item->li_lsn is zero. i.e. the first time the inode it is logged and
> formatted, the LSN we write into the log inode will be zero. If we
> only log it once, recovery will run and can write this zero LSN into
> the inode.
>
> This means that the next time the inode is logged and log recovery
> runs, it will *always* replay changes to the inode regardless of
> whether the inode is newer on disk than the version in the log and
> that violates the entire purpose of recording the LSN in the inode
> at writeback time (i.e. to stop it going backwards in time on disk
> during recovery).
>
> Secondly, if we commit the CIL to the journal so the inode item
> moves to the AIL, and then relog the inode, the LSN that gets
> stamped into the log inode will be the LSN of the inode's current
> location in the AIL, not it's age on disk. And it's not the LSN that
> will be associated with the current change. That means when log
> recovery replays this inode item, the LSN that ends up on disk is
> the LSN for the previous changes in the log, not the current
> changes being replayed. IOWs, after recovery the LSN on disk is not
> in sync with the LSN of the modifications that were replayed into
> the inode. This, again, violates the recovery ordering semantics
> that on-disk writeback LSNs provide.
>
> Hence the inode LSN in the log dinode is -always- invalid.
>
> Thirdly, recovery actually has the LSN of the log transaction it is
> replaying right at hand - it uses it to determine if it should
> replay the inode by comparing it to the on-disk inode's LSN. But it
> doesn't use that LSN to stamp the LSN into the inode which will be
> written back when the transaction is fully replayed. It uses the one
> in the log dinode, which we know is always going to be incorrect.
>
> Looking back at the change history, the inode logging was broken by
> commit 93f958f9c41f ("xfs: cull unnecessary icdinode fields") way
> back in 2016 by a stupid idiot who thought he knew how this code
> worked. i.e. me. That commit replaced an in memory di_lsn field that
> was updated only at inode writeback time from the inode item.li_lsn
> value - and hence always contained the same LSN that appeared in the
> on-disk inode - with a read of the inode item LSN at inode format
> time. CLearly these are not the same thing.
>
> Before 93f958f9c41f, the log recovery behaviour was irrelevant,
> because the LSN in the log inode always matched the on-disk LSN at
> the time the inode was logged, hence recovery of the transaction
> would never make the on-disk LSN in the inode go backwards or get
> out of sync.
>
> A symptom of the problem is this, caught from a failure of
> generic/482. Before log recovery, the inode has been allocated but
> never used:
>
> xfs_db> inode 393388
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0
> ....
> v3.crc = 0x99126961 (correct)
> v3.change_count = 0
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jan 1 10:00:00 1970
> v3.crtime.nsec = 0
>
> After log recovery:
>
> xfs_db> p
> core.magic = 0x494e
> core.mode = 020444
> ....
> v3.crc = 0x23e68f23 (correct)
> v3.change_count = 2
> v3.lsn = 0
> v3.flags2 = 0
> v3.cowextsize = 0
> v3.crtime.sec = Thu Jul 22 17:03:03 2021
> v3.crtime.nsec = 751000000
> ...
>
> You can see that the LSN of the on-disk inode is 0, even though it
> clearly has been written to disk. I point out this inode, because
> the generic/482 failure occurred because several adjacent inodes in
> this specific inode cluster were not replayed correctly and still
> appeared to be zero on disk when all the other metadata (inobt,
> finobt, directories, etc) indicated they should be allocated and
> written back.
>
> The fix for this is two-fold. The first is that we need to either
> revert the LSN changes in 93f958f9c41f or stop logging the inode LSN
> altogether. If we do the former, log recovery does not need to
> change but we add 8 bytes of memory per inode to store what is
> largely a write-only inode field. If we do the latter, log recovery
> needs to stamp the on-disk inode in the same manner that inode
> writeback does.
>
> I prefer the latter, because we shouldn't really be trying to log
> and replay changes to the on disk LSN as the on-disk value is the
> canonical source of the on-disk version of the inode. It also
> matches the way we recover buffer items - we create a buf_log_item
> that carries the current recovery transaction LSN that gets stamped
> into the buffer by the write verifier when it gets written back
> when the transaction is fully recovered.
>
> However, this might break log recovery on older kernels even more,
> so I'm going to simply ignore the logged value in recovery and stamp
> the on-disk inode with the LSN of the transaction being recovered
> that will trigger writeback on transaction recovery completion. This
> will ensure that the on-disk inode LSN always reflects the LSN of
> the last change that was written to disk, regardless of whether it
> comes from log recovery or runtime writeback.
>
> Fixes: 93f958f9c41f ("xfs: cull unnecessary icdinode fields")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/libxfs/xfs_log_format.h | 11 +++++++++-
> fs/xfs/xfs_inode_item_recover.c | 39 ++++++++++++++++++++++++---------
> 2 files changed, 39 insertions(+), 11 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index d548ea4b6aab..2c5bcbc19264 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -411,7 +411,16 @@ struct xfs_log_dinode {
> /* start of the extended dinode, writable fields */
> uint32_t di_crc; /* CRC of the inode */
> uint64_t di_changecount; /* number of attribute changes */
> - xfs_lsn_t di_lsn; /* flush sequence */
> +
> + /*
> + * The LSN we write to this field during formatting is not a reflection
> + * of the current on-disk LSN. It should never be used for recovery
> + * sequencing, nor should it be recovered into the on-disk inode at all.
> + * See xlog_recover_inode_commit_pass2() and xfs_log_dinode_to_disk()
> + * for details.
> + */
> + xfs_lsn_t di_lsn;
I wonder if we should change the name of this to di_lsn_DONOTUSE? The
net effect of this patch is that xfs_log_dinode.di_lsn is written in the
same questionable manner that it always has been for the sake of not
breaking recovery on old kernels; but from now on we're never going to
read it ever again, not even in log recovery.
The AIL will flush the inode cluster (and stamp the /correct/ LSN into
the cluster buffer), and recovery will now compare the ondisk LSN
against the LSN of the recovered transaction to decide if it should skip
the update.
If I got that right,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
--D
> +
> uint64_t di_flags2; /* more random flags */
> uint32_t di_cowextsize; /* basic cow extent size for file */
> uint8_t di_pad2[12]; /* more padding for future expansion */
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 7b79518b6c20..e0072a6cd2d3 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -145,7 +145,8 @@ xfs_log_dinode_to_disk_ts(
> STATIC void
> xfs_log_dinode_to_disk(
> struct xfs_log_dinode *from,
> - struct xfs_dinode *to)
> + struct xfs_dinode *to,
> + xfs_lsn_t lsn)
> {
> to->di_magic = cpu_to_be16(from->di_magic);
> to->di_mode = cpu_to_be16(from->di_mode);
> @@ -182,7 +183,7 @@ xfs_log_dinode_to_disk(
> to->di_flags2 = cpu_to_be64(from->di_flags2);
> to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> to->di_ino = cpu_to_be64(from->di_ino);
> - to->di_lsn = cpu_to_be64(from->di_lsn);
> + to->di_lsn = cpu_to_be64(lsn);
> memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> uuid_copy(&to->di_uuid, &from->di_uuid);
> to->di_flushiter = 0;
> @@ -261,16 +262,25 @@ xlog_recover_inode_commit_pass2(
> }
>
> /*
> - * If the inode has an LSN in it, recover the inode only if it's less
> - * than the lsn of the transaction we are replaying. Note: we still
> - * need to replay an owner change even though the inode is more recent
> - * than the transaction as there is no guarantee that all the btree
> - * blocks are more recent than this transaction, too.
> + * If the inode has an LSN in it, recover the inode only if the on-disk
> + * inode's LSN is older than the lsn of the transaction we are
> + * replaying. We can have multiple checkpoints with the same start LSN,
> + * so the current LSN being equal to the on-disk LSN doesn't necessarily
> + * mean that the on-disk inode is more recent than the change being
> + * replayed.
> + *
> + * We must check the current_lsn against the on-disk inode
> + * here because the we can't trust the log dinode to contain a valid LSN
> + * (see comment below before replaying the log dinode for details).
> + *
> + * Note: we still need to replay an owner change even though the inode
> + * is more recent than the transaction as there is no guarantee that all
> + * the btree blocks are more recent than this transaction, too.
> */
> if (dip->di_version >= 3) {
> xfs_lsn_t lsn = be64_to_cpu(dip->di_lsn);
>
> - if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) >= 0) {
> + if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
> trace_xfs_log_recover_inode_skip(log, in_f);
> error = 0;
> goto out_owner_change;
> @@ -368,8 +378,17 @@ xlog_recover_inode_commit_pass2(
> goto out_release;
> }
>
> - /* recover the log dinode inode into the on disk inode */
> - xfs_log_dinode_to_disk(ldip, dip);
> + /*
> + * Recover the log dinode inode into the on disk inode.
> + *
> + * The LSN in the log dinode is garbage - it can be zero or reflect
> + * stale in-memory runtime state that isn't coherent with the changes
> + * logged in this transaction or the changes written to the on-disk
> + * inode. Hence we write the current lSN into the inode because that
> + * matches what xfs_iflush() would write inode the inode when flushing
> + * the changes in this transaction.
> + */
> + xfs_log_dinode_to_disk(ldip, dip, current_lsn);
>
> fields = in_f->ilf_fields;
> if (fields & XFS_ILOG_DEV)
> --
> 2.31.1
>
next prev parent reply other threads:[~2021-07-27 19:01 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-27 7:10 [PATCH 0/11 v3] xfs: fix log cache flush regressions and bugs Dave Chinner
2021-07-27 7:10 ` [PATCH 01/11] xfs: flush data dev on external log write Dave Chinner
2021-07-27 7:10 ` [PATCH 02/11] xfs: external logs need to flush data device Dave Chinner
2021-07-27 7:10 ` [PATCH 03/11] xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog Dave Chinner
2021-07-27 7:10 ` [PATCH 04/11] xfs: fix ordering violation between cache flushes and tail updates Dave Chinner
2021-07-27 18:50 ` Darrick J. Wong
2021-07-27 7:10 ` [PATCH 05/11] xfs: factor out forced iclog flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 06/11] xfs: log forces imply data device cache flushes Dave Chinner
2021-07-27 7:10 ` [PATCH 07/11] xfs: avoid unnecessary waits in xfs_log_force_lsn() Dave Chinner
2021-07-27 7:10 ` [PATCH 08/11] xfs: logging the on disk inode LSN can make it go backwards Dave Chinner
2021-07-27 19:00 ` Darrick J. Wong [this message]
2023-06-13 6:20 ` Chandan Babu R
2023-06-14 9:13 ` Dave Chinner
2021-07-27 7:10 ` [PATCH 09/11] xfs: Enforce attr3 buffer recovery order Dave Chinner
2021-07-27 7:10 ` [PATCH 10/11] xfs: need to see iclog flags in tracing Dave Chinner
2021-07-27 7:10 ` [PATCH 11/11] xfs: limit iclog tail updates Dave Chinner
2021-07-27 21:25 ` Darrick J. Wong
2021-07-27 22:13 ` Dave Chinner
2021-07-27 23:44 ` [PATCH 11/11 v2] " Dave Chinner
2021-07-29 16:28 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210727190059.GG559212@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).