Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width

From: Chandan Babu R <chandan.babu@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, djwong@kernel.org
Subject: Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
Date: Thu, 14 Oct 2021 15:37:45 +0530	[thread overview]
Message-ID: <87sfx3537z.fsf@debian-BULLSEYE-live-builder-AMD64> (raw)
In-Reply-To: <20211014020032.GM2361455@dread.disaster.area>

On 14 Oct 2021 at 07:30, Dave Chinner wrote:
> On Wed, Oct 13, 2021 at 08:14:01PM +0530, Chandan Babu R wrote:
>> On 11 Oct 2021 at 03:19, Dave Chinner wrote:
>> > On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
>> >> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
>> >> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>> >> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>> >> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> >> >> >
>> >> >> 
>> >> >> Ok. The above solution looks logically correct. I haven't been able to come up
>> >> >> with a scenario where the solution wouldn't work. I will implement it and see
>> >> >> if anything breaks.
>> >> >
>> >> > I think I can poke one hole in it - I missed the fact that if we
>> >> > upgrade and inode read time, and then we modify the inode without
>> >> > modifying the inode core (can we even do that - metadata mods should
>> >> > at least change timestamps right?) then we don't log the format
>> >> > change or the NREXT64 inode flag change and they only appear in the
>> >> > on-disk inode at writeback.
>> >> >
>> >> > Log recovery needs to be checked for correct behaviour here. I think
>> >> > that if the inode is in NREXT64 format when read in and the log
>> >> > inode core is not, then the on disk LSN must be more recent than
>> >> > what is being recovered from the log and should be skipped. If
>> >> > NREXT64 is present in the log inode, then we logged the core
>> >> > properly and we just don't care what format is on disk because we
>> >> > replay it into NREXT64 format and write that back.
>> >> 
>> >> xfs_inode_item_format() logs the inode core regardless of whether
>> >> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
>> >> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
>> >> should not result in a scenario where the corresponding
>> >> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.
>> >
>> > Except that log recovery might be replaying lots of indoe changes
>> > such as:
>> >
>> > log inode
>> > commit A
>> > log inode
>> > commit B
>> > log inode
>> > set NREXT64
>> > commit C
>> > writeback inode
>> > <crash before log tail moves>
>> >
>> > Recovery will then replay commit A, B and C, in which case we *must
>> > not recover the log inode* in commit A or B because the LSN in the
>> > on-disk inode points at commit C. Hence replaying A or B will result
>> > in the on-disk inode going backwards in time and hence resulting in
>> > an inconsistent state on disk until commit C is recovered.
>> >
>> >> i.e. there is no need to compare LSNs of the checkpoint
>> >> transaction being replayed and that of the disk inode.
>> >
>> > Inncorrect: we -always- have to do this, regardless of the change
>> > being made.
>> >
>> >> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
>> >> field, then we can safely conclude that the ondisk inode has to be updated to
>> >> reflect this change
>> >
>> > We can't assume that. This makes an assumption that NREXT64 is
>> > only ever a one-way transition. There's nothing in the disk format that
>> > prevents us from -removing- NREXT64 for inodes that don't need large
>> > extent counts.
>> >
>> > Yes, the -current implementation- does not allow going back to small
>> > extent counts, but the on-disk format design still needs to allow
>> > for such things to be done as we may need such functionality and
>> > flexibility in the on-disk format in the future.
>> >
>> > Hence we have to ensure that log recovery handles both set and reset
>> > transistions from the start. If we don't ensure that log recovery
>> > handles reset conditions when we first add the feature bit, then
>> > we are going to have to add a log incompat or another feature bit
>> > to stop older kernels from trying to recover reset operations.
>> >
>> 
>> Ok. I had never considered the possibility of transitioning an inode back into
>> 32-bit data fork extent count format. With this new requirement, I now
>> understand the reasoning behind comparing ondisk inode's LSN and checkpoint
>> transaction's LSN.
>> 
>> As you have mentioned earlier, comparing LSNs is required not only for the
>> change introduced in this patch, but also for any other change in value of any
>> of the inode's fields. Without such a comparison, the inode can temporarily
>> end up being in an inconsistent state during log replay.
>> 
>> To that end, The following code snippet from xlog_recover_inode_commit_pass2()
>> skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater
>> than checkpoint transaction's LSN,
>> 
>>         if (dip->di_version >= 3) {
>>                 xfs_lsn_t       lsn = be64_to_cpu(dip->di_lsn);
>> 
>>                 if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>>                         trace_xfs_log_recover_inode_skip(log, in_f);
>>                         error = 0;
>>                         goto out_owner_change;
>>                 }
>>         }
>> 
>> 
>> However, if the commits in the sequence below belong to three different
>> checkpoint transactions having the same LSN,
>> 
>> log inode
>> commit A
>> log inode
>> commit B
>> set NREXT64
>> log inode
>> commit C
>> writeback inode
>> <crash before log tail moves>
>> 
>> Then the above code snippet won't prevent an inode from becoming temporarily
>> inconsistent due to commits A and B being replayed.
>
> Ah, this is a very special corner case.  You snipped out the most
> important part of the comment above that code:
>
> 	/*
>          * If the inode has an LSN in it, recover the inode only if the on-disk
>          * inode's LSN is older than the lsn of the transaction we are
>          * replaying. We can have multiple checkpoints with the same start LSN,
>          * so the current LSN being equal to the on-disk LSN doesn't necessarily
>          * mean that the on-disk inode is more recent than the change being
>          * replayed.
> ....
>
> This is exactly the situation you are asking about here - what
> happens in recovery when the LSNs are the same and there are
> multiple checkpoints with the same LSN.
>
> The first thing to understand here is "how do we get checkpoints
> with the same LSN" and then understand what it implies.
>
> We get checkpoints with the same start/commit LSNs when multiple
> checkpoints are written in the same iclog. The start/commit LSNs are
> determined by the LSN of the iclog they are written in, and hence if
> they are the same they were written to the journal in a single
> "atomic" IO.
>
> I say "atomic" because it's not an atomic IO at the hardware level.
> It's atomic in that the entire iclog is protected by a CRC and hence
> if the CRC check for the iclog passes at recovery, then the iclog write has been
> recovered intact. If the write was torn, misdirected
> or some other physical media failure occurred, then we don't
> recovery the iclog at all. IOWs, none of the changes in the iclog
> are recovered. IOWs, we have atomic "all or nothing" iclog recovery
> semantics.
>
> Next, the fact that the inode has been written back and is up to
> date on disk means that the iclog is entirely on stable storage.
> The inode isn't unpinned until the flush/FUA associtate with the
> iclog was completed, which happens before the iclog IO is completed
> and the callbacks to unpin the inode are run. Hence ordering tells
> us the entire iclog is on disk and should be recovered.
>
> What this really means is that we cannot possibly see the
> intermediate commit A or commit B states on disk at runtime or
> before recovery is run. The metadata is not unpinned until the iclog
> that also contains commit C is written to the journal. Hence from
> the POV of the on-disk inode, we go from the original version to
> commit C in one step and we never, ever see A or B as intermediate
> states. IOWs, the iclog contents defines old -> C as an atomic
> on-disk modification, even though the contents are spread across
> multiple checkpoints.[1]
>
> Hence in this specific case, we have 3 individual modifications to
> the inode and it's related metadata sitting in the journal waiting
> for log recovery to replay them as an atomic unit. They will all get
> recovered, and each change that is replayed will be internally
> consistent. Therefore, after replaying commit A, the inode and it's
> metadata will be reverted to whatever was in that commit and it will
> be consistent in that context. Then replay of commit B and commit C
> bring it back up to being up to date on disk and providing the step
> change from old -> C as the runtime code would have also done.
>
> Hence at the end of replay, the inode and all it's related metadata
> will be consistent with commit C and so so this special transient
> corner case should resolve itself correctly (at least, as far as my
> poor dumb brain can reason about it being correct).
>

Thanks for the detailed explaination. I had figured out that multiple
checkpoints can end up having the same LSN because they were written to the
same iclog. The value of cil->xc_push_commit_stable is one of the things that
determine if the iclog is supposed to be flushed or not just after writing the
contents of a CIL context into it.

However the "atomic replay" behaviour had not occured to me.

>> To handle this, we should
>> probably go with the additional rule of "Replay log inode if both the log
>> inode and the ondisk inode have the same value for NREXT64 bit".
>
> No, we do not want case specific logic in recovery code like this
> because inode core updates are simply overwrites. As long as the
> overwrites are all replayed from A to C, we end up with the correct
> result of an "atomic" step change from old to C on disk...
>

W.r.t processing per-inode NREXT64 bit status during log recovery, I think we
can depend on the LSN comparison that is already implemented in
xlog_recover_inode_commit_pass2() to skip checkpoint transactions (with
different LSNs) which can make an ondisk inode enter an inconsistent state.

> Cheers,
>
> Dave.
>
> [1] There's more really subtle, complex details around start LSN vs
> commit LSN ordering with AIL, iclog and recovery LSNs and how to
> treat same start/different commit LSNs, different start/same commit
> LSNs, etc, but that's way beyond the scope of what is needed to be
> understood here. These play into why we replay all the changes at
> the same LSN as per above rather than skip them. Commit 32baa63d82ee
> ("xfs: logging the on disk inode LSN can make it go backwards")
> might give you some more insight into the complexities here.

Thanks for the commit ID. I will add this to my list of items to read.

-- 
chandan