All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chandan Babu R <chandan.babu@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, djwong@kernel.org
Subject: Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
Date: Thu, 14 Oct 2021 15:37:45 +0530	[thread overview]
Message-ID: <87sfx3537z.fsf@debian-BULLSEYE-live-builder-AMD64> (raw)
In-Reply-To: <20211014020032.GM2361455@dread.disaster.area>

On 14 Oct 2021 at 07:30, Dave Chinner wrote:
> On Wed, Oct 13, 2021 at 08:14:01PM +0530, Chandan Babu R wrote:
>> On 11 Oct 2021 at 03:19, Dave Chinner wrote:
>> > On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
>> >> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
>> >> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>> >> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>> >> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> >> >> >
>> >> >> 
>> >> >> Ok. The above solution looks logically correct. I haven't been able to come up
>> >> >> with a scenario where the solution wouldn't work. I will implement it and see
>> >> >> if anything breaks.
>> >> >
>> >> > I think I can poke one hole in it - I missed the fact that if we
>> >> > upgrade and inode read time, and then we modify the inode without
>> >> > modifying the inode core (can we even do that - metadata mods should
>> >> > at least change timestamps right?) then we don't log the format
>> >> > change or the NREXT64 inode flag change and they only appear in the
>> >> > on-disk inode at writeback.
>> >> >
>> >> > Log recovery needs to be checked for correct behaviour here. I think
>> >> > that if the inode is in NREXT64 format when read in and the log
>> >> > inode core is not, then the on disk LSN must be more recent than
>> >> > what is being recovered from the log and should be skipped. If
>> >> > NREXT64 is present in the log inode, then we logged the core
>> >> > properly and we just don't care what format is on disk because we
>> >> > replay it into NREXT64 format and write that back.
>> >> 
>> >> xfs_inode_item_format() logs the inode core regardless of whether
>> >> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
>> >> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
>> >> should not result in a scenario where the corresponding
>> >> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.
>> >
>> > Except that log recovery might be replaying lots of indoe changes
>> > such as:
>> >
>> > log inode
>> > commit A
>> > log inode
>> > commit B
>> > log inode
>> > set NREXT64
>> > commit C
>> > writeback inode
>> > <crash before log tail moves>
>> >
>> > Recovery will then replay commit A, B and C, in which case we *must
>> > not recover the log inode* in commit A or B because the LSN in the
>> > on-disk inode points at commit C. Hence replaying A or B will result
>> > in the on-disk inode going backwards in time and hence resulting in
>> > an inconsistent state on disk until commit C is recovered.
>> >
>> >> i.e. there is no need to compare LSNs of the checkpoint
>> >> transaction being replayed and that of the disk inode.
>> >
>> > Inncorrect: we -always- have to do this, regardless of the change
>> > being made.
>> >
>> >> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
>> >> field, then we can safely conclude that the ondisk inode has to be updated to
>> >> reflect this change
>> >
>> > We can't assume that. This makes an assumption that NREXT64 is
>> > only ever a one-way transition. There's nothing in the disk format that
>> > prevents us from -removing- NREXT64 for inodes that don't need large
>> > extent counts.
>> >
>> > Yes, the -current implementation- does not allow going back to small
>> > extent counts, but the on-disk format design still needs to allow
>> > for such things to be done as we may need such functionality and
>> > flexibility in the on-disk format in the future.
>> >
>> > Hence we have to ensure that log recovery handles both set and reset
>> > transistions from the start. If we don't ensure that log recovery
>> > handles reset conditions when we first add the feature bit, then
>> > we are going to have to add a log incompat or another feature bit
>> > to stop older kernels from trying to recover reset operations.
>> >
>> 
>> Ok. I had never considered the possibility of transitioning an inode back into
>> 32-bit data fork extent count format. With this new requirement, I now
>> understand the reasoning behind comparing ondisk inode's LSN and checkpoint
>> transaction's LSN.
>> 
>> As you have mentioned earlier, comparing LSNs is required not only for the
>> change introduced in this patch, but also for any other change in value of any
>> of the inode's fields. Without such a comparison, the inode can temporarily
>> end up being in an inconsistent state during log replay.
>> 
>> To that end, The following code snippet from xlog_recover_inode_commit_pass2()
>> skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater
>> than checkpoint transaction's LSN,
>> 
>>         if (dip->di_version >= 3) {
>>                 xfs_lsn_t       lsn = be64_to_cpu(dip->di_lsn);
>> 
>>                 if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>>                         trace_xfs_log_recover_inode_skip(log, in_f);
>>                         error = 0;
>>                         goto out_owner_change;
>>                 }
>>         }
>> 
>> 
>> However, if the commits in the sequence below belong to three different
>> checkpoint transactions having the same LSN,
>> 
>> log inode
>> commit A
>> log inode
>> commit B
>> set NREXT64
>> log inode
>> commit C
>> writeback inode
>> <crash before log tail moves>
>> 
>> Then the above code snippet won't prevent an inode from becoming temporarily
>> inconsistent due to commits A and B being replayed.
>
> Ah, this is a very special corner case.  You snipped out the most
> important part of the comment above that code:
>
> 	/*
>          * If the inode has an LSN in it, recover the inode only if the on-disk
>          * inode's LSN is older than the lsn of the transaction we are
>          * replaying. We can have multiple checkpoints with the same start LSN,
>          * so the current LSN being equal to the on-disk LSN doesn't necessarily
>          * mean that the on-disk inode is more recent than the change being
>          * replayed.
> ....
>
> This is exactly the situation you are asking about here - what
> happens in recovery when the LSNs are the same and there are
> multiple checkpoints with the same LSN.
>
> The first thing to understand here is "how do we get checkpoints
> with the same LSN" and then understand what it implies.
>
> We get checkpoints with the same start/commit LSNs when multiple
> checkpoints are written in the same iclog. The start/commit LSNs are
> determined by the LSN of the iclog they are written in, and hence if
> they are the same they were written to the journal in a single
> "atomic" IO.
>
> I say "atomic" because it's not an atomic IO at the hardware level.
> It's atomic in that the entire iclog is protected by a CRC and hence
> if the CRC check for the iclog passes at recovery, then the iclog write has been
> recovered intact. If the write was torn, misdirected
> or some other physical media failure occurred, then we don't
> recovery the iclog at all. IOWs, none of the changes in the iclog
> are recovered. IOWs, we have atomic "all or nothing" iclog recovery
> semantics.
>
> Next, the fact that the inode has been written back and is up to
> date on disk means that the iclog is entirely on stable storage.
> The inode isn't unpinned until the flush/FUA associtate with the
> iclog was completed, which happens before the iclog IO is completed
> and the callbacks to unpin the inode are run. Hence ordering tells
> us the entire iclog is on disk and should be recovered.
>
> What this really means is that we cannot possibly see the
> intermediate commit A or commit B states on disk at runtime or
> before recovery is run. The metadata is not unpinned until the iclog
> that also contains commit C is written to the journal. Hence from
> the POV of the on-disk inode, we go from the original version to
> commit C in one step and we never, ever see A or B as intermediate
> states. IOWs, the iclog contents defines old -> C as an atomic
> on-disk modification, even though the contents are spread across
> multiple checkpoints.[1]
>
> Hence in this specific case, we have 3 individual modifications to
> the inode and it's related metadata sitting in the journal waiting
> for log recovery to replay them as an atomic unit. They will all get
> recovered, and each change that is replayed will be internally
> consistent. Therefore, after replaying commit A, the inode and it's
> metadata will be reverted to whatever was in that commit and it will
> be consistent in that context. Then replay of commit B and commit C
> bring it back up to being up to date on disk and providing the step
> change from old -> C as the runtime code would have also done.
>
> Hence at the end of replay, the inode and all it's related metadata
> will be consistent with commit C and so so this special transient
> corner case should resolve itself correctly (at least, as far as my
> poor dumb brain can reason about it being correct).
>

Thanks for the detailed explaination. I had figured out that multiple
checkpoints can end up having the same LSN because they were written to the
same iclog. The value of cil->xc_push_commit_stable is one of the things that
determine if the iclog is supposed to be flushed or not just after writing the
contents of a CIL context into it.

However the "atomic replay" behaviour had not occured to me.

>> To handle this, we should
>> probably go with the additional rule of "Replay log inode if both the log
>> inode and the ondisk inode have the same value for NREXT64 bit".
>
> No, we do not want case specific logic in recovery code like this
> because inode core updates are simply overwrites. As long as the
> overwrites are all replayed from A to C, we end up with the correct
> result of an "atomic" step change from old to C on disk...
>

W.r.t processing per-inode NREXT64 bit status during log recovery, I think we
can depend on the LSN comparison that is already implemented in
xlog_recover_inode_commit_pass2() to skip checkpoint transactions (with
different LSNs) which can make an ondisk inode enter an inconsistent state.


> Cheers,
>
> Dave.
>
> [1] There's more really subtle, complex details around start LSN vs
> commit LSN ordering with AIL, iclog and recovery LSNs and how to
> treat same start/different commit LSNs, different start/same commit
> LSNs, etc, but that's way beyond the scope of what is needed to be
> understood here. These play into why we replay all the changes at
> the same LSN as per above rather than skip them. Commit 32baa63d82ee
> ("xfs: logging the on disk inode LSN can make it go backwards")
> might give you some more insight into the complexities here.

Thanks for the commit ID. I will add this to my list of items to read.

-- 
chandan

  reply	other threads:[~2021-10-14 10:08 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16 Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
2021-09-27 22:46   ` Dave Chinner
2021-09-28  9:46     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
2021-09-30  1:19   ` Dave Chinner
2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
2021-09-27 23:46   ` Dave Chinner
2021-09-28  4:04     ` Dave Chinner
2021-09-29 17:03       ` Chandan Babu R
2021-09-30  0:40         ` Dave Chinner
2021-09-30  4:31           ` Dave Chinner
2021-09-30  7:30             ` Chandan Babu R
2021-09-30 22:55               ` Dave Chinner
2021-10-07 10:52                 ` Chandan Babu R
2021-10-10 21:49                   ` Dave Chinner
2021-10-13 14:44                     ` Chandan Babu R
2021-10-14  2:00                       ` Dave Chinner
2021-10-14 10:07                         ` Chandan Babu R [this message]
2021-10-21 10:27                       ` Chandan Babu R
2021-09-28  9:47     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
2021-09-28  0:47   ` Dave Chinner
2021-09-28  9:47     ` Chandan Babu R
2021-09-28 23:08       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
2021-09-27 23:06   ` Dave Chinner
2021-09-28  9:49     ` Chandan Babu R
2021-09-28 23:39       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 10/12] xfs: Extend per-inode extent counter widths Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
2021-09-28  0:33   ` Dave Chinner
2021-09-28 10:07     ` Chandan Babu R
2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
2021-09-18  3:36   ` [External] : " Chandan Babu R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sfx3537z.fsf@debian-BULLSEYE-live-builder-AMD64 \
    --to=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.