linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Chandan Babu R <chandan.babu@oracle.com>
Cc: linux-xfs@vger.kernel.org, djwong@kernel.org
Subject: Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
Date: Mon, 11 Oct 2021 08:49:07 +1100	[thread overview]
Message-ID: <20211010214907.GK54211@dread.disaster.area> (raw)
In-Reply-To: <87pmshrtsm.fsf@debian-BULLSEYE-live-builder-AMD64>

On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
> >> >
> >> 
> >> Ok. The above solution looks logically correct. I haven't been able to come up
> >> with a scenario where the solution wouldn't work. I will implement it and see
> >> if anything breaks.
> >
> > I think I can poke one hole in it - I missed the fact that if we
> > upgrade and inode read time, and then we modify the inode without
> > modifying the inode core (can we even do that - metadata mods should
> > at least change timestamps right?) then we don't log the format
> > change or the NREXT64 inode flag change and they only appear in the
> > on-disk inode at writeback.
> >
> > Log recovery needs to be checked for correct behaviour here. I think
> > that if the inode is in NREXT64 format when read in and the log
> > inode core is not, then the on disk LSN must be more recent than
> > what is being recovered from the log and should be skipped. If
> > NREXT64 is present in the log inode, then we logged the core
> > properly and we just don't care what format is on disk because we
> > replay it into NREXT64 format and write that back.
> 
> xfs_inode_item_format() logs the inode core regardless of whether
> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
> should not result in a scenario where the corresponding
> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.

Except that log recovery might be replaying lots of indoe changes
such as:

log inode
commit A
log inode
commit B
log inode
set NREXT64
commit C
writeback inode
<crash before log tail moves>

Recovery will then replay commit A, B and C, in which case we *must
not recover the log inode* in commit A or B because the LSN in the
on-disk inode points at commit C. Hence replaying A or B will result
in the on-disk inode going backwards in time and hence resulting in
an inconsistent state on disk until commit C is recovered.

> i.e. there is no need to compare LSNs of the checkpoint
> transaction being replayed and that of the disk inode.

Inncorrect: we -always- have to do this, regardless of the change
being made.

> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
> field, then we can safely conclude that the ondisk inode has to be updated to
> reflect this change

We can't assume that. This makes an assumption that NREXT64 is
only ever a one-way transition. There's nothing in the disk format that
prevents us from -removing- NREXT64 for inodes that don't need large
extent counts.

Yes, the -current implementation- does not allow going back to small
extent counts, but the on-disk format design still needs to allow
for such things to be done as we may need such functionality and
flexibility in the on-disk format in the future.

Hence we have to ensure that log recovery handles both set and reset
transistions from the start. If we don't ensure that log recovery
handles reset conditions when we first add the feature bit, then
we are going to have to add a log incompat or another feature bit
to stop older kernels from trying to recover reset operations.

IOWs, the only determining factor as to whether we should replay an
inode is the LSN of the on-disk inode vs the LSN of the transaction
being replayed. Feature bits in either the on-disk ior log inode are
not reliable indicators of whether a dynamically set feature is
active or not at the time the inode item is being replayed...

> >> > FWIW, I also think doing something like this would help make the
> >> > code be easier to read and confirm that it is obviously correct when
> >> > reading it:
> >> >
> >> > 	__be32          di_gid;         /* owner's group id */
> >> > 	__be32          di_nlink;       /* number of links to file */
> >> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
> >> > 	__be16          di_projid_hi;   /* higher part owner's project id */
> >> > 	union {
> >> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
> >> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
> >> > 		struct {
> >> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
> >> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
> >> > 		};
> >> > 	};
> >> > 	xfs_timestamp_t di_atime;       /* time last accessed */
> >> > 	xfs_timestamp_t di_mtime;       /* time last modified */
> >> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
> >> > 	__be64          di_size;        /* number of bytes in file */
> >> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
> >> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
> >> > 	union {
> >> > 		struct {
> >> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
> >> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
> >> > 		};
> >> > 		struct {
> >> > 			__be32	di_nextents;    /* !NREXT64 data extents */
> >> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
> >> > 		}
> >> > 	}
> 
> The two structures above result in padding and hence result in a hole being
> introduced. The entire union above can be replaced with the following,
> 
>         union {
>                 __be32  di_big_aextcnt; /* NREXT64 attr extents */
>                 __be32  di_nextents;    /* !NREXT64 data extents */
>         };
>         union {
>                 __be16  di_nrext64_pad; /* NREXT64 unused, zero */
>                 __be16  di_anextents;   /* !NREXT64 attr extents */
>         };

I don't think this makes sense. This groups by field rather than
by feature layout. It doesn't make it clear at all that these
varaibles both change definition at the same time - they are either
{di_nexts, di_anexts} pair or a {di_big_aexts, pad} pair. That's the
whole point of using anonymous structs here - it defines and
documents the relationship between the layouts when certain features
are set rather than relying on people to parse the comments
correctly to determine the relationship....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-10-10 21:49 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16 Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
2021-09-27 22:46   ` Dave Chinner
2021-09-28  9:46     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
2021-09-30  1:19   ` Dave Chinner
2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
2021-09-27 23:46   ` Dave Chinner
2021-09-28  4:04     ` Dave Chinner
2021-09-29 17:03       ` Chandan Babu R
2021-09-30  0:40         ` Dave Chinner
2021-09-30  4:31           ` Dave Chinner
2021-09-30  7:30             ` Chandan Babu R
2021-09-30 22:55               ` Dave Chinner
2021-10-07 10:52                 ` Chandan Babu R
2021-10-10 21:49                   ` Dave Chinner [this message]
2021-10-13 14:44                     ` Chandan Babu R
2021-10-14  2:00                       ` Dave Chinner
2021-10-14 10:07                         ` Chandan Babu R
2021-10-21 10:27                       ` Chandan Babu R
2021-09-28  9:47     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
2021-09-28  0:47   ` Dave Chinner
2021-09-28  9:47     ` Chandan Babu R
2021-09-28 23:08       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
2021-09-27 23:06   ` Dave Chinner
2021-09-28  9:49     ` Chandan Babu R
2021-09-28 23:39       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 10/12] xfs: Extend per-inode extent counter widths Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
2021-09-28  0:33   ` Dave Chinner
2021-09-28 10:07     ` Chandan Babu R
2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
2021-09-18  3:36   ` [External] : " Chandan Babu R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211010214907.GK54211@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=chandan.babu@oracle.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).