From: Dave Chinner <david@fromorbit.com>
To: Jeff Layton <jlayton@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
"Darrick J . Wong" <darrick.wong@oracle.com>
Subject: Re: [PATCH] xfs: fix i_version handling in xfs
Date: Wed, 17 Aug 2022 08:42:57 +1000 [thread overview]
Message-ID: <20220816224257.GV3600936@dread.disaster.area> (raw)
In-Reply-To: <e77fd4d19815fd661dbdb04ab27e687ff7e727eb.camel@kernel.org>
On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > including atime-only updates due to reads. We don't want to record those
> > > in the i_version, as they don't represent "real" changes. Remove that
> > > callsite.
> > >
> > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > updated.
> >
> > What about operations that don't touch the mtime but change the file
> > metadata anyway? There are a few of those, like the blockgc garbage
> > collector, deduperange, and the defrag tool.
> >
>
> Do those change the c/mtime at all?
>
> It's possible we're missing some places that should change the i_version
> as well. We may need some more call sites.
>
> > Zooming out a bit -- what does i_version signal, concretely? I thought
> > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > on the server has moved on, and the client needs to invalidate its
> > caches. I thought afs had a similar generation counter, though it's
> > only used to cache file data, not metadata? Does an i_version change
> > cause all of them to invalidate caches, or is there more behavior I
> > don't know about?
> >
>
> For NFS, it indicates a change to the change attribute indicates that
> there has been a change to the data or metadata for the file. atime
> changes due to reads are specifically exempted from this, but we do bump
> the i_version if someone (e.g.) changes the atime via utimes().
We have relatime behaviour to optimise away unnecessary atime
updates on reads. Trying to explicitly exclude i_version from atime
updates in one filesystem just because NFS doesn't need that
information seems .... misguided. The -on disk- i_version
field behaviour is defined by the filesystem implementation, not the
NFS requirements.
> The NFS client will generally invalidate its caches for the inode when
> it notices a change attribute change.
>
> FWIW, AFS may not meet this standard since it doesn't generally
> increment the counter on metadata changes. It may turn out that we don't
> want to expose this to the AFS client due to that (or maybe come up with
> some way to indicate this difference).
In XFS, we've defined the on-disk i_version field to mean
"increments with any persistent inode data or metadata change",
regardless of what the high level applications that use i_version
might actually require.
That some network filesystem might only need a subset of the
metadata to be covered by i_version is largely irrelevant - if we
don't cover every persistent inode metadata change with i_version,
then applications that *need* stuff like atime change notification
can't be supported.
> > Does that mean that we should bump i_version for any file data or
> > attribute that could be queried or observed by userspace? In which case
> > I suppose this change is still correct, even if it relaxes i_version
> > updates from "any change to the inode whatsoever" to "any change that
> > would bump mtime". Unless FIEMAP is part of "attributes observed by
> > userspace".
> >
> > (The other downside I can see is that now we have to remember to bump
> > timestamps for every new file operation we add, unlike the current code
> > which is centrally located in xfs_trans_log_inode.)
> >
>
> The main reason for the change attribute in NFS was that NFSv3 is
> plagued with cache-coherency problems due to coarse-grained timestamp
> granularity. It was conceived as a way to indicate that the inode had
> changed without relying on timestamps.
Yes, and the most important design consideration for a filesystem is
that it -must be persistent-. The constraints on i_version are much
stricter than timestamps, and they are directly related to how the
filesystem persists metadata changes, not how metadata is changed or
accessed in memory.
> In practice, we want to bump the i_version counter whenever the ctime or
> mtime would be changed.
What about O_NOCMTIME modifications? What about lazytime
filesystems? These explicilty avoid or delay persisten c/mtime
updates, and that means bumping i_version only based on c/mtime
updates cannot be relied on. i_version is supposed to track user
visible data and metadata changes, *not timestamp updates*.
> > > Cc: Darrick J. Wong <darrick.wong@oracle.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > fs/xfs/libxfs/xfs_trans_inode.c | 17 +++--------------
> > > fs/xfs/xfs_iops.c | 4 ++++
> > > 2 files changed, 7 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..78bf7f491462 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -71,6 +71,8 @@ xfs_trans_ichgtime(
> > > inode->i_ctime = tv;
> > > if (flags & XFS_ICHGTIME_CREATE)
> > > ip->i_crtime = tv;
> > > + if (flags & (XFS_ICHGTIME_MOD|XFS_ICHGTIME_CHG))
> > > + inode_inc_iversion(inode);
> > > }
That looks wrong - this is not the only path through XFS that
modifies timestamps, and I have to ask why this needs to be an
explicit i_version bump given that nobody may have looked at
i_version since the last time it was updated?.
What about xfs_fs_dirty_inode() when we actually persist lazytime
in-memory timestamp updates? We didn't bump i_version when setting
I_DIRTY_TIME, and this patch now removes the mechanism that is used
to bump iversion if it is needed when we persist those lazytime
updates.....
> > > /*
> > > @@ -116,20 +118,7 @@ xfs_trans_log_inode(
> > > spin_unlock(&inode->i_lock);
> > > }
> > >
> > > - /*
> > > - * First time we log the inode in a transaction, bump the inode change
> > > - * counter if it is configured for this to occur. While we have the
> > > - * inode locked exclusively for metadata modification, we can usually
> > > - * avoid setting XFS_ILOG_CORE if no one has queried the value since
> > > - * the last time it was incremented. If we have XFS_ILOG_CORE already
> > > - * set however, then go ahead and bump the i_version counter
> > > - * unconditionally.
> > > - */
> > > - if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > > - if (IS_I_VERSION(inode) &&
> > > - inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > - iversion_flags = XFS_ILOG_CORE;
> > > - }
> > > + set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags);
.... and this removes the sweep that captures in-memory timestamp
and i_version peeks between any persistent inode metadata
modifications that have been made, regardless of whether i_version
has already been bumped for them or not.
IOws, this seems to rely on every future inode modification in XFS
calling xfs_trans_ichgtime() to bump i_version to sweep previous VFS
in-memory timestamp updates that this inode modification captures
and persists to disk.
This seems fragile and error prone - it's relying on the
developers always getting timestamp and iversion updates correct,
rather the code always guaranteeing that it captures timestamp and
iversion updates without any extra effort.
Hence, I don't think that trying to modify how filesystems persist
and maintain i_version coherency because NFS "doesn't need i_version
to cover atime updates" is the wrong approach. On-disk i_version
coherency has to work for more than just one NFS implementation
(especially now i_version will be exported to userspace!).
Persistent atime updates are already optimised away by relatime, and
so I think that any further atime filtering is largely a NFS
application layer problem and not something that should be solved by
changing the on-disk definition of back end filesystem structure
persistence.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-08-16 22:43 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-16 13:17 [PATCH] xfs: fix i_version handling in xfs Jeff Layton
2022-08-16 15:43 ` Darrick J. Wong
2022-08-16 15:58 ` Jeff Layton
2022-08-16 22:42 ` Dave Chinner [this message]
2022-08-16 23:57 ` Dave Chinner
2022-08-17 12:02 ` Jeff Layton
2022-08-18 1:07 ` Dave Chinner
2022-08-18 11:12 ` Jeff Layton
2022-08-18 0:34 ` NeilBrown
2022-08-18 1:32 ` Dave Chinner
2022-08-18 1:52 ` NeilBrown
2022-08-18 2:22 ` Trond Myklebust
2022-08-18 3:00 ` Dave Chinner
2022-08-19 0:35 ` NeilBrown
2022-08-18 11:00 ` Jeff Layton
2022-08-18 23:43 ` NeilBrown
2022-08-18 1:11 ` Trond Myklebust
2022-08-18 3:37 ` Dave Chinner
2022-08-18 4:15 ` Trond Myklebust
2022-08-18 11:03 ` Jeff Layton
2022-08-23 0:05 ` Dave Chinner
2022-08-23 1:33 ` Trond Myklebust
2022-08-16 17:14 ` David Wysochanski
2022-08-16 23:37 ` Dave Chinner
2022-08-17 12:10 ` Jeff Layton
2022-08-17 21:57 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220816224257.GV3600936@dread.disaster.area \
--to=david@fromorbit.com \
--cc=darrick.wong@oracle.com \
--cc=djwong@kernel.org \
--cc=jlayton@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).