linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@redhat.com>
To: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-btrfs@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
Date: Fri, 03 Mar 2017 19:53:57 -0500	[thread overview]
Message-ID: <1488588837.11672.5.camel@redhat.com> (raw)
In-Reply-To: <20170303230018.GI13877@fieldses.org>

On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it
> > if no one is looking at it. We can also clean up the code to prepare
> > to eventually expose this value via statx().
> > 
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > set. IMA will also use it (though it's not clear to me that that
> > works 100% -- no check for MS_I_VERSION there).
> 
> I'm a little confused about the internal uses for directories.  Is it
> necessarily kept on disk?

No, they aren't necessarily stored on disk, and in fact they aren't on
most (maybe all?) of those filesystems. It's just a convenient place to
store a dir change value that is subsequently checked in readdir
operations.

That's part of the "fun" of untangling this mess. ;)

> In cases it's not, then there's not the same
> performance issue, right? 

Right, I don't think it's really a big deal for most of those and I'm
not terribly concerned about the i_version counter on directory change
operations. An extra atomic op on a directory change seems unlikely to
hurt anything.

The main purpose of this is to improve the situation with small writes.
This should also help pave the way for fs' like NFS and Ceph that
implement a version counter but may not necessarily bump it on every
change.

I think once we have things more consistent, we'll be able to consider
exposing the i_version counter to userland via statx.

> Is there any risk these patches make
> performance slightly worse in that case?
> 

Maybe, but I think that risk is pretty low. Note that I haven't measured
that specifically here, so I could be completely wrong.

If it is a problem, we could consider unioning this thing with a non-
atomic field for the directory change cases, but that would add some
complexity and I'm not sure it's worth it. I'd want to measure it first.

> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes.
> 
> On those filesystems that's done for both files and directories, right?
> 

Yes.

> > That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> > 
> > It turns out though that none of these users of i_version require that
> > i_version change on every change to the file. The only real requirement
> > is that it be different if _something_ changed since the last time we
> > queried for it. [2]
> > 
> > So, if we simply keep track of when something queries the value, we
> > can avoid bumping the counter and that metadata update.
> > 
> > This patchset implements this:
> > 
> > It starts with some small cleanup patches to just remove any mention of
> > the i_version counter in filesystems that don't actually use it.
> > 
> > Then, we add a new set of static inlines for managing the counter. The
> > initial version should work identically to what we have now. Then, all
> > of the remaining filesystems that use i_version are converted to the new
> > inlines.
> > 
> > Once that's in place, we switch to a new implementation that allows us
> > to track readers of i_version counter, and only bump it when it's
> > necessary or convenient (i.e. we're going to disk anyway).
> > 
> > The final patch switches from a scheme that uses the i_lock to serialize
> > the counter updates during write to an atomic64_t. That's a wash
> > performance-wise in my testing, but I like not having to take the i_lock
> > down where it could end up nested inside other locks.
> > 
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not a real-world workload.
> > 
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific).
> > 
> > I'm happy to run other workloads if anyone can suggest them.
> > 
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces that cost.
> > 
> > Is this an avenue that's worthwhile to pursue?
> > 
> > Note that I think we may have other changes coming in the future that
> > will make this sort of cleanup necessary anyway. I'd like to plug in the
> > Ceph change attribute here eventually, and that will require something
> > like this anyway.
> > 
> > Thoughts, comments and suggestions are welcome...
> > 
> > ---
> > 
> > [1]: On ext4 it must be turned on with the i_version mount option,
> >      mostly due to fears of incurring this impact, AFAICT.
> > 
> > [2]: NFS also recommends that it appear to increase in value over time, so
> >      that clients can discard metadata updates that are older than ones
> >      they've already seen.
> > 
> > Jeff Layton (30):
> >   lustre: don't set f_version in ll_readdir
> >   ecryptfs: remove unnecessary i_version bump
> >   ceph: remove the bump of i_version
> >   f2fs: don't bother setting i_version
> >   hpfs: don't bother with the i_version counter
> >   jfs: remove initialization of i_version counter
> >   nilfs2: remove inode->i_version initialization
> >   orangefs: remove initialization of i_version
> >   reiserfs: remove unneeded i_version bump
> >   ntfs: remove i_version handling
> >   fs: new API for handling i_version
> >   fat: convert to new i_version API
> >   affs: convert to new i_version API
> >   afs: convert to new i_version API
> >   btrfs: convert to new i_version API
> >   exofs: switch to new i_version API
> >   ext2: convert to new i_version API
> >   ext4: convert to new i_version API
> >   nfs: convert to new i_version API
> >   nfsd: convert to new i_version API
> >   ocfs2: convert to new i_version API
> >   ufs: use new i_version API
> >   xfs: convert to new i_version API
> >   IMA: switch IMA over to new i_version API
> >   fs: add a "force" parameter to inode_inc_iversion
> >   fs: only set S_VERSION when updating times if it has been queried
> >   xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
> >     incrementing
> >   btrfs: only dirty the inode in btrfs_update_time if something was
> >     changed
> >   fs: track whether the i_version has been queried with an i_state flag
> >   fs: convert i_version counter over to an atomic64_t
> > 
> >  drivers/staging/lustre/lustre/llite/dir.c |   3 -
> >  fs/affs/amigaffs.c                        |   4 +-
> >  fs/affs/dir.c                             |   4 +-
> >  fs/affs/super.c                           |   2 +-
> >  fs/afs/fsclient.c                         |   2 +-
> >  fs/afs/inode.c                            |   4 +-
> >  fs/btrfs/delayed-inode.c                  |   4 +-
> >  fs/btrfs/file.c                           |   4 +-
> >  fs/btrfs/inode.c                          |  41 ++++----
> >  fs/btrfs/ioctl.c                          |   4 +-
> >  fs/btrfs/tree-log.c                       |   2 +-
> >  fs/btrfs/xattr.c                          |   2 +-
> >  fs/ceph/inode.c                           |   1 -
> >  fs/ecryptfs/inode.c                       |   1 -
> >  fs/exofs/dir.c                            |   8 +-
> >  fs/exofs/super.c                          |   2 +-
> >  fs/ext2/dir.c                             |   8 +-
> >  fs/ext2/super.c                           |   4 +-
> >  fs/ext4/dir.c                             |   8 +-
> >  fs/ext4/inline.c                          |   6 +-
> >  fs/ext4/inode.c                           |  16 ++--
> >  fs/ext4/ioctl.c                           |   2 +-
> >  fs/ext4/namei.c                           |   8 +-
> >  fs/ext4/super.c                           |   2 +-
> >  fs/f2fs/super.c                           |   1 -
> >  fs/fat/dir.c                              |   2 +-
> >  fs/fat/inode.c                            |   8 +-
> >  fs/fat/namei_msdos.c                      |   6 +-
> >  fs/fat/namei_vfat.c                       |  20 ++--
> >  fs/hpfs/dir.c                             |   1 -
> >  fs/hpfs/dnode.c                           |   2 -
> >  fs/hpfs/super.c                           |   1 -
> >  fs/inode.c                                |   9 +-
> >  fs/jfs/super.c                            |   1 -
> >  fs/nfs/delegation.c                       |   2 +-
> >  fs/nfs/fscache-index.c                    |   4 +-
> >  fs/nfs/inode.c                            |  16 ++--
> >  fs/nfs/nfs4proc.c                         |   4 +-
> >  fs/nfs/nfstrace.h                         |   4 +-
> >  fs/nfs/write.c                            |   2 +-
> >  fs/nfsd/nfs3xdr.c                         |   2 +-
> >  fs/nfsd/nfs4xdr.c                         |   2 +-
> >  fs/nfsd/nfsfh.h                           |   2 +-
> >  fs/nilfs2/super.c                         |   1 -
> >  fs/ntfs/inode.c                           |   9 --
> >  fs/ntfs/mft.c                             |   6 --
> >  fs/ocfs2/dir.c                            |  14 +--
> >  fs/ocfs2/inode.c                          |   2 +-
> >  fs/ocfs2/namei.c                          |   2 +-
> >  fs/ocfs2/quota_global.c                   |   2 +-
> >  fs/orangefs/super.c                       |   2 -
> >  fs/reiserfs/super.c                       |   1 -
> >  fs/ufs/dir.c                              |   8 +-
> >  fs/ufs/inode.c                            |   2 +-
> >  fs/ufs/super.c                            |   2 +-
> >  fs/xfs/libxfs/xfs_inode_buf.c             |   4 +-
> >  fs/xfs/xfs_icache.c                       |   4 +-
> >  fs/xfs/xfs_inode.c                        |   2 +-
> >  fs/xfs/xfs_inode_item.c                   |   2 +-
> >  fs/xfs/xfs_trans_inode.c                  |  12 +--
> >  include/linux/fs.h                        | 151 ++++++++++++++++++++++++++++--
> >  security/integrity/ima/ima_api.c          |   2 +-
> >  security/integrity/ima/ima_main.c         |   2 +-
> >  63 files changed, 288 insertions(+), 173 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Jeff Layton <jlayton@redhat.com>

  reply	other threads:[~2017-03-04  1:02 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-21 17:03 [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 01/30] lustre: don't set f_version in ll_readdir Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 02/30] ecryptfs: remove unnecessary i_version bump Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 03/30] ceph: remove the bump of i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 04/30] f2fs: don't bother setting i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 05/30] hpfs: don't bother with the i_version counter Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 06/30] jfs: remove initialization of " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 07/30] nilfs2: remove inode->i_version initialization Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 08/30] orangefs: remove initialization of i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 09/30] reiserfs: remove unneeded i_version bump Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 10/30] ntfs: remove i_version handling Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 11/30] fs: new API for handling i_version Jeff Layton
2017-03-03 22:36   ` J. Bruce Fields
2017-03-04  0:09     ` Jeff Layton
2017-03-03 23:55   ` NeilBrown
2017-03-04  1:58     ` Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 12/30] fat: convert to new i_version API Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 13/30] affs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 14/30] afs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 15/30] btrfs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 16/30] exofs: switch " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 17/30] ext2: convert " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 18/30] ext4: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 19/30] nfs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 20/30] nfsd: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 21/30] ocfs2: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 22/30] ufs: use " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 23/30] xfs: convert to " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 24/30] IMA: switch IMA over " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 25/30] fs: add a "force" parameter to inode_inc_iversion Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 26/30] fs: only set S_VERSION when updating times if it has been queried Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 27/30] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 28/30] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag Jeff Layton
2017-03-04  0:03   ` NeilBrown
2017-03-04  0:43     ` Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t Jeff Layton
2016-12-22  8:38   ` Amir Goldstein
2016-12-22 13:27     ` Jeff Layton
2017-03-04  0:00   ` NeilBrown
2016-12-22  8:45 ` [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Christoph Hellwig
2016-12-22 14:42   ` Jeff Layton
2017-03-20 21:43     ` J. Bruce Fields
2017-03-21 13:45       ` Christoph Hellwig
2017-03-21 16:30         ` J. Bruce Fields
2017-03-21 17:23           ` Jeff Layton
2017-03-21 17:37             ` J. Bruce Fields
2017-03-21 17:51               ` J. Bruce Fields
2017-03-21 18:30             ` J. Bruce Fields
2017-03-21 18:46               ` Jeff Layton
2017-03-21 19:13                 ` J. Bruce Fields
2017-03-21 21:54                   ` Jeff Layton
2017-03-29 11:15                 ` Jan Kara
2017-03-29 17:54                   ` Jeff Layton
2017-03-29 23:41                     ` Dave Chinner
2017-03-30 11:24                       ` Jeff Layton
2017-04-04 18:38                       ` J. Bruce Fields
2017-03-30  6:47                     ` Jan Kara
2017-03-30 11:11                       ` Jeff Layton
2017-03-30 16:12                         ` J. Bruce Fields
2017-03-30 18:35                           ` Jeff Layton
2017-03-30 21:11                             ` Boaz Harrosh
2017-04-04 18:31                             ` J. Bruce Fields
2017-04-05  1:43                               ` NeilBrown
2017-04-05  8:05                                 ` Jan Kara
2017-04-05 18:14                                   ` J. Bruce Fields
2017-05-11 18:59                                     ` J. Bruce Fields
2017-05-11 22:22                                       ` NeilBrown
2017-05-12 16:21                                         ` J. Bruce Fields
2017-10-30 13:21                                           ` Jeff Layton
2017-05-12  8:27                                       ` Jan Kara
2017-05-12 15:56                                         ` J. Bruce Fields
2017-05-12 11:01                                       ` Jeff Layton
2017-05-12 15:57                                         ` J. Bruce Fields
2017-04-06  1:12                                   ` NeilBrown
2017-04-06  7:22                                     ` Jan Kara
2017-04-05 17:26                                 ` J. Bruce Fields
2017-04-01 23:05                           ` Dave Chinner
2017-04-03 14:00                             ` Jan Kara
2017-04-04 12:34                               ` Dave Chinner
2017-04-04 17:53                                 ` J. Bruce Fields
2017-04-05  1:26                                 ` NeilBrown
2017-03-21 21:45             ` Dave Chinner
2017-03-22 19:53               ` Jeff Layton
2017-03-03 23:00 ` J. Bruce Fields
2017-03-04  0:53   ` Jeff Layton [this message]
2017-03-08 17:29     ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1488588837.11672.5.camel@redhat.com \
    --to=jlayton@redhat.com \
    --cc=bfields@fieldses.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).