All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
@ 2017-12-13 14:19 Jeff Layton
  2017-12-13 14:19 ` [PATCH 01/19] fs: new API for handling inode->i_version Jeff Layton
                   ` (19 more replies)
  0 siblings, 20 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

About a year ago, I sent a pile of patches that overhauled how the
inode->i_version field is handled in filesystems. This is a follow up
to that initial series.

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it
if no one is looking at it. We can also clean up the code to prepare
to eventually expose this value via statx().

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION
set. IMA will also use it to optimize away some remeasurement if
it's available.

Only btrfs, ext4, and xfs implement it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]

It turns out though that none of these users of i_version require that
i_version change on every change to the file. The only real requirement
is that it be different if _something_ changed since the last time we
queried for it.

If we keep track of when something queries the value, we can avoid
bumping the counter and an on-disk update when nothing else has changed
if no one has queried it since it was last incremented.

This patchset changes the code to only bump the i_version counter when
it's strictly necessary, or when we're updating the inode metadata
anyway (e.g. when times change).

It takes the approach of converting the existing accessors of i_version
to use a new API, while leaving the underlying implementation mostly the
same.  The last patch then converts the existing implementation to keep
track of whether the value has been queried since it was last
incremented and uses that to avoid incrementing the counter when it can.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

The pessimal workload here is 1 byte writes, and it helps that
significantly. Of course, that's not what we'd consider a real-world
workload.

A tiobench-example.fio workload also shows some modest performance
gains, and I've gotten mails from the kernel test robot that show some
significant performance gains on some microbenchmarks (case-msync-mt in
the vm-scalability testsuite to be specific), with an earlier version of
this set.

With larger writes, the gains with this patchset mostly vaporize,
but it does not seem to cause performance to regress anywhere, AFAICT.

I'm happy to run other workloads if anyone can suggest them.

At this point, the patchset works and does what it's expected to do in
my own testing. It seems like it's at least a modest performance win
across all 3 major disk-based filesystems. It may also encourage others
to implement i_version as well since it reduces the cost.

[1]: On ext4 it must be turned on with the i_version mount option,
     mostly due to fears of incurring this impact, AFAICT.

Jeff Layton (19):
  fs: new API for handling inode->i_version
  fs: don't take the i_lock in inode_inc_iversion
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: only set S_VERSION when updating times if necessary
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
    incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
    changed
  fs: handle inode->i_version more efficiently

 fs/affs/amigaffs.c                |   4 +-
 fs/affs/dir.c                     |   4 +-
 fs/affs/super.c                   |   2 +-
 fs/afs/fsclient.c                 |   2 +-
 fs/afs/inode.c                    |   4 +-
 fs/btrfs/delayed-inode.c          |   6 +-
 fs/btrfs/inode.c                  |  11 +-
 fs/btrfs/tree-log.c               |   3 +-
 fs/exofs/dir.c                    |   8 +-
 fs/exofs/super.c                  |   2 +-
 fs/ext2/dir.c                     |   8 +-
 fs/ext2/super.c                   |   4 +-
 fs/ext4/dir.c                     |   8 +-
 fs/ext4/inline.c                  |   6 +-
 fs/ext4/inode.c                   |  12 +-
 fs/ext4/ioctl.c                   |   2 +-
 fs/ext4/namei.c                   |   4 +-
 fs/ext4/super.c                   |   2 +-
 fs/ext4/xattr.c                   |   4 +-
 fs/fat/dir.c                      |   2 +-
 fs/fat/inode.c                    |   8 +-
 fs/fat/namei_msdos.c              |   6 +-
 fs/fat/namei_vfat.c               |  20 +--
 fs/inode.c                        |   9 +-
 fs/nfs/delegation.c               |   2 +-
 fs/nfs/fscache-index.c            |   4 +-
 fs/nfs/inode.c                    |  16 +--
 fs/nfs/nfs4proc.c                 |   9 +-
 fs/nfs/nfstrace.h                 |   4 +-
 fs/nfs/write.c                    |   7 +-
 fs/nfsd/nfsfh.h                   |   2 +-
 fs/ocfs2/dir.c                    |  14 +--
 fs/ocfs2/inode.c                  |   2 +-
 fs/ocfs2/namei.c                  |   2 +-
 fs/ocfs2/quota_global.c           |   2 +-
 fs/ufs/dir.c                      |   8 +-
 fs/ufs/inode.c                    |   2 +-
 fs/ufs/super.c                    |   2 +-
 fs/xfs/libxfs/xfs_inode_buf.c     |   5 +-
 fs/xfs/xfs_icache.c               |   4 +-
 fs/xfs/xfs_inode.c                |   2 +-
 fs/xfs/xfs_inode_item.c           |   2 +-
 fs/xfs/xfs_trans_inode.c          |  14 ++-
 include/linux/fs.h                | 250 ++++++++++++++++++++++++++++++++++++--
 security/integrity/ima/ima_api.c  |   2 +-
 security/integrity/ima/ima_main.c |   2 +-
 46 files changed, 371 insertions(+), 127 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
@ 2017-12-13 14:19 ` Jeff Layton
  2017-12-13 22:04   ` NeilBrown
  2017-12-13 14:20 ` [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:19 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Add a documentation blob that explains what the i_version field is, how
it is expected to work, and how it is currently implemented by various
filesystems.

We already have inode_inc_iversion. Add several other functions for
manipulating and accessing the i_version counter. For now, the
implementation is trivial and basically works the way that all of the
open-coded i_version accesses work today.

Future patches will convert existing users of i_version to use the new
API, and then convert the backend implementation to do things more
efficiently.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h | 200 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 192 insertions(+), 8 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..5001e77342fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2036,19 +2036,203 @@ static inline void inode_dec_link_count(struct inode *inode)
 	mark_inode_dirty(inode);
 }
 
+/*
+ * The change attribute (i_version) is mandated by NFSv4 and is mostly for
+ * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
+ * appear different to observers if there was a change to the inode's data or
+ * metadata since it was last queried.
+ *
+ * It should be considered an opaque value by observers. If it remains the same
+ * since it was last checked, then nothing has changed in the inode. If it's
+ * different then something has changed. Observers cannot infer anything about
+ * the nature or magnitude of the changes from the value, only that the inode
+ * has changed in some fashion.
+ *
+ * Not all filesystems properly implement the i_version counter. Subsystems that
+ * want to use i_version field on an inode should first check whether the
+ * filesystem sets the SB_I_VERSION flag (usually via the IS_I_VERSION macro).
+ *
+ * Those that set SB_I_VERSION will automatically have their i_version counter
+ * incremented on writes to normal files. If the SB_I_VERSION is not set, then
+ * the VFS will not touch it on writes, and the filesystem can use it how it
+ * wishes. Note that the filesystem is always responsible for updating the
+ * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
+ * We consider these sorts of filesystems to have a kernel-managed i_version.
+ *
+ * Note that some filesystems (e.g. NFS and AFS) just use the field to store
+ * a server-provided value (for the most part). For that reason, those
+ * filesystems do not set SB_I_VERSION. These filesystems are considered to
+ * have a self-managed i_version.
+ */
+
+/**
+ * inode_set_iversion_raw - set i_version to the specified raw value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for use by
+ * filesystems that self-manage the i_version.
+ *
+ * For example, the NFS client stores its NFSv4 change attribute in this way,
+ * and the AFS client stores the data_version from the server here.
+ */
+static inline void
+inode_set_iversion_raw(struct inode *inode, const u64 new)
+{
+	inode->i_version = new;
+}
+
+/**
+ * inode_set_iversion - set i_version to a particular value
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * Set @inode's i_version field to @new. This function is for filesystems with
+ * a kernel-managed i_version.
+ *
+ * For now, this just does the same thing as the _raw variant.
+ */
+static inline void
+inode_set_iversion(struct inode *inode, const u64 new)
+{
+	inode_set_iversion_raw(inode, new);
+}
+
+/**
+ * inode_set_iversion_queried - set i_version to a particular value and set
+ *                              flag to indicate that it has been viewed
+ * @inode: inode to set
+ * @new: new i_version value to set
+ *
+ * When loading in an i_version value from a backing store, we typically don't
+ * know whether it was previously viewed before being stored or not. Thus, we
+ * must assume that it was, to ensure that any changes will result in the
+ * value changing.
+ *
+ * This function will set the inode's i_version, and possibly flag the value
+ * as if it has already been viewed at least once.
+ *
+ * For now, this just does what inode_set_iversion does.
+ */
+static inline void
+inode_set_iversion_queried(struct inode *inode, const u64 new)
+{
+	inode_set_iversion(inode, new);
+}
+
+/**
+ * inode_maybe_inc_iversion - increments i_version
+ * @inode: inode with the i_version that should be updated
+ * @force: increment the counter even if it's not necessary
+ *
+ * Every time the inode is modified, the i_version field must be seen to have
+ * changed by any observer.
+ *
+ * In this implementation, we always increment it after taking the i_lock to
+ * ensure that we don't race with other incrementors.
+ *
+ * Returns true if counter was bumped, and false if it wasn't.
+ */
+static inline bool
+inode_maybe_inc_iversion(struct inode *inode, bool force)
+{
+	spin_lock(&inode->i_lock);
+	inode->i_version++;
+	spin_unlock(&inode->i_lock);
+	return true;
+}
+
+/**
+ * inode_inc_iversion - forcibly increment i_version
+ * @inode: inode that needs to be updated
+ *
+ * Forcbily increment the i_version field. This always results in a change to
+ * the observable value.
+ */
+static inline void
+inode_inc_iversion(struct inode *inode)
+{
+	inode_maybe_inc_iversion(inode, true);
+}
+
 /**
- * inode_inc_iversion - increments i_version
- * @inode: inode that need to be updated
+ * inode_iversion_need_inc - is the i_version in need of being incremented?
+ * @inode: inode to check
  *
- * Every time the inode is modified, the i_version field will be incremented.
- * The filesystem has to be mounted with i_version flag
+ * Returns whether the inode->i_version counter needs incrementing on the next
+ * change.
+ *
+ * For now, we assume that it always does.
  */
+static inline bool
+inode_iversion_need_inc(struct inode *inode)
+{
+	return true;
+}
 
-static inline void inode_inc_iversion(struct inode *inode)
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+	return inode->i_version;
+}
+
+/**
+ * inode_peek_iversion - read i_version without flagging it to be incremented
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter for an inode without registering it as a
+ * query.
+ *
+ * This is typically used by local filesystems that need to store an i_version
+ * on disk. In that situation, it's not necessary to flag it as having been
+ * viewed, as the result won't be used to gauge changes from that point.
+ */
+static inline u64
+inode_peek_iversion(const struct inode *inode)
+{
+	return inode_peek_iversion_raw(inode);
+}
+
+/**
+ * inode_query_iversion - read i_version for later use
+ * @inode: inode from which i_version should be read
+ *
+ * Read the inode i_version counter. This should be used by callers that wish
+ * to store the returned i_version for later comparison. This will guarantee
+ * that a later query of the i_version will result in a different value if
+ * anything has changed.
+ *
+ * This implementation just does a peek.
+ */
+static inline u64
+inode_query_iversion(struct inode *inode)
+{
+	return inode_peek_iversion(inode);
+}
+
+/**
+ * inode_cmp_iversion - check whether the i_version counter has changed
+ * @inode: inode to check
+ * @old: old value to check against its i_version
+ *
+ * Compare an i_version counter with a previous one. Returns 0 if they are
+ * the same or non-zero if they are different.
+ */
+static inline s64
+inode_cmp_iversion(const struct inode *inode, const u64 old)
 {
-       spin_lock(&inode->i_lock);
-       inode->i_version++;
-       spin_unlock(&inode->i_lock);
+	return (s64)inode_peek_iversion(inode) - (s64)old;
 }
 
 enum file_time_flags {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
  2017-12-13 14:19 ` [PATCH 01/19] fs: new API for handling inode->i_version Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 21:52   ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 03/19] fat: convert to new i_version API Jeff Layton
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

The rationale for taking the i_lock when incrementing this value is
lost in antiquity. The readers of the field don't take it (at least
not universally), so my assumption is that it was only done here to
serialize incrementors.

If that is indeed the case, then we can drop the i_lock from this
codepath and treat it as a atomic64_t for the purposes of
incrementing it. This allows us to use inode_inc_iversion without
any danger of lock inversion.

Note that the read side is not fetched atomically with this change.
The assumption here is that that is not a critical issue since the
i_version is not fully synchronized with anything else anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5001e77342fd..c234fac4bb77 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2136,9 +2136,9 @@ inode_set_iversion_queried(struct inode *inode, const u64 new)
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-	spin_lock(&inode->i_lock);
-	inode->i_version++;
-	spin_unlock(&inode->i_lock);
+	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+
+	atomic64_inc(ivp);
 	return true;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/19] fat: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
  2017-12-13 14:19 ` [PATCH 01/19] fs: new API for handling inode->i_version Jeff Layton
  2017-12-13 14:20 ` [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 04/19] affs: " Jeff Layton
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/fat/dir.c         |  2 +-
 fs/fat/inode.c       |  8 ++++----
 fs/fat/namei_msdos.c |  6 +++---
 fs/fat/namei_vfat.c  | 20 ++++++++++----------
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index b833ffeee1e1..221d85529c5f 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -1055,7 +1055,7 @@ int fat_remove_entries(struct inode *dir, struct fat_slot_info *sinfo)
 	brelse(bh);
 	if (err)
 		return err;
-	dir->i_version++;
+	inode_inc_iversion(dir);
 
 	if (nr_slots) {
 		/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 20a0a89eaca5..a444336c1eed 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -507,7 +507,7 @@ int fat_fill_inode(struct inode *inode, struct msdos_dir_entry *de)
 	MSDOS_I(inode)->i_pos = 0;
 	inode->i_uid = sbi->options.fs_uid;
 	inode->i_gid = sbi->options.fs_gid;
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	inode->i_generation = get_seconds();
 
 	if ((de->attr & ATTR_DIR) && !IS_FREE(de->name)) {
@@ -590,7 +590,7 @@ struct inode *fat_build_inode(struct super_block *sb,
 		goto out;
 	}
 	inode->i_ino = iunique(sb, MSDOS_ROOT_INO);
-	inode->i_version = 1;
+	inode_set_iversion(inode, 1);
 	err = fat_fill_inode(inode, de);
 	if (err) {
 		iput(inode);
@@ -1377,7 +1377,7 @@ static int fat_read_root(struct inode *inode)
 	MSDOS_I(inode)->i_pos = MSDOS_ROOT_INO;
 	inode->i_uid = sbi->options.fs_uid;
 	inode->i_gid = sbi->options.fs_gid;
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	inode->i_generation = 0;
 	inode->i_mode = fat_make_mode(sbi, ATTR_DIR, S_IRWXUGO);
 	inode->i_op = sbi->dir_ops;
@@ -1828,7 +1828,7 @@ int fat_fill_super(struct super_block *sb, void *data, int silent, int isvfat,
 	if (!root_inode)
 		goto out_fail;
 	root_inode->i_ino = MSDOS_ROOT_INO;
-	root_inode->i_version = 1;
+	inode_set_iversion(root_inode, 1);
 	error = fat_read_root(root_inode);
 	if (error < 0) {
 		iput(root_inode);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index d24d2758a363..7b71c1b242ce 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -480,7 +480,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
 			} else
 				mark_inode_dirty(old_inode);
 
-			old_dir->i_version++;
+			inode_inc_iversion(old_dir);
 			old_dir->i_ctime = old_dir->i_mtime = current_time(old_dir);
 			if (IS_DIRSYNC(old_dir))
 				(void)fat_sync_inode(old_dir);
@@ -508,7 +508,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
 			goto out;
 		new_i_pos = sinfo.i_pos;
 	}
-	new_dir->i_version++;
+	inode_inc_iversion(new_dir);
 
 	fat_detach(old_inode);
 	fat_attach(old_inode, new_i_pos);
@@ -540,7 +540,7 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
 	old_sinfo.bh = NULL;
 	if (err)
 		goto error_dotdot;
-	old_dir->i_version++;
+	inode_inc_iversion(old_dir);
 	old_dir->i_ctime = old_dir->i_mtime = ts;
 	if (IS_DIRSYNC(old_dir))
 		(void)fat_sync_inode(old_dir);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 02c066663a3a..fe1bb65149c8 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -46,7 +46,7 @@ static int vfat_revalidate_shortname(struct dentry *dentry)
 {
 	int ret = 1;
 	spin_lock(&dentry->d_lock);
-	if (vfat_d_version(dentry) != d_inode(dentry->d_parent)->i_version)
+	if (inode_cmp_iversion(d_inode(dentry->d_parent), vfat_d_version(dentry)))
 		ret = 0;
 	spin_unlock(&dentry->d_lock);
 	return ret;
@@ -759,7 +759,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
 out:
 	mutex_unlock(&MSDOS_SB(sb)->s_lock);
 	if (!inode)
-		vfat_d_version_set(dentry, dir->i_version);
+		vfat_d_version_set(dentry, inode_query_iversion(dir));
 	return d_splice_alias(inode, dentry);
 error:
 	mutex_unlock(&MSDOS_SB(sb)->s_lock);
@@ -781,7 +781,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 	err = vfat_add_entry(dir, &dentry->d_name, 0, 0, &ts, &sinfo);
 	if (err)
 		goto out;
-	dir->i_version++;
+	inode_inc_iversion(dir);
 
 	inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
 	brelse(sinfo.bh);
@@ -789,7 +789,7 @@ static int vfat_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		err = PTR_ERR(inode);
 		goto out;
 	}
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
 	/* timestamp is already written, so mark_inode_dirty() is unneeded. */
 
@@ -823,7 +823,7 @@ static int vfat_rmdir(struct inode *dir, struct dentry *dentry)
 	clear_nlink(inode);
 	inode->i_mtime = inode->i_atime = current_time(inode);
 	fat_detach(inode);
-	vfat_d_version_set(dentry, dir->i_version);
+	vfat_d_version_set(dentry, inode_query_iversion(dir));
 out:
 	mutex_unlock(&MSDOS_SB(sb)->s_lock);
 
@@ -849,7 +849,7 @@ static int vfat_unlink(struct inode *dir, struct dentry *dentry)
 	clear_nlink(inode);
 	inode->i_mtime = inode->i_atime = current_time(inode);
 	fat_detach(inode);
-	vfat_d_version_set(dentry, dir->i_version);
+	vfat_d_version_set(dentry, inode_query_iversion(dir));
 out:
 	mutex_unlock(&MSDOS_SB(sb)->s_lock);
 
@@ -875,7 +875,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	err = vfat_add_entry(dir, &dentry->d_name, 1, cluster, &ts, &sinfo);
 	if (err)
 		goto out_free;
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	inc_nlink(dir);
 
 	inode = fat_build_inode(sb, sinfo.de, sinfo.i_pos);
@@ -885,7 +885,7 @@ static int vfat_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		/* the directory was completed, just return a error */
 		goto out;
 	}
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	set_nlink(inode, 2);
 	inode->i_mtime = inode->i_atime = inode->i_ctime = ts;
 	/* timestamp is already written, so mark_inode_dirty() is unneeded. */
@@ -951,7 +951,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
 			goto out;
 		new_i_pos = sinfo.i_pos;
 	}
-	new_dir->i_version++;
+	inode_inc_iversion(new_dir);
 
 	fat_detach(old_inode);
 	fat_attach(old_inode, new_i_pos);
@@ -979,7 +979,7 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
 	old_sinfo.bh = NULL;
 	if (err)
 		goto error_dotdot;
-	old_dir->i_version++;
+	inode_inc_iversion(old_dir);
 	old_dir->i_ctime = old_dir->i_mtime = ts;
 	if (IS_DIRSYNC(old_dir))
 		(void)fat_sync_inode(old_dir);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/19] affs: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (2 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 03/19] fat: convert to new i_version API Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 05/19] afs: " Jeff Layton
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/affs/amigaffs.c | 4 ++--
 fs/affs/dir.c      | 4 ++--
 fs/affs/super.c    | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index 0f0e6925e97d..085a93bea1e2 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -60,7 +60,7 @@ affs_insert_hash(struct inode *dir, struct buffer_head *bh)
 	affs_brelse(dir_bh);
 
 	dir->i_mtime = dir->i_ctime = current_time(dir);
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	mark_inode_dirty(dir);
 
 	return 0;
@@ -114,7 +114,7 @@ affs_remove_hash(struct inode *dir, struct buffer_head *rem_bh)
 	affs_brelse(bh);
 
 	dir->i_mtime = dir->i_ctime = current_time(dir);
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	mark_inode_dirty(dir);
 
 	return retval;
diff --git a/fs/affs/dir.c b/fs/affs/dir.c
index a105e77df2c1..1a8c7b067c44 100644
--- a/fs/affs/dir.c
+++ b/fs/affs/dir.c
@@ -80,7 +80,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 	 * we can jump directly to where we left off.
 	 */
 	ino = (u32)(long)file->private_data;
-	if (ino && file->f_version == inode->i_version) {
+	if (ino && inode_cmp_iversion(inode, file->f_version) == 0) {
 		pr_debug("readdir() left off=%d\n", ino);
 		goto inside;
 	}
@@ -130,7 +130,7 @@ affs_readdir(struct file *file, struct dir_context *ctx)
 		} while (ino);
 	}
 done:
-	file->f_version = inode->i_version;
+	file->f_version = inode_query_iversion(inode);
 	file->private_data = (void *)(long)ino;
 	affs_brelse(fh_bh);
 
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 1117e36134cc..70596d4084de 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -102,7 +102,7 @@ static struct inode *affs_alloc_inode(struct super_block *sb)
 	if (!i)
 		return NULL;
 
-	i->vfs_inode.i_version = 1;
+	inode_set_iversion(&i->vfs_inode, 1);
 	i->i_lc = NULL;
 	i->i_ext_bh = NULL;
 	i->i_pa_cnt = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/19] afs: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (3 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 04/19] affs: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 06/19] btrfs: " Jeff Layton
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

For AFS, it's generally treated as an opaque value, so we use the
*_raw variants of the API here.

Note that AFS has quite a different definition for this counter. AFS
only increments it on changes to the data, not for the metadata. We'll
need to reconcile that somehow if we ever want to present this to
userspace via statx.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/afs/fsclient.c | 2 +-
 fs/afs/inode.c    | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index b90ef39ae914..67ed9bb5fe31 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -124,7 +124,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
 		vnode->vfs_inode.i_ctime.tv_sec	= status->mtime_client;
 		vnode->vfs_inode.i_mtime	= vnode->vfs_inode.i_ctime;
 		vnode->vfs_inode.i_atime	= vnode->vfs_inode.i_ctime;
-		vnode->vfs_inode.i_version	= data_version;
+		inode_set_iversion_raw(&vnode->vfs_inode, data_version);
 	}
 
 	expected_version = status->data_version;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 3415eb7484f6..af9577210a46 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -89,7 +89,7 @@ static int afs_inode_map_status(struct afs_vnode *vnode, struct key *key)
 	inode->i_atime		= inode->i_mtime = inode->i_ctime;
 	inode->i_blocks		= 0;
 	inode->i_generation	= vnode->fid.unique;
-	inode->i_version	= vnode->status.data_version;
+	inode_set_iversion_raw(inode, vnode->status.data_version);
 	inode->i_mapping->a_ops	= &afs_fs_aops;
 
 	read_sequnlock_excl(&vnode->cb_lock);
@@ -218,7 +218,7 @@ struct inode *afs_iget_autocell(struct inode *dir, const char *dev_name,
 	inode->i_ctime.tv_nsec	= 0;
 	inode->i_atime		= inode->i_mtime = inode->i_ctime;
 	inode->i_blocks		= 0;
-	inode->i_version	= 0;
+	inode_set_iversion_raw(inode, 0);
 	inode->i_generation	= 0;
 
 	set_bit(AFS_VNODE_PSEUDODIR, &vnode->flags);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/19] btrfs: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (4 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 05/19] afs: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 07/19] exofs: switch " Jeff Layton
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/btrfs/delayed-inode.c | 6 ++++--
 fs/btrfs/inode.c         | 6 ++++--
 fs/btrfs/tree-log.c      | 3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 5d73f79ded8b..0019cd0447f6 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1700,7 +1700,8 @@ static void fill_stack_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_stack_inode_nbytes(inode_item, inode_get_bytes(inode));
 	btrfs_set_stack_inode_generation(inode_item,
 					 BTRFS_I(inode)->generation);
-	btrfs_set_stack_inode_sequence(inode_item, inode->i_version);
+	btrfs_set_stack_inode_sequence(inode_item,
+				       inode_query_iversion(inode));
 	btrfs_set_stack_inode_transid(inode_item, trans->transid);
 	btrfs_set_stack_inode_rdev(inode_item, inode->i_rdev);
 	btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
@@ -1754,7 +1755,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 	BTRFS_I(inode)->generation = btrfs_stack_inode_generation(inode_item);
         BTRFS_I(inode)->last_trans = btrfs_stack_inode_transid(inode_item);
 
-	inode->i_version = btrfs_stack_inode_sequence(inode_item);
+	inode_set_iversion_queried(inode,
+				   btrfs_stack_inode_sequence(inode_item));
 	inode->i_rdev = 0;
 	*rdev = btrfs_stack_inode_rdev(inode_item);
 	BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e1a7f3cb5be9..ac25389b39de 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3777,7 +3777,8 @@ static int btrfs_read_locked_inode(struct inode *inode)
 	BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
 	BTRFS_I(inode)->last_trans = btrfs_inode_transid(leaf, inode_item);
 
-	inode->i_version = btrfs_inode_sequence(leaf, inode_item);
+	inode_set_iversion_queried(inode,
+				   btrfs_inode_sequence(leaf, inode_item));
 	inode->i_generation = BTRFS_I(inode)->generation;
 	inode->i_rdev = 0;
 	rdev = btrfs_inode_rdev(leaf, inode_item);
@@ -3945,7 +3946,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 				     &token);
 	btrfs_set_token_inode_generation(leaf, item, BTRFS_I(inode)->generation,
 					 &token);
-	btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+	btrfs_set_token_inode_sequence(leaf, item,
+				       inode_query_iversion(inode), &token);
 	btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
 	btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
 	btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7bf9b31561db..3c58a68d2139 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3609,7 +3609,8 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_inode_nbytes(leaf, item, inode_get_bytes(inode),
 				     &token);
 
-	btrfs_set_token_inode_sequence(leaf, item, inode->i_version, &token);
+	btrfs_set_token_inode_sequence(leaf, item,
+				       inode_query_iversion(inode), &token);
 	btrfs_set_token_inode_transid(leaf, item, trans->transid, &token);
 	btrfs_set_token_inode_rdev(leaf, item, inode->i_rdev, &token);
 	btrfs_set_token_inode_flags(leaf, item, BTRFS_I(inode)->flags, &token);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/19] exofs: switch to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (5 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 06/19] btrfs: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 08/19] ext2: convert " Jeff Layton
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/exofs/dir.c   | 8 ++++----
 fs/exofs/super.c | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/exofs/dir.c b/fs/exofs/dir.c
index 98233a97b7b8..36e314083b80 100644
--- a/fs/exofs/dir.c
+++ b/fs/exofs/dir.c
@@ -60,7 +60,7 @@ static int exofs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 	struct inode *dir = mapping->host;
 	int err = 0;
 
-	dir->i_version++;
+	inode_inc_iversion(dir);
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);
@@ -241,7 +241,7 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
 	unsigned long n = pos >> PAGE_SHIFT;
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(exofs_chunk_size(inode)-1);
-	int need_revalidate = (file->f_version != inode->i_version);
+	bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
 	if (pos > inode->i_size - EXOFS_DIR_REC_LEN(1))
 		return 0;
@@ -264,8 +264,8 @@ exofs_readdir(struct file *file, struct dir_context *ctx)
 								chunk_mask);
 				ctx->pos = (n<<PAGE_SHIFT) + offset;
 			}
-			file->f_version = inode->i_version;
-			need_revalidate = 0;
+			file->f_version = inode_query_iversion(inode);
+			need_revalidate = false;
 		}
 		de = (struct exofs_dir_entry *)(kaddr + offset);
 		limit = kaddr + exofs_last_byte(inode, n) -
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 819624cfc8da..6746999137df 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -159,7 +159,7 @@ static struct inode *exofs_alloc_inode(struct super_block *sb)
 	if (!oi)
 		return NULL;
 
-	oi->vfs_inode.i_version = 1;
+	inode_set_iversion(&oi->vfs_inode, 1);
 	return &oi->vfs_inode;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/19] ext2: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (6 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 07/19] exofs: switch " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-18 12:47   ` Jan Kara
  2017-12-13 14:20 ` [PATCH 09/19] ext4: " Jeff Layton
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ext2/dir.c   | 8 ++++----
 fs/ext2/super.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 987647986f47..c99f14fec3f3 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -92,7 +92,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
 	struct inode *dir = mapping->host;
 	int err = 0;
 
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 
 	if (pos+len > dir->i_size) {
@@ -293,7 +293,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
 	unsigned char *types = NULL;
-	int need_revalidate = file->f_version != inode->i_version;
+	bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 
 	if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
 		return 0;
@@ -319,8 +319,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
 				offset = ext2_validate_entry(kaddr, offset, chunk_mask);
 				ctx->pos = (n<<PAGE_SHIFT) + offset;
 			}
-			file->f_version = inode->i_version;
-			need_revalidate = 0;
+			file->f_version = inode_query_iversion(inode);
+			need_revalidate = false;
 		}
 		de = (ext2_dirent *)(kaddr+offset);
 		limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 7646818ab266..dd7c3c81d918 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -184,7 +184,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
 	if (!ei)
 		return NULL;
 	ei->i_block_alloc_info = NULL;
-	ei->vfs_inode.i_version = 1;
+	inode_set_iversion(&ei->vfs_inode, 1);
 #ifdef CONFIG_QUOTA
 	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
 #endif
@@ -1569,7 +1569,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
 		return err;
 	if (inode->i_size < off+len-towrite)
 		i_size_write(inode, off+len-towrite);
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	inode->i_mtime = inode->i_ctime = current_time(inode);
 	mark_inode_dirty(inode);
 	return len - towrite;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/19] ext4: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (7 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 08/19] ext2: convert " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-14 21:52   ` Theodore Ts'o
  2017-12-13 14:20 ` [PATCH 10/19] nfs: " Jeff Layton
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ext4/dir.c    |  8 ++++----
 fs/ext4/inline.c |  6 +++---
 fs/ext4/inode.c  | 12 ++++++++----
 fs/ext4/ioctl.c  |  2 +-
 fs/ext4/namei.c  |  4 ++--
 fs/ext4/super.c  |  2 +-
 fs/ext4/xattr.c  |  4 ++--
 7 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..18eb37395a45 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -208,7 +208,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
 		 * readdir(2), then we might be pointing to an invalid
 		 * dirent right now.  Scan from the start of the block
 		 * to make sure. */
-		if (file->f_version != inode->i_version) {
+		if (inode_cmp_iversion(inode, file->f_version)) {
 			for (i = 0; i < sb->s_blocksize && i < offset; ) {
 				de = (struct ext4_dir_entry_2 *)
 					(bh->b_data + i);
@@ -227,7 +227,7 @@ static int ext4_readdir(struct file *file, struct dir_context *ctx)
 			offset = i;
 			ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
 				| offset;
-			file->f_version = inode->i_version;
+			file->f_version = inode_query_iversion(inode);
 		}
 
 		while (ctx->pos < inode->i_size
@@ -568,10 +568,10 @@ static int ext4_dx_readdir(struct file *file, struct dir_context *ctx)
 		 * cached entries.
 		 */
 		if ((!info->curr_node) ||
-		    (file->f_version != inode->i_version)) {
+		    inode_cmp_iversion(inode, file->f_version)) {
 			info->curr_node = NULL;
 			free_rb_tree_fname(&info->root);
-			file->f_version = inode->i_version;
+			file->f_version = inode_query_iversion(inode);
 			ret = ext4_htree_fill_tree(file, info->curr_hash,
 						   info->curr_minor_hash,
 						   &info->next_hash);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 1367553c43bb..a4a621eb80c1 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1042,7 +1042,7 @@ static int ext4_add_dirent_to_inline(handle_t *handle,
 	 */
 	dir->i_mtime = dir->i_ctime = current_time(dir);
 	ext4_update_dx_flag(dir);
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	return 1;
 }
 
@@ -1494,7 +1494,7 @@ int ext4_read_inline_dir(struct file *file,
 	 * dirent right now.  Scan from the start of the inline
 	 * dir to make sure.
 	 */
-	if (file->f_version != inode->i_version) {
+	if (inode_cmp_iversion(inode, file->f_version)) {
 		for (i = 0; i < extra_size && i < offset;) {
 			/*
 			 * "." is with offset 0 and
@@ -1526,7 +1526,7 @@ int ext4_read_inline_dir(struct file *file,
 		}
 		offset = i;
 		ctx->pos = offset;
-		file->f_version = inode->i_version;
+		file->f_version = inode_query_iversion(inode);
 	}
 
 	while (ctx->pos < extra_size) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..1f8ec9cd042c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4873,12 +4873,14 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
 	EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
 	if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-		inode->i_version = le32_to_cpu(raw_inode->i_disk_version);
+		u64 ivers = le32_to_cpu(raw_inode->i_disk_version);
+
 		if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
 			if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
-				inode->i_version |=
+				ivers |=
 		    (__u64)(le32_to_cpu(raw_inode->i_version_hi)) << 32;
 		}
+		inode_set_iversion_queried(inode, ivers);
 	}
 
 	ret = 0;
@@ -5164,11 +5166,13 @@ static int ext4_do_update_inode(handle_t *handle,
 	}
 
 	if (likely(!test_opt2(inode->i_sb, HURD_COMPAT))) {
-		raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+		u64 ivers = inode_peek_iversion(inode);
+
+		raw_inode->i_disk_version = cpu_to_le32(ivers);
 		if (ei->i_extra_isize) {
 			if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
 				raw_inode->i_version_hi =
-					cpu_to_le32(inode->i_version >> 32);
+					cpu_to_le32(ivers >> 32);
 			raw_inode->i_extra_isize =
 				cpu_to_le16(ei->i_extra_isize);
 		}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1eec25014f62..f2bed34869cd 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -144,7 +144,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 		i_gid_write(inode_bl, 0);
 		inode_bl->i_flags = 0;
 		ei_bl->i_flags = 0;
-		inode_bl->i_version = 1;
+		inode_set_iversion(inode_bl, 1);
 		i_size_write(inode_bl, 0);
 		inode_bl->i_mode = S_IFREG;
 		if (ext4_has_feature_extents(sb)) {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 798b3ac680db..fe91d6a8c98d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2955,7 +2955,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 			     "empty directory '%.*s' has too many links (%u)",
 			     dentry->d_name.len, dentry->d_name.name,
 			     inode->i_nlink);
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	clear_nlink(inode);
 	/* There's no need to set i_disksize: the fact that i_nlink is
 	 * zero will ensure that the right thing happens during any
@@ -3361,7 +3361,7 @@ static int ext4_setent(handle_t *handle, struct ext4_renament *ent,
 	ent->de->inode = cpu_to_le32(ino);
 	if (ext4_has_feature_filetype(ent->dir->i_sb))
 		ent->de->file_type = file_type;
-	ent->dir->i_version++;
+	inode_inc_iversion(ent->dir);
 	ent->dir->i_ctime = ent->dir->i_mtime =
 		current_time(ent->dir);
 	ext4_mark_inode_dirty(handle, ent->dir);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7c46693a14d7..5663547e97f8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -967,7 +967,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	if (!ei)
 		return NULL;
 
-	ei->vfs_inode.i_version = 1;
+	inode_set_iversion(&ei->vfs_inode, 1);
 	spin_lock_init(&ei->i_raw_lock);
 	INIT_LIST_HEAD(&ei->i_prealloc_list);
 	spin_lock_init(&ei->i_prealloc_lock);
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 218a7ba57819..d8e7baa681b5 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -294,13 +294,13 @@ ext4_xattr_inode_hash(struct ext4_sb_info *sbi, const void *buffer, size_t size)
 static u64 ext4_xattr_inode_get_ref(struct inode *ea_inode)
 {
 	return ((u64)ea_inode->i_ctime.tv_sec << 32) |
-	       ((u32)ea_inode->i_version);
+		(u32) inode_peek_iversion(ea_inode);
 }
 
 static void ext4_xattr_inode_set_ref(struct inode *ea_inode, u64 ref_count)
 {
 	ea_inode->i_ctime.tv_sec = (u32)(ref_count >> 32);
-	ea_inode->i_version = (u32)ref_count;
+	inode_set_iversion(ea_inode, ref_count & 0xffffffff);
 }
 
 static u32 ext4_xattr_inode_get_hash(struct inode *ea_inode)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/19] nfs: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (8 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 09/19] ext4: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 11/19] nfsd: " Jeff Layton
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

For NFS, we just use the "raw" API since the i_version is mostly
managed by the server. The exception there is when the client
holds a write delegation, but we only need to bump it once
there anyway to handle CB_GETATTR.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/delegation.c    |  2 +-
 fs/nfs/fscache-index.c |  4 ++--
 fs/nfs/inode.c         | 16 ++++++++--------
 fs/nfs/nfs4proc.c      |  9 +++++----
 fs/nfs/nfstrace.h      |  4 ++--
 fs/nfs/write.c         |  7 ++-----
 6 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index ade44ca0c66c..89b22b2a111c 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -347,7 +347,7 @@ int nfs_inode_set_delegation(struct inode *inode, struct rpc_cred *cred, struct
 	nfs4_stateid_copy(&delegation->stateid, &res->delegation);
 	delegation->type = res->delegation_type;
 	delegation->pagemod_limit = res->pagemod_limit;
-	delegation->change_attr = inode->i_version;
+	delegation->change_attr = inode_peek_iversion_raw(inode);
 	delegation->cred = get_rpccred(cred);
 	delegation->inode = inode;
 	delegation->flags = 1<<NFS_DELEGATION_REFERENCED;
diff --git a/fs/nfs/fscache-index.c b/fs/nfs/fscache-index.c
index 3025fe8584a0..8bf080ce4984 100644
--- a/fs/nfs/fscache-index.c
+++ b/fs/nfs/fscache-index.c
@@ -211,7 +211,7 @@ static uint16_t nfs_fscache_inode_get_aux(const void *cookie_netfs_data,
 	auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
 	if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
-		auxdata.change_attr = nfsi->vfs_inode.i_version;
+		auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);
 
 	if (bufmax > sizeof(auxdata))
 		bufmax = sizeof(auxdata);
@@ -243,7 +243,7 @@ enum fscache_checkaux nfs_fscache_inode_check_aux(void *cookie_netfs_data,
 	auxdata.ctime = nfsi->vfs_inode.i_ctime;
 
 	if (NFS_SERVER(&nfsi->vfs_inode)->nfs_client->rpc_ops->version == 4)
-		auxdata.change_attr = nfsi->vfs_inode.i_version;
+		auxdata.change_attr = inode_peek_iversion_raw(&nfsi->vfs_inode);
 
 	if (memcmp(data, &auxdata, datalen) != 0)
 		return FSCACHE_CHECKAUX_OBSOLETE;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index b992d2382ffa..c55b86c87bce 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -483,7 +483,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
 		memset(&inode->i_atime, 0, sizeof(inode->i_atime));
 		memset(&inode->i_mtime, 0, sizeof(inode->i_mtime));
 		memset(&inode->i_ctime, 0, sizeof(inode->i_ctime));
-		inode->i_version = 0;
+		inode_set_iversion_raw(inode, 0);
 		inode->i_size = 0;
 		clear_nlink(inode);
 		inode->i_uid = make_kuid(&init_user_ns, -2);
@@ -508,7 +508,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr, st
 		else if (nfs_server_capable(inode, NFS_CAP_CTIME))
 			nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR);
 		if (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-			inode->i_version = fattr->change_attr;
+			inode_set_iversion_raw(inode, fattr->change_attr);
 		else
 			nfs_set_cache_invalid(inode, NFS_INO_INVALID_ATTR
 				| NFS_INO_REVAL_PAGECACHE);
@@ -1289,8 +1289,8 @@ static unsigned long nfs_wcc_update_inode(struct inode *inode, struct nfs_fattr
 
 	if ((fattr->valid & NFS_ATTR_FATTR_PRECHANGE)
 			&& (fattr->valid & NFS_ATTR_FATTR_CHANGE)
-			&& inode->i_version == fattr->pre_change_attr) {
-		inode->i_version = fattr->change_attr;
+			&& !inode_cmp_iversion(inode, fattr->pre_change_attr)) {
+		inode_set_iversion_raw(inode, fattr->change_attr);
 		if (S_ISDIR(inode->i_mode))
 			nfs_set_cache_invalid(inode, NFS_INO_INVALID_DATA);
 		ret |= NFS_INO_INVALID_ATTR;
@@ -1348,7 +1348,7 @@ static int nfs_check_inode_attributes(struct inode *inode, struct nfs_fattr *fat
 
 	if (!nfs_file_has_buffered_writers(nfsi)) {
 		/* Verify a few of the more important attributes */
-		if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode->i_version != fattr->change_attr)
+		if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 && inode_cmp_iversion(inode, fattr->change_attr))
 			invalid |= NFS_INO_INVALID_ATTR | NFS_INO_REVAL_PAGECACHE;
 
 		if ((fattr->valid & NFS_ATTR_FATTR_MTIME) && !timespec_equal(&inode->i_mtime, &fattr->mtime))
@@ -1642,7 +1642,7 @@ int nfs_post_op_update_inode_force_wcc_locked(struct inode *inode, struct nfs_fa
 	}
 	if ((fattr->valid & NFS_ATTR_FATTR_CHANGE) != 0 &&
 			(fattr->valid & NFS_ATTR_FATTR_PRECHANGE) == 0) {
-		fattr->pre_change_attr = inode->i_version;
+		fattr->pre_change_attr = inode_peek_iversion_raw(inode);
 		fattr->valid |= NFS_ATTR_FATTR_PRECHANGE;
 	}
 	if ((fattr->valid & NFS_ATTR_FATTR_CTIME) != 0 &&
@@ -1778,7 +1778,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 
 	/* More cache consistency checks */
 	if (fattr->valid & NFS_ATTR_FATTR_CHANGE) {
-		if (inode->i_version != fattr->change_attr) {
+		if (inode_cmp_iversion(inode, fattr->change_attr)) {
 			dprintk("NFS: change_attr change on server for file %s/%ld\n",
 					inode->i_sb->s_id, inode->i_ino);
 			/* Could it be a race with writeback? */
@@ -1790,7 +1790,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 				if (S_ISDIR(inode->i_mode))
 					nfs_force_lookup_revalidate(inode);
 			}
-			inode->i_version = fattr->change_attr;
+			inode_set_iversion_raw(inode, fattr->change_attr);
 		}
 	} else {
 		nfsi->cache_validity |= save_cache_validity;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 56fa5a16e097..67ce69bce04e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1045,16 +1045,16 @@ static void update_changeattr(struct inode *dir, struct nfs4_change_info *cinfo,
 
 	spin_lock(&dir->i_lock);
 	nfsi->cache_validity |= NFS_INO_INVALID_ATTR|NFS_INO_INVALID_DATA;
-	if (cinfo->atomic && cinfo->before == dir->i_version) {
+	if (cinfo->atomic && cinfo->before == inode_peek_iversion_raw(dir)) {
 		nfsi->cache_validity &= ~NFS_INO_REVAL_PAGECACHE;
 		nfsi->attrtimeo_timestamp = jiffies;
 	} else {
 		nfs_force_lookup_revalidate(dir);
-		if (cinfo->before != dir->i_version)
+		if (cinfo->before != inode_peek_iversion_raw(dir))
 			nfsi->cache_validity |= NFS_INO_INVALID_ACCESS |
 				NFS_INO_INVALID_ACL;
 	}
-	dir->i_version = cinfo->after;
+	inode_set_iversion_raw(dir, cinfo->after);
 	nfsi->read_cache_jiffies = timestamp;
 	nfsi->attr_gencount = nfs_inc_attr_generation_counter();
 	nfs_fscache_invalidate(dir);
@@ -2454,7 +2454,8 @@ static int _nfs4_proc_open(struct nfs4_opendata *data)
 			data->file_created = true;
 		else if (o_res->cinfo.before != o_res->cinfo.after)
 			data->file_created = true;
-		if (data->file_created || dir->i_version != o_res->cinfo.after)
+		if (data->file_created ||
+		    inode_peek_iversion_raw(dir) != o_res->cinfo.after)
 			update_changeattr(dir, &o_res->cinfo,
 					o_res->f_attr->time_start);
 	}
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 093290c42d7c..3e4b20d8994a 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -61,7 +61,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event,
 			__entry->dev = inode->i_sb->s_dev;
 			__entry->fileid = nfsi->fileid;
 			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
-			__entry->version = inode->i_version;
+			__entry->version = inode_peek_iversion_raw(inode);
 		),
 
 		TP_printk(
@@ -100,7 +100,7 @@ DECLARE_EVENT_CLASS(nfs_inode_event_done,
 			__entry->fileid = nfsi->fileid;
 			__entry->fhandle = nfs_fhandle_hash(&nfsi->fh);
 			__entry->type = nfs_umode_to_dtype(inode->i_mode);
-			__entry->version = inode->i_version;
+			__entry->version = inode_peek_iversion_raw(inode);
 			__entry->size = i_size_read(inode);
 			__entry->nfsi_flags = nfsi->flags;
 			__entry->cache_validity = nfsi->cache_validity;
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 5b5f464f6f2a..258f2c85b921 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -753,11 +753,8 @@ static void nfs_inode_add_request(struct inode *inode, struct nfs_page *req)
 	 */
 	spin_lock(&mapping->private_lock);
 	if (!nfs_have_writebacks(inode) &&
-	    NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE)) {
-		spin_lock(&inode->i_lock);
-		inode->i_version++;
-		spin_unlock(&inode->i_lock);
-	}
+	    NFS_PROTO(inode)->have_delegation(inode, FMODE_WRITE))
+		inode_inc_iversion(inode);
 	if (likely(!PageSwapCache(req->wb_page))) {
 		set_bit(PG_MAPPED, &req->wb_flags);
 		SetPagePrivate(req->wb_page);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 11/19] nfsd: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (9 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 10/19] nfs: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 12/19] ocfs2: " Jeff Layton
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Mostly just making sure we use the "get" wrappers so we know when
it is being fetched for later use.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfsd/nfsfh.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index 43f31cf49bae..7dedda5e3ed4 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -259,7 +259,7 @@ static inline u64 nfsd4_change_attribute(struct inode *inode)
 	chattr =  inode->i_ctime.tv_sec;
 	chattr <<= 30;
 	chattr += inode->i_ctime.tv_nsec;
-	chattr += inode->i_version;
+	chattr += inode_query_iversion(inode);
 	return chattr;
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 12/19] ocfs2: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (10 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 11/19] nfsd: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-18 12:49   ` Jan Kara
  2017-12-13 14:20 ` [PATCH 13/19] ufs: use " Jeff Layton
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ocfs2/dir.c          | 14 +++++++-------
 fs/ocfs2/inode.c        |  2 +-
 fs/ocfs2/namei.c        |  2 +-
 fs/ocfs2/quota_global.c |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index febe6312ceff..fe2c430a7809 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -1174,7 +1174,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir,
 				le16_add_cpu(&pde->rec_len,
 						le16_to_cpu(de->rec_len));
 			de->inode = 0;
-			dir->i_version++;
+			inode_inc_iversion(dir);
 			ocfs2_journal_dirty(handle, bh);
 			goto bail;
 		}
@@ -1729,7 +1729,7 @@ int __ocfs2_add_entry(handle_t *handle,
 			if (ocfs2_dir_indexed(dir))
 				ocfs2_recalc_free_list(dir, handle, lookup);
 
-			dir->i_version++;
+			inode_inc_iversion(dir);
 			ocfs2_journal_dirty(handle, insert_bh);
 			retval = 0;
 			goto bail;
@@ -1775,7 +1775,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 		 * readdir(2), then we might be pointing to an invalid
 		 * dirent right now.  Scan from the start of the block
 		 * to make sure. */
-		if (*f_version != inode->i_version) {
+		if (inode_cmp_iversion(inode, *f_version)) {
 			for (i = 0; i < i_size_read(inode) && i < offset; ) {
 				de = (struct ocfs2_dir_entry *)
 					(data->id_data + i);
@@ -1791,7 +1791,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
 				i += le16_to_cpu(de->rec_len);
 			}
 			ctx->pos = offset = i;
-			*f_version = inode->i_version;
+			*f_version = inode_query_iversion(inode);
 		}
 
 		de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
@@ -1869,7 +1869,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 		 * readdir(2), then we might be pointing to an invalid
 		 * dirent right now.  Scan from the start of the block
 		 * to make sure. */
-		if (*f_version != inode->i_version) {
+		if (inode_cmp_iversion(inode, *f_version)) {
 			for (i = 0; i < sb->s_blocksize && i < offset; ) {
 				de = (struct ocfs2_dir_entry *) (bh->b_data + i);
 				/* It's too expensive to do a full
@@ -1886,7 +1886,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
 			offset = i;
 			ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
 				| offset;
-			*f_version = inode->i_version;
+			*f_version = inode_query_iversion(inode);
 		}
 
 		while (ctx->pos < i_size_read(inode)
@@ -1940,7 +1940,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version,
  */
 int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
 {
-	u64 version = inode->i_version;
+	u64 version = inode_query_iversion(inode);
 	ocfs2_dir_foreach_blk(inode, &version, ctx, true);
 	return 0;
 }
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 1a1e0078ab38..71ce64665a18 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -302,7 +302,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe,
 	OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
 	OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
 
-	inode->i_version = 1;
+	inode_set_iversion(inode, 1);
 	inode->i_generation = le32_to_cpu(fe->i_generation);
 	inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
 	inode->i_mode = le16_to_cpu(fe->i_mode);
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 3b0a10d9b36f..c045826b716a 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
 			mlog_errno(status);
 			goto bail;
 		}
-		new_dir->i_version++;
+		inode_inc_iversion(new_dir);
 
 		if (S_ISDIR(new_inode->i_mode))
 			ocfs2_set_links_count(newfe, 0);
diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
index b39d14cbfa34..e7595a63da43 100644
--- a/fs/ocfs2/quota_global.c
+++ b/fs/ocfs2/quota_global.c
@@ -289,7 +289,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,
 		mlog_errno(err);
 		return err;
 	}
-	gqinode->i_version++;
+	inode_query_iversion(gqinode);
 	ocfs2_mark_inode_dirty(handle, gqinode, oinfo->dqi_gqi_bh);
 	return len;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 13/19] ufs: use new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (11 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 12/19] ocfs2: " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 14/19] xfs: convert to " Jeff Layton
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/ufs/dir.c   | 8 ++++----
 fs/ufs/inode.c | 2 +-
 fs/ufs/super.c | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/ufs/dir.c b/fs/ufs/dir.c
index 2edc1755b7c5..547c4d4c4db8 100644
--- a/fs/ufs/dir.c
+++ b/fs/ufs/dir.c
@@ -47,7 +47,7 @@ static int ufs_commit_chunk(struct page *page, loff_t pos, unsigned len)
 	struct inode *dir = mapping->host;
 	int err = 0;
 
-	dir->i_version++;
+	inode_inc_iversion(dir);
 	block_write_end(NULL, mapping, pos, len, len, page, NULL);
 	if (pos+len > dir->i_size) {
 		i_size_write(dir, pos+len);
@@ -428,7 +428,7 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
 	unsigned long n = pos >> PAGE_SHIFT;
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(UFS_SB(sb)->s_uspi->s_dirblksize - 1);
-	int need_revalidate = file->f_version != inode->i_version;
+	bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
 	unsigned flags = UFS_SB(sb)->s_flags;
 
 	UFSD("BEGIN\n");
@@ -455,8 +455,8 @@ ufs_readdir(struct file *file, struct dir_context *ctx)
 				offset = ufs_validate_entry(sb, kaddr, offset, chunk_mask);
 				ctx->pos = (n<<PAGE_SHIFT) + offset;
 			}
-			file->f_version = inode->i_version;
-			need_revalidate = 0;
+			file->f_version = inode_query_iversion(inode);
+			need_revalidate = false;
 		}
 		de = (struct ufs_dir_entry *)(kaddr+offset);
 		limit = kaddr + ufs_last_byte(inode, n) - UFS_DIR_REC_LEN(1);
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index afb601c0dda0..58cb6dfb116d 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -693,7 +693,7 @@ struct inode *ufs_iget(struct super_block *sb, unsigned long ino)
 	if (err)
 		goto bad_inode;
 
-	inode->i_version++;
+	inode_inc_iversion(inode);
 	ufsi->i_lastfrag =
 		(inode->i_size + uspi->s_fsize - 1) >> uspi->s_fshift;
 	ufsi->i_dir_start_lookup = 0;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 4d497e9c6883..7dee0b07a571 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1440,7 +1440,7 @@ static struct inode *ufs_alloc_inode(struct super_block *sb)
 	if (!ei)
 		return NULL;
 
-	ei->vfs_inode.i_version = 1;
+	inode_set_iversion(&ei->vfs_inode, 1);
 	seqlock_init(&ei->meta_lock);
 	mutex_init(&ei->truncate_mutex);
 	return &ei->vfs_inode;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 14/19] xfs: convert to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (12 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 13/19] ufs: use " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 22:48   ` Dave Chinner
  2017-12-13 14:20 ` [PATCH 15/19] IMA: switch IMA over " Jeff Layton
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
 fs/xfs/xfs_icache.c           | 4 ++--
 fs/xfs/xfs_inode.c            | 2 +-
 fs/xfs/xfs_inode_item.c       | 2 +-
 fs/xfs/xfs_trans_inode.c      | 2 +-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6b7989038d75..6b47de201391 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -264,7 +264,8 @@ xfs_inode_from_disk(
 	to->di_flags	= be16_to_cpu(from->di_flags);
 
 	if (to->di_version == 3) {
-		inode->i_version = be64_to_cpu(from->di_changecount);
+		inode_set_iversion_queried(inode,
+					   be64_to_cpu(from->di_changecount));
 		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
 		to->di_flags2 = be64_to_cpu(from->di_flags2);
@@ -314,7 +315,7 @@ xfs_inode_to_disk(
 	to->di_flags = cpu_to_be16(from->di_flags);
 
 	if (from->di_version == 3) {
-		to->di_changecount = cpu_to_be64(inode->i_version);
+		to->di_changecount = cpu_to_be64(inode_peek_iversion_raw(inode));
 		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..4838462616fd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -293,14 +293,14 @@ xfs_reinit_inode(
 	int		error;
 	uint32_t	nlink = inode->i_nlink;
 	uint32_t	generation = inode->i_generation;
-	uint64_t	version = inode->i_version;
+	uint64_t	version = inode_peek_iversion_raw(inode);
 	umode_t		mode = inode->i_mode;
 
 	error = inode_init_always(mp->m_super, inode);
 
 	set_nlink(inode, nlink);
 	inode->i_generation = generation;
-	inode->i_version = version;
+	inode_set_iversion_queried(inode, version);
 	inode->i_mode = mode;
 	return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..be6d87980dd5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -833,7 +833,7 @@ xfs_ialloc(
 	ip->i_d.di_flags = 0;
 
 	if (ip->i_d.di_version == 3) {
-		inode->i_version = 1;
+		inode_set_iversion(inode, 1);
 		ip->i_d.di_flags2 = 0;
 		ip->i_d.di_cowextsize = 0;
 		ip->i_d.di_crtime.t_sec = (int32_t)tv.tv_sec;
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..1a20dbbd34e4 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -354,7 +354,7 @@ xfs_inode_to_log_dinode(
 	to->di_next_unlinked = NULLAGINO;
 
 	if (from->di_version == 3) {
-		to->di_changecount = inode->i_version;
+		to->di_changecount = inode_peek_iversion_raw(inode);
 		to->di_crtime.t_sec = from->di_crtime.t_sec;
 		to->di_crtime.t_nsec = from->di_crtime.t_nsec;
 		to->di_flags2 = from->di_flags2;
diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index daa7615497f9..b41a92d18140 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -117,7 +117,7 @@ xfs_trans_log_inode(
 	 */
 	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
 	    IS_I_VERSION(VFS_I(ip))) {
-		VFS_I(ip)->i_version++;
+		inode_inc_iversion(VFS_I(ip));
 		flags |= XFS_ILOG_CORE;
 	}
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 15/19] IMA: switch IMA over to new i_version API
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (13 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 14/19] xfs: convert to " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 security/integrity/ima/ima_api.c  | 2 +-
 security/integrity/ima/ima_main.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index c7e8db0ea4c0..588d4c05eb1e 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -215,7 +215,7 @@ int ima_collect_measurement(struct integrity_iint_cache *iint,
 	 * which do not support i_version, support is limited to an initial
 	 * measurement/appraisal/audit.
 	 */
-	i_version = file_inode(file)->i_version;
+	i_version = inode_query_iversion(inode);
 	hash.hdr.algo = algo;
 
 	/* Initialize hash digest to 0's in case of failure */
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 50b82599994d..2ba3a12ff33c 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -128,7 +128,7 @@ static void ima_check_last_writer(struct integrity_iint_cache *iint,
 	inode_lock(inode);
 	if (atomic_read(&inode->i_writecount) == 1) {
 		if (!IS_I_VERSION(inode) ||
-		    (iint->version != inode->i_version) ||
+		    inode_cmp_iversion(inode, iint->version) ||
 		    (iint->flags & IMA_NEW_FILE)) {
 			iint->flags &= ~(IMA_DONE_MASK | IMA_NEW_FILE);
 			iint->measured_pcrs = 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 16/19] fs: only set S_VERSION when updating times if necessary
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (14 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 15/19] IMA: switch IMA over " Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-15 12:59   ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

We only really need to update i_version if someone has queried for it
since we last incremented it. By doing that, we can avoid having to
update the inode if the times haven't changed.

If the times have changed, then we go ahead and forcibly increment the
counter, under the assumption that we'll be going to the storage
anyway, and the increment itself is relatively cheap.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/inode.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..7f4215f4309c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1634,17 +1634,18 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
 int generic_update_time(struct inode *inode, struct timespec *time, int flags)
 {
 	int iflags = I_DIRTY_TIME;
+	bool dirty = flags & ~S_VERSION;
 
 	if (flags & S_ATIME)
 		inode->i_atime = *time;
-	if (flags & S_VERSION)
-		inode_inc_iversion(inode);
 	if (flags & S_CTIME)
 		inode->i_ctime = *time;
 	if (flags & S_MTIME)
 		inode->i_mtime = *time;
+	if (flags & S_VERSION)
+		dirty |= inode_maybe_inc_iversion(inode, dirty);
 
-	if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
+	if (dirty || !(inode->i_sb->s_flags & SB_LAZYTIME))
 		iflags |= I_DIRTY_SYNC;
 	__mark_inode_dirty(inode, iflags);
 	return 0;
@@ -1863,7 +1864,7 @@ int file_update_time(struct file *file)
 	if (!timespec_equal(&inode->i_ctime, &now))
 		sync_it |= S_CTIME;
 
-	if (IS_I_VERSION(inode))
+	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
 		sync_it |= S_VERSION;
 
 	if (!sync_it)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (15 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

If XFS_ILOG_CORE is already set then go ahead and increment it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/xfs/xfs_trans_inode.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_trans_inode.c b/fs/xfs/xfs_trans_inode.c
index b41a92d18140..18635c5b5dc6 100644
--- a/fs/xfs/xfs_trans_inode.c
+++ b/fs/xfs/xfs_trans_inode.c
@@ -110,15 +110,17 @@ xfs_trans_log_inode(
 
 	/*
 	 * First time we log the inode in a transaction, bump the inode change
-	 * counter if it is configured for this to occur. We don't use
-	 * inode_inc_version() because there is no need for extra locking around
-	 * i_version as we already hold the inode locked exclusively for
-	 * metadata modification.
+	 * counter if it is configured for this to occur. While we have the
+	 * inode locked exclusively for metadata modification, we can usually
+	 * avoid setting XFS_ILOG_CORE if no one has queried the value since
+	 * the last time it was incremented. If we have XFS_ILOG_CORE already
+	 * set however, then go ahead and bump the i_version counter
+	 * unconditionally.
 	 */
 	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
 	    IS_I_VERSION(VFS_I(ip))) {
-		inode_inc_iversion(VFS_I(ip));
-		flags |= XFS_ILOG_CORE;
+		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
+			flags |= XFS_ILOG_CORE;
 	}
 
 	tp->t_flags |= XFS_TRANS_DIRTY;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (16 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-15 13:03   ` Jeff Layton
  2017-12-13 14:20 ` [PATCH 19/19] fs: handle inode->i_version more efficiently Jeff Layton
  2017-12-13 15:05 ` [PATCH 00/19] fs: rework and optimize i_version handling in filesystems J. Bruce Fields
  19 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

At this point, we know that "now" and the file times may differ, and we
suspect that the i_version has been flagged to be bumped. Attempt to
bump the i_version, and only mark the inode dirty if that actually
occurred or if one of the times was updated.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/btrfs/inode.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ac25389b39de..2e50a977fb06 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6106,19 +6106,20 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
 			     int flags)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	bool dirty = flags & ~S_VERSION;
 
 	if (btrfs_root_readonly(root))
 		return -EROFS;
 
 	if (flags & S_VERSION)
-		inode_inc_iversion(inode);
+		dirty |= inode_maybe_inc_iversion(inode, dirty);
 	if (flags & S_CTIME)
 		inode->i_ctime = *now;
 	if (flags & S_MTIME)
 		inode->i_mtime = *now;
 	if (flags & S_ATIME)
 		inode->i_atime = *now;
-	return btrfs_dirty_inode(inode);
+	return dirty ? btrfs_dirty_inode(inode) : 0;
 }
 
 /*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 19/19] fs: handle inode->i_version more efficiently
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (17 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
@ 2017-12-13 14:20 ` Jeff Layton
  2017-12-13 15:05 ` [PATCH 00/19] fs: rework and optimize i_version handling in filesystems J. Bruce Fields
  19 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 14:20 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

From: Jeff Layton <jlayton@redhat.com>

Since i_version is mostly treated as an opaque value, we can exploit that
fact to avoid incrementing it when no one is watching. With that change,
we can avoid incrementing the counter on writes, unless someone has
queried for it since it was last incremented. If the a/c/mtime don't
change, and the i_version hasn't changed, then there's no need to dirty
the inode metadata on a write.

Convert the i_version counter to an atomic64_t, and use the lowest order
bit to hold a flag that will tell whether anyone has queried the value
since it was last incremented.

When we go to maybe increment it, we fetch the value and check the flag
bit.  If it's clear then we don't need to do anything if the update
isn't being forced.

If we do need to update, then we increment the counter by 2, and clear
the flag bit, and then use a CAS op to swap it into place. If that
works, we return true. If it doesn't then do it again with the value
that we fetch from the CAS operation.

On the query side, if the flag is already set, then we just shift the
value down by 1 bit and return it. Otherwise, we set the flag in our
on-stack value and again use cmpxchg to swap it into place if it hasn't
changed. If it has, then we use the value from the cmpxchg as the new
"old" value and try again.

This method allows us to avoid incrementing the counter on writes (and
dirtying the metadata) under typical workloads. We only need to increment
if it has been queried since it was last changed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/fs.h | 150 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 99 insertions(+), 51 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c234fac4bb77..84fe3ce8e45a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -639,7 +639,7 @@ struct inode {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
 	};
-	u64			i_version;
+	atomic64_t		i_version;
 	atomic_t		i_count;
 	atomic_t		i_dio_count;
 	atomic_t		i_writecount;
@@ -2059,86 +2059,116 @@ static inline void inode_dec_link_count(struct inode *inode)
  * i_version on namespace changes in directories (mkdir, rmdir, unlink, etc.).
  * We consider these sorts of filesystems to have a kernel-managed i_version.
  *
- * Note that some filesystems (e.g. NFS and AFS) just use the field to store
- * a server-provided value (for the most part). For that reason, those
- * filesystems do not set SB_I_VERSION. These filesystems are considered to
- * have a self-managed i_version.
+ * This implementation uses the low bit in the i_version field as a flag to
+ * track when the value has been queried. If it has not been queried since it
+ * was last incremented, we can skip the increment in most cases.
+ *
+ * In the event that we're updating the ctime, we will usually go ahead and
+ * bump the i_version anyway. Since that has to go to stable storage in some
+ * fashion, we might as well increment it as well.
+ */
+
+/*
+ * We borrow the lowest bit in the i_version to use as a flag to tell whether
+ * it has been queried since we last incremented it. If it has, then we must
+ * increment it on the next change. After that, we can clear the flag and
+ * avoid incrementing it again until it has again been queried.
  */
+#define I_VERSION_QUERIED_SHIFT	(1)
+#define I_VERSION_QUERIED	(1ULL << (I_VERSION_QUERIED_SHIFT - 1))
+#define I_VERSION_INCREMENT	(1ULL << I_VERSION_QUERIED_SHIFT)
 
 /**
  * inode_set_iversion_raw - set i_version to the specified raw value
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
  *
- * Set @inode's i_version field to @new. This function is for use by
+ * Set @inode's i_version field to @val. This function is for use by
  * filesystems that self-manage the i_version.
  *
  * For example, the NFS client stores its NFSv4 change attribute in this way,
  * and the AFS client stores the data_version from the server here.
  */
 static inline void
-inode_set_iversion_raw(struct inode *inode, const u64 new)
+inode_set_iversion_raw(struct inode *inode, const u64 val)
 {
-	inode->i_version = new;
+	atomic64_set(&inode->i_version, val);
 }
 
 /**
  * inode_set_iversion - set i_version to a particular value
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
  *
- * Set @inode's i_version field to @new. This function is for filesystems with
- * a kernel-managed i_version.
+ * Set @inode's i_version field to @val. This function is for filesystems with
+ * a kernel-managed i_version, for initializing a newly-created inode from
+ * scratch.
  *
- * For now, this just does the same thing as the _raw variant.
+ * In this case, we do not set the QUERIED flag since we know that this value
+ * has never been queried.
  */
 static inline void
-inode_set_iversion(struct inode *inode, const u64 new)
+inode_set_iversion(struct inode *inode, const u64 val)
 {
-	inode_set_iversion_raw(inode, new);
+	inode_set_iversion_raw(inode, val << I_VERSION_QUERIED_SHIFT);
 }
 
 /**
- * inode_set_iversion_queried - set i_version to a particular value and set
- *                              flag to indicate that it has been viewed
+ * inode_set_iversion_queried - set i_version to a particular value as quereied
  * @inode: inode to set
- * @new: new i_version value to set
+ * @val: new i_version value to set
+ *
+ * Set @inode's i_version field to @val, and flag it for increment on the next
+ * change.
  *
  * When loading in an i_version value from a backing store, we typically don't
- * know whether it was previously viewed before being stored or not. Thus, we
+ * know whether it was previously viewed before being stored. Thus, we
  * must assume that it was, to ensure that any changes will result in the
  * value changing.
- *
- * This function will set the inode's i_version, and possibly flag the value
- * as if it has already been viewed at least once.
- *
- * For now, this just does what inode_set_iversion does.
  */
 static inline void
-inode_set_iversion_queried(struct inode *inode, const u64 new)
+inode_set_iversion_queried(struct inode *inode, const u64 val)
 {
-	inode_set_iversion(inode, new);
+	inode_set_iversion_raw(inode, (val << I_VERSION_QUERIED_SHIFT) |
+				I_VERSION_QUERIED);
 }
 
 /**
  * inode_maybe_inc_iversion - increments i_version
  * @inode: inode with the i_version that should be updated
- * @force: increment the counter even if it's not necessary
+ * @force: increment the counter even if it's not necessary?
  *
  * Every time the inode is modified, the i_version field must be seen to have
  * changed by any observer.
  *
- * In this implementation, we always increment it after taking the i_lock to
- * ensure that we don't race with other incrementors.
+ * If "force" is set or the QUERIED flag is set, then ensure that we increment
+ * the value, and clear the queried flag.
+ *
+ * In the common case where neither is set, then we can return "false" without
+ * updating i_version.
  *
- * Returns true if counter was bumped, and false if it wasn't.
+ * If this function returns false, and no other metadata has changed, then we
+ * can avoid logging the metadata.
  */
 static inline bool
 inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
-	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
+	u64 cur, old, new;
+
+	cur = (u64)atomic64_read(&inode->i_version);
+	for (;;) {
+		/* If flag is clear then we needn't do anything */
+		if (!force && !(cur & I_VERSION_QUERIED))
+			return false;
 
-	atomic64_inc(ivp);
+		/* Since lowest bit is flag, add 2 to avoid it */
+		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
+
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (likely(old == cur))
+			break;
+		cur = old;
+	}
 	return true;
 }
 
@@ -2155,21 +2185,6 @@ inode_inc_iversion(struct inode *inode)
 	inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_iversion_need_inc - is the i_version in need of being incremented?
- * @inode: inode to check
- *
- * Returns whether the inode->i_version counter needs incrementing on the next
- * change.
- *
- * For now, we assume that it always does.
- */
-static inline bool
-inode_iversion_need_inc(struct inode *inode)
-{
-	return true;
-}
-
 /**
  * inode_peek_iversion_raw - grab a "raw" iversion value
  * @inode: inode from which i_version should be read
@@ -2184,7 +2199,20 @@ inode_iversion_need_inc(struct inode *inode)
 static inline u64
 inode_peek_iversion_raw(const struct inode *inode)
 {
-	return inode->i_version;
+	return atomic64_read(&inode->i_version);
+}
+
+/**
+ * inode_iversion_need_inc - is the i_version in need of being incremented?
+ * @inode: inode to check
+ *
+ * Returns whether the inode->i_version counter needs incrementing on the next
+ * change. Just fetch the value and check the QUERIED flag.
+ */
+static inline bool
+inode_iversion_need_inc(struct inode *inode)
+{
+	return inode_peek_iversion_raw(inode) & I_VERSION_QUERIED;
 }
 
 /**
@@ -2201,7 +2229,7 @@ inode_peek_iversion_raw(const struct inode *inode)
 static inline u64
 inode_peek_iversion(const struct inode *inode)
 {
-	return inode_peek_iversion_raw(inode);
+	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
 }
 
 /**
@@ -2213,12 +2241,28 @@ inode_peek_iversion(const struct inode *inode)
  * that a later query of the i_version will result in a different value if
  * anything has changed.
  *
- * This implementation just does a peek.
+ * In this implementation, we fetch the current value, set the QUERIED flag and
+ * then try to swap it into place with a cmpxchg, if it wasn't already set. If
+ * that fails, we try again with the newly fetched value from the cmpxchg.
  */
 static inline u64
 inode_query_iversion(struct inode *inode)
 {
-	return inode_peek_iversion(inode);
+	u64 cur, old, new;
+
+	cur = atomic64_read(&inode->i_version);
+	for (;;) {
+		/* If flag is already set, then no need to swap */
+		if (cur & I_VERSION_QUERIED)
+			break;
+
+		new = cur | I_VERSION_QUERIED;
+		old = atomic64_cmpxchg(&inode->i_version, cur, new);
+		if (old == cur)
+			break;
+		cur = old;
+	}
+	return cur >> I_VERSION_QUERIED_SHIFT;
 }
 
 /**
@@ -2228,7 +2272,11 @@ inode_query_iversion(struct inode *inode)
  *
  * Compare an i_version counter with a previous one. Returns 0 if they are
  * the same or non-zero if they are different.
+ *
+ * Note that we don't need to set the QUERIED flag in this case, as the value
+ * in the inode is not being recorded for later use.
  */
+
 static inline s64
 inode_cmp_iversion(const struct inode *inode, const u64 old)
 {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
                   ` (18 preceding siblings ...)
  2017-12-13 14:20 ` [PATCH 19/19] fs: handle inode->i_version more efficiently Jeff Layton
@ 2017-12-13 15:05 ` J. Bruce Fields
  2017-12-13 20:14   ` Jeff Layton
  19 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2017-12-13 15:05 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-fsdevel, linux-kernel, hch, neilb, amir73il, jack, viro

This is great, thanks.

On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
> 
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not what we'd consider a real-world
> workload.
> 
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific), with an earlier version of
> this set.
> 
> With larger writes, the gains with this patchset mostly vaporize,
> but it does not seem to cause performance to regress anywhere, AFAICT.
> 
> I'm happy to run other workloads if anyone can suggest them.
> 
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces the cost.

Do you have an idea what the remaining cost is?

Especially in the ext4 case, are you still able to measure any
difference in performance between the cases where i_version is turned on
and off, after these patches?

> 
> [1]: On ext4 it must be turned on with the i_version mount option,
>      mostly due to fears of incurring this impact, AFAICT.

So xfs and btrfs both have i_version updates on by default at this
point?  (Assuming the filesystem's created with recent enough tools,
etc.)

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-13 15:05 ` [PATCH 00/19] fs: rework and optimize i_version handling in filesystems J. Bruce Fields
@ 2017-12-13 20:14   ` Jeff Layton
  2017-12-13 22:10     ` Jeff Layton
  2017-12-13 23:03     ` Dave Chinner
  0 siblings, 2 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 20:14 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: linux-fsdevel, linux-kernel, hch, neilb, amir73il, jack, viro

[-- Attachment #1: Type: text/plain, Size: 2946 bytes --]

On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> This is great, thanks.
> 
> On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not what we'd consider a real-world
> > workload.
> > 
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific), with an earlier version of
> > this set.
> > 
> > With larger writes, the gains with this patchset mostly vaporize,
> > but it does not seem to cause performance to regress anywhere, AFAICT.
> > 
> > I'm happy to run other workloads if anyone can suggest them.
> > 
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces the cost.
> 
> Do you have an idea what the remaining cost is?
> 
> Especially in the ext4 case, are you still able to measure any
> difference in performance between the cases where i_version is turned on
> and off, after these patches?

Attached is a fio jobfile + the output from 3 different runs using it
with ext4. This one is using 4k writes. There was no querying of
i_version during the runs.  I've done several runs with each and these
are pretty representative of the results:

old = 4.15-rc3, i_version enabled
ivers = 4.15-rc3 + these patches, i_version enabled
noivers = 4.15-rc3 + these patches, i_version disabled

To snip out the run status lines:

old:
WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec

ivers:
WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec

noivers:
WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec

So, I see some performance degradation with -o iversion compared to not
having it enabled (maybe due to the extra atomic fetches?), but this set
erases most of the difference.

> > 
> > [1]: On ext4 it must be turned on with the i_version mount option,
> >      mostly due to fears of incurring this impact, AFAICT.
> 
> So xfs and btrfs both have i_version updates on by default at this
> point?  (Assuming the filesystem's created with recent enough tools,
> etc.)
> 

Yes. With xfs and btrfs, I don't think you can disable it these days.

-- 
Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: ext4-ivers-4k --]
[-- Type: text/plain, Size: 12418 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=104MiB/s][r=0,w=26.7k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1191: Wed Dec 13 14:48:18 2017
  write: IOPS=3478, BW=13.6MiB/s (14.2MB/s)(8152MiB/600001msec)
    slat (usec): min=4, max=1085.0k, avg=240.14, stdev=1750.72
    clat (usec): min=58, max=1088.8k, avg=4143.72, stdev=7043.30
     lat (usec): min=110, max=1088.9k, avg=4387.89, stdev=7296.91
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   12], 99.50th=[   35], 99.90th=[  102], 99.95th=[  136],
     | 99.99th=[  218]
   bw (  KiB/s): min=  993, max=19783, per=12.98%, avg=14575.57, stdev=3800.59, samples=1147
   iops        : min=  248, max= 4945, avg=3643.57, stdev=950.17, samples=1147
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.16%, 4=70.13%, 10=28.59%, 20=0.36%, 50=0.39%
  lat (msec)   : 100=0.25%, 250=0.10%, 500=0.01%, 2000=0.01%
  cpu          : usr=0.95%, sys=89.11%, ctx=452539, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2086817,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1192: Wed Dec 13 14:48:18 2017
  write: IOPS=3626, BW=14.2MiB/s (14.9MB/s)(8500MiB/600001msec)
    slat (usec): min=6, max=1084.3k, avg=234.36, stdev=1416.96
    clat (usec): min=44, max=1114.2k, avg=3971.34, stdev=5623.69
     lat (usec): min=145, max=1114.5k, avg=4208.76, stdev=5823.63
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    7], 99.50th=[   20], 99.90th=[   78], 99.95th=[  106],
     | 99.99th=[  174]
   bw (  KiB/s): min= 1595, max=21112, per=13.53%, avg=15201.53, stdev=2969.29, samples=1147
   iops        : min=  398, max= 5278, avg=3800.04, stdev=742.34, samples=1147
  lat (usec)   : 50=0.01%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.19%, 4=69.92%, 10=29.06%, 20=0.28%, 50=0.27%
  lat (msec)   : 100=0.15%, 250=0.05%, 500=0.01%, 2000=0.01%
  cpu          : usr=0.97%, sys=92.24%, ctx=261214, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2175888,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1193: Wed Dec 13 14:48:18 2017
  write: IOPS=3452, BW=13.5MiB/s (14.1MB/s)(8092MiB/600001msec)
    slat (usec): min=4, max=1154.9k, avg=242.30, stdev=1896.64
    clat (usec): min=64, max=1164.4k, avg=4174.23, stdev=7654.32
     lat (usec): min=109, max=1164.5k, avg=4420.55, stdev=7936.35
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   13], 99.50th=[   36], 99.90th=[  106], 99.95th=[  144],
     | 99.99th=[  275]
   bw (  KiB/s): min=  529, max=21258, per=12.90%, avg=14493.62, stdev=3880.36, samples=1147
   iops        : min=  132, max= 5314, avg=3623.03, stdev=970.10, samples=1147
  lat (usec)   : 100=0.01%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.15%, 4=70.16%, 10=28.51%, 20=0.37%, 50=0.40%
  lat (msec)   : 100=0.24%, 250=0.10%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=0.94%, sys=88.67%, ctx=522998, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2071479,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1194: Wed Dec 13 14:48:18 2017
  write: IOPS=3511, BW=13.7MiB/s (14.4MB/s)(8230MiB/600001msec)
    slat (usec): min=4, max=1085.1k, avg=237.86, stdev=1687.05
    clat (usec): min=39, max=1089.1k, avg=4104.43, stdev=6668.41
     lat (usec): min=120, max=1089.4k, avg=4346.31, stdev=6898.94
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   12], 99.50th=[   30], 99.90th=[   96], 99.95th=[  132],
     | 99.99th=[  211]
   bw (  KiB/s): min= 1112, max=19776, per=13.09%, avg=14707.65, stdev=3393.04, samples=1147
   iops        : min=  278, max= 4944, avg=3676.69, stdev=848.28, samples=1147
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.11%, 4=69.99%, 10=28.82%, 20=0.37%, 50=0.39%
  lat (msec)   : 100=0.19%, 250=0.09%, 500=0.01%, 2000=0.01%
  cpu          : usr=1.00%, sys=89.85%, ctx=413785, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2106782,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1195: Wed Dec 13 14:48:18 2017
  write: IOPS=3463, BW=13.5MiB/s (14.2MB/s)(8117MiB/600001msec)
    slat (usec): min=4, max=1082.7k, avg=241.60, stdev=1838.26
    clat (usec): min=63, max=1088.4k, avg=4161.37, stdev=7332.29
     lat (usec): min=110, max=1088.7k, avg=4406.97, stdev=7592.92
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   13], 99.50th=[   35], 99.90th=[  107], 99.95th=[  144],
     | 99.99th=[  243]
   bw (  KiB/s): min=  496, max=21619, per=12.93%, avg=14528.62, stdev=3710.11, samples=1147
   iops        : min=  124, max= 5404, avg=3631.78, stdev=927.53, samples=1147
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.15%, 4=70.22%, 10=28.49%, 20=0.38%, 50=0.40%
  lat (msec)   : 100=0.24%, 250=0.10%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=0.99%, sys=88.74%, ctx=465671, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2077865,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1196: Wed Dec 13 14:48:18 2017
  write: IOPS=3550, BW=13.9MiB/s (14.5MB/s)(8321MiB/600001msec)
    slat (usec): min=4, max=1084.5k, avg=234.71, stdev=1655.30
    clat (usec): min=48, max=1089.2k, avg=4059.46, stdev=6479.65
     lat (usec): min=110, max=1089.4k, avg=4298.19, stdev=6697.09
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    5], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    9], 99.50th=[   25], 99.90th=[   88], 99.95th=[  124],
     | 99.99th=[  222]
   bw (  KiB/s): min= 1386, max=19895, per=13.27%, avg=14907.27, stdev=2940.34, samples=1147
   iops        : min=  346, max= 4973, avg=3726.44, stdev=735.09, samples=1147
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.02%
  lat (msec)   : 2=0.12%, 4=69.78%, 10=29.15%, 20=0.34%, 50=0.33%
  lat (msec)   : 100=0.18%, 250=0.07%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=0.98%, sys=90.70%, ctx=321526, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2130283,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1197: Wed Dec 13 14:48:18 2017
  write: IOPS=3520, BW=13.8MiB/s (14.4MB/s)(8252MiB/600001msec)
    slat (usec): min=4, max=1083.3k, avg=237.04, stdev=1668.77
    clat (usec): min=45, max=1092.4k, avg=4093.45, stdev=6715.18
     lat (usec): min=110, max=1092.6k, avg=4334.50, stdev=6950.68
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   11], 99.50th=[   31], 99.90th=[   95], 99.95th=[  122],
     | 99.99th=[  215]
   bw (  KiB/s): min=  128, max=22949, per=13.16%, avg=14785.06, stdev=3438.70, samples=1147
   iops        : min=   32, max= 5737, avg=3695.90, stdev=859.68, samples=1147
  lat (usec)   : 50=0.01%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.15%, 4=70.79%, 10=27.97%, 20=0.35%, 50=0.38%
  lat (msec)   : 100=0.21%, 250=0.08%, 500=0.01%, 750=0.01%, 2000=0.01%
  cpu          : usr=1.00%, sys=89.70%, ctx=410771, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2112511,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1198: Wed Dec 13 14:48:18 2017
  write: IOPS=3477, BW=13.6MiB/s (14.2MB/s)(8151MiB/600001msec)
    slat (usec): min=4, max=1082.9k, avg=240.13, stdev=1795.84
    clat (usec): min=45, max=1115.8k, avg=4144.14, stdev=7175.43
     lat (usec): min=115, max=1116.1k, avg=4388.33, stdev=7427.75
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   12], 99.50th=[   34], 99.90th=[  103], 99.95th=[  146],
     | 99.99th=[  222]
   bw (  KiB/s): min=  963, max=19951, per=12.99%, avg=14591.74, stdev=3637.39, samples=1147
   iops        : min=  240, max= 4987, avg=3647.55, stdev=909.34, samples=1147
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.15%, 4=70.37%, 10=28.36%, 20=0.38%, 50=0.37%
  lat (msec)   : 100=0.24%, 250=0.10%, 500=0.01%, 2000=0.01%
  cpu          : usr=0.97%, sys=88.97%, ctx=452510, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2086619,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec

Disk stats (read/write):
  vdb: ios=0/46063, merge=0/6986, ticks=0/2294756, in_queue=1430067, util=8.76%

[-- Attachment #3: ext4-noivers-4k --]
[-- Type: text/plain, Size: 12221 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=123MiB/s][r=0,w=31.5k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1218: Wed Dec 13 14:58:41 2017
  write: IOPS=3839, BW=14.0MiB/s (15.7MB/s)(8999MiB/600001msec)
    slat (usec): min=4, max=369392, avg=227.73, stdev=1229.83
    clat (usec): min=113, max=373025, avg=3923.34, stdev=4848.17
     lat (usec): min=118, max=373433, avg=4154.83, stdev=5012.74
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   17], 99.90th=[   79], 99.95th=[  111],
     | 99.99th=[  194]
   bw (  KiB/s): min= 3732, max=22216, per=12.82%, avg=15393.36, stdev=2641.49, samples=1200
   iops        : min=  933, max= 5554, avg=3847.98, stdev=660.39, samples=1200
  lat (usec)   : 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.28%, 4=70.76%, 10=28.16%, 20=0.26%, 50=0.25%
  lat (msec)   : 100=0.13%, 250=0.06%, 500=0.01%
  cpu          : usr=1.07%, sys=91.99%, ctx=265847, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2303836,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1219: Wed Dec 13 14:58:41 2017
  write: IOPS=3640, BW=14.2MiB/s (14.9MB/s)(8533MiB/600001msec)
    slat (usec): min=4, max=367205, avg=242.26, stdev=1701.75
    clat (usec): min=105, max=400086, avg=4137.17, stdev=6855.07
     lat (usec): min=110, max=400553, avg=4383.15, stdev=7106.79
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   11], 99.50th=[   35], 99.90th=[  112], 99.95th=[  148],
     | 99.99th=[  241]
   bw (  KiB/s): min=  264, max=23358, per=12.15%, avg=14587.82, stdev=3928.56, samples=1200
   iops        : min=   66, max= 5839, avg=3646.64, stdev=982.14, samples=1200
  lat (usec)   : 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.22%, 4=70.42%, 10=28.26%, 20=0.29%, 50=0.37%
  lat (msec)   : 100=0.25%, 250=0.11%, 500=0.01%
  cpu          : usr=0.98%, sys=88.04%, ctx=464380, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2184434,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1220: Wed Dec 13 14:58:41 2017
  write: IOPS=3808, BW=14.9MiB/s (15.6MB/s)(8925MiB/600001msec)
    slat (usec): min=4, max=348744, avg=230.25, stdev=1217.66
    clat (usec): min=110, max=353791, avg=3955.93, stdev=4810.70
     lat (usec): min=115, max=354077, avg=4189.87, stdev=4976.17
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    8], 99.50th=[   22], 99.90th=[   79], 99.95th=[  104],
     | 99.99th=[  180]
   bw (  KiB/s): min= 1360, max=21672, per=12.71%, avg=15258.19, stdev=2808.41, samples=1200
   iops        : min=  340, max= 5418, avg=3814.23, stdev=702.12, samples=1200
  lat (usec)   : 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.04%
  lat (msec)   : 2=0.25%, 4=70.75%, 10=28.05%, 20=0.32%, 50=0.32%
  lat (msec)   : 100=0.15%, 250=0.05%, 500=0.01%
  cpu          : usr=1.03%, sys=91.51%, ctx=315742, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2284898,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1221: Wed Dec 13 14:58:41 2017
  write: IOPS=3699, BW=14.4MiB/s (15.2MB/s)(8670MiB/600001msec)
    slat (usec): min=4, max=384286, avg=237.31, stdev=1556.89
    clat (usec): min=69, max=456496, avg=4071.97, stdev=6252.37
     lat (usec): min=109, max=456673, avg=4313.09, stdev=6477.02
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   10], 99.50th=[   31], 99.90th=[  104], 99.95th=[  136],
     | 99.99th=[  211]
   bw (  KiB/s): min=  793, max=24016, per=12.35%, avg=14826.33, stdev=3502.09, samples=1200
   iops        : min=  198, max= 6004, avg=3706.22, stdev=875.53, samples=1200
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.26%, 4=70.92%, 10=27.75%, 20=0.31%, 50=0.36%
  lat (msec)   : 100=0.22%, 250=0.10%, 500=0.01%
  cpu          : usr=1.03%, sys=89.25%, ctx=473168, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2219487,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1222: Wed Dec 13 14:58:41 2017
  write: IOPS=3625, BW=14.2MiB/s (14.9MB/s)(8498MiB/600001msec)
    slat (usec): min=4, max=375714, avg=243.16, stdev=1737.37
    clat (usec): min=59, max=387462, avg=4154.04, stdev=7013.84
     lat (usec): min=109, max=387723, avg=4400.98, stdev=7271.32
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   13], 99.50th=[   39], 99.90th=[  116], 99.95th=[  153],
     | 99.99th=[  236]
   bw (  KiB/s): min=  729, max=21739, per=12.09%, avg=14519.35, stdev=3986.29, samples=1200
   iops        : min=  182, max= 5434, avg=3629.62, stdev=996.59, samples=1200
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.28%, 4=70.67%, 10=27.86%, 20=0.33%, 50=0.40%
  lat (msec)   : 100=0.25%, 250=0.13%, 500=0.01%
  cpu          : usr=1.03%, sys=87.52%, ctx=510004, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2175576,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1223: Wed Dec 13 14:58:41 2017
  write: IOPS=3891, BW=15.2MiB/s (15.9MB/s)(9120MiB/600001msec)
    slat (usec): min=4, max=256927, avg=228.15, stdev=981.91
    clat (usec): min=55, max=268950, avg=3869.74, stdev=3893.99
     lat (usec): min=145, max=300244, avg=4100.95, stdev=4027.72
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   15], 99.90th=[   65], 99.95th=[   90],
     | 99.99th=[  148]
   bw (  KiB/s): min= 1638, max=22276, per=12.99%, avg=15599.33, stdev=2421.87, samples=1200
   iops        : min=  409, max= 5569, avg=3899.47, stdev=605.49, samples=1200
  lat (usec)   : 100=0.01%, 250=0.05%, 500=0.02%, 750=0.02%, 1000=0.05%
  lat (msec)   : 2=0.34%, 4=70.12%, 10=28.76%, 20=0.26%, 50=0.24%
  lat (msec)   : 100=0.11%, 250=0.03%, 500=0.01%
  cpu          : usr=1.06%, sys=93.28%, ctx=212422, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2334616,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1224: Wed Dec 13 14:58:41 2017
  write: IOPS=3707, BW=14.5MiB/s (15.2MB/s)(8689MiB/600001msec)
    slat (usec): min=4, max=356512, avg=237.42, stdev=1554.30
    clat (usec): min=42, max=376204, avg=4063.23, stdev=6248.90
     lat (usec): min=109, max=376508, avg=4304.36, stdev=6472.77
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[   10], 99.50th=[   31], 99.90th=[   99], 99.95th=[  132],
     | 99.99th=[  239]
   bw (  KiB/s): min= 1124, max=22706, per=12.40%, avg=14881.48, stdev=3507.39, samples=1200
   iops        : min=  281, max= 5676, avg=3719.98, stdev=876.86, samples=1200
  lat (usec)   : 50=0.01%, 250=0.05%, 500=0.02%, 750=0.01%, 1000=0.03%
  lat (msec)   : 2=0.34%, 4=70.57%, 10=28.00%, 20=0.28%, 50=0.37%
  lat (msec)   : 100=0.22%, 250=0.09%, 500=0.01%
  cpu          : usr=1.02%, sys=89.30%, ctx=437871, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2224313,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1225: Wed Dec 13 14:58:41 2017
  write: IOPS=3799, BW=14.8MiB/s (15.6MB/s)(8905MiB/600001msec)
    slat (usec): min=4, max=361544, avg=230.35, stdev=1332.97
    clat (usec): min=51, max=368069, avg=3965.33, stdev=5387.38
     lat (usec): min=110, max=368366, avg=4199.48, stdev=5581.52
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    5], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    7], 99.50th=[   21], 99.90th=[   89], 99.95th=[  123],
     | 99.99th=[  197]
   bw (  KiB/s): min=  963, max=22292, per=12.69%, avg=15237.54, stdev=3142.05, samples=1200
   iops        : min=  240, max= 5573, avg=3809.02, stdev=785.52, samples=1200
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.02%
  lat (msec)   : 2=0.23%, 4=71.18%, 10=27.77%, 20=0.26%, 50=0.27%
  lat (msec)   : 100=0.16%, 250=0.08%, 500=0.01%
  cpu          : usr=1.01%, sys=91.25%, ctx=341973, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2279628,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec

Disk stats (read/write):
  vdb: ios=1/50323, merge=0/451, ticks=0/1016470, in_queue=362769, util=5.81%

[-- Attachment #4: ext4-old-4k --]
[-- Type: text/plain, Size: 12847 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=82.1MiB/s][r=0,w=21.0k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1178: Wed Dec 13 14:21:23 2017
  write: IOPS=2809, BW=10.0MiB/s (11.5MB/s)(6584MiB/600001msec)
    slat (usec): min=5, max=1731.0k, avg=321.00, stdev=4481.26
    clat (usec): min=76, max=1736.5k, avg=5171.63, stdev=17687.91
     lat (usec): min=141, max=1736.8k, avg=5495.42, stdev=18290.97
    clat percentiles (usec):
     |  1.00th=[  1860],  5.00th=[  2409], 10.00th=[  2802], 20.00th=[  3195],
     | 30.00th=[  3556], 40.00th=[  3818], 50.00th=[  4015], 60.00th=[  4228],
     | 70.00th=[  4490], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 14484], 99.50th=[ 69731], 99.90th=[250610], 99.95th=[337642],
     | 99.99th=[574620]
   bw (  KiB/s): min=    8, max=24673, per=13.42%, avg=11763.36, stdev=3943.17, samples=1148
   iops        : min=    2, max= 6168, avg=2940.54, stdev=985.79, samples=1148
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.12%
  lat (msec)   : 2=1.21%, 4=47.10%, 10=50.36%, 20=0.29%, 50=0.26%
  lat (msec)   : 100=0.21%, 250=0.29%, 500=0.09%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.68%, sys=79.81%, ctx=652388, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1685538,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1179: Wed Dec 13 14:21:23 2017
  write: IOPS=2557, BW=9.99MiB/s (10.5MB/s)(5993MiB/600001msec)
    slat (usec): min=5, max=1827.2k, avg=355.53, stdev=5641.37
    clat (usec): min=54, max=1833.8k, avg=5680.10, stdev=22700.65
     lat (usec): min=139, max=1834.2k, avg=6038.34, stdev=23494.13
    clat percentiles (usec):
     |  1.00th=[  1778],  5.00th=[  2343], 10.00th=[  2704], 20.00th=[  3097],
     | 30.00th=[  3490], 40.00th=[  3752], 50.00th=[  3982], 60.00th=[  4178],
     | 70.00th=[  4424], 80.00th=[  4817], 90.00th=[  5407], 95.00th=[  6128],
     | 99.00th=[ 35390], 99.50th=[122160], 99.90th=[320865], 99.95th=[387974],
     | 99.99th=[918553]
   bw (  KiB/s): min=   24, max=26497, per=12.28%, avg=10761.46, stdev=4879.74, samples=1143
   iops        : min=    6, max= 6624, avg=2690.01, stdev=1219.92, samples=1143
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.14%
  lat (msec)   : 2=1.48%, 4=49.67%, 10=47.13%, 20=0.31%, 50=0.34%
  lat (msec)   : 100=0.29%, 250=0.42%, 500=0.15%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.63%, sys=72.85%, ctx=984304, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1534249,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1180: Wed Dec 13 14:21:23 2017
  write: IOPS=2817, BW=11.0MiB/s (11.5MB/s)(6604MiB/600001msec)
    slat (usec): min=5, max=1751.3k, avg=320.34, stdev=4558.52
    clat (usec): min=69, max=1755.7k, avg=5156.99, stdev=18063.06
     lat (usec): min=138, max=1755.0k, avg=5480.10, stdev=18680.79
    clat percentiles (usec):
     |  1.00th=[  1876],  5.00th=[  2409], 10.00th=[  2802], 20.00th=[  3195],
     | 30.00th=[  3556], 40.00th=[  3785], 50.00th=[  4015], 60.00th=[  4228],
     | 70.00th=[  4490], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 14353], 99.50th=[ 67634], 99.90th=[263193], 99.95th=[337642],
     | 99.99th=[683672]
   bw (  KiB/s): min=   80, max=20811, per=13.49%, avg=11827.88, stdev=3987.58, samples=1146
   iops        : min=   20, max= 5202, avg=2956.61, stdev=996.89, samples=1146
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.04%, 1000=0.12%
  lat (msec)   : 2=1.20%, 4=48.00%, 10=49.47%, 20=0.30%, 50=0.27%
  lat (msec)   : 100=0.20%, 250=0.29%, 500=0.09%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.66%, sys=79.82%, ctx=604829, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1690561,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1181: Wed Dec 13 14:21:23 2017
  write: IOPS=2837, BW=11.1MiB/s (11.6MB/s)(6651MiB/600001msec)
    slat (usec): min=5, max=1700.6k, avg=317.64, stdev=4338.76
    clat (usec): min=36, max=1704.3k, avg=5119.44, stdev=16972.87
     lat (usec): min=148, max=1704.8k, avg=5439.85, stdev=17539.15
    clat percentiles (usec):
     |  1.00th=[  1975],  5.00th=[  2507], 10.00th=[  2835], 20.00th=[  3261],
     | 30.00th=[  3589], 40.00th=[  3851], 50.00th=[  4047], 60.00th=[  4293],
     | 70.00th=[  4555], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 13829], 99.50th=[ 62129], 99.90th=[240124], 99.95th=[320865],
     | 99.99th=[608175]
   bw (  KiB/s): min=   48, max=23214, per=13.56%, avg=11881.25, stdev=3784.78, samples=1148
   iops        : min=   12, max= 5803, avg=2970.05, stdev=946.19, samples=1148
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.10%
  lat (msec)   : 2=0.96%, 4=46.54%, 10=51.25%, 20=0.27%, 50=0.30%
  lat (msec)   : 100=0.21%, 250=0.26%, 500=0.08%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.66%, sys=80.94%, ctx=559902, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1702735,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1182: Wed Dec 13 14:21:23 2017
  write: IOPS=2498, BW=9994KiB/s (10.2MB/s)(5856MiB/600001msec)
    slat (usec): min=5, max=1758.3k, avg=364.78, stdev=5607.81
    clat (usec): min=51, max=1763.5k, avg=5813.05, stdev=22946.68
     lat (usec): min=137, max=1763.8k, avg=6180.47, stdev=23776.43
    clat percentiles (usec):
     |  1.00th=[  1745],  5.00th=[  2376], 10.00th=[  2769], 20.00th=[  3228],
     | 30.00th=[  3589], 40.00th=[  3818], 50.00th=[  4015], 60.00th=[  4228],
     | 70.00th=[  4490], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6128],
     | 99.00th=[ 39060], 99.50th=[128451], 99.90th=[341836], 99.95th=[434111],
     | 99.99th=[826278]
   bw (  KiB/s): min=    8, max=23134, per=11.96%, avg=10486.76, stdev=5172.17, samples=1146
   iops        : min=    2, max= 5783, avg=2621.32, stdev=1293.02, samples=1146
  lat (usec)   : 100=0.01%, 250=0.03%, 500=0.01%, 750=0.03%, 1000=0.13%
  lat (msec)   : 2=1.44%, 4=47.17%, 10=49.65%, 20=0.28%, 50=0.33%
  lat (msec)   : 100=0.29%, 250=0.42%, 500=0.17%, 750=0.02%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.63%, sys=72.48%, ctx=1079470, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1499031,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1183: Wed Dec 13 14:21:23 2017
  write: IOPS=2743, BW=10.7MiB/s (11.2MB/s)(6429MiB/600001msec)
    slat (usec): min=5, max=1738.5k, avg=330.39, stdev=4825.27
    clat (usec): min=46, max=1743.0k, avg=5295.46, stdev=19183.60
     lat (usec): min=138, max=1744.3k, avg=5628.41, stdev=19855.76
    clat percentiles (usec):
     |  1.00th=[  1860],  5.00th=[  2343], 10.00th=[  2737], 20.00th=[  3130],
     | 30.00th=[  3523], 40.00th=[  3785], 50.00th=[  3982], 60.00th=[  4178],
     | 70.00th=[  4424], 80.00th=[  4817], 90.00th=[  5342], 95.00th=[  6063],
     | 99.00th=[ 18744], 99.50th=[ 83362], 99.90th=[267387], 99.95th=[379585],
     | 99.99th=[675283]
   bw (  KiB/s): min=   64, max=23616, per=13.09%, avg=11474.50, stdev=4531.36, samples=1149
   iops        : min=   16, max= 5904, avg=2868.36, stdev=1132.85, samples=1149
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.20%
  lat (msec)   : 2=1.28%, 4=48.96%, 10=48.23%, 20=0.32%, 50=0.30%
  lat (msec)   : 100=0.25%, 250=0.31%, 500=0.10%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.63%, sys=77.48%, ctx=758312, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1645817,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1184: Wed Dec 13 14:21:23 2017
  write: IOPS=2850, BW=11.1MiB/s (11.7MB/s)(6680MiB/600001msec)
    slat (usec): min=5, max=1758.4k, avg=316.41, stdev=4456.60
    clat (usec): min=61, max=1763.5k, avg=5098.26, stdev=17493.16
     lat (usec): min=140, max=1763.7k, avg=5417.38, stdev=18076.15
    clat percentiles (usec):
     |  1.00th=[  1942],  5.00th=[  2442], 10.00th=[  2802], 20.00th=[  3228],
     | 30.00th=[  3589], 40.00th=[  3818], 50.00th=[  4047], 60.00th=[  4228],
     | 70.00th=[  4490], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 13042], 99.50th=[ 58983], 99.90th=[248513], 99.95th=[333448],
     | 99.99th=[599786]
   bw (  KiB/s): min=   32, max=21843, per=13.64%, avg=11953.28, stdev=3732.91, samples=1147
   iops        : min=    8, max= 5460, avg=2987.95, stdev=933.22, samples=1147
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.17%
  lat (msec)   : 2=1.03%, 4=47.02%, 10=50.64%, 20=0.30%, 50=0.29%
  lat (msec)   : 100=0.19%, 250=0.25%, 500=0.08%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.70%, sys=80.63%, ctx=521547, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1710059,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1185: Wed Dec 13 14:21:23 2017
  write: IOPS=2798, BW=10.9MiB/s (11.5MB/s)(6559MiB/600001msec)
    slat (usec): min=5, max=1848.6k, avg=322.54, stdev=4583.84
    clat (usec): min=38, max=1854.1k, avg=5191.27, stdev=18112.72
     lat (usec): min=141, max=1854.5k, avg=5516.58, stdev=18734.21
    clat percentiles (usec):
     |  1.00th=[  1909],  5.00th=[  2409], 10.00th=[  2769], 20.00th=[  3163],
     | 30.00th=[  3556], 40.00th=[  3785], 50.00th=[  4015], 60.00th=[  4228],
     | 70.00th=[  4490], 80.00th=[  4883], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 16057], 99.50th=[ 70779], 99.90th=[270533], 99.95th=[358613],
     | 99.99th=[616563]
   bw (  KiB/s): min=   80, max=23390, per=13.36%, avg=11711.75, stdev=4136.20, samples=1148
   iops        : min=   20, max= 5847, avg=2927.75, stdev=1034.06, samples=1148
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.14%
  lat (msec)   : 2=1.13%, 4=48.61%, 10=48.88%, 20=0.29%, 50=0.29%
  lat (msec)   : 100=0.22%, 250=0.28%, 500=0.10%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.69%, sys=79.14%, ctx=652094, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1679114,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec

Disk stats (read/write):
  vdb: ios=0/32561, merge=0/6598, ticks=0/5542527, in_queue=3752820, util=15.70%

[-- Attachment #5: seq-write.fio --]
[-- Type: text/plain, Size: 192 bytes --]

; fio-seq-write.job for fiotest

[global]
name=fio-seq-write
rw=write
bs=4k
direct=0
time_based=1
runtime=600
numjobs=8

[file1]
filename=/mnt/test/fio.out
size=10G
ioengine=libaio
iodepth=16

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion
  2017-12-13 14:20 ` [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
@ 2017-12-13 21:52   ` Jeff Layton
  2017-12-13 22:07     ` NeilBrown
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 21:52 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed, 2017-12-13 at 09:20 -0500, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> The rationale for taking the i_lock when incrementing this value is
> lost in antiquity. The readers of the field don't take it (at least
> not universally), so my assumption is that it was only done here to
> serialize incrementors.
> 
> If that is indeed the case, then we can drop the i_lock from this
> codepath and treat it as a atomic64_t for the purposes of
> incrementing it. This allows us to use inode_inc_iversion without
> any danger of lock inversion.
> 
> Note that the read side is not fetched atomically with this change.
> The assumption here is that that is not a critical issue since the
> i_version is not fully synchronized with anything else anyway.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  include/linux/fs.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5001e77342fd..c234fac4bb77 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2136,9 +2136,9 @@ inode_set_iversion_queried(struct inode *inode, const u64 new)
>  static inline bool
>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>  {
> -	spin_lock(&inode->i_lock);
> -	inode->i_version++;
> -	spin_unlock(&inode->i_lock);
> +	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> +
> +	atomic64_inc(ivp);
>  	return true;
>  }
>  

FWIW, I'm not sure this patch is strictly necessary as an interim step.

Adding the i_lock into the all of the places where we currently just do
inode->i_version++ without properly auditing all of them gave me pause
though.

In any case, the last patch in the series cleans this nastiness up.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-13 14:19 ` [PATCH 01/19] fs: new API for handling inode->i_version Jeff Layton
@ 2017-12-13 22:04   ` NeilBrown
  2017-12-14  0:27     ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: NeilBrown @ 2017-12-13 22:04 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

[-- Attachment #1: Type: text/plain, Size: 2642 bytes --]

On Wed, Dec 13 2017, Jeff Layton wrote:

> +/*
> + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> + * appear different to observers if there was a change to the inode's data or
> + * metadata since it was last queried.
> + *
> + * It should be considered an opaque value by observers. If it remains the same
> + * since it was last checked, then nothing has changed in the inode. If it's
> + * different then something has changed. Observers cannot infer anything about
> + * the nature or magnitude of the changes from the value, only that the inode
> + * has changed in some fashion.

I agree that it "should be" considered opaque, but I have a suspicion
that NFSv4 doesn't consider it opaque.
There is something about write delegations and the server performing a
GETATTR callback to the delegated client so that it can answer GETATTR
from other clients without recalling the delegation.

Specifically section "10.4.3 Handling of CB_GETATTR" of RFC5661 contains
the text:

   o  The client will create a value greater than c that will be used
      for communicating that modified data is held at the client.  Let
      this value be represented by d.

"c" here is a 'change' attribute.

Then:

   While the change attribute is opaque to the client in the sense that
   it has no idea what units of time, if any, the server is counting
   change with, it is not opaque in that the client has to treat it as
   an unsigned integer, and the server has to be able to see the results
   of the client's changes to that integer.  Therefore, the server MUST
   encode the change attribute in network order when sending it to the
   client.  The client MUST decode it from network order to its native
   order when receiving it, and the client MUST encode it in network
   order when sending it to the server.  For this reason, change is
   defined as an unsigned integer rather than an opaque array of bytes.

This all suggests that nfsd needs to be certain that "incrementing" the
change id will produce a new changeid, which has not been used before,
and also suggests that nfsd needs to be able to control the changeid
stored after writes that result from a delegation being returned.

I'd just like to say that this is one of the most annoying dumb features
of NFSv4, because it is trivial to fix and I suggested a fix before
NFSv4.0 was finalized.  Grumble.

Otherwise the patch set looks good.  I haven't gone over the code
closely, the but approach is spot-on.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion
  2017-12-13 21:52   ` Jeff Layton
@ 2017-12-13 22:07     ` NeilBrown
  0 siblings, 0 replies; 46+ messages in thread
From: NeilBrown @ 2017-12-13 22:07 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

[-- Attachment #1: Type: text/plain, Size: 2198 bytes --]

On Wed, Dec 13 2017, Jeff Layton wrote:

> On Wed, 2017-12-13 at 09:20 -0500, Jeff Layton wrote:
>> From: Jeff Layton <jlayton@redhat.com>
>> 
>> The rationale for taking the i_lock when incrementing this value is
>> lost in antiquity. The readers of the field don't take it (at least
>> not universally), so my assumption is that it was only done here to
>> serialize incrementors.
>> 
>> If that is indeed the case, then we can drop the i_lock from this
>> codepath and treat it as a atomic64_t for the purposes of
>> incrementing it. This allows us to use inode_inc_iversion without
>> any danger of lock inversion.
>> 
>> Note that the read side is not fetched atomically with this change.
>> The assumption here is that that is not a critical issue since the
>> i_version is not fully synchronized with anything else anyway.
>> 
>> Signed-off-by: Jeff Layton <jlayton@redhat.com>
>> ---
>>  include/linux/fs.h | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 5001e77342fd..c234fac4bb77 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2136,9 +2136,9 @@ inode_set_iversion_queried(struct inode *inode, const u64 new)
>>  static inline bool
>>  inode_maybe_inc_iversion(struct inode *inode, bool force)
>>  {
>> -	spin_lock(&inode->i_lock);
>> -	inode->i_version++;
>> -	spin_unlock(&inode->i_lock);
>> +	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
>> +
>> +	atomic64_inc(ivp);
>>  	return true;
>>  }
>>  
>
> FWIW, I'm not sure this patch is strictly necessary as an interim step.
>
> Adding the i_lock into the all of the places where we currently just do
> inode->i_version++ without properly auditing all of them gave me pause
> though.
>
> In any case, the last patch in the series cleans this nastiness up.

Yes, I thought "nastiness" too, and was happy to see it cleaned up.

I would have guessed that the purpose of the spinlock was to avoid the
risk for torn-reads/writes on 32bit platforms that cannot access a 64bit
value atomically.  In either case, using atomic64_t is the right thing
to do.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-13 20:14   ` Jeff Layton
@ 2017-12-13 22:10     ` Jeff Layton
  2017-12-13 23:03     ` Dave Chinner
  1 sibling, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-13 22:10 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: linux-fsdevel, linux-kernel, hch, neilb, amir73il, jack, viro

[-- Attachment #1: Type: text/plain, Size: 4266 bytes --]

On Wed, 2017-12-13 at 15:14 -0500, Jeff Layton wrote:
> On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > This is great, thanks.
> > 
> > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > With this, we reduce inode metadata updates across all 3 filesystems
> > > down to roughly the frequency of the timestamp granularity, particularly
> > > when it's not being queried (the vastly common case).
> > > 
> > > The pessimal workload here is 1 byte writes, and it helps that
> > > significantly. Of course, that's not what we'd consider a real-world
> > > workload.
> > > 
> > > A tiobench-example.fio workload also shows some modest performance
> > > gains, and I've gotten mails from the kernel test robot that show some
> > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > the vm-scalability testsuite to be specific), with an earlier version of
> > > this set.
> > > 
> > > With larger writes, the gains with this patchset mostly vaporize,
> > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > 
> > > I'm happy to run other workloads if anyone can suggest them.
> > > 
> > > At this point, the patchset works and does what it's expected to do in
> > > my own testing. It seems like it's at least a modest performance win
> > > across all 3 major disk-based filesystems. It may also encourage others
> > > to implement i_version as well since it reduces the cost.
> > 
> > Do you have an idea what the remaining cost is?
> > 
> > Especially in the ext4 case, are you still able to measure any
> > difference in performance between the cases where i_version is turned on
> > and off, after these patches?
> 
> Attached is a fio jobfile + the output from 3 different runs using it
> with ext4. This one is using 4k writes. There was no querying of
> i_version during the runs.  I've done several runs with each and these
> are pretty representative of the results:
> 
> old = 4.15-rc3, i_version enabled
> ivers = 4.15-rc3 + these patches, i_version enabled
> noivers = 4.15-rc3 + these patches, i_version disabled
> 
> To snip out the run status lines:
> 
> old:
> WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> 
> ivers:
> WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> 
> noivers:
> WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> 
> So, I see some performance degradation with -o iversion compared to not
> having it enabled (maybe due to the extra atomic fetches?), but this set
> erases most of the difference.
> 
> > > 
> > > [1]: On ext4 it must be turned on with the i_version mount option,
> > >      mostly due to fears of incurring this impact, AFAICT.
> > 
> > So xfs and btrfs both have i_version updates on by default at this
> > point?  (Assuming the filesystem's created with recent enough tools,
> > etc.)
> > 
> 
> Yes. With xfs and btrfs, I don't think you can disable it these days.
> 

...and for completeness, I'm seeing better performance with 4k writes on
xfs with this set as well:


old:
WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec

new:
WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec


btrfs doesn't show much change ... it's possible that that patch isn't
quite right:

old:
WRITE: bw=49.1MiB/s (51.5MB/s), 5656KiB/s-7011KiB/s (5792kB/s-7179kB/s), io=28.8GiB (30.9GB), run=600001-600003msec

new:
WRITE: bw=50.3MiB/s (52.7MB/s), 5753KiB/s-7264KiB/s (5891kB/s-7438kB/s), io=29.5GiB (31.6GB), run=600001-600002msec


This is with the kernel running in a KVM guest, virtio disk in guest
backed by a LVM volume on a bog-standard ssd. Performance
characteristics could be different on a different setup, but I think
this is encouraging.

Note that this patchset requires some prerequisite patches that Andrew
is carrying. The complete series is here for anyone who wants to play
with it:

https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=iversion

-- 
Jeff Layton <jlayton@kernel.org>

[-- Attachment #2: btrfs-new-4k --]
[-- Type: text/plain, Size: 13057 bytes --]

fio ./seq-write.fio 
file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=38.0MiB/s][r=0,w=9733 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1153: Wed Dec 13 16:45:33 2017
  write: IOPS=1570, BW=6284KiB/s (6435kB/s)(3682MiB/600002msec)
    slat (usec): min=11, max=2617.8k, avg=617.26, stdev=17969.35
    clat (usec): min=13, max=2676.9k, avg=9558.41, stdev=71784.85
     lat (usec): min=235, max=2677.4k, avg=10178.00, stdev=74245.06
    clat percentiles (usec):
     |  1.00th=[    865],  5.00th=[   1336], 10.00th=[   2008],
     | 20.00th=[   2343], 30.00th=[   2704], 40.00th=[   2999],
     | 50.00th=[   3359], 60.00th=[   3621], 70.00th=[   4228],
     | 80.00th=[   5014], 90.00th=[   6128], 95.00th=[   7111],
     | 99.00th=[ 137364], 99.50th=[ 383779], 99.90th=[1082131],
     | 99.95th=[1535116], 99.99th=[2365588]
   bw (  KiB/s): min=    8, max=44184, per=14.97%, avg=7707.74, stdev=6514.82, samples=980
   iops        : min=    2, max=11046, avg=1926.59, stdev=1628.66, samples=980
  lat (usec)   : 20=0.01%, 250=0.04%, 500=0.06%, 750=0.12%, 1000=2.60%
  lat (msec)   : 2=7.04%, 4=56.36%, 10=31.73%, 20=0.24%, 50=0.36%
  lat (msec)   : 100=0.30%, 250=0.41%, 500=0.35%, 750=0.16%, 1000=0.09%
  lat (msec)   : 2000=0.11%, >=2000=0.02%
  cpu          : usr=0.36%, sys=40.40%, ctx=916341, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,942585,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1154: Wed Dec 13 16:45:33 2017
  write: IOPS=1438, BW=5753KiB/s (5891kB/s)(3371MiB/600001msec)
    slat (usec): min=11, max=2902.8k, avg=676.21, stdev=20444.71
    clat (usec): min=23, max=4845.3k, avg=10439.08, stdev=82915.40
     lat (usec): min=229, max=4845.5k, avg=11117.62, stdev=85919.93
    clat percentiles (usec):
     |  1.00th=[    832],  5.00th=[   1237], 10.00th=[   1827],
     | 20.00th=[   2311], 30.00th=[   2704], 40.00th=[   2999],
     | 50.00th=[   3261], 60.00th=[   3490], 70.00th=[   4015],
     | 80.00th=[   4686], 90.00th=[   5866], 95.00th=[   6980],
     | 99.00th=[ 183501], 99.50th=[ 425722], 99.90th=[1333789],
     | 99.95th=[1971323], 99.99th=[2701132]
   bw (  KiB/s): min=    7, max=42276, per=14.19%, avg=7308.17, stdev=6556.14, samples=947
   iops        : min=    1, max=10569, avg=1826.70, stdev=1638.99, samples=947
  lat (usec)   : 50=0.01%, 250=0.05%, 500=0.07%, 750=0.18%, 1000=2.94%
  lat (msec)   : 2=7.98%, 4=58.60%, 10=27.93%, 20=0.26%, 50=0.43%
  lat (msec)   : 100=0.28%, 250=0.46%, 500=0.40%, 750=0.16%, 1000=0.12%
  lat (msec)   : 2000=0.10%, >=2000=0.05%
  cpu          : usr=0.29%, sys=36.57%, ctx=1089177, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,862978,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1155: Wed Dec 13 16:45:33 2017
  write: IOPS=1485, BW=5942KiB/s (6084kB/s)(3481MiB/600001msec)
    slat (usec): min=11, max=3129.5k, avg=654.20, stdev=19347.52
    clat (usec): min=38, max=4083.4k, avg=10108.44, stdev=78745.11
     lat (usec): min=231, max=4083.7k, avg=10764.95, stdev=81539.21
    clat percentiles (usec):
     |  1.00th=[    857],  5.00th=[   1303], 10.00th=[   1876],
     | 20.00th=[   2376], 30.00th=[   2769], 40.00th=[   3032],
     | 50.00th=[   3326], 60.00th=[   3621], 70.00th=[   4293],
     | 80.00th=[   5014], 90.00th=[   6128], 95.00th=[   7111],
     | 99.00th=[ 149947], 99.50th=[ 408945], 99.90th=[1266680],
     | 99.95th=[1769997], 99.99th=[2399142]
   bw (  KiB/s): min=    8, max=40801, per=14.48%, avg=7458.17, stdev=6168.79, samples=958
   iops        : min=    2, max=10200, avg=1864.21, stdev=1542.13, samples=958
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.06%, 750=0.27%, 1000=2.63%
  lat (msec)   : 2=8.13%, 4=55.45%, 10=31.31%, 20=0.25%, 50=0.37%
  lat (msec)   : 100=0.30%, 250=0.42%, 500=0.35%, 750=0.18%, 1000=0.09%
  lat (msec)   : 2000=0.11%, >=2000=0.03%
  cpu          : usr=0.33%, sys=39.00%, ctx=996024, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,891228,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1156: Wed Dec 13 16:45:33 2017
  write: IOPS=1785, BW=7142KiB/s (7314kB/s)(4185MiB/600001msec)
    slat (usec): min=11, max=3142.7k, avg=540.50, stdev=15841.26
    clat (usec): min=45, max=3915.5k, avg=8403.58, stdev=63204.82
     lat (usec): min=231, max=3915.7k, avg=8946.44, stdev=65380.05
    clat percentiles (usec):
     |  1.00th=[    881],  5.00th=[   1434], 10.00th=[   2180],
     | 20.00th=[   2442], 30.00th=[   2868], 40.00th=[   3195],
     | 50.00th=[   3425], 60.00th=[   3687], 70.00th=[   4293],
     | 80.00th=[   4948], 90.00th=[   6063], 95.00th=[   6980],
     | 99.00th=[  85459], 99.50th=[ 295699], 99.90th=[1019216],
     | 99.95th=[1384121], 99.99th=[2365588]
   bw (  KiB/s): min=   16, max=37184, per=16.54%, avg=8516.84, stdev=6147.10, samples=1007
   iops        : min=    4, max= 9296, avg=2129.02, stdev=1536.76, samples=1007
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.06%, 750=0.13%, 1000=1.83%
  lat (msec)   : 2=5.42%, 4=58.27%, 10=32.49%, 20=0.27%, 50=0.32%
  lat (msec)   : 100=0.24%, 250=0.37%, 500=0.29%, 750=0.12%, 1000=0.06%
  lat (msec)   : 2000=0.09%, >=2000=0.02%
  cpu          : usr=0.39%, sys=46.14%, ctx=811687, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1071347,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1157: Wed Dec 13 16:45:33 2017
  write: IOPS=1457, BW=5831KiB/s (5971kB/s)(3417MiB/600001msec)
    slat (usec): min=11, max=2990.9k, avg=666.92, stdev=19764.26
    clat (usec): min=40, max=3392.7k, avg=10300.16, stdev=78330.46
     lat (usec): min=231, max=3392.0k, avg=10969.40, stdev=80982.57
    clat percentiles (usec):
     |  1.00th=[    873],  5.00th=[   1319], 10.00th=[   1975],
     | 20.00th=[   2311], 30.00th=[   2737], 40.00th=[   2999],
     | 50.00th=[   3261], 60.00th=[   3458], 70.00th=[   3949],
     | 80.00th=[   4686], 90.00th=[   5800], 95.00th=[   6980],
     | 99.00th=[ 191890], 99.50th=[ 450888], 99.90th=[1182794],
     | 99.95th=[1719665], 99.99th=[2466251]
   bw (  KiB/s): min=    8, max=34228, per=14.39%, avg=7411.45, stdev=6196.08, samples=946
   iops        : min=    2, max= 8557, avg=1852.53, stdev=1548.99, samples=946
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.09%, 750=0.26%, 1000=2.29%
  lat (msec)   : 2=7.98%, 4=60.29%, 10=26.89%, 20=0.26%, 50=0.38%
  lat (msec)   : 100=0.27%, 250=0.42%, 500=0.42%, 750=0.21%, 1000=0.10%
  lat (msec)   : 2000=0.12%, >=2000=0.03%
  cpu          : usr=0.32%, sys=36.90%, ctx=1018693, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,874628,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1158: Wed Dec 13 16:45:33 2017
  write: IOPS=1649, BW=6597KiB/s (6755kB/s)(3865MiB/600001msec)
    slat (usec): min=11, max=3143.6k, avg=586.92, stdev=17474.34
    clat (usec): min=39, max=4084.4k, avg=9104.61, stdev=69703.21
     lat (usec): min=202, max=4084.6k, avg=9693.89, stdev=72110.31
    clat percentiles (usec):
     |  1.00th=[    865],  5.00th=[   1287], 10.00th=[   2024],
     | 20.00th=[   2409], 30.00th=[   2835], 40.00th=[   3163],
     | 50.00th=[   3392], 60.00th=[   3654], 70.00th=[   4228],
     | 80.00th=[   5014], 90.00th=[   6128], 95.00th=[   7111],
     | 99.00th=[ 119014], 99.50th=[ 354419], 99.90th=[1061159],
     | 99.95th=[1635779], 99.99th=[2432697]
   bw (  KiB/s): min=    8, max=40905, per=15.54%, avg=8003.47, stdev=6226.76, samples=990
   iops        : min=    2, max=10226, avg=2000.69, stdev=1556.67, samples=990
  lat (usec)   : 50=0.01%, 250=0.03%, 500=0.06%, 750=0.13%, 1000=2.61%
  lat (msec)   : 2=6.82%, 4=56.22%, 10=32.18%, 20=0.27%, 50=0.37%
  lat (msec)   : 100=0.24%, 250=0.40%, 500=0.32%, 750=0.14%, 1000=0.09%
  lat (msec)   : 2000=0.09%, >=2000=0.02%
  cpu          : usr=0.37%, sys=42.71%, ctx=873316, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,989567,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1159: Wed Dec 13 16:45:33 2017
  write: IOPS=1672, BW=6689KiB/s (6850kB/s)(3920MiB/600001msec)
    slat (usec): min=11, max=2633.0k, avg=578.72, stdev=17269.43
    clat (usec): min=11, max=3011.0k, avg=8979.73, stdev=68815.08
     lat (usec): min=237, max=3012.2k, avg=9560.80, stdev=71137.44
    clat percentiles (usec):
     |  1.00th=[    881],  5.00th=[   1336], 10.00th=[   1975],
     | 20.00th=[   2376], 30.00th=[   2802], 40.00th=[   3064],
     | 50.00th=[   3359], 60.00th=[   3621], 70.00th=[   4228],
     | 80.00th=[   4883], 90.00th=[   6063], 95.00th=[   7046],
     | 99.00th=[ 117965], 99.50th=[ 346031], 99.90th=[1082131],
     | 99.95th=[1568670], 99.99th=[2499806]
   bw (  KiB/s): min=    8, max=42064, per=15.78%, avg=8128.29, stdev=6379.01, samples=986
   iops        : min=    2, max=10516, avg=2032.02, stdev=1594.76, samples=986
  lat (usec)   : 20=0.01%, 250=0.04%, 500=0.07%, 750=0.12%, 1000=2.44%
  lat (msec)   : 2=7.78%, 4=56.46%, 10=31.22%, 20=0.26%, 50=0.32%
  lat (msec)   : 100=0.24%, 250=0.40%, 500=0.32%, 750=0.15%, 1000=0.06%
  lat (msec)   : 2000=0.10%, >=2000=0.02%
  cpu          : usr=0.35%, sys=43.06%, ctx=957715, majf=0, minf=7
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1003392,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1160: Wed Dec 13 16:45:33 2017
  write: IOPS=1816, BW=7264KiB/s (7438kB/s)(4256MiB/600001msec)
    slat (usec): min=12, max=3142.9k, avg=531.15, stdev=16191.13
    clat (usec): min=47, max=3146.2k, avg=8270.14, stdev=63500.94
     lat (usec): min=240, max=3146.4k, avg=8803.63, stdev=65645.85
    clat percentiles (usec):
     |  1.00th=[    873],  5.00th=[   1401], 10.00th=[   2114],
     | 20.00th=[   2409], 30.00th=[   2802], 40.00th=[   3064],
     | 50.00th=[   3359], 60.00th=[   3621], 70.00th=[   4228],
     | 80.00th=[   4883], 90.00th=[   5997], 95.00th=[   6980],
     | 99.00th=[  74974], 99.50th=[ 287310], 99.90th=[1002439],
     | 99.95th=[1434452], 99.99th=[2365588]
   bw (  KiB/s): min=    8, max=26781, per=17.00%, avg=8756.18, stdev=6125.17, samples=998
   iops        : min=    2, max= 6695, avg=2188.69, stdev=1531.26, samples=998
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.02%, 750=0.12%, 1000=2.32%
  lat (msec)   : 2=5.61%, 4=58.24%, 10=31.92%, 20=0.27%, 50=0.35%
  lat (msec)   : 100=0.22%, 250=0.38%, 500=0.27%, 750=0.12%, 1000=0.06%
  lat (msec)   : 2000=0.09%, >=2000=0.01%
  cpu          : usr=0.39%, sys=46.33%, ctx=803875, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1089611,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=50.3MiB/s (52.7MB/s), 5753KiB/s-7264KiB/s (5891kB/s-7438kB/s), io=29.5GiB (31.6GB), run=600001-600002msec

[-- Attachment #3: btrfs-old-4k --]
[-- Type: text/plain, Size: 13078 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
file1: Laying out IO file (1 file / 10240MiB)
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=23.9MiB/s][r=0,w=6130 IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1219: Wed Dec 13 16:25:59 2017
  write: IOPS=1737, BW=6949KiB/s (7116kB/s)(4072MiB/600003msec)
    slat (usec): min=12, max=4505.4k, avg=556.60, stdev=17945.08
    clat (usec): min=37, max=4509.3k, avg=8637.16, stdev=71426.10
     lat (usec): min=261, max=4509.5k, avg=9196.04, stdev=73807.97
    clat percentiles (usec):
     |  1.00th=[    873],  5.00th=[   1336], 10.00th=[   1762],
     | 20.00th=[   2343], 30.00th=[   2769], 40.00th=[   2999],
     | 50.00th=[   3326], 60.00th=[   3523], 70.00th=[   4015],
     | 80.00th=[   4752], 90.00th=[   5997], 95.00th=[   6980],
     | 99.00th=[  74974], 99.50th=[ 308282], 99.90th=[1098908],
     | 99.95th=[1518339], 99.99th=[2600469]
   bw (  KiB/s): min=    8, max=38012, per=17.25%, avg=8670.43, stdev=6123.73, samples=962
   iops        : min=    2, max= 9503, avg=2167.52, stdev=1530.93, samples=962
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.03%, 750=0.16%, 1000=2.57%
  lat (msec)   : 2=8.62%, 4=58.13%, 10=28.77%, 20=0.27%, 50=0.27%
  lat (msec)   : 100=0.24%, 250=0.32%, 500=0.29%, 750=0.12%, 1000=0.07%
  lat (msec)   : 2000=0.10%, >=2000=0.02%
  cpu          : usr=0.40%, sys=43.14%, ctx=761137, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1042323,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1220: Wed Dec 13 16:25:59 2017
  write: IOPS=1523, BW=6094KiB/s (6240kB/s)(3571MiB/600001msec)
    slat (usec): min=11, max=3106.5k, avg=636.61, stdev=19297.35
    clat (usec): min=8, max=3326.4k, avg=9841.72, stdev=76303.68
     lat (usec): min=153, max=3326.7k, avg=10480.72, stdev=78931.46
    clat percentiles (usec):
     |  1.00th=[    881],  5.00th=[   1336], 10.00th=[   2024],
     | 20.00th=[   2343], 30.00th=[   2802], 40.00th=[   3064],
     | 50.00th=[   3359], 60.00th=[   3589], 70.00th=[   4146],
     | 80.00th=[   4752], 90.00th=[   5866], 95.00th=[   6915],
     | 99.00th=[ 135267], 99.50th=[ 425722], 99.90th=[1182794],
     | 99.95th=[1551893], 99.99th=[2600469]
   bw (  KiB/s): min=    8, max=37835, per=15.13%, avg=7604.81, stdev=6137.33, samples=963
   iops        : min=    2, max= 9458, avg=1900.95, stdev=1534.30, samples=963
  lat (usec)   : 10=0.01%, 250=0.03%, 500=0.08%, 750=0.18%, 1000=2.16%
  lat (msec)   : 2=7.24%, 4=57.60%, 10=30.73%, 20=0.28%, 50=0.36%
  lat (msec)   : 100=0.23%, 250=0.35%, 500=0.31%, 750=0.19%, 1000=0.09%
  lat (msec)   : 2000=0.14%, >=2000=0.02%
  cpu          : usr=0.34%, sys=39.27%, ctx=1024014, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,914111,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1221: Wed Dec 13 16:25:59 2017
  write: IOPS=1579, BW=6317KiB/s (6468kB/s)(3701MiB/600001msec)
    slat (usec): min=11, max=3378.7k, avg=613.95, stdev=19369.95
    clat (usec): min=30, max=3382.7k, avg=9505.30, stdev=77110.08
     lat (usec): min=236, max=3382.0k, avg=10121.57, stdev=79761.57
    clat percentiles (usec):
     |  1.00th=[    889],  5.00th=[   1385], 10.00th=[   2040],
     | 20.00th=[   2376], 30.00th=[   2769], 40.00th=[   3032],
     | 50.00th=[   3326], 60.00th=[   3589], 70.00th=[   4178],
     | 80.00th=[   4817], 90.00th=[   6063], 95.00th=[   6980],
     | 99.00th=[ 111674], 99.50th=[ 383779], 99.90th=[1233126],
     | 99.95th=[1551893], 99.99th=[2835350]
   bw (  KiB/s): min=    8, max=32344, per=15.92%, avg=7999.22, stdev=6196.12, samples=950
   iops        : min=    2, max= 8086, avg=1999.49, stdev=1549.00, samples=950
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.06%, 750=0.14%, 1000=1.96%
  lat (msec)   : 2=7.38%, 4=57.50%, 10=31.13%, 20=0.25%, 50=0.32%
  lat (msec)   : 100=0.23%, 250=0.35%, 500=0.30%, 750=0.14%, 1000=0.08%
  lat (msec)   : 2000=0.13%, >=2000=0.03%
  cpu          : usr=0.36%, sys=40.24%, ctx=845216, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,947496,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1222: Wed Dec 13 16:25:59 2017
  write: IOPS=1504, BW=6019KiB/s (6163kB/s)(3527MiB/600003msec)
    slat (usec): min=11, max=3352.6k, avg=645.71, stdev=19845.49
    clat (usec): min=28, max=3701.9k, avg=9978.31, stdev=79408.41
     lat (usec): min=237, max=3702.1k, avg=10626.31, stdev=82130.21
    clat percentiles (usec):
     |  1.00th=[    857],  5.00th=[   1336], 10.00th=[   1975],
     | 20.00th=[   2343], 30.00th=[   2802], 40.00th=[   3032],
     | 50.00th=[   3326], 60.00th=[   3556], 70.00th=[   4113],
     | 80.00th=[   4752], 90.00th=[   5997], 95.00th=[   7046],
     | 99.00th=[ 132645], 99.50th=[ 434111], 99.90th=[1283458],
     | 99.95th=[1619002], 99.99th=[2600469]
   bw (  KiB/s): min=    7, max=35528, per=15.28%, avg=7681.10, stdev=5994.99, samples=940
   iops        : min=    1, max= 8882, avg=1920.22, stdev=1498.76, samples=940
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.03%, 750=0.14%, 1000=2.58%
  lat (msec)   : 2=7.65%, 4=57.94%, 10=29.71%, 20=0.30%, 50=0.35%
  lat (msec)   : 100=0.22%, 250=0.34%, 500=0.29%, 750=0.18%, 1000=0.10%
  lat (msec)   : 2000=0.15%, >=2000=0.03%
  cpu          : usr=0.32%, sys=38.58%, ctx=962514, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,902859,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1223: Wed Dec 13 16:25:59 2017
  write: IOPS=1752, BW=7011KiB/s (7179kB/s)(4108MiB/600002msec)
    slat (usec): min=12, max=4505.4k, avg=550.95, stdev=16994.58
    clat (usec): min=29, max=5045.7k, avg=8568.87, stdev=68091.35
     lat (usec): min=241, max=5045.9k, avg=9122.18, stdev=70418.08
    clat percentiles (usec):
     |  1.00th=[    889],  5.00th=[   1483], 10.00th=[   2180],
     | 20.00th=[   2442], 30.00th=[   2900], 40.00th=[   3195],
     | 50.00th=[   3392], 60.00th=[   3621], 70.00th=[   4146],
     | 80.00th=[   4817], 90.00th=[   5997], 95.00th=[   6915],
     | 99.00th=[  78119], 99.50th=[ 320865], 99.90th=[1027605],
     | 99.95th=[1417675], 99.99th=[2533360]
   bw (  KiB/s): min=    8, max=35574, per=16.97%, avg=8529.81, stdev=6213.26, samples=989
   iops        : min=    2, max= 8893, avg=2132.10, stdev=1553.27, samples=989
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.07%, 750=0.08%, 1000=1.69%
  lat (msec)   : 2=5.78%, 4=59.87%, 10=30.81%, 20=0.25%, 50=0.29%
  lat (msec)   : 100=0.21%, 250=0.33%, 500=0.30%, 750=0.11%, 1000=0.08%
  lat (msec)   : 2000=0.09%, >=2000=0.02%
  cpu          : usr=0.42%, sys=44.71%, ctx=804193, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1051592,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1224: Wed Dec 13 16:25:59 2017
  write: IOPS=1457, BW=5831KiB/s (5971kB/s)(3417MiB/600001msec)
    slat (usec): min=11, max=3408.5k, avg=667.13, stdev=20675.53
    clat (usec): min=31, max=3731.3k, avg=10298.98, stdev=83065.95
     lat (usec): min=220, max=3731.4k, avg=10968.38, stdev=85913.83
    clat percentiles (usec):
     |  1.00th=[    824],  5.00th=[   1254], 10.00th=[   1500],
     | 20.00th=[   2278], 30.00th=[   2573], 40.00th=[   2933],
     | 50.00th=[   3195], 60.00th=[   3425], 70.00th=[   3851],
     | 80.00th=[   4621], 90.00th=[   5932], 95.00th=[   6980],
     | 99.00th=[ 160433], 99.50th=[ 446694], 99.90th=[1300235],
     | 99.95th=[1736442], 99.99th=[2902459]
   bw (  KiB/s): min=    8, max=35558, per=15.21%, avg=7644.58, stdev=6275.71, samples=918
   iops        : min=    2, max= 8889, avg=1910.82, stdev=1568.89, samples=918
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.09%, 750=0.16%, 1000=3.44%
  lat (msec)   : 2=8.67%, 4=59.58%, 10=26.04%, 20=0.23%, 50=0.35%
  lat (msec)   : 100=0.26%, 250=0.38%, 500=0.35%, 750=0.16%, 1000=0.12%
  lat (msec)   : 2000=0.14%, >=2000=0.03%
  cpu          : usr=0.30%, sys=36.04%, ctx=992046, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,874718,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1225: Wed Dec 13 16:25:59 2017
  write: IOPS=1595, BW=6382KiB/s (6535kB/s)(3739MiB/600001msec)
    slat (usec): min=11, max=3222.8k, avg=607.54, stdev=18877.57
    clat (usec): min=27, max=3226.6k, avg=9412.18, stdev=74707.16
     lat (usec): min=167, max=3226.8k, avg=10022.04, stdev=77241.60
    clat percentiles (usec):
     |  1.00th=[    873],  5.00th=[   1336], 10.00th=[   1975],
     | 20.00th=[   2311], 30.00th=[   2737], 40.00th=[   2999],
     | 50.00th=[   3326], 60.00th=[   3523], 70.00th=[   4080],
     | 80.00th=[   4817], 90.00th=[   5997], 95.00th=[   6915],
     | 99.00th=[ 113771], 99.50th=[ 362808], 99.90th=[1199571],
     | 99.95th=[1535116], 99.99th=[2600469]
   bw (  KiB/s): min=    8, max=37675, per=16.03%, avg=8055.05, stdev=6107.68, samples=953
   iops        : min=    2, max= 9418, avg=2013.42, stdev=1526.91, samples=953
  lat (usec)   : 50=0.01%, 250=0.06%, 500=0.06%, 750=0.14%, 1000=2.53%
  lat (msec)   : 2=7.54%, 4=57.91%, 10=29.88%, 20=0.23%, 50=0.37%
  lat (msec)   : 100=0.25%, 250=0.33%, 500=0.35%, 750=0.12%, 1000=0.09%
  lat (msec)   : 2000=0.13%, >=2000=0.02%
  cpu          : usr=0.36%, sys=40.54%, ctx=986375, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,957247,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1226: Wed Dec 13 16:25:59 2017
  write: IOPS=1413, BW=5656KiB/s (5792kB/s)(3314MiB/600001msec)
    slat (usec): min=11, max=3209.1k, avg=688.54, stdev=20245.71
    clat (usec): min=43, max=3396.6k, avg=10618.11, stdev=82474.16
     lat (usec): min=234, max=3396.8k, avg=11308.90, stdev=85346.97
    clat percentiles (usec):
     |  1.00th=[    848],  5.00th=[   1221], 10.00th=[   1565],
     | 20.00th=[   2311], 30.00th=[   2737], 40.00th=[   3032],
     | 50.00th=[   3326], 60.00th=[   3523], 70.00th=[   4113],
     | 80.00th=[   4817], 90.00th=[   6128], 95.00th=[   7111],
     | 99.00th=[ 185598], 99.50th=[ 488637], 99.90th=[1317012],
     | 99.95th=[1702888], 99.99th=[2768241]
   bw (  KiB/s): min=    8, max=40056, per=14.11%, avg=7089.66, stdev=6368.59, samples=960
   iops        : min=    2, max=10014, avg=1772.08, stdev=1592.12, samples=960
  lat (usec)   : 50=0.01%, 250=0.01%, 500=0.19%, 750=0.18%, 1000=3.21%
  lat (msec)   : 2=9.05%, 4=55.64%, 10=29.63%, 20=0.27%, 50=0.35%
  lat (msec)   : 100=0.24%, 250=0.37%, 500=0.38%, 750=0.22%, 1000=0.10%
  lat (msec)   : 2000=0.15%, >=2000=0.03%
  cpu          : usr=0.31%, sys=36.04%, ctx=914036, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,848399,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=49.1MiB/s (51.5MB/s), 5656KiB/s-7011KiB/s (5792kB/s-7179kB/s), io=28.8GiB (30.9GB), run=600001-600003msec

[-- Attachment #4: xfs-new-4k --]
[-- Type: text/plain, Size: 12469 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
file1: Laying out IO file (1 file / 10240MiB)
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=102MiB/s][r=0,w=26.1k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1360: Wed Dec 13 16:02:00 2017
  write: IOPS=4178, BW=16.3MiB/s (17.1MB/s)(9793MiB/600001msec)
    slat (usec): min=3, max=765706, avg=201.29, stdev=1529.42
    clat (usec): min=60, max=769481, avg=3609.99, stdev=6000.14
     lat (usec): min=109, max=769867, avg=3815.72, stdev=6202.11
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   12], 99.90th=[   51], 99.95th=[   74],
     | 99.99th=[  313]
   bw (  KiB/s): min=  609, max=22883, per=12.66%, avg=16780.35, stdev=3058.72, samples=1198
   iops        : min=  152, max= 5720, avg=4194.72, stdev=764.69, samples=1198
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.09%, 4=87.62%, 10=11.70%, 20=0.25%, 50=0.22%
  lat (msec)   : 100=0.07%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.26%, sys=92.80%, ctx=216533, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2507016,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1361: Wed Dec 13 16:02:00 2017
  write: IOPS=4101, BW=16.0MiB/s (16.8MB/s)(9612MiB/600001msec)
    slat (usec): min=3, max=832923, avg=205.39, stdev=1646.45
    clat (usec): min=62, max=837254, avg=3677.76, stdev=6496.35
     lat (usec): min=106, max=837548, avg=3887.67, stdev=6720.67
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    7], 99.50th=[   19], 99.90th=[   69], 99.95th=[   95],
     | 99.99th=[  313]
   bw (  KiB/s): min=  384, max=22156, per=12.41%, avg=16450.03, stdev=3574.82, samples=1198
   iops        : min=   96, max= 5539, avg=4112.29, stdev=893.71, samples=1198
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.09%, 4=88.25%, 10=10.87%, 20=0.29%, 50=0.31%
  lat (msec)   : 100=0.12%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.27%, sys=91.00%, ctx=322342, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2460684,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1362: Wed Dec 13 16:02:00 2017
  write: IOPS=4158, BW=16.2MiB/s (17.0MB/s)(9747MiB/600001msec)
    slat (usec): min=3, max=829989, avg=201.98, stdev=1546.44
    clat (usec): min=82, max=833896, avg=3627.01, stdev=6048.97
     lat (usec): min=108, max=834076, avg=3833.49, stdev=6252.83
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   14], 99.90th=[   57], 99.95th=[   83],
     | 99.99th=[  300]
   bw (  KiB/s): min= 1611, max=43503, per=12.60%, avg=16705.43, stdev=3274.39, samples=1198
   iops        : min=  402, max=10875, avg=4176.00, stdev=818.59, samples=1198
  lat (usec)   : 100=0.01%, 250=0.11%, 500=0.03%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.11%, 4=87.72%, 10=11.40%, 20=0.25%, 50=0.25%
  lat (msec)   : 100=0.09%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.27%, sys=92.33%, ctx=253025, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2495350,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1363: Wed Dec 13 16:02:00 2017
  write: IOPS=4254, BW=16.6MiB/s (17.4MB/s)(9972MiB/600001msec)
    slat (usec): min=5, max=766806, avg=203.17, stdev=1368.21
    clat (usec): min=131, max=771002, avg=3541.83, stdev=5341.13
     lat (usec): min=138, max=771300, avg=3748.35, stdev=5520.27
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[    8], 99.90th=[   36], 99.95th=[   54],
     | 99.99th=[  300]
   bw (  KiB/s): min= 1731, max=22701, per=12.88%, avg=17083.49, stdev=2671.55, samples=1198
   iops        : min=  432, max= 5675, avg=4270.50, stdev=667.85, samples=1198
  lat (usec)   : 250=0.05%, 500=0.02%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.18%, 4=87.04%, 10=12.26%, 20=0.20%, 50=0.16%
  lat (msec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.21%, sys=94.41%, ctx=145712, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2552889,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1364: Wed Dec 13 16:02:00 2017
  write: IOPS=4055, BW=15.8MiB/s (16.6MB/s)(9506MiB/600001msec)
    slat (usec): min=3, max=847054, avg=208.13, stdev=1715.47
    clat (usec): min=55, max=850110, avg=3718.75, stdev=6733.35
     lat (usec): min=105, max=850414, avg=3931.37, stdev=6968.33
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    8], 99.50th=[   22], 99.90th=[   75], 99.95th=[  110],
     | 99.99th=[  317]
   bw (  KiB/s): min=  907, max=22252, per=12.29%, avg=16300.78, stdev=3677.63, samples=1197
   iops        : min=  226, max= 5563, avg=4074.82, stdev=919.42, samples=1197
  lat (usec)   : 100=0.01%, 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.12%, 4=87.75%, 10=11.24%, 20=0.31%, 50=0.33%
  lat (msec)   : 100=0.14%, 250=0.04%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.31%, sys=90.45%, ctx=396245, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2433447,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1365: Wed Dec 13 16:02:00 2017
  write: IOPS=4125, BW=16.1MiB/s (16.9MB/s)(9668MiB/600001msec)
    slat (usec): min=3, max=766661, avg=204.36, stdev=1540.93
    clat (usec): min=102, max=771014, avg=3656.38, stdev=6059.01
     lat (usec): min=107, max=771384, avg=3865.18, stdev=6265.22
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   17], 99.90th=[   63], 99.95th=[   86],
     | 99.99th=[  300]
   bw (  KiB/s): min= 1691, max=22549, per=12.49%, avg=16557.28, stdev=3511.44, samples=1198
   iops        : min=  422, max= 5637, avg=4138.97, stdev=877.87, samples=1198
  lat (usec)   : 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.09%, 4=87.64%, 10=11.51%, 20=0.30%, 50=0.29%
  lat (msec)   : 100=0.11%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.23%, sys=91.80%, ctx=295015, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2475045,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1366: Wed Dec 13 16:02:00 2017
  write: IOPS=4142, BW=16.2MiB/s (16.0MB/s)(9709MiB/600001msec)
    slat (usec): min=3, max=769630, avg=202.97, stdev=1605.67
    clat (usec): min=74, max=773586, avg=3641.33, stdev=6337.34
     lat (usec): min=106, max=773767, avg=3848.80, stdev=6552.80
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   15], 99.90th=[   58], 99.95th=[   86],
     | 99.99th=[  334]
   bw (  KiB/s): min=  224, max=22995, per=12.55%, avg=16640.38, stdev=3320.37, samples=1198
   iops        : min=   56, max= 5748, avg=4159.72, stdev=830.10, samples=1198
  lat (usec)   : 100=0.01%, 250=0.02%, 500=0.01%, 750=0.03%, 1000=0.01%
  lat (msec)   : 2=0.12%, 4=87.95%, 10=11.21%, 20=0.26%, 50=0.27%
  lat (msec)   : 100=0.09%, 250=0.03%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.33%, sys=91.99%, ctx=297864, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2485419,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1367: Wed Dec 13 16:02:00 2017
  write: IOPS=4130, BW=16.1MiB/s (16.9MB/s)(9681MiB/600001msec)
    slat (usec): min=3, max=836059, avg=203.03, stdev=1560.64
    clat (usec): min=60, max=839666, avg=3651.98, stdev=6149.70
     lat (usec): min=110, max=839905, avg=3859.61, stdev=6358.90
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    5], 95.00th=[    5],
     | 99.00th=[    6], 99.50th=[   16], 99.90th=[   62], 99.95th=[   87],
     | 99.99th=[  334]
   bw (  KiB/s): min=  937, max=22484, per=12.50%, avg=16579.11, stdev=3330.88, samples=1198
   iops        : min=  234, max= 5621, avg=4144.45, stdev=832.70, samples=1198
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.09%, 4=87.88%, 10=11.33%, 20=0.26%, 50=0.27%
  lat (msec)   : 100=0.10%, 250=0.02%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=1.28%, sys=91.92%, ctx=290466, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,2478319,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec

Disk stats (read/write):
  vdb: ios=1/27804, merge=0/1743, ticks=1/4511916, in_queue=3364804, util=25.86%

[-- Attachment #5: xfs-old-4k --]
[-- Type: text/plain, Size: 12705 bytes --]

file1: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16
...
fio-3.0
Starting 8 processes
Jobs: 8 (f=8): [W(8)][100.0%][r=0KiB/s,w=82.2MiB/s][r=0,w=21.0k IOPS][eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=1131: Wed Dec 13 16:14:21 2017
  write: IOPS=2908, BW=11.4MiB/s (11.9MB/s)(6816MiB/600001msec)
    slat (usec): min=5, max=787176, avg=316.82, stdev=3687.41
    clat (usec): min=74, max=1056.3k, avg=5171.76, stdev=14861.09
     lat (usec): min=135, max=1056.7k, avg=5491.61, stdev=15383.15
    clat percentiles (usec):
     |  1.00th=[  1532],  5.00th=[  2311], 10.00th=[  2573], 20.00th=[  2933],
     | 30.00th=[  3294], 40.00th=[  3556], 50.00th=[  3818], 60.00th=[  4047],
     | 70.00th=[  4359], 80.00th=[  4686], 90.00th=[  5342], 95.00th=[  5997],
     | 99.00th=[ 38536], 99.50th=[ 89654], 99.90th=[225444], 99.95th=[278922],
     | 99.99th=[505414]
   bw (  KiB/s): min=   16, max=37683, per=13.12%, avg=11655.03, stdev=4424.17, samples=1199
   iops        : min=    4, max= 9420, avg=2913.44, stdev=1106.04, samples=1199
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.23%
  lat (msec)   : 2=1.64%, 4=56.03%, 10=39.78%, 20=0.79%, 50=0.63%
  lat (msec)   : 100=0.40%, 250=0.37%, 500=0.07%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.79%, sys=74.80%, ctx=868689, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1744871,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1132: Wed Dec 13 16:14:21 2017
  write: IOPS=2802, BW=10.9MiB/s (11.5MB/s)(6567MiB/600001msec)
    slat (usec): min=5, max=1163.8k, avg=329.51, stdev=4078.35
    clat (usec): min=52, max=1271.4k, avg=5367.10, stdev=16476.69
     lat (usec): min=140, max=1271.5k, avg=5699.81, stdev=17066.63
    clat percentiles (usec):
     |  1.00th=[  1336],  5.00th=[  2278], 10.00th=[  2573], 20.00th=[  2900],
     | 30.00th=[  3261], 40.00th=[  3490], 50.00th=[  3752], 60.00th=[  4015],
     | 70.00th=[  4293], 80.00th=[  4621], 90.00th=[  5276], 95.00th=[  6063],
     | 99.00th=[ 47449], 99.50th=[106431], 99.90th=[254804], 99.95th=[316670],
     | 99.99th=[463471]
   bw (  KiB/s): min=   96, max=34464, per=12.64%, avg=11225.62, stdev=4880.80, samples=1198
   iops        : min=   24, max= 8616, avg=2806.20, stdev=1220.20, samples=1198
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.06%, 1000=0.33%
  lat (msec)   : 2=1.78%, 4=57.81%, 10=37.44%, 20=0.85%, 50=0.73%
  lat (msec)   : 100=0.44%, 250=0.42%, 500=0.10%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.79%, sys=71.65%, ctx=1044041, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1681268,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1133: Wed Dec 13 16:14:21 2017
  write: IOPS=2985, BW=11.7MiB/s (12.2MB/s)(6997MiB/600001msec)
    slat (usec): min=5, max=748162, avg=307.69, stdev=3307.29
    clat (usec): min=60, max=775858, avg=5038.81, stdev=13095.28
     lat (usec): min=165, max=776031, avg=5349.64, stdev=13543.69
    clat percentiles (usec):
     |  1.00th=[  1483],  5.00th=[  2311], 10.00th=[  2573], 20.00th=[  2933],
     | 30.00th=[  3294], 40.00th=[  3523], 50.00th=[  3785], 60.00th=[  4047],
     | 70.00th=[  4293], 80.00th=[  4686], 90.00th=[  5342], 95.00th=[  5997],
     | 99.00th=[ 36439], 99.50th=[ 82314], 99.90th=[204473], 99.95th=[246416],
     | 99.99th=[404751]
   bw (  KiB/s): min=   24, max=35440, per=13.46%, avg=11951.36, stdev=3904.50, samples=1200
   iops        : min=    6, max= 8860, avg=2987.58, stdev=976.12, samples=1200
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.04%, 1000=0.23%
  lat (msec)   : 2=1.69%, 4=56.62%, 10=39.10%, 20=0.82%, 50=0.68%
  lat (msec)   : 100=0.41%, 250=0.35%, 500=0.04%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.84%, sys=76.03%, ctx=786053, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1791122,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1134: Wed Dec 13 16:14:21 2017
  write: IOPS=2774, BW=10.8MiB/s (11.4MB/s)(6502MiB/600001msec)
    slat (usec): min=6, max=1165.7k, avg=332.59, stdev=4092.38
    clat (usec): min=62, max=1168.9k, avg=5421.32, stdev=16731.39
     lat (usec): min=142, max=1169.2k, avg=5757.12, stdev=17348.61
    clat percentiles (usec):
     |  1.00th=[  1467],  5.00th=[  2311], 10.00th=[  2606], 20.00th=[  2966],
     | 30.00th=[  3294], 40.00th=[  3556], 50.00th=[  3818], 60.00th=[  4047],
     | 70.00th=[  4359], 80.00th=[  4686], 90.00th=[  5342], 95.00th=[  6063],
     | 99.00th=[ 47449], 99.50th=[105382], 99.90th=[258999], 99.95th=[320865],
     | 99.99th=[476054]
   bw (  KiB/s): min=    8, max=40232, per=12.54%, avg=11136.08, stdev=4803.28, samples=1197
   iops        : min=    2, max=10058, avg=2783.74, stdev=1200.81, samples=1197
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.07%, 1000=0.25%
  lat (msec)   : 2=1.71%, 4=55.77%, 10=39.67%, 20=0.81%, 50=0.73%
  lat (msec)   : 100=0.43%, 250=0.42%, 500=0.10%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.74%, sys=71.57%, ctx=949754, majf=0, minf=13
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1664518,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1135: Wed Dec 13 16:14:21 2017
  write: IOPS=2692, BW=10.5MiB/s (11.0MB/s)(6311MiB/600002msec)
    slat (usec): min=6, max=951339, avg=344.14, stdev=4217.19
    clat (usec): min=45, max=955624, avg=5584.63, stdev=17319.60
     lat (usec): min=144, max=955868, avg=5931.93, stdev=17960.09
    clat percentiles (usec):
     |  1.00th=[  1713],  5.00th=[  2311], 10.00th=[  2573], 20.00th=[  2933],
     | 30.00th=[  3261], 40.00th=[  3523], 50.00th=[  3785], 60.00th=[  4015],
     | 70.00th=[  4293], 80.00th=[  4621], 90.00th=[  5276], 95.00th=[  6063],
     | 99.00th=[ 60031], 99.50th=[121111], 99.90th=[256902], 99.95th=[312476],
     | 99.99th=[497026]
   bw (  KiB/s): min=   48, max=23743, per=12.17%, avg=10806.28, stdev=5068.49, samples=1197
   iops        : min=   12, max= 5935, avg=2701.33, stdev=1267.12, samples=1197
  lat (usec)   : 50=0.01%, 250=0.02%, 500=0.01%, 750=0.05%, 1000=0.23%
  lat (msec)   : 2=1.45%, 4=57.54%, 10=37.96%, 20=0.85%, 50=0.77%
  lat (msec)   : 100=0.49%, 250=0.53%, 500=0.10%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.73%, sys=69.40%, ctx=1092449, majf=0, minf=12
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1615619,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1136: Wed Dec 13 16:14:21 2017
  write: IOPS=2808, BW=10.0MiB/s (11.5MB/s)(6582MiB/600001msec)
    slat (usec): min=5, max=1164.6k, avg=328.77, stdev=4029.61
    clat (usec): min=69, max=1167.1k, avg=5355.39, stdev=16199.32
     lat (usec): min=133, max=1167.3k, avg=5687.28, stdev=16774.26
    clat percentiles (usec):
     |  1.00th=[  1729],  5.00th=[  2343], 10.00th=[  2606], 20.00th=[  2966],
     | 30.00th=[  3326], 40.00th=[  3556], 50.00th=[  3818], 60.00th=[  4047],
     | 70.00th=[  4293], 80.00th=[  4621], 90.00th=[  5276], 95.00th=[  5997],
     | 99.00th=[ 46400], 99.50th=[102237], 99.90th=[235930], 99.95th=[299893],
     | 99.99th=[497026]
   bw (  KiB/s): min=   32, max=21088, per=12.69%, avg=11272.85, stdev=4574.15, samples=1197
   iops        : min=    8, max= 5272, avg=2817.91, stdev=1143.53, samples=1197
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.23%
  lat (msec)   : 2=1.43%, 4=56.67%, 10=39.15%, 20=0.79%, 50=0.71%
  lat (msec)   : 100=0.45%, 250=0.42%, 500=0.08%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.77%, sys=72.24%, ctx=899175, majf=0, minf=10
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1684919,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1137: Wed Dec 13 16:14:21 2017
  write: IOPS=2813, BW=10.0MiB/s (11.5MB/s)(6593MiB/600001msec)
    slat (usec): min=5, max=585328, avg=328.40, stdev=3937.41
    clat (usec): min=134, max=626978, avg=5346.05, stdev=15758.18
     lat (usec): min=141, max=627505, avg=5677.59, stdev=16313.91
    clat percentiles (usec):
     |  1.00th=[  1418],  5.00th=[  2278], 10.00th=[  2573], 20.00th=[  2933],
     | 30.00th=[  3294], 40.00th=[  3556], 50.00th=[  3818], 60.00th=[  4047],
     | 70.00th=[  4293], 80.00th=[  4686], 90.00th=[  5407], 95.00th=[  6063],
     | 99.00th=[ 42730], 99.50th=[104334], 99.90th=[250610], 99.95th=[316670],
     | 99.99th=[459277]
   bw (  KiB/s): min=  104, max=22252, per=12.67%, avg=11258.18, stdev=4395.06, samples=1200
   iops        : min=   26, max= 5563, avg=2814.34, stdev=1098.77, samples=1200
  lat (usec)   : 250=0.03%, 500=0.02%, 750=0.08%, 1000=0.27%
  lat (msec)   : 2=1.71%, 4=55.92%, 10=39.49%, 20=0.86%, 50=0.71%
  lat (msec)   : 100=0.38%, 250=0.42%, 500=0.10%, 750=0.01%
  cpu          : usr=0.74%, sys=72.58%, ctx=934703, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1687883,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
file1: (groupid=0, jobs=1): err= 0: pid=1138: Wed Dec 13 16:14:21 2017
  write: IOPS=2422, BW=9689KiB/s (9921kB/s)(5677MiB/600001msec)
    slat (usec): min=5, max=836561, avg=385.80, stdev=5028.24
    clat (usec): min=83, max=911210, avg=6206.53, stdev=20455.78
     lat (usec): min=143, max=911433, avg=6595.48, stdev=21201.82
    clat percentiles (usec):
     |  1.00th=[  1303],  5.00th=[  2245], 10.00th=[  2540], 20.00th=[  2900],
     | 30.00th=[  3294], 40.00th=[  3556], 50.00th=[  3818], 60.00th=[  4047],
     | 70.00th=[  4359], 80.00th=[  4752], 90.00th=[  5473], 95.00th=[  6259],
     | 99.00th=[ 86508], 99.50th=[149947], 99.90th=[295699], 99.95th=[354419],
     | 99.99th=[549454]
   bw (  KiB/s): min=   48, max=28440, per=10.94%, avg=9712.94, stdev=5482.06, samples=1198
   iops        : min=   12, max= 7110, avg=2428.04, stdev=1370.51, samples=1198
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.03%, 1000=0.36%
  lat (msec)   : 2=2.00%, 4=55.19%, 10=39.22%, 20=0.84%, 50=0.84%
  lat (msec)   : 100=0.61%, 250=0.73%, 500=0.14%, 750=0.01%, 1000=0.01%
  cpu          : usr=0.66%, sys=63.84%, ctx=1410998, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=0,1453347,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec

Disk stats (read/write):
  vdb: ios=0/39272, merge=0/258, ticks=0/3274077, in_queue=2287355, util=22.13%

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 14/19] xfs: convert to new i_version API
  2017-12-13 14:20 ` [PATCH 14/19] xfs: convert to " Jeff Layton
@ 2017-12-13 22:48   ` Dave Chinner
  2017-12-13 23:25     ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2017-12-13 22:48 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
>  fs/xfs/xfs_icache.c           | 4 ++--
>  fs/xfs/xfs_inode.c            | 2 +-
>  fs/xfs/xfs_inode_item.c       | 2 +-
>  fs/xfs/xfs_trans_inode.c      | 2 +-
>  5 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b7989038d75..6b47de201391 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -264,7 +264,8 @@ xfs_inode_from_disk(
>  	to->di_flags	= be16_to_cpu(from->di_flags);
>  
>  	if (to->di_version == 3) {
> -		inode->i_version = be64_to_cpu(from->di_changecount);
> +		inode_set_iversion_queried(inode,
> +					   be64_to_cpu(from->di_changecount));

So we use the "kernel managed" (really not sure what that means)
set function here to read it off disk, but...

>  		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
>  		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
>  		to->di_flags2 = be64_to_cpu(from->di_flags2);
> @@ -314,7 +315,7 @@ xfs_inode_to_disk(
>  	to->di_flags = cpu_to_be16(from->di_flags);
>  
>  	if (from->di_version == 3) {
> -		to->di_changecount = cpu_to_be64(inode->i_version);
> +		to->di_changecount = cpu_to_be64(inode_peek_iversion_raw(inode));

... use the raw access mode to put it back on disk.

>  		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 43005fbe8b1e..4838462616fd 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -293,14 +293,14 @@ xfs_reinit_inode(
>  	int		error;
>  	uint32_t	nlink = inode->i_nlink;
>  	uint32_t	generation = inode->i_generation;
> -	uint64_t	version = inode->i_version;
> +	uint64_t	version = inode_peek_iversion_raw(inode);
>  	umode_t		mode = inode->i_mode;
>  
>  	error = inode_init_always(mp->m_super, inode);
>  
>  	set_nlink(inode, nlink);
>  	inode->i_generation = generation;
> -	inode->i_version = version;
> +	inode_set_iversion_queried(inode, version);

Again - raw mode to read, kernel managed to set.

>  	inode->i_mode = mode;
>  	return error;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 801274126648..be6d87980dd5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -833,7 +833,7 @@ xfs_ialloc(
>  	ip->i_d.di_flags = 0;
>  
>  	if (ip->i_d.di_version == 3) {
> -		inode->i_version = 1;
> +		inode_set_iversion(inode, 1);

But here you are using the "filesystem managed" mdoe to set the
new value. Why? How is this any different from reading the value
off disk and setting it?

> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -117,7 +117,7 @@ xfs_trans_log_inode(
>  	 */
>  	if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
>  	    IS_I_VERSION(VFS_I(ip))) {
> -		VFS_I(ip)->i_version++;
> +		inode_inc_iversion(VFS_I(ip));
>  		flags |= XFS_ILOG_CORE;
>  	}

And isn't this a case of "filesystem managed" iversion behaviour?

Basically, I can't make head or tail of why the different API
functions are used here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-13 20:14   ` Jeff Layton
  2017-12-13 22:10     ` Jeff Layton
@ 2017-12-13 23:03     ` Dave Chinner
  2017-12-14  0:02       ` Jeff Layton
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2017-12-13 23:03 UTC (permalink / raw)
  To: Jeff Layton
  Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, hch, neilb,
	amir73il, jack, viro

On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > This is great, thanks.
> > 
> > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > With this, we reduce inode metadata updates across all 3 filesystems
> > > down to roughly the frequency of the timestamp granularity, particularly
> > > when it's not being queried (the vastly common case).
> > > 
> > > The pessimal workload here is 1 byte writes, and it helps that
> > > significantly. Of course, that's not what we'd consider a real-world
> > > workload.
> > > 
> > > A tiobench-example.fio workload also shows some modest performance
> > > gains, and I've gotten mails from the kernel test robot that show some
> > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > the vm-scalability testsuite to be specific), with an earlier version of
> > > this set.
> > > 
> > > With larger writes, the gains with this patchset mostly vaporize,
> > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > 
> > > I'm happy to run other workloads if anyone can suggest them.
> > > 
> > > At this point, the patchset works and does what it's expected to do in
> > > my own testing. It seems like it's at least a modest performance win
> > > across all 3 major disk-based filesystems. It may also encourage others
> > > to implement i_version as well since it reduces the cost.
> > 
> > Do you have an idea what the remaining cost is?
> > 
> > Especially in the ext4 case, are you still able to measure any
> > difference in performance between the cases where i_version is turned on
> > and off, after these patches?
> 
> Attached is a fio jobfile + the output from 3 different runs using it
> with ext4. This one is using 4k writes. There was no querying of
> i_version during the runs.  I've done several runs with each and these
> are pretty representative of the results:
> 
> old = 4.15-rc3, i_version enabled
> ivers = 4.15-rc3 + these patches, i_version enabled
> noivers = 4.15-rc3 + these patches, i_version disabled
> 
> To snip out the run status lines:
> 
> old:
> WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> 
> ivers:
> WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> 
> noivers:
> WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> 
> So, I see some performance degradation with -o iversion compared to not
> having it enabled (maybe due to the extra atomic fetches?), but this set
> erases most of the difference.

So what is the performance difference when something is actively
querying the i_version counter as fast as it can (e.g. file being
constantly stat()d via NFS whilst being modified)? How does the
performance compare to the old code in that case?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 14/19] xfs: convert to new i_version API
  2017-12-13 22:48   ` Dave Chinner
@ 2017-12-13 23:25     ` Dave Chinner
  2017-12-14  0:10       ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2017-12-13 23:25 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro


So now I've looked at the last patch .....

On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > From: Jeff Layton <jlayton@redhat.com>
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> >  fs/xfs/xfs_icache.c           | 4 ++--
> >  fs/xfs/xfs_inode.c            | 2 +-
> >  fs/xfs/xfs_inode_item.c       | 2 +-
> >  fs/xfs/xfs_trans_inode.c      | 2 +-
> >  5 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6b7989038d75..6b47de201391 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> >  	to->di_flags	= be16_to_cpu(from->di_flags);
> >  
> >  	if (to->di_version == 3) {
> > -		inode->i_version = be64_to_cpu(from->di_changecount);
> > +		inode_set_iversion_queried(inode,
> > +					   be64_to_cpu(from->di_changecount));
> 
> So we use the "kernel managed" (really not sure what that means)
> set function here to read it off disk, but...

This stores the value from disk in the incore inode as "val << 1",
then sets the lowest bit to indicate that it has been "queried"
so that it will be incremented on the first modification.

Why do we initialise values read from disk as "queried"? This means
the i_version will change once every time it's brought into memory
and modified, regardless of whether anyone is looking at it. What
purpose does this serve?

> >  		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> >  		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> >  		to->di_flags2 = be64_to_cpu(from->di_flags2);
> > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> >  	to->di_flags = cpu_to_be16(from->di_flags);
> >  
> >  	if (from->di_version == 3) {
> > -		to->di_changecount = cpu_to_be64(inode->i_version);
> > +		to->di_changecount = cpu_to_be64(inode_peek_iversion_raw(inode));
> 
> ... use the raw access mode to put it back on disk.

This writes the current inode->i_version value directly to disk,
including the "queried" flag.

Hence every time this inode cycles through memory and is modified,
we essentially shift the on-disk i_version value upwards by 1 slot
(i.e. double it's value) when we read it back in from disk.

Seems like a bug - this is not a monotonically increasing counter
anymore - after ~60 modification cycles through memory it's going to
have an practically random value when pulled in off disk, not a
slowly increasing value.

> >  		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 43005fbe8b1e..4838462616fd 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> >  	int		error;
> >  	uint32_t	nlink = inode->i_nlink;
> >  	uint32_t	generation = inode->i_generation;
> > -	uint64_t	version = inode->i_version;
> > +	uint64_t	version = inode_peek_iversion_raw(inode);
> >  	umode_t		mode = inode->i_mode;
> >  
> >  	error = inode_init_always(mp->m_super, inode);
> >  
> >  	set_nlink(inode, nlink);
> >  	inode->i_generation = generation;
> > -	inode->i_version = version;
> > +	inode_set_iversion_queried(inode, version);
> 
> Again - raw mode to read, kernel managed to set.

This, again, will double the i_version value. Shouldn't all the XFS
code just be using inode_peek_iversion(), not the _raw variant?

> 
> >  	inode->i_mode = mode;
> >  	return error;
> >  }
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 801274126648..be6d87980dd5 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -833,7 +833,7 @@ xfs_ialloc(
> >  	ip->i_d.di_flags = 0;
> >  
> >  	if (ip->i_d.di_version == 3) {
> > -		inode->i_version = 1;
> > +		inode_set_iversion(inode, 1);
> 
> But here you are using the "filesystem managed" mdoe to set the
> new value. Why? How is this any different from reading the value
> off disk and setting it?

Still don't understand why this is different to reading the inode
from disk....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-13 23:03     ` Dave Chinner
@ 2017-12-14  0:02       ` Jeff Layton
  2017-12-14 14:14         ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-14  0:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, hch, neilb,
	amir73il, jack, viro

On Thu, 2017-12-14 at 10:03 +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> > On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > > This is great, thanks.
> > > 
> > > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > > With this, we reduce inode metadata updates across all 3 filesystems
> > > > down to roughly the frequency of the timestamp granularity, particularly
> > > > when it's not being queried (the vastly common case).
> > > > 
> > > > The pessimal workload here is 1 byte writes, and it helps that
> > > > significantly. Of course, that's not what we'd consider a real-world
> > > > workload.
> > > > 
> > > > A tiobench-example.fio workload also shows some modest performance
> > > > gains, and I've gotten mails from the kernel test robot that show some
> > > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > > the vm-scalability testsuite to be specific), with an earlier version of
> > > > this set.
> > > > 
> > > > With larger writes, the gains with this patchset mostly vaporize,
> > > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > > 
> > > > I'm happy to run other workloads if anyone can suggest them.
> > > > 
> > > > At this point, the patchset works and does what it's expected to do in
> > > > my own testing. It seems like it's at least a modest performance win
> > > > across all 3 major disk-based filesystems. It may also encourage others
> > > > to implement i_version as well since it reduces the cost.
> > > 
> > > Do you have an idea what the remaining cost is?
> > > 
> > > Especially in the ext4 case, are you still able to measure any
> > > difference in performance between the cases where i_version is turned on
> > > and off, after these patches?
> > 
> > Attached is a fio jobfile + the output from 3 different runs using it
> > with ext4. This one is using 4k writes. There was no querying of
> > i_version during the runs.  I've done several runs with each and these
> > are pretty representative of the results:
> > 
> > old = 4.15-rc3, i_version enabled
> > ivers = 4.15-rc3 + these patches, i_version enabled
> > noivers = 4.15-rc3 + these patches, i_version disabled
> > 
> > To snip out the run status lines:
> > 
> > old:
> > WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> > 
> > ivers:
> > WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> > 
> > noivers:
> > WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> > 
> > So, I see some performance degradation with -o iversion compared to not
> > having it enabled (maybe due to the extra atomic fetches?), but this set
> > erases most of the difference.
> 
> So what is the performance difference when something is actively
> querying the i_version counter as fast as it can (e.g. file being
> constantly stat()d via NFS whilst being modified)? How does the
> performance compare to the old code in that case?
> 

I haven't benchmarked that with the latest set, but I did with the set
that I posted around a year ago. Basically I just ran a similar test to
this, and had another shell open doing statx(..., STATX_VERSION, ...);
the thing in a tight loop.

I did see some performance hit vs. the case where no one is viewing it,
but it was still significantly faster than the unpatched version that
was incrementing the counter every time.

That was on a different test rig, and the patchset has some small
differences now. I'll see if I can get some hard numbers with such a
testcase soon.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 14/19] xfs: convert to new i_version API
  2017-12-13 23:25     ` Dave Chinner
@ 2017-12-14  0:10       ` Jeff Layton
  2017-12-14  2:17         ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-14  0:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> So now I've looked at the last patch .....
> 
> On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > From: Jeff Layton <jlayton@redhat.com>
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > >  fs/xfs/xfs_icache.c           | 4 ++--
> > >  fs/xfs/xfs_inode.c            | 2 +-
> > >  fs/xfs/xfs_inode_item.c       | 2 +-
> > >  fs/xfs/xfs_trans_inode.c      | 2 +-
> > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 6b7989038d75..6b47de201391 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > >  	to->di_flags	= be16_to_cpu(from->di_flags);
> > >  
> > >  	if (to->di_version == 3) {
> > > -		inode->i_version = be64_to_cpu(from->di_changecount);
> > > +		inode_set_iversion_queried(inode,
> > > +					   be64_to_cpu(from->di_changecount));
> > 
> > So we use the "kernel managed" (really not sure what that means)
> > set function here to read it off disk, but...
> 
> This stores the value from disk in the incore inode as "val << 1",
> then sets the lowest bit to indicate that it has been "queried"
> so that it will be incremented on the first modification.
> 
> Why do we initialise values read from disk as "queried"? This means
> the i_version will change once every time it's brought into memory
> and modified, regardless of whether anyone is looking at it. What
> purpose does this serve?
> 

I don't think we want to store the QUERIED bit.

It's always possible that we crash at an inopportune time and a query
happened vs. this value before this thing hit the backing store.

If we always set the queried bit when we load it from disk, then we know
that that scenario is harmless, at the negligible expense of having to
bump it on the first write.

> > >  		to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> > >  		to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = be64_to_cpu(from->di_flags2);
> > > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> > >  	to->di_flags = cpu_to_be16(from->di_flags);
> > >  
> > >  	if (from->di_version == 3) {
> > > -		to->di_changecount = cpu_to_be64(inode->i_version);
> > > +		to->di_changecount = cpu_to_be64(inode_peek_iversion_raw(inode));
> > 
> > ... use the raw access mode to put it back on disk.
> 
> This writes the current inode->i_version value directly to disk,
> including the "queried" flag.
> 
> Hence every time this inode cycles through memory and is modified,
> we essentially shift the on-disk i_version value upwards by 1 slot
> (i.e. double it's value) when we read it back in from disk.
> 
> Seems like a bug - this is not a monotonically increasing counter
> anymore - after ~60 modification cycles through memory it's going to
> have an practically random value when pulled in off disk, not a
> slowly increasing value.
> 

Good catch. That's definitely a bug. I'll fix it and test again. This
new API went through several iterations. I'll go back through it in more
detail.

I don't think it'll probably affect the performance, but I'll test again
to be sure.

> > >  		to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 43005fbe8b1e..4838462616fd 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> > >  	int		error;
> > >  	uint32_t	nlink = inode->i_nlink;
> > >  	uint32_t	generation = inode->i_generation;
> > > -	uint64_t	version = inode->i_version;
> > > +	uint64_t	version = inode_peek_iversion_raw(inode);
> > >  	umode_t		mode = inode->i_mode;
> > >  
> > >  	error = inode_init_always(mp->m_super, inode);
> > >  
> > >  	set_nlink(inode, nlink);
> > >  	inode->i_generation = generation;
> > > -	inode->i_version = version;
> > > +	inode_set_iversion_queried(inode, version);
> > 
> > Again - raw mode to read, kernel managed to set.
> 
> This, again, will double the i_version value. Shouldn't all the XFS
> code just be using inode_peek_iversion(), not the _raw variant?
> 

Yes, indeed. Will fix.

> > 
> > >  	inode->i_mode = mode;
> > >  	return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 801274126648..be6d87980dd5 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > >  	ip->i_d.di_flags = 0;
> > >  
> > >  	if (ip->i_d.di_version == 3) {
> > > -		inode->i_version = 1;
> > > +		inode_set_iversion(inode, 1);
> > 
> > But here you are using the "filesystem managed" mdoe to set the
> > new value. Why? How is this any different from reading the value
> > off disk and setting it?
> 
> Still don't understand why this is different to reading the inode
> from disk....

This is a allocating a brand new, never before seen inode. There's no
way this i_version could have ever been seen, so there's no need to flag
it as queried.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-13 22:04   ` NeilBrown
@ 2017-12-14  0:27     ` Jeff Layton
  2017-12-16  4:17       ` NeilBrown
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-14  0:27 UTC (permalink / raw)
  To: NeilBrown, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Thu, 2017-12-14 at 09:04 +1100, NeilBrown wrote:
> On Wed, Dec 13 2017, Jeff Layton wrote:
> 
> > +/*
> > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > + * appear different to observers if there was a change to the inode's data or
> > + * metadata since it was last queried.
> > + *
> > + * It should be considered an opaque value by observers. If it remains the same
> > + * since it was last checked, then nothing has changed in the inode. If it's
> > + * different then something has changed. Observers cannot infer anything about
> > + * the nature or magnitude of the changes from the value, only that the inode
> > + * has changed in some fashion.
> 
> I agree that it "should be" considered opaque, but I have a suspicion
> that NFSv4 doesn't consider it opaque.
> There is something about write delegations and the server performing a
> GETATTR callback to the delegated client so that it can answer GETATTR
> from other clients without recalling the delegation.
> 
> Specifically section "10.4.3 Handling of CB_GETATTR" of RFC5661 contains
> the text:
> 
>    o  The client will create a value greater than c that will be used
>       for communicating that modified data is held at the client.  Let
>       this value be represented by d.
> 
> "c" here is a 'change' attribute.
> 
> Then:
> 
>    While the change attribute is opaque to the client in the sense that
>    it has no idea what units of time, if any, the server is counting
>    change with, it is not opaque in that the client has to treat it as
>    an unsigned integer, and the server has to be able to see the results
>    of the client's changes to that integer.  Therefore, the server MUST
>    encode the change attribute in network order when sending it to the
>    client.  The client MUST decode it from network order to its native
>    order when receiving it, and the client MUST encode it in network
>    order when sending it to the server.  For this reason, change is
>    defined as an unsigned integer rather than an opaque array of bytes.
> 
> This all suggests that nfsd needs to be certain that "incrementing" the
> change id will produce a new changeid, which has not been used before,
> and also suggests that nfsd needs to be able to control the changeid
> stored after writes that result from a delegation being returned.
> 
> I'd just like to say that this is one of the most annoying dumb features
> of NFSv4, because it is trivial to fix and I suggested a fix before
> NFSv4.0 was finalized.  Grumble.
> 
> Otherwise the patch set looks good.  I haven't gone over the code
> closely, the but approach is spot-on.

I don't think we have to do that. There are really only two states with
a client holding a write delegation, as far as the server is concerned.
Either:

a) the client has done no writes to the file, in which case it'll return
the same i_version that the server has when issued a CB_GETATTR

...or...

b) it has written to the file while holding the delegation, in which
case it'll return a different CB_GETATTR to the server

The simplest thing for the server to do is to just increment the change
attribute _once_ when it gets back a CB_GETATTR with a different change
attr than it has.

That's sufficient to tell another client issuing a a GETATTR that the
file has changed without needing to recall the delegation.

Prior to the delegation being returned, the client will send at least
one WRITE RPC, and that's enough to ensure that the the next stat will
see the thing increase.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 14/19] xfs: convert to new i_version API
  2017-12-14  0:10       ` Jeff Layton
@ 2017-12-14  2:17         ` Dave Chinner
  2017-12-14 11:16           ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2017-12-14  2:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > So now I've looked at the last patch .....
> > 
> > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > From: Jeff Layton <jlayton@redhat.com>
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > >  fs/xfs/xfs_icache.c           | 4 ++--
> > > >  fs/xfs/xfs_inode.c            | 2 +-
> > > >  fs/xfs/xfs_inode_item.c       | 2 +-
> > > >  fs/xfs/xfs_trans_inode.c      | 2 +-
> > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > index 6b7989038d75..6b47de201391 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > >  	to->di_flags	= be16_to_cpu(from->di_flags);
> > > >  
> > > >  	if (to->di_version == 3) {
> > > > -		inode->i_version = be64_to_cpu(from->di_changecount);
> > > > +		inode_set_iversion_queried(inode,
> > > > +					   be64_to_cpu(from->di_changecount));
> > > 
> > > So we use the "kernel managed" (really not sure what that means)
> > > set function here to read it off disk, but...
> > 
> > This stores the value from disk in the incore inode as "val << 1",
> > then sets the lowest bit to indicate that it has been "queried"
> > so that it will be incremented on the first modification.
> > 
> > Why do we initialise values read from disk as "queried"? This means
> > the i_version will change once every time it's brought into memory
> > and modified, regardless of whether anyone is looking at it. What
> > purpose does this serve?
> > 
> 
> I don't think we want to store the QUERIED bit.
> 
> It's always possible that we crash at an inopportune time and a query
> happened vs. this value before this thing hit the backing store.
> 
> If we always set the queried bit when we load it from disk, then we know
> that that scenario is harmless, at the negligible expense of having to
> bump it on the first write.

Reasonable. Needs documentation.

> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 801274126648..be6d87980dd5 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > >  	ip->i_d.di_flags = 0;
> > > >  
> > > >  	if (ip->i_d.di_version == 3) {
> > > > -		inode->i_version = 1;
> > > > +		inode_set_iversion(inode, 1);
> > > 
> > > But here you are using the "filesystem managed" mdoe to set the
> > > new value. Why? How is this any different from reading the value
> > > off disk and setting it?
> > 
> > Still don't understand why this is different to reading the inode
> > from disk....
> 
> This is a allocating a brand new, never before seen inode. There's no
> way this i_version could have ever been seen, so there's no need to flag
> it as queried.

More documentation. People are going to need to know this stuff to
be able to implement/maintain this stuff in working order - it's no
longer a simple, obvious "just increment the counter on
modification" variable and that has potential ramifications for
filesystems that store this on disk.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 14/19] xfs: convert to new i_version API
  2017-12-14  2:17         ` Dave Chinner
@ 2017-12-14 11:16           ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-14 11:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Thu, 2017-12-14 at 13:17 +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> > On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > > So now I've looked at the last patch .....
> > > 
> > > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > > From: Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > > >  fs/xfs/xfs_icache.c           | 4 ++--
> > > > >  fs/xfs/xfs_inode.c            | 2 +-
> > > > >  fs/xfs/xfs_inode_item.c       | 2 +-
> > > > >  fs/xfs/xfs_trans_inode.c      | 2 +-
> > > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > index 6b7989038d75..6b47de201391 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > > >  	to->di_flags	= be16_to_cpu(from->di_flags);
> > > > >  
> > > > >  	if (to->di_version == 3) {
> > > > > -		inode->i_version = be64_to_cpu(from->di_changecount);
> > > > > +		inode_set_iversion_queried(inode,
> > > > > +					   be64_to_cpu(from->di_changecount));
> > > > 
> > > > So we use the "kernel managed" (really not sure what that means)
> > > > set function here to read it off disk, but...
> > > 
> > > This stores the value from disk in the incore inode as "val << 1",
> > > then sets the lowest bit to indicate that it has been "queried"
> > > so that it will be incremented on the first modification.
> > > 
> > > Why do we initialise values read from disk as "queried"? This means
> > > the i_version will change once every time it's brought into memory
> > > and modified, regardless of whether anyone is looking at it. What
> > > purpose does this serve?
> > > 
> > 
> > I don't think we want to store the QUERIED bit.
> > 
> > It's always possible that we crash at an inopportune time and a query
> > happened vs. this value before this thing hit the backing store.
> > 
> > If we always set the queried bit when we load it from disk, then we know
> > that that scenario is harmless, at the negligible expense of having to
> > bump it on the first write.
> 
> Reasonable. Needs documentation.
> 

Will do.

FWIW, there's another reason to do it this way too: backward
compatibility. If we don't try to store the queried bit then we should
be able to go back and forth between legacy kernels and the ones with
the new i_version handling without any trouble. The older kernels will
just bump the count more frequently.

> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 801274126648..be6d87980dd5 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > > >  	ip->i_d.di_flags = 0;
> > > > >  
> > > > >  	if (ip->i_d.di_version == 3) {
> > > > > -		inode->i_version = 1;
> > > > > +		inode_set_iversion(inode, 1);
> > > > 
> > > > But here you are using the "filesystem managed" mdoe to set the
> > > > new value. Why? How is this any different from reading the value
> > > > off disk and setting it?
> > > 
> > > Still don't understand why this is different to reading the inode
> > > from disk....
> > 
> > This is a allocating a brand new, never before seen inode. There's no
> > way this i_version could have ever been seen, so there's no need to flag
> > it as queried.
> 
> More documentation. People are going to need to know this stuff to
> be able to implement/maintain this stuff in working order - it's no
> longer a simple, obvious "just increment the counter on
> modification" variable and that has potential ramifications for
> filesystems that store this on disk.
> 
> 

Definitely. I'm finding that documenting this has been the hardest part.

Thanks for the review so far!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-14  0:02       ` Jeff Layton
@ 2017-12-14 14:14         ` Jeff Layton
  2017-12-14 15:14           ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-14 14:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: J. Bruce Fields, linux-fsdevel, linux-kernel, hch, neilb,
	amir73il, jack, viro

On Wed, 2017-12-13 at 19:02 -0500, Jeff Layton wrote:
> On Thu, 2017-12-14 at 10:03 +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> > > On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > > > This is great, thanks.
> > > > 
> > > > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > > > With this, we reduce inode metadata updates across all 3 filesystems
> > > > > down to roughly the frequency of the timestamp granularity, particularly
> > > > > when it's not being queried (the vastly common case).
> > > > > 
> > > > > The pessimal workload here is 1 byte writes, and it helps that
> > > > > significantly. Of course, that's not what we'd consider a real-world
> > > > > workload.
> > > > > 
> > > > > A tiobench-example.fio workload also shows some modest performance
> > > > > gains, and I've gotten mails from the kernel test robot that show some
> > > > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > > > the vm-scalability testsuite to be specific), with an earlier version of
> > > > > this set.
> > > > > 
> > > > > With larger writes, the gains with this patchset mostly vaporize,
> > > > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > > > 
> > > > > I'm happy to run other workloads if anyone can suggest them.
> > > > > 
> > > > > At this point, the patchset works and does what it's expected to do in
> > > > > my own testing. It seems like it's at least a modest performance win
> > > > > across all 3 major disk-based filesystems. It may also encourage others
> > > > > to implement i_version as well since it reduces the cost.
> > > > 
> > > > Do you have an idea what the remaining cost is?
> > > > 
> > > > Especially in the ext4 case, are you still able to measure any
> > > > difference in performance between the cases where i_version is turned on
> > > > and off, after these patches?
> > > 
> > > Attached is a fio jobfile + the output from 3 different runs using it
> > > with ext4. This one is using 4k writes. There was no querying of
> > > i_version during the runs.  I've done several runs with each and these
> > > are pretty representative of the results:
> > > 
> > > old = 4.15-rc3, i_version enabled
> > > ivers = 4.15-rc3 + these patches, i_version enabled
> > > noivers = 4.15-rc3 + these patches, i_version disabled
> > > 
> > > To snip out the run status lines:
> > > 
> > > old:
> > > WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> > > 
> > > ivers:
> > > WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> > > 
> > > noivers:
> > > WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> > > 
> > > So, I see some performance degradation with -o iversion compared to not
> > > having it enabled (maybe due to the extra atomic fetches?), but this set
> > > erases most of the difference.
> > 
> > So what is the performance difference when something is actively
> > querying the i_version counter as fast as it can (e.g. file being
> > constantly stat()d via NFS whilst being modified)? How does the
> > performance compare to the old code in that case?
> > 
> 
> I haven't benchmarked that with the latest set, but I did with the set
> that I posted around a year ago. Basically I just ran a similar test to
> this, and had another shell open doing statx(..., STATX_VERSION, ...);
> the thing in a tight loop.
> 
> I did see some performance hit vs. the case where no one is viewing it,
> but it was still significantly faster than the unpatched version that
> was incrementing the counter every time.
> 
> That was on a different test rig, and the patchset has some small
> differences now. I'll see if I can get some hard numbers with such a
> testcase soon.

I reran the test on xfs, with another shell running test-statx to query
the i_version value in a tight loop:

    WRITE: bw=110MiB/s (115MB/s), 13.6MiB/s-14.0MiB/s (14.2MB/s-14.7MB/s), io=64.5GiB (69.3GB), run=600001-600001msec

...contrast that with the run where I was not doing any queries:

   WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec

...vs the unpatched kernel:

   WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec

There is some clear peformance impact when you are running frequent
queries of the i_version.

My gut feeling is that you could probably make the new code perform
worse than the old if you were to _really_ hammer the inode with queries
for the i_version (probably from many threads in parallel) while doing a
lot of small writes to it.

That'd be a pretty unusual workload though.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-14 14:14         ` Jeff Layton
@ 2017-12-14 15:14           ` J. Bruce Fields
  2017-12-15 15:15             ` Jeff Layton
  0 siblings, 1 reply; 46+ messages in thread
From: J. Bruce Fields @ 2017-12-14 15:14 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, linux-fsdevel, linux-kernel, hch, neilb, amir73il,
	jack, viro

On Thu, Dec 14, 2017 at 09:14:47AM -0500, Jeff Layton wrote:
> On Wed, 2017-12-13 at 19:02 -0500, Jeff Layton wrote:
> > On Thu, 2017-12-14 at 10:03 +1100, Dave Chinner wrote:
> > > On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> > > > On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > > > > This is great, thanks.
> > > > > 
> > > > > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > > > > With this, we reduce inode metadata updates across all 3 filesystems
> > > > > > down to roughly the frequency of the timestamp granularity, particularly
> > > > > > when it's not being queried (the vastly common case).
> > > > > > 
> > > > > > The pessimal workload here is 1 byte writes, and it helps that
> > > > > > significantly. Of course, that's not what we'd consider a real-world
> > > > > > workload.
> > > > > > 
> > > > > > A tiobench-example.fio workload also shows some modest performance
> > > > > > gains, and I've gotten mails from the kernel test robot that show some
> > > > > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > > > > the vm-scalability testsuite to be specific), with an earlier version of
> > > > > > this set.
> > > > > > 
> > > > > > With larger writes, the gains with this patchset mostly vaporize,
> > > > > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > > > > 
> > > > > > I'm happy to run other workloads if anyone can suggest them.
> > > > > > 
> > > > > > At this point, the patchset works and does what it's expected to do in
> > > > > > my own testing. It seems like it's at least a modest performance win
> > > > > > across all 3 major disk-based filesystems. It may also encourage others
> > > > > > to implement i_version as well since it reduces the cost.
> > > > > 
> > > > > Do you have an idea what the remaining cost is?
> > > > > 
> > > > > Especially in the ext4 case, are you still able to measure any
> > > > > difference in performance between the cases where i_version is turned on
> > > > > and off, after these patches?
> > > > 
> > > > Attached is a fio jobfile + the output from 3 different runs using it
> > > > with ext4. This one is using 4k writes. There was no querying of
> > > > i_version during the runs.  I've done several runs with each and these
> > > > are pretty representative of the results:
> > > > 
> > > > old = 4.15-rc3, i_version enabled
> > > > ivers = 4.15-rc3 + these patches, i_version enabled
> > > > noivers = 4.15-rc3 + these patches, i_version disabled
> > > > 
> > > > To snip out the run status lines:
> > > > 
> > > > old:
> > > > WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> > > > 
> > > > ivers:
> > > > WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> > > > 
> > > > noivers:
> > > > WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> > > > 
> > > > So, I see some performance degradation with -o iversion compared to not
> > > > having it enabled (maybe due to the extra atomic fetches?), but this set
> > > > erases most of the difference.
> > > 
> > > So what is the performance difference when something is actively
> > > querying the i_version counter as fast as it can (e.g. file being
> > > constantly stat()d via NFS whilst being modified)? How does the
> > > performance compare to the old code in that case?
> > > 
> > 
> > I haven't benchmarked that with the latest set, but I did with the set
> > that I posted around a year ago. Basically I just ran a similar test to
> > this, and had another shell open doing statx(..., STATX_VERSION, ...);
> > the thing in a tight loop.
> > 
> > I did see some performance hit vs. the case where no one is viewing it,
> > but it was still significantly faster than the unpatched version that
> > was incrementing the counter every time.
> > 
> > That was on a different test rig, and the patchset has some small
> > differences now. I'll see if I can get some hard numbers with such a
> > testcase soon.
> 
> I reran the test on xfs, with another shell running test-statx to query
> the i_version value in a tight loop:
> 
>     WRITE: bw=110MiB/s (115MB/s), 13.6MiB/s-14.0MiB/s (14.2MB/s-14.7MB/s), io=64.5GiB (69.3GB), run=600001-600001msec
> 
> ...contrast that with the run where I was not doing any queries:
> 
>    WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec
> 
> ...vs the unpatched kernel:
> 
>    WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec
> 
> There is some clear peformance impact when you are running frequent
> queries of the i_version.
> 
> My gut feeling is that you could probably make the new code perform
> worse than the old if you were to _really_ hammer the inode with queries
> for the i_version (probably from many threads in parallel) while doing a
> lot of small writes to it.
> 
> That'd be a pretty unusual workload though.

It may be pretty common for NFS itself: if I'm understanding the client
code right (mainly nfs4_write_need_cache_consistency()), our client will
request the change attribute in every WRITE that isn't a pNFS write, an
O_DIRECT write, or associated with a delegation.

The goal of this series isn't to improve NFS performance, it's to save
non-NFS users from paying a performance penalty for something that NFS
requires for correctness.  Probably this series doesn't make much
difference in the NFS write case, and that's fine.  Still, might be
worth confirming that a workload with lots of small NFS writes is mostly
unaffected.

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/19] ext4: convert to new i_version API
  2017-12-13 14:20 ` [PATCH 09/19] ext4: " Jeff Layton
@ 2017-12-14 21:52   ` Theodore Ts'o
  0 siblings, 0 replies; 46+ messages in thread
From: Theodore Ts'o @ 2017-12-14 21:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed, Dec 13, 2017 at 09:20:07AM -0500, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

Acked-by: Theodore Ts'o <tytso@mit.edu>

						- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 16/19] fs: only set S_VERSION when updating times if necessary
  2017-12-13 14:20 ` [PATCH 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
@ 2017-12-15 12:59   ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-15 12:59 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro, David Howells

On Wed, 2017-12-13 at 09:20 -0500, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> We only really need to update i_version if someone has queried for it
> since we last incremented it. By doing that, we can avoid having to
> update the inode if the times haven't changed.
> 
> If the times have changed, then we go ahead and forcibly increment the
> counter, under the assumption that we'll be going to the storage
> anyway, and the increment itself is relatively cheap.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/inode.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 03102d6ef044..7f4215f4309c 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1634,17 +1634,18 @@ static int relatime_need_update(const struct path *path, struct inode *inode,
>  int generic_update_time(struct inode *inode, struct timespec *time, int flags)
>  {
>  	int iflags = I_DIRTY_TIME;
> +	bool dirty = flags & ~S_VERSION;
>  
>  	if (flags & S_ATIME)
>  		inode->i_atime = *time;
> -	if (flags & S_VERSION)
> -		inode_inc_iversion(inode);
>  	if (flags & S_CTIME)
>  		inode->i_ctime = *time;
>  	if (flags & S_MTIME)
>  		inode->i_mtime = *time;
> +	if (flags & S_VERSION)
> +		dirty |= inode_maybe_inc_iversion(inode, dirty);
>  
> -	if (!(inode->i_sb->s_flags & SB_LAZYTIME) || (flags & S_VERSION))
> +	if (dirty || !(inode->i_sb->s_flags & SB_LAZYTIME))

David Howells pointed out that the logic for setting dirty in here will
break SB_LAZYTIME handling. The patch is now fixed in my tree.

>  		iflags |= I_DIRTY_SYNC;
>  	__mark_inode_dirty(inode, iflags);
>  	return 0;
> @@ -1863,7 +1864,7 @@ int file_update_time(struct file *file)
>  	if (!timespec_equal(&inode->i_ctime, &now))
>  		sync_it |= S_CTIME;
>  
> -	if (IS_I_VERSION(inode))
> +	if (IS_I_VERSION(inode) && inode_iversion_need_inc(inode))
>  		sync_it |= S_VERSION;
>  
>  	if (!sync_it)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed
  2017-12-13 14:20 ` [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
@ 2017-12-15 13:03   ` Jeff Layton
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-15 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro,
	Josef Bacik, Chris Mason, Omar Sandoval, David Howells

On Wed, 2017-12-13 at 09:20 -0500, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> At this point, we know that "now" and the file times may differ, and we
> suspect that the i_version has been flagged to be bumped. Attempt to
> bump the i_version, and only mark the inode dirty if that actually
> occurred or if one of the times was updated.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/btrfs/inode.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index ac25389b39de..2e50a977fb06 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6106,19 +6106,20 @@ static int btrfs_update_time(struct inode *inode, struct timespec *now,
>  			     int flags)
>  {
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	bool dirty = flags & ~S_VERSION;
>  
>  	if (btrfs_root_readonly(root))
>  		return -EROFS;
>  
>  	if (flags & S_VERSION)
> -		inode_inc_iversion(inode);
> +		dirty |= inode_maybe_inc_iversion(inode, dirty);
>  	if (flags & S_CTIME)
>  		inode->i_ctime = *now;
>  	if (flags & S_MTIME)
>  		inode->i_mtime = *now;
>  	if (flags & S_ATIME)
>  		inode->i_atime = *now;
> -	return btrfs_dirty_inode(inode);
> +	return dirty ? btrfs_dirty_inode(inode) : 0;
>  }
>  
>  /*

I had some bogus handling for SB_LAZYTIME in the corresponding patch for
generic_update_time. I've fixed in my tree, but now I'm wondering...

Should btrfs not be dirtying the inode here if SB_LAZYTIME is set and
the only update is to the atime?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-14 15:14           ` J. Bruce Fields
@ 2017-12-15 15:15             ` Jeff Layton
  2017-12-15 15:26               ` J. Bruce Fields
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff Layton @ 2017-12-15 15:15 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dave Chinner, linux-fsdevel, linux-kernel, hch, neilb, amir73il,
	jack, viro

On Thu, 2017-12-14 at 10:14 -0500, J. Bruce Fields wrote:
> On Thu, Dec 14, 2017 at 09:14:47AM -0500, Jeff Layton wrote:
> > On Wed, 2017-12-13 at 19:02 -0500, Jeff Layton wrote:
> > > On Thu, 2017-12-14 at 10:03 +1100, Dave Chinner wrote:
> > > > On Wed, Dec 13, 2017 at 03:14:28PM -0500, Jeff Layton wrote:
> > > > > On Wed, 2017-12-13 at 10:05 -0500, J. Bruce Fields wrote:
> > > > > > This is great, thanks.
> > > > > > 
> > > > > > On Wed, Dec 13, 2017 at 09:19:58AM -0500, Jeff Layton wrote:
> > > > > > > With this, we reduce inode metadata updates across all 3 filesystems
> > > > > > > down to roughly the frequency of the timestamp granularity, particularly
> > > > > > > when it's not being queried (the vastly common case).
> > > > > > > 
> > > > > > > The pessimal workload here is 1 byte writes, and it helps that
> > > > > > > significantly. Of course, that's not what we'd consider a real-world
> > > > > > > workload.
> > > > > > > 
> > > > > > > A tiobench-example.fio workload also shows some modest performance
> > > > > > > gains, and I've gotten mails from the kernel test robot that show some
> > > > > > > significant performance gains on some microbenchmarks (case-msync-mt in
> > > > > > > the vm-scalability testsuite to be specific), with an earlier version of
> > > > > > > this set.
> > > > > > > 
> > > > > > > With larger writes, the gains with this patchset mostly vaporize,
> > > > > > > but it does not seem to cause performance to regress anywhere, AFAICT.
> > > > > > > 
> > > > > > > I'm happy to run other workloads if anyone can suggest them.
> > > > > > > 
> > > > > > > At this point, the patchset works and does what it's expected to do in
> > > > > > > my own testing. It seems like it's at least a modest performance win
> > > > > > > across all 3 major disk-based filesystems. It may also encourage others
> > > > > > > to implement i_version as well since it reduces the cost.
> > > > > > 
> > > > > > Do you have an idea what the remaining cost is?
> > > > > > 
> > > > > > Especially in the ext4 case, are you still able to measure any
> > > > > > difference in performance between the cases where i_version is turned on
> > > > > > and off, after these patches?
> > > > > 
> > > > > Attached is a fio jobfile + the output from 3 different runs using it
> > > > > with ext4. This one is using 4k writes. There was no querying of
> > > > > i_version during the runs.  I've done several runs with each and these
> > > > > are pretty representative of the results:
> > > > > 
> > > > > old = 4.15-rc3, i_version enabled
> > > > > ivers = 4.15-rc3 + these patches, i_version enabled
> > > > > noivers = 4.15-rc3 + these patches, i_version disabled
> > > > > 
> > > > > To snip out the run status lines:
> > > > > 
> > > > > old:
> > > > > WRITE: bw=85.6MiB/s (89.8MB/s), 9994KiB/s-11.1MiB/s (10.2MB/s-11.7MB/s), io=50.2GiB (53.8GB), run=600001-600001msec
> > > > > 
> > > > > ivers:
> > > > > WRITE: bw=110MiB/s (115MB/s), 13.5MiB/s-14.2MiB/s (14.1MB/s-14.9MB/s), io=64.3GiB (69.0GB), run=600001-600001msec
> > > > > 
> > > > > noivers:
> > > > > WRITE: bw=117MiB/s (123MB/s), 14.2MiB/s-15.2MiB/s (14.9MB/s-15.9MB/s), io=68.7GiB (73.8GB), run=600001-600001msec
> > > > > 
> > > > > So, I see some performance degradation with -o iversion compared to not
> > > > > having it enabled (maybe due to the extra atomic fetches?), but this set
> > > > > erases most of the difference.
> > > > 
> > > > So what is the performance difference when something is actively
> > > > querying the i_version counter as fast as it can (e.g. file being
> > > > constantly stat()d via NFS whilst being modified)? How does the
> > > > performance compare to the old code in that case?
> > > > 
> > > 
> > > I haven't benchmarked that with the latest set, but I did with the set
> > > that I posted around a year ago. Basically I just ran a similar test to
> > > this, and had another shell open doing statx(..., STATX_VERSION, ...);
> > > the thing in a tight loop.
> > > 
> > > I did see some performance hit vs. the case where no one is viewing it,
> > > but it was still significantly faster than the unpatched version that
> > > was incrementing the counter every time.
> > > 
> > > That was on a different test rig, and the patchset has some small
> > > differences now. I'll see if I can get some hard numbers with such a
> > > testcase soon.
> > 
> > I reran the test on xfs, with another shell running test-statx to query
> > the i_version value in a tight loop:
> > 
> >     WRITE: bw=110MiB/s (115MB/s), 13.6MiB/s-14.0MiB/s (14.2MB/s-14.7MB/s), io=64.5GiB (69.3GB), run=600001-600001msec
> > 
> > ...contrast that with the run where I was not doing any queries:
> > 
> >    WRITE: bw=129MiB/s (136MB/s), 15.8MiB/s-16.6MiB/s (16.6MB/s-17.4MB/s), io=75.9GiB (81.5GB), run=600001-600001msec
> > 
> > ...vs the unpatched kernel:
> > 
> >    WRITE: bw=86.7MiB/s (90.0MB/s), 9689KiB/s-11.7MiB/s (9921kB/s-12.2MB/s), io=50.8GiB (54.6GB), run=600001-600002msec
> > 
> > There is some clear peformance impact when you are running frequent
> > queries of the i_version.
> > 
> > My gut feeling is that you could probably make the new code perform
> > worse than the old if you were to _really_ hammer the inode with queries
> > for the i_version (probably from many threads in parallel) while doing a
> > lot of small writes to it.
> > 
> > That'd be a pretty unusual workload though.
> 
> It may be pretty common for NFS itself: if I'm understanding the client
> code right (mainly nfs4_write_need_cache_consistency()), our client will
> request the change attribute in every WRITE that isn't a pNFS write, an
> O_DIRECT write, or associated with a delegation.
> 
> The goal of this series isn't to improve NFS performance, it's to save
> non-NFS users from paying a performance penalty for something that NFS
> requires for correctness.  Probably this series doesn't make much
> difference in the NFS write case, and that's fine.  Still, might be
> worth confirming that a workload with lots of small NFS writes is mostly
> unaffected.

Just for yuks, I ran such a test this morning. I used the same fio
jobfile, but changed it to have:

    direct=1

...to eliminate client-side caching effects:

old:
   WRITE: bw=1146KiB/s (1174kB/s), 143KiB/s-143KiB/s (147kB/s-147kB/s), io=672MiB (705MB), run=600075-600435msec

patched:
   WRITE: bw=1253KiB/s (1283kB/s), 156KiB/s-157KiB/s (160kB/s-161kB/s), io=735MiB (770MB), run=600089-600414msec

So still seems to be a bit faster -- maybe because we're using an
atomic64_t instead of a spinlock now? Probably I should profile that at
some point...
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 00/19] fs: rework and optimize i_version handling in filesystems
  2017-12-15 15:15             ` Jeff Layton
@ 2017-12-15 15:26               ` J. Bruce Fields
  0 siblings, 0 replies; 46+ messages in thread
From: J. Bruce Fields @ 2017-12-15 15:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, linux-fsdevel, linux-kernel, hch, neilb, amir73il,
	jack, viro

On Fri, Dec 15, 2017 at 10:15:29AM -0500, Jeff Layton wrote:
> On Thu, 2017-12-14 at 10:14 -0500, J. Bruce Fields wrote:
> > On Thu, Dec 14, 2017 at 09:14:47AM -0500, Jeff Layton wrote:
> > > There is some clear peformance impact when you are running frequent
> > > queries of the i_version.
> > > 
> > > My gut feeling is that you could probably make the new code perform
> > > worse than the old if you were to _really_ hammer the inode with queries
> > > for the i_version (probably from many threads in parallel) while doing a
> > > lot of small writes to it.
> > > 
> > > That'd be a pretty unusual workload though.
> > 
> > It may be pretty common for NFS itself: if I'm understanding the client
> > code right (mainly nfs4_write_need_cache_consistency()), our client will
> > request the change attribute in every WRITE that isn't a pNFS write, an
> > O_DIRECT write, or associated with a delegation.
> > 
> > The goal of this series isn't to improve NFS performance, it's to save
> > non-NFS users from paying a performance penalty for something that NFS
> > requires for correctness.  Probably this series doesn't make much
> > difference in the NFS write case, and that's fine.  Still, might be
> > worth confirming that a workload with lots of small NFS writes is mostly
> > unaffected.
> 
> Just for yuks, I ran such a test this morning. I used the same fio
> jobfile, but changed it to have:
> 
>     direct=1
> 
> ...to eliminate client-side caching effects:
> 
> old:
>    WRITE: bw=1146KiB/s (1174kB/s), 143KiB/s-143KiB/s (147kB/s-147kB/s), io=672MiB (705MB), run=600075-600435msec
> 
> patched:
>    WRITE: bw=1253KiB/s (1283kB/s), 156KiB/s-157KiB/s (160kB/s-161kB/s), io=735MiB (770MB), run=600089-600414msec
> 
> So still seems to be a bit faster -- maybe because we're using an
> atomic64_t instead of a spinlock now? Probably I should profile that at
> some point...

That would be interesting!

But it looks like this and all your results are about what we expect,
and all the evidence so far is that the series is doing what we need it
to.

--b.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-14  0:27     ` Jeff Layton
@ 2017-12-16  4:17       ` NeilBrown
  2017-12-17 13:01         ` Jeff Layton
  2017-12-18 14:03         ` Jeff Layton
  0 siblings, 2 replies; 46+ messages in thread
From: NeilBrown @ 2017-12-16  4:17 UTC (permalink / raw)
  To: Jeff Layton, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

[-- Attachment #1: Type: text/plain, Size: 4381 bytes --]

On Wed, Dec 13 2017, Jeff Layton wrote:

> On Thu, 2017-12-14 at 09:04 +1100, NeilBrown wrote:
>> On Wed, Dec 13 2017, Jeff Layton wrote:
>> 
>> > +/*
>> > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>> > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
>> > + * appear different to observers if there was a change to the inode's data or
>> > + * metadata since it was last queried.
>> > + *
>> > + * It should be considered an opaque value by observers. If it remains the same
>> > + * since it was last checked, then nothing has changed in the inode. If it's
>> > + * different then something has changed. Observers cannot infer anything about
>> > + * the nature or magnitude of the changes from the value, only that the inode
>> > + * has changed in some fashion.
>> 
>> I agree that it "should be" considered opaque, but I have a suspicion
>> that NFSv4 doesn't consider it opaque.
>> There is something about write delegations and the server performing a
>> GETATTR callback to the delegated client so that it can answer GETATTR
>> from other clients without recalling the delegation.
>> 
>> Specifically section "10.4.3 Handling of CB_GETATTR" of RFC5661 contains
>> the text:
>> 
>>    o  The client will create a value greater than c that will be used
>>       for communicating that modified data is held at the client.  Let
>>       this value be represented by d.
>> 
>> "c" here is a 'change' attribute.
>> 
>> Then:
>> 
>>    While the change attribute is opaque to the client in the sense that
>>    it has no idea what units of time, if any, the server is counting
>>    change with, it is not opaque in that the client has to treat it as
>>    an unsigned integer, and the server has to be able to see the results
>>    of the client's changes to that integer.  Therefore, the server MUST
>>    encode the change attribute in network order when sending it to the
>>    client.  The client MUST decode it from network order to its native
>>    order when receiving it, and the client MUST encode it in network
>>    order when sending it to the server.  For this reason, change is
>>    defined as an unsigned integer rather than an opaque array of bytes.
>> 
>> This all suggests that nfsd needs to be certain that "incrementing" the
>> change id will produce a new changeid, which has not been used before,
>> and also suggests that nfsd needs to be able to control the changeid
>> stored after writes that result from a delegation being returned.
>> 
>> I'd just like to say that this is one of the most annoying dumb features
>> of NFSv4, because it is trivial to fix and I suggested a fix before
>> NFSv4.0 was finalized.  Grumble.
>> 
>> Otherwise the patch set looks good.  I haven't gone over the code
>> closely, the but approach is spot-on.
>
> I don't think we have to do that. There are really only two states with
> a client holding a write delegation, as far as the server is concerned.
> Either:
>
> a) the client has done no writes to the file, in which case it'll return
> the same i_version that the server has when issued a CB_GETATTR
>
> ...or...
>
> b) it has written to the file while holding the delegation, in which
> case it'll return a different CB_GETATTR to the server
>
> The simplest thing for the server to do is to just increment the change
> attribute _once_ when it gets back a CB_GETATTR with a different change
> attr than it has.
>
> That's sufficient to tell another client issuing a a GETATTR that the
> file has changed without needing to recall the delegation.
>
> Prior to the delegation being returned, the client will send at least
> one WRITE RPC, and that's enough to ensure that the the next stat will
> see the thing increase.

"increment" and "increase" are not words that mean anything for an
"opaque value".
NFSd is, presumably, an "observer" of i_version (as it isn't the
filesytem that controls it), so your text says it must treat i_version as
opaque.  That means it cannot detect an "increase" (only a change), and
it certainly cannot "increment" the value.

I think you need to allow observers to treat i_version as a 64 bit number
which will monotonically increase.  Any change to the file will result
in an increment of at least '1'.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-16  4:17       ` NeilBrown
@ 2017-12-17 13:01         ` Jeff Layton
  2017-12-18 14:03         ` Jeff Layton
  1 sibling, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-17 13:01 UTC (permalink / raw)
  To: NeilBrown, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Sat, 2017-12-16 at 15:17 +1100, NeilBrown wrote:
> On Wed, Dec 13 2017, Jeff Layton wrote:
> 
> > On Thu, 2017-12-14 at 09:04 +1100, NeilBrown wrote:
> > > On Wed, Dec 13 2017, Jeff Layton wrote:
> > > 
> > > > +/*
> > > > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > + * appear different to observers if there was a change to the inode's data or
> > > > + * metadata since it was last queried.
> > > > + *
> > > > + * It should be considered an opaque value by observers. If it remains the same
> > > > + * since it was last checked, then nothing has changed in the inode. If it's
> > > > + * different then something has changed. Observers cannot infer anything about
> > > > + * the nature or magnitude of the changes from the value, only that the inode
> > > > + * has changed in some fashion.
> > > 
> > > I agree that it "should be" considered opaque, but I have a suspicion
> > > that NFSv4 doesn't consider it opaque.
> > > There is something about write delegations and the server performing a
> > > GETATTR callback to the delegated client so that it can answer GETATTR
> > > from other clients without recalling the delegation.
> > > 
> > > Specifically section "10.4.3 Handling of CB_GETATTR" of RFC5661 contains
> > > the text:
> > > 
> > >    o  The client will create a value greater than c that will be used
> > >       for communicating that modified data is held at the client.  Let
> > >       this value be represented by d.
> > > 
> > > "c" here is a 'change' attribute.
> > > 
> > > Then:
> > > 
> > >    While the change attribute is opaque to the client in the sense that
> > >    it has no idea what units of time, if any, the server is counting
> > >    change with, it is not opaque in that the client has to treat it as
> > >    an unsigned integer, and the server has to be able to see the results
> > >    of the client's changes to that integer.  Therefore, the server MUST
> > >    encode the change attribute in network order when sending it to the
> > >    client.  The client MUST decode it from network order to its native
> > >    order when receiving it, and the client MUST encode it in network
> > >    order when sending it to the server.  For this reason, change is
> > >    defined as an unsigned integer rather than an opaque array of bytes.
> > > 
> > > This all suggests that nfsd needs to be certain that "incrementing" the
> > > change id will produce a new changeid, which has not been used before,
> > > and also suggests that nfsd needs to be able to control the changeid
> > > stored after writes that result from a delegation being returned.
> > > 
> > > I'd just like to say that this is one of the most annoying dumb features
> > > of NFSv4, because it is trivial to fix and I suggested a fix before
> > > NFSv4.0 was finalized.  Grumble.
> > > 
> > > Otherwise the patch set looks good.  I haven't gone over the code
> > > closely, the but approach is spot-on.
> > 
> > I don't think we have to do that. There are really only two states with
> > a client holding a write delegation, as far as the server is concerned.
> > Either:
> > 
> > a) the client has done no writes to the file, in which case it'll return
> > the same i_version that the server has when issued a CB_GETATTR
> > 
> > ...or...
> > 
> > b) it has written to the file while holding the delegation, in which
> > case it'll return a different CB_GETATTR to the server
> > 
> > The simplest thing for the server to do is to just increment the change
> > attribute _once_ when it gets back a CB_GETATTR with a different change
> > attr than it has.
> > 
> > That's sufficient to tell another client issuing a a GETATTR that the
> > file has changed without needing to recall the delegation.
> > 
> > Prior to the delegation being returned, the client will send at least
> > one WRITE RPC, and that's enough to ensure that the the next stat will
> > see the thing increase.
> 
> "increment" and "increase" are not words that mean anything for an
> "opaque value".
> NFSd is, presumably, an "observer" of i_version (as it isn't the
> filesytem that controls it), so your text says it must treat i_version as
> opaque.  That means it cannot detect an "increase" (only a change), and
> it certainly cannot "increment" the value.
> 
> I think you need to allow observers to treat i_version as a 64 bit number
> which will monotonically increase.  Any change to the file will result
> in an increment of at least '1'.

Here, I was mostly speaking about NFS in general. I think the above
method is the cheapest/best way to ensure that you don't end up with
reused change attributes, within the confines of the protocol.

With this implementation, it's probably safe enough to make a guarantee
that the value will increase wrt a previously sampled value if there was
a change. I'll have to think about how best to document that.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/19] ext2: convert to new i_version API
  2017-12-13 14:20 ` [PATCH 08/19] ext2: convert " Jeff Layton
@ 2017-12-18 12:47   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2017-12-18 12:47 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed 13-12-17 09:20:06, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

Looks good. You can add:

Reviwed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext2/dir.c   | 8 ++++----
>  fs/ext2/super.c | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
> index 987647986f47..c99f14fec3f3 100644
> --- a/fs/ext2/dir.c
> +++ b/fs/ext2/dir.c
> @@ -92,7 +92,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
>  	struct inode *dir = mapping->host;
>  	int err = 0;
>  
> -	dir->i_version++;
> +	inode_inc_iversion(dir);
>  	block_write_end(NULL, mapping, pos, len, len, page, NULL);
>  
>  	if (pos+len > dir->i_size) {
> @@ -293,7 +293,7 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
>  	unsigned long npages = dir_pages(inode);
>  	unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
>  	unsigned char *types = NULL;
> -	int need_revalidate = file->f_version != inode->i_version;
> +	bool need_revalidate = inode_cmp_iversion(inode, file->f_version);
>  
>  	if (pos > inode->i_size - EXT2_DIR_REC_LEN(1))
>  		return 0;
> @@ -319,8 +319,8 @@ ext2_readdir(struct file *file, struct dir_context *ctx)
>  				offset = ext2_validate_entry(kaddr, offset, chunk_mask);
>  				ctx->pos = (n<<PAGE_SHIFT) + offset;
>  			}
> -			file->f_version = inode->i_version;
> -			need_revalidate = 0;
> +			file->f_version = inode_query_iversion(inode);
> +			need_revalidate = false;
>  		}
>  		de = (ext2_dirent *)(kaddr+offset);
>  		limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 7646818ab266..dd7c3c81d918 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -184,7 +184,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
>  	if (!ei)
>  		return NULL;
>  	ei->i_block_alloc_info = NULL;
> -	ei->vfs_inode.i_version = 1;
> +	inode_set_iversion(&ei->vfs_inode, 1);
>  #ifdef CONFIG_QUOTA
>  	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
>  #endif
> @@ -1569,7 +1569,7 @@ static ssize_t ext2_quota_write(struct super_block *sb, int type,
>  		return err;
>  	if (inode->i_size < off+len-towrite)
>  		i_size_write(inode, off+len-towrite);
> -	inode->i_version++;
> +	inode_inc_iversion(inode);
>  	inode->i_mtime = inode->i_ctime = current_time(inode);
>  	mark_inode_dirty(inode);
>  	return len - towrite;
> -- 
> 2.14.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 12/19] ocfs2: convert to new i_version API
  2017-12-13 14:20 ` [PATCH 12/19] ocfs2: " Jeff Layton
@ 2017-12-18 12:49   ` Jan Kara
  0 siblings, 0 replies; 46+ messages in thread
From: Jan Kara @ 2017-12-18 12:49 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-fsdevel, linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Wed 13-12-17 09:20:10, Jeff Layton wrote:
> From: Jeff Layton <jlayton@redhat.com>
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ocfs2/dir.c          | 14 +++++++-------
>  fs/ocfs2/inode.c        |  2 +-
>  fs/ocfs2/namei.c        |  2 +-
>  fs/ocfs2/quota_global.c |  2 +-
>  4 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
> index febe6312ceff..fe2c430a7809 100644
> --- a/fs/ocfs2/dir.c
> +++ b/fs/ocfs2/dir.c
> @@ -1174,7 +1174,7 @@ static int __ocfs2_delete_entry(handle_t *handle, struct inode *dir,
>  				le16_add_cpu(&pde->rec_len,
>  						le16_to_cpu(de->rec_len));
>  			de->inode = 0;
> -			dir->i_version++;
> +			inode_inc_iversion(dir);
>  			ocfs2_journal_dirty(handle, bh);
>  			goto bail;
>  		}
> @@ -1729,7 +1729,7 @@ int __ocfs2_add_entry(handle_t *handle,
>  			if (ocfs2_dir_indexed(dir))
>  				ocfs2_recalc_free_list(dir, handle, lookup);
>  
> -			dir->i_version++;
> +			inode_inc_iversion(dir);
>  			ocfs2_journal_dirty(handle, insert_bh);
>  			retval = 0;
>  			goto bail;
> @@ -1775,7 +1775,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
>  		 * readdir(2), then we might be pointing to an invalid
>  		 * dirent right now.  Scan from the start of the block
>  		 * to make sure. */
> -		if (*f_version != inode->i_version) {
> +		if (inode_cmp_iversion(inode, *f_version)) {
>  			for (i = 0; i < i_size_read(inode) && i < offset; ) {
>  				de = (struct ocfs2_dir_entry *)
>  					(data->id_data + i);
> @@ -1791,7 +1791,7 @@ static int ocfs2_dir_foreach_blk_id(struct inode *inode,
>  				i += le16_to_cpu(de->rec_len);
>  			}
>  			ctx->pos = offset = i;
> -			*f_version = inode->i_version;
> +			*f_version = inode_query_iversion(inode);
>  		}
>  
>  		de = (struct ocfs2_dir_entry *) (data->id_data + ctx->pos);
> @@ -1869,7 +1869,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
>  		 * readdir(2), then we might be pointing to an invalid
>  		 * dirent right now.  Scan from the start of the block
>  		 * to make sure. */
> -		if (*f_version != inode->i_version) {
> +		if (inode_cmp_iversion(inode, *f_version)) {
>  			for (i = 0; i < sb->s_blocksize && i < offset; ) {
>  				de = (struct ocfs2_dir_entry *) (bh->b_data + i);
>  				/* It's too expensive to do a full
> @@ -1886,7 +1886,7 @@ static int ocfs2_dir_foreach_blk_el(struct inode *inode,
>  			offset = i;
>  			ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1))
>  				| offset;
> -			*f_version = inode->i_version;
> +			*f_version = inode_query_iversion(inode);
>  		}
>  
>  		while (ctx->pos < i_size_read(inode)
> @@ -1940,7 +1940,7 @@ static int ocfs2_dir_foreach_blk(struct inode *inode, u64 *f_version,
>   */
>  int ocfs2_dir_foreach(struct inode *inode, struct dir_context *ctx)
>  {
> -	u64 version = inode->i_version;
> +	u64 version = inode_query_iversion(inode);
>  	ocfs2_dir_foreach_blk(inode, &version, ctx, true);
>  	return 0;
>  }
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index 1a1e0078ab38..71ce64665a18 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -302,7 +302,7 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe,
>  	OCFS2_I(inode)->ip_attr = le32_to_cpu(fe->i_attr);
>  	OCFS2_I(inode)->ip_dyn_features = le16_to_cpu(fe->i_dyn_features);
>  
> -	inode->i_version = 1;
> +	inode_set_iversion(inode, 1);
>  	inode->i_generation = le32_to_cpu(fe->i_generation);
>  	inode->i_rdev = huge_decode_dev(le64_to_cpu(fe->id1.dev1.i_rdev));
>  	inode->i_mode = le16_to_cpu(fe->i_mode);
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index 3b0a10d9b36f..c045826b716a 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -1520,7 +1520,7 @@ static int ocfs2_rename(struct inode *old_dir,
>  			mlog_errno(status);
>  			goto bail;
>  		}
> -		new_dir->i_version++;
> +		inode_inc_iversion(new_dir);
>  
>  		if (S_ISDIR(new_inode->i_mode))
>  			ocfs2_set_links_count(newfe, 0);
> diff --git a/fs/ocfs2/quota_global.c b/fs/ocfs2/quota_global.c
> index b39d14cbfa34..e7595a63da43 100644
> --- a/fs/ocfs2/quota_global.c
> +++ b/fs/ocfs2/quota_global.c
> @@ -289,7 +289,7 @@ ssize_t ocfs2_quota_write(struct super_block *sb, int type,
>  		mlog_errno(err);
>  		return err;
>  	}
> -	gqinode->i_version++;
> +	inode_query_iversion(gqinode);
>  	ocfs2_mark_inode_dirty(handle, gqinode, oinfo->dqi_gqi_bh);
>  	return len;
>  }
> -- 
> 2.14.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/19] fs: new API for handling inode->i_version
  2017-12-16  4:17       ` NeilBrown
  2017-12-17 13:01         ` Jeff Layton
@ 2017-12-18 14:03         ` Jeff Layton
  1 sibling, 0 replies; 46+ messages in thread
From: Jeff Layton @ 2017-12-18 14:03 UTC (permalink / raw)
  To: NeilBrown, linux-fsdevel
  Cc: linux-kernel, hch, neilb, bfields, amir73il, jack, viro

On Sat, 2017-12-16 at 15:17 +1100, NeilBrown wrote:
> On Wed, Dec 13 2017, Jeff Layton wrote:
> 
> > On Thu, 2017-12-14 at 09:04 +1100, NeilBrown wrote:
> > > On Wed, Dec 13 2017, Jeff Layton wrote:
> > > 
> > > > +/*
> > > > + * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > > + * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > + * appear different to observers if there was a change to the inode's data or
> > > > + * metadata since it was last queried.
> > > > + *
> > > > + * It should be considered an opaque value by observers. If it remains the same
> > > > + * since it was last checked, then nothing has changed in the inode. If it's
> > > > + * different then something has changed. Observers cannot infer anything about
> > > > + * the nature or magnitude of the changes from the value, only that the inode
> > > > + * has changed in some fashion.
> > > 
> > > I agree that it "should be" considered opaque, but I have a suspicion
> > > that NFSv4 doesn't consider it opaque.
> > > There is something about write delegations and the server performing a
> > > GETATTR callback to the delegated client so that it can answer GETATTR
> > > from other clients without recalling the delegation.
> > > 
> > > Specifically section "10.4.3 Handling of CB_GETATTR" of RFC5661 contains
> > > the text:
> > > 
> > >    o  The client will create a value greater than c that will be used
> > >       for communicating that modified data is held at the client.  Let
> > >       this value be represented by d.
> > > 
> > > "c" here is a 'change' attribute.
> > > 
> > > Then:
> > > 
> > >    While the change attribute is opaque to the client in the sense that
> > >    it has no idea what units of time, if any, the server is counting
> > >    change with, it is not opaque in that the client has to treat it as
> > >    an unsigned integer, and the server has to be able to see the results
> > >    of the client's changes to that integer.  Therefore, the server MUST
> > >    encode the change attribute in network order when sending it to the
> > >    client.  The client MUST decode it from network order to its native
> > >    order when receiving it, and the client MUST encode it in network
> > >    order when sending it to the server.  For this reason, change is
> > >    defined as an unsigned integer rather than an opaque array of bytes.
> > > 
> > > This all suggests that nfsd needs to be certain that "incrementing" the
> > > change id will produce a new changeid, which has not been used before,
> > > and also suggests that nfsd needs to be able to control the changeid
> > > stored after writes that result from a delegation being returned.
> > > 
> > > I'd just like to say that this is one of the most annoying dumb features
> > > of NFSv4, because it is trivial to fix and I suggested a fix before
> > > NFSv4.0 was finalized.  Grumble.
> > > 
> > > Otherwise the patch set looks good.  I haven't gone over the code
> > > closely, the but approach is spot-on.
> > 
> > I don't think we have to do that. There are really only two states with
> > a client holding a write delegation, as far as the server is concerned.
> > Either:
> > 
> > a) the client has done no writes to the file, in which case it'll return
> > the same i_version that the server has when issued a CB_GETATTR
> > 
> > ...or...
> > 
> > b) it has written to the file while holding the delegation, in which
> > case it'll return a different CB_GETATTR to the server
> > 
> > The simplest thing for the server to do is to just increment the change
> > attribute _once_ when it gets back a CB_GETATTR with a different change
> > attr than it has.
> > 
> > That's sufficient to tell another client issuing a a GETATTR that the
> > file has changed without needing to recall the delegation.
> > 
> > Prior to the delegation being returned, the client will send at least
> > one WRITE RPC, and that's enough to ensure that the the next stat will
> > see the thing increase.
> 
> "increment" and "increase" are not words that mean anything for an
> "opaque value".
> NFSd is, presumably, an "observer" of i_version (as it isn't the
> filesytem that controls it), so your text says it must treat i_version as
> opaque.  That means it cannot detect an "increase" (only a change), and
> it certainly cannot "increment" the value.
> 
> I think you need to allow observers to treat i_version as a 64 bit number
> which will monotonically increase.  Any change to the file will result
> in an increment of at least '1'.
> 

One thing here...

I'm currently doing this:

static inline s64                                                               
inode_cmp_iversion(const struct inode *inode, const u64 old)                    
{                                                                               
        return (s64)inode_peek_iversion(inode) - (s64)old;                      
}                                                                               

But I don't think that'll handle wraparound correctly if we want to 
allow people to determine whether it's older or newer. I'll probably
change this to shift the old value left by one bit, and mask off the low
bit of the current inode->i_version.

That'll always give you a difference of 2 or more if they're different,
but it should return the correct sign, which is really all we care about
anyway.

Granted, we're unlikely to wrap around with a 64 bit value, but it's
hard to know for sure what values might be stored on disk on existing
filesystems.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-12-18 14:03 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-13 14:19 [PATCH 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
2017-12-13 14:19 ` [PATCH 01/19] fs: new API for handling inode->i_version Jeff Layton
2017-12-13 22:04   ` NeilBrown
2017-12-14  0:27     ` Jeff Layton
2017-12-16  4:17       ` NeilBrown
2017-12-17 13:01         ` Jeff Layton
2017-12-18 14:03         ` Jeff Layton
2017-12-13 14:20 ` [PATCH 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
2017-12-13 21:52   ` Jeff Layton
2017-12-13 22:07     ` NeilBrown
2017-12-13 14:20 ` [PATCH 03/19] fat: convert to new i_version API Jeff Layton
2017-12-13 14:20 ` [PATCH 04/19] affs: " Jeff Layton
2017-12-13 14:20 ` [PATCH 05/19] afs: " Jeff Layton
2017-12-13 14:20 ` [PATCH 06/19] btrfs: " Jeff Layton
2017-12-13 14:20 ` [PATCH 07/19] exofs: switch " Jeff Layton
2017-12-13 14:20 ` [PATCH 08/19] ext2: convert " Jeff Layton
2017-12-18 12:47   ` Jan Kara
2017-12-13 14:20 ` [PATCH 09/19] ext4: " Jeff Layton
2017-12-14 21:52   ` Theodore Ts'o
2017-12-13 14:20 ` [PATCH 10/19] nfs: " Jeff Layton
2017-12-13 14:20 ` [PATCH 11/19] nfsd: " Jeff Layton
2017-12-13 14:20 ` [PATCH 12/19] ocfs2: " Jeff Layton
2017-12-18 12:49   ` Jan Kara
2017-12-13 14:20 ` [PATCH 13/19] ufs: use " Jeff Layton
2017-12-13 14:20 ` [PATCH 14/19] xfs: convert to " Jeff Layton
2017-12-13 22:48   ` Dave Chinner
2017-12-13 23:25     ` Dave Chinner
2017-12-14  0:10       ` Jeff Layton
2017-12-14  2:17         ` Dave Chinner
2017-12-14 11:16           ` Jeff Layton
2017-12-13 14:20 ` [PATCH 15/19] IMA: switch IMA over " Jeff Layton
2017-12-13 14:20 ` [PATCH 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
2017-12-15 12:59   ` Jeff Layton
2017-12-13 14:20 ` [PATCH 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2017-12-13 14:20 ` [PATCH 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2017-12-15 13:03   ` Jeff Layton
2017-12-13 14:20 ` [PATCH 19/19] fs: handle inode->i_version more efficiently Jeff Layton
2017-12-13 15:05 ` [PATCH 00/19] fs: rework and optimize i_version handling in filesystems J. Bruce Fields
2017-12-13 20:14   ` Jeff Layton
2017-12-13 22:10     ` Jeff Layton
2017-12-13 23:03     ` Dave Chinner
2017-12-14  0:02       ` Jeff Layton
2017-12-14 14:14         ` Jeff Layton
2017-12-14 15:14           ` J. Bruce Fields
2017-12-15 15:15             ` Jeff Layton
2017-12-15 15:26               ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.