linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes
@ 2022-08-12 12:37 Lukas Czerner
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Lukas Czerner @ 2022-08-12 12:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, jlayton, jack, linux-fsdevel, ebiggers, david

ea_inodes are using i_version for storing part of the reference count so
we really need to leave it alone.

The problem can be reproduced by xfstest ext4/026 when iversion is
enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
inodes in ext4_mark_iloc_dirty().

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
---
v2, v3: no change

 fs/ext4/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..2a220be34caa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 	}
 	ext4_fc_track_inode(handle, inode);
 
-	if (IS_I_VERSION(inode))
+	/*
+	 * ea_inodes are using i_version for storing reference count, don't
+	 * mess with it
+	 */
+	if (IS_I_VERSION(inode) &&
+	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
 		inode_inc_iversion(inode);
 
 	/* the do_update_inode consumes one bh->b_count */
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 12:37 [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Lukas Czerner
@ 2022-08-12 12:37 ` Lukas Czerner
  2022-08-12 18:01   ` Eric Biggers
                     ` (2 more replies)
  2022-08-12 12:37 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 20+ messages in thread
From: Lukas Czerner @ 2022-08-12 12:37 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jlayton, jack, linux-fsdevel, ebiggers, david, Christoph Hellwig

Currently the I_DIRTY_TIME will never get set if the inode already has
I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
true, however ext4 will only update the on-disk inode in
->dirty_inode(), not on actual writeback. As a result if the inode
already has I_DIRTY_INODE state by the time we get to
__mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
into on-disk inode and will not get updated until the next I_DIRTY_INODE
update, which might never come if we crash or get a power failure.

The problem can be reproduced on ext4 by running xfstest generic/622
with -o iversion mount option.

Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
I_DIRTY_INODE. Also make sure that the case is properly handled in
writeback_single_inode() as well. Additionally changes in
xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.

Thanks Jan Kara for suggestions on how to make this work properly.

Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
v2: Reworked according to suggestions from Jan
v3: Update documentation, add comments, change flag to flags in
    xfs_fs_dirty_inode()

 Documentation/filesystems/vfs.rst |  2 ++
 fs/fs-writeback.c                 | 34 ++++++++++++++++++++-----------
 fs/xfs/xfs_super.c                | 12 +++++++++--
 include/linux/fs.h                |  9 ++++----
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 6cd6953e175b..5d72b6ba4e63 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -274,6 +274,8 @@ or bottom half).
 	This is specifically for the inode itself being marked dirty,
 	not its data.  If the update needs to be persisted by fdatasync(),
 	then I_DIRTY_DATASYNC will be set in the flags argument.
+	If the inode has dirty timestamp and lazytime is enabled
+	I_DIRTY_TIME will be set in the flags.
 
 ``write_inode``
 	this method is called when the VFS needs to write an inode to
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 05221366a16d..638dbf143727 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1718,9 +1718,14 @@ static int writeback_single_inode(struct inode *inode,
 	 */
 	if (!(inode->i_state & I_DIRTY_ALL))
 		inode_cgwb_move_to_attached(inode, wb);
-	else if (!(inode->i_state & I_SYNC_QUEUED) &&
-		 (inode->i_state & I_DIRTY))
-		redirty_tail_locked(inode, wb);
+	else if (!(inode->i_state & I_SYNC_QUEUED)) {
+		if ((inode->i_state & I_DIRTY))
+			redirty_tail_locked(inode, wb);
+		else if (inode->i_state & I_DIRTY_TIME) {
+			inode->dirtied_when = jiffies;
+			inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
+		}
+	}
 
 	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
@@ -2369,6 +2374,17 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	trace_writeback_mark_inode_dirty(inode, flags);
 
 	if (flags & I_DIRTY_INODE) {
+
+		/* Inode timestamp update will piggback on this dirtying */
+		if (inode->i_state & I_DIRTY_TIME) {
+			spin_lock(&inode->i_lock);
+			if (inode->i_state & I_DIRTY_TIME) {
+				inode->i_state &= ~I_DIRTY_TIME;
+				flags |= I_DIRTY_TIME;
+			}
+			spin_unlock(&inode->i_lock);
+		}
+
 		/*
 		 * Notify the filesystem about the inode being dirtied, so that
 		 * (if needed) it can update on-disk fields and journal the
@@ -2378,7 +2394,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		trace_writeback_dirty_inode_start(inode, flags);
 		if (sb->s_op->dirty_inode)
-			sb->s_op->dirty_inode(inode, flags & I_DIRTY_INODE);
+			sb->s_op->dirty_inode(inode,
+				flags & (I_DIRTY_INODE | I_DIRTY_TIME));
 		trace_writeback_dirty_inode(inode, flags);
 
 		/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
@@ -2399,21 +2416,15 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 	 */
 	smp_mb();
 
-	if (((inode->i_state & flags) == flags) ||
-	    (dirtytime && (inode->i_state & I_DIRTY_INODE)))
+	if ((inode->i_state & flags) == flags)
 		return;
 
 	spin_lock(&inode->i_lock);
-	if (dirtytime && (inode->i_state & I_DIRTY_INODE))
-		goto out_unlock_inode;
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
 		inode_attach_wb(inode, NULL);
 
-		/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
-		if (flags & I_DIRTY_INODE)
-			inode->i_state &= ~I_DIRTY_TIME;
 		inode->i_state |= flags;
 
 		/*
@@ -2486,7 +2497,6 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 out_unlock:
 	if (wb)
 		spin_unlock(&wb->list_lock);
-out_unlock_inode:
 	spin_unlock(&inode->i_lock);
 }
 EXPORT_SYMBOL(__mark_inode_dirty);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9ac59814bbb6..13efc77a1e79 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -653,7 +653,7 @@ xfs_fs_destroy_inode(
 static void
 xfs_fs_dirty_inode(
 	struct inode			*inode,
-	int				flag)
+	int				flags)
 {
 	struct xfs_inode		*ip = XFS_I(inode);
 	struct xfs_mount		*mp = ip->i_mount;
@@ -661,7 +661,15 @@ xfs_fs_dirty_inode(
 
 	if (!(inode->i_sb->s_flags & SB_LAZYTIME))
 		return;
-	if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
+
+	/*
+	 * Only do the timestamp update if the inode is dirty (I_DIRTY_SYNC)
+	 * and has dirty timestamp (I_DIRTY_TIME). I_DIRTY_TIME can be either
+	 * already set in i_state, or passed in flags possibly together with
+	 * I_DIRTY_SYNC.
+	 */
+	if ((flags & ~I_DIRTY_TIME) != I_DIRTY_SYNC ||
+	    !((inode->i_state | flags) & I_DIRTY_TIME))
 		return;
 
 	if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5113f65c786f..e220d26d771a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2376,13 +2376,14 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
  *			don't have to write inode on fdatasync() when only
  *			e.g. the timestamps have changed.
  * I_DIRTY_PAGES	Inode has dirty pages.  Inode itself may be clean.
- * I_DIRTY_TIME		The inode itself only has dirty timestamps, and the
+ * I_DIRTY_TIME		The inode itself has dirty timestamps, and the
  *			lazytime mount option is enabled.  We keep track of this
  *			separately from I_DIRTY_SYNC in order to implement
  *			lazytime.  This gets cleared if I_DIRTY_INODE
- *			(I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set.  I.e.
- *			either I_DIRTY_TIME *or* I_DIRTY_INODE can be set in
- *			i_state, but not both.  I_DIRTY_PAGES may still be set.
+ *			(I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But
+ *			I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already
+ *			in place because writeback might already be in progress
+ *			and we don't want to lose the time update
  * I_NEW		Serves as both a mutex and completion notification.
  *			New inodes set I_NEW.  If two processes both create
  *			the same inode, one of them will release its inode and
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-12 12:37 [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Lukas Czerner
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
@ 2022-08-12 12:37 ` Lukas Czerner
  2022-08-12 13:05   ` Christian Brauner
  2022-08-16 11:48   ` Jan Kara
  2022-08-12 13:04 ` [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Christian Brauner
  2022-08-12 18:42 ` Jeff Layton
  3 siblings, 2 replies; 20+ messages in thread
From: Lukas Czerner @ 2022-08-12 12:37 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jlayton, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong

From: Jeff Layton <jlayton@kernel.org>

The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, remove the handling of
Opt_i_version as well, since we're almost to 5.20 anyway.

Ideally, we'd couple this change with a way to disable the i_version
counter (just in case), but the way the iversion mount option was
implemented makes that difficult to do. We'd need to add a new mount
option altogether or do something with tune2fs. That's probably best
left to later patches if it turns out to be needed.

[ Removed leftover bits of i_version from ext4_apply_options() since it
now can't ever be set in ctx->mask_s_flags -- lczerner ]

Cc: Dave Chinner <david@fromorbit.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v3: Removed leftover bits of i_version from ext4_apply_options
v4: no change

 fs/ext4/inode.c |  5 ++---
 fs/ext4/super.c | 21 ++++-----------------
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a220be34caa..c77d40f05763 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 			return -EINVAL;
 		}
 
-		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
+		if (attr->ia_size != inode->i_size)
 			inode_inc_iversion(inode);
 
 		if (shrink) {
@@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 	 * ea_inodes are using i_version for storing reference count, don't
 	 * mess with it
 	 */
-	if (IS_I_VERSION(inode) &&
-	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
+	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
 		inode_inc_iversion(inode);
 
 	/* the do_update_inode consumes one bh->b_count */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..1c953f6d400e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1585,7 +1585,7 @@ enum {
 	Opt_inlinecrypt,
 	Opt_usrjquota, Opt_grpjquota, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
 	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
@@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("barrier",		Opt_barrier),
 	fsparam_u32	("barrier",		Opt_barrier),
 	fsparam_flag	("nobarrier",		Opt_nobarrier),
-	fsparam_flag	("i_version",		Opt_i_version),
 	fsparam_flag	("dax",			Opt_dax),
 	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
 	fsparam_u32	("stripe",		Opt_stripe),
@@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case Opt_abort:
 		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
 		return 0;
-	case Opt_i_version:
-		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
-		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
-		ctx_set_flags(ctx, SB_I_VERSION);
-		return 0;
 	case Opt_inlinecrypt:
 #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
 		ctx_set_flags(ctx, SB_INLINECRYPT);
@@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
 	sb->s_flags &= ~ctx->mask_s_flags;
 	sb->s_flags |= ctx->vals_s_flags;
 
-	/*
-	 * i_version differs from common mount option iversion so we have
-	 * to let vfs know that it was set, otherwise it would get cleared
-	 * on remount
-	 */
-	if (ctx->mask_s_flags & SB_I_VERSION)
-		fc->sb_flags |= SB_I_VERSION;
-
 #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
 	APPLY(s_commit_interval);
 	APPLY(s_stripe);
@@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
 	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
 		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
-	if (sb->s_flags & SB_I_VERSION)
-		SEQ_OPTS_PUTS("i_version");
 	if (nodefs || sbi->s_stripe)
 		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
 	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
@@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
 		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
 
+	/* i_version is always enabled now */
+	sb->s_flags |= SB_I_VERSION;
+
 	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
 	    (ext4_has_compat_features(sb) ||
 	     ext4_has_ro_compat_features(sb) ||
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes
  2022-08-12 12:37 [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Lukas Czerner
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
  2022-08-12 12:37 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
@ 2022-08-12 13:04 ` Christian Brauner
  2022-08-12 18:42 ` Jeff Layton
  3 siblings, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2022-08-12 13:04 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, ebiggers, david

On Fri, Aug 12, 2022 at 02:37:25PM +0200, Lukas Czerner wrote:
> ea_inodes are using i_version for storing part of the reference count so
> we really need to leave it alone.
> 
> The problem can be reproduced by xfstest ext4/026 when iversion is
> enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> inodes in ext4_mark_iloc_dirty().
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> ---

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-12 12:37 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
@ 2022-08-12 13:05   ` Christian Brauner
  2022-08-16 11:48   ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Christian Brauner @ 2022-08-12 13:05 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong

On Fri, Aug 12, 2022 at 02:37:27PM +0200, Lukas Czerner wrote:
> From: Jeff Layton <jlayton@kernel.org>
> 
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).
> 
> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
> 
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well, since we're almost to 5.20 anyway.
> 
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
> 
> [ Removed leftover bits of i_version from ext4_apply_options() since it
> now can't ever be set in ctx->mask_s_flags -- lczerner ]
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Benjamin Coddington <bcodding@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---

Since ext4 seems to ignore unknown mount options in ext4_parse_param()
removing seems good,
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
@ 2022-08-12 18:01   ` Eric Biggers
  2022-08-12 18:12   ` Eric Biggers
  2022-08-12 18:42   ` Eric Biggers
  2 siblings, 0 replies; 20+ messages in thread
From: Eric Biggers @ 2022-08-12 18:01 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, david,
	Christoph Hellwig

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE.

How can this be reconciled with the below code in __mark_inode_dirty(), which
this patch doesn't touch?

	/* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
	flags &= ~I_DIRTY_TIME;

Also inode_is_dirtytime_only(), which I thought I mentioned before:

	/*
	 * Returns true if the given inode itself only has dirty timestamps (its pages
	 * may still be dirty) and isn't currently being allocated or freed.
	 * Filesystems should call this if when writing an inode when lazytime is
	 * enabled, they want to opportunistically write the timestamps of other inodes
	 * located very nearby on-disk, e.g. in the same inode block.  This returns true
	 * if the given inode is in need of such an opportunistic update.  Requires
	 * i_lock, or at least later re-checking under i_lock.
	 */
	static inline bool inode_is_dirtytime_only(struct inode *inode)
	{
		return (inode->i_state & (I_DIRTY_TIME | I_NEW |
					I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME;
	}

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
  2022-08-12 18:01   ` Eric Biggers
@ 2022-08-12 18:12   ` Eric Biggers
  2022-08-16 11:21     ` Jan Kara
  2022-08-12 18:42   ` Eric Biggers
  2 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2022-08-12 18:12 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, david,
	Christoph Hellwig

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 6cd6953e175b..5d72b6ba4e63 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -274,6 +274,8 @@ or bottom half).
>  	This is specifically for the inode itself being marked dirty,
>  	not its data.  If the update needs to be persisted by fdatasync(),
>  	then I_DIRTY_DATASYNC will be set in the flags argument.
> +	If the inode has dirty timestamp and lazytime is enabled
> +	I_DIRTY_TIME will be set in the flags.

The new sentence is not always true, since with this patch if
__mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
the inode still has dirty timestamps.  Please be super clear about what the
flags actually mean -- I'm still struggling to understand this patch...

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
  2022-08-12 18:01   ` Eric Biggers
  2022-08-12 18:12   ` Eric Biggers
@ 2022-08-12 18:42   ` Eric Biggers
  2022-08-16 11:41     ` Jan Kara
  2 siblings, 1 reply; 20+ messages in thread
From: Eric Biggers @ 2022-08-12 18:42 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, david,
	Christoph Hellwig

On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> Currently the I_DIRTY_TIME will never get set if the inode already has
> I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
> true, however ext4 will only update the on-disk inode in
> ->dirty_inode(), not on actual writeback. As a result if the inode
> already has I_DIRTY_INODE state by the time we get to
> __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> into on-disk inode and will not get updated until the next I_DIRTY_INODE
> update, which might never come if we crash or get a power failure.
> 
> The problem can be reproduced on ext4 by running xfstest generic/622
> with -o iversion mount option.
> 
> Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> I_DIRTY_INODE. Also make sure that the case is properly handled in
> writeback_single_inode() as well. Additionally changes in
> xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> 
> Thanks Jan Kara for suggestions on how to make this work properly.
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Sorry for so many separate emails.  One more thought: isn't there a much more
straightforward fix to this bug that wouldn't require changing the semantics of
the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
That would fix the bug by making the filesystem update the on-disk inode.

Perhaps you aren't doing that in order to strictly maintain the semantics of
'lazytime', where timestamp updates are only persisted at certain times?  Is
this useful even in the short window of time that an inode is dirty?

- Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes
  2022-08-12 12:37 [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Lukas Czerner
                   ` (2 preceding siblings ...)
  2022-08-12 13:04 ` [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Christian Brauner
@ 2022-08-12 18:42 ` Jeff Layton
  2022-08-16 11:52   ` Jan Kara
  3 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2022-08-12 18:42 UTC (permalink / raw)
  To: Lukas Czerner, linux-ext4; +Cc: tytso, jack, linux-fsdevel, ebiggers, david

On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> ea_inodes are using i_version for storing part of the reference count so
> we really need to leave it alone.
> 
> The problem can be reproduced by xfstest ext4/026 when iversion is
> enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> inodes in ext4_mark_iloc_dirty().
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> ---
> v2, v3: no change
> 
>  fs/ext4/inode.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 601214453c3a..2a220be34caa 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
>  	}
>  	ext4_fc_track_inode(handle, inode);
>  
> -	if (IS_I_VERSION(inode))
> +	/*
> +	 * ea_inodes are using i_version for storing reference count, don't
> +	 * mess with it
> +	 */
> +	if (IS_I_VERSION(inode) &&
> +	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>  		inode_inc_iversion(inode);
>  
>  	/* the do_update_inode consumes one bh->b_count */


I've spent some time writing tests for the i_version counter (still
quite rough right now), and what I've found is that this particular
inode_inc_iversion results in the counter being bumped on _reads_ as
well as writes, due to the atime changing. This call to
inode_inc_iversion seems to make no sense, as we aren't bumping the
mtime here.

I'm still working on and testing this, but I think we'll probably just
want to remove this inode_inc_iversion entirely, and leave the i_version
bumping for normal files to happen when the timestamps are updated. So
far, my testing seems to indicate that that does the right thing.

Hopefully I'll have some testcases + patches for this next week
sometime.

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 18:12   ` Eric Biggers
@ 2022-08-16 11:21     ` Jan Kara
  2022-08-21  6:14       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2022-08-16 11:21 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lukas Czerner, linux-ext4, tytso, jlayton, jack, linux-fsdevel,
	david, Christoph Hellwig

On Fri 12-08-22 11:12:27, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 6cd6953e175b..5d72b6ba4e63 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -274,6 +274,8 @@ or bottom half).
> >  	This is specifically for the inode itself being marked dirty,
> >  	not its data.  If the update needs to be persisted by fdatasync(),
> >  	then I_DIRTY_DATASYNC will be set in the flags argument.
> > +	If the inode has dirty timestamp and lazytime is enabled
> > +	I_DIRTY_TIME will be set in the flags.
> 
> The new sentence is not always true, since with this patch if
> __mark_inode_dirty(I_DIRTY_INODE) is called twice on an inode that has
> I_DIRTY_TIME, the second call will no longer include I_DIRTY_TIME -- even though
> the inode still has dirty timestamps.  Please be super clear about what the
> flags actually mean -- I'm still struggling to understand this patch...

Let me chime in here because I was the one who suggested the solution to
Lukas. There are two different things (which is why this is confusing I
guess):

1) I_DIRTY_TIME in the inode->i_state should mean: struct inode has times
updated after we last called ->dirty_inode() callback. Hence
inode_is_dirtytime_only() as well as the chunk:
                /* I_DIRTY_INODE supersedes I_DIRTY_TIME. */
                flags &= ~I_DIRTY_TIME;
you mention in the previous email are compatible with this meaning AFAICT.

2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
bit of a hack. Currently XFS relies on the fact that the only time its
->dirty_inode() callback needs to do anything is when VFS decides it is
time to writeback timestamps and XFS detects this situation by checking for
I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
callback (otherwise timestamp update can get lost). So the solution I've
suggested was to propagate the information "timestamp update needed" to XFS
through I_DIRTY_TIME in flags passed to ->dirty_inode().

I hope things are clearer now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-12 18:42   ` Eric Biggers
@ 2022-08-16 11:41     ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-16 11:41 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Lukas Czerner, linux-ext4, tytso, jlayton, jack, linux-fsdevel,
	david, Christoph Hellwig

On Fri 12-08-22 11:42:21, Eric Biggers wrote:
> On Fri, Aug 12, 2022 at 02:37:26PM +0200, Lukas Czerner wrote:
> > Currently the I_DIRTY_TIME will never get set if the inode already has
> > I_DIRTY_INODE with assumption that it supersedes I_DIRTY_TIME.  That's
> > true, however ext4 will only update the on-disk inode in
> > ->dirty_inode(), not on actual writeback. As a result if the inode
> > already has I_DIRTY_INODE state by the time we get to
> > __mark_inode_dirty() only with I_DIRTY_TIME, the time was already filled
> > into on-disk inode and will not get updated until the next I_DIRTY_INODE
> > update, which might never come if we crash or get a power failure.
> > 
> > The problem can be reproduced on ext4 by running xfstest generic/622
> > with -o iversion mount option.
> > 
> > Fix it by allowing I_DIRTY_TIME to be set even if the inode already has
> > I_DIRTY_INODE. Also make sure that the case is properly handled in
> > writeback_single_inode() as well. Additionally changes in
> > xfs_fs_dirty_inode() was made to accommodate for I_DIRTY_TIME in flag.
> > 
> > Thanks Jan Kara for suggestions on how to make this work properly.
> > 
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> 
> Sorry for so many separate emails.  One more thought: isn't there a much more
> straightforward fix to this bug that wouldn't require changing the semantics of
> the inode flags: on __mark_inode_dirty(I_DIRTY_TIME), if the inode already has
> i_state & I_DIRTY_INODE, just call ->dirty_inode with i_state & I_DIRTY_INODE?
> That would fix the bug by making the filesystem update the on-disk inode.

This is a good question and I was also considering this when we first
discussed the problem with Lukas. It should fix the bug for ext4 but
eventually I've decided against this because XFS would still need something
else to fix the problem (see my previous email) and for ext4 it seemed as
unnecessary overhead (see below).

> Perhaps you aren't doing that in order to strictly maintain the semantics of
> 'lazytime', where timestamp updates are only persisted at certain times?  Is
> this useful even in the short window of time that an inode is dirty?

The result of this for ext4 will be that after the inode is dirtied for
some reason, we will be logging every timestamp update to the journal
(effectively disabling lazytime for the inode) for the 30s time window that
the inode stays dirty before writeback code decides to do writeback (which
just results in clearing the I_DIRTY_INODE flag for ext4). Not too bad I
guess but I'd prefer to avoid this overhead.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-12 12:37 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
  2022-08-12 13:05   ` Christian Brauner
@ 2022-08-16 11:48   ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-16 11:48 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jlayton, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong

On Fri 12-08-22 14:37:27, Lukas Czerner wrote:
> From: Jeff Layton <jlayton@kernel.org>
> 
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).
> 
> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
> 
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well, since we're almost to 5.20 anyway.
> 
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
> 
> [ Removed leftover bits of i_version from ext4_apply_options() since it
> now can't ever be set in ctx->mask_s_flags -- lczerner ]
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Benjamin Coddington <bcodding@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> v3: Removed leftover bits of i_version from ext4_apply_options
> v4: no change
> 
>  fs/ext4/inode.c |  5 ++---
>  fs/ext4/super.c | 21 ++++-----------------
>  2 files changed, 6 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2a220be34caa..c77d40f05763 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>  			return -EINVAL;
>  		}
>  
> -		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
> +		if (attr->ia_size != inode->i_size)
>  			inode_inc_iversion(inode);
>  
>  		if (shrink) {
> @@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
>  	 * ea_inodes are using i_version for storing reference count, don't
>  	 * mess with it
>  	 */
> -	if (IS_I_VERSION(inode) &&
> -	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> +	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>  		inode_inc_iversion(inode);
>  
>  	/* the do_update_inode consumes one bh->b_count */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..1c953f6d400e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1585,7 +1585,7 @@ enum {
>  	Opt_inlinecrypt,
>  	Opt_usrjquota, Opt_grpjquota, Opt_quota,
>  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> -	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> +	Opt_usrquota, Opt_grpquota, Opt_prjquota,
>  	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
>  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
>  	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	fsparam_flag	("barrier",		Opt_barrier),
>  	fsparam_u32	("barrier",		Opt_barrier),
>  	fsparam_flag	("nobarrier",		Opt_nobarrier),
> -	fsparam_flag	("i_version",		Opt_i_version),
>  	fsparam_flag	("dax",			Opt_dax),
>  	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
>  	fsparam_u32	("stripe",		Opt_stripe),
> @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  	case Opt_abort:
>  		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
>  		return 0;
> -	case Opt_i_version:
> -		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> -		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> -		ctx_set_flags(ctx, SB_I_VERSION);
> -		return 0;
>  	case Opt_inlinecrypt:
>  #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
>  		ctx_set_flags(ctx, SB_INLINECRYPT);
> @@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
>  	sb->s_flags &= ~ctx->mask_s_flags;
>  	sb->s_flags |= ctx->vals_s_flags;
>  
> -	/*
> -	 * i_version differs from common mount option iversion so we have
> -	 * to let vfs know that it was set, otherwise it would get cleared
> -	 * on remount
> -	 */
> -	if (ctx->mask_s_flags & SB_I_VERSION)
> -		fc->sb_flags |= SB_I_VERSION;
> -
>  #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
>  	APPLY(s_commit_interval);
>  	APPLY(s_stripe);
> @@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
>  	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
>  		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> -	if (sb->s_flags & SB_I_VERSION)
> -		SEQ_OPTS_PUTS("i_version");
>  	if (nodefs || sbi->s_stripe)
>  		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
>  	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> @@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
>  		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
>  
> +	/* i_version is always enabled now */
> +	sb->s_flags |= SB_I_VERSION;
> +
>  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
>  	    (ext4_has_compat_features(sb) ||
>  	     ext4_has_ro_compat_features(sb) ||
> -- 
> 2.37.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes
  2022-08-12 18:42 ` Jeff Layton
@ 2022-08-16 11:52   ` Jan Kara
  2022-08-16 12:18     ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2022-08-16 11:52 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Lukas Czerner, linux-ext4, tytso, jack, linux-fsdevel, ebiggers, david

On Fri 12-08-22 14:42:36, Jeff Layton wrote:
> On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> > ea_inodes are using i_version for storing part of the reference count so
> > we really need to leave it alone.
> > 
> > The problem can be reproduced by xfstest ext4/026 when iversion is
> > enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> > inodes in ext4_mark_iloc_dirty().
> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > ---
> > v2, v3: no change
> > 
> >  fs/ext4/inode.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 601214453c3a..2a220be34caa 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> >  	}
> >  	ext4_fc_track_inode(handle, inode);
> >  
> > -	if (IS_I_VERSION(inode))
> > +	/*
> > +	 * ea_inodes are using i_version for storing reference count, don't
> > +	 * mess with it
> > +	 */
> > +	if (IS_I_VERSION(inode) &&
> > +	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> >  		inode_inc_iversion(inode);
> >  
> >  	/* the do_update_inode consumes one bh->b_count */
> 
> 
> I've spent some time writing tests for the i_version counter (still
> quite rough right now), and what I've found is that this particular
> inode_inc_iversion results in the counter being bumped on _reads_ as
> well as writes, due to the atime changing. This call to
> inode_inc_iversion seems to make no sense, as we aren't bumping the
> mtime here.
> 
> I'm still working on and testing this, but I think we'll probably just
> want to remove this inode_inc_iversion entirely, and leave the i_version
> bumping for normal files to happen when the timestamps are updated. So
> far, my testing seems to indicate that that does the right thing.

I agree that inode_inc_iversion() may be overly agressive here but where
else does get iversion updated for things like inode owner update or
permission changes?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes
  2022-08-16 11:52   ` Jan Kara
@ 2022-08-16 12:18     ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2022-08-16 12:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: Lukas Czerner, linux-ext4, tytso, linux-fsdevel, ebiggers, david

On Tue, 2022-08-16 at 13:52 +0200, Jan Kara wrote:
> On Fri 12-08-22 14:42:36, Jeff Layton wrote:
> > On Fri, 2022-08-12 at 14:37 +0200, Lukas Czerner wrote:
> > > ea_inodes are using i_version for storing part of the reference count so
> > > we really need to leave it alone.
> > > 
> > > The problem can be reproduced by xfstest ext4/026 when iversion is
> > > enabled. Fix it by not calling inode_inc_iversion() for EXT4_EA_INODE_FL
> > > inodes in ext4_mark_iloc_dirty().
> > > 
> > > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > v2, v3: no change
> > > 
> > >  fs/ext4/inode.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 601214453c3a..2a220be34caa 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -5731,7 +5731,12 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> > >  	}
> > >  	ext4_fc_track_inode(handle, inode);
> > >  
> > > -	if (IS_I_VERSION(inode))
> > > +	/*
> > > +	 * ea_inodes are using i_version for storing reference count, don't
> > > +	 * mess with it
> > > +	 */
> > > +	if (IS_I_VERSION(inode) &&
> > > +	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > >  		inode_inc_iversion(inode)
> > >  
> > >  	/* the do_update_inode consumes one bh->b_count */
> > 
> > 
> > I've spent some time writing tests for the i_version counter (still
> > quite rough right now), and what I've found is that this particular
> > inode_inc_iversion results in the counter being bumped on _reads_ as
> > well as writes, due to the atime changing. This call to
> > inode_inc_iversion seems to make no sense, as we aren't bumping the
> > mtime here.
> > 
> > I'm still working on and testing this, but I think we'll probably just
> > want to remove this inode_inc_iversion entirely, and leave the i_version
> > bumping for normal files to happen when the timestamps are updated. So
> > far, my testing seems to indicate that that does the right thing.
> 
> I agree that inode_inc_iversion() may be overly agressive here but where
> else does get iversion updated for things like inode owner update or
> permission changes?
> 
> 								Honza

If we remove it here, then both the setattr and setxattr codepaths will
need to explicitly bump the iversion counter. Note that we update the
ctime in those paths too, so that gives us a guidepost as to when we
should update i_version. xfs will need similar changes, but btrfs turns
out to already do the right thing.

I'm planning to post my latest patches in just a bit.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-16 11:21     ` Jan Kara
@ 2022-08-21  6:14       ` Christoph Hellwig
  2022-08-22  8:33         ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2022-08-21  6:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Eric Biggers, Lukas Czerner, linux-ext4, tytso, jlayton,
	linux-fsdevel, david, Christoph Hellwig

On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> bit of a hack. Currently XFS relies on the fact that the only time its
> ->dirty_inode() callback needs to do anything is when VFS decides it is
> time to writeback timestamps and XFS detects this situation by checking for
> I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> callback (otherwise timestamp update can get lost). So the solution I've
> suggested was to propagate the information "timestamp update needed" to XFS
> through I_DIRTY_TIME in flags passed to ->dirty_inode().

Maybe we should just add a separate update_lazy_time method to make this
a little more clear?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE
  2022-08-21  6:14       ` Christoph Hellwig
@ 2022-08-22  8:33         ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-08-22  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Eric Biggers, Lukas Czerner, linux-ext4, tytso,
	jlayton, linux-fsdevel, david

On Sat 20-08-22 23:14:37, Christoph Hellwig wrote:
> On Tue, Aug 16, 2022 at 01:21:24PM +0200, Jan Kara wrote:
> > 2) I_DIRTY_TIME flag passed to ->dirty_inode() callback. This is admittedly
> > bit of a hack. Currently XFS relies on the fact that the only time its
> > ->dirty_inode() callback needs to do anything is when VFS decides it is
> > time to writeback timestamps and XFS detects this situation by checking for
> > I_DIRTY_TIME in inode->i_state. Now to fix the race, we need to first clear
> > I_DIRTY_TIME in inode->i_state and only then call the ->dirty_inode()
> > callback (otherwise timestamp update can get lost). So the solution I've
> > suggested was to propagate the information "timestamp update needed" to XFS
> > through I_DIRTY_TIME in flags passed to ->dirty_inode().
> 
> Maybe we should just add a separate update_lazy_time method to make this
> a little more clear?

Yes, we could do that if people prefer this. Although I'd say that good
documentation at the place in __mark_inode_dirty() where this gets used and
in documentation of .dirty_inode might clear the confusion as well.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-29  8:17     ` Lukas Czerner
@ 2022-08-29 10:16       ` Jeff Layton
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff Layton @ 2022-08-29 10:16 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: linux-ext4, tytso, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong,
	Christian Brauner

On Mon, 2022-08-29 at 10:17 +0200, Lukas Czerner wrote:
> On Fri, Aug 26, 2022 at 12:11:23PM -0400, Jeff Layton wrote:
> > On Wed, 2022-08-24 at 18:03 +0200, Lukas Czerner wrote:
> > > From: Jeff Layton <jlayton@kernel.org>
> > > 
> > > The original i_version implementation was pretty expensive, requiring a
> > > log flush on every change. Because of this, it was gated behind a mount
> > > option (implemented via the MS_I_VERSION mountoption flag).
> > > 
> > > Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> > > i_version flag much less expensive, so there is no longer a performance
> > > penalty from enabling it. xfs and btrfs already enable it
> > > unconditionally when the on-disk format can support it.
> > > 
> > > Have ext4 ignore the SB_I_VERSION flag, and just enable it
> > > unconditionally. While we're in here, remove the handling of
> > > Opt_i_version as well, since we're almost to 5.20 anyway.
> > > 
> > > Ideally, we'd couple this change with a way to disable the i_version
> > > counter (just in case), but the way the iversion mount option was
> > > implemented makes that difficult to do. We'd need to add a new mount
> > > option altogether or do something with tune2fs. That's probably best
> > > left to later patches if it turns out to be needed.
> > > 
> > > [ Removed leftover bits of i_version from ext4_apply_options() since it
> > > now can't ever be set in ctx->mask_s_flags -- lczerner ]
> > > 
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: Benjamin Coddington <bcodding@redhat.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: Darrick J. Wong <djwong@kernel.org>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > > Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> > > v3: Removed leftover bits of i_version from ext4_apply_options
> > > v4: no change
> > > 
> > >  fs/ext4/inode.c |  5 ++---
> > >  fs/ext4/super.c | 21 ++++-----------------
> > >  2 files changed, 6 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 2a220be34caa..c77d40f05763 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
> > >  			return -EINVAL;
> > >  		}
> > >  
> > > -		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
> > > +		if (attr->ia_size != inode->i_size)
> > >  			inode_inc_iversion(inode);
> > >  
> > >  		if (shrink) {
> > > @@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> > >  	 * ea_inodes are using i_version for storing reference count, don't
> > >  	 * mess with it
> > >  	 */
> > > -	if (IS_I_VERSION(inode) &&
> > > -	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > > +	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > >  		inode_inc_iversion(inode);
> > >  
> > >  	/* the do_update_inode consumes one bh->b_count */
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 9a66abcca1a8..1c953f6d400e 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -1585,7 +1585,7 @@ enum {
> > >  	Opt_inlinecrypt,
> > >  	Opt_usrjquota, Opt_grpjquota, Opt_quota,
> > >  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> > > -	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> > > +	Opt_usrquota, Opt_grpquota, Opt_prjquota,
> > >  	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
> > >  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
> > >  	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> > > @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
> > >  	fsparam_flag	("barrier",		Opt_barrier),
> > >  	fsparam_u32	("barrier",		Opt_barrier),
> > >  	fsparam_flag	("nobarrier",		Opt_nobarrier),
> > > -	fsparam_flag	("i_version",		Opt_i_version),
> > >  	fsparam_flag	("dax",			Opt_dax),
> > >  	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
> > >  	fsparam_u32	("stripe",		Opt_stripe),
> > > @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > >  	case Opt_abort:
> > >  		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
> > >  		return 0;
> > > -	case Opt_i_version:
> > > -		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> > > -		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> > > -		ctx_set_flags(ctx, SB_I_VERSION);
> > > -		return 0;
> > >  	case Opt_inlinecrypt:
> > >  #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
> > >  		ctx_set_flags(ctx, SB_INLINECRYPT);
> > > @@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
> > >  	sb->s_flags &= ~ctx->mask_s_flags;
> > >  	sb->s_flags |= ctx->vals_s_flags;
> > >  
> > > -	/*
> > > -	 * i_version differs from common mount option iversion so we have
> > > -	 * to let vfs know that it was set, otherwise it would get cleared
> > > -	 * on remount
> > > -	 */
> > > -	if (ctx->mask_s_flags & SB_I_VERSION)
> > > -		fc->sb_flags |= SB_I_VERSION;
> > > -
> > >  #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
> > >  	APPLY(s_commit_interval);
> > >  	APPLY(s_stripe);
> > > @@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> > >  		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
> > >  	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
> > >  		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> > > -	if (sb->s_flags & SB_I_VERSION)
> > > -		SEQ_OPTS_PUTS("i_version");
> > >  	if (nodefs || sbi->s_stripe)
> > >  		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
> > >  	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> > > @@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> > >  	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
> > >  		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
> > >  
> > > +	/* i_version is always enabled now */
> > > +	sb->s_flags |= SB_I_VERSION;
> > > +
> > >  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
> > >  	    (ext4_has_compat_features(sb) ||
> > >  	     ext4_has_ro_compat_features(sb) ||
> > 
> > Hi Lukas,
> > 
> > I know I had originally asked you to shepherd this patch into mainline,
> > but I think it may be better to wait on it for now. Since I asked that,
> > we've since found out that ext4 is bumping the i_version counter on
> > atime updates. It'd be best to get that fixed before we turn this on
> > unconditionally, since it could cause a performance regression in some
> > cases. I'll plan to pick this back up for my latest i_version series if
> > that sounds ok to you.
> > 
> > Sorry for the back and forth, and thanks again!
> 
> Hi Jeff,
> 
> sure, no problem. I can drop the patch. The rest of the series is still
> valid though.
> 
> Thanks!
> -Lukas
> 
> 

Yes, the rest is fine (AFAICT)

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-26 16:11   ` Jeff Layton
@ 2022-08-29  8:17     ` Lukas Czerner
  2022-08-29 10:16       ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Lukas Czerner @ 2022-08-29  8:17 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-ext4, tytso, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong,
	Christian Brauner

On Fri, Aug 26, 2022 at 12:11:23PM -0400, Jeff Layton wrote:
> On Wed, 2022-08-24 at 18:03 +0200, Lukas Czerner wrote:
> > From: Jeff Layton <jlayton@kernel.org>
> > 
> > The original i_version implementation was pretty expensive, requiring a
> > log flush on every change. Because of this, it was gated behind a mount
> > option (implemented via the MS_I_VERSION mountoption flag).
> > 
> > Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> > i_version flag much less expensive, so there is no longer a performance
> > penalty from enabling it. xfs and btrfs already enable it
> > unconditionally when the on-disk format can support it.
> > 
> > Have ext4 ignore the SB_I_VERSION flag, and just enable it
> > unconditionally. While we're in here, remove the handling of
> > Opt_i_version as well, since we're almost to 5.20 anyway.
> > 
> > Ideally, we'd couple this change with a way to disable the i_version
> > counter (just in case), but the way the iversion mount option was
> > implemented makes that difficult to do. We'd need to add a new mount
> > option altogether or do something with tune2fs. That's probably best
> > left to later patches if it turns out to be needed.
> > 
> > [ Removed leftover bits of i_version from ext4_apply_options() since it
> > now can't ever be set in ctx->mask_s_flags -- lczerner ]
> > 
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Benjamin Coddington <bcodding@redhat.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Darrick J. Wong <djwong@kernel.org>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---
> > v3: Removed leftover bits of i_version from ext4_apply_options
> > v4: no change
> > 
> >  fs/ext4/inode.c |  5 ++---
> >  fs/ext4/super.c | 21 ++++-----------------
> >  2 files changed, 6 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 2a220be34caa..c77d40f05763 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
> >  			return -EINVAL;
> >  		}
> >  
> > -		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
> > +		if (attr->ia_size != inode->i_size)
> >  			inode_inc_iversion(inode);
> >  
> >  		if (shrink) {
> > @@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
> >  	 * ea_inodes are using i_version for storing reference count, don't
> >  	 * mess with it
> >  	 */
> > -	if (IS_I_VERSION(inode) &&
> > -	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> > +	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> >  		inode_inc_iversion(inode);
> >  
> >  	/* the do_update_inode consumes one bh->b_count */
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 9a66abcca1a8..1c953f6d400e 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1585,7 +1585,7 @@ enum {
> >  	Opt_inlinecrypt,
> >  	Opt_usrjquota, Opt_grpjquota, Opt_quota,
> >  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> > -	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> > +	Opt_usrquota, Opt_grpquota, Opt_prjquota,
> >  	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
> >  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
> >  	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> > @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
> >  	fsparam_flag	("barrier",		Opt_barrier),
> >  	fsparam_u32	("barrier",		Opt_barrier),
> >  	fsparam_flag	("nobarrier",		Opt_nobarrier),
> > -	fsparam_flag	("i_version",		Opt_i_version),
> >  	fsparam_flag	("dax",			Opt_dax),
> >  	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
> >  	fsparam_u32	("stripe",		Opt_stripe),
> > @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  	case Opt_abort:
> >  		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
> >  		return 0;
> > -	case Opt_i_version:
> > -		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> > -		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> > -		ctx_set_flags(ctx, SB_I_VERSION);
> > -		return 0;
> >  	case Opt_inlinecrypt:
> >  #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
> >  		ctx_set_flags(ctx, SB_INLINECRYPT);
> > @@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
> >  	sb->s_flags &= ~ctx->mask_s_flags;
> >  	sb->s_flags |= ctx->vals_s_flags;
> >  
> > -	/*
> > -	 * i_version differs from common mount option iversion so we have
> > -	 * to let vfs know that it was set, otherwise it would get cleared
> > -	 * on remount
> > -	 */
> > -	if (ctx->mask_s_flags & SB_I_VERSION)
> > -		fc->sb_flags |= SB_I_VERSION;
> > -
> >  #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
> >  	APPLY(s_commit_interval);
> >  	APPLY(s_stripe);
> > @@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> >  		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
> >  	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
> >  		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> > -	if (sb->s_flags & SB_I_VERSION)
> > -		SEQ_OPTS_PUTS("i_version");
> >  	if (nodefs || sbi->s_stripe)
> >  		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
> >  	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> > @@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
> >  	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
> >  		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
> >  
> > +	/* i_version is always enabled now */
> > +	sb->s_flags |= SB_I_VERSION;
> > +
> >  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
> >  	    (ext4_has_compat_features(sb) ||
> >  	     ext4_has_ro_compat_features(sb) ||
> 
> Hi Lukas,
> 
> I know I had originally asked you to shepherd this patch into mainline,
> but I think it may be better to wait on it for now. Since I asked that,
> we've since found out that ext4 is bumping the i_version counter on
> atime updates. It'd be best to get that fixed before we turn this on
> unconditionally, since it could cause a performance regression in some
> cases. I'll plan to pick this back up for my latest i_version series if
> that sounds ok to you.
> 
> Sorry for the back and forth, and thanks again!

Hi Jeff,

sure, no problem. I can drop the patch. The rest of the series is still
valid though.

Thanks!
-Lukas

> 
> Cheers,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-24 16:03 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
@ 2022-08-26 16:11   ` Jeff Layton
  2022-08-29  8:17     ` Lukas Czerner
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff Layton @ 2022-08-26 16:11 UTC (permalink / raw)
  To: Lukas Czerner, linux-ext4
  Cc: tytso, jack, linux-fsdevel, ebiggers, david, Benjamin Coddington,
	Christoph Hellwig, Darrick J . Wong, Christian Brauner

On Wed, 2022-08-24 at 18:03 +0200, Lukas Czerner wrote:
> From: Jeff Layton <jlayton@kernel.org>
> 
> The original i_version implementation was pretty expensive, requiring a
> log flush on every change. Because of this, it was gated behind a mount
> option (implemented via the MS_I_VERSION mountoption flag).
> 
> Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
> i_version flag much less expensive, so there is no longer a performance
> penalty from enabling it. xfs and btrfs already enable it
> unconditionally when the on-disk format can support it.
> 
> Have ext4 ignore the SB_I_VERSION flag, and just enable it
> unconditionally. While we're in here, remove the handling of
> Opt_i_version as well, since we're almost to 5.20 anyway.
> 
> Ideally, we'd couple this change with a way to disable the i_version
> counter (just in case), but the way the iversion mount option was
> implemented makes that difficult to do. We'd need to add a new mount
> option altogether or do something with tune2fs. That's probably best
> left to later patches if it turns out to be needed.
> 
> [ Removed leftover bits of i_version from ext4_apply_options() since it
> now can't ever be set in ctx->mask_s_flags -- lczerner ]
> 
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Benjamin Coddington <bcodding@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> v3: Removed leftover bits of i_version from ext4_apply_options
> v4: no change
> 
>  fs/ext4/inode.c |  5 ++---
>  fs/ext4/super.c | 21 ++++-----------------
>  2 files changed, 6 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2a220be34caa..c77d40f05763 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
>  			return -EINVAL;
>  		}
>  
> -		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
> +		if (attr->ia_size != inode->i_size)
>  			inode_inc_iversion(inode);
>  
>  		if (shrink) {
> @@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
>  	 * ea_inodes are using i_version for storing reference count, don't
>  	 * mess with it
>  	 */
> -	if (IS_I_VERSION(inode) &&
> -	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
> +	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
>  		inode_inc_iversion(inode);
>  
>  	/* the do_update_inode consumes one bh->b_count */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 9a66abcca1a8..1c953f6d400e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1585,7 +1585,7 @@ enum {
>  	Opt_inlinecrypt,
>  	Opt_usrjquota, Opt_grpjquota, Opt_quota,
>  	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
> -	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
> +	Opt_usrquota, Opt_grpquota, Opt_prjquota,
>  	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
>  	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
>  	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
> @@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
>  	fsparam_flag	("barrier",		Opt_barrier),
>  	fsparam_u32	("barrier",		Opt_barrier),
>  	fsparam_flag	("nobarrier",		Opt_nobarrier),
> -	fsparam_flag	("i_version",		Opt_i_version),
>  	fsparam_flag	("dax",			Opt_dax),
>  	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
>  	fsparam_u32	("stripe",		Opt_stripe),
> @@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  	case Opt_abort:
>  		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
>  		return 0;
> -	case Opt_i_version:
> -		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
> -		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
> -		ctx_set_flags(ctx, SB_I_VERSION);
> -		return 0;
>  	case Opt_inlinecrypt:
>  #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
>  		ctx_set_flags(ctx, SB_INLINECRYPT);
> @@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
>  	sb->s_flags &= ~ctx->mask_s_flags;
>  	sb->s_flags |= ctx->vals_s_flags;
>  
> -	/*
> -	 * i_version differs from common mount option iversion so we have
> -	 * to let vfs know that it was set, otherwise it would get cleared
> -	 * on remount
> -	 */
> -	if (ctx->mask_s_flags & SB_I_VERSION)
> -		fc->sb_flags |= SB_I_VERSION;
> -
>  #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
>  	APPLY(s_commit_interval);
>  	APPLY(s_stripe);
> @@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
>  		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
>  	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
>  		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
> -	if (sb->s_flags & SB_I_VERSION)
> -		SEQ_OPTS_PUTS("i_version");
>  	if (nodefs || sbi->s_stripe)
>  		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
>  	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
> @@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
>  		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
>  
> +	/* i_version is always enabled now */
> +	sb->s_flags |= SB_I_VERSION;
> +
>  	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
>  	    (ext4_has_compat_features(sb) ||
>  	     ext4_has_ro_compat_features(sb) ||

Hi Lukas,

I know I had originally asked you to shepherd this patch into mainline,
but I think it may be better to wait on it for now. Since I asked that,
we've since found out that ext4 is bumping the i_version counter on
atime updates. It'd be best to get that fixed before we turn this on
unconditionally, since it could cause a performance regression in some
cases. I'll plan to pick this back up for my latest i_version series if
that sounds ok to you.

Sorry for the back and forth, and thanks again!

Cheers,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 3/3] ext4: unconditionally enable the i_version counter
  2022-08-24 16:03 [PATCH v4 " Lukas Czerner
@ 2022-08-24 16:03 ` Lukas Czerner
  2022-08-26 16:11   ` Jeff Layton
  0 siblings, 1 reply; 20+ messages in thread
From: Lukas Czerner @ 2022-08-24 16:03 UTC (permalink / raw)
  To: linux-ext4
  Cc: tytso, jlayton, jack, linux-fsdevel, ebiggers, david,
	Benjamin Coddington, Christoph Hellwig, Darrick J . Wong,
	Christian Brauner

From: Jeff Layton <jlayton@kernel.org>

The original i_version implementation was pretty expensive, requiring a
log flush on every change. Because of this, it was gated behind a mount
option (implemented via the MS_I_VERSION mountoption flag).

Commit ae5e165d855d (fs: new API for handling inode->i_version) made the
i_version flag much less expensive, so there is no longer a performance
penalty from enabling it. xfs and btrfs already enable it
unconditionally when the on-disk format can support it.

Have ext4 ignore the SB_I_VERSION flag, and just enable it
unconditionally. While we're in here, remove the handling of
Opt_i_version as well, since we're almost to 5.20 anyway.

Ideally, we'd couple this change with a way to disable the i_version
counter (just in case), but the way the iversion mount option was
implemented makes that difficult to do. We'd need to add a new mount
option altogether or do something with tune2fs. That's probably best
left to later patches if it turns out to be needed.

[ Removed leftover bits of i_version from ext4_apply_options() since it
now can't ever be set in ctx->mask_s_flags -- lczerner ]

Cc: Dave Chinner <david@fromorbit.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
---
v3: Removed leftover bits of i_version from ext4_apply_options
v4: no change

 fs/ext4/inode.c |  5 ++---
 fs/ext4/super.c | 21 ++++-----------------
 2 files changed, 6 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a220be34caa..c77d40f05763 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5425,7 +5425,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
 			return -EINVAL;
 		}
 
-		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
+		if (attr->ia_size != inode->i_size)
 			inode_inc_iversion(inode);
 
 		if (shrink) {
@@ -5735,8 +5735,7 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 	 * ea_inodes are using i_version for storing reference count, don't
 	 * mess with it
 	 */
-	if (IS_I_VERSION(inode) &&
-	    !(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
+	if (!(EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL))
 		inode_inc_iversion(inode);
 
 	/* the do_update_inode consumes one bh->b_count */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..1c953f6d400e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1585,7 +1585,7 @@ enum {
 	Opt_inlinecrypt,
 	Opt_usrjquota, Opt_grpjquota, Opt_quota,
 	Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
-	Opt_usrquota, Opt_grpquota, Opt_prjquota, Opt_i_version,
+	Opt_usrquota, Opt_grpquota, Opt_prjquota,
 	Opt_dax, Opt_dax_always, Opt_dax_inode, Opt_dax_never,
 	Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_warn_on_error,
 	Opt_nowarn_on_error, Opt_mblk_io_submit, Opt_debug_want_extra_isize,
@@ -1694,7 +1694,6 @@ static const struct fs_parameter_spec ext4_param_specs[] = {
 	fsparam_flag	("barrier",		Opt_barrier),
 	fsparam_u32	("barrier",		Opt_barrier),
 	fsparam_flag	("nobarrier",		Opt_nobarrier),
-	fsparam_flag	("i_version",		Opt_i_version),
 	fsparam_flag	("dax",			Opt_dax),
 	fsparam_enum	("dax",			Opt_dax_type, ext4_param_dax),
 	fsparam_u32	("stripe",		Opt_stripe),
@@ -2140,11 +2139,6 @@ static int ext4_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	case Opt_abort:
 		ctx_set_mount_flag(ctx, EXT4_MF_FS_ABORTED);
 		return 0;
-	case Opt_i_version:
-		ext4_msg(NULL, KERN_WARNING, deprecated_msg, param->key, "5.20");
-		ext4_msg(NULL, KERN_WARNING, "Use iversion instead\n");
-		ctx_set_flags(ctx, SB_I_VERSION);
-		return 0;
 	case Opt_inlinecrypt:
 #ifdef CONFIG_FS_ENCRYPTION_INLINE_CRYPT
 		ctx_set_flags(ctx, SB_INLINECRYPT);
@@ -2814,14 +2808,6 @@ static void ext4_apply_options(struct fs_context *fc, struct super_block *sb)
 	sb->s_flags &= ~ctx->mask_s_flags;
 	sb->s_flags |= ctx->vals_s_flags;
 
-	/*
-	 * i_version differs from common mount option iversion so we have
-	 * to let vfs know that it was set, otherwise it would get cleared
-	 * on remount
-	 */
-	if (ctx->mask_s_flags & SB_I_VERSION)
-		fc->sb_flags |= SB_I_VERSION;
-
 #define APPLY(X) ({ if (ctx->spec & EXT4_SPEC_##X) sbi->X = ctx->X; })
 	APPLY(s_commit_interval);
 	APPLY(s_stripe);
@@ -2970,8 +2956,6 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PRINT("min_batch_time=%u", sbi->s_min_batch_time);
 	if (nodefs || sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME)
 		SEQ_OPTS_PRINT("max_batch_time=%u", sbi->s_max_batch_time);
-	if (sb->s_flags & SB_I_VERSION)
-		SEQ_OPTS_PUTS("i_version");
 	if (nodefs || sbi->s_stripe)
 		SEQ_OPTS_PRINT("stripe=%lu", sbi->s_stripe);
 	if (nodefs || EXT4_MOUNT_DATA_FLAGS &
@@ -4640,6 +4624,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
 		(test_opt(sb, POSIX_ACL) ? SB_POSIXACL : 0);
 
+	/* i_version is always enabled now */
+	sb->s_flags |= SB_I_VERSION;
+
 	if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
 	    (ext4_has_compat_features(sb) ||
 	     ext4_has_ro_compat_features(sb) ||
-- 
2.37.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-08-29 10:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-12 12:37 [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Lukas Czerner
2022-08-12 12:37 ` [PATCH v3 2/3] fs: record I_DIRTY_TIME even if inode already has I_DIRTY_INODE Lukas Czerner
2022-08-12 18:01   ` Eric Biggers
2022-08-12 18:12   ` Eric Biggers
2022-08-16 11:21     ` Jan Kara
2022-08-21  6:14       ` Christoph Hellwig
2022-08-22  8:33         ` Jan Kara
2022-08-12 18:42   ` Eric Biggers
2022-08-16 11:41     ` Jan Kara
2022-08-12 12:37 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
2022-08-12 13:05   ` Christian Brauner
2022-08-16 11:48   ` Jan Kara
2022-08-12 13:04 ` [PATCH v3 1/3] ext4: don't increase iversion counter for ea_inodes Christian Brauner
2022-08-12 18:42 ` Jeff Layton
2022-08-16 11:52   ` Jan Kara
2022-08-16 12:18     ` Jeff Layton
2022-08-24 16:03 [PATCH v4 " Lukas Czerner
2022-08-24 16:03 ` [PATCH v4 3/3] ext4: unconditionally enable the i_version counter Lukas Czerner
2022-08-26 16:11   ` Jeff Layton
2022-08-29  8:17     ` Lukas Czerner
2022-08-29 10:16       ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).